OpenAI Responses

Overview

Pipecat provides two variants of the OpenAI Responses API LLM service:

OpenAIResponsesLLMService (WebSocket-based, recommended): Maintains a persistent WebSocket connection for lower-latency inference and automatically uses previous_response_id to send only incremental context when possible.
OpenAIResponsesHttpLLMService (HTTP-based): Uses server-sent events (SSE) via HTTP streaming. Each request opens a new connection. Use this when WebSocket is not available or preferred.

Both variants support streaming text responses, function calling, usage metrics, and out-of-band inference, and work with the universal LLMContext and LLMContextAggregatorPair.

The Responses API is a newer OpenAI API designed for conversational AI applications. It differs from the Chat Completions API in its request/response structure and streaming format. See OpenAI Responses API documentation for more details.

WebSocket vs HTTP

Use WebSocket (OpenAIResponsesLLMService) when:

You need the lowest possible latency for real-time conversations
Your workflow involves frequent tool/function calls
You want automatic incremental context optimization without server-side storage

Use HTTP (OpenAIResponsesHttpLLMService) when:

WebSocket connections are blocked by your infrastructure
You prefer stateless request/response patterns
You don’t need the incremental context optimization

The WebSocket variant’s previous_response_id optimization works with store=False (the default) using a connection-local in-memory cache—no conversations are stored on OpenAI’s servers. The HTTP variant does not offer this optimization by default, as it would require store=True (30-day OpenAI-side conversation storage).

OpenAI Responses API Reference

Pipecat’s API methods for OpenAI Responses integration

Example Implementation

Interruptible conversation example

OpenAI Documentation

Official OpenAI Responses API documentation

OpenAI Platform

Access models and manage API keys

Installation

To use OpenAI services, install the required dependencies:

pip install "pipecat-ai[openai]"

Prerequisites

OpenAI Account Setup

Before using OpenAI Responses LLM services, you need:

OpenAI Account: Sign up at OpenAI Platform
API Key: Generate an API key from your account dashboard
Model Selection: Choose from available models (GPT-4.1, GPT-4o, GPT-4o-mini, etc.)
Usage Limits: Set up billing and usage limits as needed

Required Environment Variables

OPENAI_API_KEY: Your OpenAI API key for authentication

Configuration

Common Parameters

These parameters are available for both OpenAIResponsesLLMService and OpenAIResponsesHttpLLMService:

api_key

str

default:"None"

OpenAI API key. If None, uses the OPENAI_API_KEY environment variable.

base_url

str

default:"None"

Custom base URL for the OpenAI API. Override for proxied or self-hosted deployments.

organization

str

default:"None"

OpenAI organization ID.

project

str

default:"None"

OpenAI project ID.

default_headers

Mapping[str, str]

default:"None"

Additional HTTP headers to include in every request.

service_tier

str

default:"None"

Service tier to use (e.g., “auto”, “flex”, “priority”).

settings

OpenAIResponsesLLMSettings

default:"None"

Runtime-configurable model settings. See Settings below.

WebSocket-Specific Parameters

The following parameter is only available for OpenAIResponsesLLMService (WebSocket variant):

ws_url

str

default:"wss://api.openai.com/v1/responses"

WebSocket endpoint URL. Override for custom deployments or proxies.

Settings

Runtime-configurable settings passed via the settings constructor argument using OpenAIResponsesLLMService.Settings(...). These can be updated mid-conversation with LLMUpdateSettingsFrame. See Service Settings for details.

Parameter	Type	Default	Description
`model`	`str`	`"gpt-4.1"`	OpenAI model identifier. (Inherited from base settings.)
`system_instruction`	`str`	`None`	System instruction/prompt for the model. (Inherited from base settings.)
`temperature`	`float`	`NOT_GIVEN`	Sampling temperature (0.0 to 2.0). Lower values are more focused, higher values are more creative.
`top_p`	`float`	`NOT_GIVEN`	Top-p (nucleus) sampling (0.0 to 1.0). Controls diversity of output.
`frequency_penalty`	`float`	`None`	Penalty for frequent tokens (-2.0 to 2.0). Positive values discourage repetition.
`presence_penalty`	`float`	`None`	Penalty for new topics (-2.0 to 2.0). Positive values encourage the model to talk about new topics.
`seed`	`int`	`None`	Random seed for deterministic outputs.
`max_completion_tokens`	`int`	`NOT_GIVEN`	Maximum completion tokens to generate.

NOT_GIVEN values are omitted from the API request entirely, letting the OpenAI API use its own defaults. This is different from None, which would be sent explicitly.

Usage

Basic Setup

WebSocket variant (recommended):

from pipecat.services.openai.responses.llm import OpenAIResponsesLLMService

llm = OpenAIResponsesLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    settings=OpenAIResponsesLLMService.Settings(
        model="gpt-4.1",
        system_instruction="You are a helpful assistant.",
    ),
)

HTTP variant:

from pipecat.services.openai.responses.llm import OpenAIResponsesHttpLLMService

llm = OpenAIResponsesHttpLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    settings=OpenAIResponsesHttpLLMService.Settings(
        model="gpt-4.1",
        system_instruction="You are a helpful assistant.",
    ),
)

With Custom Settings

from pipecat.services.openai.responses.llm import OpenAIResponsesLLMService

llm = OpenAIResponsesLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    settings=OpenAIResponsesLLMService.Settings(
        model="gpt-4.1",
        temperature=0.7,
        max_completion_tokens=1000,
        frequency_penalty=0.5,
    ),
)

Both OpenAIResponsesLLMService.Settings and OpenAIResponsesHttpLLMService.Settings use the same OpenAIResponsesLLMSettings class, so settings are identical between variants.

Updating Settings at Runtime

Model settings can be changed mid-conversation using LLMUpdateSettingsFrame:

from pipecat.frames.frames import LLMUpdateSettingsFrame

await task.queue_frame(
    LLMUpdateSettingsFrame(
        delta=llm.Settings(
            temperature=0.3,
            max_completion_tokens=500,
        )
    )
)

Out-of-Band Inference

Run a one-shot inference without pushing frames through the pipeline:

from pipecat.processors.aggregators.llm_context import LLMContext

context = LLMContext()
context.add_user_message("What is the capital of France?")

response = await llm.run_inference(
    context=context,
    max_tokens=100,
    system_instruction="You are a helpful geography assistant.",
)
print(response)  # "The capital of France is Paris."

Notes

WebSocket is the new default: As of Pipecat version with PR #4141, OpenAIResponsesLLMService uses WebSocket transport by default. If you need the HTTP streaming behavior, use OpenAIResponsesHttpLLMService instead. Both have identical constructor args and settings.
Persistent WebSocket connection: The WebSocket variant maintains a persistent connection to wss://api.openai.com/v1/responses and automatically reconnects on connection loss. Connection lifetime is limited to 60 minutes server-side, after which automatic reconnection occurs.
Incremental context optimization: The WebSocket variant uses previous_response_id to send only incremental context when the conversation prefix hasn’t changed, reducing latency and costs. This works with store=False (no server-side storage) via a connection-local cache.
Responses API vs Chat Completions API: The Responses API has a different request/response structure compared to the Chat Completions API. Use OpenAILLMService for the Chat Completions API and OpenAIResponsesLLMService or OpenAIResponsesHttpLLMService for the Responses API.
Universal LLM Context: Both services work with the universal LLMContext and LLMContextAggregatorPair, making it easy to switch between different LLM providers.
Function calling: Supports OpenAI’s tool/function calling format. Register function handlers on the pipeline task to handle tool calls automatically.
Usage metrics: Automatically tracks token usage, including cached tokens and reasoning tokens.
Service tiers: Supports OpenAI’s service tier system for prioritizing requests.

Event Handlers

Both OpenAIResponsesLLMService and OpenAIResponsesHttpLLMService support the following event handlers, inherited from LLMService:

Event	Description
`on_completion_timeout`	Called when an LLM completion request times out
`on_function_calls_started`	Called when function calls are received and execution is about to start

@llm.event_handler("on_completion_timeout")
async def on_completion_timeout(service):
    print("LLM completion timed out")

API Reference

Services

Utilities

Events

Frameworks

Frames

Pipeline

OpenAI Responses

Overview

WebSocket vs HTTP

OpenAI Responses API Reference

Example Implementation

OpenAI Documentation

OpenAI Platform

Installation

Prerequisites

OpenAI Account Setup

Required Environment Variables

Configuration

Common Parameters

WebSocket-Specific Parameters

Settings

Usage

Basic Setup

With Custom Settings

Updating Settings at Runtime

Out-of-Band Inference

Notes

Event Handlers

API Reference

Services

Utilities

Events

Frameworks

Frames

Pipeline

​Overview

​WebSocket vs HTTP

OpenAI Responses API Reference

Example Implementation

OpenAI Documentation

OpenAI Platform

​Installation

​Prerequisites

​OpenAI Account Setup

​Required Environment Variables

​Configuration

​Common Parameters

​WebSocket-Specific Parameters

​Settings

​Usage

​Basic Setup

​With Custom Settings

​Updating Settings at Runtime

​Out-of-Band Inference

​Notes

​Event Handlers

Overview

WebSocket vs HTTP

Installation

Prerequisites

OpenAI Account Setup

Required Environment Variables

Configuration

Common Parameters

WebSocket-Specific Parameters

Settings

Usage

Basic Setup

With Custom Settings

Updating Settings at Runtime

Out-of-Band Inference

Notes

Event Handlers