Skip to main content

Overview

Pipecat provides two variants of the OpenAI Responses API LLM service:
  • OpenAIResponsesLLMService (WebSocket-based, recommended): Maintains a persistent WebSocket connection for lower-latency inference and automatically uses previous_response_id to send only incremental context when possible.
  • OpenAIResponsesHttpLLMService (HTTP-based): Uses server-sent events (SSE) via HTTP streaming. Each request opens a new connection. Use this when WebSocket is not available or preferred.
Both variants support streaming text responses, function calling, usage metrics, and out-of-band inference, and work with the universal LLMContext and LLMContextAggregatorPair.
The Responses API is a newer OpenAI API designed for conversational AI applications. It differs from the Chat Completions API in its request/response structure and streaming format. See OpenAI Responses API documentation for more details.

WebSocket vs HTTP

Use WebSocket (OpenAIResponsesLLMService) when:
  • You need the lowest possible latency for real-time conversations
  • Your workflow involves frequent tool/function calls
  • You want automatic incremental context optimization without server-side storage
Use HTTP (OpenAIResponsesHttpLLMService) when:
  • WebSocket connections are blocked by your infrastructure
  • You prefer stateless request/response patterns
  • You don’t need the incremental context optimization
The WebSocket variant’s previous_response_id optimization works with store=False (the default) using a connection-local in-memory cache—no conversations are stored on OpenAI’s servers. The HTTP variant does not offer this optimization by default, as it would require store=True (30-day OpenAI-side conversation storage).

OpenAI Responses API Reference

Pipecat’s API methods for OpenAI Responses integration

Example Implementation

Interruptible conversation example

OpenAI Documentation

Official OpenAI Responses API documentation

OpenAI Platform

Access models and manage API keys

Installation

To use OpenAI services, install the required dependencies:
pip install "pipecat-ai[openai]"

Prerequisites

OpenAI Account Setup

Before using OpenAI Responses LLM services, you need:
  1. OpenAI Account: Sign up at OpenAI Platform
  2. API Key: Generate an API key from your account dashboard
  3. Model Selection: Choose from available models (GPT-4.1, GPT-4o, GPT-4o-mini, etc.)
  4. Usage Limits: Set up billing and usage limits as needed

Required Environment Variables

  • OPENAI_API_KEY: Your OpenAI API key for authentication

Configuration

Common Parameters

These parameters are available for both OpenAIResponsesLLMService and OpenAIResponsesHttpLLMService:
api_key
str
default:"None"
OpenAI API key. If None, uses the OPENAI_API_KEY environment variable.
base_url
str
default:"None"
Custom base URL for the OpenAI API. Override for proxied or self-hosted deployments.
organization
str
default:"None"
OpenAI organization ID.
project
str
default:"None"
OpenAI project ID.
default_headers
Mapping[str, str]
default:"None"
Additional HTTP headers to include in every request.
service_tier
str
default:"None"
Service tier to use (e.g., “auto”, “flex”, “priority”).
settings
OpenAIResponsesLLMSettings
default:"None"
Runtime-configurable model settings. See Settings below.

WebSocket-Specific Parameters

The following parameter is only available for OpenAIResponsesLLMService (WebSocket variant):
ws_url
str
default:"wss://api.openai.com/v1/responses"
WebSocket endpoint URL. Override for custom deployments or proxies.

Settings

Runtime-configurable settings passed via the settings constructor argument using OpenAIResponsesLLMService.Settings(...). These can be updated mid-conversation with LLMUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstr"gpt-4.1"OpenAI model identifier. (Inherited from base settings.)
system_instructionstrNoneSystem instruction/prompt for the model. (Inherited from base settings.)
temperaturefloatNOT_GIVENSampling temperature (0.0 to 2.0). Lower values are more focused, higher values are more creative.
top_pfloatNOT_GIVENTop-p (nucleus) sampling (0.0 to 1.0). Controls diversity of output.
frequency_penaltyfloatNonePenalty for frequent tokens (-2.0 to 2.0). Positive values discourage repetition.
presence_penaltyfloatNonePenalty for new topics (-2.0 to 2.0). Positive values encourage the model to talk about new topics.
seedintNoneRandom seed for deterministic outputs.
max_completion_tokensintNOT_GIVENMaximum completion tokens to generate.
NOT_GIVEN values are omitted from the API request entirely, letting the OpenAI API use its own defaults. This is different from None, which would be sent explicitly.

Usage

Basic Setup

WebSocket variant (recommended):
from pipecat.services.openai.responses.llm import OpenAIResponsesLLMService

llm = OpenAIResponsesLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    settings=OpenAIResponsesLLMService.Settings(
        model="gpt-4.1",
        system_instruction="You are a helpful assistant.",
    ),
)
HTTP variant:
from pipecat.services.openai.responses.llm import OpenAIResponsesHttpLLMService

llm = OpenAIResponsesHttpLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    settings=OpenAIResponsesHttpLLMService.Settings(
        model="gpt-4.1",
        system_instruction="You are a helpful assistant.",
    ),
)

With Custom Settings

from pipecat.services.openai.responses.llm import OpenAIResponsesLLMService

llm = OpenAIResponsesLLMService(
    api_key=os.getenv("OPENAI_API_KEY"),
    settings=OpenAIResponsesLLMService.Settings(
        model="gpt-4.1",
        temperature=0.7,
        max_completion_tokens=1000,
        frequency_penalty=0.5,
    ),
)
Both OpenAIResponsesLLMService.Settings and OpenAIResponsesHttpLLMService.Settings use the same OpenAIResponsesLLMSettings class, so settings are identical between variants.

Updating Settings at Runtime

Model settings can be changed mid-conversation using LLMUpdateSettingsFrame:
from pipecat.frames.frames import LLMUpdateSettingsFrame

await task.queue_frame(
    LLMUpdateSettingsFrame(
        delta=llm.Settings(
            temperature=0.3,
            max_completion_tokens=500,
        )
    )
)

Out-of-Band Inference

Run a one-shot inference without pushing frames through the pipeline:
from pipecat.processors.aggregators.llm_context import LLMContext

context = LLMContext()
context.add_user_message("What is the capital of France?")

response = await llm.run_inference(
    context=context,
    max_tokens=100,
    system_instruction="You are a helpful geography assistant.",
)
print(response)  # "The capital of France is Paris."

Notes

  • WebSocket is the new default: As of Pipecat version with PR #4141, OpenAIResponsesLLMService uses WebSocket transport by default. If you need the HTTP streaming behavior, use OpenAIResponsesHttpLLMService instead. Both have identical constructor args and settings.
  • Persistent WebSocket connection: The WebSocket variant maintains a persistent connection to wss://api.openai.com/v1/responses and automatically reconnects on connection loss. Connection lifetime is limited to 60 minutes server-side, after which automatic reconnection occurs.
  • Incremental context optimization: The WebSocket variant uses previous_response_id to send only incremental context when the conversation prefix hasn’t changed, reducing latency and costs. This works with store=False (no server-side storage) via a connection-local cache.
  • Responses API vs Chat Completions API: The Responses API has a different request/response structure compared to the Chat Completions API. Use OpenAILLMService for the Chat Completions API and OpenAIResponsesLLMService or OpenAIResponsesHttpLLMService for the Responses API.
  • Universal LLM Context: Both services work with the universal LLMContext and LLMContextAggregatorPair, making it easy to switch between different LLM providers.
  • Function calling: Supports OpenAI’s tool/function calling format. Register function handlers on the pipeline task to handle tool calls automatically.
  • Usage metrics: Automatically tracks token usage, including cached tokens and reasoning tokens.
  • Service tiers: Supports OpenAI’s service tier system for prioritizing requests.

Event Handlers

Both OpenAIResponsesLLMService and OpenAIResponsesHttpLLMService support the following event handlers, inherited from LLMService:
EventDescription
on_completion_timeoutCalled when an LLM completion request times out
on_function_calls_startedCalled when function calls are received and execution is about to start
@llm.event_handler("on_completion_timeout")
async def on_completion_timeout(service):
    print("LLM completion timed out")