Add WebsocketLLMService as a base class for WebSocket-based LLM services,
parallel to WebsocketTTSService/WebsocketSTTService but codifying a
transactional request-response model rather than a continuous background
receive loop.
WebsocketLLMService provides:
- Connection lifecycle (start/stop/cancel → connect/disconnect)
- _ws_send/_ws_recv with transparent ConnectionClosed handling
(auto-reconnect via exponential backoff → WebsocketReconnectedError)
- _ensure_connected with retry via _try_reconnect
OpenAIResponsesLLMService now inherits from WebsocketLLMService, removing
duplicated connection management code (_connect, _disconnect, _reconnect,
_ensure_connected, _ws_send, start, stop, cancel) and simplifying
_process_context from a loop with attempt tracking to a flat try/except
with a single retry.
- Use finally block in _disconnect to ensure state is always cleaned
up, even if websocket.close() throws — prevents stale cancellation
state (e.g. _cancel_pending_response) from polluting a new connection
- Catch ConnectionClosed in _drain_cancelled_response alongside
TimeoutError — prevents _needs_drain from staying True and bricking
the service on every subsequent inference attempt
- Fall back to OPENAI_API_KEY env var when api_key is not passed,
since the WebSocket connection uses raw websockets (not the
AsyncOpenAI client which handles this automatically)
- Use _clear_cancellation_state() instead of piecemeal resets where
appropriate
Instead of trying to filter stale events inline (unreliable — the API
doesn't provide a way to correlate events to a specific response),
drain remaining events from a cancelled response before starting the
next one. On cancellation, send response.cancel and set a drain flag.
At the start of the next _process_context, read and discard events
until a terminal event arrives, ensuring a clean connection. Falls
back to reconnecting if draining times out.
Over HTTP, previous_response_id requires store=True (30-day OpenAI-side
conversation storage). The WebSocket variant avoids this via a
connection-local in-memory cache that works with store=False. Add
comments explaining this in both class docstrings, at the store=False
parameter, and in the adapter's previous_response_id note.
Add detailed trace-level logging to _apply_previous_response_optimization
showing why the optimization was applied or fell back to full context,
including the relevant data for debugging.
Use append_to_context=False for the filler TTSSpeakFrame in the
function-calling example to avoid altering the conversation history
and breaking the previous_response_id prefix match.
When using previous_response_id, the server already knows its own
output from the previous response. Store the raw response output and,
on the next call, compare it against the items following the matched
input prefix — checking role and text content for messages, and call_id
for function calls. If the items match, skip them and send only truly
new input (user messages, tool results). Falls back to full context if
either the prefix or the output comparison fails.
Introduce a WebSocket variant of the OpenAI Responses API service that
maintains a persistent connection to wss://api.openai.com/v1/responses
for lower-latency inference. The WebSocket variant automatically uses
previous_response_id to send only incremental context when possible,
falling back to full context on reconnection or cache miss.
The WebSocket variant becomes the new default OpenAIResponsesLLMService,
and the HTTP variant is renamed to OpenAIResponsesHttpLLMService. Both
share a private base class with common settings, parameter building,
and run_inference (always HTTP) logic.
Update langchain 0.3→1.2, langchain-community 0.3→0.4, and
langchain-openai 0.3→1.1. This also unblocks openai>=2.26 which
was previously constrained by the now-removed openpipe package.
OpenPipe was acquired by CoreWeave in September 2025. The Python package
hasn't been updated since June 2025 and the repo since 2024. The openpipe
package caps openai<=1.97.1, creating dependency conflicts with other
extras. Remove the dead integration to clean up the codebase.
- Add Nebius LLM service wrapping OpenAI-compatible Token Factory API
- Set supports_developer_role = False (Nebius rejects developer role)
- Default to openai/gpt-oss-120b model (supports function calling)
- Add Nebius function-calling example and env.example entry
- Fix Sarvam developer role support
- Update examples to use developer role for intro messages
Adds an OpenAI-compatible LLM service for Nebius Token Factory, supporting
open-source models (Meta Llama, Qwen, DeepSeek) via their OpenAI-compatible
REST API at https://api.tokenfactory.nebius.com/v1/.
When the remote side disconnects while send() is in flight, send() was
setting _closing=True. This prevented the receive loop from firing
on_client_disconnected, causing the pipeline to hang waiting for a
disconnect signal that never came.
The fix removes _closing from send() (that flag means we initiated the
close) and instead checks Starlette application_state in _can_send()
to suppress subsequent sends after a failure.
Fixes#3912
Add `await asyncio.sleep(0)` after `create_task()` calls in
UserIdleController, SpeechTimeoutUserTurnStopStrategy,
TurnAnalyzerUserTurnStopStrategy, and UserTurnCompletionLLMServiceMixin
so the event loop schedules the newly created timer tasks before the
caller continues.