The _self_queued_frames set and _internal_queue_frame wrapper were used
to prevent re-processing SpeechControlParamsFrame that the aggregator
queued to itself. Now that the frame is no longer special-cased, this
tracking is unnecessary. Also removes unused FrameCallback import.
Adds `enable_prompt_caching` setting to `AWSBedrockLLMSettings`. When
enabled, appends `cachePoint` markers to system prompts and tool
definitions in ConverseStream requests.
This can reduce TTFT by up to 85% for multi-turn conversations where
the system prompt stays constant (e.g. voice agents, chat assistants).
Follows the same pattern as `AnthropicLLMService.enable_prompt_caching`.
Usage:
```python
llm = AWSBedrockLLMService(
settings=AWSBedrockLLMSettings(
model="au.anthropic.claude-haiku-4-5-20251001-v1:0",
enable_prompt_caching=True,
),
)
```
See: https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html
Remove EmulateUserStartedSpeakingFrame, EmulateUserStoppedSpeakingFrame
(deprecated since v0.0.99), and the emulated field from
UserStartedSpeakingFrame and UserStoppedSpeakingFrame. Clean up the
handling code in base_input.py and a stale comment in nova_sonic/llm.py.
The interruption_strategies mechanism was deprecated in v0.0.99 in favor
of LLMUserAggregator's user_turn_strategies. All evaluation logic was
already removed — this removes the remaining field definitions, property,
StartFrame propagation, conditional check in base_input.py, strategy
files, and test.
This field was deprecated in v0.0.99 in favor of LLMUserAggregator's
user_turn_strategies / user_mute_strategies parameters. Since the default
was True (interruptions allowed), removing the guards keeps the current
default behavior.
The location and project_id fields were deprecated since 0.0.90 in
favor of direct __init__ parameters. Now that InputParams is removed,
project_id is required and location defaults to "us-east4" directly
in the signature.
Override _update_settings in CartesiaTTSService to flush the current
audio context and assign a new turn context ID when voice, model, or
language settings change. This prevents Context has closed errors
from Cartesia API, which locks these parameters per context.
Remove the deprecated text_aggregator parameter from TTSService,
CartesiaTTSService, and RimeTTSService, and the deprecated text_filter
parameter from TTSService. Users should use LLMTextProcessor before
the TTS service instead. Update the voice-switching example to use
LLMTextProcessor with PatternPairAggregator.
Change the default from 10s to None so deferred function calls can run
indefinitely when no timeout is configured. Only create the timeout
task when a timeout is actually provided (per-call or service-level).
Add SmallestSTTService using the Pulse WebSocket API for real-time
transcription. Includes SmallestSTTSettings dataclass, 32-language
support with resolve_language fallback, VAD-driven finalize signal,
and SMALLEST_TTFS_P99 latency constant.
Also adds X-Source and X-Pipecat-Version headers to Smallest STT
and TTS WebSocket connections.
Example files like openai.py shadow installed packages when Python adds the
script directory to sys.path. Prepend the parent folder name to each example
file (e.g. openai.py -> function-calling-openai.py). Also split
thinking-and-mcp/ into separate mcp/ and thinking/ directories.
Now that LLMContextFrame is the only frame that provides a context,
remove the intermediate `context = None` / `if context:` pattern
and handle context processing directly in the isinstance branch.
Replace the nested services/speech/ and services/function-calling/ with
top-level voice/ and function-calling/ directories. Update eval script
paths and README to match.
Move 304 examples from a flat numbered directory into 14 descriptive
subfolders: getting-started, services (speech + function-calling),
transcription, vision, realtime, persistent-context,
context-summarization, update-settings (stt/tts/llm), turn-management,
thinking-and-mcp, transports, video-avatar, video-processing, and
features.
Strip numbered prefixes from filenames (e.g. 07c-interruptible-deepgram.py
becomes services/speech/deepgram.py) since the folder context makes them
redundant. Keep numbered prefixes only in getting-started/ where ordering
matters.
Update eval script paths and README to match the new structure.
Add WebsocketLLMService as a base class for WebSocket-based LLM services,
parallel to WebsocketTTSService/WebsocketSTTService but codifying a
transactional request-response model rather than a continuous background
receive loop.
WebsocketLLMService provides:
- Connection lifecycle (start/stop/cancel → connect/disconnect)
- _ws_send/_ws_recv with transparent ConnectionClosed handling
(auto-reconnect via exponential backoff → WebsocketReconnectedError)
- _ensure_connected with retry via _try_reconnect
OpenAIResponsesLLMService now inherits from WebsocketLLMService, removing
duplicated connection management code (_connect, _disconnect, _reconnect,
_ensure_connected, _ws_send, start, stop, cancel) and simplifying
_process_context from a loop with attempt tracking to a flat try/except
with a single retry.
When a user interruption causes the LLM chunk stream to exit early,
function call arguments may be incomplete JSON. Wrap json.loads() in
try/except JSONDecodeError to skip malformed function calls with a
warning instead of crashing. Fixes#2461.
Buffer raw bytes and only decode after splitting on newline boundaries,
preventing multi-byte UTF-8 characters from being split at chunk edges.
Fixes#3538
- Use finally block in _disconnect to ensure state is always cleaned
up, even if websocket.close() throws — prevents stale cancellation
state (e.g. _cancel_pending_response) from polluting a new connection
- Catch ConnectionClosed in _drain_cancelled_response alongside
TimeoutError — prevents _needs_drain from staying True and bricking
the service on every subsequent inference attempt
- Fall back to OPENAI_API_KEY env var when api_key is not passed,
since the WebSocket connection uses raw websockets (not the
AsyncOpenAI client which handles this automatically)
- Use _clear_cancellation_state() instead of piecemeal resets where
appropriate
Instead of trying to filter stale events inline (unreliable — the API
doesn't provide a way to correlate events to a specific response),
drain remaining events from a cancelled response before starting the
next one. On cancellation, send response.cancel and set a drain flag.
At the start of the next _process_context, read and discard events
until a terminal event arrives, ensuring a clean connection. Falls
back to reconnecting if draining times out.
Over HTTP, previous_response_id requires store=True (30-day OpenAI-side
conversation storage). The WebSocket variant avoids this via a
connection-local in-memory cache that works with store=False. Add
comments explaining this in both class docstrings, at the store=False
parameter, and in the adapter's previous_response_id note.
Add detailed trace-level logging to _apply_previous_response_optimization
showing why the optimization was applied or fell back to full context,
including the relevant data for debugging.
Use append_to_context=False for the filler TTSSpeakFrame in the
function-calling example to avoid altering the conversation history
and breaking the previous_response_id prefix match.
When using previous_response_id, the server already knows its own
output from the previous response. Store the raw response output and,
on the next call, compare it against the items following the matched
input prefix — checking role and text content for messages, and call_id
for function calls. If the items match, skip them and send only truly
new input (user messages, tool results). Falls back to full context if
either the prefix or the output comparison fails.
Introduce a WebSocket variant of the OpenAI Responses API service that
maintains a persistent connection to wss://api.openai.com/v1/responses
for lower-latency inference. The WebSocket variant automatically uses
previous_response_id to send only incremental context when possible,
falling back to full context on reconnection or cache miss.
The WebSocket variant becomes the new default OpenAIResponsesLLMService,
and the HTTP variant is renamed to OpenAIResponsesHttpLLMService. Both
share a private base class with common settings, parameter building,
and run_inference (always HTTP) logic.
Update langchain 0.3→1.2, langchain-community 0.3→0.4, and
langchain-openai 0.3→1.1. This also unblocks openai>=2.26 which
was previously constrained by the now-removed openpipe package.
OpenPipe was acquired by CoreWeave in September 2025. The Python package
hasn't been updated since June 2025 and the repo since 2024. The openpipe
package caps openai<=1.97.1, creating dependency conflicts with other
extras. Remove the dead integration to clean up the codebase.
- Add Nebius LLM service wrapping OpenAI-compatible Token Factory API
- Set supports_developer_role = False (Nebius rejects developer role)
- Default to openai/gpt-oss-120b model (supports function calling)
- Add Nebius function-calling example and env.example entry
- Fix Sarvam developer role support
- Update examples to use developer role for intro messages
Adds an OpenAI-compatible LLM service for Nebius Token Factory, supporting
open-source models (Meta Llama, Qwen, DeepSeek) via their OpenAI-compatible
REST API at https://api.tokenfactory.nebius.com/v1/.
When the remote side disconnects while send() is in flight, send() was
setting _closing=True. This prevented the receive loop from firing
on_client_disconnected, causing the pipeline to hang waiting for a
disconnect signal that never came.
The fix removes _closing from send() (that flag means we initiated the
close) and instead checks Starlette application_state in _can_send()
to suppress subsequent sends after a failure.
Fixes#3912
Add `await asyncio.sleep(0)` after `create_task()` calls in
UserIdleController, SpeechTimeoutUserTurnStopStrategy,
TurnAnalyzerUserTurnStopStrategy, and UserTurnCompletionLLMServiceMixin
so the event loop schedules the newly created timer tasks before the
caller continues.
Gemini 3.1 Flash Live won't reliably report ending its turn until
after it says something following a tool call. Restructure the system
instruction so the model says goodbye *after* calling
end_conversation, and add a comment explaining the deferred EndFrame
behavior that makes this work.
The base serializer filters out RTVI protocol messages by default
(ignore_rtvi_messages=True) to prevent them from being sent over
telephony media streams. ProtobufFrameSerializer is used by WebSocket
transports, which are the delivery channel for these messages, so
disable the filter there.
All recent Gemini Live models (including the default
gemini-2.5-flash-native-audio-preview-12-2025, and going at least as
far back as gemini-2.5-flash-native-audio-preview-09-2025) only
support AUDIO as a response modality. We considered using
`modalities=TEXT` as a Pipecat-level signal to suppress audio output
frames (so developers could pair Gemini Live with an external TTS),
but the output transcription from the API arrives too late relative
to the audio to be useful for driving an external TTS service.
For now, just log a warning when a TEXT modality is configured
(at init or via set_model_modalities) and proceed as normal. The 26d
text-modality example is removed since it no longer represents a
viable configuration.
The heartbeat monitor timeout (`HEARTBEAT_MONITOR_SECS`) was a static
module-level constant that never derived from the user-configurable
`heartbeats_period_secs`. This meant overriding the heartbeat interval
had no effect on the monitor window, causing spurious warnings or
delayed detection depending on the configured interval.
Add a new `heartbeats_monitor_secs` parameter to `PipelineParams` so
the monitor timeout is independently configurable (defaults to 10s).
The monitor handler now reads from the instance param instead of the
hard-coded constant.
Made-with: Cursor
This fixes the 26c example when using Gemini 3.1 Flash Live, which seems to be more strict about not receiving real-time input (at least, video messages) before conversation history.
Rime's WebSocket API sends a done message when synthesis completes.
Handle it to stop TTFB metrics, push TTSStoppedFrame, and remove the
audio context immediately instead of relying on the 3-second
stop_frame_timeout_s fallback.
DailyCallbacks gained a required on_dtmf_event field in PR #4047.
PR #4079 fixed this for TavusTransportClient but
LemonSliceTransportClient.setup() was not updated, causing a pydantic
ValidationError at pipeline setup time.
Only trigger handle_error for ErrorFrames originating from the active
service, not any managed service. This prevents edge cases where errors
from a non-active service could incorrectly trigger failover.
Expose a public method for retrieving all stored memories outside the
pipeline, avoiding the need for callers to reimplement client branching,
OR filter construction, and asyncio.to_thread wrapping. Simplify the
example get_initial_greeting() to use it.
Move blocking Mem0 API calls off the event loop using asyncio.to_thread().
Store messages as a fire-and-forget background task via create_task() since
the result is not needed. Insert memory messages at the configured position
in the context instead of always appending.
Closes#1741
Deepgram SDK 6.x surfaces connection errors from send_media() instead
of silently swallowing them. This causes error floods when the WebSocket
disconnects since every queued audio frame hits the dead connection.
Wrap send_media() in try/except: on failure, log one warning and set
self._connection = None so subsequent frames skip until the existing
_connection_handler reconnects.
In some cases the openai provider could answer with a `chunk.choices[0].delta.audio = None`, so the process context fails with error:
```
pipecat/services/openai/base_llm.py:552): Error during completion: 'NoneType' object has no attribute 'get'
```
When an InterruptionFrame arrives, the Python-side audio task is
cancelled but frames already submitted to rtc.AudioSource continue
playing from its internal buffer. This causes the bot to keep speaking
for several seconds after being interrupted.
Fix by overriding process_frame in LiveKitOutputTransport to call
audio_source.clear_queue() on InterruptionFrame, immediately flushing
the buffered audio.
ErrorFrames propagating upstream from downstream processors (e.g. TTS) would
enter the ServiceSwitcher via process_frame, traverse the active service sub-pipeline,
and reach push_frame where they incorrectly triggered failover. Now only errors whose
processor is one of the managed services trigger handle_error. Also fix the log in
handle_error to attribute errors to the actual source processor rather than the
current active_service.
Closes#4139
Gemini 3.x can bundle multiple fields (e.g. model_turn and
output_transcription) on the same server_content message. The previous
elif chain would only process the first matching field and silently
drop the rest. Switch to independent if checks so every field is
handled.
When Gemini Live was configured with local VAD (server-side VAD disabled),
the service was listening for the wrong frame types and not sending
ActivityStart/ActivityEnd events to the server. Now it listens for
VADUserStartedSpeakingFrame/VADUserStoppedSpeakingFrame and sends the
appropriate activity signals when local VAD is in use.
Also removes the unnecessary local SileroVADAnalyzer from server-side VAD
examples and adds a new 26a example demonstrating local VAD configuration.
Set GRPC_VERBOSITY=ERROR by default so users do not see noisy fork
handler and abseil warnings from the gRPC C library. Users can still
override by setting GRPC_VERBOSITY themselves.
nvidia-riva-client 2.25.1 ships with gencode compiled against protobuf
6.31.1, which requires a runtime >= 6.31.1. Update protobuf from 5.29.6
to >=6.31.1,<7 and grpcio-tools from 1.67.1 to 1.78.0 to match.
Regenerate frames_pb2.py with the new compiler.
Add DeepgramFluxSageMakerSTTService that combines SageMaker's HTTP/2
transport with Flux's JSON turn detection protocol (StartOfTurn,
EndOfTurn, EagerEndOfTurn, TurnResumed). Includes mid-stream Configure
support, silence watchdog, and an example bot.
Both GrokLLMService and XAIHttpTTSService use the same xAI API (api.x.ai),
so move Grok source files into the xai module. Leave deprecation shims in
the old grok/ paths for backward compatibility.
- Rename XAITTSService → XAIHttpTTSService and XAITTSSettings → XAIHttpTTSSettings
- Add language_to_xai_language() with explicit LANGUAGE_MAP using resolve_language()
- Remove deprecated InputParams, params, voice, language init params
- Remove XAI_DEFAULT_SAMPLE_RATE and XAI_PCM_CODEC constants; add encoding param
- Set sample_rate=None default (picked up from PipelineParams or user)
- Use Language.EN enum instead of string "en" for default language
- Add changelog/4031.added.md
- Add 07e-interruptible-xai.py foundational example
- Update 14g-function-calling-grok.py to use XAIHttpTTSService
- Register 07e in run-release-evals.py
Test that OpenAI Realtime, Grok Realtime, and Nova Sonic adapters
prefer init-provided system_instruction over context-provided, warn
on conflicts, and don't warn for developer messages.
Add system_instruction parameter to the Grok Realtime adapter's
get_llm_invocation_params() and call _resolve_system_instruction() to
prefer init-provided over context-provided system instructions and
warn on conflicts. Previously context-provided took precedence.
Update the Grok Realtime example to use settings.system_instruction
instead of session_properties.instructions.
Add system_instruction parameter to the OpenAI Realtime adapter's
get_llm_invocation_params() and call _resolve_system_instruction() to
prefer init-provided over context-provided system instructions and
warn on conflicts. Previously context-provided took precedence.
Add system_instruction parameter to the Nova Sonic adapter's
get_llm_invocation_params() and call _resolve_system_instruction() to
prefer init-provided over context-provided system instructions and
warn on conflicts. Previously context-provided took precedence.
Remove the service-side fallback logic, as the adapter now handles
resolution.
Pass self._system_instruction_from_init to the adapter's
get_llm_invocation_params(), which calls _resolve_system_instruction()
to prefer init-provided over context-provided system instructions and
warn on conflicts. Previously context-provided took precedence.
Also fix the reconnect check to only reconnect when the resolved
system instruction actually differs from what the initial connection
used, avoiding unnecessary reconnects.
Update documented event signatures to include transcript argument
where the code actually passes it. Remove stale on_speech_started
and on_utterance_end entries that were never registered.
Fires after the final transcript is pushed in both Pipecat and
AssemblyAI turn detection modes, giving users a reliable hook
that arrives after all transcript frames. Matches the existing
Deepgram Flux on_end_of_turn pattern.
The previous default (meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo) is
no longer available as a serverless Together.ai model and now requires a
custom deployment. The new default is openai/gpt-oss-20b, one of
Together's recommended models for small & fast use-cases.
OpenAI-compatible services that don't support the "developer" message
role can now set supports_developer_role = False on the service class.
BaseOpenAILLMService passes this as convert_developer_to_user to the
adapter, which converts developer messages to user messages before
sending them to the API. Applied to Cerebras and Perplexity.
Also removes the now-redundant developer→user conversion step from
PerplexityLLMAdapter (handled by the parent adapter via the flag).
_system_instruction_from_init was being set from the deprecated
`system_instruction` constructor parameter instead of
`self._settings.system_instruction`, so system instructions provided
via settings were silently ignored.
OpenAI Realtime, Grok Realtime, and AWS Nova Sonic adapters now convert
"developer" role messages to "user" (consistent with all other non-OpenAI
adapters). Previously these messages were silently dropped. Adds starter
unit tests for all three realtime adapters.
These messages are developer instructions to the assistant (e.g. "Please
introduce yourself to the user"), not simulated user input. The
"developer" role is semantically correct for this purpose.
Developer messages are now always converted to "user" in non-OpenAI
adapters, never promoted to the system instruction. This removes an
inconsistency where adding an unrelated message to context would change
whether a developer message got promoted.
Simplifications:
- Rename _extract_initial_system_or_developer → _extract_initial_system
- Return Optional[str] instead of Tuple (role is always "system")
- Drop initial_context_message_role from _resolve_system_instruction
- Drop system_role fields from all ConvertedMessages dataclasses
When the only message in context was a system message,
_extract_initial_system_or_developer would convert it to "user" (to
prevent empty history) without warning about the conflict with
system_instruction. Now warns inline before converting, with a message
explaining both the conflict and the user-role conversion.
Two goals:
1. Centralize system_instruction vs context system message resolution into
the LLM adapters. This eliminates duplication between in-pipeline and
out-of-band (run_inference) code paths across ~16 locations in service
llm.py files.
2. Add support for "developer" role messages in conversation context, which
is facilitated by the above centralization.
Shared helpers on BaseLLMAdapter:
- _extract_initial_system_or_developer: extracts/converts messages[0]
based on role and whether system_instruction is provided
- _resolve_system_instruction: warns on conflicts between system_instruction
and context system messages, returns the effective instruction
Developer message handling (new):
- Non-OpenAI adapters: an initial "developer" message is promoted to the
system instruction when no system_instruction is provided; otherwise it
is converted to "user". Subsequent "developer" messages are always
converted to "user". No conflict warning is emitted for developer
messages (unlike "system" messages).
- OpenAI adapter: "developer" messages pass through in conversation
history without triggering conflict warnings.
- OpenAI Responses adapter: "developer" messages are kept as "developer"
role (same as "system", which is also converted to "developer" for the
Responses API).
Other behavior changes:
- Gemini: "initial" system message detection now checks messages[0] only
(previously searched anywhere in the list)
- Bedrock: a lone system message is now converted to "user" instead of
being extracted to an empty message list (matches existing Anthropic
behavior)
Route LLMFullResponseEndFrame through the serialization queue instead
of pushing it directly downstream when push_text_frames is enabled.
This ensures the frame is emitted only after the audio context is
fully drained, preserving correct ordering relative to TTSTextFrames.
Previously, the final sentence TTSTextFrame would arrive at the
LLMAssistantAggregator after LLMFullResponseEndFrame, causing it to
be dropped from the conversation context (especially with RTVI text
input where no subsequent interruption would flush the orphaned text).
Cancel the deferral timeout task and clear the pending EndFrame during
disconnect, which could otherwise be left dangling after a
CancelFrame-triggered shutdown.
When an interruption arrives before any LLM text reaches run_tts, the
turn context ID exists but was never registered via create_audio_context.
Calling flush_audio for this unregistered context sends a message to the
provider (e.g. ElevenLabs) with a context_id it has never seen, which
implicitly creates a server-side context that is never closed. After
enough rapid interruptions these phantom contexts accumulate and exceed
the providers limit (ElevenLabs: 5 simultaneous contexts, 1008 policy
violation).
Guard the flush call with audio_context_available so it only fires when
the context was actually opened.
Fixes#4114
When an EndFrame arrives while the bot is mid-response, it is deferred
until turn_complete is received. If turn_complete never arrives, the
EndFrame gets stuck forever and the pipeline hangs indefinitely.
Add a 30-second timeout: if turn_complete hasn't arrived by then, the
deferred EndFrame is released anyway with a warning log. The timeout
is cancelled if turn_complete arrives normally.
We observed a case where a deferred EndFrame was never released in
Gemini Live, causing the pipeline to hang indefinitely. The EndFrame
deferral mechanism waits for _handle_msg_turn_complete to set
_bot_is_responding back to False, but turn_complete messages were only
processed if they also contained usage_metadata. If Gemini ever sent
turn_complete without usage_metadata, the message would be silently
dropped and the deferred EndFrame would never be released.
Now turn_complete is always handled regardless of usage_metadata
presence, with usage_metadata processing only when available.
Note: we have not actually observed a turn_complete without
usage_metadata in practice, so this is a theoretical fix for the
EndFrame-deferral hang. The actual root cause of the observed hang
may lie elsewhere.
- Route audio through audio contexts (append_to_audio_context) instead of
pushing frames directly, enabling proper turn management and interruptions
- Add push_stop_frames and push_start_frame so the base class handles
TTSStartedFrame/TTSStoppedFrame lifecycle
- Remove manual context_id tracking (self._context_id) in favor of
get_active_audio_context_id()
- Don't call remove_audio_context on "complete" — Smallest sends one
per request, not per turn; let the base class timeout handle cleanup
- Guard v2-only params (consistency, similarity, enhancement) so they
aren't sent to lightning-v3.1
- Remove request_id from request payload (not a documented request field)
- Add flush_audio override to send flush to WebSocket
Adds SmallestTTSService, a WebSocket-based TTS service using Smallest AI's
Lightning v3.1 model. Follows current Pipecat service conventions:
- SmallestTTSSettings dataclass with runtime-updatable settings (voice,
language, speed, etc.)
- Reconnects on model change; keepalive every 30s to prevent idle timeout
- TTS settings default to None so the API applies its own defaults
- Model enum: SmallestTTSModel.LIGHTNING_V3_1
Includes a foundational example (07zl-interruptible-smallest.py) using
Deepgram STT + Smallest TTS + OpenAI LLM.
STT integration will follow in a separate PR once the hallucination/finalize
behaviour is resolved.
Made-with: Cursor
Gets Gemini 3 support to the point where it works with:
- The "legacy" pattern from the previous (removed) 26- example
- inference_on_context_initialization=True (the default)
- inference_on_context_initialization=False
Add `domain` field to AssemblyAISTTSettings to support AssemblyAI's
streaming API `domain` query parameter, enabling specialized recognition
modes like Medical Mode (`medical-v1`).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add warnings in SpeechTimeoutUserTurnStopStrategy and
TurnAnalyzerUserTurnStopStrategy when stop_secs differs from the
recommended default (0.2s) or when stop_secs >= STT p99 latency,
which collapses the STT wait timeout to 0s. Document the stop_secs=0.2
assumption in stt_latency.py.
Send a setup message with client_req_id before the first text message
for each context, matching Gradium multiplexing protocol. This allows
Gradium to associate each session with its setup configuration when
using close_ws_on_eos=False.
The AudioHook protocol requires every message to carry a `parameters`
object. `_create_message` conditionally included it only when parameters
were truthy, so pong responses and closed responses without
outputVariables were sent without the field.
Clients that validate message structure (including the Genesys reference
implementation) rejected these messages, which broke server sequence
tracking and prevented outputVariables from reaching the Architect flow.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Align with the OpenTelemetry GenAI semantic convention
gen_ai.system_instructions for system prompts. The old "system"
attribute name was unrelated to gen_ai.system (which is for
provider name).
Replace adapter-based extraction in traced_llm with direct reads from
_settings.system_instruction (priority) and context messages (fallback).
The old approach had three bugs: signature mismatch with Anthropic
adapter, key name inconsistency, and unnecessary overhead from full
message/tools conversion.
Also deduplicate the system instruction in spans -- it was appearing as
both "system" and "param.system_instruction".
These services were pushing audio frames directly via push_frame() in their
WebSocket receive loops, bypassing the base TTSService audio context
serialization queue. This causes incorrect frame ordering and broken
interruption handling.
Changes per service:
- Fish Audio: use append_to_audio_context(), replace _handle_interruption
with on_audio_context_interrupted()
- LMNT: use append_to_audio_context(), remove redundant push_frame override
- Neuphonic: use append_to_audio_context(), remove redundant push_frame and
process_frame overrides (base class handles pause/resume)
- Rime NonJson: use append_to_audio_context(), remove redundant push_frame
override
The LLMContext format already matches the expected Responses API
shape for input_audio, so no adapter conversion will be needed
once OpenAI enables support.
The API-provided full model name is more specific than the
user-provided model name (e.g. includes version/snapshot details).
Reorder the lookup in _get_model_name and add a comment where the
Responses service sets the field.
The override would re-add `instructions` after the adapter had
intentionally converted it to a developer message for empty contexts.
Added a regression test.
Rewrite docstrings to more clearly explain what SyncParallelPipeline
does: hold all output until every parallel branch finishes, so frames
produced in response to a single input are released together.
Adds a FrameOrder enum with ARRIVAL (default, existing behavior) and
PIPELINE (pushes frames in pipeline definition order). This lets callers
guarantee output ordering between parallel pipelines — e.g. ensuring
image frames precede audio frames — without needing a separate reordering
processor downstream.
Updates the 05-sync-speech-and-image example to use FrameOrder.PIPELINE,
removing the ImageBeforeAudioReorderer class entirely.
Add a processor after SyncParallelPipeline that ensures each image frame
precedes its corresponding TTS audio frames. SyncParallelPipeline batches
them together but doesn't guarantee branch ordering. The reorderer detects
when TTS frames arrive before their image (via context_id tracking) and
holds them until the image arrives.
Also rename ImageAudioSync to MarkImageForPlaybackSync for clarity.
Add a `sync_with_audio` field to `OutputImageRawFrame` that routes image
frames through the audio queue in the output transport, ensuring images
are only displayed after all preceding audio has been sent. This enables
proper audio/image synchronization in pipelines like the calendar month
narration example.
Update the 05-sync-speech-and-image example to use an `ImageAudioSync`
processor that sets this flag on image frames.
The FrameProcessor two-queue architecture processes SystemFrames and
non-SystemFrames on separate concurrent async tasks. Both paths called
SyncParallelPipeline.process_frame(), which used the same per-pipeline
sink queues. A SystemFrame's wait_for_sync could steal frames from a
concurrent non-SystemFrame's wait_for_sync, corrupting synchronization
and stalling the pipeline.
This was triggered by the auto-embedded RTVI processor (added in
v0.0.101) which floods OutputTransportMessageUrgentFrame SystemFrames
through the pipeline during LLM responses.
Fix: SystemFrames (except EndFrame) now take a fast path — passed
through internal pipelines and pushed downstream directly without
touching the sink queues or drain logic. EndFrame retains the full
drain behavior as a lifecycle frame.
- Add WakePhraseUserTurnStartStrategy for gating interaction behind wake
phrase detection, with timeout and single_activation modes
- Add default_user_turn_start_strategies() and
default_user_turn_stop_strategies() helper functions
- Deprecate WakeCheckFilter in favor of the new strategy
- Extend ProcessFrameResult to stop strategies for short-circuit evaluation
- Fix MinWordsUserTurnStartStrategy including filtered text in output
* Fix empty user transcription causing spurious interruption in Nova Sonic
Skip _report_user_transcription_ended() when _user_text_buffer is empty,
which happens when the initial prompt is text-only. Previously, an empty
TranscriptionFrame was pushed upstream, triggering a chain reaction:
on_user_turn_stopped → UserStartedSpeakingFrame → interruption →
premature BotStoppedSpeaking → multiple response start/stop cycles.
* Improve TextFrame and assistant end of turn logic
Now, SPECULATIVE text results are used to push the LLMTextFrame,
AggregatedTextFrame, and TTSTextFrame. Additionally, the TTSTextFrames
are push at the end of the corresponding audio segment.
* Remove BotStoppedSpeakingFrame fallback from Nova Sonic
Now that assistant response end is detected directly from Nova Sonic
contentEnd events (END_TURN and INTERRUPTED), the BotStoppedSpeakingFrame
handler is no longer needed. Inline the cleanup logic in reset_conversation.
* Remove duplicate reconnection logic from Gradium STT
The _receive_messages method had its own while-True reconnect loop,
duplicating the reconnection handling already provided by
WebsocketService._receive_task_handler (exponential backoff, max
retries, error reporting). Flatten to just the inner message loop
and let the base class handle reconnection.
* Align Gradium STT VAD handling with base class patterns
Replace the process_frame override with a _handle_vad_user_stopped_speaking
override, which is the proper hook provided by STTService. Move
start_processing_metrics() into run_stt (matching Gladia's pattern).
Remove unused FrameDirection and VADUserStartedSpeakingFrame imports.
* Add transcript aggregation delay after flushed to capture trailing tokens
Gradium flushed response can arrive before all text tokens have been
delivered. Instead of finalizing immediately on flushed, start a short
timer (100ms) that allows trailing tokens to accumulate before pushing
the final TranscriptionFrame.
* Add changelog for PR #4066
* Change default encoding to pcm_16000
* Decouple encoding from sample_rate in Gradium STT
The encoding parameter now takes just the base type (pcm, wav, opus)
and the sample rate is derived from the pipeline audio_in_sample_rate,
assembled dynamically via input_format_from_encoding(). This fixes the
mismatch where SAMPLE_RATE=24000 was passed to the base class while
encoding defaulted to pcm_16000.
Set store=False in Responses API calls since we send full conversation
history as input items and don't use previous_response_id.
Add 5 run_inference tests for OpenAIResponsesLLMService using real
LLMContext and adapter (only HTTP client mocked).
Uses real LLMContext and adapter (only HTTP client is mocked) to test
basic inference, client exception propagation, system_instruction
override, empty context fallback, and max_tokens override.
Add OpenAIResponsesLLMService using the Responses API, with a dedicated
adapter that converts LLMContext messages to Responses API input items
(system→developer, tool_calls→function_call, tool→function_call_output,
multimodal content conversion, and tools schema flattening).
- New adapter: open_ai_responses_adapter.py
- New service: openai/responses/llm.py
- Examples: 07-interruptible and 14-function-calling variants
- 19 unit tests for adapter conversion logic
- Eval entries for both examples
List-valued settings like keyterm, keywords, search, redact, and replace
were being converted to strings before being passed to the SDK connect()
method. The SDK expects lists so its encode_query can produce repeated
query params (keyterm=a&keyterm=b).
When an LLM returns a tool call with no arguments (arguments=null in
the streaming chunks), the tool call is silently dropped because:
1. `tool_call.function.arguments` is None, so nothing is accumulated
and `arguments` stays as "" (empty string)
2. `if function_name and arguments:` treats "" as falsy, skipping the
entire tool call execution
OpenAI always sends arguments="{}" even for parameterless tools,
masking this bug. But vLLM, Ollama, and other OpenAI-compatible
providers may omit arguments entirely when the tool schema has no
required parameters, causing tool calls to be silently ignored.
Fix: check only `function_name` (not `arguments`) and default empty
arguments to "{}" so `json.loads` produces an empty dict. Apply the
same fallback for intermediate tool calls in multi-tool responses.
Raw strings like "de-DE" passed as the language parameter to TTS/STT services
were bypassing the Language enum resolution logic, causing silent failures
(e.g. ElevenLabs expects "de" not "de-DE"). Now raw strings are first converted
to Language enums so they go through the same resolve_language() path, with a
warning logged for unrecognized strings.
Reset stop strategies at turn start (not just turn stop) so that late
transcriptions arriving between turns do not leave stale _text that
causes premature stops on the next turn. Also cancel pending timeout
tasks in reset() for both SpeechTimeout and TurnAnalyzer strategies.
Expose enable_dialout as a configure() parameter (default False) so
dial-out examples can opt in without needing to build DailyRoomProperties
manually.
Narrow misleading Optional type hints on parameters that never accept
None, extract the duplicated token_exp_duration * 60 * 60 calculation,
remove unnecessary forward-reference quotes on DailyMeetingTokenProperties,
and clarify why enable_dialout is explicitly set to False.
Handle Daily's on_dtmf_event callback, convert it to an
InputDTMFFrame pushed into the input transport. Also add __str__
methods to InputDTMFFrame and OutputDTMFFrame for better logging.
Refactor language_to_soniox_language to use resolve_language + LANGUAGE_MAP
pattern consistent with other services. Fix resolve_language fallback to use
str(language) instead of language.value so plain strings don't crash.
The Inworld WS TTS plugin previously relied on the base TTS service's 3-second AUDIO_CONTEXT_TIMEOUT to detect when audio was done, then sent close_context in on_audio_context_completed. This added unnecessary latency before TTSStoppedFrame was emitted.
The original implementation likely borrowed this idea from the 11labs' impelementation. But it's likely better to mirror the Cartesia plugin pattern where on_audio_context_completed is a no-op because the server signals completion directly.
Now close_context is sent in on_turn_context_completed (right after flush_context), so the server responds with contextClosed immediately after the last audio byte. The existing receive handler already calls remove_audio_context on contextClosed, which exits the audio context handler cleanly.
The base_url parameter previously forced wss:// and https:// schemes,
breaking air-gapped or private deployments that need ws:// or http://.
Extract URL derivation into _derive_deepgram_urls() helper that respects
the developers scheme choice while deriving the paired WebSocket and
HTTP URLs the Deepgram SDK requires.
Closes#4019
Now that the base TTSService and STTService handle Language enum
conversion at init time, subclasses no longer need to convert in their
own __init__ methods. Remove conversion calls from hardcoded defaults,
params paths, and deprecated direct arg paths across 22 service files.
Services just pass raw Language enums and let the base class convert
via language_to_service_language() polymorphic dispatch.
When a Language enum (e.g. Language.ES) is passed via
settings=Service.Settings(language=Language.ES), it gets stored as-is
without conversion to the service-specific code. The base
_update_settings() handles this for runtime updates, but at init time
apply_update() copies the raw enum. This causes API errors because
services send the unconverted enum value.
Add language conversion in TTSService.__init__ and STTService.__init__
after super().__init__(), using the subclass language_to_service_language()
via normal method resolution.
Both analyzers are superseded by LocalSmartTurnAnalyzerV3. Added
deprecation warnings and docstring notices following the existing
pattern from LocalSmartTurnAnalyzer.
Perplexity appears to have statefulness within a conversation, so
converting a system message to "user" in one call and then back to
"system" in the next (after more messages are appended) causes API
errors. Remove the trailing system→user conversion entirely — if the
context only has system messages, the API call will fail but the
mistake will be caught right away.
Add test exercising the step 3 ordering where stripping a trailing
assistant exposes a system message that then gets converted to user.
Move the reasoning about when a trailing system message can occur
into the docstring.
Perplexity allows multiple initial system messages, so don't merge them.
Instead, skip system-system pairs during the consecutive same-role merge
step. Broaden the trailing message fix to convert any trailing system
message to user (not just a lone system message), so contexts with only
system messages don't fail.
Perplexity's API is stricter than OpenAI about conversation history:
- Requires strict alternation between user/tool and assistant messages
- Disallows system messages except as the initial message
- Requires the last message to be user or tool
The new adapter transforms messages before sending to satisfy all three
constraints: merging consecutive initial system messages, converting
non-initial system to user, merging consecutive same-role messages, and
removing trailing assistant messages.
Also adds dual-system-instruction warnings to Cerebras, Fireworks,
Mistral, Perplexity, and SambaNova services (matching the existing
BaseOpenAILLMService pattern), and updates the warning text in
BaseOpenAILLMService to be more descriptive.
EndTaskFrame and StopTaskFrame are now ControlFrames instead of
SystemFrames, so they flow through the pipeline and queue behind
pending work. This prevents races where EndFrame could overtake
in-flight frames (e.g. function call responses).
CancelTaskFrame and InterruptionTaskFrame remain SystemFrames
(via new TaskSystemFrame base): since they need immediate propagation.
The sink now catches EndTaskFrame, StopTaskFrame and CancelTaskFrame
downstream and re-queues it upstream to the task, ensuring the full
pipeline drains before shutdown begins.
Wait for _audio_context_task to finish draining the contexts queue
before canceling _stop_frame_task, ensuring all pending audio
contexts are processed during shutdown.
Flush buffered frames before pushing the synchronization frame so
downstream processors see the buffered frames first. Switch to a
while-loop with pop(0) so frames added to the buffer during flush
are also drained.
Add convenience parameters to configure() so callers don't need to
manually construct DailyRoomProperties/DailyRoomSipParams for common
SIP provider and geo configuration.
When `service` is set and doesn't match, the service forwards the frame instead of consuming it. This allows targeting a specific service when multiple services of the same type exist in the pipeline.
Align Simli with HeyGen/Tavus by extending AIService instead of
FrameProcessor and using a ServiceSettings dataclass. InputParams is
preserved but deprecated; its fields are promoted to direct init params.
Lifecycle handling moves to start()/stop()/cancel() methods.
The default model for OpenAILLMService and AzureLLMService was still set
to gpt-4o. Restored it to gpt-4.1. Also, removed hardcoded gpt-4o/gpt-4o-mini
model references from examples so they pick up the new default.
Move the warning helper into AIService as _warn_init_param_moved_to_settings.
It now uses type(self).__name__ to produce messages like
"Use settings=AnthropicLLMService.Settings(model=...)" instead of the raw
settings class name "AnthropicLLMSettings(model=...)". Callers no longer need
to pass the settings class explicitly.
Replace direct references to settings class names (e.g. `FooSettings`) with the nested `Settings` alias form throughout all 87 service files:
- Type annotations: `Settings`
- Runtime code: `self.Settings`
- Docstrings: `ServiceClass.Settings`
- Cross-file inheritance: `ParentService.Settings`
This makes the `Settings` alias the canonical way to reference a service's settings, keeping only the class definition and alias assignment as the remaining hits for each raw settings class name.
* Add ServiceSwitcherStrategyFailover for automatic error-based service switching
Introduce a strategy hierarchy: ServiceSwitcherStrategy (base) →
ServiceSwitcherStrategyManual (handles ManuallySwitchServiceFrame) →
ServiceSwitcherStrategyFailover (adds error-based failover). ServiceSwitcher
now defaults to ServiceSwitcherStrategyManual with strategy_type optional.
Non-fatal ErrorFrames are forwarded to the strategy via handle_error().
* Move metadata request into _set_active_if_available
Requesting metadata is part of making a service active, so it belongs
alongside setting _active_service and firing on_service_switched. This
removes the duplicate queue_frame calls from ServiceSwitcher push_frame
and process_frame.
DailyTransportClient.start_transcription() accepted a settings
parameter but always used self._params.transcription_settings
instead, silently discarding any custom settings passed by callers.
Change transcription_settings to Optional[DailyTranscriptionSettings]
defaulting to None. The default settings are now applied at the call
site when transcription is started, and start_transcription receives
the serialized settings dict directly.
Use CustomVideoSource/CustomVideoTrack for the default camera output instead of
VirtualCameraDevice, mirroring how audio already uses CustomAudioSource/CustomAudioTrack.
Add support for custom video destinations (register_video_destination, add/remove
custom video tracks, routing in write_video_frame) so multiple video tracks can be
published simultaneously.
* Add system_instruction parameter to run_inference
Allow callers to provide a custom system instruction directly when calling
run_inference, without having to construct provider-specific context objects.
For OpenAI, the instruction is prepended as a system message (preserving
existing messages). For Anthropic, Google, and AWS Bedrock, it overrides the
single system field with a warning when an existing system instruction is
present in the context.
* Use system_instruction parameter in _generate_summary
Pass the summarization prompt via run_inference's system_instruction
parameter instead of embedding it as a system message in the context.
* Add changelog for #3968
Constructor/settings system_instruction now takes priority over the
context system message. Previously the context value would overwrite
the constructor value on every call. Warn when both are set.
After interruption, both _playing_context_id and _turn_context_id are
None. If a subclass calls append_to_audio_context(None, frame), the
recovery path matches (None == None) and creates a bogus audio context
that blocks the handler from ever processing the real context.
Early-return when context_id is falsy to prevent this.
The Deepgram TTS service was bypassing pipecats audio context management
system, pushing audio frames directly via push_frame() instead of routing
them through append_to_audio_context(). This caused stale audio to leak
into the pipeline after interruptions and missed ordered playback
guarantees.
- Route audio frames through append_to_audio_context() with context
availability checks to discard stale post-interruption frames
- Handle Flushed responses by appending TTSStoppedFrame and removing
the audio context to signal completion
- Replace _handle_interruption override with on_audio_context_interrupted
hook (the recommended pattern used by ElevenLabs and Cartesia)
- Remove redundant process_frame override that caused double-flush
(base class already flushes via on_turn_context_completed)
- Remove redundant start_tts_usage_metrics call (base class handles
aggregated usage metrics)
Make `region` optional so users can provide only `private_endpoint`.
Raise ValueError if neither is provided, and warn if both are given
(private_endpoint takes priority).
Enhanced the logic for extracting the system message in the traced_llm decorator to support LLMContext via adapter and handle exceptions gracefully. This improves compatibility with different context types and ensures better tracing information.
SpeechConfig does not accept both `region` and `endpoint` simultaneously —
they are mutually exclusive. The previous code always passed both, which
raises ValueError when a user supplies a private_endpoint URL. Now we
conditionally pass either `endpoint` or `region`, never both.
Update the tests in test_integration_unified_function_calling.py to not specify particular models but instead just use service defaults (the tests shouldn't be model-dependent anyway)
Forward the on_summary_applied event from the internal summarizer to
the aggregator so users can listen for it without accessing private
members. Update summarization examples to use the new public event.
_on_call_state_updated passes (self, state) to _call_event_handler,
but _run_handler already prepends self when invoking the handler.
This causes handlers to receive 3 positional arguments instead of 2,
making the on_call_state_updated event unusable.
This aligns with how _on_first_participant_joined correctly passes
only the data arg without self.
Turn completion instructions were being injected as a system message in
the LLM context, which caused warning spam when system_instruction was
also set, did not persist across full context updates, and broke LLMs
that do not support consecutive system messages.
Instead, compose the turn completion instructions into the LLM service
system_instruction field. This is managed via _base_system_instruction
which stores the original value for restoration when turn completion is
disabled.
- mcp_service.py: remove unnecessary try/except around debug log,
use len(available_tools.tools) to match actual iteration target
- bedrock_adapter.py, aws/llm.py: add AttributeError to except tuple
to handle None content (previously caught by bare except)
Wire up the existing settings update infrastructure to send a Configure
WebSocket message when keyterm, eot_threshold, eager_eot_threshold, or
eot_timeout_ms change mid-stream, avoiding a full reconnect.
Add a `Settings` class-level alias on every STT, LLM, TTS, image,
vision, and video service class pointing to its settings dataclass.
This lets developers discover the right settings class via the service
class itself (e.g. `GoogleSTTService.Settings(...)`) without needing
to know or import the separate settings class name.
Adds the explicit "no params object" step 3 comment to all
LLM services that skip from step 2 to step 4 in their
settings initialization sequence, matching the pattern
established in services that do have a params object.
Replace references to undefined `_params` with `self._settings` for
language and VAD config. Add missing `system_instruction` to default
settings to satisfy validate_complete(). Remove redundant line that
read language from the deprecated `params` arg.
- Speechmatics: move config build after super().__init__ and settings
delta so turn_detection_mode (e.g. ADAPTIVE) takes effect
- Google STT: fix example passing bare Language enum instead of list
- Google TTS: add missing explicit defaults for all custom settings fields
- Soniox: fix accidental tuple wrapping of STT service in example
- Speechmatics examples: fix system->user role in kick-off messages
- Deepgram Flux: move tag from settings to __init__ (billing metadata)
- ElevenLabs STT: default tag_audio_events to None (use API default)
- Fal STT: simplify language default handling
- Google TTS: rename GoogleStreamTTSSettings to GoogleTTSSettings
Use "AI service" language instead of listing specific types, add
ServiceSettings as a fallback for direct AIService subclasses, and
clarify delta mode description with a concrete frame example.
The audio fields (sample rates, sample sizes, channel counts) on the deprecated `Params` class had no non-deprecated equivalent. This adds an `AudioConfig` class and `audio_config` init arg so users can specify audio configuration without relying on the deprecated `params` parameter.
Move `session_properties` into `GrokRealtimeLLMSettings`, making `settings` the canonical way to configure Grok Realtime — matching the pattern used across the rest of the codebase. The `session_properties` init arg is now deprecated in favor of `settings=GrokRealtimeLLMSettings(session_properties=...)`.
`system_instruction` is synced bidirectionally between the top-level settings field and `session_properties.instructions`, with top-level taking precedence on conflict. (Unlike OpenAI Realtime, Grok's `SessionProperties` has no `model` field, so no model sync is needed.)
Replace the monolithic rtvi.py with a proper package split by concern
protocol version:
- models_v0.py: deprecated pre-1.0 Pydantic models
- models_v1.py: current RTVI protocol v1 message models
- frames.py: RTVI pipeline frame dataclasses
- observer.py: RTVIObserver and RTVIObserverParams
- processor.py: RTVIProcessor (now lean, imports from submodules)
- __init__.py: re-exports full public API for backward compatability
Move `session_properties` into `OpenAIRealtimeLLMSettings`, making `settings` the canonical way to configure OpenAI Realtime — matching the pattern used across the rest of the codebase. The `session_properties` init arg is now deprecated in favor of `settings=OpenAIRealtimeLLMSettings(session_properties=...)`.
`model` and `system_instruction` are synced bidirectionally between the top-level settings fields and `session_properties.model`/`.instructions`, with top-level taking precedence on conflict.
Add `system_instruction=None` to `default_settings` for OpenAIRealtimeLLMService, GrokRealtimeLLMService, UltravoxRealtimeLLMService, AWSNovaSonicLLMService (Azure inherits from OpenAI), and OpenAIRealtimeBetaLLMService (Azure Beta inherits from OpenAI Beta).
Deprecate `system_instruction` init arg in AWSNovaSonicLLMService in favor of `settings=AWSNovaSonicLLMSettings(system_instruction=...)`. Use `self._settings.system_instruction` directly instead of storing a separate `self._system_instruction`.
Deprecation of `params` and `session_properties` in favor of `settings` for realtime services will be tackled in future work.
Add `system_instruction` field to `LLMSettings` so it is runtime-updatable via settings.
For Google (GoogleLLMService, GoogleVertexLLMService), deprecate the init-time arg since it was already shipped. For Anthropic, AWS Bedrock, and OpenAI, remove the init-time arg entirely since it was never shipped.
Still need to handle realtime services (OpenAI Realtime, Grok Realtime, Gemini Live).
Broaden the "Dynamic Settings Updates" section into "Service Settings"
covering the complete settings pattern: defining a Settings subclass,
wiring it into __init__ with defaults + apply_update, and distinguishing
init-only config from runtime-updatable fields.
- Add dedicated Settings subclasses to 20 LLM services that were
borrowing parent Settings classes (e.g. AzureLLMSettings,
GroqLLMSettings) so users don't need cross-module imports
- Fix field defaults to NOT_GIVEN in BaseWhisperSTTSettings,
OpenAIRealtimeSTTSettings, and NvidiaSegmentedSTTSettings for
delta-mode safety
- Fix incomplete default_settings in AWS, Cartesia, ElevenLabs,
Fish, and Whisper services so validate_complete() passes
- Add auto-discovered tests that verify all Settings classes default
to NOT_GIVEN (delta safety) and all services initialize with
complete settings (store completeness)
Move output_container, output_encoding, output_sample_rate out of
CartesiaTTSSettings into plain instance attributes since they cannot
change at runtime without breaking the audio pipeline. Remove deprecated
speed/emotion fields and their dead references in _build_msg() and
run_tts(). Remove the from_mapping override that only existed to
destructure those now-removed output format fields.
Update all ~192 call sites across 84 service files to pass class references
(e.g. `CartesiaTTSSettings`) instead of string names (`"CartesiaTTSSettings"`)
to `_warn_deprecated_param()`. This enables better IDE refactoring support.
Also fix `from_mapping` return type annotations in 5 settings subclasses to
use `typing.Self` instead of forward reference strings.
ServiceSettings types were introduced for runtime updates via ServiceUpdateSettingsFrame, but there was tension between init-time and runtime APIs: overlapping-but-different InputParams vs ServiceSettings classes, and runtime-updatable fields like `model` and `voice` scattered as direct init args rather than living in a settings object. This unifies them so developers use the same settings type at both init and runtime, improving ergonomics and consistency.
Every concrete AIService subclass (LLM, TTS, STT, ImageGen, Vision, Video) now accepts a `settings` parameter for runtime-updatable config. Old init args (`model`, `voice_id`, `params`/`InputParams`) still work but emit DeprecationWarnings pointing to the new API. When both are provided, `settings` takes precedence. Leaf classes emit warnings; base classes do not, avoiding double warnings in inheritance chains.
system_instruction from the constructor always takes precedence. A
warning is now logged when the context also contains a system message
so users can spot the conflict.
Add vad_threshold parameter to AssemblyAIConnectionParams to support
voice activity detection threshold configuration for the u3-rt-pro model.
This parameter allows users to align AssemblyAI's VAD threshold with
their external VAD systems (e.g., Silero VAD) to avoid the "dead zone"
where AssemblyAI transcribes speech that the external VAD hasn't
detected yet, which can delay interruption handling.
- Range: 0.0 to 1.0 (lower = more sensitive)
- Default: 0.3 (API default when not sent)
- Only applicable to u3-rt-pro model
- Automatically included in WebSocket query parameters
Recommended usage: Set vad_threshold to match your VAD's activation
threshold (e.g., both at 0.3) for optimal performance.
Widen version ranges for stable packages (anthropic, azure, deepgram,
groq, livekit, nvidia-riva-client, fastapi, ormsgpack, opentelemetry,
faster-whisper) and add upper bounds to previously uncapped packages
(hume, pyjwt, livekit-api, camb).
Replace CartesiaHttpTTSService's internal use of the Cartesia SDK with
direct aiohttp calls, accepting an optional aiohttp_session parameter.
Replace fal-client SDK calls in FalSTTService and FalImageGenService
with direct HTTP to bypass the SDK's aggressive retry/backoff logic
that caused significant latency regressions.
Widen version ranges for stable packages (aiofiles, docstring_parser,
onnxruntime) while adding upper bounds to previously uncapped packages
(transformers, numba, wait_for2). Bump soxr to 1.0.0 and pyloudnorm
to 0.2.0. Move silero extra to empty since onnxruntime is now a core dep.
Add appropriate log levels to dial-in/dial-out, participant, transcription,
and recording event handlers. Move transcription error log from client
callback to transport handler to keep logging consistent at the transport
level.
The default function call timeout (10s) causes silent failures for
long-running tools. This adds an optional timeout_secs parameter to
register_function() and register_direct_function() so individual tools
can override the global function_call_timeout_secs. The warning message
now mentions both the per-tool and global timeout options.
Allow either threshold to be set to None to cleanly disable that trigger,
instead of requiring users to set a very large number as a workaround.
At least one of the two must remain set (validated at construction time).
- u3-rt-pro: Does not set parameter (not used)
- universal-streaming models: Set to 1.0 to maintain fast response
- This ensures fast response time matches previous implementation
Replace the round-trip push_interruption_task_frame_and_wait() mechanism
with broadcast_interruption(), which pushes an InterruptionFrame both
upstream and downstream directly from the calling processor.
This eliminates race conditions (transcription arriving before the
InterruptionFrame comes back), swallowed-event timeouts (frame blocked
before reaching the sink), and the complexity of _wait_for_interruption
flag / queue bypass / frame.complete() obligations.
- Add broadcast_interruption() to FrameProcessor
- Deprecate push_interruption_task_frame_and_wait() (delegates to new method)
- Remove event field and complete() from InterruptionFrame/InterruptionTaskFrame
- Remove _wait_for_interruption flag and all special-case logic
- Remove frame.complete() calls in stt_mute_filter and llm_response_universal
- Update all 17 call sites to use broadcast_interruption()
- Update tests
Enables .model_dump() serialization for Pipecat Cloud collection.
All metrics now include start_time (Unix timestamp) for timeline
plotting alongside duration_secs.
Add per-service latency breakdown metrics alongside existing user-to-bot
latency measurement. When enable_metrics=True, the observer now emits an
on_latency_breakdown event with TTFB, text aggregation, and user turn
duration metrics collected between VADUserStoppedSpeakingFrame and
BotStartedSpeakingFrame.
- Add LatencyBreakdown dataclass with ttfb, text_aggregation,
user_turn_secs fields
- Accumulate MetricsFrame data during user→bot cycles
- Reset accumulators on InterruptionFrame to discard stale metrics
- Measure user_turn_secs from actual user silence (VAD timestamp -
stop_secs) to turn release (UserStoppedSpeakingFrame)
- Filter zero-value TTFB entries from startup metric resets
- Add frame deduplication using bounded deque + set pattern
- Update example 29 with latency breakdown display
The ServiceSettings refactor (PR #3714) changed self._settings from
dicts to dataclass subclasses, but tracing code still used .items(),
in containment, and subscript access, causing AttributeError on
every traced call. Use given_fields() for iteration and attribute
access for named fields.
Replace the PipelineSink detection in StartupTimingObserver with an
on_pipeline_started() callback from PipelineTask via TaskObserver.
This fixes premature report emission when using ParallelPipeline,
which has its own inner PipelineSinks per branch.
Use start_offset_secs (offset from StartFrame) on ProcessorStartupTiming
instead of a wall-clock timestamp. Reports keep a single start_time
anchor for dashboard visualization. Remove _mono_to_wall conversion.
Switch ProcessorStartupTiming, StartupTimingReport, and
TransportTimingReport from dataclasses to Pydantic BaseModel. Add
start_time (Unix timestamp) fields and wall clock conversion for
monotonic observer timestamps.
Add BotConnectedFrame (SystemFrame) pushed by SFU transports (Daily,
LiveKit, HeyGen, Tavus) when the bot joins the room. Replace the
on_transport_readiness_measured event with on_transport_timing_report
which includes both bot_connected_secs and client_connected_secs.
Introduce ClientConnectedFrame (SystemFrame) pushed by all transports
when a client connects. StartupTimingObserver uses this to measure
transport readiness — the time from StartFrame to first client
connection — via a new on_transport_readiness_measured event.
Azure TTS _handle_canceled was putting None (the normal completion
signal) into the audio queue for all cancellation reasons, so run_tts
treated errors identically to success—silently producing no audio.
Now error cancellations put an Exception marker in the queue, which
run_tts converts to an ErrorFrame.
Azure STT had no canceled event handler at all, so auth failures,
network errors, and rate-limit cancellations were invisible. Added
_on_handle_canceled which pushes an ErrorFrame upstream via push_error.
Fixespipecat-ai/pipecat#3892
Tracks how long each processor start method takes during pipeline
startup by measuring StartFrame arrive/leave deltas. Emits a timing
report via the on_startup_timing_report event and auto-logs a summary.
Internal pipeline processors are excluded from reports by default.
The Whisper-based ONNX model expects 16 kHz audio, but the
_predict_endpoint method had five hardcoded references to 16000 without
checking the actual pipeline sample rate. When running at 8 kHz (e.g.
Twilio telephony), audio was fed to the feature extractor at the wrong
rate, causing the model to perceive speech at 2x speed with shifted
formant frequencies and produce incorrect end-of-turn predictions.
Add automatic resampling via numpy interpolation before feature
extraction and replace all hardcoded sample rate values with a
_MODEL_SAMPLE_RATE constant. Also fix the WAV debug logger to write
files with the correct sample rate header.
Fixes#3844
Snapshot the blocks into immutable bytes and trim the buffer BEFORE any await, so no memoryview is
held across async yield points. Without this, a concurrent filter() or stop() call could try to
extend() or clear() the bytearray while a memoryview still exports it, raising "Existing exports
of data: object cannot be re-sized".
When filter_incomplete_user_turns is enabled and an LLMMessagesUpdateFrame
replaces the context via set_messages(), the turn completion instructions
system message was lost. This caused the LLM to stop emitting turn
completion markers. Re-inject the instructions after set_messages() to
fix this.
- Keep old parameter name for backward compatibility
- Add deprecation warning when old parameter is used
- Automatically migrate old parameter value to new min_turn_silence parameter
- Exclude deprecated parameter from WebSocket URL to avoid sending it to API
- New parameter takes precedence if both are set
- Update 13d-assemblyai-transcription.py to explicitly use u3-rt-pro model
- Update 55d-update-settings-assemblyai-stt.py to demonstrate keyterms updates instead of language updates
- Add helpful logging to show before/after keyterms boosting effect
- Use difficult names (Xiomara, Saoirse, Krzystof) to demonstrate boosting effectiveness
- Add "beta feature" note to custom prompt warning
- Rename min_end_of_turn_silence_when_confident parameter to min_turn_silence across all AssemblyAI code
- Update documentation, examples, and test files to use new parameter name
Allow pushing frames upstream through the pipeline by passing
FrameDirection.UPSTREAM. Downstream frames use the existing push queue,
while upstream frames are pushed directly from the pipeline sink.
The ServiceSettings refactor (PR #3714) changed self._settings from
dicts to dataclass subclasses, but tracing code still used .items(),
in containment, and subscript access, causing AttributeError on
every traced call. Use given_fields() for iteration and attribute
access for named fields.
The update-docs workflow intermittently failed with "Input required and
not supplied: token" because pull_request events from fork PRs don't
have access to repository secrets. Switching to pull_request_target
runs the workflow in the base repo's context, ensuring secrets are
always available. This is safe since the workflow only runs on
already-merged PRs.
- 07o-interruptible-assemblyai.py: Basic example using Pipecat VAD mode
- 07o-interruptible-assemblyai-stt.py: Advanced example using STT-controlled
turn detection with comprehensive documentation on u3-rt-pro features
(turn detection tuning, prompt-based enhancement, speaker diarization)
The request_finalize() method in STTService is synchronous (sets a flag),
but was being called with await in the VAD turn endpoint handling code.
This caused "object NoneType can't be used in 'await' expression" errors.
Also includes automatic formatting improvements from ruff.
Replace _rtvi_external instance variable with a local prepend_rtvi flag
since it is only used during __init__ to decide whether to prepend the
RTVIProcessor to the pipeline.
When the user places an RTVIProcessor inside their pipeline and provides
a custom RTVIObserver subclass in observers, PipelineTask correctly
detects both and logs "skipping default ones." However it then
unconditionally prepends self._rtvi to the pipeline, causing the
processor to appear twice in the frame chain.
Track whether the RTVIProcessor was found externally (inside the user
pipeline) vs created internally. Only prepend it when created internally.
Fixes#3867
- Remove unused Mapping import
- Remove info logs at initialization (connection params)
- Remove info logs in _handle_transcription (transcript details, text sent to LLM)
- Remove info logs in _build_ws_url (WebSocket URL and params)
- Keep debug logs (less verbose, appropriate for development)
u3-rt-pro guarantees SpeechStarted is always sent before transcripts,
so the fallback UserStartedSpeakingFrame broadcast is never needed.
This ensures clean pairing of UserStarted/StoppedSpeakingFrame:
- Start: Always from _handle_speech_started
- Stop: Always from _handle_transcription on final turn
- Add request_finalize() before sending ForceEndpoint in Pipecat mode
- Keep confirm_finalize() when receiving formatted finals in Pipecat mode
- Remove confirm_finalize() from STT mode (use finalized=True instead)
This follows Pipecat's two-step finalization pattern where request_finalize()
is called when sending a finalize request to the STT service, and
confirm_finalize() is called when receiving confirmation back.
Even when summarization_timeout is explicitly set to None, use a
DEFAULT_SUMMARIZATION_TIMEOUT (120s) fallback so the LLM call can
never hang indefinitely. Applied in both LLMService and the dedicated
LLM path in LLMContextSummarizer.
The dedicated LLM logic lived in LLMAssistantAggregator, creating two
code paths and requiring the aggregator to call a private LLMService
method. Move it into the summarizer which already owns the config and
summarization lifecycle, keeping the aggregator handler as a single-line
upstream push.
Adds a configurable summarization_timeout (default 120s) that cancels
summary generation if the LLM hangs. On timeout, an error result is
returned so _summarization_in_progress resets and future
summarizations are unblocked.
Adds an field to LLMContextSummarizationConfig that allows
routing summarization to a separate LLM service (e.g., Gemini Flash)
instead of the pipeline's primary model. This avoids paying for
expensive inference when compressing context in long-running sessions.
Allows applications to customize how the summary is wrapped when
injected into context (e.g., XML tags, custom delimiters) so system
prompts can distinguish summaries from live conversation.
Add deprecation warnings to start_processing_metrics() and
stop_processing_metrics() on FrameProcessorMetrics and FrameProcessor.
Mark ProcessingMetricsData as deprecated in docstring. All existing
behavior is preserved — the warnings inform users that these will be
removed in a future version.
Runs Claude Code Action after PRs merge to main when source files
in services/transports/serializers/processors/audio/turns/observers/pipeline
are changed. Creates a docs PR on pipecat-ai/docs with targeted edits
following the existing update-docs skill instructions.
- Fix speaker diarization: Add field alias for speaker_label → speaker
mapping in TurnMessage model
- Add warning for non-optimal min_end_of_turn_silence_when_confident
values (recommends 100ms for best latency)
- Improve max_turn_silence override warning message clarity
- Update custom prompt warning (remove 88% accuracy claim)
- Add comprehensive logging for debugging:
- Log final connection params after modifications
- Log WebSocket URL and parsed parameters
- Log speaker field in transcripts
- Log text sent to LLM with speaker formatting
- Support dynamic configuration updates via STTUpdateSettingsFrame:
- keyterms_prompt (when AssemblyAI API supports it)
- prompt
- max_turn_silence
- min_end_of_turn_silence_when_confident
- Add InterruptionFrame handling with stop_all_metrics()
- Add processing metrics (start/stop) at response boundaries
- Fix agent transcript handling for voice and text modalities:
- Voice mode: push LLMTextFrame (append_to_context=False) and
TTSTextFrame for deltas, skip duplicated final text
- Text mode: push LLMTextFrame with proper response lifecycle,
no TTSTextFrame (downstream TTS handles audio)
- Add output_medium parameter to AgentInputParams and OneShotInputParams
- Improve TTFB measurement using VAD speech end time
- Update example with user turn strategies and transcript events
- Add text-only output example (50a-ultravox-realtime-text.py)
Move the sentence vs token aggregation concern into text aggregators
so all text flows through them regardless of mode. This enables
pattern detection and tag handling to work in TOKEN mode.
- Add TextAggregationMode enum (SENTENCE, TOKEN) as the user-facing
TTS setting, separate from the internal AggregationType
- Add TOKEN mode support to Simple, SkipTags, and PatternPair aggregators
- Add text_aggregation_mode parameter to TTSService and all TTS subclasses
- Deprecate aggregate_sentences in favor of text_aggregation_mode
- Merge TTSService._process_text_frame() into a single codepath
Add TextAggregationMetricsData measuring the time from the first LLM
token to the first complete sentence, representing the latency cost of
sentence aggregation in the TTS pipeline.
- Wire up passing speed setting to Groq, even though only a value of 1.0 is supported today
- Update the 55y example to switch voices instead of changing speed
- Add a 55zn example to exercise runtime updates of Groq STT
The only (rare) exception—where a service directly still needs to directly call `self._sync_model_name_to_metrics()`—is when the model name need to be "pulled" from another field (or nested field) in settings up to settings.model on a settings update. This only occurs in Deepgram services, where we use the voice as the model name.
This change has the side-effect of bringing model name to metrics for a number of services that were accidentally omitting it before.
This was added in 31daa889e8, but only
to `RimeTTSService`, not to `RimeNonJsonTTSService. Bringing these
to parity means that users switching between the two, with the same
inputs, have more consistent vocalization behaviors.
Introduce a generic TurnMetricsData class for turn detection metrics,
replacing the service-specific SmartTurnMetricsData (now deprecated).
Add end-to-end processing time measurement to KrispVivaTurn, tracking
the interval from VAD speech-to-silence transition to model threshold
crossing. Consume metrics in the strategy _handle_input_audio path
so they are pushed immediately when fresh.
The Krisp VIVA SDK v1.8.0 requires a license key in globalInit(). Add
api_key parameter to KrispVivaSDKManager, KrispVivaTurn, and
KrispVivaFilter with fallback to KRISP_API_KEY env var. Maintain
backwards compatibility with older SDK versions by catching TypeError
and falling back to the old 3-arg signature.
Brings back the 6 development workflow skills (changelog, cleanup,
code-review, docstring, pr-description, pr-submit) that were moved
to pipecat-ai/skills, and adds a .claude-plugin/marketplace.json so
other pipecat-ai repos can install them. Updates README contributing
section with installation instructions.
Audio filters like RNNoise, KrispViva, and AIC return empty bytes while
buffering audio to accumulate their required frame size. These empty
frames were flowing downstream, causing misleading "Empty audio frame
received for STT service" warnings.
Skip the frame in BaseInputTransport when audio is empty, preventing
unnecessary processing in VAD and downstream processors.
Fixes#3517
These frames were falling through to the else branch and being pushed
downstream, unlike TranscriptionFrame which is explicitly consumed.
This aligns with how the assistant aggregator already filters them.
PR #3776 replaced manual timestamp tracking with stop_ttfb_metrics() in
the timeout handler, but without an end_time it uses time.time() at
timeout expiry—producing TTFB = timeout + stop_secs (~2.2s) instead of
the actual transcript latency.
Restore _last_transcript_time tracking so the timeout handler measures
to when the transcript arrived, and skip reporting if none arrived.
Before this change, settings updates were often not applied. For example, a `TTSUpdateSettingsFrame` queued while the bot was speaking would only have an effect at the end of the bot's reply, and any interruption before the end of the reply would "cancel" the update.
Remove bundled Claude Code skills (changelog, cleanup, code-review,
docstring, pr-description, pr-submit) that now live in
https://github.com/pipecat-ai/skills. Add a section to the README
with installation instructions. The update-docs skill remains as
it is specific to this repository.
This makes each service-specific field individually visible to the delta/update mechanism (`apply_update`, `given_fields`) and removes the need for complex sync logic between `input_params` and top-level fields like `model`.
- Soniox: replace `input_params: SonioxInputParams` with 8 individual fields, simplify `_update_settings` by removing model sync logic, remove unused `is_given` import
- Gladia: replace `input_params: GladiaInputParams` with 11 individual fields, resolve deprecated `language` → `language_config` at init time rather than at `_prepare_settings` time
- Storage mode: for use in `self._settings`. All fields should be specified, i.e. should not be `NOT_GIVEN`.
- Delta mode: for use in `*UpdateSettingsFrame`.
In service of this, this commit:
- Adds a runtime check that all fields are specified in storage mode
- Updates all services to specify all fields in stored settings
- Updates all services to no longer check for `is_given` in stored settings (not necessary anymore)
- Updates relevant docstrings
- Renames `update` to `delta` in `*UpdateSettingsFrame`
- Updates community integrations guide
Move start_processing_metrics from run_stt (called per audio chunk,
producing noisy 0ms logs) to _receive_messages when the first final
token arrives for a new utterance. The existing stop_processing_metrics
in send_endpoint_transcript completes the pair, giving a meaningful
measurement of time from first recognition to finalized transcript.
Commit 859cd7c9 refactored STT TTFB measurement to use the base class
start_ttfb_metrics/stop_ttfb_metrics, which are gated behind
can_generate_metrics(). Soniox and AWS Transcribe never overrode this
method (default returns False), so TTFB was silently never reported.
Move speech detection tracking outside the per-frame loop in append_audio
since is_speech applies to the whole buffer. Add debug log in
analyze_end_of_turn to show state and probability at decision time. Update
the Krisp VIVA example to use Cartesia TTS and turn analyzer strategy.
Use wildcard `*UpdateSettingsFrame` to cover all frame types. Clarify that NOT_GIVEN only appears in update deltas, not in the service's current settings state.
Split the single "changed" entry into separate "added", "changed", and "deprecated" entries for clarity. Add a note about the subtle behavior change in the deprecated set_model/set_voice/set_language methods.
Simplify the reconnect example to show a common pattern (reconnect on any change) and improve the _warn_unhandled_updated_settings example to show selective handling of specific fields.
Update docstrings for ServiceSettings, LLMSettings, TTSSettings, and STTSettings to make clear these capture only the subset of service configuration that can be changed while the pipeline is running via UpdateSettingsFrame, not all constructor parameters.
Document the existing convention: use @dataclass for frames and
internal pipeline data, use Pydantic BaseModel for configuration,
parameters, metrics, and external API data.
Replace self-referential `pipecat-ai[local-smart-turn-v3]` extra in core
dependencies with the actual packages (`transformers`, `onnxruntime`).
Self-referential extras are not supported by Poetry and cause dependency
resolution failures. Since these are required by the default turn stop
strategy (LocalSmartTurnAnalyzerV3), they belong in core dependencies.
- Remove `local-smart-turn-v3` optional extra from pyproject.toml
- Remove try/except ModuleNotFoundError guard (now always installed)
- Remove `--extra local-smart-turn-v3` from CI workflows
- indicate clearly that it's not meant for public use
- make it clear the `self._settings` is the single source of truth for model information
- set the stage for an upcoming change where `AIService` subclasses won't have to ever worry about explicitly calling an `AIService` method to sync model name to metrics
Across all services, switch from accessing `self._model_name` or `self.model_name` in favor of `self._settings.model`.
- AWSNovaSonicLLMService: new `AWSNovaSonicLLMSettings` with `voice_id` and `endpointing_sensitivity`; remove `self._params` entirely, storing audio I/O config as plain instance variables
- NeuphonicHttpTTSService: reuse `NeuphonicTTSSettings`; use inherited `language` field instead of bespoke `lang_code`
- NvidiaTTSService: new `NvidiaTTSSettings` with `quality`
- PiperTTSService / PiperHttpTTSService: new `PiperTTSSettings` / `PiperHttpTTSSettings` (no extra fields)
- SpeechmaticsTTSService: new `SpeechmaticsTTSSettings` with `max_retries`
Also remove redundant `lang_code` from `NeuphonicTTSSettings` (both WS and HTTP services now use the inherited `TTSSettings.language` field, with automatic enum conversion via the base class).
HTTP services (Neuphonic HTTP, Piper HTTP, Speechmatics) don't override `_update_settings` since the base class applies changes to `self._settings` and subsequent requests read from it automatically.
Also:
- remove unnecessary pass-through `_update_settings` implementation in `FalSTTService`
- warn that `AsyncAITTSService` doesn't currently support runtime settings updates
- update how `GradiumTTSService._update_settings` checks for voice changes
- remove a couple of unnecessary args (because they specified defaults) in other examples
ThinkingConfig was defined as an inner class on the service but referenced in the Settings dataclass declared before the service class, causing a crash at import time. Move ThinkingConfig to a standalone class defined before Settings, and keep a class attribute alias for backward compatibility.
Every `*Settings` dataclass field whose default is `NOT_GIVEN` now carries `_NotGiven` in its type union so the type system accurately reflects the three-state semantics (real value, `None` where applicable, or not-yet-specified). Fields previously typed as bare `Any`, `str`, `float`, `bool`, `list`, `dict`, or `Optional[X]` are now narrowed to the specific type from the corresponding `InputParams` Pydantic model.
- Move `CommitStrategy` up in the file so it could be used by `ElevenLabsRealtimeSTTSettings`
- Fix a bug where `run_tts` would erroneously try to reconnect if a reconnection was already in flight (like a reconnection triggered by `_update_settings`)
42 examples covering STT (13), TTS (21), LLM (4), and realtime (4) services. Each demonstrates updating service settings 10 seconds after client connects, verifying the typed settings machinery end-to-end for every provider.
HumeTTSService now stores its params (description, speed, trailing_silence) in a proper `HumeTTSSettings` dataclass instead of a separate `_params` Pydantic model, making it work with `TTSUpdateSettingsFrame(update=...)`. The old `update_setting(key, value)` method is kept but deprecated.
Also removes the unused no-op `TTSService.update_setting` base method, which was never called by the `TTSUpdateSettingsFrame` pipeline.
The dataclass-based API (`*UpdateSettingsFrame(update=*Settings(...))`) is the preferred path since 0.0.103. The dict path still works but now emits a `DeprecationWarning`.
Change `TTSSettings.language` and `STTSettings.language` from `Any` to `Language | str | _NotGiven`. Add `language_to_service_language` base method and centralized `isinstance`-guarded conversion in `STTService._update_settings` (mirroring TTS). Update the TTS guard from `is not None` to `isinstance(…, Language)` so raw strings pass through unchanged.
Remove now-redundant per-service language conversion from `_update_settings` overrides (ElevenLabs, Azure, Fal, Whisper). Add `language_to_service_language` to Azure STT so the centralized conversion picks it up. Fix AWS and NVIDIA STT `__init__` to convert language at construction time, then simplify their runtime accessors to read `_settings.language` directly.
Note that for services that previously handled applying updates (through methods like `set_model` and `set_language`), we're keeping the update-applying logic (some or most of which is already well-tested) and expanding it to cover all relevant settings fields. Services under this bucket are:
- Deepgram STT
- Deepgram Sagemaker STT
- Elevenlabs STT
- Google STT
- Gradium STT
- OpenAI STT
- Speechmatics STT
Now that all services use typed `ServiceSettings` objects, this removes the interim scaffolding that supported both dict-based and typed settings paths in parallel. Specifically: removes old dict-based `_update_settings(settings: Mapping)` methods from base classes, removes `isinstance(self._settings, ServiceSettings)` guards, simplifies `process_frame` branching, and renames `_update_settings_from_typed` to `_update_settings` across all ~30 service implementations. Also renames the no-arg `_update_settings()` helper on realtime services to `_send_session_update()` to avoid collision, adds `from_mapping` overrides on `GoogleLLMSettings` and `AnthropicLLMSettings` for ThinkingConfig dict-to-object conversion, and replaces a broken no-arg `_update_settings()` call in Gemini Live with a TODO.
- NvidiaSTTService.set_model: convert to proper DeprecationWarning (model can't change at runtime for Riva streaming STT)
- NvidiaTTSService.set_model: same treatment for Riva TTS
- NvidiaSegmentedSTTService.set_model: remove — base class now routes through _update_settings_from_typed which re-creates the recognition config
- GeminiTTSService.set_voice: remove — move AVAILABLE_VOICES validation into _update_settings_from_typed so it fires on both legacy and new paths
Standardize all STT, TTS, and LLM service classes to declare `_settings` with the narrowed Settings type as a class-level annotation. This gives editors and type checkers the specific type when hovering or autocompleting on `self._settings` in each service and its subclasses. Inline `self._settings: Type = ...` assignments are replaced with plain `self._settings = ...`.
`filter_incomplete_user_turns` and `user_turn_completion_config` were only handled in the legacy dict-based `_update_settings` code path. This adds them to `LLMSettings` and introduces `LLMService._update_settings_from_typed` so the typed path handles them too.
Does not (yet) touch `InputParams`, to avoid scope creep and touching something currently part of the public API. But there is a lot of overlap between `*Settings` object fields and `InputParams` fields.
Other than discoverability/typing, these are some other improvements brought by this refactor:
- There is now a single code path (see `_update_settings_from_typed`) where services can respond to settings changes (by, say, reconnecting if needed), improving maintainability and guaranteeing one and only one reconnection no matter which settings changed
- `set_language`/`set_model`/`set_voice`—which we're assuming are usable as public methods, though *not* recommended over `*UpdateSettingsFrame`—all use the same code path as settings updates. They're also now all consistent in that, if a service needs to respond to a change (by, say, reconnecting if needed), any of these methods will kick off that process. Note that this is technically a behavior change.
- Several services now properly react to changed settings by reconnecting:
- `AWSTranscribeSTTService`
- `AzureSTTService`
- `SonioxSTTService`
- `GladiaSTTService`
- `SpeechmaticsSTTService`
- `AssemblyAISTTService`
- `CartesiaSTTService`
- `FishAudioTTSService` (would previously only reconnect when `model` changed)
- `GoogleSTTService`
- `SpeechmaticsSTTService` (which previously only handled *some* settings updates through a nonstandard public `update_params` method)
- `GradiumSTTService`
- `NvidiaSegmentedSTTService` (which previously only handled changes to language)
- Bookkeeping across various services has been reduced, mostly by deduping ivars; the `self._settings` ivar is treated as the source of truth
NOTE: I pretty much guarantee that there are services missed in this PR in terms of bringing to consistency with how updates are handled (like whether changes in certain fields trigger reconnects when they need to). We can squash remaining inconsistencies as we stumble onto them, service by service. The goal here is to get things *mostly* in order, and establish the infrastructure and patterns we'll need going forward.
2026-02-13 15:12:26 -05:00
772 changed files with 57701 additions and 35755 deletions
@@ -32,6 +32,20 @@ Create changelog files for the important commits in this PR. The PR number is pr
6. Use ⚠️ emoji prefix for breaking changes.
7. **Write changes in user-facing terms first.** Lead with what users of the framework will notice: new APIs, changed behavior, new parameters, fixed bugs they might have hit, etc. Implementation details (internal refactoring, how something is wired up under the hood) can be included as secondary context after the user-facing description, but should never be the *only* content of a changelog entry when there is a user-visible effect.
**Good** (user-facing first, implementation detail as context):
```
- Turn completion instructions now persist correctly across full context updates when using `system_instruction`. Previously they were injected as a context system message, which caused warning spam and didn't survive context updates.
```
**Bad** (implementation detail only, no user-facing framing):
```
- Fixed turn completion instructions being injected as a context system message instead of using `system_instruction`.
```
Ask yourself: "If I'm a developer building on Pipecat, what would I notice changed?" Start there.
## Example
For PR #3519 with a new feature and a bug fix:
@@ -43,5 +57,5 @@ For PR #3519 with a new feature and a bug fix:
`changelog/3519.fixed.md`:
```
- Fixed an issue where something was not working correctly.
- Fixed an issue where something was not working correctly in some user-visible scenario. The root cause was an internal implementation detail.
The **Code Cleanup Skill** reviews, refactors, and documents code changes in your current branch, ensuring alignment with **Pipecat’s architecture, coding standards, and example patterns**.
The **Code Cleanup Skill** reviews, refactors, and documents code changes in your current branch, ensuring alignment with **Pipecat's architecture, coding standards, and example patterns**.
It focuses on **readability, correctness, performance, and consistency**, while avoiding breaking changes.
---
@@ -28,9 +28,9 @@ This skill analyzes all changes introduced in your branch and performs the follo
Invoke the skill using any of the following commands:
-“Clean up my branch code”
-“Refactor the changes in my branch”
-“Review and improve my branch code”
-"Clean up my branch code"
-"Refactor the changes in my branch"
-"Review and improve my branch code"
-`/cleanup`
---
@@ -144,7 +144,7 @@ class InputParams(BaseModel):
#### Examples
Validated against `examples/foundational/07-interruptible.py`:
description: Document a Python module and its classes using Google style
---
Document a Python module and its classes using Google-style docstrings following project conventions. The class name is provided as an argument.
Document a Python module or class using Google-style docstrings following project conventions. The argument can be a class name or a module path.
## Instructions
1.First, find the class in the codebase:
```
Search for "class ClassName" in src/pipecat/
```
1.Determine what to document based on the argument:
2. If multiple files contain that class name:
- List all matches with their file paths
- Ask the user which one they want to document
- Wait for confirmation before proceeding
**If a module path is provided** (e.g. `src/pipecat/audio/vad/vad_analyzer.py`):
- Use that file directly
3. Once the file is identified, read the module to understand its structure:
**If a class name is provided** (e.g. `VADAnalyzer`):
- Search for `class ClassName` in `src/pipecat/`
- If multiple files contain that class name, list all matches with their file paths, ask the user which one they want to document, and wait for confirmation
2. Once the file is identified, read the module to understand its structure:
- Identify all classes, functions, and important type aliases
@@ -157,7 +157,11 @@ After processing all mapped pairs, check for two kinds of gaps:
**Missing sections**: Mapped doc pages that are missing standard sections compared to the source. For example, a transport page with no Configuration section, or a service page with no InputParams table when the source defines `InputParams(BaseModel)`. Flag these and offer to add the missing sections.
If the user wants a new page, create it using this template structure:
If the user wants a new page, do all three of the following:
#### 8a: Create the doc page
Create the new `.mdx` file using this template structure:
Add the new page path to `DOCS_PATH/docs.json` in the correct navigation group. The path format is `server/services/{category}/{provider}` (without the `.mdx` extension).
Find the matching group in the navigation structure:
- **STT** → `"group": "Speech-to-Text"` under Services
- **TTS** → `"group": "Text-to-Speech"` under Services
- **LLM** → `"group": "LLM"` under Services
- **S2S** → `"group": "Speech-to-Speech"` under Services
- **Transport** → `"group": "Transport"` under Services
- **Serializer** → `"group": "Serializers"` under Services
- **Image generation** → `"group": "Image Generation"` under Services
- **Video** → `"group": "Video"` under Services
- **Memory** → `"group": "Memory"` under Services
- **Vision** → `"group": "Vision"` under Services
- **Analytics** → `"group": "Analytics & Monitoring"` under Services
Insert the new entry **alphabetically** within the group's `pages` array. For example, adding a new STT service "foo":
```json
{
"group": "Speech-to-Text",
"pages": [
"server/services/stt/assemblyai",
"server/services/stt/aws",
...
"server/services/stt/foo",
...
]
}
```
#### 8c: Add to supported-services.mdx
Add a new row to the correct category table in `DOCS_PATH/server/services/supported-services.mdx`.
- **DisplayName**: Use the service's human-readable name (e.g., "ElevenLabs", "AWS Polly", "Google Gemini")
- **package**: Look at the service's `pyproject.toml` extras or the import pattern in the source code. For example, if the service is in `src/pipecat/services/foo/`, the package is typically `foo`.
- If no pip dependencies are required, use `No dependencies required` instead.
Insert the new row **alphabetically** within the table. Match the column alignment of the existing rows.
### Step 9: Output summary
After all edits are complete, print a summary:
@@ -221,6 +272,9 @@ After all edits are complete, print a summary:
### Updated guides
- `guides/learn/speech-to-text.mdx` — Updated code example (renamed `old_param` → `new_param`)
### New service pages
- `server/services/tts/newprovider.mdx` — Created page, added to docs.json (Text-to-Speech), added to supported-services.mdx
### Unmapped source files
- `src/pipecat/services/newprovider/tts.py` — NewProviderTTSService (no doc page exists)
@@ -247,4 +301,6 @@ Before finishing, verify:
- [ ] New parameters have accurate types and defaults from source
- [ ] Formatting matches the existing page style
- [ ] Guides referencing changed APIs were checked and updated
- [ ] New service pages were added to `docs.json` in the correct group, alphabetically
- [ ] New service pages were added to `supported-services.mdx` in the correct table, alphabetically
@@ -10,7 +10,7 @@ Pipecat is an open-source Python framework for building real-time voice and mult
```bash
# Setup development environment
uv sync --group dev --all-extras --no-extra gstreamer --no-extra krisp
uv sync --group dev --all-extras --no-extra gstreamer
# Install pre-commit hooks
uv run pre-commit install
@@ -25,7 +25,7 @@ uv run pytest tests/test_name.py
uv run pytest tests/test_name.py::test_function_name
# Preview changelog
towncrier build --draft --version Unreleased
uv run towncrier build --draft --version Unreleased
# Lint and format check
uv run ruff check
@@ -74,7 +74,7 @@ All data flows as **Frame** objects through a pipeline of **FrameProcessors**:
- **Context Aggregation**: `LLMContext` accumulates messages for LLM calls; `UserResponse` aggregates user input
- **Turn Management**: Turn management is done through `LLMUserAggregator` and
`LLMAssistantAggregator`, created with `LLMContextAggregatorPair`
`LLMAssistantAggregator`, created with `LLMContextAggregatorPair`
- **User turn strategies**: Detection of when the user starts and stops speaking is done via user turn start/stop strategies. They push `UserStartedSpeakingFrame` and `UserStoppedSpeakingFrame` respectively.
@@ -90,23 +90,26 @@ All data flows as **Frame** objects through a pipeline of **FrameProcessors**:
- **Type hints**: Required for complex async code.
- **Dataclass vs Pydantic**: Use `@dataclass` for frames and internal pipeline data (high-frequency, no validation needed). Use Pydantic `BaseModel` for configuration, parameters, metrics, and external API data (benefits from validation and serialization). Specifically:
-`@dataclass`: Frame types, context aggregator pairs, internal data containers
-`BaseModel`: Service `InputParams`, transport/VAD/turn params, metrics data, API request/response models, serializer params
### Docstring Example
@@ -152,4 +155,3 @@ When adding a new service:
## Testing
Test utilities live in `src/pipecat/tests/utils.py`. Use `run_test()` to send frames through a pipeline and assert expected output frames in each direction. Use `SleepFrame(sleep=N)` to add delays between frames.
@@ -23,9 +23,8 @@ Create your integration following the patterns and examples shown in the "Integr
Your repository must contain these components:
- **Source code** - Complete implementation following Pipecat patterns
- **Foundational example** - Single file example showing basic usage (see [Pipecat examples](https://github.com/pipecat-ai/pipecat/tree/main/examples/foundational))
- **Foundational example** - Single file example showing basic usage (see [Pipecat examples](https://github.com/pipecat-ai/pipecat/tree/main/examples))
- **README.md** - Must include:
- Introduction and explanation of your integration
- Installation instructions
- Usage instructions with Pipecat Pipeline
@@ -66,12 +65,25 @@ Once your PR is submitted, post in the `#community-integrations` Discord channel
#### Websocket-based Services
**Base class:**`WebsocketSTTService`
**Use for:** Services where you manage the websocket connection directly. Combines `STTService` with `WebsocketService` for automatic reconnection and keepalive support.
@@ -109,56 +121,59 @@ Once your PR is submitted, post in the `#community-integrations` Discord channel
#### Key requirements:
- **`_process_context(self, context: LLMContext)`** — The main method that processes an LLM context and generates a response. Each LLM service overrides `process_frame` to extract context from `LLMContextFrame` and calls `_process_context`.
- **`adapter_class`** — Class attribute pointing to a `BaseLLMAdapter` subclass. Defaults to `OpenAILLMAdapter`. Non-OpenAI services must implement their own adapter (see `src/pipecat/adapters/base_llm_adapter.py`) with methods:
-`get_llm_invocation_params(context)` — Extract provider-specific params from universal context
-`to_provider_tools_format(tools_schema)` — Convert standard tools to provider format
-`get_messages_for_logging(context)` — Format messages for logging
- **Frame sequence:** Output must follow this frame sequence pattern:
-`LLMFullResponseStartFrame` — Signals the start of an LLM response
-`LLMTextFrame` — Contains LLM content, typically streamed as tokens
-`LLMFullResponseEndFrame` — Signals the end of an LLM response
-`LLMFullResponseStartFrame` - Signals the start of an LLM response
-`LLMTextFrame` - Contains LLM content, typically streamed as tokens
-`LLMFullResponseEndFrame` - Signals the end of an LLM response
-**Thought frames (reasoning models):** If the model supports extended thinking / chain-of-thought, emit thought frames alongside the response:
-`LLMThoughtStartFrame` — Signals the start of a thought
-`LLMThoughtTextFrame` — Contains thought content, streamed as tokens
-`LLMThoughtEndFrame` — Signals the end of a thought
- **Context aggregation:** Implement context aggregation to collect user and assistant content:
- Aggregators come in pairs with a `user()` instance and `assistant()` instance
- Context must adhere to the `LLMContext` universal format
- Aggregators should handle adding messages, function calls, and images to the context
- **Context aggregation** is handled by the framework via `LLMContext` + `LLMContextAggregatorPair`. The LLM service just processes context it receives — no need to implement aggregators.
STT, LLM, and TTS services support `ServiceUpdateSettingsFrame` for dynamic configuration changes. The base STTService has an `_update_settings()` method that handles settings, and the private `_settings``Dict` is used to store settings and provide access to the subclass.
Every AI service (STT, LLM, TTS, image generation, etc.) exposes a **Settings dataclass** that serves two roles:
1.**Store mode** — the service's `self._settings` holds the current value of every runtime-updatable field.
2.**Delta mode** — an update frame (e.g. `TTSUpdateSettingsFrame`) specifies only the fields that should change; unspecified fields remain `NOT_GIVEN`.
#### Defining your Settings class
Extend `STTSettings`, `TTSSettings`, `LLMSettings`, or `ImageGenSettings` (or, if your service directly subclasses `AIService`, `ServiceSettings`). The base classes already provide common fields (e.g. `model`, `voice`, `language`). You only need to add **service-specific knobs that should be runtime-updatable**:
```python
asyncdefset_language(self,language:Language):
"""Set the recognition language and reconnect.
fromdataclassesimportdataclass,field
Args:
language: The language to use for speech recognition.
Note that, in this example, Deepgram requires the websocket connection be disconnected and reconnected to reinitialize the service with the new value. Consider if your service requires reconnection.
**What goes in Settings vs. `__init__` params:**
| Belongs in Settings | Stays as `__init__` params |
| Anything users may want to change mid-session | Audio encoding, sample format |
| | Connection parameters (timeouts, retries) |
The rule of thumb: if a caller might send an update frame to change it at runtime, it belongs in Settings. Everything else is init-only config stored as `self._xxx`.
#### Wiring settings into `__init__`
Accept an **optional**`settings` parameter. Build a `default_settings` object with all fields set to real values, then merge any caller overrides with `apply_update`.
Add a `Settings`**class attribute** that points to your settings dataclass. This lets callers access the settings class through the service itself (e.g. `MyTTSService.Settings(...)`) without a separate import:
```python
fromtypingimportOptional
classMyTTSService(TTSService):
Settings=MyTTSSettings
_settings:Settings
def__init__(
self,
*,
api_key:str,
settings:Optional[Settings]=None,
**kwargs,
):
# 1. Defaults — every field has a real value (store mode).
default_settings=self.Settings(
model="my-model-v1",
voice="default-voice",
language="en",
speaking_rate=1.0,
)
# 2. Merge caller overrides (only given fields win).
ifsettingsisnotNone:
default_settings.apply_update(settings)
# 3. Pass the fully-populated settings to the base class.
AI services support runtime configuration changes via `*UpdateSettingsFrame`s (e.g. `STTUpdateSettingsFrame`, `TTSUpdateSettingsFrame`, `LLMUpdateSettingsFrame`).
To react to runtime setting changes, override `_update_settings`. The base implementation applies the delta to `self._settings` and returns a `dict` mapping each changed field name to its **pre-update** value. Your override should call `super()` first, then act on the changed fields. A common implementation might look like:
"""Apply a settings update, reconfiguring the connection if needed."""
changed=awaitsuper()._update_settings(update)
ifnotchanged:
returnchanged
awaitself._disconnect()
awaitself._connect()
returnchanged
```
The dict keys work like a set for membership tests (`"language" in changed`) and truthiness (`if changed`). Use `changed.keys() - {"language"}` for set difference, or `changed["language"]` to inspect the previous value of a field.
Note that, in this example, the service requires a reconnect to apply the new language. Consider, for each setting, whether your service requires reconnection or can apply changes in-place.
If your service can't yet apply certain settings at runtime, call `self._warn_unhandled_updated_settings(changed)` with any unhandled field names so users get a clear log message:
**Pipecat** is an open-source Python framework for building real-time voice and multimodal conversational agents. Orchestrate audio and video, AI services, different transports, and conversation pipelines effortlessly—so you can focus on what makes your agent unique.
> Want to dive right in? Try the [quickstart](https://docs.pipecat.ai/getting-started/quickstart).
> Want to dive right in? Run `pipecat init quickstart` or follow the [quickstart guide](https://docs.pipecat.ai/getting-started/quickstart).
## 🚀 What You Can Build
@@ -55,6 +55,20 @@ Looking for help debugging your pipeline and processors? Check out [Whisker](htt
Love terminal applications? Check out [Tail](https://github.com/pipecat-ai/tail), a terminal dashboard for Pipecat.
### 🤖 Claude Code Skills
Use [Pipecat Skills](https://github.com/pipecat-ai/skills) with [Claude Code](https://claude.ai/code) to scaffold projects, deploy to Pipecat Cloud, and more. Install the marketplace with:
```
claude plugin marketplace add pipecat-ai/skills
```
and install any of the available plugins.
### 🧩 Community Integrations
Build and share your own Pipecat service integrations! Browse existing [community integrations](https://docs.pipecat.ai/server/services/community-integrations) or check out our [guide](COMMUNITY_INTEGRATIONS.md) to create your own.
### 📺️ Pipecat TV Channel
Catch new features, interviews, and how-tos on our [Pipecat TV](https://www.youtube.com/playlist?list=PLzU2zoMTQIHjqC3v4q2XVSR3hGSzwKFwH) channel.
@@ -66,24 +80,25 @@ Catch new features, interviews, and how-tos on our [Pipecat TV](https://www.yout
| Community | [Browse community integrations →](https://docs.pipecat.ai/server/services/community-integrations) |
📚 [View full services documentation →](https://docs.pipecat.ai/server/services/supported-services)
@@ -127,15 +142,15 @@ You can get started with Pipecat running on your local machine, then move your a
## 🧪 Code examples
- [Foundational](https://github.com/pipecat-ai/pipecat/tree/main/examples/foundational) — small snippets that build on each other, introducing one or two concepts at a time
- [Foundational](https://github.com/pipecat-ai/pipecat/tree/main/examples) — small snippets that build on each other, introducing one or two concepts at a time
- [Example apps](https://github.com/pipecat-ai/pipecat-examples) — complete applications that you can use as starting points for development
## 🛠️ Contributing to the framework
### Prerequisites
**Minimum Python Version:** 3.10
**Recommended Python Version:** 3.12
**Minimum Python Version:** 3.11
**Recommended Python Version:** >= 3.12
### Setup Steps
@@ -151,7 +166,6 @@ You can get started with Pipecat running on your local machine, then move your a
```bash
uv sync --group dev --all-extras \
--no-extra gstreamer \
--no-extra krisp \
--no-extra local \
```
@@ -163,6 +177,15 @@ You can get started with Pipecat running on your local machine, then move your a
> **Note**: Some extras (local, gstreamer) require system dependencies. See documentation if you encounter build errors.
### Claude Code Skills
Install development workflow skills for contributing to Pipecat with [Claude Code](https://claude.ai/code):
```
claude plugin marketplace add pipecat-ai/pipecat
claude plugin install pipecat-dev@pipecat-dev-skills
- Switched `GradiumTTSService` from `InterruptibleWordTTSService` to `AudioContextWordTTSService`, eliminating websocket disconnect/reconnect on every interruption by using `client_req_id`-based multiplexing.
- ⚠️ Added WebSocket-based `OpenAIResponsesLLMService` as the new default for the OpenAI Responses API. It maintains a persistent connection to `wss://api.openai.com/v1/responses` and automatically uses `previous_response_id` to send only incremental context, falling back to full context on reconnection or cache miss. The previous HTTP-based implementation is now available as `OpenAIResponsesHttpLLMService`.
- ⚠️ Removed `OpenPipeLLMService` and the `openpipe` extra. OpenPipe was acquired by CoreWeave and the package is no longer maintained. If you were using `openpipe` as an LLM provider, switch to the underlying provider directly (e.g. `openai`). The OpenPipe interface can still be used with `OpenAILLMService` by specifying a `base_url`.
- ⚠️ Updated `langchain` extra to require langchain 1.x (from 0.3.x), langchain-community 0.4.x (from 0.3.x), and langchain-openai 1.x (from 0.3.x). If you pin these packages in your project, update your pins accordingly.
- Fixed `InworldHttpTTSService` streaming responses crashing with `UnicodeDecodeError` when multi-byte UTF-8 characters were split across chunk boundaries. This caused TTS audio to cut off mid-sentence intermittently.
- Fixed a crash (`JSONDecodeError`) when a user interruption occurs while the LLM is streaming function call arguments. Previously, the incomplete JSON arguments were passed directly to `json.loads()`, causing an unhandled exception. Affected services: OpenAI, Google (OpenAI-compatible), and SambaNova.
- ⚠️ Removed deprecated `on_pipeline_ended`, `on_pipeline_cancelled`, and `on_pipeline_stopped` events from `PipelineTask`. Use `on_pipeline_finished` instead.
- ⚠️ Removed deprecated `camera_in_enabled`, `camera_in_is_live`, `camera_in_width`, `camera_in_height`, `camera_out_enabled`, `camera_out_is_live`, `camera_out_width`, `camera_out_height`, and `camera_out_color` transport params. Use the `video_in_*` and `video_out_*` equivalents instead.
- ⚠️ Removed deprecated RTVI models, frames, and processor methods including `RTVIConfig`, `RTVIServiceConfig`, `RTVIServiceOptionConfig`, various `RTVI*Data` models, `RTVIActionFrame`, and `RTVIProcessor.handle_function_call`/`handle_function_call_start`. Use the updated RTVI processor API instead.
- ⚠️ Removed deprecated transport frames: `TransportMessageFrame`, `TransportMessageUrgentFrame`, `InputTransportMessageUrgentFrame`, `DailyTransportMessageFrame`, and `DailyTransportMessageUrgentFrame`. Use `OutputTransportMessageFrame`, `OutputTransportMessageUrgentFrame`, `InputTransportMessageFrame`, `DailyOutputTransportMessageFrame`, and `DailyOutputTransportMessageUrgentFrame` instead.
- ⚠️ Removed deprecated interruption frames: `StartInterruptionFrame` and `BotInterruptionFrame`. Use `InterruptionFrame` and `InterruptionTaskFrame` instead.
- ⚠️ Removed deprecated `pipecat.services.gemini_multimodal_live` package. Use `pipecat.services.google.gemini_live` instead. Note that class names no longer include "Multimodal" (e.g. `GeminiMultimodalLiveLLMService` → `GeminiLiveLLMService`).
- ⚠️ Removed deprecated `OpenAIRealtimeBetaLLMService` and `AzureRealtimeBetaLLMService`. Use `OpenAIRealtimeLLMService` and `AzureRealtimeLLMService` from `pipecat.services.openai.realtime` and `pipecat.services.azure.realtime` instead.
- ⚠️ Removed deprecated `pipecat.services.deepgram.stt_sagemaker` and `pipecat.services.deepgram.tts_sagemaker` modules. Use `pipecat.services.deepgram.sagemaker.stt` and `pipecat.services.deepgram.sagemaker.tts` instead.
- ⚠️ Removed deprecated `GoogleLLMOpenAIBetaService` from `pipecat.services.google.openai`. Use `GoogleLLMService` from `pipecat.services.google.llm` instead.
- ⚠️ `BaseOpenAILLMService.get_chat_completions()` now accepts an `LLMContext` instead of `OpenAILLMInvocationParams`. If you override this method, update your signature accordingly.
- ⚠️ Removed deprecated service-specific context and aggregator machinery, which was superseded by the universal `LLMContext` system.
Service-specific classes removed: `AnthropicLLMContext`, `AnthropicContextAggregatorPair`, `AWSBedrockLLMContext`, `AWSBedrockContextAggregatorPair`, `OpenAIContextAggregatorPair`, and their user/assistant aggregators. Also removed `create_context_aggregator()` from `LLMService`, `OpenAILLMService`, `AnthropicLLMService`, and `AWSBedrockLLMService`.
- ⚠️ Removed deprecated frame types `LLMMessagesFrame` and `OpenAILLMContextAssistantTimestampFrame` from `pipecat.frames.frames`. Instead of `LLMMessagesFrame`, use `LLMContextFrame` with the new messages, or `LLMMessagesUpdateFrame` with `run_llm=True`.
- ⚠️ Removed `VisionImageFrameAggregator` (from `pipecat.processors.aggregators.vision_image_frame`). Vision/image handling is now built into `LLMContext` (from `pipecat.processors.aggregators.llm_context`). See the `12*` examples for the recommended replacement pattern.
- ⚠️ Removed `OpenAILLMContext`, `OpenAILLMContextFrame`, and `OpenAILLMContext.from_messages()`. Use `LLMContext` (from `pipecat.processors.aggregators.llm_context`) and `LLMContextFrame` (from `pipecat.frames.frames`) instead. All services now exclusively use the universal `LLMContext`.
From the developer's point of view, migrating will usually be a matter of going from this:
- Fixed `CartesiaTTSService` failing with "Context has closed" errors when switching voice, model, or language via `TTSUpdateSettingsFrame`. The service now automatically flushes the current audio context and opens a fresh one when these settings change.
- ⚠️ Removed deprecated service parameters and shims that have been replaced by the `settings=Service.Settings(...)` pattern or direct `__init__` parameters:
-`GoogleVertexLLMService`: `InputParams` class with `location`/`project_id` fields (use direct init params); `project_id` is now required, `location` defaults to `"us-east4"`
-`MiniMaxHttpTTSService`: `english_normalization` from `InputParams` (use `text_normalization`)
-`SimliVideoService`: `simli_config` init param (use `api_key`/`face_id`), `use_turn_server` init param; `api_key` and `face_id` are now required
-`AnthropicLLMService`: `enable_prompt_caching_beta` from `InputParams` (use `enable_prompt_caching`)
- ⚠️ `LLMService.function_call_timeout_secs` now defaults to `None` instead of `10.0`. Deferred function calls will run indefinitely unless a timeout is explicitly set at the service level or per-call. If you relied on the previous 10-second default, pass `function_call_timeout_secs=10.0` explicitly.
- ⚠️ Removed deprecated `pipecat.transports.services` and `pipecat.transports.network` module aliases. Update imports to use `pipecat.transports.daily.transport`, `pipecat.transports.livekit.transport`, `pipecat.transports.websocket.*`, `pipecat.transports.webrtc.*`, and `pipecat.transports.daily.utils` respectively.
- ⚠️ Removed deprecated `EmulateUserStartedSpeakingFrame` and `EmulateUserStoppedSpeakingFrame` frames, and the `emulated` field from `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame`.
- ⚠️ Removed deprecated `STTMuteFilter`, `STTMuteConfig`, and `STTMuteStrategy` from `pipecat.processors.filters.stt_mute_filter`. Use `pipecat.turns.user_mute` strategies with `LLMUserAggregator`'s `user_mute_strategies` parameter instead.
- ⚠️ Removed deprecated `allow_interruptions` parameter from `PipelineParams`, `StartFrame`, and `FrameProcessor`. Interruptions are now always allowed by default. Use `LLMUserAggregator`'s `user_turn_strategies` / `user_mute_strategies` parameters to control interruption behavior.
- ⚠️ Removed `ExternalUserTurnStrategies` and the automatic fallback to it in `LLMUserAggregator` when a `SpeechControlParamsFrame` was received from the transport.
- ⚠️ Removed `vad_analyzer` and `turn_analyzer` parameters from `TransportParams` and all transport input classes, along with all deprecated VAD/turn analysis logic in `BaseInputTransport`. VAD and turn detection are now handled entirely by `LLMUserAggregator`.
This directory contains examples to help you learn how to build with Pipecat.
This directory contains examples showing how to build voice and multimodal agents with Pipecat.
## Getting Started
## Setup
New to Pipecat? Start here:
1. Follow the [README](https://github.com/pipecat-ai/pipecat/blob/main/README.md#%EF%B8%8F-contributing-to-the-framework) steps to get your local environment configured.
- **[Quickstart](quickstart/)** - Get your first voice AI bot running in 5 minutes _(coming soon)_
- **[Client/Server Web](client-server-web/)** - Learn to build web applications with Pipecat's client SDKs _(coming soon)_
- **[Phone Bot with Twilio](phone-bot-twilio/)** - Connect your bot to a phone number _(coming soon)_
> **Run from root directory**: Make sure you are running the steps from the root directory.
## Foundational Examples
> **Using local audio?**: The `LocalAudioTransport` requires a system dependency for `portaudio`. Install the dependency to use the transport.
Single-file examples that introduce core Pipecat concepts one at a time. These examples:
2. Copy the [`env.example`](../env.example) file and add API keys for services you plan to use:
- Build on each other progressively
- Focus on specific features or integrations
- Are used for testing with every Pipecat release
```bash
cp env.example .env
# Edit .env with your API keys
```
See the **[Foundational Examples README](foundational/)** for the complete list.
3. Run any example:
## More Advanced Examples
```bash
uv run python getting-started/01-say-one-thing.py
```
Ready to explore complex use cases? Visit **[pipecat-examples](https://github.com/pipecat-ai/pipecat-examples)** for:
4. Open the web interface at http://localhost:7860/client/ and click "Connect"
- Production-ready applications
- Multi-platform client implementations
- Telephony integrations
- Multimodal and creative applications
- Deployment and monitoring examples
## Running examples with other transports
Most examples support running with other transports, like Twilio or Daily.
### Daily
You need to create a Daily account at https://dashboard.daily.co/u/signup. Once signed up, you can create your own room from the dashboard and set the environment variables `DAILY_ROOM_URL` and `DAILY_API_KEY`. Alternatively, you can let the example create a room for you (still needs `DAILY_API_KEY` environment variable). Then, start any example with `-t daily`:
```bash
uv run getting-started/06-voice-agent.py -t daily
```
### Twilio
It is also possible to run the example through a Twilio phone number. You will need to setup a few things:
1. Install and run [ngrok](https://ngrok.com/download).
```bash
ngrok http 7860
```
2. Configure your Twilio phone number. One way is to setup a TwiML app and set the request URL to the ngrok URL from step (1). Then, set your phone number to use the new TwiML app.
Then, run the example with:
```bash
uv run getting-started/06-voice-agent.py -t twilio -x NGROK_HOST_NAME
```
## Directory Structure
### [`getting-started/`](./getting-started/)
Progressive introduction to Pipecat, from minimal TTS to a full voice agent with function calling.
### [`voice/`](./voice/)
Full STT + LLM + TTS voice agent pipelines showcasing different speech service providers (Deepgram, ElevenLabs, Cartesia, etc.)
### [`function-calling/`](./function-calling/)
Function calling with different LLM providers (OpenAI, Anthropic, Google, etc.)
### [`transcription/`](./transcription/)
Speech-to-text examples with various STT providers.
### [`vision/`](./vision/)
Image description and vision capabilities with different multimodal LLMs.
### [`realtime/`](./realtime/)
Realtime and multimodal live APIs (OpenAI Realtime, Gemini Live, AWS Nova Sonic, Ultravox, Grok).
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.