Mirrors the deprecation in ``QwenLLMService.__init__``: ``model`` should
be passed via ``settings=QwenLLMService.Settings(model=...)`` instead of
as a direct constructor arg.
TTS services whose wire protocol does not echo the context_id back on
incoming audio (Sarvam, Smallest, Soniox, Inworld, ...) call
``get_active_audio_context_id()`` to tag each chunk. That accessor
returned only ``_playing_context_id`` — the playback-side cursor set
asynchronously by ``_audio_context_task_handler`` when it pops a context
off the serialization queue.
Result: incoming audio that arrived in the gap between contexts or at
the very start of a turn (before the playback loop popped) had
``context_id=None`` and was dropped with
``unable to append audio to context: no context ID provided``.
Fall back to ``_turn_context_id`` (the synthesis-side cursor, set as
soon as the turn's context is created) so the gap is covered without
prematurely nulling the playback cursor.
- Replace custom LANGUAGE_MAP fallback in language_to_inworld_language with
resolve_language(language, LANGUAGE_MAP, use_base_code=False) to match the
pattern used by other services and restore the unverified-language warning
- Tighten delivery_mode type from str to Literal["STABLE", "BALANCED", "CREATIVE"]
- Update changelog entry to mention delivery_mode and language normalization
traced_llm only attached the aggregated ``output`` attribute to the
span after the wrapped function returned successfully. When the LLM
call was cancelled mid-stream (e.g. interruption during generation),
the accumulated text was discarded — the span had no ``output``.
Moved the attribute assignment into the ``finally`` block alongside
the existing TTFB write so the partial text we already captured via
the patched ``push_frame`` lands on the span regardless of whether
``f`` returned normally, raised, or was cancelled.
@traced_stt had the same root issue as @traced_tts: the span lifetime
was tied to a per-transcript handler call, which doesn't match the
operation we want to trace. Now uses the __set_name__ pattern to
install:
- A push_frame wrapper that drives one STT span per finalized
TranscriptionFrame. The span is anchored at speech start
(VADUserStartedSpeakingFrame.timestamp - start_secs) but lazy-opened
on the first TranscriptionFrame. Opening earlier (on VAD or
UserStartedSpeakingFrame) races with TurnTraceObserver._handle_turn_started,
which runs as a background task via _call_event_handler (sync=False),
so the span would end up parented to the previous turn. Deferring
the open to the first TranscriptionFrame avoids that race because
STT only emits transcripts well after the turn observer has set
the current turn's context.
- A stop_ttfb_metrics wrapper that closes the span on the TTFB-timeout
path (called with end_time != None from stt_service.py:566). The
span is marked stt.timed_out=True and its end_time is pinned to
the timeout's end_time (= _last_transcript_time) so the duration
reflects when STT actually stopped responding, not when the timeout
fired.
Span lifecycle:
- Open: lazy on first TranscriptionFrame of a segment.
- Close (success): finalized=True attaches metrics.ttfb and closes
the span. Multiple finalized transcripts in a single turn produce
multiple spans.
- Close (timeout): stop_ttfb_metrics(end_time=...) closes with
stt.timed_out=True.
- Close (orphan): UserStoppedSpeakingFrame closes any still-open
span with stt.incomplete=True (covers turns where no finalized
transcript and no timeout fired).
No changes required outside service_decorators.py — stt_service.py
and every per-service file are untouched.
Previously @traced_tts scoped the span to the lifetime of run_tts(). For
streaming TTS services run_tts() returns as soon as the synthesis request
is sent, long before audio chunks arrive, so:
- The span duration measured the WebSocket-send time, not synthesis time.
- The first synthesis recorded the WS-send duration as metrics.ttfb (via
the in-progress fallback in FrameProcessorMetrics.ttfb).
- Subsequent syntheses recorded the previous call's TTFB on the current
span (off-by-one).
The decorator now uses a __set_name__ descriptor to wrap the owning
class's setup() at class definition time. setup() installs per-instance
patches on create_audio_context, append_to_audio_context,
remove_audio_context, on_audio_context_completed, and
reset_active_audio_context. These patches own the span lifetime:
- create_audio_context: open span, set baseline attributes.
- append_to_audio_context: record metrics.ttfb on the first
TTSAudioRawFrame (when stop_ttfb_metrics has produced a real value),
end span on appended TTSStoppedFrame.
- on_audio_context_completed: end span on natural completion (handles
services that auto-push TTSStoppedFrame via push_frame, bypassing
append_to_audio_context).
- remove_audio_context: safety net for explicit removal paths.
- reset_active_audio_context: interruption hook (always reached from
_handle_interruption); marks the span tts.interrupted=true only when
nothing else has closed it.
The run_tts wrapper now only attaches per-call attributes (text,
metrics.character_count) to the already-open span. No changes required
in tts_service.py or in any of the per-service files.
Same async-tool routing approach as #4441: detect async-tool messages in
the LLM context, deliver the final result via the formal tool-result
channel.
Caveat: as of this writing, Inworld Realtime doesn't appear to handle
the resulting delayed tool result reliably, so the routing is
best-effort and the service emits a one-time warning when async-tool
messages are seen. Streamed intermediate results remain unsupported.
Also adds function calling to the realtime-inworld.py example, and
softens the Inworld mention in the #4447 changelog now that the
exclusion is being closed.
A single Realtime API response can now contain more than one audio item
(observed with gpt-realtime-2), and the first item's audio.done can
arrive after deltas from the second have started arriving. Deltas still
arrive strictly in playback order across items, so we keep forwarding
them as received — matching OpenAI's reference implementation.
Adjusted OpenAIRealtimeLLMService so a multi-item response is treated as
one continuous TTS turn:
- _handle_evt_audio_delta: on item switch, advance the tracked item in
place (reset total_size) without emitting another TTSStartedFrame.
Truncation now always targets the latest item.
- _handle_evt_audio_done: debug-trace only; no longer pushes
TTSStoppedFrame.
- _handle_evt_response_done: pushes a single TTSStoppedFrame per turn,
bookending the audio with the Started pushed on the first delta.
Added tests covering single-item, overlapping multi-item, non-overlapping
multi-item, and interrupt-during-multi-item (last-item-wins truncation).
`collections.abc.Coroutine` doesn't expose `cr_code`/`co_name`; only
native coroutine objects do. Use `getattr` chains so pyright is happy
and any non-native awaitable falls back to a generic task name instead
of crashing.
TaskObserver previously took a TaskManager in __init__ and reached into
it directly. Since BaseObject now provides task_manager / create_task /
cancel_task, drop the constructor argument and call
`observer.setup(task_manager)` from PipelineTask._setup() before
starting it.
PipelineTask owns its TaskManager but is itself a BaseObject, so it
inherits create_task/cancel_task. Replace the explicit
self._task_manager.create_task(coro, f"{self}::name") call sites with
self.create_task(coro, "name") for consistency with other BaseObject
subclasses.
PipelineTask owns its TaskManager (still constructed in __init__ since
TaskObserver needs it eagerly). Adding the explicit
`await super().setup(self._task_manager)` in `_setup()` formalizes the
BaseObject lifecycle so any future wiring added to BaseObject.setup is
picked up automatically.
PipelineTask owns its TaskManager but is itself a BaseObject, so it
inherits create_task/cancel_task. Replace the explicit
self._task_manager.create_task(coro, f"{self}::name") call sites with
self.create_task(coro, "name") for consistency with other BaseObject
subclasses.
Lift the task manager wiring (`_task_manager`, `task_manager` property,
`create_task`, `cancel_task`, and `setup(task_manager)`) up to
`BaseObject`. Owners propagate the task manager to their child
`BaseObject`s via `await child.setup(task_manager)`, matching the
existing convention.
Removes duplicated `_task_manager` / `task_manager` property / setup
implementations from `FrameProcessor`, `FrameProcessorMetrics`,
`UserIdleController`, `UserTurnController`,
`BaseUserTurnStartStrategy`, and `BaseUserTurnStopStrategy`.
GeminiLiveVertexLLMService overrides _supports_non_blocking_tools to
return False — Vertex AI's Gemini Live endpoint doesn't yet accept the
NON_BLOCKING behavior field on function declarations or the scheduling
field on FunctionResponse, and sending either breaks tool calling.
Effect: function declarations sent to Vertex no longer carry
NON_BLOCKING; FunctionResponses no longer carry scheduling: WHEN_IDLE.
Users registering a function with cancel_on_interruption=False against
Vertex get the same one-time logger.error + push_error the base class
surfaces on Gemini 3.x.
Mirrors the same change applied to AWSNovaSonicLLMService and
OpenAIRealtimeLLMService in #4441 / GrokRealtimeLLMService in #4447:
replaces the implicit "final happens last" pattern in
_process_completed_function_calls with an explicit
`if async_payload.kind == "final":` block, plus a trailing defensive
`continue` so async-tool messages with an unrecognized kind don't fall
through to the regular tool-result handling block.
Honors cancel_on_interruption=False on Gemini Live for models that support
Gemini's NON_BLOCKING tool mechanism (Gemini 2.x at the time of writing).
Function declarations registered via register_function(...,
cancel_on_interruption=False) are sent with behavior: NON_BLOCKING so the
conversation continues while the tool runs; the matching FunctionResponse
carries scheduling: WHEN_IDLE so the result lands at a graceful pause
rather than mid-sentence. Synchronous (default) tools stay BLOCKING —
applying NON_BLOCKING uniformly produced filler responses like "let me
look that up for you" on regular calls, since the model knew it would
have an opportunity to keep talking while waiting.
A new _supports_non_blocking_tools property gates the flow. On models
that don't support it (currently Gemini 3.x), the service falls back to
plain blocking behavior and surfaces a one-time error + ErrorFrame the
moment async-tool messages first appear in the context, explaining that
the flag's intent is not achievable.
Caveat (Gemini 2.5): an intermittent server-side 1008 "Operation is not
implemented" error can fire when realtime input arrives during a pending
tool call. We auto-reconnect, but the user may need to repeat what they
were saying. The proposed mitigation
(https://discuss.ai.google.dev/t/gemini-live-api-websocket-error-1008-operation-is-not-implemented-or-supported-or-enabled/114644/56)
of gating realtime input during pending tool calls is fundamentally
incompatible with NON_BLOCKING tool calling, so we don't apply it.
Lays groundwork for cancel_on_interruption=False support on Gemini Live by
restructuring _process_completed_function_calls to match the shape used by
AWSNovaSonicLLMService and OpenAIRealtimeLLMService in #4441: a single-pass
forward iteration over raw context messages that detects async-tool
messages via async_tool_messages.parse_message and routes them — started
skipped silently, intermediate logged-as-error and surfaced via push_error,
final delivered via the formal FunctionResponse channel.
Replaces the prior two-pass structure that went through the adapter for
sync results — the service now uses a lightweight self._tool_call_id_to_name
map (populated when the model issues tool calls) for the name lookup the
adapter used to provide. Extracts a new GeminiLLMAdapter.to_function_response_dict
static method for the dict-coercion logic that wraps non-dict tool returns
as {value: <result>} for Gemini's FunctionResponse.response field; the
adapter's existing inline copy in _from_standard_message uses it too.
Example consolidation:
- Folds realtime-gemini-live-function-calling.py into the base
realtime-gemini-live.py example so the base exercises function calling
out of the box (matching realtime-openai.py and realtime-aws-nova-sonic.py).
- Renames realtime-gemini-live-vertex-function-calling.py to
realtime-gemini-live-vertex.py, mirroring the consolidation.
- Adds realtime-gemini-live-async-tool.py.
- Updates scripts/evals/run-release-evals.py for the renames.
This commit alone doesn't make cancel_on_interruption=False fully work on
Gemini Live — additional investigation is pending. This is foundational
work to be built on.
Renames _ASYNC_TOOL_PLACEHOLDER_RESULT to _ASYNC_TOOL_STARTED_RESULT to
match the kind names from async_tool_messages, and lifts the inline
"[Async tool result for tool_call_id=...] {result}" into a sibling
_ASYNC_TOOL_FINAL_RESULT_TEMPLATE constant for the same reason.
Replaces the prior "log a warning and skip" approach with actual handling
of async-tool messages on Ultravox.
The catch with Ultravox is that its API freezes the conversation between
client_tool_invocation and the matching client_tool_result — there's no
"keep talking while the tool runs" channel like NON_BLOCKING on Gemini
or function_call_output-without-blocking on OpenAI Realtime. So:
- When the model invokes an async-registered function (cancel_on_inter
ruption=False), the service immediately ships a placeholder
client_tool_result that tells the model "the actual result isn't
ready yet; a follow-up will arrive shortly; keep the conversation
going". This unfreezes the conversation. The placeholder is sent
from _handle_tool_invocation, since the started async-tool message
doesn't reach the context-frame path until later.
- When the real tool finishes, the final async-tool message lands in
the context. _handle_context now forward-iterates and routes
async-tool messages: started is a no-op (placeholder already sent),
intermediate is logged-as-error and dropped (matching the other
realtime services), and final is injected as user-side text via
user_text_message with bracketed framing — the only mechanism
Ultravox offers for adding non-tool input mid-conversation.
Hoists the registry-lookup helper to LLMService as
_function_is_async(name) so future services can use the same pattern
without re-implementing it.
Adds an async-tool example file for Ultravox modeled on the existing
ones for the other realtime services.
Mirrors the same change applied to AWSNovaSonicLLMService and
OpenAIRealtimeLLMService in #4441: replaces the implicit "final happens
last" pattern in _process_completed_function_calls with an explicit
`if async_payload.kind == "final":` block, plus a trailing defensive
`continue` so async-tool messages with an unrecognized kind don't fall
through to the regular tool-result handling block.
Applies the same async-tool message routing introduced for AWSNovaSonicLLMService
and OpenAIRealtimeLLMService to additional realtime LLM services where the
flag's intent ("keep talking while the tool runs") is achievable:
- GrokRealtimeLLMService (xAI Realtime — also benefits the deprecated Grok
alias since it re-exports the xAI module)
- AzureRealtimeLLMService picks up the fix transitively by inheriting from
OpenAIRealtimeLLMService — no code change needed.
GrokRealtimeLLMService's _process_completed_function_calls now matches
the canonical pattern: skip LLMSpecificMessage, detect async-tool messages
via parse_message and route them — started skipped silently, intermediate
logged as an error and surfaced via push_error, final delivered through
the same channel as a synchronous result.
UltravoxRealtimeLLMService instead gets a one-time warning when async-tool
messages appear in the context. The Ultravox API freezes the conversation
during tool execution
(https://docs.ultravox.ai/tools/async-tools#custom-tool-timeouts), so the
flag's "keep talking while the tool runs" intent isn't achievable there —
applying the same code pattern would mislead users into expecting a UX
Ultravox can't deliver. Surfacing a clear warning is the right behavior
until Ultravox grows true async tool support.
Adds async-tool example files for Grok and Azure modeled on the existing
Nova Sonic / OpenAI Realtime ones (10s simulated network delay, weather
tool registered with cancel_on_interruption=False).
Two services remain excluded:
- GeminiLiveLLMService — the async-tool path needs deeper investigation.
- InworldRealtimeLLMService — appears to have a pre-existing problem
with even simple synchronous tool calling on its Realtime API (the
request reaches the server fine, but response generation fails with a
generic server_error).
Replaces the implicit "final happens last" pattern in
_process_completed_function_calls with an explicit
`if async_payload.kind == "final":` block in both AWSNovaSonicLLMService
and OpenAIRealtimeLLMService. Adds a trailing defensive `continue` so
async-tool messages with an unrecognized kind don't fall through to the
regular tool-result handling block — clearer at the call site, and safer
against future additions to AsyncToolMessageKind.
Before the new async-tool mechanism landed, AWSNovaSonicLLMService and
OpenAIRealtimeLLMService honored cancel_on_interruption=False by simply
not cancelling in-flight function calls on interruption — the eventual
result then flowed through the same channel as any synchronous tool
result. The new mechanism (which appends started/intermediate/final
messages to the LLM context as the underlying task progresses) broke
that path: the realtime services didn't know how to interpret those
messages, and the eventual result was never delivered to the provider.
Restore the flag's behavior by teaching both services to detect
async-tool messages in the context and route them appropriately:
- started → skipped silently. The provider already issued the tool call
and natively awaits a result; nothing to send for the started marker.
- final → delivered via the formal tool-result channel. Same path as a
synchronous tool result, just delayed.
Streamed intermediate results (FunctionCallResultProperties(is_final=
False)) are not supported on these realtime services. An intermediate
result is logged as an error and surfaced via push_error, then dropped.
Use a non-realtime LLM service if a tool needs to stream intermediate
results. (Docstrings on register_function, register_direct_function, and
FunctionCallResultProperties.is_final updated to call this out.)
A new shared module pipecat.processors.aggregators.async_tool_messages
is the single source of truth for the on-the-wire payload shape: the
aggregator uses its build_*_message functions when injecting messages,
and the realtime services use parse_message when scanning the context.
Adds two example files exercising a network-delayed weather tool with
each service. The plain realtime-aws-nova-sonic.py example is also
reverted to a synchronous tool call now that the async variant lives in
its own file.
Similar fixes for other realtime services are forthcoming.
Aligns deprecation docstrings on LLMUserAggregatorParams and
LLMAssistantAggregatorParams with CONTRIBUTING.md conventions:
present-tense parameter descriptions plus a `.. deprecated:: 1.2.0`
directive noting replacement and 2.0.0 removal. Also adds a runtime
DeprecationWarning for `user_turn_completion_config`, which previously
had no warning despite being deprecated.
The old name overlapped semantically with `UserStoppedSpeakingFrame`:
both could be read as "the user's turn is done." They're at different
layers — `UserStoppedSpeakingFrame` is the acoustic stop signal,
while this frame is the post-judgment "inference about the turn is
now complete (turn is semantically final)" signal emitted by the LLM
mixin (on ✓), an end-of-turn classifier, or a custom producer.
The new name pairs naturally with the existing
`on_user_turn_inference_triggered` event vocabulary and removes the
ambiguity with `UserStoppedSpeakingFrame`.
Wrap the detector chain with `deferred(...)` and append the LLM
completion gate via a `UserTurnStrategies` specialization rather than
a free-standing helper, mirroring the existing
`ExternalUserTurnStrategies` pattern. The class lives next to other
strategy containers in `pipecat.turns.user_turn_strategies`, so users
discover it where they're already configuring `user_turn_strategies`.
The deprecated `filter_incomplete_user_turns` flag now rewires
through `FilterIncompleteUserTurnStrategies` under the hood, keeping
the migration path identical to before. `deferred(...)` stays public
as the explicit escape hatch for non-default compositions.
When a stop-strategy chain splits inference-triggered from
finalization (e.g. `LLMTurnCompletionUserTurnStopStrategy` gating a
deferred detector), more than one inference can fire inside a single
user turn — each adds the new transcription segment to the context.
Previously each inference overwrote `_pending_user_turn_aggregation`,
so the eventual `on_user_turn_stopped` event surfaced only the
segment from the last inference, dropping anything the user said
before it.
Concatenate each segment into `_full_user_turn_aggregation` instead
of overwriting, and combine that running buffer with any post-final-
inference segment when emitting the public event.
Add an `LLMMarkerFrame(DataFrame)` for sideband LLM markers that need
to be persisted to context but should not flow through the standard
text path (TTS, transcript). The frame carries an
`append_to_context_immediately` flag so the assistant aggregator can
either commit the marker as a stand-alone message (○ / ◐) or merge it
with the upcoming aggregation as a prefix on the response (✓).
`UserTurnCompletionLLMServiceMixin` now emits `LLMMarkerFrame` instead
of pushing the marker as `LLMTextFrame(skip_tts=True)`, which fixes
the case where an incomplete-turn marker (○ / ◐) was aggregated by
the assistant aggregator but never committed to the context because
the assistant turn lifecycle didn't run to completion (no spoken
response, no `LLMFullResponseEndFrame`-driven `push_aggregation`).
The frame is intentionally generic so other components — STT services
with built-in turn signals, end-of-turn classifiers, custom
annotations — can use the same mechanism to inject sideband signals
into the assistant context.
`LLMTurnCompletionUserTurnStopStrategy` previously bundled two
concerns: pushing `LLMUpdateSettingsFrame` on `StartFrame`, and
finalizing the turn on `UserTurnCompletedFrame`. The latter is
producer-agnostic — any component that emits `UserTurnCompletedFrame`
(STT with built-in turn detection, dedicated end-of-turn classifiers,
custom code) can drive finalization the same way.
Move the frame-handling half into a new
`ExternalUserTurnCompletionStopStrategy`. The LLM-specific subclass
now only adds the settings-frame push and inherits finalization. Mirrors
the existing `ExternalUserTurnStopStrategy` naming pattern.
Fixes a real bug: with `filter_incomplete_user_turns` enabled, the
smart-turn detector's tentative stop was firing `on_user_turn_stopped`
before the LLM had a chance to veto it. Observers, transcript
appenders and UI indicators received an early — and sometimes
duplicated — signal.
Decomposes the single stop concern into two events:
- `on_user_turn_inference_triggered` fires when a stop strategy has
enough signal to start LLM inference. The aggregator pushes the
context here, kicking off the LLM call.
- `on_user_turn_stopped` fires only when the user turn is semantically
final. Built-in strategies fire both events at the same call site,
preserving today's behavior for the common case.
Adds `LLMTurnCompletionUserTurnStopStrategy`, which gates
finalization on a `UserTurnCompletedFrame` (a fieldless system frame
emitted by any component judging turn completeness — currently the
`UserTurnCompletionLLMServiceMixin` on `✓`).
Adds `deferred(strategy)` / `DeferredUserTurnStopStrategy`, a thin
wrapper that forwards an inner strategy's events except
`on_user_turn_stopped`. Use this to install a stop strategy as an
inference trigger only, leaving finalization to a peer (e.g. the LLM
completion strategy).
Adds `llm_completion_user_turn_stop_strategies()` for the common
case:
UserTurnStrategies(
stop=llm_completion_user_turn_stop_strategies(),
)
Deprecates `LLMUserAggregatorParams.filter_incomplete_user_turns`.
The aggregator emits a `DeprecationWarning`, wraps existing stop
strategies with `deferred(...)`, and appends
`LLMTurnCompletionUserTurnStopStrategy` automatically.
Adds an explicit Code Style bullet for the `.. deprecated::` Sphinx
directive (forbidding inline `[DEPRECATED]` tags) and extends the
Docstring Example with a Pydantic params class showing the directive
inside a `Parameters:` block — the context CONTRIBUTING.md's existing
example didn't cover.
Replaces the inline `[DEPRECATED]` tag with a `.. deprecated:: 1.1.0`
directive per CONTRIBUTING.md docstring conventions, so the deprecation
shows up properly in the rendered docs.
When a non-uninterruptible frame was being processed slowly and an
uninterruptible frame was waiting in the queue, _start_interruption
skipped task cancellation. This caused interruptions to stall until
the slow frame finished, even though it had no reason to block them.
The fix: only skip cancellation when the *current* frame is
uninterruptible. Uninterruptible frames already in the queue are
preserved regardless, because __create_process_task calls
__reset_process_queue internally, which always retains them.
Fixes: https://github.com/pipecat-ai/pipecat/issues/4412
grok-3 is being retired from the xAI API on May 15, 2026. Switch the
default to grok-4.20-non-reasoning, which xAI recommends for non-reasoning
workloads and is appropriate for real-time voice AI.
PR #4344 unconditionally switched to normalizedAlignment to fix garbled
words with pronunciation dictionaries (#4316). But normalizedAlignment
returns the post-normalized form of what was spoken - including
romanization of non-Latin scripts (Chinese rendered as pinyin), which
ends up in the LLM context and degrades subsequent turns.
Gate the switch on pronunciation_dictionary_locators being configured.
Adds a _select_alignment helper with preferred-with-fallback (both
fields are nullable per the API schema), used by both the WebSocket
and HTTP services. Tests cover dictionary mode, default mode, fallback
when preferred is missing or null, and HTTP field-name variants.
``examples/function-calling/function-calling-missing-handler.py``
demonstrates the missing-handler path by deliberately advertising a
tool to the LLM without registering its handler — what happens when a
developer forgets to call ``register_function``. Exercises the new
``logger.error`` severity end-to-end without needing to coax the LLM
into hallucinating.
When tools change mid-conversation, LLMs can produce a few different
flavors of tool-call-related hallucination: calling tools that have
been removed, avoiding tools that have been re-added, or hallucinating
output (made-up answers or tool-call-shaped non-tool-calls) when tools
are unavailable.
This change introduces an opt-in ``add_tool_change_messages`` flag on
the LLM aggregators (preferred entry point: ``LLMContextAggregatorPair(
..., add_tool_change_messages=True)``) that appends a developer-role
message to the context whenever ``LLMSetToolsFrame`` changes the set
of advertised standard tools. Helps the LLM stay coherent across tool
changes by spelling out exactly what just became available or
unavailable. Both aggregators participate; whichever handles the
frame first wins, and the other (if any) sees an empty diff against
the shared context and stays silent — order-independent regardless of
whether the frame flows downstream or upstream.
Also tightens the existing missing-handler path (introduced in #4301):
- Reworded the terminal tool result to a neutral "The function
``X`` is not currently available." (overridable via
``LLMService.MISSING_FUNCTION_CALL_MESSAGE_TEMPLATE``). Previously
read "Error: function 'X' is not registered."
- Logs at the call site now distinguish developer error (tool
advertised but no handler registered → ``logger.error``) from
hallucination (tool not advertised → ``logger.warning``).
Includes a manual validation harness
(``examples/features/features-add-tool-change-messages.py``) that
exercises the new ``add_tool_change_messages`` mitigation by flipping
tool availability on a turn counter so its effect can be observed
end-to-end with the flag on vs. off.
Flip the default Inworld TTS model from inworld-tts-1.5-max to
inworld-tts-2 across:
- InworldHttpTTSService (HTTP)
- InworldTTSService (WebSocket)
- InworldRealtimeLLMService (cascade Realtime)
inworld-tts-1.5-max and inworld-tts-1.5-mini remain valid options;
existing users can pin the prior model explicitly via the model
setting. Docstring examples updated to reference the new default.
Polly TTS, Bedrock LLM, and AgentCore previously did
`arg or os.getenv("AWS_...")` and handed the result straight to
aioboto3. When only one of `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY`
was set, aioboto3 received a half-populated kwarg and errored instead of
falling through to the boto3 credential provider chain (instance
profiles, IRSA, ECS task roles, SSO, etc.).
Route credential resolution through the shared `resolve_credentials()`
helper introduced for AWS Transcribe so all four services follow the
same `explicit → env → boto3 chain` fallback. Add an
`AWSCredentials.to_boto_kwargs()` method to bridge the dataclass field
names (`access_key`, `secret_key`) to the aioboto3 kwargs
(`aws_access_key_id`, `aws_secret_access_key`).
No public API changes. Behaviour is identical for fully-explicit and
fully-env-var configurations; partial env vars now correctly trigger
the chain instead of erroring.
Resolve and contain the user-supplied filename before serving it from
the runner's /files endpoint. Also raise a 404 (instead of returning
None) when the downloads folder is unset, and use the resolved
basename for Content-Disposition.
AWS Transcribe STT previously only supported credentials via explicit
parameters or environment variables. Services running with IAM roles
(EKS pod roles, IRSA, ECS task roles, EC2 instance profiles) or SSO
couldn't use Transcribe without exporting static credentials.
Changes:
- Add resolve_credentials() to utils.py providing a standard fallback
chain: explicit params → environment variables → boto3 credential
provider chain (instance profiles, IRSA, pod roles, SSO, etc.)
- Add AWSCredentials dataclass for type-safe credential passing
- Update AWSTranscribeSTTService to use resolve_credentials() instead
of manual os.getenv() calls
- The boto3 fallback is only attempted when both access key and secret
key are unresolved, avoiding replacement of explicitly provided creds
- boto3 is imported lazily inside the function to avoid hard dependency
for services that don't need the fallback chain
- Add 7 unit tests covering the credential resolution chain
The Bedrock LLM and Polly TTS services already support the full
credential chain via aioboto3.Session() and are not modified.
Related to #4197
Two issues were causing TTSSpeakFrame(append_to_context=True) greetings to
silently lose their trailing words and never fire on_assistant_turn_stopped:
- LLMAssistantPushAggregationFrame was emitted without a PTS, so the
transport routed it through the audio (sync) queue while word-level
TTSTextFrames travel through the clock queue. The aggregation could reach
the assistant aggregator before the final words, leaving them orphaned
in the buffer. Stamp the frame with `_word_last_pts + 1` when there are
word timestamps so it can't overtake them.
- The aggregator's LLMAssistantPushAggregationFrame handler called
push_aggregation() directly, bypassing _trigger_assistant_turn_stopped.
For TTS-only flows there is no LLMFullResponseStartFrame, so the turn
start timestamp was never set and on_assistant_turn_stopped never fired.
Open a turn (if needed) and trigger stopped from the handler.
Fixes#4264.
The UI Agent Protocol lets server-side AI agents observe and drive
a GUI app on the client side through structured RTVI messages.
Five new top-level RTVI types in kebab-case, in line with the rest
of the protocol:
ui-event client → server (named event with payload)
ui-command server → client (named command with payload)
ui-snapshot client → server (accessibility tree of the page)
ui-cancel-task client → server (cancel an in-flight task group)
ui-task server → client (task lifecycle envelope)
Each ships paired ``*Data`` / ``*Message`` pydantic models in
``rtvi.models``, following the existing RTVI envelope convention
(``BotReady`` / ``BotReadyData``, ``Error`` / ``ErrorData``, etc.).
Built-in command payload models (``Toast``, ``Navigate``,
``ScrollTo``, ``Highlight``, ``Focus``, ``Click``, ``SetInputValue``,
``SelectText``) ship alongside; matching default React handlers
live in ``@pipecat-ai/client-react``.
Bumps the RTVI ``PROTOCOL_VERSION`` from ``1.2.0`` to ``1.3.0``.
Purely additive: only new top-level message types are introduced;
no existing wire shapes are changed. The major-version
compatibility check on ``client-ready`` still passes for older
1.x clients, so old clients continue to connect without warning;
they simply will not exercise the new types.
The ``RTVIProcessor`` registers a new ``on_ui_message`` event
handler that fires for inbound ``ui-event`` / ``ui-snapshot`` /
``ui-cancel-task`` with the parsed Message envelope, mirroring how
``on_client_message`` works for ``client-message``.
Five new pipeline frames let pipeline observers and processors see
UI traffic the same way they see other RTVI messages, mirroring
the frame-and-event pattern used by ``client-message``:
RTVIUICommandFrame(command_name, payload)
Pushed by downstream code (e.g. ``pipecat-ai-subagents``'s
bridge) to send a UI command to the client. Wrapped by the
observer into a ``UICommandMessage`` envelope.
RTVIUITaskFrame(data: UITaskData)
Same shape but for ``ui-task``; wrapped into ``UITaskMessage``.
``UITaskData`` is a discriminated union of the four lifecycle
kinds (group_started / task_update / task_completed /
group_completed).
RTVIUIEventFrame(msg_id, event_name, payload)
RTVIUISnapshotFrame(msg_id, tree)
RTVIUICancelTaskFrame(msg_id, task_id, reason)
Pushed by ``RTVIProcessor._handle_message`` whenever the
matching inbound message arrives, alongside firing
``on_ui_message``. Pipeline observers and processors can match
on the frame; subscribers like the subagents bridge keep using
the event handler.
The data layer is the canonical authority for the wire format:
higher-level frameworks like ``pipecat-ai-subagents`` build the
agent abstractions on top, and single-LLM Pipecat apps can target
the same wire format directly via custom tools that emit these
typed messages.
The pyright job in `format.yaml` previously installed only `--extra
daily --extra tracing`. That was sufficient when most optional-dep-
using files were in the pyright ignore list, but as this PR has
cleared dozens of files, those files now reference symbols from
optional-dep modules (`aiortc.RTCIceServer` via `IceServer`,
`google.genai.types.HttpOptions`, etc.). `reportMissingImports: false`
tolerates the failed imports themselves, but the imported names
become `Unknown` and using them as type expressions trips
`reportInvalidTypeForm` / `reportAttributeAccessIssue` — errors
that aren't gated by that flag.
Switch to `--all-extras --no-extra gstreamer --no-extra local`
(matching the dev setup in README.md), so pyright sees the same
dependency set the code is intended to be type-checked against and
the install-set scales naturally as more files leave the ignore list.
Also reconcile CLAUDE.md's setup command, which only excluded
`gstreamer`. README.md is canonical and additionally excludes
`local` (pyaudio requires `portaudio` native libs that aren't
installed by default on a clean Ubuntu CI runner).
A fourth pass over low-error-count files. Drops 8 files (57 → 49) and
full-pyright errors from 525 → 496. Default pyright stays clean.
Optional access on transport/client receivers (4 files). Same fix
shape as #4359 — a receiver typed `X | None` accessed without a
guard. For "should never happen" cases (caller's lifecycle ensures
the field is non-None when the method runs), used `assert` rather
than silent early-return so an invariant violation surfaces loudly:
- `transports/whatsapp/client.py` (5 errors): `_validate_whatsapp_webhook_request`
was typed `bytes` / `str` but called with `bytes | None` / `str | None`.
Widened the helper signature and pushed the explicit None-check
inside (matching its existing empty-string check). Also handled
`pipecat_connection.get_answer()` returning `None` — would have
crashed at `.get("sdp")` before.
- `transports/websocket/client.py` (5 errors): four are the deprecated
`websockets.WebSocketClientProtocol` alias (same `# pyright: ignore[reportAttributeAccessIssue]`
as the `services/websocket_service.py` fix from earlier in this PR).
The fifth was `async for message in self._websocket` — traced the
call chain and confirmed `_client_task` is created only after
`self._websocket` is assigned and cancelled before it's cleared, so
the field is never None when `_client_task_handler` runs. Used `assert`.
- `services/openai/stt.py` (4 errors): same pattern. `_receive_messages`
is started by `_connect()` only when `self._websocket` is set, and
the reconnect loop in `WebsocketService._receive_task_handler`
re-establishes it before each retry. `assert` at entry. Plus L478/L483:
the `try`/`except ModuleNotFoundError` import-guard makes
`websocket_connect` and `State` `<type> | None`; `__init__` already
raises `ImportError` if either is None, so an `assert` at the
`_connect_websocket` use site is honest. Plus an L538 `Language | str`
cast (same shape as last batch).
- `services/deepgram/flux/base.py` (2 errors): `event = data.get("event")`
flowed into `_handle_turn_resumed(event: str)` as `Any | None`.
Tightened with an `isinstance(event, str)` guard before the
`FluxEventType(event)` lookup. The other error (`average_confidence > min_confidence`
where `min_confidence: float | None`) was a latent crash on missing
confidence data — restored the original `not min_confidence` (which
treats both `None` and `0.0` as "no filter") and added an explicit
drop-on-missing-confidence-data branch.
`gemini_live` Settings/InputParams (vertex). The deprecated `InputParams`
declares `modalities: GeminiModalities | None` and `media_resolution: GeminiMediaResolution | None`,
but their downstream usage at `services/google/gemini_live/llm.py:952,959`
calls `.value` on each — `None` would crash. Rather than touching the
deprecated input model, translate `None` to the canonical defaults
(`GeminiModalities.AUDIO`, `GeminiMediaResolution.UNSPECIFIED`) at the
assignment site in `vertex/llm.py`. Also fixed an unrelated annotation
bug: `_get_credentials` was annotated `-> str` but actually returns
`service_account.Credentials` (used correctly by the caller — only
the annotation was wrong).
`moondream/vision.py` (3 errors). `frame.format` is `str | None` but
`Image.frombytes(mode, ...)` requires `str`; raise instead of crashing
on missing format. The other two errors are pyright thinking the
moondream2-custom `encode_image` and `query` methods are `Tensor`
(rather than callables) — those are provided by the model code via
`trust_remote_code=True` and aren't visible to pyright on the base
`AutoModelForCausalLM` type. Scoped `# pyright: ignore[reportCallIssue]`
on the two call sites.
`transports/base_output.py` (3 errors). Two are `self._mixer.mix(...)`
calls in `with_mixer`, a closure invoked only when `self._mixer` is
truthy at the call site — captured the mixer to a local variable
inside the closure with an `assert`, then used that. Third is the
PIL `frombytes(mode, ...)` shape — `frame.format is None` early-
return guard at the top of `resize_frame` so the main resize logic
reads cleanly.
`elevenlabs/tts.py` (4 errors). The payload-building dict at L1271
was typed `dict[str, str | dict[str, float | bool]]` — an aspirational
shape that matched only the first two assignments. Subsequent code
assigned `list[dict[...]]` (pronunciation locators) and bools, all
violating the annotation. Same pattern at L926 (the WebSocket-init
`msg`). Both widened to `dict[str, Any]`, which is the honest shape
for a JSON request payload and what similar code uses elsewhere.
Files dropped from the ignore list (57 → 49):
services/deepgram/flux/base.py, services/elevenlabs/tts.py,
services/google/gemini_live/vertex/llm.py,
services/moondream/vision.py, services/openai/stt.py,
transports/base_output.py, transports/websocket/client.py,
transports/whatsapp/client.py.
A third pass over low-error-count files in the ignore list. Drops 10
files (67 → 57) and full-pyright errors from 555 → 525. Default
pyright stays clean.
Optional access guards (4 files). The same fix shape as 9e9b1f39e:
a receiver typed `X | None` accessed without a guard, fixed with a
local-var capture or an early return.
- `mistral/stt.py`: `_connection.send_audio` could crash if
`_connect()` swallowed an exception and left `_connection` unset;
drop the audio chunk with a warning instead. `_receive_events`
iterating `_connection.events()` got the same defensive narrowing.
- `deepgram/flux/stt.py`: `_websocket_url` is set in `_connect`
before `_connect_websocket` is called, but pyright doesn't track
that across methods — assert at the use site. `websocket.response`
is `Response | None` in the websockets stubs even though it's
always populated post-handshake; guarded with a fallback.
- `audio/filters/rnnoise_filter.py`: the module-level import sets
`RNNoise` to `None` if `pyrnnoise` isn't installed; raise
`ImportError` explicitly instead of relying on the existing try-
block to catch the `None(...)` call. Also gated `filter()` with
`or self._rnnoise is None` so pyright sees the narrowing.
- `transports/smallwebrtc/request_handler.py`: `get_answer()`
legitimately returns `None`; raise instead of crashing on three
subscript accesses.
`TTSService` `audio-context` API tightening. Mirroring the
`append_to_audio_context` fix from the previous batch:
`remove_audio_context` was typed `str` but is called with `str | None`
from `get_active_audio_context_id()` results. Widened to `str | None`
and the `None` handling lives in the function body (early debug log
+ return) — matching `append_to_audio_context`'s shape.
`audio_context_available` keeps its narrow `str` signature; asking
"is `None` available?" isn't a meaningful question (`_audio_contexts`
is `dict[str, asyncio.Queue]`). The internal call site in
`on_turn_context_completed` narrows `_turn_context_id` explicitly
before passing it. Side effect: deepgram/tts.py's L307 error clears
without local changes.
`deepgram/tts.py` (4 errors → 0): the same `push_error(ErrorFrame(...))`
latent bug we fixed in resembleai earlier in this PR — `push_error`
takes a string; there's a separate `push_error_frame` for frames.
Two sites switched. The Optional `_websocket.response` access is
guarded the same way as deepgram/flux/stt.py. The `remove_audio_context`
error was cleared by the tightening above.
`aws/utils.py` (3 errors → 0): `AWSTranscribePresignedURL` declared
`session_token: str` but the dict source is `str | None` (AWS
supports long-term IAM creds without a session token). Same for
`vocabulary_name`/`vocabulary_filter_name` on `get_request_url`,
which were typed `str = ""` even though the body uses truthy checks
to skip them. Widened to `str | None = None` — matches actual
runtime semantics.
`audio/dtmf/utils.py` (2 errors → 0): `files("...").joinpath(...)`
returns a `Traversable`, but `aiofiles.open` wants a real path. For
regular pip installs this worked in practice (Traversable was a
`Path`), but it would fail for zipped distributions (zipapp,
zipimport) where the resource isn't on disk. Wrapped in
`importlib.resources.as_file(...)` — the canonical bridge that
extracts to a temp file when the resource isn't already on the
filesystem. Validated end-to-end: regular install still reads bytes;
ad-hoc zipapp test confirmed `as_file` extracts the resource and
returns a real Path.
`openai/image.py` (2 errors → 0): the `size` arg to
`images.generate` is `Literal[...] | None` in the SDK but our
settings field is `str | None`. Mirrored the `groq/tts.py`
hint-not-constraint pattern from the previous batch: defined a
module-level `OpenAIImageSize = Literal[...]` alias with a comment
attributing the upstream symbol and documenting the cast contract
(callers can pass any string; invalid values surface as an OpenAI
API error). Also guarded `image.data[0]` (response.data is
`list[Image] | None`).
`processors/frameworks/{langchain,strands_agents}.py` (4 + 4 → 0):
both processors do `messages[-1]["content"]` on a value typed
`LLMStandardMessage | LLMSpecificMessage` (the latter is a dataclass,
not a dict, so `__getitem__` errors). Historically these only
handled plain-text user messages, so the fix is two explicit guards
(skip if the last message isn't a dict; skip if `content` isn't a
string) plus a TODO noting that other shapes (multi-modal content,
provider-specific messages) aren't supported yet. langchain's
`__get_token_value` also got a small fix where `AIMessageChunk.content`
is `str | list[parts]` but the function declares `-> str`; stringify
the list case. strands_agents' surfaced two unrelated narrows: a
`graph_exit_node: str | None` arg gated by an `__init__`-time assert,
and `agent.stream_async` reached only when we're not in graph mode.
Files dropped from the ignore list (67 → 57):
audio/dtmf/utils.py, audio/filters/rnnoise_filter.py,
processors/frameworks/langchain.py,
processors/frameworks/strands_agents.py, services/aws/utils.py,
services/deepgram/flux/stt.py, services/deepgram/tts.py,
services/mistral/stt.py, services/openai/image.py,
transports/smallwebrtc/request_handler.py.
A second pass over the low-error-count files in the ignore list. Drops
10 files (77 → 67) and full-pyright errors from 580 → 555. Default
pyright stays clean.
Three coherent shapes plus a handful of one-offs:
`Language | str | None` → `Language | None` at STT frame boundaries.
`assert_given(self._settings.language)` returns `Language | str | None`
(strips `_NotGiven`, keeps the rest), but `TranscriptionFrame.language`
expects `Language | None`. In practice both `_settings.language` and
SDK-supplied codes resolve to a `Language` enum value, but technically
they could be raw strings — and `Language` is a StrEnum, so downstream
consumers (which mostly compare/serialize as strings) handle either.
Used `cast("Language | None", ...)` at each call site rather than a
runtime-validating helper, so an unrecognised code (e.g. one we
haven't added to the enum yet) still flows through unchanged. Cleared
azure/stt.py, aws/stt.py, gradium/stt.py; mistral/stt.py keeps the
cast at the SDK boundary (storing under `_detected_language: Language
| None`) but stays in the ignore list because of two unrelated
Optional-access errors.
aiobotocore `async with` stub gap. `aioboto3.Session().client(...)`
is an async context manager at runtime but its stubs don't advertise
`__aenter__`/`__aexit__` to pyright. Scoped
`# pyright: ignore[reportGeneralTypeIssues]` on the two affected
sites: aws/agent_core.py and aws/tts.py. aws/tts.py also had a latent
bug on the no-`AudioStream` path: the original code set
`audio_data = None` and then crashed in `resample(...)` and
`len(audio_data)` below; replaced with an early `return` after
logging — matches the convention elsewhere (OpenAI TTS, etc.) of not
recording usage metrics on the error path.
heygen `event_id: str | None` → `str` at transport→client boundary.
Three call sites in transports/heygen/transport.py passed `self._event_id`
(`str | None`) into client methods that take `str`. Added a guard at
each: `agent_speak_end` and `interrupt` only fire when `_event_id` is
set; `write_audio_frame` warn-and-drops when there's no active bot
event rather than sending a malformed message.
`OpenAIResponsesLLMInvocationParams` TypedDict.
`get_llm_invocation_params` always sets both `input` and `tools` in
the same dict literal, but the TypedDict was `total=False` so direct
subscript access (`invocation_params["input"]`) tripped
`reportTypedDictNotRequiredAccess` in services/openai/responses/llm.py.
Marked both keys `Required[...]`; `instructions` stays non-required
since it's only added when a system instruction is present.
Latent bug in heygen/api_interactive_avatar.py: the code accessed
`request_data.voice.voiceId` and `request_data.voice.elevenlabsSettings`,
but those names are Pydantic *aliases*; the actual attribute names
(used for attribute access) are `voice_id` and `elevenlabs_settings`.
Switched to the field names — those camelCase accesses would have
raised AttributeError at runtime if `voice` was set.
Other small fixes:
- assemblyai/stt.py: the deprecated `connection_params=` init path
was reading `formatted_finals` and `word_finalization_max_wait_time`
off `AssemblyAIConnectionParams`, but those fields were never on
the deprecated input model — they were added to Settings later.
Removed the reads (with a comment noting they're only available
via the canonical `settings=...` API); the deprecated input model
is unchanged.
- rtvi/processor.py: two `about: Mapping[str, Any] = None` parameter
signatures — declared `Mapping`, defaulted to `None`, and both
function bodies already handled the None case. Widened to
`Mapping[str, Any] | None = None`.
- aws/stt.py: `subprotocols=["mqtt"]` failed against websockets'
`Sequence[Subprotocol] | None` (Subprotocol is a NewType wrapper).
Wrapped: `subprotocols=[Subprotocol("mqtt")]`.
Files dropped from the ignore list (77 → 67):
processors/frameworks/rtvi/processor.py, services/assemblyai/stt.py,
services/aws/agent_core.py, services/aws/stt.py, services/aws/tts.py,
services/azure/stt.py, services/gradium/stt.py,
services/heygen/api_interactive_avatar.py,
services/openai/responses/llm.py, transports/heygen/transport.py.
Several adjacent fix shapes that together drop 19 files from the
pyrightconfig.json ignore list (96 → 77) and full-pyright errors from
605 → 580. Default pyright stays clean.
TTS voice/context_id None handling — most files in this batch had a
single error of the shape "value typed `T | None` passed where `T` is
required" coming out of `assert_given(self._settings.voice)` (which
strips `_NotGiven` but not `None`) or `get_active_audio_context_id()`.
Two patterns:
- For services where a missing voice means the request can't proceed
(hume, openai, xtts, groq, kokoro, piper), added an explicit None
check. Inside `run_tts` we yield an `ErrorFrame` and return — matching
each service's existing error-emission style (a few wrap `Exception`
broadly and were fine; openai/hume/xtts had narrower or no try blocks
so a bare `raise ValueError` would have escaped uncaught). Piper
validates in `__init__`, where failing fast at construction is the
right shape. OpenAI also gained a `voice not in VALID_VOICES` guard
with a clear message listing supported voices.
- For services where a missing audio context just means "skip this
message" (fish, lmnt, smallest, sarvam, neuphonic), widened
`TTSService.append_to_audio_context`'s `context_id` signature to
`str | None`. The function body already explicitly handled the None
case with a debug log + early return, so the prior `str` annotation
was a lie; making it honest cleared call sites without local guards.
inworld's `_close_context` got the same treatment.
google.genai imports — switched `from google import genai` to
`import google.genai as genai` in google/image.py and google/llm.py.
The dotted form sidesteps a PEP 420 namespace-package stub gap (the
`google` namespace stubs come from a different distribution and don't
declare `genai`), which means pyright now resolves `genai` to the
real module rather than `Unknown`. IDE autocomplete on `genai.<x>`
works for the first time. In image.py this surfaced three latent
bugs that the `Unknown` resolution had been hiding (model was
`str | _NotGiven | None` not narrowed before passing to the SDK; two
spots accessed `.image_bytes` on an `Image | None` without a guard) —
all fixed. llm.py's dotted import surfaced 8 errors (Content-list
typing nuances, internal `_api_client` access, a few small Optionals);
deferred to a future pass since they're outside this commit's scope,
so the file stays in the ignore list with the dotted import.
Latent bug fixes spotted along the way:
- resembleai/tts.py was calling `push_error(ErrorFrame(...))`, but
`push_error` takes a string — there's a separate `push_error_frame`
for the frame case. Switched to the right method.
- openai/base_llm.py: `max_completion_tokens` was the only sibling
field on `OpenAILLMSettings` missing `| None` in its type, which
caused the assignment in openai/llm.py from `params.max_completion_tokens`
(`int | None`) to fail. Added `| None` for consistency with
`max_tokens` etc.
- heygen/base_api.py: `livekit_url: str = None` and `ws_url: str = None`
declared `str` while defaulting to `None`. Removed the bogus
defaults — both fields are required at construction in every
in-tree call site, and the previous `str = None` was a Pydantic
footgun.
Other small ones: gladia/stt.py needed a None guard on `_session_url`
before `websocket_connect`; openrouter/llm.py's
`build_chat_completion_params` override widened to `dict[str, Any]`
diverging from the parent's `OpenAILLMInvocationParams` — restored
the parent's type; neuphonic/tts.py guarded the receive loop's
`async for message in self._websocket` with a local-variable narrowing
matching the pattern from 9e9b1f39e.
groq/tts.py: tightened `output_format`'s typing to
`Literal["flac","mp3","mulaw","ogg","wav"] | str = "wav"`. The literal
side gives IDE autocomplete hints for the currently-supported set;
the `| str` side keeps callers unblocked if groq adds a new format
before this list is updated. A `cast` at the API boundary satisfies
groq's stricter `Literal` parameter type. The literal alias mirrors
the inlined Literal on `groq.resources.audio.speech.AsyncSpeech.create`'s
`response_format` (the SDK doesn't export it as a named symbol).
websocket_service.py: scoped `# pyright: ignore[reportAttributeAccessIssue]`
on `websockets.WebSocketClientProtocol`. That alias is now a deprecated
re-export from the legacy submodule and pyright doesn't surface it
on the top-level `websockets` namespace; runtime is fine. Migrating
to `websockets.ClientConnection` is a separate piece of work
(transports/websocket/client.py uses the same alias four times) and
left for a future commit.
Files dropped from the ignore list: fish/tts.py, gladia/stt.py,
google/image.py, groq/tts.py, heygen/base_api.py, hume/tts.py,
inworld/tts.py, kokoro/tts.py, lmnt/tts.py, neuphonic/tts.py,
openai/llm.py, openai/tts.py, openrouter/llm.py, piper/tts.py,
resembleai/tts.py, sarvam/tts.py, smallest/tts.py,
websocket_service.py, xtts/tts.py.
Same approach as the previous round — apply boundary casts where the
code does dict-style mutation on TypedDict-typed values, narrow at
return sites, and document the LLMSpecificMessage limitation in
realtime adapters that pack history into a single text message.
aws_nova_sonic_adapter.py — pure typing + small narrowing fixes:
- Filter LLMSpecific items in `_from_universal_context_messages`
(documented).
- `_from_universal_context_message` now declared
`-> AWSNovaSonicConversationHistoryMessage | None` (it already had
paths returning None implicitly).
- `get_messages_for_logging` returns `dict[str, Any]` per element
via `dataclasses.asdict`, matching the declared return type.
- Use a local `role` variable so pyright keeps the narrowing across
the truthy-content guard.
grok_realtime_adapter.py / inworld_realtime_adapter.py — same shape
of fix as `open_ai_realtime_adapter.py` from the previous batch.
The two files are essentially copies of the OpenAI Realtime adapter,
so the same template applies: cast at the boundary, filter
LLMSpecificMessage with a documented note, replace the implicit-None
fallthrough with `raise ValueError`, and switch the `text_content +=`
pattern (which fails when one of the parts is None) to a
`text_parts.append(...)` + `" ".join(...)` pattern.
open_ai_adapter.py — pure typing. Cast at the
`OpenAILLMInvocationParams` return, narrow the system-instruction
warning's `initial_content` to `str | None`, and cast the custom-tools
list to `list[ChatCompletionToolParam]`.
open_ai_responses_adapter.py — pure typing. Same shape: narrow
`first_content` to `str | None` for the warning resolver, cast the
constructed dict literals at append sites where the target is
`ResponseInputItemParam`, and cast `get_messages_for_logging`'s
return to the declared `list[dict[str, Any]]`.
processors/aggregators/llm_context.py — pure typing. Cast the
deepcopied message in the redaction loop in `get_messages` to
`dict[str, Any]` and the create_image/audio_message return-dict
literals to `LLMContextMessage`.
Removes 6 newly-clean files from the pyright ignore list.
Net: -77 pyright errors (full-config: 680 -> 603).
Same shape of fix we applied to anthropic_adapter.py earlier — these
adapters do dict-style mutation on values typed as
ChatCompletionMessageParam (a union of TypedDicts) or against Optional
fields. Apply boundary casts (`cast(dict[str, Any], ...)` for the
mutation block, cast back to the TypedDict at return sites). Most
changes are pure typing (rename + cast); a handful in gemini and
openai_realtime are small defensive bug fixes for code paths that
were latently broken by Optional fields slipping through:
perplexity_adapter.py — pure typing. Cast the deepcopied messages to
`list[dict[str, Any]]` for the role-merging / system-conversion /
trailing-assistant-removal transformations and cast back to
ChatCompletionMessageParam at the return.
bedrock_adapter.py — pure typing. Cast the message to
`dict[str, Any]` at the top of `_from_standard_message` for the
tool-result / tool-use / image-content transformations. Cast the
constructed dict at the return site of `get_llm_invocation_params`.
gemini_adapter.py — typing + several None guards on Content.parts and
related Optional fields. Each guard turns a latent
`TypeError`/`AttributeError` (when the type-system-allowed None
showed up at runtime) into a defensive skip — the type annotations
say these can be None and we now handle that.
open_ai_realtime_adapter.py:
- Typing: cast the deepcopied messages, cast back where needed.
- LLMSpecificMessage handling: previously the function would crash on
the first `.get()` call if any LLMSpecificMessage was in the list.
Filter them out and document the limitation — this adapter's
pack-into-single-text-message strategy doesn't compose with opaque
per-provider payloads.
- Real bug fix: `events.ConversationItem` is a Pydantic BaseModel,
not a TypedDict. The bulk-packing path was constructing a raw dict
where a ConversationItem was expected. Replaced with proper
constructor calls (matches what the single-user-message path
already does).
- Real bug fix: `_from_universal_context_message` was declared
`-> events.ConversationItem` but on the unhandled-message
fallthrough it logged and returned None implicitly. Raise
ValueError so the violation is loud, not silent.
Removes 4 newly-clean files from the pyright ignore list:
adapters/services/{perplexity,bedrock,gemini,open_ai_realtime}_adapter.py.
Net: -95 pyright errors (full-config: 775 -> 680).
Six pyright errors followed the same pattern: a value flowed out of
`self._settings.X` (typed `T | _NotGiven`) into a context that wanted
the plain `T`. Wrap each with `assert_given(...)` so the sentinel
gets stripped at the boundary:
- aws/nova_sonic/llm.py: `_settings.model` (in InvokeModel...Input)
and `_settings.system_instruction` (passed to the adapter).
- deepgram/flux/base.py: iterating `_settings.keyterm`.
- google/stt.py: iterating `_settings.languages`.
- google/tts.py: iterating `_settings.speaker_configs`.
- openai/base_llm.py: `_settings.system_instruction` passed to the
adapter.
Also takes a deeper pass at the related Google STT issue: the override
of `language_to_service_language` had been broadened to take
`Language | list[Language]` and return `str | list[str]`, a Liskov
violation against the base's `Language -> str | None` contract.
External callers always pass a single Language, and the only consumer
of the list path was Google STT's own `_get_language_codes`. Restore
the override to a single-Language signature and let
`_get_language_codes` iterate. The override is also tightened to
return `str` (narrower than the base's `str | None`, which is
LSP-compatible) since it always falls back to `"en-US"` rather than
returning None.
Net: -7 pyright errors (full-config run: 782 -> 775).
These provider-specific helpers are all thin wrappers around
`resolve_language(...)`, which itself returns `str` — never `None`.
The `str | None` annotations were misleading and were producing
spurious pyright errors at the call sites that assigned the result
into a `str` field. Update each helper's signature to `str` and
rewrite the `Returns:` docstring to describe the actual fallback
behaviour (resolve to base or full code, with a warning).
Importantly, the per-class `language_to_service_language(...)`
methods on `STTService`/`TTSService` subclasses keep `str | None` as
their return type. That signature is an extension hook for future
and/or third-party subclasses that may genuinely not be able to
produce a code for some languages, even though all in-tree first-
party services currently return a string.
Also includes one small unrelated tightening in azure/stt.py: wrap
`self._settings.language` with `assert_given(...)` so the truthy
fallback to `language_to_azure_language(Language.EN_US)` doesn't
silently swallow a NotGiven sentinel.
Net: -3 pyright errors (full-config run: 785 -> 782).
Pyright flagged 19 sites where `await self._<connection>.send/recv/...`
was called on a receiver typed `X | None`. Each kind of call site
needed a slightly different fix to be both type-safe and behaviour-
preserving:
Streaming/user-facing paths (early return + warn — drop and warn is
the right runtime fail-safe when reconnect didn't succeed):
- cartesia/stt.py (run_stt)
- soniox/stt.py (_send_keepalive)
- elevenlabs/tts.py (run_tts — yields ErrorFrame and returns)
- deepgram/sagemaker/tts.py (run_tts)
- transports/lemonslice/transport.py (send_message)
- transports/tavus/transport.py (send_message)
"Should never happen" cases (early return with comment, no warn —
caller already gated on a separate `_is_*` check, so a warn would be
noise):
- deepgram/flux/stt.py (transport methods, gated by _transport_is_active)
- deepgram/flux/sagemaker/stt.py (same)
- stt_service.py (_send_keepalive, gated by _is_keepalive_ready)
- elevenlabs/stt.py (_send_keepalive, same)
- llm_service.py (_ws_recv — raises ConnectionError to match
_ensure_connected's contract)
- heygen/client.py (receive loop, gated by self._connected)
Just-assigned-above (use a local variable so pyright keeps the
narrowing across statements):
- lmnt/tts.py
- gradium/stt.py
- fish/tts.py
Other:
- transports/websocket/server.py — used the existing local `websocket`
parameter in scope instead of `self._websocket` for the close call.
- websocket_service.py — `send_with_retry` raises ConnectionError when
`self._websocket` is None inside the existing try-block, so the
broad `except Exception` triggers reconnect just as it would on a
real send failure (preserving the prior behaviour where None
silently fell through to the AttributeError-driven reconnect path).
Drops three now-clean files from the pyright ignore list: cartesia/stt.py,
elevenlabs/stt.py, and soniox/stt.py.
After making LLMService generic, an unparameterized subclass
(`class MyService(LLMService):` with no bracket — the third-party
provider pattern) saw `get_llm_adapter()` return `Unknown` rather
than `BaseLLMAdapter` as it did before the refactor.
Add `default=BaseLLMAdapter` (PEP 696) on the TypeVar — via
`typing_extensions.TypeVar` so older Python targets keep working —
so unparameterized callers get `LLMService[BaseLLMAdapter]` and
`get_llm_adapter()` returns `BaseLLMAdapter`, matching the
pre-refactor type precision.
Two internal fallouts of having a default (where the default makes
unannotated `LLMService` resolve invariantly to
`LLMService[BaseLLMAdapter]`):
- `FunctionCallParams.llm` is now `LLMService[Any]` so concrete
parameterizations like `LLMService[OpenAILLMAdapter]` can be
passed where the field is set.
- The explicit `LLMService.__init__(self, **kwargs)` in
`WebsocketLLMService.__init__` gets a `pyright: ignore[reportArgumentType]`
comment — pyright's invariance handling can't see through the
multi-inheritance + generic + default combination, but the
runtime call is correct (generics are erased).
Two follow-ups now that LLMService is generic over its adapter:
- Add an explicit backward-compat test verifying that an LLMService
subclass with no generic parameter (the third-party-provider
pattern) instantiates and returns a usable adapter. The existing
MockLLMService (declared without brackets) already exercised this
implicitly, but it's worth a named assertion.
- Drop the now-redundant `params: SomeLLMInvocationParams = ...`
variable annotations on `adapter.get_llm_invocation_params()`
results. Since `get_llm_adapter()` now returns the precise adapter
type, and `BaseLLMAdapter` is generic in its invocation-params
type, the call already infers the right TypedDict.
Previously, `LLMService.get_llm_adapter()` returned `BaseLLMAdapter`,
which forced every caller that wanted the precise adapter type to
write `adapter: SomeAdapter = self.get_llm_adapter()` and accept
pyright's complaint that the assignment doesn't match the declared
type. That pattern existed in 17 places across the LLM services.
Make `LLMService` generic over its adapter type — `LLMService(...,
Generic[TAdapter])` with `TAdapter = TypeVar("TAdapter",
bound=BaseLLMAdapter)` — so subclasses opt in via
`LLMService[XAdapter]` and callers get the precise type back from
`get_llm_adapter()` automatically.
Backward-compatible for third-party providers: code that says
`class MyService(LLMService):` (no bracket) still type-checks, with
TAdapter resolving to BaseLLMAdapter from the bound — identical to
the pre-refactor behavior. The `adapter_class` attribute keeps its
loose `type[BaseLLMAdapter] = OpenAILLMAdapter` typing so the default
remains usable; one localized cast in `__init__` bridges the loose
class attr to the precise instance attr.
In-tree subclasses opted in:
- AnthropicLLMService -> LLMService[AnthropicLLMAdapter]
- AWSBedrockLLMService -> LLMService[AWSBedrockLLMAdapter]
- AWSNovaSonicLLMService -> LLMService[AWSNovaSonicLLMAdapter]
- BaseOpenAILLMService -> LLMService[OpenAILLMAdapter] (propagates to
~15 OpenAI-compatible providers like Cerebras, Groq, Together)
- GeminiLiveLLMService -> LLMService[GeminiLLMAdapter]
- GoogleLLMService -> LLMService[GeminiLLMAdapter]
- GrokRealtimeLLMService -> LLMService[GrokRealtimeLLMAdapter]
- InworldRealtimeLLMService -> LLMService[InworldRealtimeLLMAdapter]
- OpenAIRealtimeLLMService -> LLMService[OpenAIRealtimeLLMAdapter]
- _BaseOpenAIResponsesLLMService -> LLMService[OpenAIResponsesLLMAdapter]
- WebsocketLLMService is also generic so the multi-inheritance case
(OpenAIResponsesLLMService) can keep both bases agreeing on TAdapter.
All 17 redundant `adapter: SomeAdapter = self.get_llm_adapter()`
annotations are now plain `adapter = self.get_llm_adapter()`.
Same pattern as the earlier get_setup_params fix: when context tools
are absent, the fallback `adapter.from_standard_tools(self._tools)`
can return the NotGiven sentinel, and `_send_prompt_start_event`
expects a list. Coerce via `or []` so the NotGiven case becomes an
empty list.
Three small changes that resolve pyright errors and sharpen the logic:
- Guard `self._context` with the codebase's "should never happen"
early-return pattern, so we don't blindly call `.get_messages()` on
None.
- Skip `LLMSpecificMessage` items in the iteration. They're opaque
provider-specific payloads with no `.get()`, and the surrounding
logic only applies to standard tool-result messages.
- Match `role == "tool"` explicitly. The previous truthy-only check
was working by accident — the `tool_call_id` filter further down
was effectively narrowing to tool messages, but the intent is
clearer when stated upfront.
reset_conversation is part of the public AWSNovaSonicLLMService API and
is also called internally from the receive-task error handler.
Previously it captured `self._context` (typed `LLMContext | None`) and
unconditionally passed it to `_handle_context`, which expects a real
context — silently doing the wrong thing if no initial context had
been received yet.
Treat that as developer error: log a warning and return early. Nothing
to preserve means nothing to reset.
The service implements the NovaSonicSessionSender protocol so the
session-continuation helper can target either the current or next
session. The protocol declares
`get_setup_params(self) -> tuple[str | None, list]`, but the
implementation was unannotated and could return NotGiven in the tools
position when from_standard_tools fell through to its NotGiven
sentinel. Add the matching return annotation and coerce the NotGiven
case to an empty list.
Same MessageParam content-typing issue as the consecutive-message merge
fix: pyright doesn't carry the str-to-list narrowing forward, and
Iterable has no `[-1]` access. Cast to `list[Any]` and document the
chain of assumptions (list, non-empty, dict-typed last item) and where
each is upheld upstream.
This brings anthropic_adapter.py to 0 pyright errors (down from 115).
The function takes an OpenAI ChatCompletionMessageParam (a union of
TypedDicts) and returns an Anthropic MessageParam (a different
TypedDict). It does the conversion via dict-level mutations that don't
type-check against either side's TypedDict schema. Work with the
deepcopied message as a plain dict and cast to MessageParam at the
return sites — matching the boundary-cast convention noted in
llm_context.py.
Drops anthropic_adapter.py from 20 to 2 pyright errors.
The fallback path in `_from_universal_context_message` returns
`message.message` from an `LLMSpecificMessage`, which is typed loosely
(`Any | dict`). The surrounding comment already documents the
assumption that the message is already in Anthropic format — make that
assumption explicit to pyright with a cast.
MessageParam types content as `str | Iterable[...]`, and Iterable has
no `.extend()`. After the str-to-list conversions, pyright re-reads
the TypedDict field as the original wide type rather than carrying the
narrowing forward. Cast to `list[Any]` to express the codebase's
existing str-or-list assumption.
Drops anthropic_adapter.py from 23 to 21 pyright errors.
Content items in MessageParam have a heterogeneous union type (Pydantic
ContentBlock variants and TypedDict *BlockParam variants), neither of
which supports the dict-style access and mutation this sanitizer does.
Treat the deepcopied message as a plain dict and guard each content
item with isinstance(item, dict) — matches the runtime shape produced
by _from_standard_message and avoids crashing if a non-dict ever flows
through the LLMSpecificMessage path.
Drops anthropic_adapter.py from 115 to 23 pyright errors.
xAI's Voice Agent API selects the model via the ?model= query
parameter on the WebSocket URL; it cannot be changed later via
session.update. The Grok Realtime service was setting the model in
Settings but never including it in the connection URL, so every
session silently fell back to the deprecated default
grok-voice-fast-1.0.
Append the model from Settings to the WebSocket URL on connect, and
default to the recommended grok-voice-think-fast-1.0.
Adds a `mip_opt_out` init parameter to both `DeepgramTTSService` (WebSocket)
and `DeepgramHttpTTSService` so callers can opt out of the Deepgram Model
Improvement Program. When set, the value is forwarded as a query parameter
on the request, matching the pattern used by the Deepgram STT services.
Broaden `tool_resources` to `app_resources` for easy access not just in
tool handlers but in other places like custom `FrameProcessor`s.
Involves 3 changes:
- A rename: `tool_resources` -> `app_resources`
- A new property on `PipelineTask`: `app_resources`
- A new property on `FrameProcessor`: `pipeline_task`
Usage in tool handler:
async def get_weather(params: FunctionCallParams):
resources = cast(MyAppResources, params.app_resources)
...
Usage in custom `FrameProcessor`:
class MyProcessor(FrameProcessor):
async def process_frame(self, frame, direction):
await super().process_frame(frame, direction)
if self.pipeline_task is not None:
resources = cast(MyAppResources, self.pipeline_task.app_resources)
...
The previous `tool_resources` aliases (on `PipelineTask`,
`FunctionCallParams`, and `FrameProcessorSetup`) keep working but are
deprecated as of 1.2.0 and emit `DeprecationWarning`s.
The four krisp test files installed a process-wide mock of
importlib.metadata.version with `patch(...).start()` at module level and
never called .stop(). Once any of these files was collected, the mock
leaked across the rest of the test session, returning '0.0.0-dev' for
every version check. This corrupted unrelated tests that triggered
transformers' import-time dependency check (e.g. lazy imports of
LocalSmartTurnAnalyzerV3) — transformers saw tqdm=='0.0.0-dev' and
refused to load.
Wrap the pipecat imports in `with patch(...)` so the mock is active
during import (where pipecat's krisp version check needs it) and torn
down before any tests run.
Importing pipecat.turns.user_turn_strategies pulled in
LocalSmartTurnAnalyzerV3 → transformers → onnxruntime at module load
time. Since this module is imported by llm_response_universal (and
therefore most LLM services), any LLM service import paid the cost of
loading transformers and triggered its missing-backend warning in
environments without PyTorch/TF/Flax.
Move the LocalSmartTurnAnalyzerV3 import into
default_user_turn_stop_strategies() so it only loads when the default
smart-turn strategy is actually constructed.
Fixes#4392
The non-200 branch yielded an ErrorFrame and then raised, which the outer
except caught and yielded a second, less informative "Unknown error" frame.
Return after the yield and fold the status code into the message.
Pyright flagged the .post() call on a possibly-None _session. Raise a
clear RuntimeError if start() wasn't called instead of crashing on the
attribute access.
SPELL/EMOTION_TAG/PAUSE_TAG/VOLUME_TAG/SPEED_TAG are stateless and worked
only via class-level access. Decorating them lets instance access work too
and silences the missing-self lint warning.
- Bump default cartesia_version to 2026-03-01.
- Replace deprecated use_original_timestamps with use_normalized_timestamps
so word timestamps match what was actually spoken.
- Add max_buffer_delay_ms init arg; auto-derive 0 in SENTENCE mode to avoid
the doc-warned "middle ground" of client + server buffering, leave unset
in TOKEN mode for managed buffering.
- Silently consume flush_done messages now emitted per transcript when
server-side buffering is disabled.
Adds a `session_id: str | None` field to `RunnerArguments` so bots can
log/trace a per-session identifier in local development the same way
they can in Pipecat Cloud (where it is provided via the
`x-daily-session-id` header).
The local runner now mints a UUID at every `*RunnerArguments`
construction site. For paths that already returned a `sessionId` to the
caller (Daily `/start`, dial-in webhook), a single UUID is now generated
and shared between `runner_args.session_id` and the response body
instead of being thrown away. The SmallWebRTC `/api/offer` endpoint
accepts an optional `session_id` so the `/sessions/{session_id}/...`
proxy can thread it through.
This is the prerequisite step for collapsing pipecat-cloud's
`SessionArguments` / `*SessionArguments` hierarchy onto the upstream
runner types.
So the rendered changelog has the (PR [...]) line aligned as a list
continuation under its bullet. Verified with both short and wrapped
entries via `towncrier build --draft`.
Introduce SonioxTTSService, a WebSocket TTS provider that streams text and
receives audio over a persistent connection, multiplexing up to 5 concurrent
streams per socket via Soniox's `stream_id`. Also updates the README service
table and the Soniox voice example to use the new TTS end-to-end.
Replaces the hardcoded camera publishing send settings in
DailyTransport with a new DailyParams.camera_out_send_settings dict that
applications can pass through verbatim to the Daily client. This makes
the encoding/codec/bitrate configuration user-controllable instead of
being driven solely by the generic TransportParams fields.
As a consequence, TransportParams.video_out_bitrate is deprecated for
the Daily transport (now configured via camera_out_send_settings) and
its default is changed to None.
Adds a dedicated screen video track alongside the existing camera track
so applications can publish to Daily's built-in "screenVideo" destination
via video_out_destinations. The track is created at join time and wired
into the client settings (inputs and publishing) when "screenVideo" is
configured; write_video_frame routes frames to the appropriate track
based on the frame's transport_destination.
Bound methods are created fresh on each attribute access, so
'self._missing_function_call_handler is self._missing_function_call_handler'
is always False. Using 'is' meant the placeholder branch never fired and
both warnings logged when a function was missing at queue time.
Switch to == so equality compares the underlying function and instance.
Strengthen the missing-at-queue-time test to assert the second warning
does not fire.
Address review feedback: a function may be unregistered between when
run_function_calls queues it and when _run_function_call executes it.
Restore the live lookup, falling back to the missing-function handler
when the entry is gone, so the call still terminates with a normal
tool result. Factor the missing-handler item construction into a
helper since it's now built in two places.
Runner-created Daily rooms previously had no expiration when callers
posted partial `dailyRoomProperties` (e.g. `{"start_video_off": true}`).
The model-default `exp=None` and `eject_at_room_exp=False` meant Daily's
cron never cleaned them up, so rooms accumulated indefinitely.
Encode the policy in the runner: define `PIPECAT_CLOUD_ROOM_EXP_HOURS=4.0`,
inject `exp` and `eject_at_room_exp=True` into user-supplied properties via
`setdefault` (so explicit caller values still win), and pass
`room_exp_duration` to all four `configure()` call sites.
* VIVA SDK TT v3 support
* Format fix.
* Renamed the API naming, removed '3' from the name.
* Implementation of User turn start strategy using Krisp VIVA Interruption Prediction in scope of TT v3 support.
* TT demo tool
* Some improvements for demo scripts, audio recordin, etc.
* Enhance demo scripts with VAD selection and audio embedding features. Updated HTML report to include annotated audio players and improved response time metrics in summary formatting. Added README for setup and usage instructions.
* Refactor interrupt prediction demo to compare multiple interruption strategies (Krisp IP vs VAD). Updated README with usage instructions and output details. Enhanced audio processing with new helper functions for generating beeps and mixing audio.
* Refactor demo scripts to improve latency metrics by introducing total_delay property in TurnEvent. Update formatting in reports and visualizations to reflect accurate speech end times, including VAD wait times. Enhance HTML report with detailed latency information and adjust audio processing to account for VAD stop seconds.
* Add audio resampling functionality and update demo scripts for improved audio processing
- Introduced `resample_audio` function to handle audio resampling with linear interpolation.
- Updated `demo_audio_recorder.py` to utilize the new resampling feature, ensuring audio is saved at the requested sample rate.
- Modified `demo_interrupt_prediction.py` and `demo_turn_taking.py` to resample audio to 16 kHz for compatibility with Silero VAD.
- Adjusted imports in demo scripts to include the new resampling function.
- Enhanced error handling for sample rate discrepancies in audio recording.
* Enhance demo_interrupt_prediction.py with VAD type selection and improved processing logic
- Added support for selecting between "silero" and "krisp" VAD engines in the demo script.
- Introduced a new create_vad function to configure VAD analyzers based on the selected type.
- Updated audio processing logic to handle VAD type-specific resampling and state management.
- Modified the KrispVivaIPUserTurnStartStrategy to utilize a separate vad_flag for per-frame VAD input, improving interruption detection accuracy.
* Refactor audio processing scripts for improved readability and consistency
- Updated type hinting in `resample_audio` function to use `tuple` instead of `Tuple`.
- Simplified print statements in `demo_audio_recorder.py`, `demo_formatting.py`, and `demo_interrupt_prediction.py` for better readability.
- Adjusted argument formatting in `demo_audio_recorder.py` and `demo_formatting.py` for consistency.
- Cleaned up list comprehensions in `demo_formatting.py`, `demo_html_report.py`, and `demo_interrupt_prediction.py` for clarity.
- Enhanced error handling in `__init__.py` for the KrispVivaIPUserTurnStartStrategy import.
* Refactor VAD handling in KrispVivaIPUserTurnStartStrategy and update tests for clarity
- Simplified the argument formatting in the _handle_vad_started method for improved readability.
- Updated test assertions to reflect changes in VAD processing logic, ensuring that the vad_flag is correctly set to False during continuous state processing.
- Enhanced test cases to verify that the process method is called appropriately under different conditions.
* more format fixes.
* removed demo scripts.
* reverted wrongly removed file.
* Corrected the IP integration logic.
* style fix.
* Refactor audio processing and state management in KrispVivaIPUserTurnStartStrategy
- Removed the unused _vad_flag attribute to streamline state tracking.
- Updated the reset method to clear the audio buffer instead of resetting the vad_flag.
- Adjusted the process_frame method to use _speech_active for VAD input, enhancing clarity in the logic.
- Modified tests to reflect changes in state management and ensure proper functionality of the reset method and audio buffer handling.
* FIxed formatting
---------
Co-authored-by: Aram Poghosyan <apoghosyan@krisp.ai>
Pipecat 1.0.8 hard-required protobuf 6.x via the base `protobuf>=6.31.1,<7`
pin, blocking users whose dependency graph already constrains protobuf to
the 5.x line. The original bump (PR #4136) was only needed because
`nvidia-riva-client>=2.25.1` ships gencode compiled with protoc 6.31.1.
Changes:
- Widen base pin to `protobuf>=5.29.6,<7`.
- Regenerate `frames_pb2.py` with `grpcio-tools~=1.67.1` (protoc 5.x). Per
Google's cross-version runtime guarantee, 5.x gencode runs on both 5.x
and 6.x runtimes, so this single artifact serves all users.
- Loosen the dev pin `grpcio-tools` to `>=1.67.1,<2` so contributors can
install `pipecat[dev,nvidia]` without resolver conflict. Comment in
`frames.proto` documents the 1.67.x requirement for regeneration.
- Add an explicit `protobuf>=6.31.1,<7` to the `nvidia` extra. This
compensates for nvidia-riva-client's missing `protobuf` install
requirement (upstream packaging gap, see
https://github.com/nvidia-riva/python-clients/issues/172). When that
issue is resolved, the explicit protobuf entry in the `nvidia` extra
can be removed.
Verified: pipecat imports cleanly on both protobuf 5.29.6 and 6.33.6;
`tests/test_protobuf_serializer.py` passes; `import riva.client` succeeds
when `pipecat[nvidia]` is installed.
Nova Sonic sessions have an AWS-imposed ~8-minute time limit. This adds
transparent session continuation that rotates sessions in the background
before the limit is reached, preserving conversation context with no
user-perceptible interruption.
Implementation follows the AWS reference architecture:
- Monitor loop detects when session age exceeds threshold
- On assistant AUDIO contentStart: start buffering user audio, create next
session (sessionStart + promptStart + system instruction)
- Track SPECULATIVE/FINAL text counts as completion signal
- On completion signal: send conversation history + audioInputStart +
buffered audio to next session, then promote immediately
- Close old session in background (non-blocking)
- Dead session detection: recreate next session if idle >30s
Key design decisions:
- Session continuation enabled by default (fundamental for long conversations)
- Conversation history tracked in real-time via _sc_conversation_history
(independent of pipeline context aggregator which updates asynchronously)
- Completion signal check in _handle_content_end_event (after history update)
to ensure latest text is included in handoff
- Rolling audio buffer (default 3s) captures user audio during transition
- transition_threshold_seconds capped at 420s (7min) for safety margin
- Unified event methods (_send_text_event, _send_client_event, etc.) accept
optional stream/prompt_name params, eliminating duplicate SC methods
Also adds:
- SessionContinuationParams config (enabled, threshold, buffer, timeout)
LLMContext's NotGiven, LLMContextToolChoice, and LLMStandardMessage are
currently aliased to their OpenAI equivalents, so passing values
between the two sides type-checks implicitly. That works today but
obscures the fact that these are meant to be conceptually distinct —
if LLMContext ever diverges from OpenAI's types, every implicit
crossing would silently break.
Introduce two module-private cast helpers in open_ai_adapter.py:
- _openai_from_llm_context_tool_choice(tool_choice)
- _openai_from_llm_standard_message(message)
Both are typed no-ops today (implemented with typing.cast) but each
carries a docstring explaining why the cast is present, and every
boundary crossing now routes through a named function. Future readers
(and future greps) can find the crossings; a later divergence becomes
a mechanical find-and-update rather than hunting through adapter code.
No behavior change, no pyright error delta.
After widening TTSSettings.voice to str | None | _NotGiven (so other
TTS services can opt into None as a valid "no voice" state), pyright
flagged Speechmatics' URL builder receiving str | None where it
required str.
Speechmatics has no "no voice" mode (the URL path includes the voice
name), so override the inherited field in SpeechmaticsTTSSettings to
str | _NotGiven. The call site stays as a plain assert_given(...)
without an extra None check.
Three LLM services initialize certain Settings fields with the SDK's
NOT_GIVEN (openai.NOT_GIVEN or anthropic.NOT_GIVEN) so the value
flows unmodified into SDK API calls. The inherited field types from
LLMSettings only admit pipecat's _NotGiven, so pyright flagged each
constructor call as a flavor mismatch.
Widen the field types in each service-specific Settings subclass so
they accept both pipecat's _NotGiven (for delta-mode defaults) and
the corresponding SDK NotGiven (for store-mode passthrough):
- OpenAILLMSettings: frequency_penalty, presence_penalty, seed,
temperature, top_p, max_tokens, max_completion_tokens.
- OpenAIResponsesLLMSettings: temperature, top_p,
max_completion_tokens.
- AnthropicLLMSettings: temperature, top_k, top_p, thinking.
Every overridden field is genuinely read from self._settings and
passed directly to the SDK, so none of the overrides are vestigial.
Clears 21 pyright errors and restores test_service_settings_complete
parity with the pre-NOT_GIVEN-swap state.
asyncai/tts and google/vertex/llm are now clean after the missing-None
sweep (both benefited from the TTSSettings.voice / LLMSettings
cascades).
- src/pipecat/services/asyncai/tts.py
- src/pipecat/services/google/vertex/llm.py
Service-specific Settings subclasses declared fields as T | _NotGiven
(no None), but the services routinely pass None to those fields during
init to mean "don't override — use the vendor's default". The field
type just didn't reflect that a None value is valid, so pyright
flagged every None at the call sites.
Change the declarations to T | None | _NotGiven, matching the pattern
already used by ServiceSettings.model and TTSSettings.language. No
constructor-call changes; the default_factory stays NOT_GIVEN.
Fields touched across 11 files:
- services/settings.py: TTSSettings.voice (base class; covers
asyncai, cartesia, elevenlabs, fish, hume, kokoro, lmnt, mistral,
neuphonic, piper, resembleai, rime, xtts TTS services).
- services/aws/llm.py: latency.
- services/aws/tts.py: engine, pitch, rate, volume, lexicon_names.
- services/azure/tts.py: emphasis, pitch, rate, role, style,
style_degree, volume.
- services/google/gemini_live/llm.py: vad.
- services/google/llm.py: thinking.
- services/google/stt.py: language_codes.
- services/inworld/tts.py: speaking_rate, temperature.
- services/openai/tts.py: instructions, speed.
- services/speechmatics/stt.py: 13 fields (domain, operating_point,
max_delay, end_of_utterance_*, punctuation_overrides, *_partials,
split_sentences, enable_diarization, speaker_*, max_speakers,
prefer_current_speaker, extra_params).
- services/ultravox/llm.py: output_medium.
Clears 94 pyright errors (1035 -> 941).
Three files no longer have pyright errors after the is_given /
assert_given sweep — remove them from the ignore list (which serves as
a live todo of files with remaining type errors).
- src/pipecat/processors/gstreamer/pipeline_source.py
- src/pipecat/services/camb/tts.py
- src/pipecat/services/speechmatics/tts.py
Apply assert_given across service modules to narrow reads from
store-mode settings fields (self._settings.X, default_settings.X),
where _NotGiven is declared in the field type but should never appear
at runtime (enforced by validate_complete()).
Two idioms used:
- Inline wrap for single uses:
func(assert_given(self._settings.enable_prompt_caching), ...)
- Extract-and-reuse when the same value is used multiple times:
thinking = assert_given(self._settings.thinking)
if thinking:
params["thinking"] = thinking.model_dump(...)
43 service files touched. Cleared ~172 pyright errors; remaining
_NotGiven-related errors are in adjacent categories (flavor mismatch
between openai/anthropic NotGiven and pipecat _NotGiven, settings
field types that should allow None but don't) that need different
fixes.
In store-mode settings objects, _NotGiven should never appear (the
invariant enforced by validate_complete). But the declared field types
still include _NotGiven because the same class doubles as delta mode,
so every field read is typed X | None | _NotGiven and pyright flags
operations that assume X | None.
assert_given is a one-line extractor that narrows away _NotGiven and
raises loudly if the invariant is violated — preferable to scattering
is_given guards that defend against something that can't occur in
practice.
resolved_model = assert_given(self._settings.model) # str | None
Replace direct identity checks against NOT_GIVEN with is_given() at
sites where pyright's inability to narrow on non-singleton sentinels
was causing type errors.
- adapters/services/anthropic_adapter.py: narrow converted.system for
_resolve_system_instruction.
- services/openai/llm.py: narrow params.service_tier using OpenAI's
is_given.
- services/sarvam/llm.py: narrow tools / tool_choice using OpenAI's
is_given (aliased as openai_is_given alongside the existing
settings.is_given import).
- services/sarvam/tts.py: narrow settings.voice using settings.is_given.
Pyright can't narrow identity checks against module-level NotGiven
sentinels (they aren't typed as singletons), which leaves many
NotGiven-bearing unions stuck as unnarrowed types throughout the
codebase. Introduce is_given TypeGuard helpers so narrowing works via
isinstance under the hood.
Each helper is co-located with the NotGiven flavor it guards:
- services/settings.py: upgrade the existing is_given to a TypeGuard.
- processors/aggregators/llm_context.py: add an is_given for
LLMContext's NotGiven. Treat LLMContext's re-exported types
(LLMStandardMessage, LLMContextToolChoice, NOT_GIVEN, NotGiven) as
LLMContext's own — independent definitions that happen to coincide
with OpenAI's as an implementation detail.
- adapters/services/anthropic_adapter.py: add is_given for anthropic's
NotGiven.
- adapters/services/open_ai_adapter.py: add is_given for openai's
NotGiven.
TypedDict types are not subtypes of dict[...] in the type system
(per PEP 589), so TypedDict-based invocation param classes could not
satisfy the TypeVar bound. Mapping[str, Any] accepts TypedDicts while
preserving the "string-keyed mapping" constraint.
The original contributor's PR (#4328) landed as #4355. Rename the fragment
so the rendered changelog links to the merged PR, and add the leading `- `
bullet prefix that towncrier expects.
Extends the reconnect re-seeding fix to work cleanly on Gemini Live 2.5,
which has stricter seed requirements than 3.x and a documented audio-input /
history-recall limitation. Both initial connection and reconnect now share a
single code path (`_create_initial_response(for_reconnect=...)`), with four
well-documented cases.
On Gemini 2.5 reconnect, `turn_complete=True` is now forced on the seed so
the model produces a recap-style response immediately instead of briefly
acting "forgetful" on the user's next utterance — the latter being
especially jarring mid-conversation. When a 2.5 seed doesn't already end
with a user turn (e.g. the bot had finished speaking before the disconnect),
a blank user turn is appended to satisfy the server's seed-shape
requirement. Gemini 3.x needs neither workaround.
Tkinter's `Label` only stores `PhotoImage` references at the C level, so
Python GC eats them unless something on the Python side keeps a
reference. The canonical fix is to stash the reference on the widget
itself: `label.image = photo`. Tkinter widgets are plain Python objects,
so the assignment works at runtime, but the stub declares no `image`
attribute (correctly — there isn't one; we're adding it).
Narrow the suppression to `# type: ignore[attr-defined]` on the one
line. The existing comment above the assignment already documents why.
Mistral imposes three conversation-history quirks on top of the
OpenAI-compatible wire format: tool messages must be followed by an
assistant message; non-initial system messages are rejected; trailing
assistant messages require `prefix=True`. These rules were applied
inline in `MistralLLMService.build_chat_completion_params`, which is the
wrong layer — every other provider with OpenAI-compatible-but-quirky
shape (Perplexity, etc.) owns its transformations in a
`BaseLLMAdapter` subclass that runs during `get_llm_invocation_params`.
Create `MistralLLMAdapter(OpenAILLMAdapter)` on the Perplexity template
and wire it in via the existing `adapter_class` dispatch. The service
now only handles Mistral-specific request-level mapping (`random_seed`
in place of `seed`), and the message shape concerns live with other
provider format logic.
No behavior change. The transform function casts to `list[dict[str,
Any]]` internally because mutating `role` and attaching Mistral's
non-standard `prefix` field both step outside OpenAI's TypedDict
contract; the cast at the return boundary encodes that we're emitting
Mistral's extended schema, not OpenAI's.
`inspect.getdoc()` returns `str | None`, but `docstring_parser.parse()`
requires `str`. Functions without a docstring produced `None`, which
the type checker correctly flagged.
Coerce to `""` at the call site. `docstring_parser.parse("")` returns
an empty docstring whose `.description` and `.params` are already
handled by the surrounding `or ""` fallbacks, so runtime behavior is
unchanged.
`ToolsSchema.__init__` declared `standard_tools: list[FunctionSchema |
DirectFunction]`. Callers (`BaseLLMAdapter`, `MCPService`) pass in
`list[FunctionSchema]`, which is not assignable to the union list
because `list` is invariant in its element type.
Widen the parameter to `Sequence[...]` (covariant) so `list[X]` and
`list[X | Y]` both fit. A narrower `list[FunctionSchema]` is still
accepted, and nothing in this class mutates the argument — the
constructor immediately copies it via `_map_standard_tools`.
Also correct the `custom_tools` property return type to include
`None`, matching the stored `_custom_tools` field.
This single edit clears the pyright errors for three ignore-list
entries: `tools_schema.py`, `base_llm_adapter.py`, and `mcp_service.py`.
Two services were reading `_settings.model` (typed `str | _NotGiven |
None` because NOT_GIVEN is the default) and coercing it with `or ""`
or similar. `_NotGiven.__bool__` returns False, so the runtime
behavior happened to work, but the type was a lie — pyright saw
`str | _NotGiven` flowing into APIs that required `str` or `str | None`.
- `AIService._sync_model_name_to_metrics`: use `isinstance(model, str)`
narrowing with an empty-string fallback. Equivalent runtime behavior,
honest type, no truthiness dependency on a sentinel.
- `SarvamLLMService.__init__`: validate the model is a real string
before handing it to `_validate_model(str)`. A non-string model at
this point is a configuration bug; raise `ValueError` so the error
is clear and survives `python -O` (unlike an assert).
Three spots had the same shape: a field starts None, a later method
populates it, a read site later reads it. Pyright can't track the
cross-method invariant. Rather than spray assertions at the read
sites, fix each site at the structural level:
- `FastAPIWebsocketInputTransport._monitor_websocket` now takes the
session timeout as an argument. The task-creation site already
guards on truthiness, so the call can pass the non-None value
directly and the method's signature tells the truth.
- `FrameProcessorMetrics.task_manager` raises `RuntimeError` instead
of asserting. Asserts are stripped under `python -O`; a real raise
keeps the runtime safety net and still narrows the type for pyright.
- `SOXRStreamAudioResampler._maybe_initialize_sox_stream` returns the
initialized stream. Callers use the return value and never touch
the Optional `_soxr_stream` attribute, so narrowing stays inside
the init method where the invariant is established.
`ImageGenService.run_image_gen` and `VisionService.run_vision` were
declared `async def ... -> AsyncGenerator[Frame, None]` with `pass`
bodies. Without a `yield` anywhere in the body, Python treats the
function as a coroutine returning an `AsyncGenerator`, not as an async
generator itself, so callers got a coroutine where they expected an
iterator.
Add `raise NotImplementedError; yield` so the body contains a yield
(making this a real async generator) while still raising cleanly if a
subclass ever calls `super().run_*` by mistake.
Deepgram STT, Gradium TTS, Smallest STT, and xAI STT/TTS had exactly
one pyright error each, all of them the AsyncGenerator return-type
mismatch resolved in 08fe9157c. Remove them from the ignore list.
AssemblyAI, Cartesia, Gradium, and Soniox STT services sent audio over
the WebSocket without catching transient send failures, so a single
network hiccup could propagate an exception up through process_frame
and end the pipeline. Other push-based STT services (Deepgram, xAI,
Azure, Smallest, etc.) already guard their sends.
Follow the deepgram/stt.py pattern: log a warning and continue. The
existing connection-state check at the top of each call handles
recovery on the next invocation.
The push-based STT/TTS implementations send audio/text over a socket and
receive results via a separate receive task, so there is nothing to
yield inline. They yield `None` by design. The previous declaration of
`AsyncGenerator[Frame, None]` disagreed with that, while the consumer
(`AIService.process_generator`) already accepted `Frame | None`. Widen
the producer side (abstract base and every subclass) so the type honestly
describes the contract.
Pure annotation change; no runtime behavior difference.
Previously, six modules (adapters, audio, processors, serializers,
services, transports) were ignored wholesale. Many files in those
modules already pass type checking, but we had no way to protect them
from regressions or make the remaining work visible.
Switch the include list to src/pipecat so any new module is checked by
default, and replace directory-level ignores with the 140 specific
files that still fail. This puts 189 previously-untyped files under
type checking immediately and turns the remaining work into a concrete,
shrinking TODO list.
Moves src/pipecat/serializers into pyright's include list. Narrows
self._params to each subclass's InputParams in exotel, vonage, plivo,
twilio, genesys, and telnyx. In protobuf.py, renames the reassigned
frame local to avoid clobbering its Frame type and silences two dynamic
attribute accesses on the generated frames_pb2 module.
Also aligns telnyx and plivo hangup validation with twilio: if
auto_hang_up=True (the default) but required credentials are missing,
__init__ now raises ValueError instead of silently logging a warning
at call-end time. Previously a misconfigured serializer would construct
fine and fail to hang up the call later, leaving a phantom billable
session.
Collapse the separate fallback timer into the existing user_speech_timeout
timer, restarted when a transcript arrives without a VAD stop. stt_timeout
has no meaning on the fallback path, so the stt wait is marked done
immediately. This drops the _fallback_timeout_task / _fallback_expired
bookkeeping and the branched trigger condition.
Adds XAITTSService in the existing xai/tts.py module, alongside the
existing XAIHttpTTSService. Connects to xAI's streaming endpoint at
wss://api.x.ai/v1/tts, streams text.delta chunks up and base64 audio.delta
chunks down on the same connection so audio starts flowing before the full
utterance is synthesized.
Extends InterruptibleTTSService since xAI's protocol is strictly sequential
per connection and exposes neither a cancel verb nor a context ID — the
only way to stop an in-flight utterance is to tear down the WebSocket,
which is exactly what InterruptibleTTSService does on interruption when
the bot is speaking.
Voice, language, codec, and sample_rate are passed as query-string params
at connect time; runtime setting changes reconnect the socket. Defaults to
raw PCM so emitted TTSAudioRawFrame objects need no decoding downstream.
Splits the existing example into voice-xai.py (WebSocket) and
voice-xai-http.py (batch HTTP) so each variant has its own entry point.
Promotes the xai extra to depend on pipecat-ai[websockets-base] since the
new service imports the websockets library.
Remove `examples/` from the `pyrightconfig.json` ignore list and fix
the resulting type errors across all example files. Common fixes:
- Required API keys: `os.getenv("X")` -> `os.environ["X"]` so the
return type is `str` rather than `str | None`, and misconfiguration
fails fast.
- Narrow `LLMContextMessage` union members with `isinstance(..., dict)`
before dict-style access.
- `assert isinstance(params.llm, ...)` before calling service-specific
methods that aren't on the base `LLMService`.
- Guard optional frame fields (e.g. `LLMSearchResponseFrame.search_result`)
before use.
If the WebSocket handshake is cancelled or fails before `keepalive_task`
is assigned (e.g. an STTUpdateSettingsFrame triggers a reconnect during
initial connect), the `finally` block tried to cancel an unbound local.
Initialize `keepalive_task = None` before the try and guard the cancel.
New `XAISTTService` wraps xAI's real-time speech-to-text WebSocket
(`wss://api.x.ai/v1/stt`). It extends `WebsocketSTTService`, authenticates
with the `XAI_API_KEY` as a Bearer token on the WS handshake, and streams
raw audio (PCM/mu-law/A-law) with configurable interim results, endpointing,
language, multichannel, and diarization settings.
- `src/pipecat/services/xai/stt.py`: new service, settings dataclass, and
`language_to_xai_stt_language` helper.
- `src/pipecat/services/stt_latency.py`: `XAI_TTFS_P99` default.
- `pyproject.toml` / `uv.lock`: `xai` extra now pulls in `websockets-base`.
- `README.md`: link to xAI STT in the services table.
- `examples/voice/voice-xai.py`: swap DeepgramSTTService for XAISTTService so
the xAI voice example is fully xAI.
- `examples/transcription/transcription-xai.py`: new transcription-only
example using the new service.
SpeechTimeoutUserTurnStopStrategy previously collapsed two waits into
max(stt_timeout, user_speech_timeout), which over-waited for finalizing
STT services and could also end the turn early in a legacy code path.
Run them as independent timers instead:
- user_speech_timeout: policy floor, always runs to completion.
- stt_timeout: latency safety net, short-circuited by a finalized
transcript since STT has signaled it has nothing more to send.
The no-VAD fallback now waits only user_speech_timeout rather than
max(stt_timeout, user_speech_timeout); stt_timeout is defined relative
to VAD stop and has no meaning when no VAD event occurred. This
shortens the fallback wait for users who set stt_timeout greater than
user_speech_timeout.
* Fix Smallest AI TTS WebSocket endpoint URL to match API documentation
Update base URL from waves-api.smallest.ai to api.smallest.ai and
fix path prefix from /api/v1/ to /waves/v1/ per the v4.0.0 docs.
* Update keepalive using silent space message instead of unsupported flush
Pylance analyzes open files even when they're outside the `include`
set, producing noise in the editor. Adding these paths to `ignore`
suppresses diagnostics without affecting import resolution.
Some TTS providers (e.g. Inworld) return verbatim tokens where spaces and
punctuation are already embedded in the token text. When downstream consumers
join these tokens with an extra space they produce "hello , world" instead of
"hello, world".
Add an opt-in `includes_inter_frame_spaces: bool = False` parameter to
`add_word_timestamps` / `_add_word_timestamps`. The flag is threaded through
`_WordTimestampEntry` and stamped onto every emitted `TTSTextFrame`.
Defaults to `False` — no behaviour change for existing services.
`InworldTTSService` passes `includes_inter_frame_spaces=True` and stops
pre-processing tokens in `_calculate_word_times`, returning them verbatim.
Tests added to `test_tts_frame_ordering.py` covering both HTTP and WebSocket
delivery paths: verbatim text preservation, PTS ordering, text-before-audio
ordering, and the Inworld punctuation-token scenario.
Made-with: Cursor
The two logger.error lines in krisp_instance.py fired at module-load time
whenever anything transitively imported it (e.g. pipecat.turns.user_start
pulling in krisp_viva_ip_user_turn_start_strategy), producing noisy output
for users who never asked for Krisp. Drop the log calls and raise a more
informative ImportError that names the affected classes so direct
importers still get clear guidance.
- Fall back to Language.EN in _primary_detected_language when model is
flux-general-en, preserving prior behavior on the default model.
- Standardize example on DeepgramFluxSTTService.Settings and drop the
now-redundant DeepgramFluxSTTSettings import.
- Narrow the changed-behavior changelog to reflect that flux-general-en
frames still carry Language.EN.
Enables the flux-general-multi model with one or more language_hints.
Hints are sent as repeatable URL params at connect time and via a
Configure control message when updated mid-stream (detect-then-lock).
TranscriptionFrame.language now reflects the language Flux detected
for each turn via the TurnInfo `languages` field.
Add changelog entries for the pyright introduction and the
LiveKitRunnerArguments.token signature tightening. Restore the
indented multi-line format for the WhatsApp missing-env error,
now listing only the vars that are actually missing.
Make required parameters non-optional: LiveKitRunnerArguments.token,
_create_telephony_transport args. Use os.environ[] instead of
os.getenv() for required WhatsApp env vars. Guard spec/loader None
in module loading. Tighten sip_caller_phone guard in daily.py.
* VIVA SDK TT v3 support
* Format fix.
* Renamed the API naming, removed '3' from the name.
* Implementation of User turn start strategy using Krisp VIVA Interruption Prediction in scope of TT v3 support.
* Typo fix in voice-krisp-viva example to use KrispVivaFilter class
* style fix.
* test run error fixes.
* some test related changes.
* Fixed tests
* Stule fixes.
SentryMetrics.stop_ttfb_metrics and stop_processing_metrics called the
base FrameProcessorMetrics implementation but discarded its return
value (implicit `return None`). FrameProcessorMetrics.stop_ttfb_metrics
/ stop_processing_metrics build and return a MetricsFrame, which
FrameProcessor.stop_ttfb_metrics / stop_processing_metrics then pushes
downstream so observers (e.g. UserBotLatencyObserver,
MetricsLogObserver) can see TTFB / processing metrics.
Because SentryMetrics returned None, the FrameProcessor never pushed
the MetricsFrame, so any pipeline using metrics=SentryMetrics() on STT
/ LLM / TTS services silently lost all downstream TTFB and processing
MetricsFrames. The metrics were still calculated and logged
internally, and Sentry transactions still finished correctly, but
observers never saw them.
Forward the MetricsFrame returned by the base class so FrameProcessor
can push it into the pipeline.
Use Sequence[FrameProcessor] instead of list[FrameProcessor] in Pipeline,
ServiceSwitcher, and ServiceSwitcherStrategy parameters to accept subtype
lists. Add cast() in LLMSwitcher for narrowed return types. Guard against
None in task_observer._send_to_proxy and replace hasattr with truthiness
check in task._cleanup.
Widen base strategy process_frame return types to ProcessFrameResult |
None to match actual behavior (None treated as CONTINUE). Give
UserTurnCompletionLLMServiceMixin a FrameProcessor base class so pyright
can see create_task, cancel_task, process_frame, and push_frame.
Tighten LLMMessagesAppendFrame and LLMMessagesUpdateFrame message fields
from list[dict] to list[LLMContextMessage] to match actual usage. Add
type annotations on inline message lists in IVR navigator and voicemail
detector.
In token-streaming mode, _push_tts_frames previously stripped only
leading newlines and dropped any pure-whitespace frame. That silently
discarded meaningful inter-token whitespace (e.g. a standalone "\n"
token between "hello" and "world"), losing prosody cues and any
downstream sentence-boundary semantics.
Track whether a non-whitespace character has been sent in the current
context. While the flag is false, strip all leading whitespace; once
true, let whitespace tokens flow through. Reset the flag on
LLMFullResponseEndFrame/EndFrame and on interruption, and save/restore
it around TTSSpeakFrame since each utterance is its own context.
Sentence-aggregation mode preserves the existing behavior.
Group three co-assigned fields (_start_frame_id, _start_frame_arrival_ns,
_start_wall_clock) into a single _StartFrameInfo dataclass. This makes
the "always set together" invariant structural rather than implicit, and
fixes the incorrect str | None annotation on _start_frame_id (Frame.id
is int).
Add pyrightconfig.json with basic type checking for zero-error modules
(clocks, metrics, transcriptions, frames) and enforce via CI. The
include list will expand as modules are fixed.
* Improve HeyGen LiveAvatar plugin reliability and performance
- Add WebSocket ready gate: wait for session.state_updated connected
event before sending commands (prevents silently dropped messages)
- Add keep-alive mechanism: send session.keep_alive every 2.5 min to
prevent 5-minute inactivity timeout
- Optimize audio chunking: 600ms first chunk for faster initial
response, 1s subsequent chunks for efficient streaming
- Fix audio buffer flush: send remaining buffered audio on utterance
end instead of discarding it
- Fix WS state cleanup: properly reset connected/ready state when
WebSocket drops unexpectedly
- Add livekit_config passthrough in LiveAvatar session token creation
- Replace stray print() with logger.debug()
* Fix HeyGenOutputTransport.start() signature and use 400ms first chunk
- Update transport.py to match new client.start() signature (no
audio_chunk_size param)
- Change first chunk size from 600ms to 400ms per feedback
* Fix transport audio resampling and client.start() error propagation
- Add audio resampling in HeyGenOutputTransport.write_audio_frame() to
ensure audio is always 24kHz before sending to HeyGen (was sending
at pipeline sample rate, causing garbled audio)
- Raise exception on WS ready timeout instead of silently returning,
preventing transport from appearing ready when WS connection failed
* Fix session readiness gate to work with LITE mode
LITE mode does not send session.state_updated WS events. Instead,
use a dual-signal _session_ready event that fires on either:
- WS session.state_updated connected (FULL mode)
- LiveKit participant connected (LITE mode)
Also reorder start() to connect both WS and LiveKit before waiting,
since the WS events may depend on LiveKit being connected.
Verified with live sandbox session - all tests pass.
* Simplify session readiness to use only WS ready gate
Remove _session_ready dual-signal and use only _ws_ready, which fires
on the session.state_updated connected WS event. Increase timeout to
30s. LiveKit is connected before waiting so the WS event can arrive.
* Reduce WS ready gate timeout back to 10s
* Remove WS ready gate (session.state_updated not reliably received)
The session.state_updated connected event is not reliably received
via the websockets library. Remove the gate for now and assume the
session is ready after WS + LiveKit connect. Keep-alive, chunking,
buffer flush, state cleanup, and other improvements remain.
Mirrors the existing `from_string` classmethod and lets callers
turn a frame's `buttons` list back into a dial string like `"123#"`.
`__str__` and the Daily transport's native DTMF path reuse it.
The single-key `button` field on `OutputDTMFFrame` and
`OutputDTMFUrgentFrame` is kept as a first-class ergonomic shortcut
for the common single-keypress case, equivalent to
`buttons=[button]`. `buttons` takes precedence when both are set.
Replaces the string-based `tones` field with a type-safe
`buttons: list[KeypadEntry]` on `OutputDTMFFrame` and
`OutputDTMFUrgentFrame`, matching the existing singular `button`
field on `InputDTMFFrame`. A `from_string` classmethod builds the
list from a dial string like `"123#"` (invalid characters raise
ValueError from the `KeypadEntry` constructor).
The base output audio fallback now iterates `frame.buttons`
directly, LiveKit sends `frame.buttons[0].value`, and the Daily
transport joins the button values into the single string Daily's
`send_dtmf` expects.
Introduces a new `tones` field on `OutputDTMFFrame` and
`OutputDTMFUrgentFrame` for sending multi-digit DTMF sequences and
deprecates the existing single-key `button` field. When only `button`
is set, it is used as a single-character `tones` string for backward
compatibility.
`DTMFFrame` is kept as an empty marker class so both input and output
DTMF frames can still be identified via isinstance. `InputDTMFFrame`
keeps its required `button` field (single keypress semantics).
The Daily-specific `DailyOutputDTMFFrame` and
`DailyOutputDTMFUrgentFrame` frames no longer need to override
`button` and simply add `session_id` and `digit_duration_ms`, which
are forwarded to Daily's `send_dtmf` as `sessionId` and
`digitDurationMs`.
The base output audio fallback now iterates `tones` and generates a
tone per character; LiveKit's native DTMF path sends `tones[0]` since
its API is single-tone.
Introduces Daily-specific DTMF output frames that carry explicit
`tones`, `session_id` and `digit_duration_ms` fields, forwarded to
Daily's `send_dtmf` as `tones`, `sessionId` and `digitDurationMs`.
The inherited `button` and `transport_destination` fields are
ignored for these frames in the Daily transport.
When the LLM returned zero text tokens (e.g. it was interrupted before producing
tokens or about to push tokens), push_aggregation() returned an empty string and
on_assistant_turn_stopped was never emitted. This left consumers waiting for an
event that would never arrive.
Now on_assistant_turn_stopped always fires, with an empty content string when
the LLM produced no text tokens.
Fixes#4292
Only treat messages[0] as the initial system prompt when determining the
summarization range. Previously, the code scanned the entire context for
the first system-role message, which caused failures when the only system
message was a mid-conversation injection (e.g. "The user has been quiet").
In that case summary_start exceeded summary_end, producing an empty range
and "No messages to summarize" errors.
Fixes#4286
The enable_logging and enable_ssml_parsing URL params used truthy checks,
so False was treated the same as None (both skipped). Also, Python's
str(False) produces "False" but the API expects lowercase "false".
Additionally, add enable_logging support to ElevenLabsHttpTTSService
which was missing entirely.
When the STT p99 timeout fires without a transcript, the turn stop
strategy previously did nothing — falling through to the 5-second
user_turn_stop_timeout. Now, a _timeout_expired flag tracks when the
timeout has elapsed so that a late transcript triggers the turn stop
immediately instead of waiting for the fallback.
Previously settings updates were ignored with a TODO comment. Now when
model/language changes via STTUpdateSettingsFrame the service disconnects
and reconnects with the new query parameters.
Key changes:
- Implement _update_settings to disconnect/reconnect on changes
- Check `is not State.OPEN` in run_stt to catch CLOSING state
- Send `done` command before closing for clean session shutdown
- Capture websocket reference in _disconnect_websocket to prevent a
concurrent _connect from having its new connection nulled by a stale
finally block
The strategy schedules background tasks during setup. Fast-running
tests could observe state before those tasks had a chance to run;
yielding once via asyncio.sleep(0) ensures they do.
Enable callers to get a compact version of context messages suitable
for serialization, logging, and debugging tools. For standard
messages, known binary data (base64 images, audio) is fully elided.
For LLM-specific messages, long string values are recursively
truncated. Adapter get_messages_for_logging() methods now use this.
Example files can live under subdirectories (e.g. foundational/01.py),
so the recording path needs its parent directory created before the
audio file is written.
Replaces the per-frame asyncio.Event signaling with a monotonic
timestamp updated on each audio frame. The handler sleeps until the
next deadline (last_audio_time + timeout), recomputing on each wake-up
to account for audio arriving during sleep.
This avoids waking the handler on every audio frame (~50/s at 20ms
chunks), and guarantees detection latency is bounded by timeout rather
than 2 * timeout.
Also renames audio_starvation_timeout to audio_idle_timeout and
associated identifiers for consistency with existing pipecat naming
(user_idle_timeout, etc.).
These are TypedDicts (plain dicts at runtime), so no behavioral change
— just more descriptive type hints for readers. Use ToolParam instead
of FunctionToolParam for the Responses adapter to reflect that custom
non-function tools are supported. Use ChatCompletionToolParam instead
of Any for the completions adapter return type. Update tests to use
typed params in expected values.
During pipeline shutdown, proxy tasks must be cancelled before observer
resources are cleaned up. Previously, stop() was called inside
_cancel_tasks() and start() was called in _start_tasks(), which could
lead to proxy tasks still consuming frames after observer resources
were closed.
Now the lifecycle is explicit in _handle_start_frame: start() after all
observers are loaded, and stop() before cleanup() on shutdown.
Also fixes misleading variable name in TaskObserver.cleanup() where
iterating self._proxies yields observer keys, not Proxy values.
Fixes#4195
Move event.clear() from finally block to success path in
IdleFrameProcessor and UserIdleProcessor._idle_task_handler().
The finally block unconditionally cleared signals set during
async timeout callbacks, causing false-positive idle detection.
Closes#3402
* Add Inworld Realtime LLM service
Adds a WebSocket-based realtime service for Inworld's cascade
STT/LLM/TTS API with semantic VAD, function calling, and streaming
transcription support.
New files:
- src/pipecat/services/inworld/realtime/ (service, events)
- src/pipecat/adapters/services/inworld_realtime_adapter.py
- examples/foundational/19zb-inworld-realtime.py
Also includes:
- websockets dependency for inworld extra in pyproject.toml
- Adapter and settings tests matching OpenAI/Grok realtime patterns
- Fix for double-response when server-side VAD is enabled
* Prefer init-provided system instruction in Inworld Realtime
Adopt _resolve_system_instruction() from BaseLLMAdapter, matching the
pattern applied to OpenAI Realtime, Grok Realtime, Gemini Live, and
Nova Sonic in the pk/realtime-services-init-v-context-system-instructions-cleanup
branch.
* Update changelog entry with PR number
* Fix changelog format to use bullet point
* Polish PR: default model, example cleanup, changelog update
- Change default model from gpt-4.1-nano to gpt-4.1-mini
- Add function calling demo to example
- Remove demo-testing artifact from system instruction
- Mention Router support in changelog
* Address PR review feedback for Inworld Realtime
- Move example to examples/realtime/realtime-inworld.py
- Change initial context role from "user" to "developer"
- Remove explicit sample rates from example; sync them in
_ensure_audio_config so Inworld gets the transport's actual rates
- Add audio race condition guard in _handle_evt_audio_delta (matches
OpenAI realtime pattern)
- Convert remaining "system"/"developer" messages to "user" in adapter
- Add clarifying comment for local-VAD vs server-VAD metrics paths
* Simplify example, add provider tracking, remove local VAD path
- Remove function calling from example, switch model to xai/grok-4-1-fast-non-reasoning
- Add pipecat-realtime session key prefix and provider_data metadata
for Inworld traffic attribution
- Remove local VAD code path (Inworld only supports server-side VAD)
- Use typed InputAudioBufferAppendEvent for audio sends
* Default TTS model to inworld-tts-1.5-max
* Remove dead shimmed tools code, set STT/VAD defaults
- Remove non-functional AdapterType.SHIM custom tools code from adapter
- Default STT model to assemblyai/u3-rt-pro
- Default VAD eagerness to low
Integrate with Mistral's Voxtral TTS API (voxtral-mini-tts-2603) using
HTTP streaming with Server-Sent Events. Converts base64-encoded float32
PCM chunks from the API to int16 for the Pipecat pipeline.
After a reconnect, _ready_for_realtime_input was never set back to True
because _create_initial_response (which sets the flag) is only called on
initial connection. This caused all audio/video/text to be silently
dropped after reconnecting, making the bot appear to hang.
Set the flag in _handle_session_ready when we detect a reconnect, either
via session_resumption_handle (server restores state) or via existing
context (rare case where connection drops before first resumption handle).
After a reconnect, _ready_for_realtime_input was never set back to True
because _create_initial_response (which sets the flag) is only called on
initial connection. This caused all audio/video/text to be silently
dropped after reconnecting, making the bot appear to hang.
Set the flag in _handle_session_ready when context already exists
(i.e. reconnect case) since we don't need to go through
_create_initial_response again.
Remove the deprecation proxy infrastructure that allowed old-style flat
imports (e.g. `from pipecat.services.openai import OpenAILLMService`).
Users must now import from specific submodules
(`from pipecat.services.openai.llm import OpenAILLMService`), which is
already the established pattern across all internal code and 179+ examples.
- Strip 32 proxy `__init__.py` files to empty
- Strip 3 non-proxy files with bare star imports (minimax, sambanova, sarvam)
- Strip google/gemini_live `__init__.py` re-exports
- Remove DeprecatedModuleProxy class and helpers from services/__init__.py
- Remove ruff per-file ignore for services/__init__.py
- Fix 2 examples using old-style imports
Patch Pydantic's DICT_TYPES check in conf.py to accept Union-wrapped
dict types, fixing the autodoc import failure for models using
ConfigDict(extra="allow").
Make -W (warnings as errors) opt-in via --strict flag instead of
default, and update README to reflect uv-based workflow and current
directory structure.
Add tests for LLMRunFrame, LLMMessagesAppendFrame, LLMMessagesUpdateFrame,
and LLMMessagesTransformFrame sent upstream to LLMAssistantAggregator,
mirroring the existing LLMUserAggregator downstream tests. Add
frames_to_send_direction param to run_test helper to support this.
The previous approach required the caller to directly grab a reference to the context object, grab a "snapshot" of its messages *at that point in time*, transform the messages, and then push an `LLMMessagesUpdateFrame` with the transformed messages. This approach can lead to problems: what if there had already been a change to the context queued in the pipeline? The transformed messages would simply overwrite it without consideration.
Napoleon's Attributes section creates class-level attribute docs that
duplicate the __init__ parameter docs when napoleon_include_init_with_doc
is enabled. Using Parameters avoids the duplication.
- Remove expect_stripped_words from LLMAssistantAggregatorParams and related warnings
- Remove old multi-parameter on_push_frame observer signature support in TaskObserver
- Remove deprecated context field from UserImageRequestFrame
- Remove deprecated LiveKitTransportMessageFrame and LiveKitTransportMessageUrgentFrame
- Remove deprecated pipecat.turns.mute shim module
Replace Markdown code blocks with RST syntax in genesys.py, fix
deprecated directive transitions in nvidia and summarization modules,
remove stray bullet prefix in whisper arg docs, restructure code block
in turn completion mixin, and add deepgram mock to Sphinx conf.
Remove stale riva mock imports from autodoc_mock_imports since the riva
service was removed and nvidia-riva-client is installed during doc builds.
Add pipecat.turns and pipecat.extensions to import_core_modules() and
add Turns to the index.rst toctree. Regenerate uv.lock to reflect the
riva extra removal from pyproject.toml.
Move the FastAPI instance to module level so other packages can import
it and register routes before main() is called. main() now configures
the existing app with transport-specific routes instead of creating a
new one.
Deepgram's built-in VAD events were deprecated in 0.0.99 in favor of
Silero VAD. This removes vad_events from settings and LiveOptions,
the should_interrupt parameter, the vad_enabled property,
_on_speech_started/_on_utterance_end handlers, and simplifies
_on_message and process_frame accordingly.
Remove the send_transcription_frames parameter from OpenAI Realtime LLM
(deprecated since 0.0.92). Also fix undefined _warn_deprecated_param
calls in both OpenAI and xAI realtime services, replacing them with the
existing _warn_init_param_moved_to_settings method.
UserBotLatencyLogObserver (deprecated 0.0.102) is replaced by
UserBotLatencyObserver. UserIdleProcessor (deprecated 0.0.100) is
replaced by LLMUserAggregator with user_idle_timeout.
The _self_queued_frames set and _internal_queue_frame wrapper were used
to prevent re-processing SpeechControlParamsFrame that the aggregator
queued to itself. Now that the frame is no longer special-cased, this
tracking is unnecessary. Also removes unused FrameCallback import.
Adds `enable_prompt_caching` setting to `AWSBedrockLLMSettings`. When
enabled, appends `cachePoint` markers to system prompts and tool
definitions in ConverseStream requests.
This can reduce TTFT by up to 85% for multi-turn conversations where
the system prompt stays constant (e.g. voice agents, chat assistants).
Follows the same pattern as `AnthropicLLMService.enable_prompt_caching`.
Usage:
```python
llm = AWSBedrockLLMService(
settings=AWSBedrockLLMSettings(
model="au.anthropic.claude-haiku-4-5-20251001-v1:0",
enable_prompt_caching=True,
),
)
```
See: https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html
Remove EmulateUserStartedSpeakingFrame, EmulateUserStoppedSpeakingFrame
(deprecated since v0.0.99), and the emulated field from
UserStartedSpeakingFrame and UserStoppedSpeakingFrame. Clean up the
handling code in base_input.py and a stale comment in nova_sonic/llm.py.
The interruption_strategies mechanism was deprecated in v0.0.99 in favor
of LLMUserAggregator's user_turn_strategies. All evaluation logic was
already removed — this removes the remaining field definitions, property,
StartFrame propagation, conditional check in base_input.py, strategy
files, and test.
This field was deprecated in v0.0.99 in favor of LLMUserAggregator's
user_turn_strategies / user_mute_strategies parameters. Since the default
was True (interruptions allowed), removing the guards keeps the current
default behavior.
The location and project_id fields were deprecated since 0.0.90 in
favor of direct __init__ parameters. Now that InputParams is removed,
project_id is required and location defaults to "us-east4" directly
in the signature.
Override _update_settings in CartesiaTTSService to flush the current
audio context and assign a new turn context ID when voice, model, or
language settings change. This prevents Context has closed errors
from Cartesia API, which locks these parameters per context.
Remove the deprecated text_aggregator parameter from TTSService,
CartesiaTTSService, and RimeTTSService, and the deprecated text_filter
parameter from TTSService. Users should use LLMTextProcessor before
the TTS service instead. Update the voice-switching example to use
LLMTextProcessor with PatternPairAggregator.
Change the default from 10s to None so deferred function calls can run
indefinitely when no timeout is configured. Only create the timeout
task when a timeout is actually provided (per-call or service-level).
Add SmallestSTTService using the Pulse WebSocket API for real-time
transcription. Includes SmallestSTTSettings dataclass, 32-language
support with resolve_language fallback, VAD-driven finalize signal,
and SMALLEST_TTFS_P99 latency constant.
Also adds X-Source and X-Pipecat-Version headers to Smallest STT
and TTS WebSocket connections.
Example files like openai.py shadow installed packages when Python adds the
script directory to sys.path. Prepend the parent folder name to each example
file (e.g. openai.py -> function-calling-openai.py). Also split
thinking-and-mcp/ into separate mcp/ and thinking/ directories.
Now that LLMContextFrame is the only frame that provides a context,
remove the intermediate `context = None` / `if context:` pattern
and handle context processing directly in the isinstance branch.
Replace the nested services/speech/ and services/function-calling/ with
top-level voice/ and function-calling/ directories. Update eval script
paths and README to match.
Move 304 examples from a flat numbered directory into 14 descriptive
subfolders: getting-started, services (speech + function-calling),
transcription, vision, realtime, persistent-context,
context-summarization, update-settings (stt/tts/llm), turn-management,
thinking-and-mcp, transports, video-avatar, video-processing, and
features.
Strip numbered prefixes from filenames (e.g. 07c-interruptible-deepgram.py
becomes services/speech/deepgram.py) since the folder context makes them
redundant. Keep numbered prefixes only in getting-started/ where ordering
matters.
Update eval script paths and README to match the new structure.
Add WebsocketLLMService as a base class for WebSocket-based LLM services,
parallel to WebsocketTTSService/WebsocketSTTService but codifying a
transactional request-response model rather than a continuous background
receive loop.
WebsocketLLMService provides:
- Connection lifecycle (start/stop/cancel → connect/disconnect)
- _ws_send/_ws_recv with transparent ConnectionClosed handling
(auto-reconnect via exponential backoff → WebsocketReconnectedError)
- _ensure_connected with retry via _try_reconnect
OpenAIResponsesLLMService now inherits from WebsocketLLMService, removing
duplicated connection management code (_connect, _disconnect, _reconnect,
_ensure_connected, _ws_send, start, stop, cancel) and simplifying
_process_context from a loop with attempt tracking to a flat try/except
with a single retry.
When a user interruption causes the LLM chunk stream to exit early,
function call arguments may be incomplete JSON. Wrap json.loads() in
try/except JSONDecodeError to skip malformed function calls with a
warning instead of crashing. Fixes#2461.
Buffer raw bytes and only decode after splitting on newline boundaries,
preventing multi-byte UTF-8 characters from being split at chunk edges.
Fixes#3538
A single service failing to reconnect should not kill the entire
pipeline. Non-fatal errors flow through the pipeline so application
code (e.g. ServiceSwitcher) can handle failover to a backup service.
When a WebSocket server accepts the handshake but immediately closes the
connection (e.g. invalid API key returning close code 1008), the existing
exponential backoff does not help because the handshake keeps succeeding.
This tracks how long each connection survives and emits a non-fatal
ErrorFrame after 3 consecutive sub-5s failures, allowing ServiceSwitcher
failover instead of killing the pipeline.
Fixes#3711
- Use finally block in _disconnect to ensure state is always cleaned
up, even if websocket.close() throws — prevents stale cancellation
state (e.g. _cancel_pending_response) from polluting a new connection
- Catch ConnectionClosed in _drain_cancelled_response alongside
TimeoutError — prevents _needs_drain from staying True and bricking
the service on every subsequent inference attempt
- Fall back to OPENAI_API_KEY env var when api_key is not passed,
since the WebSocket connection uses raw websockets (not the
AsyncOpenAI client which handles this automatically)
- Use _clear_cancellation_state() instead of piecemeal resets where
appropriate
Instead of trying to filter stale events inline (unreliable — the API
doesn't provide a way to correlate events to a specific response),
drain remaining events from a cancelled response before starting the
next one. On cancellation, send response.cancel and set a drain flag.
At the start of the next _process_context, read and discard events
until a terminal event arrives, ensuring a clean connection. Falls
back to reconnecting if draining times out.
Over HTTP, previous_response_id requires store=True (30-day OpenAI-side
conversation storage). The WebSocket variant avoids this via a
connection-local in-memory cache that works with store=False. Add
comments explaining this in both class docstrings, at the store=False
parameter, and in the adapter's previous_response_id note.
Add detailed trace-level logging to _apply_previous_response_optimization
showing why the optimization was applied or fell back to full context,
including the relevant data for debugging.
Use append_to_context=False for the filler TTSSpeakFrame in the
function-calling example to avoid altering the conversation history
and breaking the previous_response_id prefix match.
When using previous_response_id, the server already knows its own
output from the previous response. Store the raw response output and,
on the next call, compare it against the items following the matched
input prefix — checking role and text content for messages, and call_id
for function calls. If the items match, skip them and send only truly
new input (user messages, tool results). Falls back to full context if
either the prefix or the output comparison fails.
Introduce a WebSocket variant of the OpenAI Responses API service that
maintains a persistent connection to wss://api.openai.com/v1/responses
for lower-latency inference. The WebSocket variant automatically uses
previous_response_id to send only incremental context when possible,
falling back to full context on reconnection or cache miss.
The WebSocket variant becomes the new default OpenAIResponsesLLMService,
and the HTTP variant is renamed to OpenAIResponsesHttpLLMService. Both
share a private base class with common settings, parameter building,
and run_inference (always HTTP) logic.
Update langchain 0.3→1.2, langchain-community 0.3→0.4, and
langchain-openai 0.3→1.1. This also unblocks openai>=2.26 which
was previously constrained by the now-removed openpipe package.
OpenPipe was acquired by CoreWeave in September 2025. The Python package
hasn't been updated since June 2025 and the repo since 2024. The openpipe
package caps openai<=1.97.1, creating dependency conflicts with other
extras. Remove the dead integration to clean up the codebase.
- Add Nebius LLM service wrapping OpenAI-compatible Token Factory API
- Set supports_developer_role = False (Nebius rejects developer role)
- Default to openai/gpt-oss-120b model (supports function calling)
- Add Nebius function-calling example and env.example entry
- Fix Sarvam developer role support
- Update examples to use developer role for intro messages
Adds an OpenAI-compatible LLM service for Nebius Token Factory, supporting
open-source models (Meta Llama, Qwen, DeepSeek) via their OpenAI-compatible
REST API at https://api.tokenfactory.nebius.com/v1/.
When the remote side disconnects while send() is in flight, send() was
setting _closing=True. This prevented the receive loop from firing
on_client_disconnected, causing the pipeline to hang waiting for a
disconnect signal that never came.
The fix removes _closing from send() (that flag means we initiated the
close) and instead checks Starlette application_state in _can_send()
to suppress subsequent sends after a failure.
Fixes#3912
Add `await asyncio.sleep(0)` after `create_task()` calls in
UserIdleController, SpeechTimeoutUserTurnStopStrategy,
TurnAnalyzerUserTurnStopStrategy, and UserTurnCompletionLLMServiceMixin
so the event loop schedules the newly created timer tasks before the
caller continues.
Gemini 3.1 Flash Live won't reliably report ending its turn until
after it says something following a tool call. Restructure the system
instruction so the model says goodbye *after* calling
end_conversation, and add a comment explaining the deferred EndFrame
behavior that makes this work.
The base serializer filters out RTVI protocol messages by default
(ignore_rtvi_messages=True) to prevent them from being sent over
telephony media streams. ProtobufFrameSerializer is used by WebSocket
transports, which are the delivery channel for these messages, so
disable the filter there.
All recent Gemini Live models (including the default
gemini-2.5-flash-native-audio-preview-12-2025, and going at least as
far back as gemini-2.5-flash-native-audio-preview-09-2025) only
support AUDIO as a response modality. We considered using
`modalities=TEXT` as a Pipecat-level signal to suppress audio output
frames (so developers could pair Gemini Live with an external TTS),
but the output transcription from the API arrives too late relative
to the audio to be useful for driving an external TTS service.
For now, just log a warning when a TEXT modality is configured
(at init or via set_model_modalities) and proceed as normal. The 26d
text-modality example is removed since it no longer represents a
viable configuration.
The heartbeat monitor timeout (`HEARTBEAT_MONITOR_SECS`) was a static
module-level constant that never derived from the user-configurable
`heartbeats_period_secs`. This meant overriding the heartbeat interval
had no effect on the monitor window, causing spurious warnings or
delayed detection depending on the configured interval.
Add a new `heartbeats_monitor_secs` parameter to `PipelineParams` so
the monitor timeout is independently configurable (defaults to 10s).
The monitor handler now reads from the instance param instead of the
hard-coded constant.
Made-with: Cursor
This fixes the 26c example when using Gemini 3.1 Flash Live, which seems to be more strict about not receiving real-time input (at least, video messages) before conversation history.
Rime's WebSocket API sends a done message when synthesis completes.
Handle it to stop TTFB metrics, push TTSStoppedFrame, and remove the
audio context immediately instead of relying on the 3-second
stop_frame_timeout_s fallback.
DailyCallbacks gained a required on_dtmf_event field in PR #4047.
PR #4079 fixed this for TavusTransportClient but
LemonSliceTransportClient.setup() was not updated, causing a pydantic
ValidationError at pipeline setup time.
Only trigger handle_error for ErrorFrames originating from the active
service, not any managed service. This prevents edge cases where errors
from a non-active service could incorrectly trigger failover.
Expose a public method for retrieving all stored memories outside the
pipeline, avoiding the need for callers to reimplement client branching,
OR filter construction, and asyncio.to_thread wrapping. Simplify the
example get_initial_greeting() to use it.
Move blocking Mem0 API calls off the event loop using asyncio.to_thread().
Store messages as a fire-and-forget background task via create_task() since
the result is not needed. Insert memory messages at the configured position
in the context instead of always appending.
Closes#1741
Deepgram SDK 6.x surfaces connection errors from send_media() instead
of silently swallowing them. This causes error floods when the WebSocket
disconnects since every queued audio frame hits the dead connection.
Wrap send_media() in try/except: on failure, log one warning and set
self._connection = None so subsequent frames skip until the existing
_connection_handler reconnects.
In some cases the openai provider could answer with a `chunk.choices[0].delta.audio = None`, so the process context fails with error:
```
pipecat/services/openai/base_llm.py:552): Error during completion: 'NoneType' object has no attribute 'get'
```
When an InterruptionFrame arrives, the Python-side audio task is
cancelled but frames already submitted to rtc.AudioSource continue
playing from its internal buffer. This causes the bot to keep speaking
for several seconds after being interrupted.
Fix by overriding process_frame in LiveKitOutputTransport to call
audio_source.clear_queue() on InterruptionFrame, immediately flushing
the buffered audio.
ErrorFrames propagating upstream from downstream processors (e.g. TTS) would
enter the ServiceSwitcher via process_frame, traverse the active service sub-pipeline,
and reach push_frame where they incorrectly triggered failover. Now only errors whose
processor is one of the managed services trigger handle_error. Also fix the log in
handle_error to attribute errors to the actual source processor rather than the
current active_service.
Closes#4139
Gemini 3.x can bundle multiple fields (e.g. model_turn and
output_transcription) on the same server_content message. The previous
elif chain would only process the first matching field and silently
drop the rest. Switch to independent if checks so every field is
handled.
When Gemini Live was configured with local VAD (server-side VAD disabled),
the service was listening for the wrong frame types and not sending
ActivityStart/ActivityEnd events to the server. Now it listens for
VADUserStartedSpeakingFrame/VADUserStoppedSpeakingFrame and sends the
appropriate activity signals when local VAD is in use.
Also removes the unnecessary local SileroVADAnalyzer from server-side VAD
examples and adds a new 26a example demonstrating local VAD configuration.
Set GRPC_VERBOSITY=ERROR by default so users do not see noisy fork
handler and abseil warnings from the gRPC C library. Users can still
override by setting GRPC_VERBOSITY themselves.
nvidia-riva-client 2.25.1 ships with gencode compiled against protobuf
6.31.1, which requires a runtime >= 6.31.1. Update protobuf from 5.29.6
to >=6.31.1,<7 and grpcio-tools from 1.67.1 to 1.78.0 to match.
Regenerate frames_pb2.py with the new compiler.
Add DeepgramFluxSageMakerSTTService that combines SageMaker's HTTP/2
transport with Flux's JSON turn detection protocol (StartOfTurn,
EndOfTurn, EagerEndOfTurn, TurnResumed). Includes mid-stream Configure
support, silence watchdog, and an example bot.
Both GrokLLMService and XAIHttpTTSService use the same xAI API (api.x.ai),
so move Grok source files into the xai module. Leave deprecation shims in
the old grok/ paths for backward compatibility.
- Rename XAITTSService → XAIHttpTTSService and XAITTSSettings → XAIHttpTTSSettings
- Add language_to_xai_language() with explicit LANGUAGE_MAP using resolve_language()
- Remove deprecated InputParams, params, voice, language init params
- Remove XAI_DEFAULT_SAMPLE_RATE and XAI_PCM_CODEC constants; add encoding param
- Set sample_rate=None default (picked up from PipelineParams or user)
- Use Language.EN enum instead of string "en" for default language
- Add changelog/4031.added.md
- Add 07e-interruptible-xai.py foundational example
- Update 14g-function-calling-grok.py to use XAIHttpTTSService
- Register 07e in run-release-evals.py
Test that OpenAI Realtime, Grok Realtime, and Nova Sonic adapters
prefer init-provided system_instruction over context-provided, warn
on conflicts, and don't warn for developer messages.
Add system_instruction parameter to the Grok Realtime adapter's
get_llm_invocation_params() and call _resolve_system_instruction() to
prefer init-provided over context-provided system instructions and
warn on conflicts. Previously context-provided took precedence.
Update the Grok Realtime example to use settings.system_instruction
instead of session_properties.instructions.
Add system_instruction parameter to the OpenAI Realtime adapter's
get_llm_invocation_params() and call _resolve_system_instruction() to
prefer init-provided over context-provided system instructions and
warn on conflicts. Previously context-provided took precedence.
Add system_instruction parameter to the Nova Sonic adapter's
get_llm_invocation_params() and call _resolve_system_instruction() to
prefer init-provided over context-provided system instructions and
warn on conflicts. Previously context-provided took precedence.
Remove the service-side fallback logic, as the adapter now handles
resolution.
Pass self._system_instruction_from_init to the adapter's
get_llm_invocation_params(), which calls _resolve_system_instruction()
to prefer init-provided over context-provided system instructions and
warn on conflicts. Previously context-provided took precedence.
Also fix the reconnect check to only reconnect when the resolved
system instruction actually differs from what the initial connection
used, avoiding unnecessary reconnects.
Update documented event signatures to include transcript argument
where the code actually passes it. Remove stale on_speech_started
and on_utterance_end entries that were never registered.
Fires after the final transcript is pushed in both Pipecat and
AssemblyAI turn detection modes, giving users a reliable hook
that arrives after all transcript frames. Matches the existing
Deepgram Flux on_end_of_turn pattern.
The previous default (meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo) is
no longer available as a serverless Together.ai model and now requires a
custom deployment. The new default is openai/gpt-oss-20b, one of
Together's recommended models for small & fast use-cases.
OpenAI-compatible services that don't support the "developer" message
role can now set supports_developer_role = False on the service class.
BaseOpenAILLMService passes this as convert_developer_to_user to the
adapter, which converts developer messages to user messages before
sending them to the API. Applied to Cerebras and Perplexity.
Also removes the now-redundant developer→user conversion step from
PerplexityLLMAdapter (handled by the parent adapter via the flag).
_system_instruction_from_init was being set from the deprecated
`system_instruction` constructor parameter instead of
`self._settings.system_instruction`, so system instructions provided
via settings were silently ignored.
OpenAI Realtime, Grok Realtime, and AWS Nova Sonic adapters now convert
"developer" role messages to "user" (consistent with all other non-OpenAI
adapters). Previously these messages were silently dropped. Adds starter
unit tests for all three realtime adapters.
These messages are developer instructions to the assistant (e.g. "Please
introduce yourself to the user"), not simulated user input. The
"developer" role is semantically correct for this purpose.
Developer messages are now always converted to "user" in non-OpenAI
adapters, never promoted to the system instruction. This removes an
inconsistency where adding an unrelated message to context would change
whether a developer message got promoted.
Simplifications:
- Rename _extract_initial_system_or_developer → _extract_initial_system
- Return Optional[str] instead of Tuple (role is always "system")
- Drop initial_context_message_role from _resolve_system_instruction
- Drop system_role fields from all ConvertedMessages dataclasses
When the only message in context was a system message,
_extract_initial_system_or_developer would convert it to "user" (to
prevent empty history) without warning about the conflict with
system_instruction. Now warns inline before converting, with a message
explaining both the conflict and the user-role conversion.
Two goals:
1. Centralize system_instruction vs context system message resolution into
the LLM adapters. This eliminates duplication between in-pipeline and
out-of-band (run_inference) code paths across ~16 locations in service
llm.py files.
2. Add support for "developer" role messages in conversation context, which
is facilitated by the above centralization.
Shared helpers on BaseLLMAdapter:
- _extract_initial_system_or_developer: extracts/converts messages[0]
based on role and whether system_instruction is provided
- _resolve_system_instruction: warns on conflicts between system_instruction
and context system messages, returns the effective instruction
Developer message handling (new):
- Non-OpenAI adapters: an initial "developer" message is promoted to the
system instruction when no system_instruction is provided; otherwise it
is converted to "user". Subsequent "developer" messages are always
converted to "user". No conflict warning is emitted for developer
messages (unlike "system" messages).
- OpenAI adapter: "developer" messages pass through in conversation
history without triggering conflict warnings.
- OpenAI Responses adapter: "developer" messages are kept as "developer"
role (same as "system", which is also converted to "developer" for the
Responses API).
Other behavior changes:
- Gemini: "initial" system message detection now checks messages[0] only
(previously searched anywhere in the list)
- Bedrock: a lone system message is now converted to "user" instead of
being extracted to an empty message list (matches existing Anthropic
behavior)
Route LLMFullResponseEndFrame through the serialization queue instead
of pushing it directly downstream when push_text_frames is enabled.
This ensures the frame is emitted only after the audio context is
fully drained, preserving correct ordering relative to TTSTextFrames.
Previously, the final sentence TTSTextFrame would arrive at the
LLMAssistantAggregator after LLMFullResponseEndFrame, causing it to
be dropped from the conversation context (especially with RTVI text
input where no subsequent interruption would flush the orphaned text).
Cancel the deferral timeout task and clear the pending EndFrame during
disconnect, which could otherwise be left dangling after a
CancelFrame-triggered shutdown.
When an interruption arrives before any LLM text reaches run_tts, the
turn context ID exists but was never registered via create_audio_context.
Calling flush_audio for this unregistered context sends a message to the
provider (e.g. ElevenLabs) with a context_id it has never seen, which
implicitly creates a server-side context that is never closed. After
enough rapid interruptions these phantom contexts accumulate and exceed
the providers limit (ElevenLabs: 5 simultaneous contexts, 1008 policy
violation).
Guard the flush call with audio_context_available so it only fires when
the context was actually opened.
Fixes#4114
When an EndFrame arrives while the bot is mid-response, it is deferred
until turn_complete is received. If turn_complete never arrives, the
EndFrame gets stuck forever and the pipeline hangs indefinitely.
Add a 30-second timeout: if turn_complete hasn't arrived by then, the
deferred EndFrame is released anyway with a warning log. The timeout
is cancelled if turn_complete arrives normally.
We observed a case where a deferred EndFrame was never released in
Gemini Live, causing the pipeline to hang indefinitely. The EndFrame
deferral mechanism waits for _handle_msg_turn_complete to set
_bot_is_responding back to False, but turn_complete messages were only
processed if they also contained usage_metadata. If Gemini ever sent
turn_complete without usage_metadata, the message would be silently
dropped and the deferred EndFrame would never be released.
Now turn_complete is always handled regardless of usage_metadata
presence, with usage_metadata processing only when available.
Note: we have not actually observed a turn_complete without
usage_metadata in practice, so this is a theoretical fix for the
EndFrame-deferral hang. The actual root cause of the observed hang
may lie elsewhere.
- Route audio through audio contexts (append_to_audio_context) instead of
pushing frames directly, enabling proper turn management and interruptions
- Add push_stop_frames and push_start_frame so the base class handles
TTSStartedFrame/TTSStoppedFrame lifecycle
- Remove manual context_id tracking (self._context_id) in favor of
get_active_audio_context_id()
- Don't call remove_audio_context on "complete" — Smallest sends one
per request, not per turn; let the base class timeout handle cleanup
- Guard v2-only params (consistency, similarity, enhancement) so they
aren't sent to lightning-v3.1
- Remove request_id from request payload (not a documented request field)
- Add flush_audio override to send flush to WebSocket
Adds SmallestTTSService, a WebSocket-based TTS service using Smallest AI's
Lightning v3.1 model. Follows current Pipecat service conventions:
- SmallestTTSSettings dataclass with runtime-updatable settings (voice,
language, speed, etc.)
- Reconnects on model change; keepalive every 30s to prevent idle timeout
- TTS settings default to None so the API applies its own defaults
- Model enum: SmallestTTSModel.LIGHTNING_V3_1
Includes a foundational example (07zl-interruptible-smallest.py) using
Deepgram STT + Smallest TTS + OpenAI LLM.
STT integration will follow in a separate PR once the hallucination/finalize
behaviour is resolved.
Made-with: Cursor
Gets Gemini 3 support to the point where it works with:
- The "legacy" pattern from the previous (removed) 26- example
- inference_on_context_initialization=True (the default)
- inference_on_context_initialization=False
Add `domain` field to AssemblyAISTTSettings to support AssemblyAI's
streaming API `domain` query parameter, enabling specialized recognition
modes like Medical Mode (`medical-v1`).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add warnings in SpeechTimeoutUserTurnStopStrategy and
TurnAnalyzerUserTurnStopStrategy when stop_secs differs from the
recommended default (0.2s) or when stop_secs >= STT p99 latency,
which collapses the STT wait timeout to 0s. Document the stop_secs=0.2
assumption in stt_latency.py.
Send a setup message with client_req_id before the first text message
for each context, matching Gradium multiplexing protocol. This allows
Gradium to associate each session with its setup configuration when
using close_ws_on_eos=False.
The AudioHook protocol requires every message to carry a `parameters`
object. `_create_message` conditionally included it only when parameters
were truthy, so pong responses and closed responses without
outputVariables were sent without the field.
Clients that validate message structure (including the Genesys reference
implementation) rejected these messages, which broke server sequence
tracking and prevented outputVariables from reaching the Architect flow.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Align with the OpenTelemetry GenAI semantic convention
gen_ai.system_instructions for system prompts. The old "system"
attribute name was unrelated to gen_ai.system (which is for
provider name).
Replace adapter-based extraction in traced_llm with direct reads from
_settings.system_instruction (priority) and context messages (fallback).
The old approach had three bugs: signature mismatch with Anthropic
adapter, key name inconsistency, and unnecessary overhead from full
message/tools conversion.
Also deduplicate the system instruction in spans -- it was appearing as
both "system" and "param.system_instruction".
These services were pushing audio frames directly via push_frame() in their
WebSocket receive loops, bypassing the base TTSService audio context
serialization queue. This causes incorrect frame ordering and broken
interruption handling.
Changes per service:
- Fish Audio: use append_to_audio_context(), replace _handle_interruption
with on_audio_context_interrupted()
- LMNT: use append_to_audio_context(), remove redundant push_frame override
- Neuphonic: use append_to_audio_context(), remove redundant push_frame and
process_frame overrides (base class handles pause/resume)
- Rime NonJson: use append_to_audio_context(), remove redundant push_frame
override
The LLMContext format already matches the expected Responses API
shape for input_audio, so no adapter conversion will be needed
once OpenAI enables support.
The API-provided full model name is more specific than the
user-provided model name (e.g. includes version/snapshot details).
Reorder the lookup in _get_model_name and add a comment where the
Responses service sets the field.
The override would re-add `instructions` after the adapter had
intentionally converted it to a developer message for empty contexts.
Added a regression test.
Rewrite docstrings to more clearly explain what SyncParallelPipeline
does: hold all output until every parallel branch finishes, so frames
produced in response to a single input are released together.
Adds a FrameOrder enum with ARRIVAL (default, existing behavior) and
PIPELINE (pushes frames in pipeline definition order). This lets callers
guarantee output ordering between parallel pipelines — e.g. ensuring
image frames precede audio frames — without needing a separate reordering
processor downstream.
Updates the 05-sync-speech-and-image example to use FrameOrder.PIPELINE,
removing the ImageBeforeAudioReorderer class entirely.
Add a processor after SyncParallelPipeline that ensures each image frame
precedes its corresponding TTS audio frames. SyncParallelPipeline batches
them together but doesn't guarantee branch ordering. The reorderer detects
when TTS frames arrive before their image (via context_id tracking) and
holds them until the image arrives.
Also rename ImageAudioSync to MarkImageForPlaybackSync for clarity.
Add a `sync_with_audio` field to `OutputImageRawFrame` that routes image
frames through the audio queue in the output transport, ensuring images
are only displayed after all preceding audio has been sent. This enables
proper audio/image synchronization in pipelines like the calendar month
narration example.
Update the 05-sync-speech-and-image example to use an `ImageAudioSync`
processor that sets this flag on image frames.
The FrameProcessor two-queue architecture processes SystemFrames and
non-SystemFrames on separate concurrent async tasks. Both paths called
SyncParallelPipeline.process_frame(), which used the same per-pipeline
sink queues. A SystemFrame's wait_for_sync could steal frames from a
concurrent non-SystemFrame's wait_for_sync, corrupting synchronization
and stalling the pipeline.
This was triggered by the auto-embedded RTVI processor (added in
v0.0.101) which floods OutputTransportMessageUrgentFrame SystemFrames
through the pipeline during LLM responses.
Fix: SystemFrames (except EndFrame) now take a fast path — passed
through internal pipelines and pushed downstream directly without
touching the sink queues or drain logic. EndFrame retains the full
drain behavior as a lifecycle frame.
- Add WakePhraseUserTurnStartStrategy for gating interaction behind wake
phrase detection, with timeout and single_activation modes
- Add default_user_turn_start_strategies() and
default_user_turn_stop_strategies() helper functions
- Deprecate WakeCheckFilter in favor of the new strategy
- Extend ProcessFrameResult to stop strategies for short-circuit evaluation
- Fix MinWordsUserTurnStartStrategy including filtered text in output
* Fix empty user transcription causing spurious interruption in Nova Sonic
Skip _report_user_transcription_ended() when _user_text_buffer is empty,
which happens when the initial prompt is text-only. Previously, an empty
TranscriptionFrame was pushed upstream, triggering a chain reaction:
on_user_turn_stopped → UserStartedSpeakingFrame → interruption →
premature BotStoppedSpeaking → multiple response start/stop cycles.
* Improve TextFrame and assistant end of turn logic
Now, SPECULATIVE text results are used to push the LLMTextFrame,
AggregatedTextFrame, and TTSTextFrame. Additionally, the TTSTextFrames
are push at the end of the corresponding audio segment.
* Remove BotStoppedSpeakingFrame fallback from Nova Sonic
Now that assistant response end is detected directly from Nova Sonic
contentEnd events (END_TURN and INTERRUPTED), the BotStoppedSpeakingFrame
handler is no longer needed. Inline the cleanup logic in reset_conversation.
* Remove duplicate reconnection logic from Gradium STT
The _receive_messages method had its own while-True reconnect loop,
duplicating the reconnection handling already provided by
WebsocketService._receive_task_handler (exponential backoff, max
retries, error reporting). Flatten to just the inner message loop
and let the base class handle reconnection.
* Align Gradium STT VAD handling with base class patterns
Replace the process_frame override with a _handle_vad_user_stopped_speaking
override, which is the proper hook provided by STTService. Move
start_processing_metrics() into run_stt (matching Gladia's pattern).
Remove unused FrameDirection and VADUserStartedSpeakingFrame imports.
* Add transcript aggregation delay after flushed to capture trailing tokens
Gradium flushed response can arrive before all text tokens have been
delivered. Instead of finalizing immediately on flushed, start a short
timer (100ms) that allows trailing tokens to accumulate before pushing
the final TranscriptionFrame.
* Add changelog for PR #4066
* Change default encoding to pcm_16000
* Decouple encoding from sample_rate in Gradium STT
The encoding parameter now takes just the base type (pcm, wav, opus)
and the sample rate is derived from the pipeline audio_in_sample_rate,
assembled dynamically via input_format_from_encoding(). This fixes the
mismatch where SAMPLE_RATE=24000 was passed to the base class while
encoding defaulted to pcm_16000.
Set store=False in Responses API calls since we send full conversation
history as input items and don't use previous_response_id.
Add 5 run_inference tests for OpenAIResponsesLLMService using real
LLMContext and adapter (only HTTP client mocked).
Uses real LLMContext and adapter (only HTTP client is mocked) to test
basic inference, client exception propagation, system_instruction
override, empty context fallback, and max_tokens override.
Add OpenAIResponsesLLMService using the Responses API, with a dedicated
adapter that converts LLMContext messages to Responses API input items
(system→developer, tool_calls→function_call, tool→function_call_output,
multimodal content conversion, and tools schema flattening).
- New adapter: open_ai_responses_adapter.py
- New service: openai/responses/llm.py
- Examples: 07-interruptible and 14-function-calling variants
- 19 unit tests for adapter conversion logic
- Eval entries for both examples
List-valued settings like keyterm, keywords, search, redact, and replace
were being converted to strings before being passed to the SDK connect()
method. The SDK expects lists so its encode_query can produce repeated
query params (keyterm=a&keyterm=b).
When an LLM returns a tool call with no arguments (arguments=null in
the streaming chunks), the tool call is silently dropped because:
1. `tool_call.function.arguments` is None, so nothing is accumulated
and `arguments` stays as "" (empty string)
2. `if function_name and arguments:` treats "" as falsy, skipping the
entire tool call execution
OpenAI always sends arguments="{}" even for parameterless tools,
masking this bug. But vLLM, Ollama, and other OpenAI-compatible
providers may omit arguments entirely when the tool schema has no
required parameters, causing tool calls to be silently ignored.
Fix: check only `function_name` (not `arguments`) and default empty
arguments to "{}" so `json.loads` produces an empty dict. Apply the
same fallback for intermediate tool calls in multi-tool responses.
Raw strings like "de-DE" passed as the language parameter to TTS/STT services
were bypassing the Language enum resolution logic, causing silent failures
(e.g. ElevenLabs expects "de" not "de-DE"). Now raw strings are first converted
to Language enums so they go through the same resolve_language() path, with a
warning logged for unrecognized strings.
Reset stop strategies at turn start (not just turn stop) so that late
transcriptions arriving between turns do not leave stale _text that
causes premature stops on the next turn. Also cancel pending timeout
tasks in reset() for both SpeechTimeout and TurnAnalyzer strategies.
Expose enable_dialout as a configure() parameter (default False) so
dial-out examples can opt in without needing to build DailyRoomProperties
manually.
Narrow misleading Optional type hints on parameters that never accept
None, extract the duplicated token_exp_duration * 60 * 60 calculation,
remove unnecessary forward-reference quotes on DailyMeetingTokenProperties,
and clarify why enable_dialout is explicitly set to False.
Handle Daily's on_dtmf_event callback, convert it to an
InputDTMFFrame pushed into the input transport. Also add __str__
methods to InputDTMFFrame and OutputDTMFFrame for better logging.
Refactor language_to_soniox_language to use resolve_language + LANGUAGE_MAP
pattern consistent with other services. Fix resolve_language fallback to use
str(language) instead of language.value so plain strings don't crash.
The Inworld WS TTS plugin previously relied on the base TTS service's 3-second AUDIO_CONTEXT_TIMEOUT to detect when audio was done, then sent close_context in on_audio_context_completed. This added unnecessary latency before TTSStoppedFrame was emitted.
The original implementation likely borrowed this idea from the 11labs' impelementation. But it's likely better to mirror the Cartesia plugin pattern where on_audio_context_completed is a no-op because the server signals completion directly.
Now close_context is sent in on_turn_context_completed (right after flush_context), so the server responds with contextClosed immediately after the last audio byte. The existing receive handler already calls remove_audio_context on contextClosed, which exits the audio context handler cleanly.
The base_url parameter previously forced wss:// and https:// schemes,
breaking air-gapped or private deployments that need ws:// or http://.
Extract URL derivation into _derive_deepgram_urls() helper that respects
the developers scheme choice while deriving the paired WebSocket and
HTTP URLs the Deepgram SDK requires.
Closes#4019
Now that the base TTSService and STTService handle Language enum
conversion at init time, subclasses no longer need to convert in their
own __init__ methods. Remove conversion calls from hardcoded defaults,
params paths, and deprecated direct arg paths across 22 service files.
Services just pass raw Language enums and let the base class convert
via language_to_service_language() polymorphic dispatch.
When a Language enum (e.g. Language.ES) is passed via
settings=Service.Settings(language=Language.ES), it gets stored as-is
without conversion to the service-specific code. The base
_update_settings() handles this for runtime updates, but at init time
apply_update() copies the raw enum. This causes API errors because
services send the unconverted enum value.
Add language conversion in TTSService.__init__ and STTService.__init__
after super().__init__(), using the subclass language_to_service_language()
via normal method resolution.
Both analyzers are superseded by LocalSmartTurnAnalyzerV3. Added
deprecation warnings and docstring notices following the existing
pattern from LocalSmartTurnAnalyzer.
Perplexity appears to have statefulness within a conversation, so
converting a system message to "user" in one call and then back to
"system" in the next (after more messages are appended) causes API
errors. Remove the trailing system→user conversion entirely — if the
context only has system messages, the API call will fail but the
mistake will be caught right away.
Add test exercising the step 3 ordering where stripping a trailing
assistant exposes a system message that then gets converted to user.
Move the reasoning about when a trailing system message can occur
into the docstring.
Perplexity allows multiple initial system messages, so don't merge them.
Instead, skip system-system pairs during the consecutive same-role merge
step. Broaden the trailing message fix to convert any trailing system
message to user (not just a lone system message), so contexts with only
system messages don't fail.
Perplexity's API is stricter than OpenAI about conversation history:
- Requires strict alternation between user/tool and assistant messages
- Disallows system messages except as the initial message
- Requires the last message to be user or tool
The new adapter transforms messages before sending to satisfy all three
constraints: merging consecutive initial system messages, converting
non-initial system to user, merging consecutive same-role messages, and
removing trailing assistant messages.
Also adds dual-system-instruction warnings to Cerebras, Fireworks,
Mistral, Perplexity, and SambaNova services (matching the existing
BaseOpenAILLMService pattern), and updates the warning text in
BaseOpenAILLMService to be more descriptive.
EndTaskFrame and StopTaskFrame are now ControlFrames instead of
SystemFrames, so they flow through the pipeline and queue behind
pending work. This prevents races where EndFrame could overtake
in-flight frames (e.g. function call responses).
CancelTaskFrame and InterruptionTaskFrame remain SystemFrames
(via new TaskSystemFrame base): since they need immediate propagation.
The sink now catches EndTaskFrame, StopTaskFrame and CancelTaskFrame
downstream and re-queues it upstream to the task, ensuring the full
pipeline drains before shutdown begins.
Wait for _audio_context_task to finish draining the contexts queue
before canceling _stop_frame_task, ensuring all pending audio
contexts are processed during shutdown.
Flush buffered frames before pushing the synchronization frame so
downstream processors see the buffered frames first. Switch to a
while-loop with pop(0) so frames added to the buffer during flush
are also drained.
Add convenience parameters to configure() so callers don't need to
manually construct DailyRoomProperties/DailyRoomSipParams for common
SIP provider and geo configuration.
When `service` is set and doesn't match, the service forwards the frame instead of consuming it. This allows targeting a specific service when multiple services of the same type exist in the pipeline.
Align Simli with HeyGen/Tavus by extending AIService instead of
FrameProcessor and using a ServiceSettings dataclass. InputParams is
preserved but deprecated; its fields are promoted to direct init params.
Lifecycle handling moves to start()/stop()/cancel() methods.
The default model for OpenAILLMService and AzureLLMService was still set
to gpt-4o. Restored it to gpt-4.1. Also, removed hardcoded gpt-4o/gpt-4o-mini
model references from examples so they pick up the new default.
Move the warning helper into AIService as _warn_init_param_moved_to_settings.
It now uses type(self).__name__ to produce messages like
"Use settings=AnthropicLLMService.Settings(model=...)" instead of the raw
settings class name "AnthropicLLMSettings(model=...)". Callers no longer need
to pass the settings class explicitly.
Replace direct references to settings class names (e.g. `FooSettings`) with the nested `Settings` alias form throughout all 87 service files:
- Type annotations: `Settings`
- Runtime code: `self.Settings`
- Docstrings: `ServiceClass.Settings`
- Cross-file inheritance: `ParentService.Settings`
This makes the `Settings` alias the canonical way to reference a service's settings, keeping only the class definition and alias assignment as the remaining hits for each raw settings class name.
* Add ServiceSwitcherStrategyFailover for automatic error-based service switching
Introduce a strategy hierarchy: ServiceSwitcherStrategy (base) →
ServiceSwitcherStrategyManual (handles ManuallySwitchServiceFrame) →
ServiceSwitcherStrategyFailover (adds error-based failover). ServiceSwitcher
now defaults to ServiceSwitcherStrategyManual with strategy_type optional.
Non-fatal ErrorFrames are forwarded to the strategy via handle_error().
* Move metadata request into _set_active_if_available
Requesting metadata is part of making a service active, so it belongs
alongside setting _active_service and firing on_service_switched. This
removes the duplicate queue_frame calls from ServiceSwitcher push_frame
and process_frame.
DailyTransportClient.start_transcription() accepted a settings
parameter but always used self._params.transcription_settings
instead, silently discarding any custom settings passed by callers.
Change transcription_settings to Optional[DailyTranscriptionSettings]
defaulting to None. The default settings are now applied at the call
site when transcription is started, and start_transcription receives
the serialized settings dict directly.
Use CustomVideoSource/CustomVideoTrack for the default camera output instead of
VirtualCameraDevice, mirroring how audio already uses CustomAudioSource/CustomAudioTrack.
Add support for custom video destinations (register_video_destination, add/remove
custom video tracks, routing in write_video_frame) so multiple video tracks can be
published simultaneously.
* Add system_instruction parameter to run_inference
Allow callers to provide a custom system instruction directly when calling
run_inference, without having to construct provider-specific context objects.
For OpenAI, the instruction is prepended as a system message (preserving
existing messages). For Anthropic, Google, and AWS Bedrock, it overrides the
single system field with a warning when an existing system instruction is
present in the context.
* Use system_instruction parameter in _generate_summary
Pass the summarization prompt via run_inference's system_instruction
parameter instead of embedding it as a system message in the context.
* Add changelog for #3968
Constructor/settings system_instruction now takes priority over the
context system message. Previously the context value would overwrite
the constructor value on every call. Warn when both are set.
After interruption, both _playing_context_id and _turn_context_id are
None. If a subclass calls append_to_audio_context(None, frame), the
recovery path matches (None == None) and creates a bogus audio context
that blocks the handler from ever processing the real context.
Early-return when context_id is falsy to prevent this.
The Deepgram TTS service was bypassing pipecats audio context management
system, pushing audio frames directly via push_frame() instead of routing
them through append_to_audio_context(). This caused stale audio to leak
into the pipeline after interruptions and missed ordered playback
guarantees.
- Route audio frames through append_to_audio_context() with context
availability checks to discard stale post-interruption frames
- Handle Flushed responses by appending TTSStoppedFrame and removing
the audio context to signal completion
- Replace _handle_interruption override with on_audio_context_interrupted
hook (the recommended pattern used by ElevenLabs and Cartesia)
- Remove redundant process_frame override that caused double-flush
(base class already flushes via on_turn_context_completed)
- Remove redundant start_tts_usage_metrics call (base class handles
aggregated usage metrics)
Make `region` optional so users can provide only `private_endpoint`.
Raise ValueError if neither is provided, and warn if both are given
(private_endpoint takes priority).
Enhanced the logic for extracting the system message in the traced_llm decorator to support LLMContext via adapter and handle exceptions gracefully. This improves compatibility with different context types and ensures better tracing information.
SpeechConfig does not accept both `region` and `endpoint` simultaneously —
they are mutually exclusive. The previous code always passed both, which
raises ValueError when a user supplies a private_endpoint URL. Now we
conditionally pass either `endpoint` or `region`, never both.
Update the tests in test_integration_unified_function_calling.py to not specify particular models but instead just use service defaults (the tests shouldn't be model-dependent anyway)
Forward the on_summary_applied event from the internal summarizer to
the aggregator so users can listen for it without accessing private
members. Update summarization examples to use the new public event.
_on_call_state_updated passes (self, state) to _call_event_handler,
but _run_handler already prepends self when invoking the handler.
This causes handlers to receive 3 positional arguments instead of 2,
making the on_call_state_updated event unusable.
This aligns with how _on_first_participant_joined correctly passes
only the data arg without self.
Turn completion instructions were being injected as a system message in
the LLM context, which caused warning spam when system_instruction was
also set, did not persist across full context updates, and broke LLMs
that do not support consecutive system messages.
Instead, compose the turn completion instructions into the LLM service
system_instruction field. This is managed via _base_system_instruction
which stores the original value for restoration when turn completion is
disabled.
- mcp_service.py: remove unnecessary try/except around debug log,
use len(available_tools.tools) to match actual iteration target
- bedrock_adapter.py, aws/llm.py: add AttributeError to except tuple
to handle None content (previously caught by bare except)
Wire up the existing settings update infrastructure to send a Configure
WebSocket message when keyterm, eot_threshold, eager_eot_threshold, or
eot_timeout_ms change mid-stream, avoiding a full reconnect.
Add a `Settings` class-level alias on every STT, LLM, TTS, image,
vision, and video service class pointing to its settings dataclass.
This lets developers discover the right settings class via the service
class itself (e.g. `GoogleSTTService.Settings(...)`) without needing
to know or import the separate settings class name.
Adds the explicit "no params object" step 3 comment to all
LLM services that skip from step 2 to step 4 in their
settings initialization sequence, matching the pattern
established in services that do have a params object.
Replace references to undefined `_params` with `self._settings` for
language and VAD config. Add missing `system_instruction` to default
settings to satisfy validate_complete(). Remove redundant line that
read language from the deprecated `params` arg.
- Speechmatics: move config build after super().__init__ and settings
delta so turn_detection_mode (e.g. ADAPTIVE) takes effect
- Google STT: fix example passing bare Language enum instead of list
- Google TTS: add missing explicit defaults for all custom settings fields
- Soniox: fix accidental tuple wrapping of STT service in example
- Speechmatics examples: fix system->user role in kick-off messages
- Deepgram Flux: move tag from settings to __init__ (billing metadata)
- ElevenLabs STT: default tag_audio_events to None (use API default)
- Fal STT: simplify language default handling
- Google TTS: rename GoogleStreamTTSSettings to GoogleTTSSettings
Use "AI service" language instead of listing specific types, add
ServiceSettings as a fallback for direct AIService subclasses, and
clarify delta mode description with a concrete frame example.
The audio fields (sample rates, sample sizes, channel counts) on the deprecated `Params` class had no non-deprecated equivalent. This adds an `AudioConfig` class and `audio_config` init arg so users can specify audio configuration without relying on the deprecated `params` parameter.
Move `session_properties` into `GrokRealtimeLLMSettings`, making `settings` the canonical way to configure Grok Realtime — matching the pattern used across the rest of the codebase. The `session_properties` init arg is now deprecated in favor of `settings=GrokRealtimeLLMSettings(session_properties=...)`.
`system_instruction` is synced bidirectionally between the top-level settings field and `session_properties.instructions`, with top-level taking precedence on conflict. (Unlike OpenAI Realtime, Grok's `SessionProperties` has no `model` field, so no model sync is needed.)
Replace the monolithic rtvi.py with a proper package split by concern
protocol version:
- models_v0.py: deprecated pre-1.0 Pydantic models
- models_v1.py: current RTVI protocol v1 message models
- frames.py: RTVI pipeline frame dataclasses
- observer.py: RTVIObserver and RTVIObserverParams
- processor.py: RTVIProcessor (now lean, imports from submodules)
- __init__.py: re-exports full public API for backward compatability
Move `session_properties` into `OpenAIRealtimeLLMSettings`, making `settings` the canonical way to configure OpenAI Realtime — matching the pattern used across the rest of the codebase. The `session_properties` init arg is now deprecated in favor of `settings=OpenAIRealtimeLLMSettings(session_properties=...)`.
`model` and `system_instruction` are synced bidirectionally between the top-level settings fields and `session_properties.model`/`.instructions`, with top-level taking precedence on conflict.
Add `system_instruction=None` to `default_settings` for OpenAIRealtimeLLMService, GrokRealtimeLLMService, UltravoxRealtimeLLMService, AWSNovaSonicLLMService (Azure inherits from OpenAI), and OpenAIRealtimeBetaLLMService (Azure Beta inherits from OpenAI Beta).
Deprecate `system_instruction` init arg in AWSNovaSonicLLMService in favor of `settings=AWSNovaSonicLLMSettings(system_instruction=...)`. Use `self._settings.system_instruction` directly instead of storing a separate `self._system_instruction`.
Deprecation of `params` and `session_properties` in favor of `settings` for realtime services will be tackled in future work.
Add `system_instruction` field to `LLMSettings` so it is runtime-updatable via settings.
For Google (GoogleLLMService, GoogleVertexLLMService), deprecate the init-time arg since it was already shipped. For Anthropic, AWS Bedrock, and OpenAI, remove the init-time arg entirely since it was never shipped.
Still need to handle realtime services (OpenAI Realtime, Grok Realtime, Gemini Live).
Broaden the "Dynamic Settings Updates" section into "Service Settings"
covering the complete settings pattern: defining a Settings subclass,
wiring it into __init__ with defaults + apply_update, and distinguishing
init-only config from runtime-updatable fields.
- Add dedicated Settings subclasses to 20 LLM services that were
borrowing parent Settings classes (e.g. AzureLLMSettings,
GroqLLMSettings) so users don't need cross-module imports
- Fix field defaults to NOT_GIVEN in BaseWhisperSTTSettings,
OpenAIRealtimeSTTSettings, and NvidiaSegmentedSTTSettings for
delta-mode safety
- Fix incomplete default_settings in AWS, Cartesia, ElevenLabs,
Fish, and Whisper services so validate_complete() passes
- Add auto-discovered tests that verify all Settings classes default
to NOT_GIVEN (delta safety) and all services initialize with
complete settings (store completeness)
Move output_container, output_encoding, output_sample_rate out of
CartesiaTTSSettings into plain instance attributes since they cannot
change at runtime without breaking the audio pipeline. Remove deprecated
speed/emotion fields and their dead references in _build_msg() and
run_tts(). Remove the from_mapping override that only existed to
destructure those now-removed output format fields.
Update all ~192 call sites across 84 service files to pass class references
(e.g. `CartesiaTTSSettings`) instead of string names (`"CartesiaTTSSettings"`)
to `_warn_deprecated_param()`. This enables better IDE refactoring support.
Also fix `from_mapping` return type annotations in 5 settings subclasses to
use `typing.Self` instead of forward reference strings.
ServiceSettings types were introduced for runtime updates via ServiceUpdateSettingsFrame, but there was tension between init-time and runtime APIs: overlapping-but-different InputParams vs ServiceSettings classes, and runtime-updatable fields like `model` and `voice` scattered as direct init args rather than living in a settings object. This unifies them so developers use the same settings type at both init and runtime, improving ergonomics and consistency.
Every concrete AIService subclass (LLM, TTS, STT, ImageGen, Vision, Video) now accepts a `settings` parameter for runtime-updatable config. Old init args (`model`, `voice_id`, `params`/`InputParams`) still work but emit DeprecationWarnings pointing to the new API. When both are provided, `settings` takes precedence. Leaf classes emit warnings; base classes do not, avoiding double warnings in inheritance chains.
system_instruction from the constructor always takes precedence. A
warning is now logged when the context also contains a system message
so users can spot the conflict.
Add vad_threshold parameter to AssemblyAIConnectionParams to support
voice activity detection threshold configuration for the u3-rt-pro model.
This parameter allows users to align AssemblyAI's VAD threshold with
their external VAD systems (e.g., Silero VAD) to avoid the "dead zone"
where AssemblyAI transcribes speech that the external VAD hasn't
detected yet, which can delay interruption handling.
- Range: 0.0 to 1.0 (lower = more sensitive)
- Default: 0.3 (API default when not sent)
- Only applicable to u3-rt-pro model
- Automatically included in WebSocket query parameters
Recommended usage: Set vad_threshold to match your VAD's activation
threshold (e.g., both at 0.3) for optimal performance.
Widen version ranges for stable packages (anthropic, azure, deepgram,
groq, livekit, nvidia-riva-client, fastapi, ormsgpack, opentelemetry,
faster-whisper) and add upper bounds to previously uncapped packages
(hume, pyjwt, livekit-api, camb).
Replace CartesiaHttpTTSService's internal use of the Cartesia SDK with
direct aiohttp calls, accepting an optional aiohttp_session parameter.
Replace fal-client SDK calls in FalSTTService and FalImageGenService
with direct HTTP to bypass the SDK's aggressive retry/backoff logic
that caused significant latency regressions.
Widen version ranges for stable packages (aiofiles, docstring_parser,
onnxruntime) while adding upper bounds to previously uncapped packages
(transformers, numba, wait_for2). Bump soxr to 1.0.0 and pyloudnorm
to 0.2.0. Move silero extra to empty since onnxruntime is now a core dep.
Add appropriate log levels to dial-in/dial-out, participant, transcription,
and recording event handlers. Move transcription error log from client
callback to transport handler to keep logging consistent at the transport
level.
The default function call timeout (10s) causes silent failures for
long-running tools. This adds an optional timeout_secs parameter to
register_function() and register_direct_function() so individual tools
can override the global function_call_timeout_secs. The warning message
now mentions both the per-tool and global timeout options.
Allow either threshold to be set to None to cleanly disable that trigger,
instead of requiring users to set a very large number as a workaround.
At least one of the two must remain set (validated at construction time).
- u3-rt-pro: Does not set parameter (not used)
- universal-streaming models: Set to 1.0 to maintain fast response
- This ensures fast response time matches previous implementation
Replace the round-trip push_interruption_task_frame_and_wait() mechanism
with broadcast_interruption(), which pushes an InterruptionFrame both
upstream and downstream directly from the calling processor.
This eliminates race conditions (transcription arriving before the
InterruptionFrame comes back), swallowed-event timeouts (frame blocked
before reaching the sink), and the complexity of _wait_for_interruption
flag / queue bypass / frame.complete() obligations.
- Add broadcast_interruption() to FrameProcessor
- Deprecate push_interruption_task_frame_and_wait() (delegates to new method)
- Remove event field and complete() from InterruptionFrame/InterruptionTaskFrame
- Remove _wait_for_interruption flag and all special-case logic
- Remove frame.complete() calls in stt_mute_filter and llm_response_universal
- Update all 17 call sites to use broadcast_interruption()
- Update tests
Enables .model_dump() serialization for Pipecat Cloud collection.
All metrics now include start_time (Unix timestamp) for timeline
plotting alongside duration_secs.
Add per-service latency breakdown metrics alongside existing user-to-bot
latency measurement. When enable_metrics=True, the observer now emits an
on_latency_breakdown event with TTFB, text aggregation, and user turn
duration metrics collected between VADUserStoppedSpeakingFrame and
BotStartedSpeakingFrame.
- Add LatencyBreakdown dataclass with ttfb, text_aggregation,
user_turn_secs fields
- Accumulate MetricsFrame data during user→bot cycles
- Reset accumulators on InterruptionFrame to discard stale metrics
- Measure user_turn_secs from actual user silence (VAD timestamp -
stop_secs) to turn release (UserStoppedSpeakingFrame)
- Filter zero-value TTFB entries from startup metric resets
- Add frame deduplication using bounded deque + set pattern
- Update example 29 with latency breakdown display
The ServiceSettings refactor (PR #3714) changed self._settings from
dicts to dataclass subclasses, but tracing code still used .items(),
in containment, and subscript access, causing AttributeError on
every traced call. Use given_fields() for iteration and attribute
access for named fields.
Replace the PipelineSink detection in StartupTimingObserver with an
on_pipeline_started() callback from PipelineTask via TaskObserver.
This fixes premature report emission when using ParallelPipeline,
which has its own inner PipelineSinks per branch.
Use start_offset_secs (offset from StartFrame) on ProcessorStartupTiming
instead of a wall-clock timestamp. Reports keep a single start_time
anchor for dashboard visualization. Remove _mono_to_wall conversion.
Switch ProcessorStartupTiming, StartupTimingReport, and
TransportTimingReport from dataclasses to Pydantic BaseModel. Add
start_time (Unix timestamp) fields and wall clock conversion for
monotonic observer timestamps.
Add BotConnectedFrame (SystemFrame) pushed by SFU transports (Daily,
LiveKit, HeyGen, Tavus) when the bot joins the room. Replace the
on_transport_readiness_measured event with on_transport_timing_report
which includes both bot_connected_secs and client_connected_secs.
Introduce ClientConnectedFrame (SystemFrame) pushed by all transports
when a client connects. StartupTimingObserver uses this to measure
transport readiness — the time from StartFrame to first client
connection — via a new on_transport_readiness_measured event.
Azure TTS _handle_canceled was putting None (the normal completion
signal) into the audio queue for all cancellation reasons, so run_tts
treated errors identically to success—silently producing no audio.
Now error cancellations put an Exception marker in the queue, which
run_tts converts to an ErrorFrame.
Azure STT had no canceled event handler at all, so auth failures,
network errors, and rate-limit cancellations were invisible. Added
_on_handle_canceled which pushes an ErrorFrame upstream via push_error.
Fixespipecat-ai/pipecat#3892
Tracks how long each processor start method takes during pipeline
startup by measuring StartFrame arrive/leave deltas. Emits a timing
report via the on_startup_timing_report event and auto-logs a summary.
Internal pipeline processors are excluded from reports by default.
The Whisper-based ONNX model expects 16 kHz audio, but the
_predict_endpoint method had five hardcoded references to 16000 without
checking the actual pipeline sample rate. When running at 8 kHz (e.g.
Twilio telephony), audio was fed to the feature extractor at the wrong
rate, causing the model to perceive speech at 2x speed with shifted
formant frequencies and produce incorrect end-of-turn predictions.
Add automatic resampling via numpy interpolation before feature
extraction and replace all hardcoded sample rate values with a
_MODEL_SAMPLE_RATE constant. Also fix the WAV debug logger to write
files with the correct sample rate header.
Fixes#3844
Snapshot the blocks into immutable bytes and trim the buffer BEFORE any await, so no memoryview is
held across async yield points. Without this, a concurrent filter() or stop() call could try to
extend() or clear() the bytearray while a memoryview still exports it, raising "Existing exports
of data: object cannot be re-sized".
When filter_incomplete_user_turns is enabled and an LLMMessagesUpdateFrame
replaces the context via set_messages(), the turn completion instructions
system message was lost. This caused the LLM to stop emitting turn
completion markers. Re-inject the instructions after set_messages() to
fix this.
- Keep old parameter name for backward compatibility
- Add deprecation warning when old parameter is used
- Automatically migrate old parameter value to new min_turn_silence parameter
- Exclude deprecated parameter from WebSocket URL to avoid sending it to API
- New parameter takes precedence if both are set
- Update 13d-assemblyai-transcription.py to explicitly use u3-rt-pro model
- Update 55d-update-settings-assemblyai-stt.py to demonstrate keyterms updates instead of language updates
- Add helpful logging to show before/after keyterms boosting effect
- Use difficult names (Xiomara, Saoirse, Krzystof) to demonstrate boosting effectiveness
- Add "beta feature" note to custom prompt warning
- Rename min_end_of_turn_silence_when_confident parameter to min_turn_silence across all AssemblyAI code
- Update documentation, examples, and test files to use new parameter name
Allow pushing frames upstream through the pipeline by passing
FrameDirection.UPSTREAM. Downstream frames use the existing push queue,
while upstream frames are pushed directly from the pipeline sink.
The ServiceSettings refactor (PR #3714) changed self._settings from
dicts to dataclass subclasses, but tracing code still used .items(),
in containment, and subscript access, causing AttributeError on
every traced call. Use given_fields() for iteration and attribute
access for named fields.
The update-docs workflow intermittently failed with "Input required and
not supplied: token" because pull_request events from fork PRs don't
have access to repository secrets. Switching to pull_request_target
runs the workflow in the base repo's context, ensuring secrets are
always available. This is safe since the workflow only runs on
already-merged PRs.
- 07o-interruptible-assemblyai.py: Basic example using Pipecat VAD mode
- 07o-interruptible-assemblyai-stt.py: Advanced example using STT-controlled
turn detection with comprehensive documentation on u3-rt-pro features
(turn detection tuning, prompt-based enhancement, speaker diarization)
The request_finalize() method in STTService is synchronous (sets a flag),
but was being called with await in the VAD turn endpoint handling code.
This caused "object NoneType can't be used in 'await' expression" errors.
Also includes automatic formatting improvements from ruff.
Replace _rtvi_external instance variable with a local prepend_rtvi flag
since it is only used during __init__ to decide whether to prepend the
RTVIProcessor to the pipeline.
When the user places an RTVIProcessor inside their pipeline and provides
a custom RTVIObserver subclass in observers, PipelineTask correctly
detects both and logs "skipping default ones." However it then
unconditionally prepends self._rtvi to the pipeline, causing the
processor to appear twice in the frame chain.
Track whether the RTVIProcessor was found externally (inside the user
pipeline) vs created internally. Only prepend it when created internally.
Fixes#3867
- Remove unused Mapping import
- Remove info logs at initialization (connection params)
- Remove info logs in _handle_transcription (transcript details, text sent to LLM)
- Remove info logs in _build_ws_url (WebSocket URL and params)
- Keep debug logs (less verbose, appropriate for development)
u3-rt-pro guarantees SpeechStarted is always sent before transcripts,
so the fallback UserStartedSpeakingFrame broadcast is never needed.
This ensures clean pairing of UserStarted/StoppedSpeakingFrame:
- Start: Always from _handle_speech_started
- Stop: Always from _handle_transcription on final turn
- Add request_finalize() before sending ForceEndpoint in Pipecat mode
- Keep confirm_finalize() when receiving formatted finals in Pipecat mode
- Remove confirm_finalize() from STT mode (use finalized=True instead)
This follows Pipecat's two-step finalization pattern where request_finalize()
is called when sending a finalize request to the STT service, and
confirm_finalize() is called when receiving confirmation back.
Even when summarization_timeout is explicitly set to None, use a
DEFAULT_SUMMARIZATION_TIMEOUT (120s) fallback so the LLM call can
never hang indefinitely. Applied in both LLMService and the dedicated
LLM path in LLMContextSummarizer.
The dedicated LLM logic lived in LLMAssistantAggregator, creating two
code paths and requiring the aggregator to call a private LLMService
method. Move it into the summarizer which already owns the config and
summarization lifecycle, keeping the aggregator handler as a single-line
upstream push.
Adds a configurable summarization_timeout (default 120s) that cancels
summary generation if the LLM hangs. On timeout, an error result is
returned so _summarization_in_progress resets and future
summarizations are unblocked.
Adds an field to LLMContextSummarizationConfig that allows
routing summarization to a separate LLM service (e.g., Gemini Flash)
instead of the pipeline's primary model. This avoids paying for
expensive inference when compressing context in long-running sessions.
Allows applications to customize how the summary is wrapped when
injected into context (e.g., XML tags, custom delimiters) so system
prompts can distinguish summaries from live conversation.
Add deprecation warnings to start_processing_metrics() and
stop_processing_metrics() on FrameProcessorMetrics and FrameProcessor.
Mark ProcessingMetricsData as deprecated in docstring. All existing
behavior is preserved — the warnings inform users that these will be
removed in a future version.
Runs Claude Code Action after PRs merge to main when source files
in services/transports/serializers/processors/audio/turns/observers/pipeline
are changed. Creates a docs PR on pipecat-ai/docs with targeted edits
following the existing update-docs skill instructions.
- Fix speaker diarization: Add field alias for speaker_label → speaker
mapping in TurnMessage model
- Add warning for non-optimal min_end_of_turn_silence_when_confident
values (recommends 100ms for best latency)
- Improve max_turn_silence override warning message clarity
- Update custom prompt warning (remove 88% accuracy claim)
- Add comprehensive logging for debugging:
- Log final connection params after modifications
- Log WebSocket URL and parsed parameters
- Log speaker field in transcripts
- Log text sent to LLM with speaker formatting
- Support dynamic configuration updates via STTUpdateSettingsFrame:
- keyterms_prompt (when AssemblyAI API supports it)
- prompt
- max_turn_silence
- min_end_of_turn_silence_when_confident
- Add InterruptionFrame handling with stop_all_metrics()
- Add processing metrics (start/stop) at response boundaries
- Fix agent transcript handling for voice and text modalities:
- Voice mode: push LLMTextFrame (append_to_context=False) and
TTSTextFrame for deltas, skip duplicated final text
- Text mode: push LLMTextFrame with proper response lifecycle,
no TTSTextFrame (downstream TTS handles audio)
- Add output_medium parameter to AgentInputParams and OneShotInputParams
- Improve TTFB measurement using VAD speech end time
- Update example with user turn strategies and transcript events
- Add text-only output example (50a-ultravox-realtime-text.py)
Move the sentence vs token aggregation concern into text aggregators
so all text flows through them regardless of mode. This enables
pattern detection and tag handling to work in TOKEN mode.
- Add TextAggregationMode enum (SENTENCE, TOKEN) as the user-facing
TTS setting, separate from the internal AggregationType
- Add TOKEN mode support to Simple, SkipTags, and PatternPair aggregators
- Add text_aggregation_mode parameter to TTSService and all TTS subclasses
- Deprecate aggregate_sentences in favor of text_aggregation_mode
- Merge TTSService._process_text_frame() into a single codepath
Add TextAggregationMetricsData measuring the time from the first LLM
token to the first complete sentence, representing the latency cost of
sentence aggregation in the TTS pipeline.
- Wire up passing speed setting to Groq, even though only a value of 1.0 is supported today
- Update the 55y example to switch voices instead of changing speed
- Add a 55zn example to exercise runtime updates of Groq STT
The only (rare) exception—where a service directly still needs to directly call `self._sync_model_name_to_metrics()`—is when the model name need to be "pulled" from another field (or nested field) in settings up to settings.model on a settings update. This only occurs in Deepgram services, where we use the voice as the model name.
This change has the side-effect of bringing model name to metrics for a number of services that were accidentally omitting it before.
This was added in 31daa889e8, but only
to `RimeTTSService`, not to `RimeNonJsonTTSService. Bringing these
to parity means that users switching between the two, with the same
inputs, have more consistent vocalization behaviors.
Introduce a generic TurnMetricsData class for turn detection metrics,
replacing the service-specific SmartTurnMetricsData (now deprecated).
Add end-to-end processing time measurement to KrispVivaTurn, tracking
the interval from VAD speech-to-silence transition to model threshold
crossing. Consume metrics in the strategy _handle_input_audio path
so they are pushed immediately when fresh.
The Krisp VIVA SDK v1.8.0 requires a license key in globalInit(). Add
api_key parameter to KrispVivaSDKManager, KrispVivaTurn, and
KrispVivaFilter with fallback to KRISP_API_KEY env var. Maintain
backwards compatibility with older SDK versions by catching TypeError
and falling back to the old 3-arg signature.
Brings back the 6 development workflow skills (changelog, cleanup,
code-review, docstring, pr-description, pr-submit) that were moved
to pipecat-ai/skills, and adds a .claude-plugin/marketplace.json so
other pipecat-ai repos can install them. Updates README contributing
section with installation instructions.
Audio filters like RNNoise, KrispViva, and AIC return empty bytes while
buffering audio to accumulate their required frame size. These empty
frames were flowing downstream, causing misleading "Empty audio frame
received for STT service" warnings.
Skip the frame in BaseInputTransport when audio is empty, preventing
unnecessary processing in VAD and downstream processors.
Fixes#3517
These frames were falling through to the else branch and being pushed
downstream, unlike TranscriptionFrame which is explicitly consumed.
This aligns with how the assistant aggregator already filters them.
PR #3776 replaced manual timestamp tracking with stop_ttfb_metrics() in
the timeout handler, but without an end_time it uses time.time() at
timeout expiry—producing TTFB = timeout + stop_secs (~2.2s) instead of
the actual transcript latency.
Restore _last_transcript_time tracking so the timeout handler measures
to when the transcript arrived, and skip reporting if none arrived.
Before this change, settings updates were often not applied. For example, a `TTSUpdateSettingsFrame` queued while the bot was speaking would only have an effect at the end of the bot's reply, and any interruption before the end of the reply would "cancel" the update.
Remove bundled Claude Code skills (changelog, cleanup, code-review,
docstring, pr-description, pr-submit) that now live in
https://github.com/pipecat-ai/skills. Add a section to the README
with installation instructions. The update-docs skill remains as
it is specific to this repository.
This makes each service-specific field individually visible to the delta/update mechanism (`apply_update`, `given_fields`) and removes the need for complex sync logic between `input_params` and top-level fields like `model`.
- Soniox: replace `input_params: SonioxInputParams` with 8 individual fields, simplify `_update_settings` by removing model sync logic, remove unused `is_given` import
- Gladia: replace `input_params: GladiaInputParams` with 11 individual fields, resolve deprecated `language` → `language_config` at init time rather than at `_prepare_settings` time
- Storage mode: for use in `self._settings`. All fields should be specified, i.e. should not be `NOT_GIVEN`.
- Delta mode: for use in `*UpdateSettingsFrame`.
In service of this, this commit:
- Adds a runtime check that all fields are specified in storage mode
- Updates all services to specify all fields in stored settings
- Updates all services to no longer check for `is_given` in stored settings (not necessary anymore)
- Updates relevant docstrings
- Renames `update` to `delta` in `*UpdateSettingsFrame`
- Updates community integrations guide
Move start_processing_metrics from run_stt (called per audio chunk,
producing noisy 0ms logs) to _receive_messages when the first final
token arrives for a new utterance. The existing stop_processing_metrics
in send_endpoint_transcript completes the pair, giving a meaningful
measurement of time from first recognition to finalized transcript.
Commit 859cd7c9 refactored STT TTFB measurement to use the base class
start_ttfb_metrics/stop_ttfb_metrics, which are gated behind
can_generate_metrics(). Soniox and AWS Transcribe never overrode this
method (default returns False), so TTFB was silently never reported.
Move speech detection tracking outside the per-frame loop in append_audio
since is_speech applies to the whole buffer. Add debug log in
analyze_end_of_turn to show state and probability at decision time. Update
the Krisp VIVA example to use Cartesia TTS and turn analyzer strategy.
Use wildcard `*UpdateSettingsFrame` to cover all frame types. Clarify that NOT_GIVEN only appears in update deltas, not in the service's current settings state.
Split the single "changed" entry into separate "added", "changed", and "deprecated" entries for clarity. Add a note about the subtle behavior change in the deprecated set_model/set_voice/set_language methods.
Simplify the reconnect example to show a common pattern (reconnect on any change) and improve the _warn_unhandled_updated_settings example to show selective handling of specific fields.
Update docstrings for ServiceSettings, LLMSettings, TTSSettings, and STTSettings to make clear these capture only the subset of service configuration that can be changed while the pipeline is running via UpdateSettingsFrame, not all constructor parameters.
Document the existing convention: use @dataclass for frames and
internal pipeline data, use Pydantic BaseModel for configuration,
parameters, metrics, and external API data.
Replace self-referential `pipecat-ai[local-smart-turn-v3]` extra in core
dependencies with the actual packages (`transformers`, `onnxruntime`).
Self-referential extras are not supported by Poetry and cause dependency
resolution failures. Since these are required by the default turn stop
strategy (LocalSmartTurnAnalyzerV3), they belong in core dependencies.
- Remove `local-smart-turn-v3` optional extra from pyproject.toml
- Remove try/except ModuleNotFoundError guard (now always installed)
- Remove `--extra local-smart-turn-v3` from CI workflows
When the InterruptionFrame does not complete within the timeout the
caller was stuck in an infinite loop logging warnings. Now the event
is set after the first timeout so the processor can continue.
Also adds a keyword timeout parameter so callers can customize the
wait duration.
- indicate clearly that it's not meant for public use
- make it clear the `self._settings` is the single source of truth for model information
- set the stage for an upcoming change where `AIService` subclasses won't have to ever worry about explicitly calling an `AIService` method to sync model name to metrics
Across all services, switch from accessing `self._model_name` or `self.model_name` in favor of `self._settings.model`.
Adds a TTS service that connects to Deepgram models deployed on AWS
SageMaker endpoints via HTTP/2 bidirectional streaming. Supports the
Deepgram TTS protocol (Speak, Flush, Clear, Close) over the BiDi
client, with interruption handling and per-turn TTFB metrics.
Updates the example and env.example with separate STT/TTS endpoint names.
- AWSNovaSonicLLMService: new `AWSNovaSonicLLMSettings` with `voice_id` and `endpointing_sensitivity`; remove `self._params` entirely, storing audio I/O config as plain instance variables
- NeuphonicHttpTTSService: reuse `NeuphonicTTSSettings`; use inherited `language` field instead of bespoke `lang_code`
- NvidiaTTSService: new `NvidiaTTSSettings` with `quality`
- PiperTTSService / PiperHttpTTSService: new `PiperTTSSettings` / `PiperHttpTTSSettings` (no extra fields)
- SpeechmaticsTTSService: new `SpeechmaticsTTSSettings` with `max_retries`
Also remove redundant `lang_code` from `NeuphonicTTSSettings` (both WS and HTTP services now use the inherited `TTSSettings.language` field, with automatic enum conversion via the base class).
HTTP services (Neuphonic HTTP, Piper HTTP, Speechmatics) don't override `_update_settings` since the base class applies changes to `self._settings` and subsequent requests read from it automatically.
Also:
- remove unnecessary pass-through `_update_settings` implementation in `FalSTTService`
- warn that `AsyncAITTSService` doesn't currently support runtime settings updates
- update how `GradiumTTSService._update_settings` checks for voice changes
- remove a couple of unnecessary args (because they specified defaults) in other examples
ThinkingConfig was defined as an inner class on the service but referenced in the Settings dataclass declared before the service class, causing a crash at import time. Move ThinkingConfig to a standalone class defined before Settings, and keep a class attribute alias for backward compatibility.
Eliminate custom _emit_stt_ttfb_metric and manual timestamp tracking in
STTService by reusing FrameProcessor's start_ttfb_metrics/stop_ttfb_metrics
with new start_time/end_time parameters. This keeps the chronological
start→stop ordering and removes _speech_end_time and _last_transcription_time
state from STTService.
Remove the deprecation warning and __post_init__ override. Also fix the
default value for remote_participants to use field(default_factory=dict)
instead of None.
Add write_transport_frame() hook to BaseOutputTransport so subclasses
can handle custom frame types that flow through the audio queue. Add
DailySIPTransferFrame and DailySIPReferFrame as DataFrame subclasses
that queue with audio, ensuring SIP operations execute only after the
bot finishes its current utterance. Override write_transport_frame in
DailyOutputTransport to dispatch these frames to the existing
sip_call_transfer() and sip_refer() client methods.
Also switch DailyOutputTransport.send_message error handling from
logger.error to push_error for consistency.
Every `*Settings` dataclass field whose default is `NOT_GIVEN` now carries `_NotGiven` in its type union so the type system accurately reflects the three-state semantics (real value, `None` where applicable, or not-yet-specified). Fields previously typed as bare `Any`, `str`, `float`, `bool`, `list`, `dict`, or `Optional[X]` are now narrowed to the specific type from the corresponding `InputParams` Pydantic model.
RTVIObserver previously filtered out all upstream frames to avoid
duplicate messages from broadcasted frames. This caused upstream-only
frames to be silently ignored. Instead, add a `broadcasted` field to
the Frame base class that is set by broadcast_frame() and
broadcast_frame_instance(), and only skip upstream copies of
broadcasted frames.
The CI was failing because the runner's package index was stale,
causing a 404 when fetching libasound2-dev (a dependency of
portaudio19-dev). Running apt-get update first refreshes the index.
- Move `CommitStrategy` up in the file so it could be used by `ElevenLabsRealtimeSTTSettings`
- Fix a bug where `run_tts` would erroneously try to reconnect if a reconnection was already in flight (like a reconnection triggered by `_update_settings`)
42 examples covering STT (13), TTS (21), LLM (4), and realtime (4) services. Each demonstrates updating service settings 10 seconds after client connects, verifying the typed settings machinery end-to-end for every provider.
HumeTTSService now stores its params (description, speed, trailing_silence) in a proper `HumeTTSSettings` dataclass instead of a separate `_params` Pydantic model, making it work with `TTSUpdateSettingsFrame(update=...)`. The old `update_setting(key, value)` method is kept but deprecated.
Also removes the unused no-op `TTSService.update_setting` base method, which was never called by the `TTSUpdateSettingsFrame` pipeline.
The dataclass-based API (`*UpdateSettingsFrame(update=*Settings(...))`) is the preferred path since 0.0.103. The dict path still works but now emits a `DeprecationWarning`.
Change `TTSSettings.language` and `STTSettings.language` from `Any` to `Language | str | _NotGiven`. Add `language_to_service_language` base method and centralized `isinstance`-guarded conversion in `STTService._update_settings` (mirroring TTS). Update the TTS guard from `is not None` to `isinstance(…, Language)` so raw strings pass through unchanged.
Remove now-redundant per-service language conversion from `_update_settings` overrides (ElevenLabs, Azure, Fal, Whisper). Add `language_to_service_language` to Azure STT so the centralized conversion picks it up. Fix AWS and NVIDIA STT `__init__` to convert language at construction time, then simplify their runtime accessors to read `_settings.language` directly.
Note that for services that previously handled applying updates (through methods like `set_model` and `set_language`), we're keeping the update-applying logic (some or most of which is already well-tested) and expanding it to cover all relevant settings fields. Services under this bucket are:
- Deepgram STT
- Deepgram Sagemaker STT
- Elevenlabs STT
- Google STT
- Gradium STT
- OpenAI STT
- Speechmatics STT
Change the version specifier from `>=0.2.8` to
`~=0.2.8` for the `speechmatics-voice` package.
This ensures compatibility with future patch
versions while preventing potential breaking
changes from minor updates.
Use client_req_id-based multiplexing instead of disconnecting and
reconnecting the websocket on every interruption. This follows the
same pattern used by Cartesia, ElevenLabs, and other services via
AudioContextWordTTSService.
Key changes:
- Base class: InterruptibleWordTTSService -> AudioContextWordTTSService
- Add close_ws_on_eos: False to setup message to keep connection alive
- Add client_req_id to text, end_of_stream messages for demultiplexing
- Route audio via append_to_audio_context() instead of push_frame()
- Silently drop messages for cancelled/unknown contexts on interruption
- Add _handle_interruption() that resets context without reconnecting
- Remove no-op push_frame() override
Always create UserIdleController (timeout=0 means disabled), removing
all Optional guards. Add UserIdleTimeoutUpdateFrame to allow changing
the idle timeout at runtime.
Replace the continuous heartbeat-based timer (UserSpeakingFrame/BotSpeakingFrame
+ asyncio.Event loop) with a simple one-shot timer that starts when
BotStoppedSpeakingFrame is received and cancels on UserStartedSpeakingFrame or
BotStartedSpeakingFrame. This eliminates false idle triggers caused by gaps
between the user finishing speaking and the bot starting to speak (LLM/TTS
latency).
Guard the timer start with two conditions to prevent false triggers:
- User turn in progress: during interruptions, BotStoppedSpeaking arrives
while the user is still speaking mid-turn.
- Function calls in progress: FunctionCallsStarted arrives before
BotStoppedSpeaking because the bot speaks concurrently with the function
call starting, so the timer must wait for the result and subsequent bot
response.
Now that all services use typed `ServiceSettings` objects, this removes the interim scaffolding that supported both dict-based and typed settings paths in parallel. Specifically: removes old dict-based `_update_settings(settings: Mapping)` methods from base classes, removes `isinstance(self._settings, ServiceSettings)` guards, simplifies `process_frame` branching, and renames `_update_settings_from_typed` to `_update_settings` across all ~30 service implementations. Also renames the no-arg `_update_settings()` helper on realtime services to `_send_session_update()` to avoid collision, adds `from_mapping` overrides on `GoogleLLMSettings` and `AnthropicLLMSettings` for ThinkingConfig dict-to-object conversion, and replaces a broken no-arg `_update_settings()` call in Gemini Live with a TODO.
- NvidiaSTTService.set_model: convert to proper DeprecationWarning (model can't change at runtime for Riva streaming STT)
- NvidiaTTSService.set_model: same treatment for Riva TTS
- NvidiaSegmentedSTTService.set_model: remove — base class now routes through _update_settings_from_typed which re-creates the recognition config
- GeminiTTSService.set_voice: remove — move AVAILABLE_VOICES validation into _update_settings_from_typed so it fires on both legacy and new paths
Standardize all STT, TTS, and LLM service classes to declare `_settings` with the narrowed Settings type as a class-level annotation. This gives editors and type checkers the specific type when hovering or autocompleting on `self._settings` in each service and its subclasses. Inline `self._settings: Type = ...` assignments are replaced with plain `self._settings = ...`.
`filter_incomplete_user_turns` and `user_turn_completion_config` were only handled in the legacy dict-based `_update_settings` code path. This adds them to `LLMSettings` and introduces `LLMService._update_settings_from_typed` so the typed path handles them too.
Does not (yet) touch `InputParams`, to avoid scope creep and touching something currently part of the public API. But there is a lot of overlap between `*Settings` object fields and `InputParams` fields.
Other than discoverability/typing, these are some other improvements brought by this refactor:
- There is now a single code path (see `_update_settings_from_typed`) where services can respond to settings changes (by, say, reconnecting if needed), improving maintainability and guaranteeing one and only one reconnection no matter which settings changed
- `set_language`/`set_model`/`set_voice`—which we're assuming are usable as public methods, though *not* recommended over `*UpdateSettingsFrame`—all use the same code path as settings updates. They're also now all consistent in that, if a service needs to respond to a change (by, say, reconnecting if needed), any of these methods will kick off that process. Note that this is technically a behavior change.
- Several services now properly react to changed settings by reconnecting:
- `AWSTranscribeSTTService`
- `AzureSTTService`
- `SonioxSTTService`
- `GladiaSTTService`
- `SpeechmaticsSTTService`
- `AssemblyAISTTService`
- `CartesiaSTTService`
- `FishAudioTTSService` (would previously only reconnect when `model` changed)
- `GoogleSTTService`
- `SpeechmaticsSTTService` (which previously only handled *some* settings updates through a nonstandard public `update_params` method)
- `GradiumSTTService`
- `NvidiaSegmentedSTTService` (which previously only handled changes to language)
- Bookkeeping across various services has been reduced, mostly by deduping ivars; the `self._settings` ivar is treated as the source of truth
NOTE: I pretty much guarantee that there are services missed in this PR in terms of bringing to consistency with how updates are handled (like whether changes in certain fields trigger reconnects when they need to). We can squash remaining inconsistencies as we stumble onto them, service by service. The goal here is to get things *mostly* in order, and establish the infrastructure and patterns we'll need going forward.
The outer try/except in each service decorator caught both tracing
setup errors and application errors from the wrapped function. If the
function itself raised (e.g. LLM rate limit, TTS timeout), the
exception was caught and the function was called a second time.
Fix by tracking whether the original function was called via a
fn_called flag. If the function was already called, re-raise the
exception instead of falling back to untraced re-execution.
Adds a Claude Code skill that analyzes the current branch diff against
main, maps changed source files to their doc pages, and makes targeted
updates to Configuration, InputParams, Usage, Notes, and Event Handlers
sections.
The class_decorators.py module (Traceable, @traceable, @traced) is not
used anywhere in the codebase. Mark it deprecated and fix the misleading
comment in service_decorators.py that referenced it as if it were active.
Consolidate _tracing_enabled and _tracing_context from LLMService,
STTService, and TTSService into the shared AIService base class.
Extract _get_turn_context() helper in service_decorators.py to
encapsulate the repeated pattern across all traced decorators.
Add model-specific params (arcana: repetition_penalty, temperature, top_p;
mistv2: no_text_normalization, save_oovs, segment) with dynamic query param
building via _build_settings(). Model/voice/param changes now trigger
WebSocket reconnection since all settings are URL query params.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This allows non-websocket STT services (like SarvamSTTService, which uses
the Sarvam Python SDK for connection management) to reuse the same keepalive
pattern. Subclasses override _send_keepalive() and _is_keepalive_ready() for
their specific protocol.
ConversationContextProvider and TurnContextProvider were singletons that
stored tracing context as class-level state. When two PipelineTask instances
ran concurrently, they would overwrite each other's context, causing service
spans to attach to the wrong pipeline's turn span.
Replace both singletons with a single TracingContext object owned by each
PipelineTask, threaded to services via StartFrame.
The Grok API now returns prefixed voice names (e.g. "human_Ara") in
session.updated events, causing Pydantic validation errors. Widen the
voice field type from GrokVoice to GrokVoice | str to accept both
user-facing names and server-returned values.
Extract the procedural PR workflow into an actionable skill that can be
invoked with /pr-submit. CLAUDE.md is better suited for project context
and conventions, not step-by-step procedures.
Expand the Pull Requests section with detailed step-by-step instructions
including branch naming, commit guidance, changelog generation, and PR
description updates.
OpenAI's AsyncStream uses close() while async generators (e.g. from
OpenPipe) use aclose(). Replace direct async-with on the stream with a
helper that handles both protocols.
Adds opt-in keepalive_timeout and keepalive_interval params to
WebsocketSTTService. When enabled, a background task sends silent audio
(or a service-specific protocol message) when the connection has been
idle, preventing server-side timeout disconnects.
Subclasses override _send_keepalive(silence) to wrap the silence in
their wire format. The default sends raw PCM bytes.
Enables keepalive for ElevenLabs (10s), Gladia (20s), and Soniox (1s),
replacing their per-service custom keepalive tasks.
Add a `service` field so the frame targets a specific service, allowing
ServiceSwitcher.push_frame to consume it only when the targeted service
matches the active service. STTService and test mocks now push the frame
downstream after handling instead of silently consuming it.
The default stop strategy changed to TurnAnalyzerUserTurnStopStrategy,
which requires actual audio analysis. Use SpeechTimeoutUserTurnStopStrategy
explicitly since this test is not testing turn detection.
Change the default user turn stop strategy from
TranscriptionUserTurnStopStrategy to TurnAnalyzerUserTurnStopStrategy
with LocalSmartTurnAnalyzerV3. Also reduce AUDIO_INPUT_TIMEOUT_SECS
from 1.0 to 0.5 and remove its debug log.
- Make ServiceSwitcherStrategy inherit from BaseObject with properties
for services and active_service, and move initial service selection
into the base class
- Add on_service_switched event to ServiceSwitcherStrategy
- handle_frame now returns the switched-to service (or None), allowing
ServiceSwitcher to swallow ManuallySwitchServiceFrame on switch and
request metadata from the new active service
- Override push_frame to suppress RequestMetadataFrame and
ServiceMetadataFrame from inactive services
- Remove ServiceSwitcherFilter and ServiceSwitcherFilterFrame in favor
of plain FunctionFilter instances with closures that check the
strategy's active service directly
- FunctionFilter: add FilterType alias
- FunctionFilter: when direction is None, frames in both directions
are filtered instead of just one
- Add docstrings to ServiceSwitcher and its components
Refactor TranscriptionUserTurnStopStrategy and TurnAnalyzerUserTurnStopStrategy
to use VADUserStoppedSpeakingFrame as the ground truth for when speech ended,
rather than triggering timeouts from transcription frames.
RTVIObserver now skips upstream frames to prevent duplicate RTVI messages
when frames are broadcast in both directions. Also changed
FunctionCallCancelFrame to use broadcast_frame for consistency with
other function call frames.
Processors inside parallel sub-pipelines can push frames during
StartFrame/EndFrame/CancelFrame processing. Previously these frames
could escape the ParallelPipeline before all branches finished
processing the lifecycle frame. Now they are buffered and flushed
after synchronization completes.
Add tests for the event-based interruption completion: complete() sets
the event, complete() is safe without an event, the event fires at
the pipeline sink, and a warning is logged when the frame is blocked.
Also remove the unconditional await after the timeout so the function
returns instead of hanging when complete() is never called.
Move the interruption wait event from per-processor instance state to
the frame itself. The event is created in
push_interruption_task_frame_and_wait(), threaded through
InterruptionTaskFrame → InterruptionFrame, and set when the frame
reaches the pipeline sink. This scopes the event to each interruption
flow rather than sharing mutable state on the processor.
Also adds a 2s timeout warning to help diagnose cases where
InterruptionFrame.complete() is never called.
This adds user-to-bot response latency tracking to OpenTelemetry spans:
- Created UserBotLatencyObserver as a reusable component for tracking
user-to-bot response latency
- Records the value as an attribute on turn spans (turn.user_bot_latency_seconds)
- Updated TurnTraceObserver to use UserBotLatencyObserver, following the same pattern as TurnTrackingObserver
- Updated PipelineTask to automatically create and wire UserBotLatencyObserver
when tracing is enabled (same as TurnTrackingObserver)
Tests cover:
- No messages received (raises ValueError)
- One message received (logs warning, continues)
- Two messages received (normal operation)
- All telephony providers (Twilio, Telnyx, Plivo, Exotel)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Handle WebSocket disconnections gracefully when telephony providers send
fewer messages than expected. Adds explicit StopAsyncIteration handling
for both first and second message retrieval.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
NLTK's sent_tokenize() only supports ~15 European languages and defaults to
English. For Japanese, Chinese, Korean, Hindi, Arabic, and other non-Latin
languages, NLTK fails to recognize sentence boundaries like 。?! causing
text to accumulate until flush instead of being emitted sentence-by-sentence.
Add a fallback in match_endofsentence() that scans for unambiguous non-Latin
sentence-ending punctuation when NLTK fails to split the text. Latin
punctuation (. ! ? ; …) is excluded from the fallback since NLTK handles
those correctly and they can be ambiguous (abbreviations, decimals, etc.).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
broadcast_frame() expects a frame class and kwargs, but the three
websocket input transports (fastapi, client, server) were incorrectly
passing a frame instance. This would cause a TypeError at runtime when
an InputTransportMessageFrame was received.
Incorporates latest changes from main branch including:
- AIC filter and VAD updates
- STT service improvements
- Base serializer changes
- Various bug fixes
If `enable_rtvi` is enabled (enabled by default) and RTVI processor will be
added automatically to the pipeline. Also, and RTVI observer will be
registered.
Process audio as soon as we receive it from the generator. Previously, we were
reading from the generator and adding elements into a queue until there was no
more data, then we would process the queue.
If AIService subclasses implement start()/stop()/cancel() and exception are not
handled, execution will not continue and therefore the originator frames will
not be pushed. This would cause the pipeline to not be started (i.e. StartFrame
would not be pushed downstream) or stopped properly.
Issues:
- After disconnecting, we were prematurely sending audio messages using the new prompt and content names, before the new prompt and content were created
- We weren't properly sending system instruction and conversation history messages to Nova Sonic with `"interactive": false`
The underlying issue was related to the fact that we were sending audio to Grok before we had configured the Grok session with our default input sample rate (16000), so Grok was interpreting those initial audio chunks as having its default sample rate (24000). We didn't see this issue when using the Daily transport simply because in our test environments Daily took a smidge longer than a reflexive (localhost) pure WebRTC connection, so we would only send audio to Grok *after* we had configured the Grok session with the desired sample rate.
Extract dictionary value to local variable and check for None before
accessing cancel_on_interruption attribute, since the dictionary values
are typed as Optional[FunctionCallInProgressFrame].
If we aggregate transcriptions we will get incorrect interruptions. For example,
if we have a strategy with min_words=3 and we say "One" and pause, then "Two"
and pause and then "Three", this would trigger the start of the turn when it
shouldn't. We should only look at the incoming transcription text and don't
aggregate it with the previous.
- Replace aiohttp with camb SDK (AsyncCambAI client)
- Add support for passing existing SDK client instance
- Simplify API: no longer requires aiohttp_session parameter
- Update example to use simplified initialization
- Rewrite tests to mock SDK client instead of HTTP servers
- Add --voice-id CLI argument to example (default: 2681)
- Remove test_camb_quick.py from examples/ (tests belong in tests/)
- Update docstring with new usage
Gemini expects parallel function calls to be passed in as a single multi-part `Content` block. This is important because only one of the function calls in a batch of parallel function calls gets a thought signature—if they're passed in as separate `Content` blocks, there'd be one or more missing thought signatures, which would result in a Gemini error.
Add support for Gladia's speech_start/speech_end events to emit
UserStartedSpeakingFrame and UserStoppedSpeakingFrame frames.
When enable_vad=True in GladiaInputParams:
- speech_start triggers interruption and pushes UserStartedSpeakingFrame
- speech_end pushes UserStoppedSpeakingFrame
- Tracks speaking state to prevent duplicate events
This allows using Gladia's built-in VAD instead of a separate VAD
in the pipeline.
The end of turn is already handle with interruptions or with
LLMFullResponseEndFrame. LLMFullResponseEndFrame should never be blocked,
otherwise the assistant would not work.
* Krisp VIVA SDK Filter and Turn support.
* Reverted the krisp_filter.py as it's already deprectaed.
* enabled test with krisp_audio mock.
* More review comment fixes.
reverted the state logic in viva filter to be similar to the existing impl on main branch.
Fixed tests, ruff, etc.
* More review comments for Turn detection.
removed integration tests.
* Moved the SDK init/deinit into start/stop
Changed getattr with default value to use 'or' operator for fallback.
This ensures proper model name retrieval when model_name attribute exists but is None or empty.
This commit adds word boundary support to AzureTTSService and fixes
the race condition that causes scrambled TTS output across multiple
sentences.
## Features Added
- Change AzureTTSService to inherit from WordTTSService
- Subscribe to Azure SDK's synthesis_word_boundary event
- Emit word-level text with timing information via _words_queue
- Add synthesis lock for sequential sentence processing
## Race Condition Fix
Previously, each sentence's word boundary timestamps reset to 0,
causing downstream components to interleave words when reordering
frames by PTS. This resulted in scrambled output like:
'Hello ! I What am questions AI have assistant...'
The fix adds cumulative audio offset tracking to ensure monotonically
increasing PTS across all sentences:
Sentence 1: pts = 0.1s, 0.5s, 0.8s (cumulative at end: 0.8s)
Sentence 2: pts = 0.9s, 1.2s, 1.5s (0.8s + relative offset)
## Key Changes
- _cumulative_audio_offset: tracks total audio duration
- _handle_word_boundary: adds cumulative offset to timestamps
- _handle_completed: accumulates audio duration for next sentence
- flush_audio: resets cumulative offset at end of LLM response
- _handle_interruption: resets state on user interruption
- run_tts: uses synthesis lock for sequential processing
Fixes#2918🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add support for the language_hints_strict parameter in Soniox STT
configuration. When set to true, this parameter strictly enforces
language hints, restricting transcription to only the specified
languages.
Prior to this change, after the model generated an image the conversation would not be able to progress. It would stall out because we were never storing the image in context, so the model would never realize it already did the work of generating an image. We didn't run into issues with Gemini 2.5 Flash Image, because that model always followed up an image with a text message.
Adds support for using Ultravox Realtime as a speech-to-speech service.
Also removes the deprecated Ultravox speech-to-text vllm model integration to avoid confusion.
Changed the on_client_connected system message from a direct greeting to
an instruction that tells the AI to introduce itself, giving the LLM more
flexibility in how it starts the conversation.
Thinking, sometimes called "extended thinking" or "reasoning", is an LLM process where the model takes some additional time before giving an answer. It's useful for complex tasks that may require some level of planning and structured, step-by-step reasoning. The model can output its thoughts (or thought summaries, depending on the model) in addition to the answer. The thoughts are usually pretty granular and not really suitable for being spoken out loud in a conversation, but can be useful for logging or prompt debugging.
Here's what's added:
1. New typed input parameters for Google and Anthropic LLMs that control the models' thinking behavior (like how much thinking to do, and whether to output thoughts or thought summaries).
2. New frames for representing thoughts output by LLMs.
3. A generic mechanism for associating extra LLM-specific data with a function call in context, used specifically to support Google's function-call-related "thought signatures", which are necessary to ensure thinking continuity between function calls in a chain (where the model thinks, makes a function call, thinks some more, etc.)
4. A generic mechanism for recording LLM thoughts to context, used specifically to support Anthropic, whose thought signatures are expected to appear alongside the text of the thoughts within assistant context messages.
5. An expansion of `TranscriptProcessor` to process LLM thoughts in addition to user and assistant utterances.
Deepgram request IDs are necessary for investigating behavior at the
request level. This commit adds DEBUG logs that print Deepgram request
IDs when using Deepgram's STT or TTS.
a best effort version of the bot's output
- The `RTVIObserver` now emits `bot-output` messages based off
the new `AggregatedTextFrame`s (`bot-tts-text` and
`bot-llm-text` are still supported and generated, but
`bot-transcript` is now deprecated in lieu of this new, more
thorough, message).
- The new `RTVIBotOutputMessage` includes the fields:
- `spoken`: A boolean indicating whether the text was spoken by TTS
- `aggregated_by`: A string representing how the text was aggregated
("sentence", "word", "my custom aggregation")
- Introduced new fields to `RTVIObserver` to support the new
`bot-output` messaging:
- `bot_output_enabled`: Defaults to True. Set to false to disable
bot-output messages.
- `skip_aggregator_types`: Defaults to `None`. Set to a list of
strings that match aggregation types that should not be included
in bot-output messages. (Ex. `credit_card`)
`CartesiaTTSService`:
- Modified use of custom default text_aggregator to avoid deprecation warnings and push users
towards use of transformers or the `LLMTextProcessor`
- Added convenience methods for taking advantage of Cartesia's SSML tags: spell, emotion,
pauses, volume, and speed.
`RimeTTSService`:
- Modified use of custom default text_aggregator to avoid deprecation warnings and push users
towards use of transformers or the `LLMTextProcessor`
- Added convenience methods for taking advantage of Rime's customization options: spell,
pauses, pronunciations, and inline speed control.
Introduced `LLMTextProcessor`: A new processor meant to allow customization for how
LLMTextFrames should be aggregated and considered. It's purpose is to turn
`LLMTextFrame`s into `AggregatedTextFrame`s. By default, a TTSService will still
aggregate `LLMTextFrame`s by sentence for the service to consume. However, if you
wish to override how the llm text is aggregated, you should no longer override the
TTS's internal text_aggregator, but instead, insert this processor between your LLM
and TTS in the pipeline.
This frame introduces an `aggregated_by` field to describe the type of text included
in the frame and allows unspoken groupings of text to be pushed through the pipeline
and treated similar to TTSTextFrames.
Adding support for setting whether or not the text in the TextFrame
should be added to the LLM context (by the LLM assistant aggregator).
Defaults to `True`.
Modified the BaseTextAggregator type so that when text gets aggregated, metadata can
be associated with it. Currently, that just means a `type`, so that the aggregation
can be classified or described. Changes made to support this:
- **IMPORTANT**: Aggregators are now expected to strip leading/trailing white space
characters before returning their aggregation from `aggregation()` or `.text`. This
way all aggregators have a consistent contract allowing downstream use to know how
to stitch aggregations back together
- Introduced a new `Aggregation` dataclass to represent both the aggregated `text` and
a string identifying the `type` of aggregation (ex. "sentence", "word", "my custom
aggregation")
- **BREAKING**: `BaseTextAggregator.text` now returns an `Aggregation` (instead of `str`).
To update: `aggregated_text = myAggregator.text` -> `aggregated_text = myAggregator.text.text`
- **BREAKING**: `BaseTextAggregator.aggregate()` now returns `Optional[Aggregation]`
(instead of `Optional[str]`). To update:
```
aggregation = myAggregator.aggregate(text)
if (aggregation):
print(f"successfully aggregated text: {aggregation.text}") // instead of {aggregation}
```
- `SimpleTextAggregator`, `SkipTagsAggregator`, `PatternPairAggregator` updated to
produce/consume `Aggregation` objects.
- All uses of the above Aggregators have been updated accordingly.
This update ensures that audio is flushed immediately after sending bare text to the WebSocket, improving the responsiveness of the Text-to-Speech service.
This commit introduces the RimeNonJsonTTSService class, enabling Text-to-Speech synthesis over WebSocket endpoints that require plain text messages. The service includes configuration parameters for language, segmentation, and audio settings, and handles WebSocket connections for raw audio byte transmission. Limitations include the lack of support for word-level timestamps and context IDs.
Now:
- For TTS word-by-word output and `TTSSpeakFrames`: `TTSTextFrame`s' have `includes_inter_frame_spaces=False`.
- For all other TTS output: `TTSTextFrame` pass through the received text frames' `includes_inter_frame_spaces` value. So far, this value has always been `True`: LLMs send text chunks already containing all necessary spaces.
- `LLMTextFrame`s set `includes_inter_frame_spaces=False` at init time, per the aforementioned assumption.
* feat: Add ErrorFrame emission to TTS/STT services for pipeline error detection
- Add ErrorFrame emission to all major TTS/STT services during initialization and runtime failures
- Services updated: Cartesia, ElevenLabs, Deepgram, AssemblyAI, Rime, Azure
- ErrorFrame objects emitted with fatal=False for graceful degradation
- Enables on_pipeline_error event handler to detect service failures programmatically
- Add comprehensive pytest test suite to verify ErrorFrame emission
- Fixes issue where services failed gracefully but didn't emit ErrorFrame objects
This allows developers to implement real-time error monitoring and alerting
using the on_pipeline_error event handler introduced in v0.0.90.
* Update STT and TTS services to use consistent error handling pattern
- Improves error handling consistency across all services
* Add changelog entry for STT/TTS error handling improvements
* Linting issues Resolved
* Azure STT ErrorFrames added with consistent patterns
* Cartesia STT and Deepgram STT; additional fixes made
* Removed Fatal Flags across services, removed duplication
* Moving the changelog entry to the correct place.
* Refactoring some classes to use yield instead of push_error directly.
* Fixing ruff format.
---------
Co-authored-by: Filipi Fuchter <filipi87@gmail.com>
- `GoogleHttpTTSService`
- `OpenAITTSService`
The reason I skipped this work in an earlier PR was because these services seemed to be emitting long, punctuation-free text frames. It turns out that the issue was with the LLM prompt, though, resulting in the LLM nondeterministically excluding all punctuation. An upcoming commit will address that prompt issue.
Note that for `LLMTextFrame`s, the right behavior is pretty much always `includes_inter_frame_spaces = True`. I decided *not* to go ahead and make that the default for `LLMTextFrame`s, though, simply to not introduce a subtle behavior change for creative/unexpected use-cases that were relying on text in hand-crafted `LLMTextFrame`s being handled a certain way. Ditto for `TTSTextFrame`s.
Also, fix an issue in `NeuphonicTTSService` where it wasn't pushing `TTSTextFrame`s.
Also, fix the broken `SarvamHttpTTSService` example.
Also, add a couple of missing examples.
* Fix Langfuse tracing for GoogleLLMService with universal LLMContext
- Fixed issue where input appeared as null in Langfuse dashboard for GoogleLLMService
- Added fallback to use adapter's get_messages_for_logging() for universal LLMContext
- Ensures proper message format conversion for Google/Gemini services
- Handles system message conversion to system_instruction format
- Also fixes serialization of empty message lists ([] now serializes correctly)
This fix ensures Langfuse tracing works correctly for Google services using
both OpenAILLMContext/GoogleLLMContext and the universal LLMContext.
* Add unit tests for Langfuse tracing with GoogleLLMService
- Test that tracing correctly captures messages with universal LLMContext
- Test that empty message lists are properly serialized
- Test that adapter's get_messages_for_logging is used instead of context method
- All tests verify that input is correctly added to Langfuse spans
* Fix test mocking to patch opentelemetry.trace.get_tracer correctly
The tests were failing in CI because they were trying to patch
'pipecat.utils.tracing.service_decorators.trace' which doesn't exist as
an attribute. The trace module is imported from opentelemetry, so we need
to patch 'opentelemetry.trace.get_tracer' instead.
* Skip tracing tests when opentelemetry is not installed
The tracing dependencies (opentelemetry) are optional in Pipecat and not
installed in the CI environment. Added a skipif marker to skip these tests
when opentelemetry is not available, preventing CI failures while still
allowing the tests to run when tracing dependencies are installed locally.
* Install tracing dependencies in GitHub Actions CI
Instead of skipping the tracing tests, install the 'tracing' extra
(opentelemetry) in the CI environment so the tests can run properly.
Removed the skipif condition from the tests since opentelemetry will
now be available in CI.
* Use the context type to determine which messages to use, fix tool_count and tools (#3032)
---------
Co-authored-by: Mark Backman <mark@daily.co>
I chose to go the somewhat hacky route of adding the `ToolsSchema` support into the `events.SessionProperties` model itself—even though we should never serialize that type when creating events—because the alternative seemed to be to create a new type for `OpenAIRealtimeLLMService` initialization parameters and then we'd have to contend with backward compatibility, which seemed like a bigger headache.
- If the context contained a system message, that message would be converted to a user message and the LLM would respond
- If the system message was provided as a constructor argument, though, no user messages would be sent to the LLM, and the LLM would therefore not respond
Not adding this fix to the CHANGELOG since `GeminiLiveLLMService`'s ability to properly handle context-provided tools and system instruction hasn't been published yet.
Not adding this fix to the CHANGELOG since `GeminiLiveLLMService`'s ability to properly handle context-provided tools and system instruction hasn't been published yet.
This allows folks to use `MCPClient` alongside the pattern of passing in tools at LLM init time, a pattern supported by speech-to-speech services such as `GeminiLiveLLMService`.
This does a couple of things:
- Makes the `MCPClient` LLM agnostic, setting us up for some upcoming improvements (like making it possible to use with `LLMSwitcher`)
- Makes `GeminiLLMAdapter` more robust, as the schema massaging that was previously only done in `MCPClient` is useful for all tools, not just for MCP-provided ones
Before, when requesting a user image from a function call we had to wait for a
random time before we could indicate the function call was done. This was to
given time to the aggregator to process the image before marking the function
call as completed.
To avoid this, we now wait for the requested image to be received by the LLM
assistant agrgegator (using an asyncio event). Then, we can successfully mark
the function call as completed.
Make `LLMUserAggregator` push the `LLMSetToolsFrame`s, in case a speech-to-speech service that needs to handle the frame itself—like `OpenAIRealtimeLLMService`—is downstream. As far as I can tell, pushing `LLMSetToolsFrame` should otherwise have no unwanted side effects.
Add `LLMContext.get_messages_for_persistent_storage()` for compatibility with `OpenAILLMContext`, to avoid tripping up users who we're unknowingly migrating to using `LLMContext`.
Add back file that was removed, when it should've just been deprecated.
Also, fix version numbers in deprecation messages to match the next expected release.
Update `create_context_aggregator()` (which we're keeping around for backward compatibility) to create a `LLMContextAggregatorPair` rather than OpenAI-Realtime-specific aggregators.
Implement sending tool call results to the OpenAI server based on reading context updates. This lets us use the normal assistant context aggregator and not a special OpenAI Realtime subclass that pushes up a special frame for function call results.
Receiving a new context (via a context frame) no longer serves as a signal to reset the conversation. That’s because we’re now receiving new contexts from the user aggregator every time new messages are added, and from the assistant aggregator when function call results come in. The code pattern we're heading towards, of “diffing” each new context with the previous on, sets us up for doing more sophisticated things in the future, like sending specific messages to OpenAI to edit its internally-tracked context.
Also, remove code that was directly modifying context.
Push `TranscriptionFrame`s upstream, to be handled by the user context aggregator. This will require at least a couple of other changes:
- Updating examples to put transcript processors upstream from `OpenAIRealtimeLLMService`
- Maybe figuring out a way to preserve backward compatibility with existing pipelines that put transcript processors downstream from `OpenAIRealtimeLLMService`
- Updating `OpenAIRealtimeLLMService` to ignore new received context frames, since the upstream user context aggregator will generate those after each newly-added user message; hopefully nobody was reliant on the old behavior of resetting the conversation upon receiving a new context!
Avoid pushing `LLMTextFrame` when `OpenAIRealtimeLLMService` is configured to output audio. This avoids duplicate text in assistant messages in context. Conceptually, a speech-to-speech service encapsulates TTS behavior; in a "traditional" pipeline, `LLMTextFrames` are swallowed by the TTS service, so they should similarly not be pushed by a speech-to-speech service. Only. `TTSTextFrame`s should be pushed.
Properly bookended responses now work with:
- AUDIO modality (validated with 26b example)
- TEXT modality (validated with 26d example)
- AUDIO modality with Vertex AI (validated with 26h example)
It doesn't seem that TEXT modality is supported with Vertex AI, hence the missing "quadrant" of validation.
- Change GenerationConfig from dataclass to Pydantic BaseModel for consistency
- Simplify _build_msg() to use model_dump(exclude_none=True) instead of manual field extraction
- Simplify HTTP run_tts() to use model_dump(exclude_none=True) instead of manual field extraction
This addresses feedback from code review and reduces code duplication.
Add GenerationConfig dataclass with volume, speed, and emotion parameters
for Cartesia Sonic-3 TTS models. This enables fine-grained control over
speech generation including volume (0.5-2.0), speed (0.6-1.5), and
emotion (60+ options).
Changes:
- Add GenerationConfig dataclass with proper Google-style docstrings
- Update CartesiaTTSService.InputParams to include generation_config
- Update CartesiaHttpTTSService.InputParams to include generation_config
- Modify _build_msg() to include generation_config in WebSocket messages
- Modify run_tts() to include generation_config in HTTP requests
- Maintain backward compatibility with existing speed and emotion parameters
The legacy speed (literal strings) and emotion (list) parameters remain
available for non-Sonic-3 models.
This wasn't really an issue before, when folks were *knowingly* migrating from `OpenAILLMContext` to `LLMContext`. But in the latest AWS Nova Sonic change, we're swapping it out from under folks, so this kind of compatibility is more important.
For context, the reason we *didn't* offer the `messages` property earlier was to aid in the development of `LLMContext`—we wanted to draw attention to all the places where messages were being read from context, so we could find the places where we might need to pass an argument to the read.
The reason for its `system_instruction` argument was to support usage with LLMs where you might pass the system instruction as a parameter to the `LLMService` rather than specifying it in the context.
But as I thought about it more I became unconvinced that the `system_instruction` argument was really beneficial:
- If you specified your system instruction in your context in the first place, it'll still be there when you read messages for persistent storage
- If you didn't specify your system instruction in the context and instead passed it in as an `LLMService` parameter, you most likely *don't* want it to be in the context when you read messages for persistent storage
- ...and if you really really do need to inject it at the start of the context, it's quite easy to do anyway
And if we remove the `system_instruction` argument from `get_messages_for_persistent_storage()`, then it's essentially just `get_messages()`.
Also fix the slightly wrong (but so far harmless) pattern of initializing `OpenAILLMService.InputParams()` in the `GoogleVertexLLMService` if `params` wasn't provided—we should be letting the superclass decide what to do if the argument isn't specified.
- Conceptually, these args comprise project-level setup, akin to credentials, whereas everything in `InputParams` is concerned with model configuration
- Providing a `project_id` when initializing `GoogleVertexLLMService` should not be optional, but prior to the change in this commit it was (erroneously) treated as optional by dint of `InputParams` being optional
This improvement was discussed [in this comment](https://github.com/pipecat-ai/pipecat/pull/2795#discussion_r2408279142).
To understand this fix, let's look exclusively at `pause_processing_frames()` (`pause_processing_system_frames()` works the same way).
`pause_processing_frames()` works by setting a `__should_block_frames` flag, which is then read each time through the loop in the long-running `__process_frame_task_handler`. if `__should_block_frames` is `True`, it pauses processing frames until it's resumed.
Prior to this fix, the check for `__should_block_frames` was before `await self.__process_queue.get()`. The problem is that a lot of the time spent in the loop is waiting for a frame from the process queue. So if `pause_processing_frames()` is set at any time other than within `process_frame()` itself, it actually won't have an effect by the next frame, only on the frame *after* the next, which is later than intended.
Because thus far in the Pipecat codebase we've only ever called `pause_processing_frames()` and `pause_processing_system_frames()` from within `process_frame()`, this change should have no behavioral effect. But it will be helpful if we ever need to call it from anywhere else. I noticed this issue while developing a feature that did exactly that (though I later abandoned that code).
It expected a WebSocket URL, but we're no longer (directly) using WebSockets to talk to Gemini. Instead of trying to (potentially erroneously) map a given custom WebSocket URL to an `HttpOptions` object (the new preferred way of customizing requests made by the Gemini API client), we're simply deprecating `base_url` and pointing users to the `http_options` argument instead.
We move the thread creation to the VADAnalyzer instead of the input
transport. This can potentially be useful if we need to analyze multiple audio
streams.
- Usage in classes that are already deprecated
- Usage related to realtime LLMs, which don't yet support `LLMContext`
- Usage in (soon-to-be-deprecated) code paths related to `OpenAILLMContext` itself and associated machinery
Note that `LLMContext` doesn't have a `get_messages_for_persistent_storage()`; the messages are already in the "standard" format so they can be used directly for storage.
NOTE: oops! Turns out some of these files had *already* been updated to use universal `LLMContext` even though they weren't yet using `ToolsSchema`. This commit should fix those examples.
With all these examples updated, we no longer need dedicated examples illustrating `LLMContext`, so they're removed.
Here’s where we *don’t* yet use `LLMContext` and associated machinery:
- Realtime services: OpenAI Realtime, Gemini Live, and AWS Nova Sonic (support coming soon)
- `GoogleLLMOpenAIBetaService` (it’s deprecated, so we didn’t bother adding support)
- `LLMLogObserver` (support coming soon)
- `GatedOpenAILLMContextAggregator` (support coming soon)
- `LangchainProcessor` (support coming soon)
- `Mem0MemoryService` (support coming soon)
- Examples that use LLM-specific tools definitions as opposed to `ToolsSchema` (these will be updated soon)
- Examples that rely `GoogleLLMContext.upgrade_to_google` (TBD what to do with these)
Examples that use `LLMLogObserver`:
- 30-
Examples that use `GatedOpenAILLMContextAggregator`:
- 22-
Examples that use `LangchainProcessor`:
- 07b-
Examples that use `Mem0MemoryService`:
- 37-
Examples that need updating to use `ToolsSchema`:
- 15-
- 15a-
- 20a-
- 20c-
- 20d-
- 22b-
- 22c-
- 33-
- 36-
Examples that use `GoogleLLMContext.upgrade_to_google`:
- 22d-
- 25-
- Document the new global location support in GoogleVertexLLMService
- Explain the difference between regional and global API hosts
- Follow Keep a Changelog format
Immediate is the "default", i.e. has the more obvious name (e.g. `ManuallySwitchServiceFrame` v `ManuallySwitchServiceControlFrame`), since that's *probably* what users will want to reach for. Also, the immediate frames are more likely to behave like what we had before the last few commits, where the service switch would always "jump the queue" by having an immediate effect once it hit the `ServiceSwitcher` in the pipeline, jumping ahead of frames in front of it destined for the service.
- A text frame
- A `ManuallySwitchServiceFrame` (which is a `ServiceSwitcherFrame`)
- Another text frame
And expect that the first text frame be handled by the initially active service and the second text frame be handled by the newly active one.
Previously, the `ManuallySwitchServiceFrame` would have an effect too early, causing both text frames to be handled by the newly active service. Why? Because the frame filtering condition was being updated *directly* by the `ServiceSwitcher`, which is upstream from the services it's switching between. It could therefore update the filters *before* the services received the prior frames.
- Update _get_base_url method to handle 'global' location case
- Use 'aiplatform.googleapis.com' for global locations
- Use '{location}-aiplatform.googleapis.com' for regional locations
- Maintains backward compatibility with existing regional endpoints
* Handle missing rawResponse in transcription messages
- Use message.get('rawResponse', {}) to safely access rawResponse field
- Default is_final to False when rawResponse is missing
- Add proper type annotations for better code clarity
- Minor import formatting cleanup
This prevents KeyError crashes when transcription messages from Daily's API
don't include the rawResponse field in edge cases.
* docs: add changelog line
Removing `VisionImageRawFrame` lets us simplify LLM services' logic, getting us closer to the idealized architecture where all they care about is handling context frames.
This change is in service of getting us closer to ready to deprecate usage of `OpenAILLMContext` and subclasses in favor of the universal `LLMContext`, at least for the traditional text-to-text LLMs.
Why remove `VisionImageRawFrame` rather than deprecate? It's "internal"—only created by `VisionImageFrameAggregator`—and never intended to be used directly by users (it would be difficult to use directly anyway).
Move the logic that was once in `VisionImageFrameAggregator` directly into the examples. Reasoning:
- If `UserImageRequester` is defined in the examples, it makes sense for `UserImageProcessor` to be too, as it’s the flip side of the same coin, so to speak
- The logic is now pretty trivial
- This kind of one-shot, history-less image-describing pipeline shouldn't be common at all; it's ok for it to live in examples rather than as a dedicated class
- In the short term, this enables us to create `LLMContext`s for services that support it and `OpenAILLMContext`s for services that don't yet (AWS)
This commit also adds missing translation from OpenAI-format image context messages to AWS format. Note that this isn't a wasted effort in the face of the upcoming migration to universal `LLMContext`—this work will be reused as it has to be implemented there too.
This lets a text frame bypass TTS while still being included in the LLM
context. Useful for cases like structured text that isn’t meant to be spoken but
should still contribute to context.
Some implementations were returing a list and some were returning a JSON
string. They should all return a list and the user would decide if it wants to
transform that into JSON.
* Added Sarvam TTS Websocket Implementation
* Addressed some of the comments on PR
* added change voice logic
* added changes from main
* pushing text frames and added flush audio
* updated docs string for better docs
* Addressed comments and added some improvements
* pushed optional args down
* removed new line
* made aiohttp session mandatory in http service
* added push frame and removed unused function
* removed pong message
* added disconnecting logic
---------
Co-authored-by: vinayak-sarvam <vinayak@sarvam.ai>
1. `ToolsSchema` has been supported in `LLMSetToolsFrame` for a while but wasn't properly reflected in these type hints
2. The new universal `LLMContext` expects tools to be either a `ToolsSchema` or `NOT_GIVEN`.
This abstraction will allow us to update Pipecat Flows to avoid reaching into LLM service internals to generate summaries.
In addition to being a helpful refactor to remove a fragile part of Pipecat Flows, this change helps set the stage for supporting the upcoming `LLMSwitcher`, where the “active” LLM will only be determined at runtime—today, Pipecat Flows needs to know ahead of time what type of LLM it’s working with, to load an LLM-specific “adapter” that does the work of generating summaries, among other things.
- Add support for LLM-specific messages in the universal `LLMContext`, to enable using LLM-specific functionality while still using the universal LLM context
Watchdog timers have been removed. They were introduced in 0.0.72 to help
diagnose pipeline freezes. Unfortunately, they proved ineffective since they
required developers to use Pipecat-specific queues, iterators, and events to
correctly reset the timer, which limited their usefulness and added friction.
This patch uses `wait_for2` package to implement `asyncio.wait_for()` for
Python < 3.12.
In Python 3.12, `asyncio.wait_for()` is implemented in terms of
`asyncio.timeout()` which fixed a bunch of issues. However, this was never
backported (because of the lack of `async.timeout()`) and there are still many
remainig issues, specially in Python 3.10, in `async.wait_for()`.
See https://github.com/python/cpython/pull/98518
We don't want to set `last_frame_time` on other frames like `HeartBeatFrame`, `LLMGeneratedTextFrame`, `InterruptionFrames` so that we can calculate `diff_time` and compare it against `vad_stop_secs` properly
We now force each inserted item in the priority queue to be a tuple and the
actual value to be last in the tuple. All the previous values in the tuple also
need to be numeric.
- Add timeout (default 5.0s) and retry_on_timeout parameters to constructor
- Implement timeout/retry logic in get_chat_completions using asyncio.wait_for
- Extract build_chat_completion_params() as public method for subclass customization
We need to increment the counters before the await otherwise we could go to a
different task that could add an item with the same counter.
Also, we need to handle non-frame items as well.
- `{PR_NUMBER}.added.2.md`, `{PR_NUMBER}.added.3.md` - for additional entries of the same type
- `{PR_NUMBER}.changed.md` - for changes to existing functionality
- `{PR_NUMBER}.fixed.md` - for bug fixes
- `{PR_NUMBER}.deprecated.md` - for deprecations
- `{PR_NUMBER}.removed.md` - for removed features
- `{PR_NUMBER}.security.md` - for security fixes
- `{PR_NUMBER}.performance.md` - for performance improvements
- `{PR_NUMBER}.other.md` - for other changes
4. Each changelog file should at least contain a main single line starting with `- ` followed by a clear description of the change. No line wrapping.
5. If the change is complicated, changelog files can have indented lines after the main line with additional details or code samples.
6. Use ⚠️ emoji prefix for breaking changes.
7. **Write changes in user-facing terms first.** Lead with what users of the framework will notice: new APIs, changed behavior, new parameters, fixed bugs they might have hit, etc. Implementation details (internal refactoring, how something is wired up under the hood) can be included as secondary context after the user-facing description, but should never be the *only* content of a changelog entry when there is a user-visible effect.
**Good** (user-facing first, implementation detail as context):
```
- Turn completion instructions now persist correctly across full context updates when using `system_instruction`. Previously they were injected as a context system message, which caused warning spam and didn't survive context updates.
```
**Bad** (implementation detail only, no user-facing framing):
```
- Fixed turn completion instructions being injected as a context system message instead of using `system_instruction`.
```
Ask yourself: "If I'm a developer building on Pipecat, what would I notice changed?" Start there.
## Example
For PR #3519 with a new feature and a bug fix:
`changelog/3519.added.md`:
```
- Added `SomeNewFeature` for doing something useful.
```
`changelog/3519.fixed.md`:
```
- Fixed an issue where something was not working correctly in some user-visible scenario. The root cause was an internal implementation detail.
description: Review, refactor, document, and validate code changes in the current branch
---
# Code Cleanup Skill
The **Code Cleanup Skill** reviews, refactors, and documents code changes in your current branch, ensuring alignment with **Pipecat's architecture, coding standards, and example patterns**.
It focuses on **readability, correctness, performance, and consistency**, while avoiding breaking changes.
---
## Skill Overview
This skill analyzes all changes introduced in your branch and performs the following actions:
1.**Analyze Branch Changes**
- Review uncommitted changes and outgoing commits
2.**Refactor for Readability**
- Improve clarity, naming, structure, and modern Python usage
**Agent assumptions (applies to all agents and subagents):**
- All tools are functional and will work without error. Do not test tools or make exploratory calls. Make sure this is clear to every subagent that is launched.
- Only call a tool if it is required to complete the task. Every tool call should have a clear purpose.
To do this, follow these steps precisely:
1. Launch a haiku agent to check if any of the following are true:
- The pull request is closed
- The pull request is a draft
- The pull request does not need code review (e.g. automated PR, trivial change that is obviously correct)
- Claude has already commented on this PR (check `gh pr view <PR> --comments` for comments left by claude)
If any condition is true, stop and do not proceed.
Note: Still review Claude generated PR's.
2. Launch a haiku agent to return a list of file paths (not their contents) for all relevant CLAUDE.md files including:
- The root CLAUDE.md file, if it exists
- Any CLAUDE.md files in directories containing files modified by the pull request
3. Launch a sonnet agent to view the pull request and return a summary of the changes
4. Launch 4 agents in parallel to independently review the changes. Each agent should return the list of issues, where each issue includes a description and the reason it was flagged (e.g. "CLAUDE.md adherence", "bug"). The agents should do the following:
Agents 1 + 2: CLAUDE.md compliance sonnet agents
Audit changes for CLAUDE.md compliance in parallel. Note: When evaluating CLAUDE.md compliance for a file, you should only consider CLAUDE.md files that share a file path with the file or parents.
Agent 3: Opus bug agent (parallel subagent with agent 4)
Scan for obvious bugs. Focus only on the diff itself without reading extra context. Flag only significant bugs; ignore nitpicks and likely false positives. Do not flag issues that you cannot validate without looking at context outside of the git diff.
Agent 4: Opus bug agent (parallel subagent with agent 3)
Look for problems that exist in the introduced code. This could be security issues, incorrect logic, etc. Only look for issues that fall within the changed code.
**CRITICAL: We only want HIGH SIGNAL issues.** Flag issues where:
- The code will fail to compile or parse (syntax errors, type errors, missing imports, unresolved references)
- The code will definitely produce wrong results regardless of inputs (clear logic errors)
- Clear, unambiguous CLAUDE.md violations where you can quote the exact rule being broken
Do NOT flag:
- Code style or quality concerns
- Potential issues that depend on specific inputs or state
- Subjective suggestions or improvements
If you are not certain an issue is real, do not flag it. False positives erode trust and waste reviewer time.
In addition to the above, each subagent should be told the PR title and description. This will help provide context regarding the author's intent.
5. For each issue found in the previous step by agents 3 and 4, launch parallel subagents to validate the issue. These subagents should get the PR title and description along with a description of the issue. The agent's job is to review the issue to validate that the stated issue is truly an issue with high confidence. For example, if an issue such as "variable is not defined" was flagged, the subagent's job would be to validate that is actually true in the code. Another example would be CLAUDE.md issues. The agent should validate that the CLAUDE.md rule that was violated is scoped for this file and is actually violated. Use Opus subagents for bugs and logic issues, and sonnet agents for CLAUDE.md violations.
6. Filter out any issues that were not validated in step 5. This step will give us our list of high signal issues for our review.
7. If issues were found, skip to step 8 to post comments.
If NO issues were found, post a summary comment using `gh pr comment` (if `--comment` argument is provided):
"No issues found. Checked for bugs and CLAUDE.md compliance."
8. Create a list of all comments that you plan on leaving. This is only for you to make sure you are comfortable with the comments. Do not post this list anywhere.
9. Post inline comments for each issue using `gh pr review` with inline comments. For each comment:
- Provide a brief description of the issue
- For small, self-contained fixes, include a committable suggestion block
- For larger fixes (6+ lines, structural changes, or changes spanning multiple locations), describe the issue and suggested fix without a suggestion block
- Never post a committable suggestion UNLESS committing the suggestion fixes the issue entirely. If follow up steps are required, do not leave a committable suggestion.
**IMPORTANT: Only post ONE comment per unique issue. Do not post duplicate comments.**
Use this list when evaluating issues in Steps 4 and 5 (these are false positives, do NOT flag):
- Pre-existing issues
- Something that appears to be a bug but is actually correct
- Pedantic nitpicks that a senior engineer would not flag
- Issues that a linter will catch (do not run the linter to verify)
- General code quality concerns (e.g., lack of test coverage, general security issues) unless explicitly required in CLAUDE.md
- Issues mentioned in CLAUDE.md but explicitly silenced in the code (e.g., via a lint ignore comment)
Notes:
- Use gh CLI to interact with GitHub (e.g., fetch pull requests, create comments). Do not use web fetch.
- Create a todo list before starting.
- You must cite and link each issue in inline comments (e.g., if referring to a CLAUDE.md, include a link to it).
- If no issues are found, post a comment with the following format:
---
## Code review
No issues found. Checked for bugs and CLAUDE.md compliance.
---
- When linking to code in inline comments, follow the following format precisely, otherwise the Markdown preview won't render correctly: `https://github.com/OWNER/REPO/blob/FULL_SHA/path/to/file.py#L10-L15`
- Requires full git sha
- You must provide the full sha. Commands like `https://github.com/owner/repo/blob/$(git rev-parse HEAD)/foo/bar` will not work, since your comment will be directly rendered in Markdown.
- Repo name must match the repo you're code reviewing
- # sign after the file name
- Line range format is L[start]-L[end]
- Provide at least 1 line of context before and after, centered on the line you are commenting about (eg. if you are commenting about lines 5-6, you should link to `L4-7`)
description: Document a Python module and its classes using Google style
---
Document a Python module or class using Google-style docstrings following project conventions. The argument can be a class name or a module path.
## Instructions
1. Determine what to document based on the argument:
**If a module path is provided** (e.g. `src/pipecat/audio/vad/vad_analyzer.py`):
- Use that file directly
**If a class name is provided** (e.g. `VADAnalyzer`):
- Search for `class ClassName` in `src/pipecat/`
- If multiple files contain that class name, list all matches with their file paths, ask the user which one they want to document, and wait for confirmation
2. Once the file is identified, read the module to understand its structure:
- Identify all classes, functions, and important type aliases
- **Already documented code** - If a class, method, or function already has a complete docstring that follows the project style, do not modify it. A docstring is complete if it has:
- A one-line summary
- Args section (if it has parameters)
- Returns section (if it returns something meaningful)
- Only add or improve documentation where it is missing or incomplete
## Module Docstring Format
```python
"""[One-line description of module purpose].
[Optional: Longer explanation of functionality, key classes, or use cases.]
"""
```
Example:
```python
"""Neuphonic text-to-speech service implementations.
This module provides WebSocket and HTTP-based integrations with Neuphonic's
text-to-speech API for real-time audio synthesis.
"""
```
## Class Docstring Format
```python
classClassName:
"""One-line summary describing what the class does.
[Longer description explaining purpose, behavior, and key features.
Use action-oriented language.]
[Optional: Event handlers, usage notes, or important caveats.]
"""
```
Example:
```python
classFrameProcessor(BaseObject):
"""Base class for all frame processors in the pipeline.
Frame processors are the building blocks of Pipecat pipelines, they can be
linked to form complex processing pipelines. They receive frames, process
them, and pass them to the next or previous processor in the chain.
Event handlers available:
- on_before_process_frame: Called before a frame is processed
- on_after_process_frame: Called after a frame is processed
Note: When listing event handlers, do NOT use backticks. Include an `Example::` section (with double colon for Sphinx) showing the decorator pattern and function signature for each event.
4. Generate or update the PR description with these sections:
## PR Description Format
### Summary (always include)
Brief bullet points describing what changed and why. Focus on the *purpose* and *impact*, not implementation details.
```markdown
## Summary
- Added X to enable Y
- Fixed bug where Z would happen
- Refactored W for better maintainability
```
### Breaking Changes (include only if applicable)
Document any changes that affect existing users or APIs.
```markdown
## Breaking Changes
-`ClassName.method()` now requires a `param` argument
- Removed deprecated `old_function()` - use `new_function()` instead
```
### Testing (include when non-obvious)
How to verify the changes work. Skip for trivial changes.
```markdown
## Testing
- Run `uv run pytest tests/test_feature.py` to verify the fix
- Example usage: `uv run examples/new_feature.py`
```
### Fixes (include if issues are provided or found in commits)
List issues this PR fixes. GitHub will automatically close these issues when the PR is merged.
```markdown
## Fixes
- Fixes #123
- Fixes #456
```
Note: Use "Fixes #X" format (not "Closes" or "Resolves") for consistency. Each issue should be on its own line with "Fixes" to ensure GitHub auto-closes them.
## Guidelines
- **Be concise** - Reviewers should understand the PR in 30 seconds
- **Focus on why** - The diff shows *what* changed, explain *why*
- **Skip empty sections** - Only include sections that have content
- **Use bullet points** - Easier to scan than paragraphs
- **Don't duplicate the diff** - Avoid listing every file or line changed
## Example Output
```markdown
## Summary
- Added `/docstring` skill for documenting Python modules with Google-style docstrings
- Skill finds classes by name and handles conflicts when multiple matches exist
- Skips already-documented code to avoid unnecessary changes
description: Update documentation pages to match source code changes on the current branch
---
Update documentation pages to reflect source code changes on the current branch. Analyzes the diff against main, maps changed source files to their corresponding doc pages, and makes targeted edits.
## Arguments
```
/update-docs [DOCS_PATH]
```
-`DOCS_PATH` (optional): Path to the docs repository root. If not provided, ask the user.
Examples:
-`/update-docs /Users/me/src/docs`
-`/update-docs`
## Instructions
### Step 1: Resolve docs path
If `DOCS_PATH` was provided as an argument, use it. Otherwise, ask the user for the path to their docs repository.
Verify the path exists and contains `server/services/` subdirectory.
### Step 2: Create docs branch
Get the current pipecat branch name:
```bash
git rev-parse --abbrev-ref HEAD
```
In the docs repo, create a new branch off main with a matching name:
```bash
cd DOCS_PATH && git checkout main && git pull && git checkout -b {branch-name}-docs
```
For example, if the pipecat branch is `feat/new-service`, the docs branch becomes `feat/new-service-docs`.
All doc edits in subsequent steps are made on this branch.
Ignore `__init__.py`, `__pycache__`, test files, and files that only contain type re-exports.
### Step 4: Map source files to doc pages
For each changed source file, find the corresponding doc page. Read the mapping file at `.claude/skills/update-docs/SOURCE_DOC_MAPPING.md` and apply its tiered lookup: tier 1 (known exceptions) → tier 2 (pattern matching) → tier 3 (search fallback). **First match wins.**
### Step 5: Analyze each source-doc pair
For each mapped pair:
1.**Read the full source file** to understand current state
2.**Read the diff** for that file: `git diff main..HEAD -- <source_file>`
3.**Read the current doc page** in full
Identify what changed by comparing source to docs:
- **Constructor parameters**: Compare `__init__` signature to the Configuration section's `<ParamField>` entries
- **InputParams fields**: Compare `InputParams(BaseModel)` class fields to the InputParams table
- **Event handlers**: Compare `_register_event_handler` calls and event handler definitions to Event Handlers section
- **Behavioral changes**: Check if Notes section needs updating
### Step 6: Make targeted edits
For each doc page that needs updates, edit **only the sections that need changes**. Preserve all other content exactly as-is.
#### Rules
- **Never remove content** unless the corresponding source code was removed
- **Never rewrite sections** that are already accurate
- **Match existing formatting** — if the page uses `<ParamField>` tags, use them; if it uses tables, use tables
- **Keep descriptions concise** — match the tone and length of surrounding content
- **Preserve CardGroup, links, and examples** unless they reference removed functionality
- **Don't touch frontmatter** unless the class was renamed
#### Section-specific guidance
**Configuration** (constructor params):
- Use `<ParamField path="name" type="type" default="value">` format if the page already uses it
- Add new params in logical order (required first, then optional)
- Remove params that no longer exist in source
- Update types/defaults that changed
**InputParams** (runtime settings):
- Use markdown table format: `| Parameter | Type | Default | Description |`
- Match the field names and types from the `InputParams(BaseModel)` class
- Include the default values from the source
**Usage** (code examples):
- Update import paths, class names, and parameter names
- Only modify examples if they would break or be misleading with the new API
- Don't rewrite working examples just to add new optional params
**Notes**:
- Add notes for new behavioral gotchas or breaking changes
- Remove notes about limitations that were fixed
- Keep existing notes that are still accurate
**Event Handlers**:
- Update the event table and example code
- Add new events, remove deleted ones
- Update handler signatures if they changed
**Overview / Key Features / Prerequisites**:
- Only update if the PR fundamentally changes what the service does (new capability, removed capability, renamed class)
- Most PRs will NOT need changes to these sections
### Step 7: Update guides
Guides at `DOCS_PATH/guides/` reference specific class names, parameters, imports, and code patterns. After completing reference doc edits, check if any guides need updates too.
For each changed source file, collect the class names, renamed parameters, and changed imports from the diff. Search the guides directory:
After processing all mapped pairs, check for two kinds of gaps:
**Missing pages**: Source files that had no doc page mapping (neither tier 1, 2, nor 3) and are not marked as "(skip)". For each, tell the user:
- The source file path
- The main class(es) it defines
- Whether a new doc page should be created
**Missing sections**: Mapped doc pages that are missing standard sections compared to the source. For example, a transport page with no Configuration section, or a service page with no InputParams table when the source defines `InputParams(BaseModel)`. Flag these and offer to add the missing sections.
If the user wants a new page, do all three of the following:
#### 8a: Create the doc page
Create the new `.mdx` file using this template structure:
```
---
title: "Service Name"
description: "Brief description"
---
## Overview
[Description from class docstring or source analysis]
<CardGroup cols={2}>
[Cards for API reference and examples if available]
</CardGroup>
## Installation
```bash
pip install "pipecat-ai[package-name]"
```
## Prerequisites
[Environment variables and account setup]
## Configuration
[ParamField entries for constructor params]
## InputParams
[Table of InputParams fields, if the service has them]
## Usage
### Basic Setup
```python
[Minimalworkingexample]
```
## Notes
[Important caveats]
## Event Handlers
[Event table and example code]
```
#### 8b: Add to docs.json
Add the new page path to `DOCS_PATH/docs.json` in the correct navigation group. The path format is `server/services/{category}/{provider}` (without the `.mdx` extension).
Find the matching group in the navigation structure:
- **STT** → `"group": "Speech-to-Text"` under Services
- **TTS** → `"group": "Text-to-Speech"` under Services
- **LLM** → `"group": "LLM"` under Services
- **S2S** → `"group": "Speech-to-Speech"` under Services
- **Transport** → `"group": "Transport"` under Services
- **Serializer** → `"group": "Serializers"` under Services
- **Image generation** → `"group": "Image Generation"` under Services
- **Video** → `"group": "Video"` under Services
- **Memory** → `"group": "Memory"` under Services
- **Vision** → `"group": "Vision"` under Services
- **Analytics** → `"group": "Analytics & Monitoring"` under Services
Insert the new entry **alphabetically** within the group's `pages` array. For example, adding a new STT service "foo":
```json
{
"group": "Speech-to-Text",
"pages": [
"server/services/stt/assemblyai",
"server/services/stt/aws",
...
"server/services/stt/foo",
...
]
}
```
#### 8c: Add to supported-services.mdx
Add a new row to the correct category table in `DOCS_PATH/server/services/supported-services.mdx`.
- **DisplayName**: Use the service's human-readable name (e.g., "ElevenLabs", "AWS Polly", "Google Gemini")
- **package**: Look at the service's `pyproject.toml` extras or the import pattern in the source code. For example, if the service is in `src/pipecat/services/foo/`, the package is typically `foo`.
- If no pip dependencies are required, use `No dependencies required` instead.
Insert the new row **alphabetically** within the table. Match the column alignment of the existing rows.
- `guides/learn/speech-to-text.mdx` — Updated code example (renamed `old_param` → `new_param`)
### New service pages
- `server/services/tts/newprovider.mdx` — Created page, added to docs.json (Text-to-Speech), added to supported-services.mdx
### Unmapped source files
- `src/pipecat/services/newprovider/tts.py` — NewProviderTTSService (no doc page exists)
### Skipped files
- `src/pipecat/services/ai_service.py` — internal base class
```
## Guidelines
- **Be conservative** — only change what the diff warrants. Don't "improve" docs beyond what changed in source.
- **Read before editing** — always read the full doc page before making changes so you understand the existing structure.
- **Preserve voice** — match the writing style of the existing doc page, don't impose a different tone.
- **One PR at a time** — this skill operates on the current branch's diff against main. Don't look at other branches.
- **Parallel analysis** — when multiple source files map to different doc pages, analyze and edit them in parallel for efficiency.
- **Shared source files** — files like `services/google/google.py` are shared bases. Check which services import from them and update all affected doc pages.
## Checklist
Before finishing, verify:
- [ ] All changed source files were checked against the mapping table
- [ ] Each doc page edit matches the actual source code change (not guessed)
- [ ] No content was removed unless the corresponding source was removed
- [ ] New parameters have accurate types and defaults from source
- [ ] Formatting matches the existing page style
- [ ] Guides referencing changed APIs were checked and updated
- [ ] New service pages were added to `docs.json` in the correct group, alphabetically
- [ ] New service pages were added to `supported-services.mdx` in the correct table, alphabetically
This file provides guidance to AI coding agents when working with code in this repository.
## Project Overview
Pipecat is an open-source Python framework for building real-time voice and multimodal conversational AI agents. It orchestrates audio/video, AI services, transports, and conversation pipelines using a frame-based architecture.
## Common Commands
```bash
# Setup development environment
uv sync --group dev --all-extras --no-extra gstreamer --no-extra local
# Install pre-commit hooks
uv run pre-commit install
# Run all tests
uv run pytest
# Run a single test file
uv run pytest tests/test_name.py
# Run a specific test
uv run pytest tests/test_name.py::test_function_name
# Preview changelog
uv run towncrier build --draft --version Unreleased
All data flows as **Frame** objects through a pipeline of **FrameProcessors**:
```
[Processor1] → [Processor2] → ... → [ProcessorN]
```
**Key components:**
- **Frames** (`src/pipecat/frames/frames.py`): Data units (audio, text, video) and control signals. Flow DOWNSTREAM (input→output) or UPSTREAM (acknowledgments/errors).
- **FrameProcessor** (`src/pipecat/processors/frame_processor.py`): Base processing unit. Each processor receives frames, processes them, and pushes results downstream.
- **ParallelPipeline** (`src/pipecat/pipeline/parallel_pipeline.py`): Runs multiple pipelines in parallel.
- **Transports** (`src/pipecat/transports/`): Transports are frame processors used for external I/O layer (Daily WebRTC, LiveKit WebRTC, WebSocket, Local). Abstract interface via `BaseTransport`, `BaseInputTransport` and `BaseOutputTransport`.
- **Pipeline Task (`src/pipecat/pipeline/task.py`)**: Runs and manages a pipeline. Pipeline tasks send the first frame, `StartFrame`, to the pipeline in order for processors to know they can start processing and pushing frames. Pipeline tasks internally create a pipeline with two additional processors, a source processor before the user-defined pipeline and a sink processor at the end. Those are used for multiple things: error handling, pipeline task level events, heartbeat monitoring, etc.
- **Pipeline Runner (`src/pipecat/pipeline/runner.py`)**: High-level entry point for executing pipeline tasks. Handles signal management (SIGINT/SIGTERM) for graceful shutdown and optional garbage collection. Run a single pipeline task with `await runner.run(task)` or multiple concurrently with `await asyncio.gather(runner.run(task1), runner.run(task2))`.
- **Services** (`src/pipecat/services/`): 60+ AI provider integrations (STT, TTS, LLM, etc.). Extend base classes: `AIService`, `LLMService`, `STTService`, `TTSService`, `VisionService`.
- **Serializers** (`src/pipecat/serializers/`): Convert frames to/from wire formats for WebSocket transports. `FrameSerializer` base class defines `serialize()` and `deserialize()`. Telephony serializers (Twilio, Plivo, Vonage, Telnyx, Exotel, Genesys) handle provider-specific protocols and audio encoding (e.g., μ-law).
- **RTVI** (`src/pipecat/processors/frameworks/rtvi.py`): Real-Time Voice Interface protocol bridging clients and the pipeline. `RTVIProcessor` handles incoming client messages (text input, audio, function call results). `RTVIObserver` converts pipeline frames to outgoing messages: user/bot speaking events, transcriptions, LLM/TTS lifecycle, function calls, metrics, and audio levels.
- **Observers** (`src/pipecat/observers/`): Monitor frame flow without modifying the pipeline. Passed to `PipelineTask` via the `observers` parameter. Implement `on_process_frame()` and `on_push_frame()` callbacks.
### Important Patterns
- **Context Aggregation**: `LLMContext` accumulates messages for LLM calls; `UserResponse` aggregates user input
- **Turn Management**: Turn management is done through `LLMUserAggregator` and
`LLMAssistantAggregator`, created with `LLMContextAggregatorPair`
- **User turn strategies**: Detection of when the user starts and stops speaking is done via user turn start/stop strategies. They push `UserStartedSpeakingFrame` and `UserStoppedSpeakingFrame` respectively.
- **Interruptions**: Interruptions are usually triggered by a user turn start strategy (e.g. `VADUserTurnStartStrategy`) but they can be triggered by other processors as well, in which case the user turn start strategies don't need to. An `InterruptionFrame` carries an optional `asyncio.Event` that is set when the frame reaches the pipeline sink. If a processor stops an `InterruptionFrame` from propagating downstream (i.e., doesn't push it), it **must** call `frame.complete()` to avoid stalling `push_interruption_task_frame_and_wait()` callers.
- **Uninterruptible Frames**: These are frames that will not be removed from internal queues even if there's an interruption. For example, `EndFrame` and `StopFrame`.
- **Events**: Most classes in Pipecat have `BaseObject` as the very base class. `BaseObject` has support for events. Events can run in the background in an async task (default) or synchronously (`sync=True`) if we want immediate action. Synchronous event handlers need to execute fast.
- **Async Task Management**: Always use `self.create_task(coroutine, name)` instead of raw `asyncio.create_task()`. The `TaskManager` automatically tracks tasks and cleans them up on processor shutdown. Use `await self.cancel_task(task, timeout)` for cancellation.
- **Error Handling**: Use `await self.push_error(msg, exception, fatal)` to push errors upstream. Services should use `fatal=False` (the default) so application code can handle errors and take action (e.g. switch to another service).
- **Docstrings**: Google-style. Classes describe purpose; `__init__` has `Args:` section; dataclasses use `Parameters:` section.
- **Deprecations**: Use the `.. deprecated:: <version>` Sphinx directive in docstrings (never inline tags like `[DEPRECATED]`), and pair it with a runtime `warnings.warn(..., DeprecationWarning)` at the call site. See `CONTRIBUTING.md` for full conventions.
- **Type hints**: Required for complex async code.
- **Dataclass vs Pydantic**: Use `@dataclass` for frames and internal pipeline data (high-frequency, no validation needed). Use Pydantic `BaseModel` for configuration, parameters, metrics, and external API data (benefits from validation and serialization). Specifically:
-`@dataclass`: Frame types, context aggregator pairs, internal data containers
-`BaseModel`: Service `InputParams`, transport/VAD/turn params, metrics data, API request/response models, serializer params
### Docstring Example
```python
classMyService(LLMService):
"""Description of what the service does.
More detailed description.
Event handlers available:
- on_connected: Called when we are connected
Example::
@service.event_handler("on_connected")
async def on_connected(service, frame):
...
"""
def__init__(self,param1:str,**kwargs):
"""Initialize the service.
Args:
param1: Description of param1.
**kwargs: Additional arguments passed to parent.
"""
super().__init__(**kwargs)
# Pydantic params class with a deprecated field
classMyParams(BaseModel):
"""Configuration parameters for MyService.
Parameters:
new_setting: Replacement for ``old_setting``.
old_setting: Legacy setting, no longer used.
.. deprecated:: 1.2.0
Use ``new_setting`` instead. Will be removed in 2.0.0.
"""
new_setting:str="default"
old_setting:str|None=None
```
## Service Implementation
When adding a new service:
1. Extend the appropriate base class (`STTService`, `TTSService`, `LLMService`, etc.)
2. Implement required abstract methods
3. Handle necessary frames
4. By default, all frames should be pushed in the direction they came
5. Push `ErrorFrame` on failures
6. Add metrics tracking via `MetricsData` if relevant
7. Follow the pattern of existing services in `src/pipecat/services/`
## Testing
Test utilities live in `src/pipecat/tests/utils.py`. Use `run_test()` to send frames through a pipeline and assert expected output frames in each direction. Use `SleepFrame(sleep=N)` to add delays between frames.
Pipecat welcomes community-maintained integrations! As our ecosystem grows, we've established a process for any developer to create and maintain their own service integrations while ensuring discoverability for the Pipecat community.
## Overview
**What we support:** Community-maintained integrations that live in separate repositories and are maintained by their authors.
**What we don't do:** The Pipecat team does not code review, test, or maintain community integrations. We provide guidance and list approved integrations for discoverability.
**Why this approach:** This allows the community to move quickly while keeping the Pipecat core team focused on maintaining the framework itself.
## Submitting your Integration
To be listed as an official community integration, follow these steps:
### Step 1: Build Your Integration
Create your integration following the patterns and examples shown in the "Integration Patterns and Examples" section below.
### Step 2: Set Up Your Repository
Your repository must contain these components:
- **Source code** - Complete implementation following Pipecat patterns
- **Foundational example** - Single file example showing basic usage (see [Pipecat examples](https://github.com/pipecat-ai/pipecat/tree/main/examples))
- **README.md** - Must include:
- Introduction and explanation of your integration
- Installation instructions
- Usage instructions with Pipecat Pipeline
- How to run your example
- Pipecat version compatibility (e.g., "Tested with Pipecat v0.0.86")
- Company attribution: If you work for the company providing the service, please mention this in your README. This helps build confidence that the integration will be actively maintained.
- **LICENSE** - Permissive license (BSD-2 like Pipecat, or equivalent open source terms)
- **Code documentation** - Source code with docstrings (we recommend following [Pipecat's docstring conventions](https://github.com/pipecat-ai/pipecat/blob/main/CONTRIBUTING.md#docstring-conventions))
- **Changelog** - Maintain a changelog for version updates
### Step 3: Join Discord
Join our Discord: https://discord.gg/pipecat
### Step 4: Submit for Listing
Submit a pull request to add your integration to our [Community Integrations documentation page](https://docs.pipecat.ai/server/services/community-integrations).
**To submit:**
1. Fork the [Pipecat docs repository](https://github.com/pipecat-ai/docs)
2. Edit the file `server/services/community-integrations.mdx`
3. Add your integration to the appropriate service category table with:
- Service name
- Link to your repository
- Maintainer GitHub username(s)
4. Include a link to your demo video (approx 30-60 seconds) in your PR description showing:
- Core functionality of your integration
- Handling of an interruption (if applicable to service type)
5. Submit your pull request
Once your PR is submitted, post in the `#community-integrations` Discord channel to let us know.
## Integration Patterns and Examples
### STT (Speech-to-Text) Services
#### Websocket-based Services
**Base class:**`WebsocketSTTService`
**Use for:** Services where you manage the websocket connection directly. Combines `STTService` with `WebsocketService` for automatic reconnection and keepalive support.
- **`_process_context(self, context: LLMContext)`** — The main method that processes an LLM context and generates a response. Each LLM service overrides `process_frame` to extract context from `LLMContextFrame` and calls `_process_context`.
- **`adapter_class`** — Class attribute pointing to a `BaseLLMAdapter` subclass. Defaults to `OpenAILLMAdapter`. Non-OpenAI services must implement their own adapter (see `src/pipecat/adapters/base_llm_adapter.py`) with methods:
-`get_llm_invocation_params(context)` — Extract provider-specific params from universal context
-`to_provider_tools_format(tools_schema)` — Convert standard tools to provider format
-`get_messages_for_logging(context)` — Format messages for logging
- **Frame sequence:** Output must follow this frame sequence pattern:
-`LLMFullResponseStartFrame` — Signals the start of an LLM response
-`LLMTextFrame` — Contains LLM content, typically streamed as tokens
-`LLMFullResponseEndFrame` — Signals the end of an LLM response
- **Thought frames (reasoning models):** If the model supports extended thinking / chain-of-thought, emit thought frames alongside the response:
-`LLMThoughtStartFrame` — Signals the start of a thought
-`LLMThoughtTextFrame` — Contains thought content, streamed as tokens
-`LLMThoughtEndFrame` — Signals the end of a thought
- **Context aggregation** is handled by the framework via `LLMContext` + `LLMContextAggregatorPair`. The LLM service just processes context it receives — no need to implement aggregators.
### TTS (Text-to-Speech) Services
#### WebsocketTTSService
**Use for:** Websocket-based streaming services (with or without word timestamps)
- For websocket services, use asyncio WebSocket implementation
- Handle idle service timeouts with keepalives
- TTS services push both audio (`TTSAudioRawFrame`) and text (`TTSTextFrame`) frames
### Telephony Serializers
Pipecat supports telephony provider integration using websocket connections to exchange MediaStreams. These services use a FrameSerializer to serialize and deserialize inputs from the FastAPIWebsocketTransport.
- Must implement `run_vision` method that takes a `UserImageRawFrame` and returns an `AsyncGenerator[Frame, None]`
- The method processes the image frame and yields frames with analysis results
- Must yield the frame sequence: `VisionFullResponseStartFrame`, `VisionTextFrame`, `VisionFullResponseEndFrame`
## Implementation Guidelines
### Naming Conventions
#### Package and Repository Naming
Use the `pipecat-{vendor}` naming convention for your PyPI package and repository:
-`pipecat-{vendor}` — for single-service integrations (e.g., `pipecat-deepdub`)
-`pipecat-{vendor}-{type}` — when a vendor offers multiple service types (e.g., `pipecat-upliftai-stt`, `pipecat-upliftai-tts`)
This convention makes community packages easily discoverable via PyPI search and clearly identifies them as part of the Pipecat ecosystem.
#### Class Naming
- **STT:** `VendorSTTService`
- **LLM:** `VendorLLMService`
- **TTS:**
- Websocket: `VendorTTSService`
- HTTP: `VendorHttpTTSService`
- **Image:** `VendorImageGenService`
- **Vision:** `VendorVisionService`
- **Telephony:** `VendorFrameSerializer`
### Metrics Support
Enable metrics in your service:
```python
defcan_generate_metrics(self)->bool:
"""Check if this service can generate processing metrics.
Returns:
True, as this service supports metrics.
"""
returnTrue
```
### Service Settings
Every AI service (STT, LLM, TTS, image generation, etc.) exposes a **Settings dataclass** that serves two roles:
1.**Store mode** — the service's `self._settings` holds the current value of every runtime-updatable field.
2.**Delta mode** — an update frame (e.g. `TTSUpdateSettingsFrame`) specifies only the fields that should change; unspecified fields remain `NOT_GIVEN`.
#### Defining your Settings class
Extend `STTSettings`, `TTSSettings`, `LLMSettings`, or `ImageGenSettings` (or, if your service directly subclasses `AIService`, `ServiceSettings`). The base classes already provide common fields (e.g. `model`, `voice`, `language`). You only need to add **service-specific knobs that should be runtime-updatable**:
| Anything users may want to change mid-session | Audio encoding, sample format |
| | Connection parameters (timeouts, retries) |
The rule of thumb: if a caller might send an update frame to change it at runtime, it belongs in Settings. Everything else is init-only config stored as `self._xxx`.
#### Wiring settings into `__init__`
Accept an **optional**`settings` parameter. Build a `default_settings` object with all fields set to real values, then merge any caller overrides with `apply_update`.
Add a `Settings`**class attribute** that points to your settings dataclass. This lets callers access the settings class through the service itself (e.g. `MyTTSService.Settings(...)`) without a separate import:
```python
fromtypingimportOptional
classMyTTSService(TTSService):
Settings=MyTTSSettings
_settings:Settings
def__init__(
self,
*,
api_key:str,
settings:Optional[Settings]=None,
**kwargs,
):
# 1. Defaults — every field has a real value (store mode).
default_settings=self.Settings(
model="my-model-v1",
voice="default-voice",
language="en",
speaking_rate=1.0,
)
# 2. Merge caller overrides (only given fields win).
ifsettingsisnotNone:
default_settings.apply_update(settings)
# 3. Pass the fully-populated settings to the base class.
AI services support runtime configuration changes via `*UpdateSettingsFrame`s (e.g. `STTUpdateSettingsFrame`, `TTSUpdateSettingsFrame`, `LLMUpdateSettingsFrame`).
To react to runtime setting changes, override `_update_settings`. The base implementation applies the delta to `self._settings` and returns a `dict` mapping each changed field name to its **pre-update** value. Your override should call `super()` first, then act on the changed fields. A common implementation might look like:
"""Apply a settings update, reconfiguring the connection if needed."""
changed=awaitsuper()._update_settings(update)
ifnotchanged:
returnchanged
awaitself._disconnect()
awaitself._connect()
returnchanged
```
The dict keys work like a set for membership tests (`"language" in changed`) and truthiness (`if changed`). Use `changed.keys() - {"language"}` for set difference, or `changed["language"]` to inspect the previous value of a field.
Note that, in this example, the service requires a reconnect to apply the new language. Consider, for each setting, whether your service requires reconnection or can apply changes in-place.
If your service can't yet apply certain settings at runtime, call `self._warn_unhandled_updated_settings(changed)` with any unhandled field names so users get a clear log message:
Sample rates are set via PipelineParams and passed to each frame processor at initialization. The pattern is to _not_ set the sample rate value in the constructor of a given service. Instead, use the `start()` method to initialize sample rates from the frame:
Note that `self.sample_rate` is a `@property` set in the TTSService base class, which provides access to the private sample rate value obtained from the StartFrame.
### Tracing Decorators
Use Pipecat's tracing decorators:
- **STT:** `@traced_stt` - decorate `_handle_transcription(self, transcript, is_final, language)` (the standard method name convention)
- **LLM:** `@traced_llm` - decorate the `_process_context()` method
- **TTS:** `@traced_tts` - decorate the `run_tts()` method
## Best Practices
### Packaging and Distribution
- Name your package `pipecat-{vendor}` (see [Naming Conventions](#naming-conventions))
- Use [uv](https://docs.astral.sh/uv/) for packaging (encouraged)
- Publish to PyPI for easier installation
- Follow semantic versioning principles
- Maintain a changelog
### HTTP Communication
For REST-based communication, use aiohttp. Pipecat includes this as a required dependency, so using it prevents adding an additional dependency to your integration.
### Error Handling
- Wrap API calls in appropriate try/catch blocks
- Handle rate limits and network failures gracefully
- Provide meaningful error messages
- When errors occur, raise exceptions AND push errors to notify the pipeline:
- Your foundational example serves as a valuable integration-level test
- Unit tests are nice to have. As the Pipecat teams provides better guidance, we will encourage unit testing more
## Disclaimer
Community integrations are community-maintained and not officially supported by the Pipecat team. Users should evaluate these integrations independently. The Pipecat team reserves the right to remove listings that become unmaintained or problematic.
## Staying Up to Date
Pipecat evolves rapidly to support the latest AI technologies and patterns. While we strive to minimize breaking changes, they do occur as the framework matures.
**We strongly recommend:**
- Join our Discord at https://discord.gg/pipecat and monitor the `#announcements` channel for release notifications
We encourage community-maintained integrations! Please see our [Community Integration Guide](COMMUNITY_INTEGRATIONS.md) for the process and requirements.
**Want to contribute to Pipecat core?**
We welcome contributions of all kinds! Your help is appreciated. Follow these steps to get involved:
1.**Fork this repository**: Start by forking the Pipecat Documentation repository to your GitHub account.
@@ -13,24 +17,137 @@ We welcome contributions of all kinds! Your help is appreciated. Follow these st
git checkout -b your-branch-name
```
4. **Make your changes**: Edit or add files as necessary.
5. **Test your changes**: Ensure that your changes look correct and follow the style set in the codebase.
6. **Commit your changes**: Once you're satisfied with your changes, commit them with a meaningful message.
5. **Add a changelog entry**: Create a changelog fragment file (see [Changelog Entries](#changelog-entries) below).
6. **Test your changes**: Ensure that your changes look correct and follow the style set in the codebase.
7. **Commit your changes**: Once you're satisfied with your changes, commit them with a meaningful message.
```bash
git commit -m "Description of your changes"
```
7. **Push your changes**: Push your branch to your forked repository.
8. **Push your changes**: Push your branch to your forked repository.
```bash
git push origin your-branch-name
```
8. **Submit a Pull Request (PR)**: Open a PR from your forked repository to the main branch of this repo.
9. **Submit a Pull Request (PR)**: Open a PR from your forked repository to the main branch of this repo.
> Important: Describe the changes you've made clearly!
Our maintainers will review your PR, and once everything is good, your contributions will be merged!
## Changelog Entries
Every pull request that makes a user-facing change should include a changelog entry. We use a changelog fragment system to avoid merge conflicts.
### Creating a Changelog Fragment
1. Create a new file in the `changelog/` directory with this naming pattern:
```
<PR_number>.<type>.md
```
2. Choose the appropriate type:
- `added.md` - New features
- `changed.md` - Changes in existing functionality
- `deprecated.md` - Soon-to-be removed features
- `removed.md` - Removed features
- `fixed.md` - Bug fixes
- `performance.md` - Performance improvements
- `security.md` - Security fixes
- `other.md` - Other changes (documentation, dependencies, etc.)
3. Write your changelog entry as a Markdown bullet point. Include the `-` at the start:
**Example files:**
`changelog/1234.added.md`:
```markdown
- Added support for Anthropic Claude 3.5 Sonnet with improved streaming performance.
```
`changelog/5678.fixed.md`:
```markdown
- Fixed an issue where audio frames were dropped during high-load scenarios.
```
**For entries with nested bullets:**
`changelog/1234.changed.md`:
```markdown
- Updated service configuration:
- Changed default timeout to 30 seconds
- Added retry logic for failed connections
```
### Multiple Changes in One PR
**Different types of changes:** Create separate fragment files for each type:
```
changelog/1234.added.md
changelog/1234.fixed.md
```
**Multiple changes of the same type:** Create numbered fragment files:
```
changelog/1234.changed.md
changelog/1234.changed.2.md
```
**Related changes:** Use nested bullets in a single fragment:
```markdown
- Updated service configuration:
- Changed default timeout to 30 seconds
- Added retry logic for failed connections
```
**Rule of thumb:** One logical change per fragment file. If changes are unrelated, use separate files.
### Preview Your Changes
To see what your changelog entry will look like:
```bash
towncrier build --draft --version Unreleased
```
This won't modify any files, just show you a preview.
### When to Skip Changelog Entries
You can skip adding a changelog entry for:
- Documentation-only changes
- Internal refactoring with no user-facing impact
- Test-only changes
- CI/build configuration changes
If you're unsure whether your change needs a changelog entry, ask in your PR!
## Dependency Management
This project uses [uv](https://docs.astral.sh/uv/) for dependency management. The `uv.lock` file is committed to ensure reproducible builds.
### Adding or Updating Dependencies
1. Edit `pyproject.toml` to add/update dependencies
2. Run `uv lock` to update the lockfile with new dependency resolution
3. Run `uv sync` to install the updated dependencies locally
4. Always commit both files together:
```bash
git add pyproject.toml uv.lock
git commit -m "feat: add new dependency for feature X"
```
**Important:** Never manually edit `uv.lock`. It's auto-generated by `uv lock`.
# 🎙️ Pipecat: Real-Time Voice & Multimodal AI Agents
**Pipecat** is an open-source Python framework for building real-time voice and multimodal conversational agents. Orchestrate audio and video, AI services, different transports, and conversation pipelines effortlessly—so you can focus on what makes your agent unique.
> Want to dive right in? Try the [quickstart](https://docs.pipecat.ai/getting-started/quickstart).
> Want to dive right in? Run `pipecat init quickstart` or follow the [quickstart guide](https://docs.pipecat.ai/getting-started/quickstart).
## 🚀 What You Can Build
@@ -19,8 +19,6 @@
- **Business Agents** – customer intake, support bots, guided flows
- **Complex Dialog Systems** – design logic with structured conversations
🧭 Looking to build structured conversations? Check out [Pipecat Flows](https://github.com/pipecat-ai/pipecat-flows) for managing complex conversational states and transitions.
## 🧠 Why Pipecat?
- **Voice-first**: Integrates speech recognition, text-to-speech, and conversation handling
@@ -28,44 +26,85 @@
- **Composable Pipelines**: Build complex behavior from modular components
- **Real-Time**: Ultra-low latency interaction with different transports (e.g. WebSockets or WebRTC)
## 🌐 Pipecat Ecosystem
### 🧩 Multi-agent systems
Need multiple AI agents working together? [Pipecat Subagents](https://github.com/pipecat-ai/pipecat-subagents) lets you build distributed multi-agent systems where each agent runs its own pipeline and communicates through a shared message bus. Hand off conversations between specialists, dispatch background tasks, and scale agents across processes or machines.
### 📱 Client SDKs
Building client applications? You can connect to Pipecat from any platform using our official SDKs:
Looking to build structured conversations? Check out [Pipecat Flows](https://github.com/pipecat-ai/pipecat-flows) for managing complex conversational states and transitions.
### 🪄 Beautiful UIs
Want to build beautiful and engaging experiences? Checkout the [Voice UI Kit](https://github.com/pipecat-ai/voice-ui-kit), a collection of components, hooks and templates for building voice AI applications quickly.
### 🛠️ Create and deploy projects
Create a new project in under a minute with the [Pipecat CLI](https://github.com/pipecat-ai/pipecat-cli). Then use the CLI to monitor and deploy your agent to production.
### 🔍 Debugging
Looking for help debugging your pipeline and processors? Check out [Whisker](https://github.com/pipecat-ai/whisker), a real-time Pipecat debugger.
### 🖥️ Terminal
Love terminal applications? Check out [Tail](https://github.com/pipecat-ai/tail), a terminal dashboard for Pipecat.
### 🤖 Claude Code Skills
Use [Pipecat Skills](https://github.com/pipecat-ai/skills) with [Claude Code](https://claude.ai/code) to scaffold projects, deploy to Pipecat Cloud, and more. Install the marketplace with:
```
claude plugin marketplace add pipecat-ai/skills
```
and install any of the available plugins.
### 🧩 Community Integrations
Build and share your own Pipecat service integrations! Browse existing [community integrations](https://docs.pipecat.ai/api-reference/server/services/community-integrations) or check out our [guide](COMMUNITY_INTEGRATIONS.md) to create your own.
### 📺️ Pipecat TV Channel
Catch new features, interviews, and how-tos on our [Pipecat TV](https://www.youtube.com/playlist?list=PLzU2zoMTQIHjqC3v4q2XVSR3hGSzwKFwH) channel.
| Community | [Browse community integrations →](https://docs.pipecat.ai/api-reference/server/services/community-integrations) |
📚 [View full services documentation →](https://docs.pipecat.ai/server/services/supported-services)
📚 [View full services documentation →](https://docs.pipecat.ai/api-reference/server/services/supported-services)
## ⚡ Getting started
@@ -107,14 +146,15 @@ You can get started with Pipecat running on your local machine, then move your a
## 🧪 Code examples
- [Foundational](https://github.com/pipecat-ai/pipecat/tree/main/examples/foundational) — small snippets that build on each other, introducing one or two concepts at a time
- [Foundational](https://github.com/pipecat-ai/pipecat/tree/main/examples) — small snippets that build on each other, introducing one or two concepts at a time
- [Example apps](https://github.com/pipecat-ai/pipecat-examples) — complete applications that you can use as starting points for development
## 🛠️ Contributing to the framework
### Prerequisites
**Python Version:** 3.10+
**Minimum Python Version:** 3.11
**Recommended Python Version:** >= 3.12
### Setup Steps
@@ -128,7 +168,9 @@ You can get started with Pipecat running on your local machine, then move your a
2. Install development and testing dependencies:
```bash
uv sync --group dev --all-extras --no-extra gstreamer --no-extra krisp --no-extra local
uv sync --group dev --all-extras \
--no-extra gstreamer \
--no-extra local \
```
3. Install the git pre-commit hooks:
@@ -137,25 +179,17 @@ You can get started with Pipecat running on your local machine, then move your a
uv run pre-commit install
```
### Python 3.13+ Compatibility
Some features require PyTorch, which doesn't yet support Python 3.13+. Install using:
```bash
uv sync --group dev --all-extras \
--no-extra gstreamer \
--no-extra krisp \
--no-extra local \
--no-extra local-smart-turn \
--no-extra mlx-whisper \
--no-extra moondream \
--no-extra ultravox
```
> **Tip:** For full compatibility, use Python 3.12: `uv python pin 3.12`
> **Note**: Some extras (local, gstreamer) require system dependencies. See documentation if you encounter build errors.
### Claude Code Skills
Install development workflow skills for contributing to Pipecat with [Claude Code](https://claude.ai/code):
```
claude plugin marketplace add pipecat-ai/pipecat
claude plugin install pipecat-dev@pipecat-dev-skills
```
### Running tests
To run all tests, from the root directory:
@@ -170,54 +204,6 @@ Run a specific test suite:
uv run pytest tests/test_name.py
```
### Setting up your editor
This project uses strict [PEP 8](https://peps.python.org/pep-0008/) formatting via [Ruff](https://github.com/astral-sh/ruff).
#### Emacs
You can use [use-package](https://github.com/jwiegley/use-package) to install [emacs-lazy-ruff](https://github.com/christophermadsen/emacs-lazy-ruff) package and configure `ruff` arguments:
`ruff` was installed in the `venv` environment described before, so you should be able to use [pyvenv-auto](https://github.com/ryotaro612/pyvenv-auto) to automatically load that environment inside Emacs.
```elisp
(use-package pyvenv-auto
:ensure t
:defer t
:hook ((python-mode . pyvenv-auto-run)))
```
#### Visual Studio Code
Install the
[Ruff](https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff) extension. Then edit the user settings (_Ctrl-Shift-P_ `Open User Settings (JSON)`) and set it as the default Python formatter, and enable formatting on save:
```json
"[python]": {
"editor.defaultFormatter": "charliermarsh.ruff",
"editor.formatOnSave": true
}
```
#### PyCharm
`ruff` was installed in the `venv` environment described before, now to enable autoformatting on save, go to `File` -> `Settings` -> `Tools` -> `File Watchers` and add a new watcher with the following settings:
1. **Name**: `Ruff formatter`
2. **File type**: `Python`
3. **Working directory**: `$ContentRoot$`
4. **Arguments**: `format $FilePath$`
5. **Program**: `$PyInterpreterDirectory$/ruff`
## 🤝 Contributing
We welcome contributions from the community! Whether you're fixing bugs, improving documentation, or adding new features, here's how you can help:
Frames can represent discrete chunks of data, for instance a chunk of text, a chunk of audio, or an image. They can also be used to as control flow, for instance a frame that indicates that there is no more data available, or that a user started or stopped talking. They can also represent more complex data structures, such as a message array used for an LLM completion.
## FrameProcessors
Frame processors operate on frames. Every frame processor implements a `process_frame` method that consumes one frame and produces zero or more frames. Frame processors can do simple transforms, such as concatenating text fragments into sentences, or they can treat frames as input for an AI Service, and emit chat completions based on message arrays or transform text into audio or images.
## Pipelines
Pipelines are lists of frame processors linked together. Frame processors can push frames upstream or downstream to their peers. A very simple pipeline might chain an LLM frame processor to a text-to-speech frame processor, with a transport as an output.
## Transports
Transports provide input and output frame processors to receive or send frames respectively. For example, the `DailyTransport` does this with a WebRTC session joined to a Daily.co room.
2. The Transport places a Transcription frame in the Pipeline’s source queue.

3. The Pipeline passes the Transcription frame to the first Frame Processor in its list, the LLM User Message Aggregator.

4. The LLM User Message Aggregator updates the LLM Context with a `{“user”: “Hello LLM”}` message.

5. The LLM User Message Aggregator yields an LLM Message Frame, containing the updated LLM Context. The Pipeline passes this frame to the LLM Frame Processor.

6. The LLM Frame Processor creates a streaming chat completion based on the LLM context and yields the first chunk of a response, Text Frame with the value “Hi, “. The Pipeline passes this frame to the TTS Frame Processor. The TTS Frame Processor aggregates this response but doesn’t yield anything, yet, because it’s waiting for a full sentence.

7. The LLM Frame Processor yields another Text Frame with the value “there.”. The Pipeline passes this frame to the TTS Frame Processor.

8. The TTS Frame Processor now has a full sentence, so it starts streaming audio based on “Hi, there.” It yields the first chunk of streaming audio as an Audio frame, which the Pipeline passes to the LLM Assistant Message Aggregator.

9. The LLM Assistant Message Aggregator doesn’t do anything with Audio frames, so it immediately yields the frame, unchanged. This is the convention for all Frame Processors: frames that the processor doesn’t process should be immediately yielded.

10. The Pipeline places the first Audio frame in its sink queue, which is being watched by the Transport. Since the frame is now in a queue, the Pipeline can continue processing other frames. Note that the source and sink queues form a sort of “boundary of concurrent processing” between a Pipeline and the outside world. In a Pipeline, Frames are processed sequentially; once a Frame is on a queue it can be processed in parallel with the frames being processed by the Pipeline. TODO: link to a more in-depth section about this.

11. The TTS Frame Processor yields another Audio frame as the Transport transmits the first Audio frame.

12. As before, the LLM Assistant Message Aggregator immediately yields the Audio frame and the Pipeline places the Audio frame in the sink queue.

13. The TTS Frame Processor has no more frames to yield. The LLM Frame Processor emits an LLM Response End Frame, which the Pipeline passes to the TTS Frame Processor.

14. The TTS Frame Processor immediately yields the LLM Response End Frame, so the Pipeline passes it along to the LLM Assistant Message Aggregator. The LLM Assistant Message Aggregator updates the LLM Context with the full response from the LLM. TODO TODO: I realized I forgot that the TSS Frame Processor also yields the Text frames that the LLM emitted so that the LLM Assistant Message Aggregator could accumulate them, arrggh.

15. The system is quiet, and waiting for the next message from the Transport.
# Understanding Different Frame Types in the Pipecat System
In the Pipecat system, frames are used to represent different types of data and control signals that flow through the pipeline. Understanding these frame types is crucial for working with the system effectively. This tutorial will cover the main categories of frames and their specific uses.
## 1. Base Frame Classes
### Frame
The `Frame` class is the base class for all frames. It includes:
-`id`: A unique identifier
-`name`: A descriptive name
-`pts`: Presentation timestamp (optional)
### DataFrame
`DataFrame` is a subclass of `Frame` and serves as a base for most data-carrying frames.
## 2. Audio Frames
### AudioRawFrame
Represents a chunk of audio with properties:
-`audio`: Raw audio data
-`sample_rate`: Audio sample rate
-`num_channels`: Number of audio channels
Subclasses include:
-`InputAudioRawFrame`: For audio from input sources
-`OutputAudioRawFrame`: For audio to be played by output devices
-`TTSAudioRawFrame`: For audio generated by Text-to-Speech services
## 3. Image Frames
### ImageRawFrame
Represents an image with properties:
-`image`: Raw image data
-`size`: Image dimensions
-`format`: Image format (e.g., JPEG, PNG)
Subclasses include:
-`InputImageRawFrame`: For images from input sources
-`OutputImageRawFrame`: For images to be displayed
-`UserImageRawFrame`: For images associated with a specific user
-`VisionImageRawFrame`: For images with associated text for description
-`URLImageRawFrame`: For images with an associated URL
### SpriteFrame
Represents an animated sprite, containing a list of `ImageRawFrame` objects.
## 4. Text and Transcription Frames
### TextFrame
Represents a chunk of text, used for various purposes in the pipeline.
### TranscriptionFrame
A specialized `TextFrame` for speech transcriptions, including:
-`user_id`: ID of the speaking user
-`timestamp`: When the transcription was generated
-`language`: Detected language of the speech
### InterimTranscriptionFrame
Similar to `TranscriptionFrame`, but for interim (not final) transcriptions.
## 5. LLM (Language Model) Frames
### LLMMessagesFrame
Contains a list of messages for an LLM service to process.
### LLMMessagesAppendFrame and LLMMessagesUpdateFrame
Used to modify the current context of LLM messages.
### LLMSetToolsFrame
Specifies tools (functions) available for the LLM to use.
### LLMEnablePromptCachingFrame
Controls prompt caching in certain LLMs.
## 6. System and Control Frames
### SystemFrame
Base class for system-level frames.
Important system frames include:
-`StartFrame`: Initiates a pipeline
-`CancelFrame`: Stops a pipeline immediately
-`ErrorFrame`: Notifies of errors (with `FatalErrorFrame` for unrecoverable errors)
-`EndTaskFrame` and `CancelTaskFrame`: Control pipeline tasks
-`StartInterruptionFrame` and `StopInterruptionFrame`: Indicate user speech for interruptions
### ControlFrame
Base class for control-flow frames.
Notable control frames:
-`EndFrame`: Signals the end of a pipeline
-`LLMFullResponseStartFrame` and `LLMFullResponseEndFrame`: Bracket LLM responses
-`UserStartedSpeakingFrame` and `UserStoppedSpeakingFrame`: Indicate user speech activity
-`BotStartedSpeakingFrame` and `BotStoppedSpeakingFrame`: Indicate bot speech activity
-`TTSStartedFrame` and `TTSStoppedFrame`: Bracket Text-to-Speech responses
## 7. Special Purpose Frames
### MetricsFrame
Contains performance metrics data.
### FunctionCallInProgressFrame and FunctionCallResultFrame
Used for handling LLM function (tool) calls.
### ServiceUpdateSettingsFrame
Base class for updating service settings, with specific subclasses for LLM, TTS, and STT services.
## Conclusion
Understanding these frame types is essential for working with the Pipecat system. Each frame type serves a specific purpose in the pipeline, whether it's carrying data (like audio or images), controlling the flow of the pipeline, or managing system-level operations. By using the appropriate frame types, you can effectively process and transmit various kinds of information through your pipeline.
This directory contains examples to help you learn how to build with Pipecat.
This directory contains examples showing how to build voice and multimodal agents with Pipecat.
## Getting Started
## Setup
New to Pipecat? Start here:
1. Follow the [README](https://github.com/pipecat-ai/pipecat/blob/main/README.md#%EF%B8%8F-contributing-to-the-framework) steps to get your local environment configured.
- **[Quickstart](quickstart/)** - Get your first voice AI bot running in 5 minutes _(coming soon)_
- **[Client/Server Web](client-server-web/)** - Learn to build web applications with Pipecat's client SDKs _(coming soon)_
- **[Phone Bot with Twilio](phone-bot-twilio/)** - Connect your bot to a phone number _(coming soon)_
> **Run from root directory**: Make sure you are running the steps from the root directory.
## Foundational Examples
> **Using local audio?**: The `LocalAudioTransport` requires a system dependency for `portaudio`. Install the dependency to use the transport.
Single-file examples that introduce core Pipecat concepts one at a time. These examples:
2. Copy the [`env.example`](../env.example) file and add API keys for services you plan to use:
- Build on each other progressively
- Focus on specific features or integrations
- Are used for testing with every Pipecat release
```bash
cp env.example .env
# Edit .env with your API keys
```
See the **[Foundational Examples README](foundational/)** for the complete list.
3. Run any example:
## More Advanced Examples
```bash
uv run python getting-started/01-say-one-thing.py
```
Ready to explore complex use cases? Visit **[pipecat-examples](https://github.com/pipecat-ai/pipecat-examples)** for:
4. Open the web interface at http://localhost:7860/client/ and click "Connect"
- Production-ready applications
- Multi-platform client implementations
- Telephony integrations
- Multimodal and creative applications
- Deployment and monitoring examples
## Running examples with other transports
Most examples support running with other transports, like Twilio or Daily.
### Daily
You need to create a Daily account at https://dashboard.daily.co/u/signup. Once signed up, you can create your own room from the dashboard and set the environment variables `DAILY_ROOM_URL` and `DAILY_API_KEY`. Alternatively, you can let the example create a room for you (still needs `DAILY_API_KEY` environment variable). Then, start any example with `-t daily`:
```bash
uv run getting-started/06-voice-agent.py -t daily
```
### Twilio
It is also possible to run the example through a Twilio phone number. You will need to setup a few things:
1. Install and run [ngrok](https://ngrok.com/download).
```bash
ngrok http 7860
```
2. Configure your Twilio phone number. One way is to setup a TwiML app and set the request URL to the ngrok URL from step (1). Then, set your phone number to use the new TwiML app.
Then, run the example with:
```bash
uv run getting-started/06-voice-agent.py -t twilio -x NGROK_HOST_NAME
```
## Directory Structure
### [`getting-started/`](./getting-started/)
Progressive introduction to Pipecat, from minimal TTS to a full voice agent with function calling.
### [`voice/`](./voice/)
Full STT + LLM + TTS voice agent pipelines showcasing different speech service providers (Deepgram, ElevenLabs, Cartesia, etc.)
### [`function-calling/`](./function-calling/)
Function calling with different LLM providers (OpenAI, Anthropic, Google, etc.)
### [`transcription/`](./transcription/)
Speech-to-text examples with various STT providers.
### [`vision/`](./vision/)
Image description and vision capabilities with different multimodal LLMs.
### [`realtime/`](./realtime/)
Realtime and multimodal live APIs (OpenAI Realtime, Gemini Live, AWS Nova Sonic, Ultravox, Grok).
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
)
messages=[
{
"role":"system",
"content":"You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
)
# Create audio buffer processor
audiobuffer=AudioBufferProcessor()
messages=[
{
"role":"system",
"content":"You are a helpful assistant demonstrating audio recording capabilities. Keep your responses brief and clear.",
voice_id="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
)
messages=[
{
"role":"system",
"content":"You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio. Respond to what the user said in a creative and helpful way.",
},
]
tts=CartesiaTTSService(
api_key=os.environ["CARTESIA_API_KEY"],
settings=CartesiaTTSService.Settings(
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
llm=GoogleLLMService(
api_key=os.environ["GOOGLE_API_KEY"],
settings=GoogleLLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way. You have access to tools to get the current weather - use them when relevant.",
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way. You have access to tools to get the current weather - use them when relevant.",
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
llm=OpenAIResponsesLLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAIResponsesLLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
)
# You can also register a function_name of None to get all functions
# sent to the same callback with an additional function_name parameter.
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
openai_llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
)
groq_llm=GroqLLMService(
api_key=os.environ["GROQ_API_KEY"],
settings=GroqLLMService.Settings(
system_instruction="You are a very helpful assistant. Your goal is to demonstrate your capabilities in detail in a creative and helpful way.",
),
)
openai_context=LLMContext()
groq_context=LLMContext()
# We use an external VADProcessor because the UserTurnProcessor is shared
# across multiple parallel aggregators. The VADProcessor emits
# VADUserStartedSpeakingFrame and VADUserStoppedSpeakingFrame which the
# UserTurnProcessor needs to manage turn lifecycle.
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
# Main LLM — drives the conversation. Its RTVI events reach the client.
main_llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
)
# Evaluator LLM — silently grades the user's message in the background.
# Its RTVI events will be suppressed so the client is unaware of this branch.
evaluator_llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
name="EvaluatorLLM",
settings=OpenAILLMService.Settings(
system_instruction="You are a silent quality evaluator. When given a user message, respond with a single JSON object: {'score': <1-5>, 'reason': '<brief reason>'}. Do not respond conversationally.",
),
)
main_context=LLMContext()
evaluator_context=LLMContext()
# We use an external VADProcessor because the UserTurnProcessor is shared
# across multiple parallel aggregators. The VADProcessor emits
# VADUserStartedSpeakingFrame and VADUserStoppedSpeakingFrame which the
# UserTurnProcessor needs to manage turn lifecycle.
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
base_url="http://0.0.0.0:8000/v1",
)
messages=[
{
"role":"system",
"content":"You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
voice="d4db5fb9-f44b-4bd1-85fa-192e0f0d75f9",# Spanish-speaking Lady
),
)
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a live translation assistant. Your sole purpose is to translate English text into Spanish. When you receive English text from the user, immediately translate it into natural, fluent Spanish. Do not add explanations, commentary, or extra information—only provide the Spanish translation of the text you receive.",
),
)
context=LLMContext()
# We use the TranscriptionUserTurnStartStrategy to start a new user turn
# every time a transcription is received. We disable interruptions, so the
# user can continue speaking while the bot is transcribing, without
system_prompt="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way."
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.