Adds a `session_id: str | None` field to `RunnerArguments` so bots can
log/trace a per-session identifier in local development the same way
they can in Pipecat Cloud (where it is provided via the
`x-daily-session-id` header).
The local runner now mints a UUID at every `*RunnerArguments`
construction site. For paths that already returned a `sessionId` to the
caller (Daily `/start`, dial-in webhook), a single UUID is now generated
and shared between `runner_args.session_id` and the response body
instead of being thrown away. The SmallWebRTC `/api/offer` endpoint
accepts an optional `session_id` so the `/sessions/{session_id}/...`
proxy can thread it through.
This is the prerequisite step for collapsing pipecat-cloud's
`SessionArguments` / `*SessionArguments` hierarchy onto the upstream
runner types.
So the rendered changelog has the (PR [...]) line aligned as a list
continuation under its bullet. Verified with both short and wrapped
entries via `towncrier build --draft`.
Introduce SonioxTTSService, a WebSocket TTS provider that streams text and
receives audio over a persistent connection, multiplexing up to 5 concurrent
streams per socket via Soniox's `stream_id`. Also updates the README service
table and the Soniox voice example to use the new TTS end-to-end.
Replaces the hardcoded camera publishing send settings in
DailyTransport with a new DailyParams.camera_out_send_settings dict that
applications can pass through verbatim to the Daily client. This makes
the encoding/codec/bitrate configuration user-controllable instead of
being driven solely by the generic TransportParams fields.
As a consequence, TransportParams.video_out_bitrate is deprecated for
the Daily transport (now configured via camera_out_send_settings) and
its default is changed to None.
Adds a dedicated screen video track alongside the existing camera track
so applications can publish to Daily's built-in "screenVideo" destination
via video_out_destinations. The track is created at join time and wired
into the client settings (inputs and publishing) when "screenVideo" is
configured; write_video_frame routes frames to the appropriate track
based on the frame's transport_destination.
Bound methods are created fresh on each attribute access, so
'self._missing_function_call_handler is self._missing_function_call_handler'
is always False. Using 'is' meant the placeholder branch never fired and
both warnings logged when a function was missing at queue time.
Switch to == so equality compares the underlying function and instance.
Strengthen the missing-at-queue-time test to assert the second warning
does not fire.
Address review feedback: a function may be unregistered between when
run_function_calls queues it and when _run_function_call executes it.
Restore the live lookup, falling back to the missing-function handler
when the entry is gone, so the call still terminates with a normal
tool result. Factor the missing-handler item construction into a
helper since it's now built in two places.
Runner-created Daily rooms previously had no expiration when callers
posted partial `dailyRoomProperties` (e.g. `{"start_video_off": true}`).
The model-default `exp=None` and `eject_at_room_exp=False` meant Daily's
cron never cleaned them up, so rooms accumulated indefinitely.
Encode the policy in the runner: define `PIPECAT_CLOUD_ROOM_EXP_HOURS=4.0`,
inject `exp` and `eject_at_room_exp=True` into user-supplied properties via
`setdefault` (so explicit caller values still win), and pass
`room_exp_duration` to all four `configure()` call sites.
* VIVA SDK TT v3 support
* Format fix.
* Renamed the API naming, removed '3' from the name.
* Implementation of User turn start strategy using Krisp VIVA Interruption Prediction in scope of TT v3 support.
* TT demo tool
* Some improvements for demo scripts, audio recordin, etc.
* Enhance demo scripts with VAD selection and audio embedding features. Updated HTML report to include annotated audio players and improved response time metrics in summary formatting. Added README for setup and usage instructions.
* Refactor interrupt prediction demo to compare multiple interruption strategies (Krisp IP vs VAD). Updated README with usage instructions and output details. Enhanced audio processing with new helper functions for generating beeps and mixing audio.
* Refactor demo scripts to improve latency metrics by introducing total_delay property in TurnEvent. Update formatting in reports and visualizations to reflect accurate speech end times, including VAD wait times. Enhance HTML report with detailed latency information and adjust audio processing to account for VAD stop seconds.
* Add audio resampling functionality and update demo scripts for improved audio processing
- Introduced `resample_audio` function to handle audio resampling with linear interpolation.
- Updated `demo_audio_recorder.py` to utilize the new resampling feature, ensuring audio is saved at the requested sample rate.
- Modified `demo_interrupt_prediction.py` and `demo_turn_taking.py` to resample audio to 16 kHz for compatibility with Silero VAD.
- Adjusted imports in demo scripts to include the new resampling function.
- Enhanced error handling for sample rate discrepancies in audio recording.
* Enhance demo_interrupt_prediction.py with VAD type selection and improved processing logic
- Added support for selecting between "silero" and "krisp" VAD engines in the demo script.
- Introduced a new create_vad function to configure VAD analyzers based on the selected type.
- Updated audio processing logic to handle VAD type-specific resampling and state management.
- Modified the KrispVivaIPUserTurnStartStrategy to utilize a separate vad_flag for per-frame VAD input, improving interruption detection accuracy.
* Refactor audio processing scripts for improved readability and consistency
- Updated type hinting in `resample_audio` function to use `tuple` instead of `Tuple`.
- Simplified print statements in `demo_audio_recorder.py`, `demo_formatting.py`, and `demo_interrupt_prediction.py` for better readability.
- Adjusted argument formatting in `demo_audio_recorder.py` and `demo_formatting.py` for consistency.
- Cleaned up list comprehensions in `demo_formatting.py`, `demo_html_report.py`, and `demo_interrupt_prediction.py` for clarity.
- Enhanced error handling in `__init__.py` for the KrispVivaIPUserTurnStartStrategy import.
* Refactor VAD handling in KrispVivaIPUserTurnStartStrategy and update tests for clarity
- Simplified the argument formatting in the _handle_vad_started method for improved readability.
- Updated test assertions to reflect changes in VAD processing logic, ensuring that the vad_flag is correctly set to False during continuous state processing.
- Enhanced test cases to verify that the process method is called appropriately under different conditions.
* more format fixes.
* removed demo scripts.
* reverted wrongly removed file.
* Corrected the IP integration logic.
* style fix.
* Refactor audio processing and state management in KrispVivaIPUserTurnStartStrategy
- Removed the unused _vad_flag attribute to streamline state tracking.
- Updated the reset method to clear the audio buffer instead of resetting the vad_flag.
- Adjusted the process_frame method to use _speech_active for VAD input, enhancing clarity in the logic.
- Modified tests to reflect changes in state management and ensure proper functionality of the reset method and audio buffer handling.
* FIxed formatting
---------
Co-authored-by: Aram Poghosyan <apoghosyan@krisp.ai>
Pipecat 1.0.8 hard-required protobuf 6.x via the base `protobuf>=6.31.1,<7`
pin, blocking users whose dependency graph already constrains protobuf to
the 5.x line. The original bump (PR #4136) was only needed because
`nvidia-riva-client>=2.25.1` ships gencode compiled with protoc 6.31.1.
Changes:
- Widen base pin to `protobuf>=5.29.6,<7`.
- Regenerate `frames_pb2.py` with `grpcio-tools~=1.67.1` (protoc 5.x). Per
Google's cross-version runtime guarantee, 5.x gencode runs on both 5.x
and 6.x runtimes, so this single artifact serves all users.
- Loosen the dev pin `grpcio-tools` to `>=1.67.1,<2` so contributors can
install `pipecat[dev,nvidia]` without resolver conflict. Comment in
`frames.proto` documents the 1.67.x requirement for regeneration.
- Add an explicit `protobuf>=6.31.1,<7` to the `nvidia` extra. This
compensates for nvidia-riva-client's missing `protobuf` install
requirement (upstream packaging gap, see
https://github.com/nvidia-riva/python-clients/issues/172). When that
issue is resolved, the explicit protobuf entry in the `nvidia` extra
can be removed.
Verified: pipecat imports cleanly on both protobuf 5.29.6 and 6.33.6;
`tests/test_protobuf_serializer.py` passes; `import riva.client` succeeds
when `pipecat[nvidia]` is installed.
Nova Sonic sessions have an AWS-imposed ~8-minute time limit. This adds
transparent session continuation that rotates sessions in the background
before the limit is reached, preserving conversation context with no
user-perceptible interruption.
Implementation follows the AWS reference architecture:
- Monitor loop detects when session age exceeds threshold
- On assistant AUDIO contentStart: start buffering user audio, create next
session (sessionStart + promptStart + system instruction)
- Track SPECULATIVE/FINAL text counts as completion signal
- On completion signal: send conversation history + audioInputStart +
buffered audio to next session, then promote immediately
- Close old session in background (non-blocking)
- Dead session detection: recreate next session if idle >30s
Key design decisions:
- Session continuation enabled by default (fundamental for long conversations)
- Conversation history tracked in real-time via _sc_conversation_history
(independent of pipeline context aggregator which updates asynchronously)
- Completion signal check in _handle_content_end_event (after history update)
to ensure latest text is included in handoff
- Rolling audio buffer (default 3s) captures user audio during transition
- transition_threshold_seconds capped at 420s (7min) for safety margin
- Unified event methods (_send_text_event, _send_client_event, etc.) accept
optional stream/prompt_name params, eliminating duplicate SC methods
Also adds:
- SessionContinuationParams config (enabled, threshold, buffer, timeout)
LLMContext's NotGiven, LLMContextToolChoice, and LLMStandardMessage are
currently aliased to their OpenAI equivalents, so passing values
between the two sides type-checks implicitly. That works today but
obscures the fact that these are meant to be conceptually distinct —
if LLMContext ever diverges from OpenAI's types, every implicit
crossing would silently break.
Introduce two module-private cast helpers in open_ai_adapter.py:
- _openai_from_llm_context_tool_choice(tool_choice)
- _openai_from_llm_standard_message(message)
Both are typed no-ops today (implemented with typing.cast) but each
carries a docstring explaining why the cast is present, and every
boundary crossing now routes through a named function. Future readers
(and future greps) can find the crossings; a later divergence becomes
a mechanical find-and-update rather than hunting through adapter code.
No behavior change, no pyright error delta.
After widening TTSSettings.voice to str | None | _NotGiven (so other
TTS services can opt into None as a valid "no voice" state), pyright
flagged Speechmatics' URL builder receiving str | None where it
required str.
Speechmatics has no "no voice" mode (the URL path includes the voice
name), so override the inherited field in SpeechmaticsTTSSettings to
str | _NotGiven. The call site stays as a plain assert_given(...)
without an extra None check.
Three LLM services initialize certain Settings fields with the SDK's
NOT_GIVEN (openai.NOT_GIVEN or anthropic.NOT_GIVEN) so the value
flows unmodified into SDK API calls. The inherited field types from
LLMSettings only admit pipecat's _NotGiven, so pyright flagged each
constructor call as a flavor mismatch.
Widen the field types in each service-specific Settings subclass so
they accept both pipecat's _NotGiven (for delta-mode defaults) and
the corresponding SDK NotGiven (for store-mode passthrough):
- OpenAILLMSettings: frequency_penalty, presence_penalty, seed,
temperature, top_p, max_tokens, max_completion_tokens.
- OpenAIResponsesLLMSettings: temperature, top_p,
max_completion_tokens.
- AnthropicLLMSettings: temperature, top_k, top_p, thinking.
Every overridden field is genuinely read from self._settings and
passed directly to the SDK, so none of the overrides are vestigial.
Clears 21 pyright errors and restores test_service_settings_complete
parity with the pre-NOT_GIVEN-swap state.
asyncai/tts and google/vertex/llm are now clean after the missing-None
sweep (both benefited from the TTSSettings.voice / LLMSettings
cascades).
- src/pipecat/services/asyncai/tts.py
- src/pipecat/services/google/vertex/llm.py
Service-specific Settings subclasses declared fields as T | _NotGiven
(no None), but the services routinely pass None to those fields during
init to mean "don't override — use the vendor's default". The field
type just didn't reflect that a None value is valid, so pyright
flagged every None at the call sites.
Change the declarations to T | None | _NotGiven, matching the pattern
already used by ServiceSettings.model and TTSSettings.language. No
constructor-call changes; the default_factory stays NOT_GIVEN.
Fields touched across 11 files:
- services/settings.py: TTSSettings.voice (base class; covers
asyncai, cartesia, elevenlabs, fish, hume, kokoro, lmnt, mistral,
neuphonic, piper, resembleai, rime, xtts TTS services).
- services/aws/llm.py: latency.
- services/aws/tts.py: engine, pitch, rate, volume, lexicon_names.
- services/azure/tts.py: emphasis, pitch, rate, role, style,
style_degree, volume.
- services/google/gemini_live/llm.py: vad.
- services/google/llm.py: thinking.
- services/google/stt.py: language_codes.
- services/inworld/tts.py: speaking_rate, temperature.
- services/openai/tts.py: instructions, speed.
- services/speechmatics/stt.py: 13 fields (domain, operating_point,
max_delay, end_of_utterance_*, punctuation_overrides, *_partials,
split_sentences, enable_diarization, speaker_*, max_speakers,
prefer_current_speaker, extra_params).
- services/ultravox/llm.py: output_medium.
Clears 94 pyright errors (1035 -> 941).
Three files no longer have pyright errors after the is_given /
assert_given sweep — remove them from the ignore list (which serves as
a live todo of files with remaining type errors).
- src/pipecat/processors/gstreamer/pipeline_source.py
- src/pipecat/services/camb/tts.py
- src/pipecat/services/speechmatics/tts.py
Apply assert_given across service modules to narrow reads from
store-mode settings fields (self._settings.X, default_settings.X),
where _NotGiven is declared in the field type but should never appear
at runtime (enforced by validate_complete()).
Two idioms used:
- Inline wrap for single uses:
func(assert_given(self._settings.enable_prompt_caching), ...)
- Extract-and-reuse when the same value is used multiple times:
thinking = assert_given(self._settings.thinking)
if thinking:
params["thinking"] = thinking.model_dump(...)
43 service files touched. Cleared ~172 pyright errors; remaining
_NotGiven-related errors are in adjacent categories (flavor mismatch
between openai/anthropic NotGiven and pipecat _NotGiven, settings
field types that should allow None but don't) that need different
fixes.
In store-mode settings objects, _NotGiven should never appear (the
invariant enforced by validate_complete). But the declared field types
still include _NotGiven because the same class doubles as delta mode,
so every field read is typed X | None | _NotGiven and pyright flags
operations that assume X | None.
assert_given is a one-line extractor that narrows away _NotGiven and
raises loudly if the invariant is violated — preferable to scattering
is_given guards that defend against something that can't occur in
practice.
resolved_model = assert_given(self._settings.model) # str | None
Replace direct identity checks against NOT_GIVEN with is_given() at
sites where pyright's inability to narrow on non-singleton sentinels
was causing type errors.
- adapters/services/anthropic_adapter.py: narrow converted.system for
_resolve_system_instruction.
- services/openai/llm.py: narrow params.service_tier using OpenAI's
is_given.
- services/sarvam/llm.py: narrow tools / tool_choice using OpenAI's
is_given (aliased as openai_is_given alongside the existing
settings.is_given import).
- services/sarvam/tts.py: narrow settings.voice using settings.is_given.
Pyright can't narrow identity checks against module-level NotGiven
sentinels (they aren't typed as singletons), which leaves many
NotGiven-bearing unions stuck as unnarrowed types throughout the
codebase. Introduce is_given TypeGuard helpers so narrowing works via
isinstance under the hood.
Each helper is co-located with the NotGiven flavor it guards:
- services/settings.py: upgrade the existing is_given to a TypeGuard.
- processors/aggregators/llm_context.py: add an is_given for
LLMContext's NotGiven. Treat LLMContext's re-exported types
(LLMStandardMessage, LLMContextToolChoice, NOT_GIVEN, NotGiven) as
LLMContext's own — independent definitions that happen to coincide
with OpenAI's as an implementation detail.
- adapters/services/anthropic_adapter.py: add is_given for anthropic's
NotGiven.
- adapters/services/open_ai_adapter.py: add is_given for openai's
NotGiven.
TypedDict types are not subtypes of dict[...] in the type system
(per PEP 589), so TypedDict-based invocation param classes could not
satisfy the TypeVar bound. Mapping[str, Any] accepts TypedDicts while
preserving the "string-keyed mapping" constraint.
The original contributor's PR (#4328) landed as #4355. Rename the fragment
so the rendered changelog links to the merged PR, and add the leading `- `
bullet prefix that towncrier expects.
Extends the reconnect re-seeding fix to work cleanly on Gemini Live 2.5,
which has stricter seed requirements than 3.x and a documented audio-input /
history-recall limitation. Both initial connection and reconnect now share a
single code path (`_create_initial_response(for_reconnect=...)`), with four
well-documented cases.
On Gemini 2.5 reconnect, `turn_complete=True` is now forced on the seed so
the model produces a recap-style response immediately instead of briefly
acting "forgetful" on the user's next utterance — the latter being
especially jarring mid-conversation. When a 2.5 seed doesn't already end
with a user turn (e.g. the bot had finished speaking before the disconnect),
a blank user turn is appended to satisfy the server's seed-shape
requirement. Gemini 3.x needs neither workaround.
Tkinter's `Label` only stores `PhotoImage` references at the C level, so
Python GC eats them unless something on the Python side keeps a
reference. The canonical fix is to stash the reference on the widget
itself: `label.image = photo`. Tkinter widgets are plain Python objects,
so the assignment works at runtime, but the stub declares no `image`
attribute (correctly — there isn't one; we're adding it).
Narrow the suppression to `# type: ignore[attr-defined]` on the one
line. The existing comment above the assignment already documents why.
Mistral imposes three conversation-history quirks on top of the
OpenAI-compatible wire format: tool messages must be followed by an
assistant message; non-initial system messages are rejected; trailing
assistant messages require `prefix=True`. These rules were applied
inline in `MistralLLMService.build_chat_completion_params`, which is the
wrong layer — every other provider with OpenAI-compatible-but-quirky
shape (Perplexity, etc.) owns its transformations in a
`BaseLLMAdapter` subclass that runs during `get_llm_invocation_params`.
Create `MistralLLMAdapter(OpenAILLMAdapter)` on the Perplexity template
and wire it in via the existing `adapter_class` dispatch. The service
now only handles Mistral-specific request-level mapping (`random_seed`
in place of `seed`), and the message shape concerns live with other
provider format logic.
No behavior change. The transform function casts to `list[dict[str,
Any]]` internally because mutating `role` and attaching Mistral's
non-standard `prefix` field both step outside OpenAI's TypedDict
contract; the cast at the return boundary encodes that we're emitting
Mistral's extended schema, not OpenAI's.
`inspect.getdoc()` returns `str | None`, but `docstring_parser.parse()`
requires `str`. Functions without a docstring produced `None`, which
the type checker correctly flagged.
Coerce to `""` at the call site. `docstring_parser.parse("")` returns
an empty docstring whose `.description` and `.params` are already
handled by the surrounding `or ""` fallbacks, so runtime behavior is
unchanged.
`ToolsSchema.__init__` declared `standard_tools: list[FunctionSchema |
DirectFunction]`. Callers (`BaseLLMAdapter`, `MCPService`) pass in
`list[FunctionSchema]`, which is not assignable to the union list
because `list` is invariant in its element type.
Widen the parameter to `Sequence[...]` (covariant) so `list[X]` and
`list[X | Y]` both fit. A narrower `list[FunctionSchema]` is still
accepted, and nothing in this class mutates the argument — the
constructor immediately copies it via `_map_standard_tools`.
Also correct the `custom_tools` property return type to include
`None`, matching the stored `_custom_tools` field.
This single edit clears the pyright errors for three ignore-list
entries: `tools_schema.py`, `base_llm_adapter.py`, and `mcp_service.py`.
Two services were reading `_settings.model` (typed `str | _NotGiven |
None` because NOT_GIVEN is the default) and coercing it with `or ""`
or similar. `_NotGiven.__bool__` returns False, so the runtime
behavior happened to work, but the type was a lie — pyright saw
`str | _NotGiven` flowing into APIs that required `str` or `str | None`.
- `AIService._sync_model_name_to_metrics`: use `isinstance(model, str)`
narrowing with an empty-string fallback. Equivalent runtime behavior,
honest type, no truthiness dependency on a sentinel.
- `SarvamLLMService.__init__`: validate the model is a real string
before handing it to `_validate_model(str)`. A non-string model at
this point is a configuration bug; raise `ValueError` so the error
is clear and survives `python -O` (unlike an assert).
Three spots had the same shape: a field starts None, a later method
populates it, a read site later reads it. Pyright can't track the
cross-method invariant. Rather than spray assertions at the read
sites, fix each site at the structural level:
- `FastAPIWebsocketInputTransport._monitor_websocket` now takes the
session timeout as an argument. The task-creation site already
guards on truthiness, so the call can pass the non-None value
directly and the method's signature tells the truth.
- `FrameProcessorMetrics.task_manager` raises `RuntimeError` instead
of asserting. Asserts are stripped under `python -O`; a real raise
keeps the runtime safety net and still narrows the type for pyright.
- `SOXRStreamAudioResampler._maybe_initialize_sox_stream` returns the
initialized stream. Callers use the return value and never touch
the Optional `_soxr_stream` attribute, so narrowing stays inside
the init method where the invariant is established.
`ImageGenService.run_image_gen` and `VisionService.run_vision` were
declared `async def ... -> AsyncGenerator[Frame, None]` with `pass`
bodies. Without a `yield` anywhere in the body, Python treats the
function as a coroutine returning an `AsyncGenerator`, not as an async
generator itself, so callers got a coroutine where they expected an
iterator.
Add `raise NotImplementedError; yield` so the body contains a yield
(making this a real async generator) while still raising cleanly if a
subclass ever calls `super().run_*` by mistake.
Deepgram STT, Gradium TTS, Smallest STT, and xAI STT/TTS had exactly
one pyright error each, all of them the AsyncGenerator return-type
mismatch resolved in 08fe9157c. Remove them from the ignore list.
AssemblyAI, Cartesia, Gradium, and Soniox STT services sent audio over
the WebSocket without catching transient send failures, so a single
network hiccup could propagate an exception up through process_frame
and end the pipeline. Other push-based STT services (Deepgram, xAI,
Azure, Smallest, etc.) already guard their sends.
Follow the deepgram/stt.py pattern: log a warning and continue. The
existing connection-state check at the top of each call handles
recovery on the next invocation.
The push-based STT/TTS implementations send audio/text over a socket and
receive results via a separate receive task, so there is nothing to
yield inline. They yield `None` by design. The previous declaration of
`AsyncGenerator[Frame, None]` disagreed with that, while the consumer
(`AIService.process_generator`) already accepted `Frame | None`. Widen
the producer side (abstract base and every subclass) so the type honestly
describes the contract.
Pure annotation change; no runtime behavior difference.
Previously, six modules (adapters, audio, processors, serializers,
services, transports) were ignored wholesale. Many files in those
modules already pass type checking, but we had no way to protect them
from regressions or make the remaining work visible.
Switch the include list to src/pipecat so any new module is checked by
default, and replace directory-level ignores with the 140 specific
files that still fail. This puts 189 previously-untyped files under
type checking immediately and turns the remaining work into a concrete,
shrinking TODO list.
Moves src/pipecat/serializers into pyright's include list. Narrows
self._params to each subclass's InputParams in exotel, vonage, plivo,
twilio, genesys, and telnyx. In protobuf.py, renames the reassigned
frame local to avoid clobbering its Frame type and silences two dynamic
attribute accesses on the generated frames_pb2 module.
Also aligns telnyx and plivo hangup validation with twilio: if
auto_hang_up=True (the default) but required credentials are missing,
__init__ now raises ValueError instead of silently logging a warning
at call-end time. Previously a misconfigured serializer would construct
fine and fail to hang up the call later, leaving a phantom billable
session.
Collapse the separate fallback timer into the existing user_speech_timeout
timer, restarted when a transcript arrives without a VAD stop. stt_timeout
has no meaning on the fallback path, so the stt wait is marked done
immediately. This drops the _fallback_timeout_task / _fallback_expired
bookkeeping and the branched trigger condition.
Adds XAITTSService in the existing xai/tts.py module, alongside the
existing XAIHttpTTSService. Connects to xAI's streaming endpoint at
wss://api.x.ai/v1/tts, streams text.delta chunks up and base64 audio.delta
chunks down on the same connection so audio starts flowing before the full
utterance is synthesized.
Extends InterruptibleTTSService since xAI's protocol is strictly sequential
per connection and exposes neither a cancel verb nor a context ID — the
only way to stop an in-flight utterance is to tear down the WebSocket,
which is exactly what InterruptibleTTSService does on interruption when
the bot is speaking.
Voice, language, codec, and sample_rate are passed as query-string params
at connect time; runtime setting changes reconnect the socket. Defaults to
raw PCM so emitted TTSAudioRawFrame objects need no decoding downstream.
Splits the existing example into voice-xai.py (WebSocket) and
voice-xai-http.py (batch HTTP) so each variant has its own entry point.
Promotes the xai extra to depend on pipecat-ai[websockets-base] since the
new service imports the websockets library.
Remove `examples/` from the `pyrightconfig.json` ignore list and fix
the resulting type errors across all example files. Common fixes:
- Required API keys: `os.getenv("X")` -> `os.environ["X"]` so the
return type is `str` rather than `str | None`, and misconfiguration
fails fast.
- Narrow `LLMContextMessage` union members with `isinstance(..., dict)`
before dict-style access.
- `assert isinstance(params.llm, ...)` before calling service-specific
methods that aren't on the base `LLMService`.
- Guard optional frame fields (e.g. `LLMSearchResponseFrame.search_result`)
before use.
If the WebSocket handshake is cancelled or fails before `keepalive_task`
is assigned (e.g. an STTUpdateSettingsFrame triggers a reconnect during
initial connect), the `finally` block tried to cancel an unbound local.
Initialize `keepalive_task = None` before the try and guard the cancel.
New `XAISTTService` wraps xAI's real-time speech-to-text WebSocket
(`wss://api.x.ai/v1/stt`). It extends `WebsocketSTTService`, authenticates
with the `XAI_API_KEY` as a Bearer token on the WS handshake, and streams
raw audio (PCM/mu-law/A-law) with configurable interim results, endpointing,
language, multichannel, and diarization settings.
- `src/pipecat/services/xai/stt.py`: new service, settings dataclass, and
`language_to_xai_stt_language` helper.
- `src/pipecat/services/stt_latency.py`: `XAI_TTFS_P99` default.
- `pyproject.toml` / `uv.lock`: `xai` extra now pulls in `websockets-base`.
- `README.md`: link to xAI STT in the services table.
- `examples/voice/voice-xai.py`: swap DeepgramSTTService for XAISTTService so
the xAI voice example is fully xAI.
- `examples/transcription/transcription-xai.py`: new transcription-only
example using the new service.
SpeechTimeoutUserTurnStopStrategy previously collapsed two waits into
max(stt_timeout, user_speech_timeout), which over-waited for finalizing
STT services and could also end the turn early in a legacy code path.
Run them as independent timers instead:
- user_speech_timeout: policy floor, always runs to completion.
- stt_timeout: latency safety net, short-circuited by a finalized
transcript since STT has signaled it has nothing more to send.
The no-VAD fallback now waits only user_speech_timeout rather than
max(stt_timeout, user_speech_timeout); stt_timeout is defined relative
to VAD stop and has no meaning when no VAD event occurred. This
shortens the fallback wait for users who set stt_timeout greater than
user_speech_timeout.
* Fix Smallest AI TTS WebSocket endpoint URL to match API documentation
Update base URL from waves-api.smallest.ai to api.smallest.ai and
fix path prefix from /api/v1/ to /waves/v1/ per the v4.0.0 docs.
* Update keepalive using silent space message instead of unsupported flush
Pylance analyzes open files even when they're outside the `include`
set, producing noise in the editor. Adding these paths to `ignore`
suppresses diagnostics without affecting import resolution.
Some TTS providers (e.g. Inworld) return verbatim tokens where spaces and
punctuation are already embedded in the token text. When downstream consumers
join these tokens with an extra space they produce "hello , world" instead of
"hello, world".
Add an opt-in `includes_inter_frame_spaces: bool = False` parameter to
`add_word_timestamps` / `_add_word_timestamps`. The flag is threaded through
`_WordTimestampEntry` and stamped onto every emitted `TTSTextFrame`.
Defaults to `False` — no behaviour change for existing services.
`InworldTTSService` passes `includes_inter_frame_spaces=True` and stops
pre-processing tokens in `_calculate_word_times`, returning them verbatim.
Tests added to `test_tts_frame_ordering.py` covering both HTTP and WebSocket
delivery paths: verbatim text preservation, PTS ordering, text-before-audio
ordering, and the Inworld punctuation-token scenario.
Made-with: Cursor
The two logger.error lines in krisp_instance.py fired at module-load time
whenever anything transitively imported it (e.g. pipecat.turns.user_start
pulling in krisp_viva_ip_user_turn_start_strategy), producing noisy output
for users who never asked for Krisp. Drop the log calls and raise a more
informative ImportError that names the affected classes so direct
importers still get clear guidance.
- Fall back to Language.EN in _primary_detected_language when model is
flux-general-en, preserving prior behavior on the default model.
- Standardize example on DeepgramFluxSTTService.Settings and drop the
now-redundant DeepgramFluxSTTSettings import.
- Narrow the changed-behavior changelog to reflect that flux-general-en
frames still carry Language.EN.
Enables the flux-general-multi model with one or more language_hints.
Hints are sent as repeatable URL params at connect time and via a
Configure control message when updated mid-stream (detect-then-lock).
TranscriptionFrame.language now reflects the language Flux detected
for each turn via the TurnInfo `languages` field.
Add changelog entries for the pyright introduction and the
LiveKitRunnerArguments.token signature tightening. Restore the
indented multi-line format for the WhatsApp missing-env error,
now listing only the vars that are actually missing.
Make required parameters non-optional: LiveKitRunnerArguments.token,
_create_telephony_transport args. Use os.environ[] instead of
os.getenv() for required WhatsApp env vars. Guard spec/loader None
in module loading. Tighten sip_caller_phone guard in daily.py.
* VIVA SDK TT v3 support
* Format fix.
* Renamed the API naming, removed '3' from the name.
* Implementation of User turn start strategy using Krisp VIVA Interruption Prediction in scope of TT v3 support.
* Typo fix in voice-krisp-viva example to use KrispVivaFilter class
* style fix.
* test run error fixes.
* some test related changes.
* Fixed tests
* Stule fixes.
SentryMetrics.stop_ttfb_metrics and stop_processing_metrics called the
base FrameProcessorMetrics implementation but discarded its return
value (implicit `return None`). FrameProcessorMetrics.stop_ttfb_metrics
/ stop_processing_metrics build and return a MetricsFrame, which
FrameProcessor.stop_ttfb_metrics / stop_processing_metrics then pushes
downstream so observers (e.g. UserBotLatencyObserver,
MetricsLogObserver) can see TTFB / processing metrics.
Because SentryMetrics returned None, the FrameProcessor never pushed
the MetricsFrame, so any pipeline using metrics=SentryMetrics() on STT
/ LLM / TTS services silently lost all downstream TTFB and processing
MetricsFrames. The metrics were still calculated and logged
internally, and Sentry transactions still finished correctly, but
observers never saw them.
Forward the MetricsFrame returned by the base class so FrameProcessor
can push it into the pipeline.
Use Sequence[FrameProcessor] instead of list[FrameProcessor] in Pipeline,
ServiceSwitcher, and ServiceSwitcherStrategy parameters to accept subtype
lists. Add cast() in LLMSwitcher for narrowed return types. Guard against
None in task_observer._send_to_proxy and replace hasattr with truthiness
check in task._cleanup.
Widen base strategy process_frame return types to ProcessFrameResult |
None to match actual behavior (None treated as CONTINUE). Give
UserTurnCompletionLLMServiceMixin a FrameProcessor base class so pyright
can see create_task, cancel_task, process_frame, and push_frame.
Tighten LLMMessagesAppendFrame and LLMMessagesUpdateFrame message fields
from list[dict] to list[LLMContextMessage] to match actual usage. Add
type annotations on inline message lists in IVR navigator and voicemail
detector.
In token-streaming mode, _push_tts_frames previously stripped only
leading newlines and dropped any pure-whitespace frame. That silently
discarded meaningful inter-token whitespace (e.g. a standalone "\n"
token between "hello" and "world"), losing prosody cues and any
downstream sentence-boundary semantics.
Track whether a non-whitespace character has been sent in the current
context. While the flag is false, strip all leading whitespace; once
true, let whitespace tokens flow through. Reset the flag on
LLMFullResponseEndFrame/EndFrame and on interruption, and save/restore
it around TTSSpeakFrame since each utterance is its own context.
Sentence-aggregation mode preserves the existing behavior.
Group three co-assigned fields (_start_frame_id, _start_frame_arrival_ns,
_start_wall_clock) into a single _StartFrameInfo dataclass. This makes
the "always set together" invariant structural rather than implicit, and
fixes the incorrect str | None annotation on _start_frame_id (Frame.id
is int).
Add pyrightconfig.json with basic type checking for zero-error modules
(clocks, metrics, transcriptions, frames) and enforce via CI. The
include list will expand as modules are fixed.
* Improve HeyGen LiveAvatar plugin reliability and performance
- Add WebSocket ready gate: wait for session.state_updated connected
event before sending commands (prevents silently dropped messages)
- Add keep-alive mechanism: send session.keep_alive every 2.5 min to
prevent 5-minute inactivity timeout
- Optimize audio chunking: 600ms first chunk for faster initial
response, 1s subsequent chunks for efficient streaming
- Fix audio buffer flush: send remaining buffered audio on utterance
end instead of discarding it
- Fix WS state cleanup: properly reset connected/ready state when
WebSocket drops unexpectedly
- Add livekit_config passthrough in LiveAvatar session token creation
- Replace stray print() with logger.debug()
* Fix HeyGenOutputTransport.start() signature and use 400ms first chunk
- Update transport.py to match new client.start() signature (no
audio_chunk_size param)
- Change first chunk size from 600ms to 400ms per feedback
* Fix transport audio resampling and client.start() error propagation
- Add audio resampling in HeyGenOutputTransport.write_audio_frame() to
ensure audio is always 24kHz before sending to HeyGen (was sending
at pipeline sample rate, causing garbled audio)
- Raise exception on WS ready timeout instead of silently returning,
preventing transport from appearing ready when WS connection failed
* Fix session readiness gate to work with LITE mode
LITE mode does not send session.state_updated WS events. Instead,
use a dual-signal _session_ready event that fires on either:
- WS session.state_updated connected (FULL mode)
- LiveKit participant connected (LITE mode)
Also reorder start() to connect both WS and LiveKit before waiting,
since the WS events may depend on LiveKit being connected.
Verified with live sandbox session - all tests pass.
* Simplify session readiness to use only WS ready gate
Remove _session_ready dual-signal and use only _ws_ready, which fires
on the session.state_updated connected WS event. Increase timeout to
30s. LiveKit is connected before waiting so the WS event can arrive.
* Reduce WS ready gate timeout back to 10s
* Remove WS ready gate (session.state_updated not reliably received)
The session.state_updated connected event is not reliably received
via the websockets library. Remove the gate for now and assume the
session is ready after WS + LiveKit connect. Keep-alive, chunking,
buffer flush, state cleanup, and other improvements remain.
Mirrors the existing `from_string` classmethod and lets callers
turn a frame's `buttons` list back into a dial string like `"123#"`.
`__str__` and the Daily transport's native DTMF path reuse it.
The single-key `button` field on `OutputDTMFFrame` and
`OutputDTMFUrgentFrame` is kept as a first-class ergonomic shortcut
for the common single-keypress case, equivalent to
`buttons=[button]`. `buttons` takes precedence when both are set.
Replaces the string-based `tones` field with a type-safe
`buttons: list[KeypadEntry]` on `OutputDTMFFrame` and
`OutputDTMFUrgentFrame`, matching the existing singular `button`
field on `InputDTMFFrame`. A `from_string` classmethod builds the
list from a dial string like `"123#"` (invalid characters raise
ValueError from the `KeypadEntry` constructor).
The base output audio fallback now iterates `frame.buttons`
directly, LiveKit sends `frame.buttons[0].value`, and the Daily
transport joins the button values into the single string Daily's
`send_dtmf` expects.
Introduces a new `tones` field on `OutputDTMFFrame` and
`OutputDTMFUrgentFrame` for sending multi-digit DTMF sequences and
deprecates the existing single-key `button` field. When only `button`
is set, it is used as a single-character `tones` string for backward
compatibility.
`DTMFFrame` is kept as an empty marker class so both input and output
DTMF frames can still be identified via isinstance. `InputDTMFFrame`
keeps its required `button` field (single keypress semantics).
The Daily-specific `DailyOutputDTMFFrame` and
`DailyOutputDTMFUrgentFrame` frames no longer need to override
`button` and simply add `session_id` and `digit_duration_ms`, which
are forwarded to Daily's `send_dtmf` as `sessionId` and
`digitDurationMs`.
The base output audio fallback now iterates `tones` and generates a
tone per character; LiveKit's native DTMF path sends `tones[0]` since
its API is single-tone.
Introduces Daily-specific DTMF output frames that carry explicit
`tones`, `session_id` and `digit_duration_ms` fields, forwarded to
Daily's `send_dtmf` as `tones`, `sessionId` and `digitDurationMs`.
The inherited `button` and `transport_destination` fields are
ignored for these frames in the Daily transport.
When the LLM returned zero text tokens (e.g. it was interrupted before producing
tokens or about to push tokens), push_aggregation() returned an empty string and
on_assistant_turn_stopped was never emitted. This left consumers waiting for an
event that would never arrive.
Now on_assistant_turn_stopped always fires, with an empty content string when
the LLM produced no text tokens.
Fixes#4292
Only treat messages[0] as the initial system prompt when determining the
summarization range. Previously, the code scanned the entire context for
the first system-role message, which caused failures when the only system
message was a mid-conversation injection (e.g. "The user has been quiet").
In that case summary_start exceeded summary_end, producing an empty range
and "No messages to summarize" errors.
Fixes#4286
The enable_logging and enable_ssml_parsing URL params used truthy checks,
so False was treated the same as None (both skipped). Also, Python's
str(False) produces "False" but the API expects lowercase "false".
Additionally, add enable_logging support to ElevenLabsHttpTTSService
which was missing entirely.
When the STT p99 timeout fires without a transcript, the turn stop
strategy previously did nothing — falling through to the 5-second
user_turn_stop_timeout. Now, a _timeout_expired flag tracks when the
timeout has elapsed so that a late transcript triggers the turn stop
immediately instead of waiting for the fallback.
Previously settings updates were ignored with a TODO comment. Now when
model/language changes via STTUpdateSettingsFrame the service disconnects
and reconnects with the new query parameters.
Key changes:
- Implement _update_settings to disconnect/reconnect on changes
- Check `is not State.OPEN` in run_stt to catch CLOSING state
- Send `done` command before closing for clean session shutdown
- Capture websocket reference in _disconnect_websocket to prevent a
concurrent _connect from having its new connection nulled by a stale
finally block
The strategy schedules background tasks during setup. Fast-running
tests could observe state before those tasks had a chance to run;
yielding once via asyncio.sleep(0) ensures they do.
Enable callers to get a compact version of context messages suitable
for serialization, logging, and debugging tools. For standard
messages, known binary data (base64 images, audio) is fully elided.
For LLM-specific messages, long string values are recursively
truncated. Adapter get_messages_for_logging() methods now use this.
Example files can live under subdirectories (e.g. foundational/01.py),
so the recording path needs its parent directory created before the
audio file is written.
Replaces the per-frame asyncio.Event signaling with a monotonic
timestamp updated on each audio frame. The handler sleeps until the
next deadline (last_audio_time + timeout), recomputing on each wake-up
to account for audio arriving during sleep.
This avoids waking the handler on every audio frame (~50/s at 20ms
chunks), and guarantees detection latency is bounded by timeout rather
than 2 * timeout.
Also renames audio_starvation_timeout to audio_idle_timeout and
associated identifiers for consistency with existing pipecat naming
(user_idle_timeout, etc.).
These are TypedDicts (plain dicts at runtime), so no behavioral change
— just more descriptive type hints for readers. Use ToolParam instead
of FunctionToolParam for the Responses adapter to reflect that custom
non-function tools are supported. Use ChatCompletionToolParam instead
of Any for the completions adapter return type. Update tests to use
typed params in expected values.
During pipeline shutdown, proxy tasks must be cancelled before observer
resources are cleaned up. Previously, stop() was called inside
_cancel_tasks() and start() was called in _start_tasks(), which could
lead to proxy tasks still consuming frames after observer resources
were closed.
Now the lifecycle is explicit in _handle_start_frame: start() after all
observers are loaded, and stop() before cleanup() on shutdown.
Also fixes misleading variable name in TaskObserver.cleanup() where
iterating self._proxies yields observer keys, not Proxy values.
Fixes#4195
Move event.clear() from finally block to success path in
IdleFrameProcessor and UserIdleProcessor._idle_task_handler().
The finally block unconditionally cleared signals set during
async timeout callbacks, causing false-positive idle detection.
Closes#3402
* Add Inworld Realtime LLM service
Adds a WebSocket-based realtime service for Inworld's cascade
STT/LLM/TTS API with semantic VAD, function calling, and streaming
transcription support.
New files:
- src/pipecat/services/inworld/realtime/ (service, events)
- src/pipecat/adapters/services/inworld_realtime_adapter.py
- examples/foundational/19zb-inworld-realtime.py
Also includes:
- websockets dependency for inworld extra in pyproject.toml
- Adapter and settings tests matching OpenAI/Grok realtime patterns
- Fix for double-response when server-side VAD is enabled
* Prefer init-provided system instruction in Inworld Realtime
Adopt _resolve_system_instruction() from BaseLLMAdapter, matching the
pattern applied to OpenAI Realtime, Grok Realtime, Gemini Live, and
Nova Sonic in the pk/realtime-services-init-v-context-system-instructions-cleanup
branch.
* Update changelog entry with PR number
* Fix changelog format to use bullet point
* Polish PR: default model, example cleanup, changelog update
- Change default model from gpt-4.1-nano to gpt-4.1-mini
- Add function calling demo to example
- Remove demo-testing artifact from system instruction
- Mention Router support in changelog
* Address PR review feedback for Inworld Realtime
- Move example to examples/realtime/realtime-inworld.py
- Change initial context role from "user" to "developer"
- Remove explicit sample rates from example; sync them in
_ensure_audio_config so Inworld gets the transport's actual rates
- Add audio race condition guard in _handle_evt_audio_delta (matches
OpenAI realtime pattern)
- Convert remaining "system"/"developer" messages to "user" in adapter
- Add clarifying comment for local-VAD vs server-VAD metrics paths
* Simplify example, add provider tracking, remove local VAD path
- Remove function calling from example, switch model to xai/grok-4-1-fast-non-reasoning
- Add pipecat-realtime session key prefix and provider_data metadata
for Inworld traffic attribution
- Remove local VAD code path (Inworld only supports server-side VAD)
- Use typed InputAudioBufferAppendEvent for audio sends
* Default TTS model to inworld-tts-1.5-max
* Remove dead shimmed tools code, set STT/VAD defaults
- Remove non-functional AdapterType.SHIM custom tools code from adapter
- Default STT model to assemblyai/u3-rt-pro
- Default VAD eagerness to low
Integrate with Mistral's Voxtral TTS API (voxtral-mini-tts-2603) using
HTTP streaming with Server-Sent Events. Converts base64-encoded float32
PCM chunks from the API to int16 for the Pipecat pipeline.
After a reconnect, _ready_for_realtime_input was never set back to True
because _create_initial_response (which sets the flag) is only called on
initial connection. This caused all audio/video/text to be silently
dropped after reconnecting, making the bot appear to hang.
Set the flag in _handle_session_ready when we detect a reconnect, either
via session_resumption_handle (server restores state) or via existing
context (rare case where connection drops before first resumption handle).
After a reconnect, _ready_for_realtime_input was never set back to True
because _create_initial_response (which sets the flag) is only called on
initial connection. This caused all audio/video/text to be silently
dropped after reconnecting, making the bot appear to hang.
Set the flag in _handle_session_ready when context already exists
(i.e. reconnect case) since we don't need to go through
_create_initial_response again.
Remove the deprecation proxy infrastructure that allowed old-style flat
imports (e.g. `from pipecat.services.openai import OpenAILLMService`).
Users must now import from specific submodules
(`from pipecat.services.openai.llm import OpenAILLMService`), which is
already the established pattern across all internal code and 179+ examples.
- Strip 32 proxy `__init__.py` files to empty
- Strip 3 non-proxy files with bare star imports (minimax, sambanova, sarvam)
- Strip google/gemini_live `__init__.py` re-exports
- Remove DeprecatedModuleProxy class and helpers from services/__init__.py
- Remove ruff per-file ignore for services/__init__.py
- Fix 2 examples using old-style imports
Patch Pydantic's DICT_TYPES check in conf.py to accept Union-wrapped
dict types, fixing the autodoc import failure for models using
ConfigDict(extra="allow").
Make -W (warnings as errors) opt-in via --strict flag instead of
default, and update README to reflect uv-based workflow and current
directory structure.
Add tests for LLMRunFrame, LLMMessagesAppendFrame, LLMMessagesUpdateFrame,
and LLMMessagesTransformFrame sent upstream to LLMAssistantAggregator,
mirroring the existing LLMUserAggregator downstream tests. Add
frames_to_send_direction param to run_test helper to support this.
The previous approach required the caller to directly grab a reference to the context object, grab a "snapshot" of its messages *at that point in time*, transform the messages, and then push an `LLMMessagesUpdateFrame` with the transformed messages. This approach can lead to problems: what if there had already been a change to the context queued in the pipeline? The transformed messages would simply overwrite it without consideration.
Napoleon's Attributes section creates class-level attribute docs that
duplicate the __init__ parameter docs when napoleon_include_init_with_doc
is enabled. Using Parameters avoids the duplication.
- Remove expect_stripped_words from LLMAssistantAggregatorParams and related warnings
- Remove old multi-parameter on_push_frame observer signature support in TaskObserver
- Remove deprecated context field from UserImageRequestFrame
- Remove deprecated LiveKitTransportMessageFrame and LiveKitTransportMessageUrgentFrame
- Remove deprecated pipecat.turns.mute shim module
Replace Markdown code blocks with RST syntax in genesys.py, fix
deprecated directive transitions in nvidia and summarization modules,
remove stray bullet prefix in whisper arg docs, restructure code block
in turn completion mixin, and add deepgram mock to Sphinx conf.
Remove stale riva mock imports from autodoc_mock_imports since the riva
service was removed and nvidia-riva-client is installed during doc builds.
Add pipecat.turns and pipecat.extensions to import_core_modules() and
add Turns to the index.rst toctree. Regenerate uv.lock to reflect the
riva extra removal from pyproject.toml.
Move the FastAPI instance to module level so other packages can import
it and register routes before main() is called. main() now configures
the existing app with transport-specific routes instead of creating a
new one.
Deepgram's built-in VAD events were deprecated in 0.0.99 in favor of
Silero VAD. This removes vad_events from settings and LiveOptions,
the should_interrupt parameter, the vad_enabled property,
_on_speech_started/_on_utterance_end handlers, and simplifies
_on_message and process_frame accordingly.
Remove the send_transcription_frames parameter from OpenAI Realtime LLM
(deprecated since 0.0.92). Also fix undefined _warn_deprecated_param
calls in both OpenAI and xAI realtime services, replacing them with the
existing _warn_init_param_moved_to_settings method.
UserBotLatencyLogObserver (deprecated 0.0.102) is replaced by
UserBotLatencyObserver. UserIdleProcessor (deprecated 0.0.100) is
replaced by LLMUserAggregator with user_idle_timeout.
The _self_queued_frames set and _internal_queue_frame wrapper were used
to prevent re-processing SpeechControlParamsFrame that the aggregator
queued to itself. Now that the frame is no longer special-cased, this
tracking is unnecessary. Also removes unused FrameCallback import.
Adds `enable_prompt_caching` setting to `AWSBedrockLLMSettings`. When
enabled, appends `cachePoint` markers to system prompts and tool
definitions in ConverseStream requests.
This can reduce TTFT by up to 85% for multi-turn conversations where
the system prompt stays constant (e.g. voice agents, chat assistants).
Follows the same pattern as `AnthropicLLMService.enable_prompt_caching`.
Usage:
```python
llm = AWSBedrockLLMService(
settings=AWSBedrockLLMSettings(
model="au.anthropic.claude-haiku-4-5-20251001-v1:0",
enable_prompt_caching=True,
),
)
```
See: https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html
Remove EmulateUserStartedSpeakingFrame, EmulateUserStoppedSpeakingFrame
(deprecated since v0.0.99), and the emulated field from
UserStartedSpeakingFrame and UserStoppedSpeakingFrame. Clean up the
handling code in base_input.py and a stale comment in nova_sonic/llm.py.
The interruption_strategies mechanism was deprecated in v0.0.99 in favor
of LLMUserAggregator's user_turn_strategies. All evaluation logic was
already removed — this removes the remaining field definitions, property,
StartFrame propagation, conditional check in base_input.py, strategy
files, and test.
This field was deprecated in v0.0.99 in favor of LLMUserAggregator's
user_turn_strategies / user_mute_strategies parameters. Since the default
was True (interruptions allowed), removing the guards keeps the current
default behavior.
The location and project_id fields were deprecated since 0.0.90 in
favor of direct __init__ parameters. Now that InputParams is removed,
project_id is required and location defaults to "us-east4" directly
in the signature.
Override _update_settings in CartesiaTTSService to flush the current
audio context and assign a new turn context ID when voice, model, or
language settings change. This prevents Context has closed errors
from Cartesia API, which locks these parameters per context.
Remove the deprecated text_aggregator parameter from TTSService,
CartesiaTTSService, and RimeTTSService, and the deprecated text_filter
parameter from TTSService. Users should use LLMTextProcessor before
the TTS service instead. Update the voice-switching example to use
LLMTextProcessor with PatternPairAggregator.
Change the default from 10s to None so deferred function calls can run
indefinitely when no timeout is configured. Only create the timeout
task when a timeout is actually provided (per-call or service-level).
Add SmallestSTTService using the Pulse WebSocket API for real-time
transcription. Includes SmallestSTTSettings dataclass, 32-language
support with resolve_language fallback, VAD-driven finalize signal,
and SMALLEST_TTFS_P99 latency constant.
Also adds X-Source and X-Pipecat-Version headers to Smallest STT
and TTS WebSocket connections.
Example files like openai.py shadow installed packages when Python adds the
script directory to sys.path. Prepend the parent folder name to each example
file (e.g. openai.py -> function-calling-openai.py). Also split
thinking-and-mcp/ into separate mcp/ and thinking/ directories.
Now that LLMContextFrame is the only frame that provides a context,
remove the intermediate `context = None` / `if context:` pattern
and handle context processing directly in the isinstance branch.
Replace the nested services/speech/ and services/function-calling/ with
top-level voice/ and function-calling/ directories. Update eval script
paths and README to match.
Move 304 examples from a flat numbered directory into 14 descriptive
subfolders: getting-started, services (speech + function-calling),
transcription, vision, realtime, persistent-context,
context-summarization, update-settings (stt/tts/llm), turn-management,
thinking-and-mcp, transports, video-avatar, video-processing, and
features.
Strip numbered prefixes from filenames (e.g. 07c-interruptible-deepgram.py
becomes services/speech/deepgram.py) since the folder context makes them
redundant. Keep numbered prefixes only in getting-started/ where ordering
matters.
Update eval script paths and README to match the new structure.
Add WebsocketLLMService as a base class for WebSocket-based LLM services,
parallel to WebsocketTTSService/WebsocketSTTService but codifying a
transactional request-response model rather than a continuous background
receive loop.
WebsocketLLMService provides:
- Connection lifecycle (start/stop/cancel → connect/disconnect)
- _ws_send/_ws_recv with transparent ConnectionClosed handling
(auto-reconnect via exponential backoff → WebsocketReconnectedError)
- _ensure_connected with retry via _try_reconnect
OpenAIResponsesLLMService now inherits from WebsocketLLMService, removing
duplicated connection management code (_connect, _disconnect, _reconnect,
_ensure_connected, _ws_send, start, stop, cancel) and simplifying
_process_context from a loop with attempt tracking to a flat try/except
with a single retry.
When a user interruption causes the LLM chunk stream to exit early,
function call arguments may be incomplete JSON. Wrap json.loads() in
try/except JSONDecodeError to skip malformed function calls with a
warning instead of crashing. Fixes#2461.
Buffer raw bytes and only decode after splitting on newline boundaries,
preventing multi-byte UTF-8 characters from being split at chunk edges.
Fixes#3538
A single service failing to reconnect should not kill the entire
pipeline. Non-fatal errors flow through the pipeline so application
code (e.g. ServiceSwitcher) can handle failover to a backup service.
When a WebSocket server accepts the handshake but immediately closes the
connection (e.g. invalid API key returning close code 1008), the existing
exponential backoff does not help because the handshake keeps succeeding.
This tracks how long each connection survives and emits a non-fatal
ErrorFrame after 3 consecutive sub-5s failures, allowing ServiceSwitcher
failover instead of killing the pipeline.
Fixes#3711
- Use finally block in _disconnect to ensure state is always cleaned
up, even if websocket.close() throws — prevents stale cancellation
state (e.g. _cancel_pending_response) from polluting a new connection
- Catch ConnectionClosed in _drain_cancelled_response alongside
TimeoutError — prevents _needs_drain from staying True and bricking
the service on every subsequent inference attempt
- Fall back to OPENAI_API_KEY env var when api_key is not passed,
since the WebSocket connection uses raw websockets (not the
AsyncOpenAI client which handles this automatically)
- Use _clear_cancellation_state() instead of piecemeal resets where
appropriate
Instead of trying to filter stale events inline (unreliable — the API
doesn't provide a way to correlate events to a specific response),
drain remaining events from a cancelled response before starting the
next one. On cancellation, send response.cancel and set a drain flag.
At the start of the next _process_context, read and discard events
until a terminal event arrives, ensuring a clean connection. Falls
back to reconnecting if draining times out.
Over HTTP, previous_response_id requires store=True (30-day OpenAI-side
conversation storage). The WebSocket variant avoids this via a
connection-local in-memory cache that works with store=False. Add
comments explaining this in both class docstrings, at the store=False
parameter, and in the adapter's previous_response_id note.
Add detailed trace-level logging to _apply_previous_response_optimization
showing why the optimization was applied or fell back to full context,
including the relevant data for debugging.
Use append_to_context=False for the filler TTSSpeakFrame in the
function-calling example to avoid altering the conversation history
and breaking the previous_response_id prefix match.
When using previous_response_id, the server already knows its own
output from the previous response. Store the raw response output and,
on the next call, compare it against the items following the matched
input prefix — checking role and text content for messages, and call_id
for function calls. If the items match, skip them and send only truly
new input (user messages, tool results). Falls back to full context if
either the prefix or the output comparison fails.
Introduce a WebSocket variant of the OpenAI Responses API service that
maintains a persistent connection to wss://api.openai.com/v1/responses
for lower-latency inference. The WebSocket variant automatically uses
previous_response_id to send only incremental context when possible,
falling back to full context on reconnection or cache miss.
The WebSocket variant becomes the new default OpenAIResponsesLLMService,
and the HTTP variant is renamed to OpenAIResponsesHttpLLMService. Both
share a private base class with common settings, parameter building,
and run_inference (always HTTP) logic.
Update langchain 0.3→1.2, langchain-community 0.3→0.4, and
langchain-openai 0.3→1.1. This also unblocks openai>=2.26 which
was previously constrained by the now-removed openpipe package.
OpenPipe was acquired by CoreWeave in September 2025. The Python package
hasn't been updated since June 2025 and the repo since 2024. The openpipe
package caps openai<=1.97.1, creating dependency conflicts with other
extras. Remove the dead integration to clean up the codebase.
- Add Nebius LLM service wrapping OpenAI-compatible Token Factory API
- Set supports_developer_role = False (Nebius rejects developer role)
- Default to openai/gpt-oss-120b model (supports function calling)
- Add Nebius function-calling example and env.example entry
- Fix Sarvam developer role support
- Update examples to use developer role for intro messages
Adds an OpenAI-compatible LLM service for Nebius Token Factory, supporting
open-source models (Meta Llama, Qwen, DeepSeek) via their OpenAI-compatible
REST API at https://api.tokenfactory.nebius.com/v1/.
When the remote side disconnects while send() is in flight, send() was
setting _closing=True. This prevented the receive loop from firing
on_client_disconnected, causing the pipeline to hang waiting for a
disconnect signal that never came.
The fix removes _closing from send() (that flag means we initiated the
close) and instead checks Starlette application_state in _can_send()
to suppress subsequent sends after a failure.
Fixes#3912
Add `await asyncio.sleep(0)` after `create_task()` calls in
UserIdleController, SpeechTimeoutUserTurnStopStrategy,
TurnAnalyzerUserTurnStopStrategy, and UserTurnCompletionLLMServiceMixin
so the event loop schedules the newly created timer tasks before the
caller continues.
Gemini 3.1 Flash Live won't reliably report ending its turn until
after it says something following a tool call. Restructure the system
instruction so the model says goodbye *after* calling
end_conversation, and add a comment explaining the deferred EndFrame
behavior that makes this work.
The base serializer filters out RTVI protocol messages by default
(ignore_rtvi_messages=True) to prevent them from being sent over
telephony media streams. ProtobufFrameSerializer is used by WebSocket
transports, which are the delivery channel for these messages, so
disable the filter there.
All recent Gemini Live models (including the default
gemini-2.5-flash-native-audio-preview-12-2025, and going at least as
far back as gemini-2.5-flash-native-audio-preview-09-2025) only
support AUDIO as a response modality. We considered using
`modalities=TEXT` as a Pipecat-level signal to suppress audio output
frames (so developers could pair Gemini Live with an external TTS),
but the output transcription from the API arrives too late relative
to the audio to be useful for driving an external TTS service.
For now, just log a warning when a TEXT modality is configured
(at init or via set_model_modalities) and proceed as normal. The 26d
text-modality example is removed since it no longer represents a
viable configuration.
The heartbeat monitor timeout (`HEARTBEAT_MONITOR_SECS`) was a static
module-level constant that never derived from the user-configurable
`heartbeats_period_secs`. This meant overriding the heartbeat interval
had no effect on the monitor window, causing spurious warnings or
delayed detection depending on the configured interval.
Add a new `heartbeats_monitor_secs` parameter to `PipelineParams` so
the monitor timeout is independently configurable (defaults to 10s).
The monitor handler now reads from the instance param instead of the
hard-coded constant.
Made-with: Cursor
This fixes the 26c example when using Gemini 3.1 Flash Live, which seems to be more strict about not receiving real-time input (at least, video messages) before conversation history.
Rime's WebSocket API sends a done message when synthesis completes.
Handle it to stop TTFB metrics, push TTSStoppedFrame, and remove the
audio context immediately instead of relying on the 3-second
stop_frame_timeout_s fallback.
DailyCallbacks gained a required on_dtmf_event field in PR #4047.
PR #4079 fixed this for TavusTransportClient but
LemonSliceTransportClient.setup() was not updated, causing a pydantic
ValidationError at pipeline setup time.
Only trigger handle_error for ErrorFrames originating from the active
service, not any managed service. This prevents edge cases where errors
from a non-active service could incorrectly trigger failover.
Expose a public method for retrieving all stored memories outside the
pipeline, avoiding the need for callers to reimplement client branching,
OR filter construction, and asyncio.to_thread wrapping. Simplify the
example get_initial_greeting() to use it.
Move blocking Mem0 API calls off the event loop using asyncio.to_thread().
Store messages as a fire-and-forget background task via create_task() since
the result is not needed. Insert memory messages at the configured position
in the context instead of always appending.
Closes#1741
Deepgram SDK 6.x surfaces connection errors from send_media() instead
of silently swallowing them. This causes error floods when the WebSocket
disconnects since every queued audio frame hits the dead connection.
Wrap send_media() in try/except: on failure, log one warning and set
self._connection = None so subsequent frames skip until the existing
_connection_handler reconnects.
In some cases the openai provider could answer with a `chunk.choices[0].delta.audio = None`, so the process context fails with error:
```
pipecat/services/openai/base_llm.py:552): Error during completion: 'NoneType' object has no attribute 'get'
```
When an InterruptionFrame arrives, the Python-side audio task is
cancelled but frames already submitted to rtc.AudioSource continue
playing from its internal buffer. This causes the bot to keep speaking
for several seconds after being interrupted.
Fix by overriding process_frame in LiveKitOutputTransport to call
audio_source.clear_queue() on InterruptionFrame, immediately flushing
the buffered audio.
ErrorFrames propagating upstream from downstream processors (e.g. TTS) would
enter the ServiceSwitcher via process_frame, traverse the active service sub-pipeline,
and reach push_frame where they incorrectly triggered failover. Now only errors whose
processor is one of the managed services trigger handle_error. Also fix the log in
handle_error to attribute errors to the actual source processor rather than the
current active_service.
Closes#4139
Gemini 3.x can bundle multiple fields (e.g. model_turn and
output_transcription) on the same server_content message. The previous
elif chain would only process the first matching field and silently
drop the rest. Switch to independent if checks so every field is
handled.
When Gemini Live was configured with local VAD (server-side VAD disabled),
the service was listening for the wrong frame types and not sending
ActivityStart/ActivityEnd events to the server. Now it listens for
VADUserStartedSpeakingFrame/VADUserStoppedSpeakingFrame and sends the
appropriate activity signals when local VAD is in use.
Also removes the unnecessary local SileroVADAnalyzer from server-side VAD
examples and adds a new 26a example demonstrating local VAD configuration.
Set GRPC_VERBOSITY=ERROR by default so users do not see noisy fork
handler and abseil warnings from the gRPC C library. Users can still
override by setting GRPC_VERBOSITY themselves.
nvidia-riva-client 2.25.1 ships with gencode compiled against protobuf
6.31.1, which requires a runtime >= 6.31.1. Update protobuf from 5.29.6
to >=6.31.1,<7 and grpcio-tools from 1.67.1 to 1.78.0 to match.
Regenerate frames_pb2.py with the new compiler.
Add DeepgramFluxSageMakerSTTService that combines SageMaker's HTTP/2
transport with Flux's JSON turn detection protocol (StartOfTurn,
EndOfTurn, EagerEndOfTurn, TurnResumed). Includes mid-stream Configure
support, silence watchdog, and an example bot.
Both GrokLLMService and XAIHttpTTSService use the same xAI API (api.x.ai),
so move Grok source files into the xai module. Leave deprecation shims in
the old grok/ paths for backward compatibility.
- Rename XAITTSService → XAIHttpTTSService and XAITTSSettings → XAIHttpTTSSettings
- Add language_to_xai_language() with explicit LANGUAGE_MAP using resolve_language()
- Remove deprecated InputParams, params, voice, language init params
- Remove XAI_DEFAULT_SAMPLE_RATE and XAI_PCM_CODEC constants; add encoding param
- Set sample_rate=None default (picked up from PipelineParams or user)
- Use Language.EN enum instead of string "en" for default language
- Add changelog/4031.added.md
- Add 07e-interruptible-xai.py foundational example
- Update 14g-function-calling-grok.py to use XAIHttpTTSService
- Register 07e in run-release-evals.py
Test that OpenAI Realtime, Grok Realtime, and Nova Sonic adapters
prefer init-provided system_instruction over context-provided, warn
on conflicts, and don't warn for developer messages.
Add system_instruction parameter to the Grok Realtime adapter's
get_llm_invocation_params() and call _resolve_system_instruction() to
prefer init-provided over context-provided system instructions and
warn on conflicts. Previously context-provided took precedence.
Update the Grok Realtime example to use settings.system_instruction
instead of session_properties.instructions.
Add system_instruction parameter to the OpenAI Realtime adapter's
get_llm_invocation_params() and call _resolve_system_instruction() to
prefer init-provided over context-provided system instructions and
warn on conflicts. Previously context-provided took precedence.
Add system_instruction parameter to the Nova Sonic adapter's
get_llm_invocation_params() and call _resolve_system_instruction() to
prefer init-provided over context-provided system instructions and
warn on conflicts. Previously context-provided took precedence.
Remove the service-side fallback logic, as the adapter now handles
resolution.
Pass self._system_instruction_from_init to the adapter's
get_llm_invocation_params(), which calls _resolve_system_instruction()
to prefer init-provided over context-provided system instructions and
warn on conflicts. Previously context-provided took precedence.
Also fix the reconnect check to only reconnect when the resolved
system instruction actually differs from what the initial connection
used, avoiding unnecessary reconnects.
Update documented event signatures to include transcript argument
where the code actually passes it. Remove stale on_speech_started
and on_utterance_end entries that were never registered.
Fires after the final transcript is pushed in both Pipecat and
AssemblyAI turn detection modes, giving users a reliable hook
that arrives after all transcript frames. Matches the existing
Deepgram Flux on_end_of_turn pattern.
The previous default (meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo) is
no longer available as a serverless Together.ai model and now requires a
custom deployment. The new default is openai/gpt-oss-20b, one of
Together's recommended models for small & fast use-cases.
OpenAI-compatible services that don't support the "developer" message
role can now set supports_developer_role = False on the service class.
BaseOpenAILLMService passes this as convert_developer_to_user to the
adapter, which converts developer messages to user messages before
sending them to the API. Applied to Cerebras and Perplexity.
Also removes the now-redundant developer→user conversion step from
PerplexityLLMAdapter (handled by the parent adapter via the flag).
_system_instruction_from_init was being set from the deprecated
`system_instruction` constructor parameter instead of
`self._settings.system_instruction`, so system instructions provided
via settings were silently ignored.
OpenAI Realtime, Grok Realtime, and AWS Nova Sonic adapters now convert
"developer" role messages to "user" (consistent with all other non-OpenAI
adapters). Previously these messages were silently dropped. Adds starter
unit tests for all three realtime adapters.
These messages are developer instructions to the assistant (e.g. "Please
introduce yourself to the user"), not simulated user input. The
"developer" role is semantically correct for this purpose.
Developer messages are now always converted to "user" in non-OpenAI
adapters, never promoted to the system instruction. This removes an
inconsistency where adding an unrelated message to context would change
whether a developer message got promoted.
Simplifications:
- Rename _extract_initial_system_or_developer → _extract_initial_system
- Return Optional[str] instead of Tuple (role is always "system")
- Drop initial_context_message_role from _resolve_system_instruction
- Drop system_role fields from all ConvertedMessages dataclasses
When the only message in context was a system message,
_extract_initial_system_or_developer would convert it to "user" (to
prevent empty history) without warning about the conflict with
system_instruction. Now warns inline before converting, with a message
explaining both the conflict and the user-role conversion.
Two goals:
1. Centralize system_instruction vs context system message resolution into
the LLM adapters. This eliminates duplication between in-pipeline and
out-of-band (run_inference) code paths across ~16 locations in service
llm.py files.
2. Add support for "developer" role messages in conversation context, which
is facilitated by the above centralization.
Shared helpers on BaseLLMAdapter:
- _extract_initial_system_or_developer: extracts/converts messages[0]
based on role and whether system_instruction is provided
- _resolve_system_instruction: warns on conflicts between system_instruction
and context system messages, returns the effective instruction
Developer message handling (new):
- Non-OpenAI adapters: an initial "developer" message is promoted to the
system instruction when no system_instruction is provided; otherwise it
is converted to "user". Subsequent "developer" messages are always
converted to "user". No conflict warning is emitted for developer
messages (unlike "system" messages).
- OpenAI adapter: "developer" messages pass through in conversation
history without triggering conflict warnings.
- OpenAI Responses adapter: "developer" messages are kept as "developer"
role (same as "system", which is also converted to "developer" for the
Responses API).
Other behavior changes:
- Gemini: "initial" system message detection now checks messages[0] only
(previously searched anywhere in the list)
- Bedrock: a lone system message is now converted to "user" instead of
being extracted to an empty message list (matches existing Anthropic
behavior)
Route LLMFullResponseEndFrame through the serialization queue instead
of pushing it directly downstream when push_text_frames is enabled.
This ensures the frame is emitted only after the audio context is
fully drained, preserving correct ordering relative to TTSTextFrames.
Previously, the final sentence TTSTextFrame would arrive at the
LLMAssistantAggregator after LLMFullResponseEndFrame, causing it to
be dropped from the conversation context (especially with RTVI text
input where no subsequent interruption would flush the orphaned text).
Cancel the deferral timeout task and clear the pending EndFrame during
disconnect, which could otherwise be left dangling after a
CancelFrame-triggered shutdown.
When an interruption arrives before any LLM text reaches run_tts, the
turn context ID exists but was never registered via create_audio_context.
Calling flush_audio for this unregistered context sends a message to the
provider (e.g. ElevenLabs) with a context_id it has never seen, which
implicitly creates a server-side context that is never closed. After
enough rapid interruptions these phantom contexts accumulate and exceed
the providers limit (ElevenLabs: 5 simultaneous contexts, 1008 policy
violation).
Guard the flush call with audio_context_available so it only fires when
the context was actually opened.
Fixes#4114
When an EndFrame arrives while the bot is mid-response, it is deferred
until turn_complete is received. If turn_complete never arrives, the
EndFrame gets stuck forever and the pipeline hangs indefinitely.
Add a 30-second timeout: if turn_complete hasn't arrived by then, the
deferred EndFrame is released anyway with a warning log. The timeout
is cancelled if turn_complete arrives normally.
We observed a case where a deferred EndFrame was never released in
Gemini Live, causing the pipeline to hang indefinitely. The EndFrame
deferral mechanism waits for _handle_msg_turn_complete to set
_bot_is_responding back to False, but turn_complete messages were only
processed if they also contained usage_metadata. If Gemini ever sent
turn_complete without usage_metadata, the message would be silently
dropped and the deferred EndFrame would never be released.
Now turn_complete is always handled regardless of usage_metadata
presence, with usage_metadata processing only when available.
Note: we have not actually observed a turn_complete without
usage_metadata in practice, so this is a theoretical fix for the
EndFrame-deferral hang. The actual root cause of the observed hang
may lie elsewhere.
- Route audio through audio contexts (append_to_audio_context) instead of
pushing frames directly, enabling proper turn management and interruptions
- Add push_stop_frames and push_start_frame so the base class handles
TTSStartedFrame/TTSStoppedFrame lifecycle
- Remove manual context_id tracking (self._context_id) in favor of
get_active_audio_context_id()
- Don't call remove_audio_context on "complete" — Smallest sends one
per request, not per turn; let the base class timeout handle cleanup
- Guard v2-only params (consistency, similarity, enhancement) so they
aren't sent to lightning-v3.1
- Remove request_id from request payload (not a documented request field)
- Add flush_audio override to send flush to WebSocket
Adds SmallestTTSService, a WebSocket-based TTS service using Smallest AI's
Lightning v3.1 model. Follows current Pipecat service conventions:
- SmallestTTSSettings dataclass with runtime-updatable settings (voice,
language, speed, etc.)
- Reconnects on model change; keepalive every 30s to prevent idle timeout
- TTS settings default to None so the API applies its own defaults
- Model enum: SmallestTTSModel.LIGHTNING_V3_1
Includes a foundational example (07zl-interruptible-smallest.py) using
Deepgram STT + Smallest TTS + OpenAI LLM.
STT integration will follow in a separate PR once the hallucination/finalize
behaviour is resolved.
Made-with: Cursor
Gets Gemini 3 support to the point where it works with:
- The "legacy" pattern from the previous (removed) 26- example
- inference_on_context_initialization=True (the default)
- inference_on_context_initialization=False
Add `domain` field to AssemblyAISTTSettings to support AssemblyAI's
streaming API `domain` query parameter, enabling specialized recognition
modes like Medical Mode (`medical-v1`).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add warnings in SpeechTimeoutUserTurnStopStrategy and
TurnAnalyzerUserTurnStopStrategy when stop_secs differs from the
recommended default (0.2s) or when stop_secs >= STT p99 latency,
which collapses the STT wait timeout to 0s. Document the stop_secs=0.2
assumption in stt_latency.py.
Send a setup message with client_req_id before the first text message
for each context, matching Gradium multiplexing protocol. This allows
Gradium to associate each session with its setup configuration when
using close_ws_on_eos=False.
The AudioHook protocol requires every message to carry a `parameters`
object. `_create_message` conditionally included it only when parameters
were truthy, so pong responses and closed responses without
outputVariables were sent without the field.
Clients that validate message structure (including the Genesys reference
implementation) rejected these messages, which broke server sequence
tracking and prevented outputVariables from reaching the Architect flow.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Align with the OpenTelemetry GenAI semantic convention
gen_ai.system_instructions for system prompts. The old "system"
attribute name was unrelated to gen_ai.system (which is for
provider name).
Replace adapter-based extraction in traced_llm with direct reads from
_settings.system_instruction (priority) and context messages (fallback).
The old approach had three bugs: signature mismatch with Anthropic
adapter, key name inconsistency, and unnecessary overhead from full
message/tools conversion.
Also deduplicate the system instruction in spans -- it was appearing as
both "system" and "param.system_instruction".
These services were pushing audio frames directly via push_frame() in their
WebSocket receive loops, bypassing the base TTSService audio context
serialization queue. This causes incorrect frame ordering and broken
interruption handling.
Changes per service:
- Fish Audio: use append_to_audio_context(), replace _handle_interruption
with on_audio_context_interrupted()
- LMNT: use append_to_audio_context(), remove redundant push_frame override
- Neuphonic: use append_to_audio_context(), remove redundant push_frame and
process_frame overrides (base class handles pause/resume)
- Rime NonJson: use append_to_audio_context(), remove redundant push_frame
override
The LLMContext format already matches the expected Responses API
shape for input_audio, so no adapter conversion will be needed
once OpenAI enables support.
The API-provided full model name is more specific than the
user-provided model name (e.g. includes version/snapshot details).
Reorder the lookup in _get_model_name and add a comment where the
Responses service sets the field.
The override would re-add `instructions` after the adapter had
intentionally converted it to a developer message for empty contexts.
Added a regression test.
Rewrite docstrings to more clearly explain what SyncParallelPipeline
does: hold all output until every parallel branch finishes, so frames
produced in response to a single input are released together.
Adds a FrameOrder enum with ARRIVAL (default, existing behavior) and
PIPELINE (pushes frames in pipeline definition order). This lets callers
guarantee output ordering between parallel pipelines — e.g. ensuring
image frames precede audio frames — without needing a separate reordering
processor downstream.
Updates the 05-sync-speech-and-image example to use FrameOrder.PIPELINE,
removing the ImageBeforeAudioReorderer class entirely.
Add a processor after SyncParallelPipeline that ensures each image frame
precedes its corresponding TTS audio frames. SyncParallelPipeline batches
them together but doesn't guarantee branch ordering. The reorderer detects
when TTS frames arrive before their image (via context_id tracking) and
holds them until the image arrives.
Also rename ImageAudioSync to MarkImageForPlaybackSync for clarity.
Add a `sync_with_audio` field to `OutputImageRawFrame` that routes image
frames through the audio queue in the output transport, ensuring images
are only displayed after all preceding audio has been sent. This enables
proper audio/image synchronization in pipelines like the calendar month
narration example.
Update the 05-sync-speech-and-image example to use an `ImageAudioSync`
processor that sets this flag on image frames.
The FrameProcessor two-queue architecture processes SystemFrames and
non-SystemFrames on separate concurrent async tasks. Both paths called
SyncParallelPipeline.process_frame(), which used the same per-pipeline
sink queues. A SystemFrame's wait_for_sync could steal frames from a
concurrent non-SystemFrame's wait_for_sync, corrupting synchronization
and stalling the pipeline.
This was triggered by the auto-embedded RTVI processor (added in
v0.0.101) which floods OutputTransportMessageUrgentFrame SystemFrames
through the pipeline during LLM responses.
Fix: SystemFrames (except EndFrame) now take a fast path — passed
through internal pipelines and pushed downstream directly without
touching the sink queues or drain logic. EndFrame retains the full
drain behavior as a lifecycle frame.
- Add WakePhraseUserTurnStartStrategy for gating interaction behind wake
phrase detection, with timeout and single_activation modes
- Add default_user_turn_start_strategies() and
default_user_turn_stop_strategies() helper functions
- Deprecate WakeCheckFilter in favor of the new strategy
- Extend ProcessFrameResult to stop strategies for short-circuit evaluation
- Fix MinWordsUserTurnStartStrategy including filtered text in output
* Fix empty user transcription causing spurious interruption in Nova Sonic
Skip _report_user_transcription_ended() when _user_text_buffer is empty,
which happens when the initial prompt is text-only. Previously, an empty
TranscriptionFrame was pushed upstream, triggering a chain reaction:
on_user_turn_stopped → UserStartedSpeakingFrame → interruption →
premature BotStoppedSpeaking → multiple response start/stop cycles.
* Improve TextFrame and assistant end of turn logic
Now, SPECULATIVE text results are used to push the LLMTextFrame,
AggregatedTextFrame, and TTSTextFrame. Additionally, the TTSTextFrames
are push at the end of the corresponding audio segment.
* Remove BotStoppedSpeakingFrame fallback from Nova Sonic
Now that assistant response end is detected directly from Nova Sonic
contentEnd events (END_TURN and INTERRUPTED), the BotStoppedSpeakingFrame
handler is no longer needed. Inline the cleanup logic in reset_conversation.
* Remove duplicate reconnection logic from Gradium STT
The _receive_messages method had its own while-True reconnect loop,
duplicating the reconnection handling already provided by
WebsocketService._receive_task_handler (exponential backoff, max
retries, error reporting). Flatten to just the inner message loop
and let the base class handle reconnection.
* Align Gradium STT VAD handling with base class patterns
Replace the process_frame override with a _handle_vad_user_stopped_speaking
override, which is the proper hook provided by STTService. Move
start_processing_metrics() into run_stt (matching Gladia's pattern).
Remove unused FrameDirection and VADUserStartedSpeakingFrame imports.
* Add transcript aggregation delay after flushed to capture trailing tokens
Gradium flushed response can arrive before all text tokens have been
delivered. Instead of finalizing immediately on flushed, start a short
timer (100ms) that allows trailing tokens to accumulate before pushing
the final TranscriptionFrame.
* Add changelog for PR #4066
* Change default encoding to pcm_16000
* Decouple encoding from sample_rate in Gradium STT
The encoding parameter now takes just the base type (pcm, wav, opus)
and the sample rate is derived from the pipeline audio_in_sample_rate,
assembled dynamically via input_format_from_encoding(). This fixes the
mismatch where SAMPLE_RATE=24000 was passed to the base class while
encoding defaulted to pcm_16000.
Set store=False in Responses API calls since we send full conversation
history as input items and don't use previous_response_id.
Add 5 run_inference tests for OpenAIResponsesLLMService using real
LLMContext and adapter (only HTTP client mocked).
Uses real LLMContext and adapter (only HTTP client is mocked) to test
basic inference, client exception propagation, system_instruction
override, empty context fallback, and max_tokens override.
Add OpenAIResponsesLLMService using the Responses API, with a dedicated
adapter that converts LLMContext messages to Responses API input items
(system→developer, tool_calls→function_call, tool→function_call_output,
multimodal content conversion, and tools schema flattening).
- New adapter: open_ai_responses_adapter.py
- New service: openai/responses/llm.py
- Examples: 07-interruptible and 14-function-calling variants
- 19 unit tests for adapter conversion logic
- Eval entries for both examples
List-valued settings like keyterm, keywords, search, redact, and replace
were being converted to strings before being passed to the SDK connect()
method. The SDK expects lists so its encode_query can produce repeated
query params (keyterm=a&keyterm=b).
When an LLM returns a tool call with no arguments (arguments=null in
the streaming chunks), the tool call is silently dropped because:
1. `tool_call.function.arguments` is None, so nothing is accumulated
and `arguments` stays as "" (empty string)
2. `if function_name and arguments:` treats "" as falsy, skipping the
entire tool call execution
OpenAI always sends arguments="{}" even for parameterless tools,
masking this bug. But vLLM, Ollama, and other OpenAI-compatible
providers may omit arguments entirely when the tool schema has no
required parameters, causing tool calls to be silently ignored.
Fix: check only `function_name` (not `arguments`) and default empty
arguments to "{}" so `json.loads` produces an empty dict. Apply the
same fallback for intermediate tool calls in multi-tool responses.
Raw strings like "de-DE" passed as the language parameter to TTS/STT services
were bypassing the Language enum resolution logic, causing silent failures
(e.g. ElevenLabs expects "de" not "de-DE"). Now raw strings are first converted
to Language enums so they go through the same resolve_language() path, with a
warning logged for unrecognized strings.
Reset stop strategies at turn start (not just turn stop) so that late
transcriptions arriving between turns do not leave stale _text that
causes premature stops on the next turn. Also cancel pending timeout
tasks in reset() for both SpeechTimeout and TurnAnalyzer strategies.
Expose enable_dialout as a configure() parameter (default False) so
dial-out examples can opt in without needing to build DailyRoomProperties
manually.
Narrow misleading Optional type hints on parameters that never accept
None, extract the duplicated token_exp_duration * 60 * 60 calculation,
remove unnecessary forward-reference quotes on DailyMeetingTokenProperties,
and clarify why enable_dialout is explicitly set to False.
Handle Daily's on_dtmf_event callback, convert it to an
InputDTMFFrame pushed into the input transport. Also add __str__
methods to InputDTMFFrame and OutputDTMFFrame for better logging.
Refactor language_to_soniox_language to use resolve_language + LANGUAGE_MAP
pattern consistent with other services. Fix resolve_language fallback to use
str(language) instead of language.value so plain strings don't crash.
The Inworld WS TTS plugin previously relied on the base TTS service's 3-second AUDIO_CONTEXT_TIMEOUT to detect when audio was done, then sent close_context in on_audio_context_completed. This added unnecessary latency before TTSStoppedFrame was emitted.
The original implementation likely borrowed this idea from the 11labs' impelementation. But it's likely better to mirror the Cartesia plugin pattern where on_audio_context_completed is a no-op because the server signals completion directly.
Now close_context is sent in on_turn_context_completed (right after flush_context), so the server responds with contextClosed immediately after the last audio byte. The existing receive handler already calls remove_audio_context on contextClosed, which exits the audio context handler cleanly.
The base_url parameter previously forced wss:// and https:// schemes,
breaking air-gapped or private deployments that need ws:// or http://.
Extract URL derivation into _derive_deepgram_urls() helper that respects
the developers scheme choice while deriving the paired WebSocket and
HTTP URLs the Deepgram SDK requires.
Closes#4019
Now that the base TTSService and STTService handle Language enum
conversion at init time, subclasses no longer need to convert in their
own __init__ methods. Remove conversion calls from hardcoded defaults,
params paths, and deprecated direct arg paths across 22 service files.
Services just pass raw Language enums and let the base class convert
via language_to_service_language() polymorphic dispatch.
When a Language enum (e.g. Language.ES) is passed via
settings=Service.Settings(language=Language.ES), it gets stored as-is
without conversion to the service-specific code. The base
_update_settings() handles this for runtime updates, but at init time
apply_update() copies the raw enum. This causes API errors because
services send the unconverted enum value.
Add language conversion in TTSService.__init__ and STTService.__init__
after super().__init__(), using the subclass language_to_service_language()
via normal method resolution.
Both analyzers are superseded by LocalSmartTurnAnalyzerV3. Added
deprecation warnings and docstring notices following the existing
pattern from LocalSmartTurnAnalyzer.
Perplexity appears to have statefulness within a conversation, so
converting a system message to "user" in one call and then back to
"system" in the next (after more messages are appended) causes API
errors. Remove the trailing system→user conversion entirely — if the
context only has system messages, the API call will fail but the
mistake will be caught right away.
Add test exercising the step 3 ordering where stripping a trailing
assistant exposes a system message that then gets converted to user.
Move the reasoning about when a trailing system message can occur
into the docstring.
Perplexity allows multiple initial system messages, so don't merge them.
Instead, skip system-system pairs during the consecutive same-role merge
step. Broaden the trailing message fix to convert any trailing system
message to user (not just a lone system message), so contexts with only
system messages don't fail.
Perplexity's API is stricter than OpenAI about conversation history:
- Requires strict alternation between user/tool and assistant messages
- Disallows system messages except as the initial message
- Requires the last message to be user or tool
The new adapter transforms messages before sending to satisfy all three
constraints: merging consecutive initial system messages, converting
non-initial system to user, merging consecutive same-role messages, and
removing trailing assistant messages.
Also adds dual-system-instruction warnings to Cerebras, Fireworks,
Mistral, Perplexity, and SambaNova services (matching the existing
BaseOpenAILLMService pattern), and updates the warning text in
BaseOpenAILLMService to be more descriptive.
EndTaskFrame and StopTaskFrame are now ControlFrames instead of
SystemFrames, so they flow through the pipeline and queue behind
pending work. This prevents races where EndFrame could overtake
in-flight frames (e.g. function call responses).
CancelTaskFrame and InterruptionTaskFrame remain SystemFrames
(via new TaskSystemFrame base): since they need immediate propagation.
The sink now catches EndTaskFrame, StopTaskFrame and CancelTaskFrame
downstream and re-queues it upstream to the task, ensuring the full
pipeline drains before shutdown begins.
Wait for _audio_context_task to finish draining the contexts queue
before canceling _stop_frame_task, ensuring all pending audio
contexts are processed during shutdown.
Flush buffered frames before pushing the synchronization frame so
downstream processors see the buffered frames first. Switch to a
while-loop with pop(0) so frames added to the buffer during flush
are also drained.
Add convenience parameters to configure() so callers don't need to
manually construct DailyRoomProperties/DailyRoomSipParams for common
SIP provider and geo configuration.
When `service` is set and doesn't match, the service forwards the frame instead of consuming it. This allows targeting a specific service when multiple services of the same type exist in the pipeline.
Align Simli with HeyGen/Tavus by extending AIService instead of
FrameProcessor and using a ServiceSettings dataclass. InputParams is
preserved but deprecated; its fields are promoted to direct init params.
Lifecycle handling moves to start()/stop()/cancel() methods.
The default model for OpenAILLMService and AzureLLMService was still set
to gpt-4o. Restored it to gpt-4.1. Also, removed hardcoded gpt-4o/gpt-4o-mini
model references from examples so they pick up the new default.
Move the warning helper into AIService as _warn_init_param_moved_to_settings.
It now uses type(self).__name__ to produce messages like
"Use settings=AnthropicLLMService.Settings(model=...)" instead of the raw
settings class name "AnthropicLLMSettings(model=...)". Callers no longer need
to pass the settings class explicitly.
Replace direct references to settings class names (e.g. `FooSettings`) with the nested `Settings` alias form throughout all 87 service files:
- Type annotations: `Settings`
- Runtime code: `self.Settings`
- Docstrings: `ServiceClass.Settings`
- Cross-file inheritance: `ParentService.Settings`
This makes the `Settings` alias the canonical way to reference a service's settings, keeping only the class definition and alias assignment as the remaining hits for each raw settings class name.
* Add ServiceSwitcherStrategyFailover for automatic error-based service switching
Introduce a strategy hierarchy: ServiceSwitcherStrategy (base) →
ServiceSwitcherStrategyManual (handles ManuallySwitchServiceFrame) →
ServiceSwitcherStrategyFailover (adds error-based failover). ServiceSwitcher
now defaults to ServiceSwitcherStrategyManual with strategy_type optional.
Non-fatal ErrorFrames are forwarded to the strategy via handle_error().
* Move metadata request into _set_active_if_available
Requesting metadata is part of making a service active, so it belongs
alongside setting _active_service and firing on_service_switched. This
removes the duplicate queue_frame calls from ServiceSwitcher push_frame
and process_frame.
DailyTransportClient.start_transcription() accepted a settings
parameter but always used self._params.transcription_settings
instead, silently discarding any custom settings passed by callers.
Change transcription_settings to Optional[DailyTranscriptionSettings]
defaulting to None. The default settings are now applied at the call
site when transcription is started, and start_transcription receives
the serialized settings dict directly.
Use CustomVideoSource/CustomVideoTrack for the default camera output instead of
VirtualCameraDevice, mirroring how audio already uses CustomAudioSource/CustomAudioTrack.
Add support for custom video destinations (register_video_destination, add/remove
custom video tracks, routing in write_video_frame) so multiple video tracks can be
published simultaneously.
* Add system_instruction parameter to run_inference
Allow callers to provide a custom system instruction directly when calling
run_inference, without having to construct provider-specific context objects.
For OpenAI, the instruction is prepended as a system message (preserving
existing messages). For Anthropic, Google, and AWS Bedrock, it overrides the
single system field with a warning when an existing system instruction is
present in the context.
* Use system_instruction parameter in _generate_summary
Pass the summarization prompt via run_inference's system_instruction
parameter instead of embedding it as a system message in the context.
* Add changelog for #3968
Constructor/settings system_instruction now takes priority over the
context system message. Previously the context value would overwrite
the constructor value on every call. Warn when both are set.
After interruption, both _playing_context_id and _turn_context_id are
None. If a subclass calls append_to_audio_context(None, frame), the
recovery path matches (None == None) and creates a bogus audio context
that blocks the handler from ever processing the real context.
Early-return when context_id is falsy to prevent this.
The Deepgram TTS service was bypassing pipecats audio context management
system, pushing audio frames directly via push_frame() instead of routing
them through append_to_audio_context(). This caused stale audio to leak
into the pipeline after interruptions and missed ordered playback
guarantees.
- Route audio frames through append_to_audio_context() with context
availability checks to discard stale post-interruption frames
- Handle Flushed responses by appending TTSStoppedFrame and removing
the audio context to signal completion
- Replace _handle_interruption override with on_audio_context_interrupted
hook (the recommended pattern used by ElevenLabs and Cartesia)
- Remove redundant process_frame override that caused double-flush
(base class already flushes via on_turn_context_completed)
- Remove redundant start_tts_usage_metrics call (base class handles
aggregated usage metrics)
Make `region` optional so users can provide only `private_endpoint`.
Raise ValueError if neither is provided, and warn if both are given
(private_endpoint takes priority).
Enhanced the logic for extracting the system message in the traced_llm decorator to support LLMContext via adapter and handle exceptions gracefully. This improves compatibility with different context types and ensures better tracing information.
SpeechConfig does not accept both `region` and `endpoint` simultaneously —
they are mutually exclusive. The previous code always passed both, which
raises ValueError when a user supplies a private_endpoint URL. Now we
conditionally pass either `endpoint` or `region`, never both.
Update the tests in test_integration_unified_function_calling.py to not specify particular models but instead just use service defaults (the tests shouldn't be model-dependent anyway)
Forward the on_summary_applied event from the internal summarizer to
the aggregator so users can listen for it without accessing private
members. Update summarization examples to use the new public event.
_on_call_state_updated passes (self, state) to _call_event_handler,
but _run_handler already prepends self when invoking the handler.
This causes handlers to receive 3 positional arguments instead of 2,
making the on_call_state_updated event unusable.
This aligns with how _on_first_participant_joined correctly passes
only the data arg without self.
Turn completion instructions were being injected as a system message in
the LLM context, which caused warning spam when system_instruction was
also set, did not persist across full context updates, and broke LLMs
that do not support consecutive system messages.
Instead, compose the turn completion instructions into the LLM service
system_instruction field. This is managed via _base_system_instruction
which stores the original value for restoration when turn completion is
disabled.
- mcp_service.py: remove unnecessary try/except around debug log,
use len(available_tools.tools) to match actual iteration target
- bedrock_adapter.py, aws/llm.py: add AttributeError to except tuple
to handle None content (previously caught by bare except)
Wire up the existing settings update infrastructure to send a Configure
WebSocket message when keyterm, eot_threshold, eager_eot_threshold, or
eot_timeout_ms change mid-stream, avoiding a full reconnect.
Add a `Settings` class-level alias on every STT, LLM, TTS, image,
vision, and video service class pointing to its settings dataclass.
This lets developers discover the right settings class via the service
class itself (e.g. `GoogleSTTService.Settings(...)`) without needing
to know or import the separate settings class name.
Adds the explicit "no params object" step 3 comment to all
LLM services that skip from step 2 to step 4 in their
settings initialization sequence, matching the pattern
established in services that do have a params object.
Replace references to undefined `_params` with `self._settings` for
language and VAD config. Add missing `system_instruction` to default
settings to satisfy validate_complete(). Remove redundant line that
read language from the deprecated `params` arg.
- Speechmatics: move config build after super().__init__ and settings
delta so turn_detection_mode (e.g. ADAPTIVE) takes effect
- Google STT: fix example passing bare Language enum instead of list
- Google TTS: add missing explicit defaults for all custom settings fields
- Soniox: fix accidental tuple wrapping of STT service in example
- Speechmatics examples: fix system->user role in kick-off messages
- Deepgram Flux: move tag from settings to __init__ (billing metadata)
- ElevenLabs STT: default tag_audio_events to None (use API default)
- Fal STT: simplify language default handling
- Google TTS: rename GoogleStreamTTSSettings to GoogleTTSSettings
Use "AI service" language instead of listing specific types, add
ServiceSettings as a fallback for direct AIService subclasses, and
clarify delta mode description with a concrete frame example.
The audio fields (sample rates, sample sizes, channel counts) on the deprecated `Params` class had no non-deprecated equivalent. This adds an `AudioConfig` class and `audio_config` init arg so users can specify audio configuration without relying on the deprecated `params` parameter.
Move `session_properties` into `GrokRealtimeLLMSettings`, making `settings` the canonical way to configure Grok Realtime — matching the pattern used across the rest of the codebase. The `session_properties` init arg is now deprecated in favor of `settings=GrokRealtimeLLMSettings(session_properties=...)`.
`system_instruction` is synced bidirectionally between the top-level settings field and `session_properties.instructions`, with top-level taking precedence on conflict. (Unlike OpenAI Realtime, Grok's `SessionProperties` has no `model` field, so no model sync is needed.)
Replace the monolithic rtvi.py with a proper package split by concern
protocol version:
- models_v0.py: deprecated pre-1.0 Pydantic models
- models_v1.py: current RTVI protocol v1 message models
- frames.py: RTVI pipeline frame dataclasses
- observer.py: RTVIObserver and RTVIObserverParams
- processor.py: RTVIProcessor (now lean, imports from submodules)
- __init__.py: re-exports full public API for backward compatability
Move `session_properties` into `OpenAIRealtimeLLMSettings`, making `settings` the canonical way to configure OpenAI Realtime — matching the pattern used across the rest of the codebase. The `session_properties` init arg is now deprecated in favor of `settings=OpenAIRealtimeLLMSettings(session_properties=...)`.
`model` and `system_instruction` are synced bidirectionally between the top-level settings fields and `session_properties.model`/`.instructions`, with top-level taking precedence on conflict.
Add `system_instruction=None` to `default_settings` for OpenAIRealtimeLLMService, GrokRealtimeLLMService, UltravoxRealtimeLLMService, AWSNovaSonicLLMService (Azure inherits from OpenAI), and OpenAIRealtimeBetaLLMService (Azure Beta inherits from OpenAI Beta).
Deprecate `system_instruction` init arg in AWSNovaSonicLLMService in favor of `settings=AWSNovaSonicLLMSettings(system_instruction=...)`. Use `self._settings.system_instruction` directly instead of storing a separate `self._system_instruction`.
Deprecation of `params` and `session_properties` in favor of `settings` for realtime services will be tackled in future work.
Add `system_instruction` field to `LLMSettings` so it is runtime-updatable via settings.
For Google (GoogleLLMService, GoogleVertexLLMService), deprecate the init-time arg since it was already shipped. For Anthropic, AWS Bedrock, and OpenAI, remove the init-time arg entirely since it was never shipped.
Still need to handle realtime services (OpenAI Realtime, Grok Realtime, Gemini Live).
Broaden the "Dynamic Settings Updates" section into "Service Settings"
covering the complete settings pattern: defining a Settings subclass,
wiring it into __init__ with defaults + apply_update, and distinguishing
init-only config from runtime-updatable fields.
- Add dedicated Settings subclasses to 20 LLM services that were
borrowing parent Settings classes (e.g. AzureLLMSettings,
GroqLLMSettings) so users don't need cross-module imports
- Fix field defaults to NOT_GIVEN in BaseWhisperSTTSettings,
OpenAIRealtimeSTTSettings, and NvidiaSegmentedSTTSettings for
delta-mode safety
- Fix incomplete default_settings in AWS, Cartesia, ElevenLabs,
Fish, and Whisper services so validate_complete() passes
- Add auto-discovered tests that verify all Settings classes default
to NOT_GIVEN (delta safety) and all services initialize with
complete settings (store completeness)
Move output_container, output_encoding, output_sample_rate out of
CartesiaTTSSettings into plain instance attributes since they cannot
change at runtime without breaking the audio pipeline. Remove deprecated
speed/emotion fields and their dead references in _build_msg() and
run_tts(). Remove the from_mapping override that only existed to
destructure those now-removed output format fields.
Update all ~192 call sites across 84 service files to pass class references
(e.g. `CartesiaTTSSettings`) instead of string names (`"CartesiaTTSSettings"`)
to `_warn_deprecated_param()`. This enables better IDE refactoring support.
Also fix `from_mapping` return type annotations in 5 settings subclasses to
use `typing.Self` instead of forward reference strings.
ServiceSettings types were introduced for runtime updates via ServiceUpdateSettingsFrame, but there was tension between init-time and runtime APIs: overlapping-but-different InputParams vs ServiceSettings classes, and runtime-updatable fields like `model` and `voice` scattered as direct init args rather than living in a settings object. This unifies them so developers use the same settings type at both init and runtime, improving ergonomics and consistency.
Every concrete AIService subclass (LLM, TTS, STT, ImageGen, Vision, Video) now accepts a `settings` parameter for runtime-updatable config. Old init args (`model`, `voice_id`, `params`/`InputParams`) still work but emit DeprecationWarnings pointing to the new API. When both are provided, `settings` takes precedence. Leaf classes emit warnings; base classes do not, avoiding double warnings in inheritance chains.
system_instruction from the constructor always takes precedence. A
warning is now logged when the context also contains a system message
so users can spot the conflict.
Add vad_threshold parameter to AssemblyAIConnectionParams to support
voice activity detection threshold configuration for the u3-rt-pro model.
This parameter allows users to align AssemblyAI's VAD threshold with
their external VAD systems (e.g., Silero VAD) to avoid the "dead zone"
where AssemblyAI transcribes speech that the external VAD hasn't
detected yet, which can delay interruption handling.
- Range: 0.0 to 1.0 (lower = more sensitive)
- Default: 0.3 (API default when not sent)
- Only applicable to u3-rt-pro model
- Automatically included in WebSocket query parameters
Recommended usage: Set vad_threshold to match your VAD's activation
threshold (e.g., both at 0.3) for optimal performance.
Widen version ranges for stable packages (anthropic, azure, deepgram,
groq, livekit, nvidia-riva-client, fastapi, ormsgpack, opentelemetry,
faster-whisper) and add upper bounds to previously uncapped packages
(hume, pyjwt, livekit-api, camb).
Replace CartesiaHttpTTSService's internal use of the Cartesia SDK with
direct aiohttp calls, accepting an optional aiohttp_session parameter.
Replace fal-client SDK calls in FalSTTService and FalImageGenService
with direct HTTP to bypass the SDK's aggressive retry/backoff logic
that caused significant latency regressions.
Widen version ranges for stable packages (aiofiles, docstring_parser,
onnxruntime) while adding upper bounds to previously uncapped packages
(transformers, numba, wait_for2). Bump soxr to 1.0.0 and pyloudnorm
to 0.2.0. Move silero extra to empty since onnxruntime is now a core dep.
Add appropriate log levels to dial-in/dial-out, participant, transcription,
and recording event handlers. Move transcription error log from client
callback to transport handler to keep logging consistent at the transport
level.
The default function call timeout (10s) causes silent failures for
long-running tools. This adds an optional timeout_secs parameter to
register_function() and register_direct_function() so individual tools
can override the global function_call_timeout_secs. The warning message
now mentions both the per-tool and global timeout options.
Allow either threshold to be set to None to cleanly disable that trigger,
instead of requiring users to set a very large number as a workaround.
At least one of the two must remain set (validated at construction time).
- u3-rt-pro: Does not set parameter (not used)
- universal-streaming models: Set to 1.0 to maintain fast response
- This ensures fast response time matches previous implementation
Replace the round-trip push_interruption_task_frame_and_wait() mechanism
with broadcast_interruption(), which pushes an InterruptionFrame both
upstream and downstream directly from the calling processor.
This eliminates race conditions (transcription arriving before the
InterruptionFrame comes back), swallowed-event timeouts (frame blocked
before reaching the sink), and the complexity of _wait_for_interruption
flag / queue bypass / frame.complete() obligations.
- Add broadcast_interruption() to FrameProcessor
- Deprecate push_interruption_task_frame_and_wait() (delegates to new method)
- Remove event field and complete() from InterruptionFrame/InterruptionTaskFrame
- Remove _wait_for_interruption flag and all special-case logic
- Remove frame.complete() calls in stt_mute_filter and llm_response_universal
- Update all 17 call sites to use broadcast_interruption()
- Update tests
Enables .model_dump() serialization for Pipecat Cloud collection.
All metrics now include start_time (Unix timestamp) for timeline
plotting alongside duration_secs.
Add per-service latency breakdown metrics alongside existing user-to-bot
latency measurement. When enable_metrics=True, the observer now emits an
on_latency_breakdown event with TTFB, text aggregation, and user turn
duration metrics collected between VADUserStoppedSpeakingFrame and
BotStartedSpeakingFrame.
- Add LatencyBreakdown dataclass with ttfb, text_aggregation,
user_turn_secs fields
- Accumulate MetricsFrame data during user→bot cycles
- Reset accumulators on InterruptionFrame to discard stale metrics
- Measure user_turn_secs from actual user silence (VAD timestamp -
stop_secs) to turn release (UserStoppedSpeakingFrame)
- Filter zero-value TTFB entries from startup metric resets
- Add frame deduplication using bounded deque + set pattern
- Update example 29 with latency breakdown display
The ServiceSettings refactor (PR #3714) changed self._settings from
dicts to dataclass subclasses, but tracing code still used .items(),
in containment, and subscript access, causing AttributeError on
every traced call. Use given_fields() for iteration and attribute
access for named fields.
Replace the PipelineSink detection in StartupTimingObserver with an
on_pipeline_started() callback from PipelineTask via TaskObserver.
This fixes premature report emission when using ParallelPipeline,
which has its own inner PipelineSinks per branch.
Use start_offset_secs (offset from StartFrame) on ProcessorStartupTiming
instead of a wall-clock timestamp. Reports keep a single start_time
anchor for dashboard visualization. Remove _mono_to_wall conversion.
Switch ProcessorStartupTiming, StartupTimingReport, and
TransportTimingReport from dataclasses to Pydantic BaseModel. Add
start_time (Unix timestamp) fields and wall clock conversion for
monotonic observer timestamps.
Add BotConnectedFrame (SystemFrame) pushed by SFU transports (Daily,
LiveKit, HeyGen, Tavus) when the bot joins the room. Replace the
on_transport_readiness_measured event with on_transport_timing_report
which includes both bot_connected_secs and client_connected_secs.
Introduce ClientConnectedFrame (SystemFrame) pushed by all transports
when a client connects. StartupTimingObserver uses this to measure
transport readiness — the time from StartFrame to first client
connection — via a new on_transport_readiness_measured event.
Azure TTS _handle_canceled was putting None (the normal completion
signal) into the audio queue for all cancellation reasons, so run_tts
treated errors identically to success—silently producing no audio.
Now error cancellations put an Exception marker in the queue, which
run_tts converts to an ErrorFrame.
Azure STT had no canceled event handler at all, so auth failures,
network errors, and rate-limit cancellations were invisible. Added
_on_handle_canceled which pushes an ErrorFrame upstream via push_error.
Fixespipecat-ai/pipecat#3892
Tracks how long each processor start method takes during pipeline
startup by measuring StartFrame arrive/leave deltas. Emits a timing
report via the on_startup_timing_report event and auto-logs a summary.
Internal pipeline processors are excluded from reports by default.
The Whisper-based ONNX model expects 16 kHz audio, but the
_predict_endpoint method had five hardcoded references to 16000 without
checking the actual pipeline sample rate. When running at 8 kHz (e.g.
Twilio telephony), audio was fed to the feature extractor at the wrong
rate, causing the model to perceive speech at 2x speed with shifted
formant frequencies and produce incorrect end-of-turn predictions.
Add automatic resampling via numpy interpolation before feature
extraction and replace all hardcoded sample rate values with a
_MODEL_SAMPLE_RATE constant. Also fix the WAV debug logger to write
files with the correct sample rate header.
Fixes#3844
Snapshot the blocks into immutable bytes and trim the buffer BEFORE any await, so no memoryview is
held across async yield points. Without this, a concurrent filter() or stop() call could try to
extend() or clear() the bytearray while a memoryview still exports it, raising "Existing exports
of data: object cannot be re-sized".
When filter_incomplete_user_turns is enabled and an LLMMessagesUpdateFrame
replaces the context via set_messages(), the turn completion instructions
system message was lost. This caused the LLM to stop emitting turn
completion markers. Re-inject the instructions after set_messages() to
fix this.
- Keep old parameter name for backward compatibility
- Add deprecation warning when old parameter is used
- Automatically migrate old parameter value to new min_turn_silence parameter
- Exclude deprecated parameter from WebSocket URL to avoid sending it to API
- New parameter takes precedence if both are set
- Update 13d-assemblyai-transcription.py to explicitly use u3-rt-pro model
- Update 55d-update-settings-assemblyai-stt.py to demonstrate keyterms updates instead of language updates
- Add helpful logging to show before/after keyterms boosting effect
- Use difficult names (Xiomara, Saoirse, Krzystof) to demonstrate boosting effectiveness
- Add "beta feature" note to custom prompt warning
- Rename min_end_of_turn_silence_when_confident parameter to min_turn_silence across all AssemblyAI code
- Update documentation, examples, and test files to use new parameter name
Allow pushing frames upstream through the pipeline by passing
FrameDirection.UPSTREAM. Downstream frames use the existing push queue,
while upstream frames are pushed directly from the pipeline sink.
The ServiceSettings refactor (PR #3714) changed self._settings from
dicts to dataclass subclasses, but tracing code still used .items(),
in containment, and subscript access, causing AttributeError on
every traced call. Use given_fields() for iteration and attribute
access for named fields.
The update-docs workflow intermittently failed with "Input required and
not supplied: token" because pull_request events from fork PRs don't
have access to repository secrets. Switching to pull_request_target
runs the workflow in the base repo's context, ensuring secrets are
always available. This is safe since the workflow only runs on
already-merged PRs.
- 07o-interruptible-assemblyai.py: Basic example using Pipecat VAD mode
- 07o-interruptible-assemblyai-stt.py: Advanced example using STT-controlled
turn detection with comprehensive documentation on u3-rt-pro features
(turn detection tuning, prompt-based enhancement, speaker diarization)
The request_finalize() method in STTService is synchronous (sets a flag),
but was being called with await in the VAD turn endpoint handling code.
This caused "object NoneType can't be used in 'await' expression" errors.
Also includes automatic formatting improvements from ruff.
Replace _rtvi_external instance variable with a local prepend_rtvi flag
since it is only used during __init__ to decide whether to prepend the
RTVIProcessor to the pipeline.
When the user places an RTVIProcessor inside their pipeline and provides
a custom RTVIObserver subclass in observers, PipelineTask correctly
detects both and logs "skipping default ones." However it then
unconditionally prepends self._rtvi to the pipeline, causing the
processor to appear twice in the frame chain.
Track whether the RTVIProcessor was found externally (inside the user
pipeline) vs created internally. Only prepend it when created internally.
Fixes#3867
- Remove unused Mapping import
- Remove info logs at initialization (connection params)
- Remove info logs in _handle_transcription (transcript details, text sent to LLM)
- Remove info logs in _build_ws_url (WebSocket URL and params)
- Keep debug logs (less verbose, appropriate for development)
u3-rt-pro guarantees SpeechStarted is always sent before transcripts,
so the fallback UserStartedSpeakingFrame broadcast is never needed.
This ensures clean pairing of UserStarted/StoppedSpeakingFrame:
- Start: Always from _handle_speech_started
- Stop: Always from _handle_transcription on final turn
- Add request_finalize() before sending ForceEndpoint in Pipecat mode
- Keep confirm_finalize() when receiving formatted finals in Pipecat mode
- Remove confirm_finalize() from STT mode (use finalized=True instead)
This follows Pipecat's two-step finalization pattern where request_finalize()
is called when sending a finalize request to the STT service, and
confirm_finalize() is called when receiving confirmation back.
Even when summarization_timeout is explicitly set to None, use a
DEFAULT_SUMMARIZATION_TIMEOUT (120s) fallback so the LLM call can
never hang indefinitely. Applied in both LLMService and the dedicated
LLM path in LLMContextSummarizer.
The dedicated LLM logic lived in LLMAssistantAggregator, creating two
code paths and requiring the aggregator to call a private LLMService
method. Move it into the summarizer which already owns the config and
summarization lifecycle, keeping the aggregator handler as a single-line
upstream push.
Adds a configurable summarization_timeout (default 120s) that cancels
summary generation if the LLM hangs. On timeout, an error result is
returned so _summarization_in_progress resets and future
summarizations are unblocked.
Adds an field to LLMContextSummarizationConfig that allows
routing summarization to a separate LLM service (e.g., Gemini Flash)
instead of the pipeline's primary model. This avoids paying for
expensive inference when compressing context in long-running sessions.
Allows applications to customize how the summary is wrapped when
injected into context (e.g., XML tags, custom delimiters) so system
prompts can distinguish summaries from live conversation.
Add deprecation warnings to start_processing_metrics() and
stop_processing_metrics() on FrameProcessorMetrics and FrameProcessor.
Mark ProcessingMetricsData as deprecated in docstring. All existing
behavior is preserved — the warnings inform users that these will be
removed in a future version.
Runs Claude Code Action after PRs merge to main when source files
in services/transports/serializers/processors/audio/turns/observers/pipeline
are changed. Creates a docs PR on pipecat-ai/docs with targeted edits
following the existing update-docs skill instructions.
- Fix speaker diarization: Add field alias for speaker_label → speaker
mapping in TurnMessage model
- Add warning for non-optimal min_end_of_turn_silence_when_confident
values (recommends 100ms for best latency)
- Improve max_turn_silence override warning message clarity
- Update custom prompt warning (remove 88% accuracy claim)
- Add comprehensive logging for debugging:
- Log final connection params after modifications
- Log WebSocket URL and parsed parameters
- Log speaker field in transcripts
- Log text sent to LLM with speaker formatting
- Support dynamic configuration updates via STTUpdateSettingsFrame:
- keyterms_prompt (when AssemblyAI API supports it)
- prompt
- max_turn_silence
- min_end_of_turn_silence_when_confident
- Add InterruptionFrame handling with stop_all_metrics()
- Add processing metrics (start/stop) at response boundaries
- Fix agent transcript handling for voice and text modalities:
- Voice mode: push LLMTextFrame (append_to_context=False) and
TTSTextFrame for deltas, skip duplicated final text
- Text mode: push LLMTextFrame with proper response lifecycle,
no TTSTextFrame (downstream TTS handles audio)
- Add output_medium parameter to AgentInputParams and OneShotInputParams
- Improve TTFB measurement using VAD speech end time
- Update example with user turn strategies and transcript events
- Add text-only output example (50a-ultravox-realtime-text.py)
Move the sentence vs token aggregation concern into text aggregators
so all text flows through them regardless of mode. This enables
pattern detection and tag handling to work in TOKEN mode.
- Add TextAggregationMode enum (SENTENCE, TOKEN) as the user-facing
TTS setting, separate from the internal AggregationType
- Add TOKEN mode support to Simple, SkipTags, and PatternPair aggregators
- Add text_aggregation_mode parameter to TTSService and all TTS subclasses
- Deprecate aggregate_sentences in favor of text_aggregation_mode
- Merge TTSService._process_text_frame() into a single codepath
Add TextAggregationMetricsData measuring the time from the first LLM
token to the first complete sentence, representing the latency cost of
sentence aggregation in the TTS pipeline.
- Wire up passing speed setting to Groq, even though only a value of 1.0 is supported today
- Update the 55y example to switch voices instead of changing speed
- Add a 55zn example to exercise runtime updates of Groq STT
The only (rare) exception—where a service directly still needs to directly call `self._sync_model_name_to_metrics()`—is when the model name need to be "pulled" from another field (or nested field) in settings up to settings.model on a settings update. This only occurs in Deepgram services, where we use the voice as the model name.
This change has the side-effect of bringing model name to metrics for a number of services that were accidentally omitting it before.
This was added in 31daa889e8, but only
to `RimeTTSService`, not to `RimeNonJsonTTSService. Bringing these
to parity means that users switching between the two, with the same
inputs, have more consistent vocalization behaviors.
Introduce a generic TurnMetricsData class for turn detection metrics,
replacing the service-specific SmartTurnMetricsData (now deprecated).
Add end-to-end processing time measurement to KrispVivaTurn, tracking
the interval from VAD speech-to-silence transition to model threshold
crossing. Consume metrics in the strategy _handle_input_audio path
so they are pushed immediately when fresh.
The Krisp VIVA SDK v1.8.0 requires a license key in globalInit(). Add
api_key parameter to KrispVivaSDKManager, KrispVivaTurn, and
KrispVivaFilter with fallback to KRISP_API_KEY env var. Maintain
backwards compatibility with older SDK versions by catching TypeError
and falling back to the old 3-arg signature.
Brings back the 6 development workflow skills (changelog, cleanup,
code-review, docstring, pr-description, pr-submit) that were moved
to pipecat-ai/skills, and adds a .claude-plugin/marketplace.json so
other pipecat-ai repos can install them. Updates README contributing
section with installation instructions.
Audio filters like RNNoise, KrispViva, and AIC return empty bytes while
buffering audio to accumulate their required frame size. These empty
frames were flowing downstream, causing misleading "Empty audio frame
received for STT service" warnings.
Skip the frame in BaseInputTransport when audio is empty, preventing
unnecessary processing in VAD and downstream processors.
Fixes#3517
These frames were falling through to the else branch and being pushed
downstream, unlike TranscriptionFrame which is explicitly consumed.
This aligns with how the assistant aggregator already filters them.
PR #3776 replaced manual timestamp tracking with stop_ttfb_metrics() in
the timeout handler, but without an end_time it uses time.time() at
timeout expiry—producing TTFB = timeout + stop_secs (~2.2s) instead of
the actual transcript latency.
Restore _last_transcript_time tracking so the timeout handler measures
to when the transcript arrived, and skip reporting if none arrived.
Before this change, settings updates were often not applied. For example, a `TTSUpdateSettingsFrame` queued while the bot was speaking would only have an effect at the end of the bot's reply, and any interruption before the end of the reply would "cancel" the update.
Remove bundled Claude Code skills (changelog, cleanup, code-review,
docstring, pr-description, pr-submit) that now live in
https://github.com/pipecat-ai/skills. Add a section to the README
with installation instructions. The update-docs skill remains as
it is specific to this repository.
This makes each service-specific field individually visible to the delta/update mechanism (`apply_update`, `given_fields`) and removes the need for complex sync logic between `input_params` and top-level fields like `model`.
- Soniox: replace `input_params: SonioxInputParams` with 8 individual fields, simplify `_update_settings` by removing model sync logic, remove unused `is_given` import
- Gladia: replace `input_params: GladiaInputParams` with 11 individual fields, resolve deprecated `language` → `language_config` at init time rather than at `_prepare_settings` time
- Storage mode: for use in `self._settings`. All fields should be specified, i.e. should not be `NOT_GIVEN`.
- Delta mode: for use in `*UpdateSettingsFrame`.
In service of this, this commit:
- Adds a runtime check that all fields are specified in storage mode
- Updates all services to specify all fields in stored settings
- Updates all services to no longer check for `is_given` in stored settings (not necessary anymore)
- Updates relevant docstrings
- Renames `update` to `delta` in `*UpdateSettingsFrame`
- Updates community integrations guide
Move start_processing_metrics from run_stt (called per audio chunk,
producing noisy 0ms logs) to _receive_messages when the first final
token arrives for a new utterance. The existing stop_processing_metrics
in send_endpoint_transcript completes the pair, giving a meaningful
measurement of time from first recognition to finalized transcript.
Commit 859cd7c9 refactored STT TTFB measurement to use the base class
start_ttfb_metrics/stop_ttfb_metrics, which are gated behind
can_generate_metrics(). Soniox and AWS Transcribe never overrode this
method (default returns False), so TTFB was silently never reported.
Move speech detection tracking outside the per-frame loop in append_audio
since is_speech applies to the whole buffer. Add debug log in
analyze_end_of_turn to show state and probability at decision time. Update
the Krisp VIVA example to use Cartesia TTS and turn analyzer strategy.
Use wildcard `*UpdateSettingsFrame` to cover all frame types. Clarify that NOT_GIVEN only appears in update deltas, not in the service's current settings state.
Split the single "changed" entry into separate "added", "changed", and "deprecated" entries for clarity. Add a note about the subtle behavior change in the deprecated set_model/set_voice/set_language methods.
Simplify the reconnect example to show a common pattern (reconnect on any change) and improve the _warn_unhandled_updated_settings example to show selective handling of specific fields.
Update docstrings for ServiceSettings, LLMSettings, TTSSettings, and STTSettings to make clear these capture only the subset of service configuration that can be changed while the pipeline is running via UpdateSettingsFrame, not all constructor parameters.
Document the existing convention: use @dataclass for frames and
internal pipeline data, use Pydantic BaseModel for configuration,
parameters, metrics, and external API data.
Replace self-referential `pipecat-ai[local-smart-turn-v3]` extra in core
dependencies with the actual packages (`transformers`, `onnxruntime`).
Self-referential extras are not supported by Poetry and cause dependency
resolution failures. Since these are required by the default turn stop
strategy (LocalSmartTurnAnalyzerV3), they belong in core dependencies.
- Remove `local-smart-turn-v3` optional extra from pyproject.toml
- Remove try/except ModuleNotFoundError guard (now always installed)
- Remove `--extra local-smart-turn-v3` from CI workflows
When the InterruptionFrame does not complete within the timeout the
caller was stuck in an infinite loop logging warnings. Now the event
is set after the first timeout so the processor can continue.
Also adds a keyword timeout parameter so callers can customize the
wait duration.
- indicate clearly that it's not meant for public use
- make it clear the `self._settings` is the single source of truth for model information
- set the stage for an upcoming change where `AIService` subclasses won't have to ever worry about explicitly calling an `AIService` method to sync model name to metrics
Across all services, switch from accessing `self._model_name` or `self.model_name` in favor of `self._settings.model`.
Adds a TTS service that connects to Deepgram models deployed on AWS
SageMaker endpoints via HTTP/2 bidirectional streaming. Supports the
Deepgram TTS protocol (Speak, Flush, Clear, Close) over the BiDi
client, with interruption handling and per-turn TTFB metrics.
Updates the example and env.example with separate STT/TTS endpoint names.
- AWSNovaSonicLLMService: new `AWSNovaSonicLLMSettings` with `voice_id` and `endpointing_sensitivity`; remove `self._params` entirely, storing audio I/O config as plain instance variables
- NeuphonicHttpTTSService: reuse `NeuphonicTTSSettings`; use inherited `language` field instead of bespoke `lang_code`
- NvidiaTTSService: new `NvidiaTTSSettings` with `quality`
- PiperTTSService / PiperHttpTTSService: new `PiperTTSSettings` / `PiperHttpTTSSettings` (no extra fields)
- SpeechmaticsTTSService: new `SpeechmaticsTTSSettings` with `max_retries`
Also remove redundant `lang_code` from `NeuphonicTTSSettings` (both WS and HTTP services now use the inherited `TTSSettings.language` field, with automatic enum conversion via the base class).
HTTP services (Neuphonic HTTP, Piper HTTP, Speechmatics) don't override `_update_settings` since the base class applies changes to `self._settings` and subsequent requests read from it automatically.
Also:
- remove unnecessary pass-through `_update_settings` implementation in `FalSTTService`
- warn that `AsyncAITTSService` doesn't currently support runtime settings updates
- update how `GradiumTTSService._update_settings` checks for voice changes
- remove a couple of unnecessary args (because they specified defaults) in other examples
ThinkingConfig was defined as an inner class on the service but referenced in the Settings dataclass declared before the service class, causing a crash at import time. Move ThinkingConfig to a standalone class defined before Settings, and keep a class attribute alias for backward compatibility.
Eliminate custom _emit_stt_ttfb_metric and manual timestamp tracking in
STTService by reusing FrameProcessor's start_ttfb_metrics/stop_ttfb_metrics
with new start_time/end_time parameters. This keeps the chronological
start→stop ordering and removes _speech_end_time and _last_transcription_time
state from STTService.
Remove the deprecation warning and __post_init__ override. Also fix the
default value for remote_participants to use field(default_factory=dict)
instead of None.
Add write_transport_frame() hook to BaseOutputTransport so subclasses
can handle custom frame types that flow through the audio queue. Add
DailySIPTransferFrame and DailySIPReferFrame as DataFrame subclasses
that queue with audio, ensuring SIP operations execute only after the
bot finishes its current utterance. Override write_transport_frame in
DailyOutputTransport to dispatch these frames to the existing
sip_call_transfer() and sip_refer() client methods.
Also switch DailyOutputTransport.send_message error handling from
logger.error to push_error for consistency.
Every `*Settings` dataclass field whose default is `NOT_GIVEN` now carries `_NotGiven` in its type union so the type system accurately reflects the three-state semantics (real value, `None` where applicable, or not-yet-specified). Fields previously typed as bare `Any`, `str`, `float`, `bool`, `list`, `dict`, or `Optional[X]` are now narrowed to the specific type from the corresponding `InputParams` Pydantic model.
RTVIObserver previously filtered out all upstream frames to avoid
duplicate messages from broadcasted frames. This caused upstream-only
frames to be silently ignored. Instead, add a `broadcasted` field to
the Frame base class that is set by broadcast_frame() and
broadcast_frame_instance(), and only skip upstream copies of
broadcasted frames.
The CI was failing because the runner's package index was stale,
causing a 404 when fetching libasound2-dev (a dependency of
portaudio19-dev). Running apt-get update first refreshes the index.
- Move `CommitStrategy` up in the file so it could be used by `ElevenLabsRealtimeSTTSettings`
- Fix a bug where `run_tts` would erroneously try to reconnect if a reconnection was already in flight (like a reconnection triggered by `_update_settings`)
42 examples covering STT (13), TTS (21), LLM (4), and realtime (4) services. Each demonstrates updating service settings 10 seconds after client connects, verifying the typed settings machinery end-to-end for every provider.
HumeTTSService now stores its params (description, speed, trailing_silence) in a proper `HumeTTSSettings` dataclass instead of a separate `_params` Pydantic model, making it work with `TTSUpdateSettingsFrame(update=...)`. The old `update_setting(key, value)` method is kept but deprecated.
Also removes the unused no-op `TTSService.update_setting` base method, which was never called by the `TTSUpdateSettingsFrame` pipeline.
The dataclass-based API (`*UpdateSettingsFrame(update=*Settings(...))`) is the preferred path since 0.0.103. The dict path still works but now emits a `DeprecationWarning`.
Change `TTSSettings.language` and `STTSettings.language` from `Any` to `Language | str | _NotGiven`. Add `language_to_service_language` base method and centralized `isinstance`-guarded conversion in `STTService._update_settings` (mirroring TTS). Update the TTS guard from `is not None` to `isinstance(…, Language)` so raw strings pass through unchanged.
Remove now-redundant per-service language conversion from `_update_settings` overrides (ElevenLabs, Azure, Fal, Whisper). Add `language_to_service_language` to Azure STT so the centralized conversion picks it up. Fix AWS and NVIDIA STT `__init__` to convert language at construction time, then simplify their runtime accessors to read `_settings.language` directly.
Note that for services that previously handled applying updates (through methods like `set_model` and `set_language`), we're keeping the update-applying logic (some or most of which is already well-tested) and expanding it to cover all relevant settings fields. Services under this bucket are:
- Deepgram STT
- Deepgram Sagemaker STT
- Elevenlabs STT
- Google STT
- Gradium STT
- OpenAI STT
- Speechmatics STT
Change the version specifier from `>=0.2.8` to
`~=0.2.8` for the `speechmatics-voice` package.
This ensures compatibility with future patch
versions while preventing potential breaking
changes from minor updates.
Use client_req_id-based multiplexing instead of disconnecting and
reconnecting the websocket on every interruption. This follows the
same pattern used by Cartesia, ElevenLabs, and other services via
AudioContextWordTTSService.
Key changes:
- Base class: InterruptibleWordTTSService -> AudioContextWordTTSService
- Add close_ws_on_eos: False to setup message to keep connection alive
- Add client_req_id to text, end_of_stream messages for demultiplexing
- Route audio via append_to_audio_context() instead of push_frame()
- Silently drop messages for cancelled/unknown contexts on interruption
- Add _handle_interruption() that resets context without reconnecting
- Remove no-op push_frame() override
Always create UserIdleController (timeout=0 means disabled), removing
all Optional guards. Add UserIdleTimeoutUpdateFrame to allow changing
the idle timeout at runtime.
Replace the continuous heartbeat-based timer (UserSpeakingFrame/BotSpeakingFrame
+ asyncio.Event loop) with a simple one-shot timer that starts when
BotStoppedSpeakingFrame is received and cancels on UserStartedSpeakingFrame or
BotStartedSpeakingFrame. This eliminates false idle triggers caused by gaps
between the user finishing speaking and the bot starting to speak (LLM/TTS
latency).
Guard the timer start with two conditions to prevent false triggers:
- User turn in progress: during interruptions, BotStoppedSpeaking arrives
while the user is still speaking mid-turn.
- Function calls in progress: FunctionCallsStarted arrives before
BotStoppedSpeaking because the bot speaks concurrently with the function
call starting, so the timer must wait for the result and subsequent bot
response.
Now that all services use typed `ServiceSettings` objects, this removes the interim scaffolding that supported both dict-based and typed settings paths in parallel. Specifically: removes old dict-based `_update_settings(settings: Mapping)` methods from base classes, removes `isinstance(self._settings, ServiceSettings)` guards, simplifies `process_frame` branching, and renames `_update_settings_from_typed` to `_update_settings` across all ~30 service implementations. Also renames the no-arg `_update_settings()` helper on realtime services to `_send_session_update()` to avoid collision, adds `from_mapping` overrides on `GoogleLLMSettings` and `AnthropicLLMSettings` for ThinkingConfig dict-to-object conversion, and replaces a broken no-arg `_update_settings()` call in Gemini Live with a TODO.
- NvidiaSTTService.set_model: convert to proper DeprecationWarning (model can't change at runtime for Riva streaming STT)
- NvidiaTTSService.set_model: same treatment for Riva TTS
- NvidiaSegmentedSTTService.set_model: remove — base class now routes through _update_settings_from_typed which re-creates the recognition config
- GeminiTTSService.set_voice: remove — move AVAILABLE_VOICES validation into _update_settings_from_typed so it fires on both legacy and new paths
Standardize all STT, TTS, and LLM service classes to declare `_settings` with the narrowed Settings type as a class-level annotation. This gives editors and type checkers the specific type when hovering or autocompleting on `self._settings` in each service and its subclasses. Inline `self._settings: Type = ...` assignments are replaced with plain `self._settings = ...`.
`filter_incomplete_user_turns` and `user_turn_completion_config` were only handled in the legacy dict-based `_update_settings` code path. This adds them to `LLMSettings` and introduces `LLMService._update_settings_from_typed` so the typed path handles them too.
Does not (yet) touch `InputParams`, to avoid scope creep and touching something currently part of the public API. But there is a lot of overlap between `*Settings` object fields and `InputParams` fields.
Other than discoverability/typing, these are some other improvements brought by this refactor:
- There is now a single code path (see `_update_settings_from_typed`) where services can respond to settings changes (by, say, reconnecting if needed), improving maintainability and guaranteeing one and only one reconnection no matter which settings changed
- `set_language`/`set_model`/`set_voice`—which we're assuming are usable as public methods, though *not* recommended over `*UpdateSettingsFrame`—all use the same code path as settings updates. They're also now all consistent in that, if a service needs to respond to a change (by, say, reconnecting if needed), any of these methods will kick off that process. Note that this is technically a behavior change.
- Several services now properly react to changed settings by reconnecting:
- `AWSTranscribeSTTService`
- `AzureSTTService`
- `SonioxSTTService`
- `GladiaSTTService`
- `SpeechmaticsSTTService`
- `AssemblyAISTTService`
- `CartesiaSTTService`
- `FishAudioTTSService` (would previously only reconnect when `model` changed)
- `GoogleSTTService`
- `SpeechmaticsSTTService` (which previously only handled *some* settings updates through a nonstandard public `update_params` method)
- `GradiumSTTService`
- `NvidiaSegmentedSTTService` (which previously only handled changes to language)
- Bookkeeping across various services has been reduced, mostly by deduping ivars; the `self._settings` ivar is treated as the source of truth
NOTE: I pretty much guarantee that there are services missed in this PR in terms of bringing to consistency with how updates are handled (like whether changes in certain fields trigger reconnects when they need to). We can squash remaining inconsistencies as we stumble onto them, service by service. The goal here is to get things *mostly* in order, and establish the infrastructure and patterns we'll need going forward.
The outer try/except in each service decorator caught both tracing
setup errors and application errors from the wrapped function. If the
function itself raised (e.g. LLM rate limit, TTS timeout), the
exception was caught and the function was called a second time.
Fix by tracking whether the original function was called via a
fn_called flag. If the function was already called, re-raise the
exception instead of falling back to untraced re-execution.
Adds a Claude Code skill that analyzes the current branch diff against
main, maps changed source files to their doc pages, and makes targeted
updates to Configuration, InputParams, Usage, Notes, and Event Handlers
sections.
The class_decorators.py module (Traceable, @traceable, @traced) is not
used anywhere in the codebase. Mark it deprecated and fix the misleading
comment in service_decorators.py that referenced it as if it were active.
Consolidate _tracing_enabled and _tracing_context from LLMService,
STTService, and TTSService into the shared AIService base class.
Extract _get_turn_context() helper in service_decorators.py to
encapsulate the repeated pattern across all traced decorators.
Add model-specific params (arcana: repetition_penalty, temperature, top_p;
mistv2: no_text_normalization, save_oovs, segment) with dynamic query param
building via _build_settings(). Model/voice/param changes now trigger
WebSocket reconnection since all settings are URL query params.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This allows non-websocket STT services (like SarvamSTTService, which uses
the Sarvam Python SDK for connection management) to reuse the same keepalive
pattern. Subclasses override _send_keepalive() and _is_keepalive_ready() for
their specific protocol.
ConversationContextProvider and TurnContextProvider were singletons that
stored tracing context as class-level state. When two PipelineTask instances
ran concurrently, they would overwrite each other's context, causing service
spans to attach to the wrong pipeline's turn span.
Replace both singletons with a single TracingContext object owned by each
PipelineTask, threaded to services via StartFrame.
The Grok API now returns prefixed voice names (e.g. "human_Ara") in
session.updated events, causing Pydantic validation errors. Widen the
voice field type from GrokVoice to GrokVoice | str to accept both
user-facing names and server-returned values.
Extract the procedural PR workflow into an actionable skill that can be
invoked with /pr-submit. CLAUDE.md is better suited for project context
and conventions, not step-by-step procedures.
Expand the Pull Requests section with detailed step-by-step instructions
including branch naming, commit guidance, changelog generation, and PR
description updates.
OpenAI's AsyncStream uses close() while async generators (e.g. from
OpenPipe) use aclose(). Replace direct async-with on the stream with a
helper that handles both protocols.
Adds opt-in keepalive_timeout and keepalive_interval params to
WebsocketSTTService. When enabled, a background task sends silent audio
(or a service-specific protocol message) when the connection has been
idle, preventing server-side timeout disconnects.
Subclasses override _send_keepalive(silence) to wrap the silence in
their wire format. The default sends raw PCM bytes.
Enables keepalive for ElevenLabs (10s), Gladia (20s), and Soniox (1s),
replacing their per-service custom keepalive tasks.
Add a `service` field so the frame targets a specific service, allowing
ServiceSwitcher.push_frame to consume it only when the targeted service
matches the active service. STTService and test mocks now push the frame
downstream after handling instead of silently consuming it.
The default stop strategy changed to TurnAnalyzerUserTurnStopStrategy,
which requires actual audio analysis. Use SpeechTimeoutUserTurnStopStrategy
explicitly since this test is not testing turn detection.
Change the default user turn stop strategy from
TranscriptionUserTurnStopStrategy to TurnAnalyzerUserTurnStopStrategy
with LocalSmartTurnAnalyzerV3. Also reduce AUDIO_INPUT_TIMEOUT_SECS
from 1.0 to 0.5 and remove its debug log.
- Make ServiceSwitcherStrategy inherit from BaseObject with properties
for services and active_service, and move initial service selection
into the base class
- Add on_service_switched event to ServiceSwitcherStrategy
- handle_frame now returns the switched-to service (or None), allowing
ServiceSwitcher to swallow ManuallySwitchServiceFrame on switch and
request metadata from the new active service
- Override push_frame to suppress RequestMetadataFrame and
ServiceMetadataFrame from inactive services
- Remove ServiceSwitcherFilter and ServiceSwitcherFilterFrame in favor
of plain FunctionFilter instances with closures that check the
strategy's active service directly
- FunctionFilter: add FilterType alias
- FunctionFilter: when direction is None, frames in both directions
are filtered instead of just one
- Add docstrings to ServiceSwitcher and its components
Refactor TranscriptionUserTurnStopStrategy and TurnAnalyzerUserTurnStopStrategy
to use VADUserStoppedSpeakingFrame as the ground truth for when speech ended,
rather than triggering timeouts from transcription frames.
RTVIObserver now skips upstream frames to prevent duplicate RTVI messages
when frames are broadcast in both directions. Also changed
FunctionCallCancelFrame to use broadcast_frame for consistency with
other function call frames.
Processors inside parallel sub-pipelines can push frames during
StartFrame/EndFrame/CancelFrame processing. Previously these frames
could escape the ParallelPipeline before all branches finished
processing the lifecycle frame. Now they are buffered and flushed
after synchronization completes.
Add tests for the event-based interruption completion: complete() sets
the event, complete() is safe without an event, the event fires at
the pipeline sink, and a warning is logged when the frame is blocked.
Also remove the unconditional await after the timeout so the function
returns instead of hanging when complete() is never called.
Move the interruption wait event from per-processor instance state to
the frame itself. The event is created in
push_interruption_task_frame_and_wait(), threaded through
InterruptionTaskFrame → InterruptionFrame, and set when the frame
reaches the pipeline sink. This scopes the event to each interruption
flow rather than sharing mutable state on the processor.
Also adds a 2s timeout warning to help diagnose cases where
InterruptionFrame.complete() is never called.
This adds user-to-bot response latency tracking to OpenTelemetry spans:
- Created UserBotLatencyObserver as a reusable component for tracking
user-to-bot response latency
- Records the value as an attribute on turn spans (turn.user_bot_latency_seconds)
- Updated TurnTraceObserver to use UserBotLatencyObserver, following the same pattern as TurnTrackingObserver
- Updated PipelineTask to automatically create and wire UserBotLatencyObserver
when tracing is enabled (same as TurnTrackingObserver)
Tests cover:
- No messages received (raises ValueError)
- One message received (logs warning, continues)
- Two messages received (normal operation)
- All telephony providers (Twilio, Telnyx, Plivo, Exotel)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Handle WebSocket disconnections gracefully when telephony providers send
fewer messages than expected. Adds explicit StopAsyncIteration handling
for both first and second message retrieval.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
NLTK's sent_tokenize() only supports ~15 European languages and defaults to
English. For Japanese, Chinese, Korean, Hindi, Arabic, and other non-Latin
languages, NLTK fails to recognize sentence boundaries like 。?! causing
text to accumulate until flush instead of being emitted sentence-by-sentence.
Add a fallback in match_endofsentence() that scans for unambiguous non-Latin
sentence-ending punctuation when NLTK fails to split the text. Latin
punctuation (. ! ? ; …) is excluded from the fallback since NLTK handles
those correctly and they can be ambiguous (abbreviations, decimals, etc.).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
broadcast_frame() expects a frame class and kwargs, but the three
websocket input transports (fastapi, client, server) were incorrectly
passing a frame instance. This would cause a TypeError at runtime when
an InputTransportMessageFrame was received.
Incorporates latest changes from main branch including:
- AIC filter and VAD updates
- STT service improvements
- Base serializer changes
- Various bug fixes
If `enable_rtvi` is enabled (enabled by default) and RTVI processor will be
added automatically to the pipeline. Also, and RTVI observer will be
registered.
Process audio as soon as we receive it from the generator. Previously, we were
reading from the generator and adding elements into a queue until there was no
more data, then we would process the queue.
If AIService subclasses implement start()/stop()/cancel() and exception are not
handled, execution will not continue and therefore the originator frames will
not be pushed. This would cause the pipeline to not be started (i.e. StartFrame
would not be pushed downstream) or stopped properly.
Issues:
- After disconnecting, we were prematurely sending audio messages using the new prompt and content names, before the new prompt and content were created
- We weren't properly sending system instruction and conversation history messages to Nova Sonic with `"interactive": false`
The underlying issue was related to the fact that we were sending audio to Grok before we had configured the Grok session with our default input sample rate (16000), so Grok was interpreting those initial audio chunks as having its default sample rate (24000). We didn't see this issue when using the Daily transport simply because in our test environments Daily took a smidge longer than a reflexive (localhost) pure WebRTC connection, so we would only send audio to Grok *after* we had configured the Grok session with the desired sample rate.
Extract dictionary value to local variable and check for None before
accessing cancel_on_interruption attribute, since the dictionary values
are typed as Optional[FunctionCallInProgressFrame].
If we aggregate transcriptions we will get incorrect interruptions. For example,
if we have a strategy with min_words=3 and we say "One" and pause, then "Two"
and pause and then "Three", this would trigger the start of the turn when it
shouldn't. We should only look at the incoming transcription text and don't
aggregate it with the previous.
- Replace aiohttp with camb SDK (AsyncCambAI client)
- Add support for passing existing SDK client instance
- Simplify API: no longer requires aiohttp_session parameter
- Update example to use simplified initialization
- Rewrite tests to mock SDK client instead of HTTP servers
- Add --voice-id CLI argument to example (default: 2681)
- Remove test_camb_quick.py from examples/ (tests belong in tests/)
- Update docstring with new usage
Gemini expects parallel function calls to be passed in as a single multi-part `Content` block. This is important because only one of the function calls in a batch of parallel function calls gets a thought signature—if they're passed in as separate `Content` blocks, there'd be one or more missing thought signatures, which would result in a Gemini error.
Add support for Gladia's speech_start/speech_end events to emit
UserStartedSpeakingFrame and UserStoppedSpeakingFrame frames.
When enable_vad=True in GladiaInputParams:
- speech_start triggers interruption and pushes UserStartedSpeakingFrame
- speech_end pushes UserStoppedSpeakingFrame
- Tracks speaking state to prevent duplicate events
This allows using Gladia's built-in VAD instead of a separate VAD
in the pipeline.
The end of turn is already handle with interruptions or with
LLMFullResponseEndFrame. LLMFullResponseEndFrame should never be blocked,
otherwise the assistant would not work.
* Krisp VIVA SDK Filter and Turn support.
* Reverted the krisp_filter.py as it's already deprectaed.
* enabled test with krisp_audio mock.
* More review comment fixes.
reverted the state logic in viva filter to be similar to the existing impl on main branch.
Fixed tests, ruff, etc.
* More review comments for Turn detection.
removed integration tests.
* Moved the SDK init/deinit into start/stop
Changed getattr with default value to use 'or' operator for fallback.
This ensures proper model name retrieval when model_name attribute exists but is None or empty.
This commit adds word boundary support to AzureTTSService and fixes
the race condition that causes scrambled TTS output across multiple
sentences.
## Features Added
- Change AzureTTSService to inherit from WordTTSService
- Subscribe to Azure SDK's synthesis_word_boundary event
- Emit word-level text with timing information via _words_queue
- Add synthesis lock for sequential sentence processing
## Race Condition Fix
Previously, each sentence's word boundary timestamps reset to 0,
causing downstream components to interleave words when reordering
frames by PTS. This resulted in scrambled output like:
'Hello ! I What am questions AI have assistant...'
The fix adds cumulative audio offset tracking to ensure monotonically
increasing PTS across all sentences:
Sentence 1: pts = 0.1s, 0.5s, 0.8s (cumulative at end: 0.8s)
Sentence 2: pts = 0.9s, 1.2s, 1.5s (0.8s + relative offset)
## Key Changes
- _cumulative_audio_offset: tracks total audio duration
- _handle_word_boundary: adds cumulative offset to timestamps
- _handle_completed: accumulates audio duration for next sentence
- flush_audio: resets cumulative offset at end of LLM response
- _handle_interruption: resets state on user interruption
- run_tts: uses synthesis lock for sequential processing
Fixes#2918🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add support for the language_hints_strict parameter in Soniox STT
configuration. When set to true, this parameter strictly enforces
language hints, restricting transcription to only the specified
languages.
Prior to this change, after the model generated an image the conversation would not be able to progress. It would stall out because we were never storing the image in context, so the model would never realize it already did the work of generating an image. We didn't run into issues with Gemini 2.5 Flash Image, because that model always followed up an image with a text message.
Adds support for using Ultravox Realtime as a speech-to-speech service.
Also removes the deprecated Ultravox speech-to-text vllm model integration to avoid confusion.
Changed the on_client_connected system message from a direct greeting to
an instruction that tells the AI to introduce itself, giving the LLM more
flexibility in how it starts the conversation.
Thinking, sometimes called "extended thinking" or "reasoning", is an LLM process where the model takes some additional time before giving an answer. It's useful for complex tasks that may require some level of planning and structured, step-by-step reasoning. The model can output its thoughts (or thought summaries, depending on the model) in addition to the answer. The thoughts are usually pretty granular and not really suitable for being spoken out loud in a conversation, but can be useful for logging or prompt debugging.
Here's what's added:
1. New typed input parameters for Google and Anthropic LLMs that control the models' thinking behavior (like how much thinking to do, and whether to output thoughts or thought summaries).
2. New frames for representing thoughts output by LLMs.
3. A generic mechanism for associating extra LLM-specific data with a function call in context, used specifically to support Google's function-call-related "thought signatures", which are necessary to ensure thinking continuity between function calls in a chain (where the model thinks, makes a function call, thinks some more, etc.)
4. A generic mechanism for recording LLM thoughts to context, used specifically to support Anthropic, whose thought signatures are expected to appear alongside the text of the thoughts within assistant context messages.
5. An expansion of `TranscriptProcessor` to process LLM thoughts in addition to user and assistant utterances.
Deepgram request IDs are necessary for investigating behavior at the
request level. This commit adds DEBUG logs that print Deepgram request
IDs when using Deepgram's STT or TTS.
a best effort version of the bot's output
- The `RTVIObserver` now emits `bot-output` messages based off
the new `AggregatedTextFrame`s (`bot-tts-text` and
`bot-llm-text` are still supported and generated, but
`bot-transcript` is now deprecated in lieu of this new, more
thorough, message).
- The new `RTVIBotOutputMessage` includes the fields:
- `spoken`: A boolean indicating whether the text was spoken by TTS
- `aggregated_by`: A string representing how the text was aggregated
("sentence", "word", "my custom aggregation")
- Introduced new fields to `RTVIObserver` to support the new
`bot-output` messaging:
- `bot_output_enabled`: Defaults to True. Set to false to disable
bot-output messages.
- `skip_aggregator_types`: Defaults to `None`. Set to a list of
strings that match aggregation types that should not be included
in bot-output messages. (Ex. `credit_card`)
`CartesiaTTSService`:
- Modified use of custom default text_aggregator to avoid deprecation warnings and push users
towards use of transformers or the `LLMTextProcessor`
- Added convenience methods for taking advantage of Cartesia's SSML tags: spell, emotion,
pauses, volume, and speed.
`RimeTTSService`:
- Modified use of custom default text_aggregator to avoid deprecation warnings and push users
towards use of transformers or the `LLMTextProcessor`
- Added convenience methods for taking advantage of Rime's customization options: spell,
pauses, pronunciations, and inline speed control.
Introduced `LLMTextProcessor`: A new processor meant to allow customization for how
LLMTextFrames should be aggregated and considered. It's purpose is to turn
`LLMTextFrame`s into `AggregatedTextFrame`s. By default, a TTSService will still
aggregate `LLMTextFrame`s by sentence for the service to consume. However, if you
wish to override how the llm text is aggregated, you should no longer override the
TTS's internal text_aggregator, but instead, insert this processor between your LLM
and TTS in the pipeline.
This frame introduces an `aggregated_by` field to describe the type of text included
in the frame and allows unspoken groupings of text to be pushed through the pipeline
and treated similar to TTSTextFrames.
Adding support for setting whether or not the text in the TextFrame
should be added to the LLM context (by the LLM assistant aggregator).
Defaults to `True`.
Modified the BaseTextAggregator type so that when text gets aggregated, metadata can
be associated with it. Currently, that just means a `type`, so that the aggregation
can be classified or described. Changes made to support this:
- **IMPORTANT**: Aggregators are now expected to strip leading/trailing white space
characters before returning their aggregation from `aggregation()` or `.text`. This
way all aggregators have a consistent contract allowing downstream use to know how
to stitch aggregations back together
- Introduced a new `Aggregation` dataclass to represent both the aggregated `text` and
a string identifying the `type` of aggregation (ex. "sentence", "word", "my custom
aggregation")
- **BREAKING**: `BaseTextAggregator.text` now returns an `Aggregation` (instead of `str`).
To update: `aggregated_text = myAggregator.text` -> `aggregated_text = myAggregator.text.text`
- **BREAKING**: `BaseTextAggregator.aggregate()` now returns `Optional[Aggregation]`
(instead of `Optional[str]`). To update:
```
aggregation = myAggregator.aggregate(text)
if (aggregation):
print(f"successfully aggregated text: {aggregation.text}") // instead of {aggregation}
```
- `SimpleTextAggregator`, `SkipTagsAggregator`, `PatternPairAggregator` updated to
produce/consume `Aggregation` objects.
- All uses of the above Aggregators have been updated accordingly.
This update ensures that audio is flushed immediately after sending bare text to the WebSocket, improving the responsiveness of the Text-to-Speech service.
This commit introduces the RimeNonJsonTTSService class, enabling Text-to-Speech synthesis over WebSocket endpoints that require plain text messages. The service includes configuration parameters for language, segmentation, and audio settings, and handles WebSocket connections for raw audio byte transmission. Limitations include the lack of support for word-level timestamps and context IDs.
Now:
- For TTS word-by-word output and `TTSSpeakFrames`: `TTSTextFrame`s' have `includes_inter_frame_spaces=False`.
- For all other TTS output: `TTSTextFrame` pass through the received text frames' `includes_inter_frame_spaces` value. So far, this value has always been `True`: LLMs send text chunks already containing all necessary spaces.
- `LLMTextFrame`s set `includes_inter_frame_spaces=False` at init time, per the aforementioned assumption.
* feat: Add ErrorFrame emission to TTS/STT services for pipeline error detection
- Add ErrorFrame emission to all major TTS/STT services during initialization and runtime failures
- Services updated: Cartesia, ElevenLabs, Deepgram, AssemblyAI, Rime, Azure
- ErrorFrame objects emitted with fatal=False for graceful degradation
- Enables on_pipeline_error event handler to detect service failures programmatically
- Add comprehensive pytest test suite to verify ErrorFrame emission
- Fixes issue where services failed gracefully but didn't emit ErrorFrame objects
This allows developers to implement real-time error monitoring and alerting
using the on_pipeline_error event handler introduced in v0.0.90.
* Update STT and TTS services to use consistent error handling pattern
- Improves error handling consistency across all services
* Add changelog entry for STT/TTS error handling improvements
* Linting issues Resolved
* Azure STT ErrorFrames added with consistent patterns
* Cartesia STT and Deepgram STT; additional fixes made
* Removed Fatal Flags across services, removed duplication
* Moving the changelog entry to the correct place.
* Refactoring some classes to use yield instead of push_error directly.
* Fixing ruff format.
---------
Co-authored-by: Filipi Fuchter <filipi87@gmail.com>
- `GoogleHttpTTSService`
- `OpenAITTSService`
The reason I skipped this work in an earlier PR was because these services seemed to be emitting long, punctuation-free text frames. It turns out that the issue was with the LLM prompt, though, resulting in the LLM nondeterministically excluding all punctuation. An upcoming commit will address that prompt issue.
Note that for `LLMTextFrame`s, the right behavior is pretty much always `includes_inter_frame_spaces = True`. I decided *not* to go ahead and make that the default for `LLMTextFrame`s, though, simply to not introduce a subtle behavior change for creative/unexpected use-cases that were relying on text in hand-crafted `LLMTextFrame`s being handled a certain way. Ditto for `TTSTextFrame`s.
Also, fix an issue in `NeuphonicTTSService` where it wasn't pushing `TTSTextFrame`s.
Also, fix the broken `SarvamHttpTTSService` example.
Also, add a couple of missing examples.
* Fix Langfuse tracing for GoogleLLMService with universal LLMContext
- Fixed issue where input appeared as null in Langfuse dashboard for GoogleLLMService
- Added fallback to use adapter's get_messages_for_logging() for universal LLMContext
- Ensures proper message format conversion for Google/Gemini services
- Handles system message conversion to system_instruction format
- Also fixes serialization of empty message lists ([] now serializes correctly)
This fix ensures Langfuse tracing works correctly for Google services using
both OpenAILLMContext/GoogleLLMContext and the universal LLMContext.
* Add unit tests for Langfuse tracing with GoogleLLMService
- Test that tracing correctly captures messages with universal LLMContext
- Test that empty message lists are properly serialized
- Test that adapter's get_messages_for_logging is used instead of context method
- All tests verify that input is correctly added to Langfuse spans
* Fix test mocking to patch opentelemetry.trace.get_tracer correctly
The tests were failing in CI because they were trying to patch
'pipecat.utils.tracing.service_decorators.trace' which doesn't exist as
an attribute. The trace module is imported from opentelemetry, so we need
to patch 'opentelemetry.trace.get_tracer' instead.
* Skip tracing tests when opentelemetry is not installed
The tracing dependencies (opentelemetry) are optional in Pipecat and not
installed in the CI environment. Added a skipif marker to skip these tests
when opentelemetry is not available, preventing CI failures while still
allowing the tests to run when tracing dependencies are installed locally.
* Install tracing dependencies in GitHub Actions CI
Instead of skipping the tracing tests, install the 'tracing' extra
(opentelemetry) in the CI environment so the tests can run properly.
Removed the skipif condition from the tests since opentelemetry will
now be available in CI.
* Use the context type to determine which messages to use, fix tool_count and tools (#3032)
---------
Co-authored-by: Mark Backman <mark@daily.co>
I chose to go the somewhat hacky route of adding the `ToolsSchema` support into the `events.SessionProperties` model itself—even though we should never serialize that type when creating events—because the alternative seemed to be to create a new type for `OpenAIRealtimeLLMService` initialization parameters and then we'd have to contend with backward compatibility, which seemed like a bigger headache.
- If the context contained a system message, that message would be converted to a user message and the LLM would respond
- If the system message was provided as a constructor argument, though, no user messages would be sent to the LLM, and the LLM would therefore not respond
Not adding this fix to the CHANGELOG since `GeminiLiveLLMService`'s ability to properly handle context-provided tools and system instruction hasn't been published yet.
Not adding this fix to the CHANGELOG since `GeminiLiveLLMService`'s ability to properly handle context-provided tools and system instruction hasn't been published yet.
This allows folks to use `MCPClient` alongside the pattern of passing in tools at LLM init time, a pattern supported by speech-to-speech services such as `GeminiLiveLLMService`.
This does a couple of things:
- Makes the `MCPClient` LLM agnostic, setting us up for some upcoming improvements (like making it possible to use with `LLMSwitcher`)
- Makes `GeminiLLMAdapter` more robust, as the schema massaging that was previously only done in `MCPClient` is useful for all tools, not just for MCP-provided ones
Before, when requesting a user image from a function call we had to wait for a
random time before we could indicate the function call was done. This was to
given time to the aggregator to process the image before marking the function
call as completed.
To avoid this, we now wait for the requested image to be received by the LLM
assistant agrgegator (using an asyncio event). Then, we can successfully mark
the function call as completed.
Make `LLMUserAggregator` push the `LLMSetToolsFrame`s, in case a speech-to-speech service that needs to handle the frame itself—like `OpenAIRealtimeLLMService`—is downstream. As far as I can tell, pushing `LLMSetToolsFrame` should otherwise have no unwanted side effects.
Add `LLMContext.get_messages_for_persistent_storage()` for compatibility with `OpenAILLMContext`, to avoid tripping up users who we're unknowingly migrating to using `LLMContext`.
Add back file that was removed, when it should've just been deprecated.
Also, fix version numbers in deprecation messages to match the next expected release.
Update `create_context_aggregator()` (which we're keeping around for backward compatibility) to create a `LLMContextAggregatorPair` rather than OpenAI-Realtime-specific aggregators.
Implement sending tool call results to the OpenAI server based on reading context updates. This lets us use the normal assistant context aggregator and not a special OpenAI Realtime subclass that pushes up a special frame for function call results.
Receiving a new context (via a context frame) no longer serves as a signal to reset the conversation. That’s because we’re now receiving new contexts from the user aggregator every time new messages are added, and from the assistant aggregator when function call results come in. The code pattern we're heading towards, of “diffing” each new context with the previous on, sets us up for doing more sophisticated things in the future, like sending specific messages to OpenAI to edit its internally-tracked context.
Also, remove code that was directly modifying context.
Push `TranscriptionFrame`s upstream, to be handled by the user context aggregator. This will require at least a couple of other changes:
- Updating examples to put transcript processors upstream from `OpenAIRealtimeLLMService`
- Maybe figuring out a way to preserve backward compatibility with existing pipelines that put transcript processors downstream from `OpenAIRealtimeLLMService`
- Updating `OpenAIRealtimeLLMService` to ignore new received context frames, since the upstream user context aggregator will generate those after each newly-added user message; hopefully nobody was reliant on the old behavior of resetting the conversation upon receiving a new context!
Avoid pushing `LLMTextFrame` when `OpenAIRealtimeLLMService` is configured to output audio. This avoids duplicate text in assistant messages in context. Conceptually, a speech-to-speech service encapsulates TTS behavior; in a "traditional" pipeline, `LLMTextFrames` are swallowed by the TTS service, so they should similarly not be pushed by a speech-to-speech service. Only. `TTSTextFrame`s should be pushed.
Properly bookended responses now work with:
- AUDIO modality (validated with 26b example)
- TEXT modality (validated with 26d example)
- AUDIO modality with Vertex AI (validated with 26h example)
It doesn't seem that TEXT modality is supported with Vertex AI, hence the missing "quadrant" of validation.
- Change GenerationConfig from dataclass to Pydantic BaseModel for consistency
- Simplify _build_msg() to use model_dump(exclude_none=True) instead of manual field extraction
- Simplify HTTP run_tts() to use model_dump(exclude_none=True) instead of manual field extraction
This addresses feedback from code review and reduces code duplication.
Add GenerationConfig dataclass with volume, speed, and emotion parameters
for Cartesia Sonic-3 TTS models. This enables fine-grained control over
speech generation including volume (0.5-2.0), speed (0.6-1.5), and
emotion (60+ options).
Changes:
- Add GenerationConfig dataclass with proper Google-style docstrings
- Update CartesiaTTSService.InputParams to include generation_config
- Update CartesiaHttpTTSService.InputParams to include generation_config
- Modify _build_msg() to include generation_config in WebSocket messages
- Modify run_tts() to include generation_config in HTTP requests
- Maintain backward compatibility with existing speed and emotion parameters
The legacy speed (literal strings) and emotion (list) parameters remain
available for non-Sonic-3 models.
This wasn't really an issue before, when folks were *knowingly* migrating from `OpenAILLMContext` to `LLMContext`. But in the latest AWS Nova Sonic change, we're swapping it out from under folks, so this kind of compatibility is more important.
For context, the reason we *didn't* offer the `messages` property earlier was to aid in the development of `LLMContext`—we wanted to draw attention to all the places where messages were being read from context, so we could find the places where we might need to pass an argument to the read.
The reason for its `system_instruction` argument was to support usage with LLMs where you might pass the system instruction as a parameter to the `LLMService` rather than specifying it in the context.
But as I thought about it more I became unconvinced that the `system_instruction` argument was really beneficial:
- If you specified your system instruction in your context in the first place, it'll still be there when you read messages for persistent storage
- If you didn't specify your system instruction in the context and instead passed it in as an `LLMService` parameter, you most likely *don't* want it to be in the context when you read messages for persistent storage
- ...and if you really really do need to inject it at the start of the context, it's quite easy to do anyway
And if we remove the `system_instruction` argument from `get_messages_for_persistent_storage()`, then it's essentially just `get_messages()`.
Also fix the slightly wrong (but so far harmless) pattern of initializing `OpenAILLMService.InputParams()` in the `GoogleVertexLLMService` if `params` wasn't provided—we should be letting the superclass decide what to do if the argument isn't specified.
- Conceptually, these args comprise project-level setup, akin to credentials, whereas everything in `InputParams` is concerned with model configuration
- Providing a `project_id` when initializing `GoogleVertexLLMService` should not be optional, but prior to the change in this commit it was (erroneously) treated as optional by dint of `InputParams` being optional
This improvement was discussed [in this comment](https://github.com/pipecat-ai/pipecat/pull/2795#discussion_r2408279142).
To understand this fix, let's look exclusively at `pause_processing_frames()` (`pause_processing_system_frames()` works the same way).
`pause_processing_frames()` works by setting a `__should_block_frames` flag, which is then read each time through the loop in the long-running `__process_frame_task_handler`. if `__should_block_frames` is `True`, it pauses processing frames until it's resumed.
Prior to this fix, the check for `__should_block_frames` was before `await self.__process_queue.get()`. The problem is that a lot of the time spent in the loop is waiting for a frame from the process queue. So if `pause_processing_frames()` is set at any time other than within `process_frame()` itself, it actually won't have an effect by the next frame, only on the frame *after* the next, which is later than intended.
Because thus far in the Pipecat codebase we've only ever called `pause_processing_frames()` and `pause_processing_system_frames()` from within `process_frame()`, this change should have no behavioral effect. But it will be helpful if we ever need to call it from anywhere else. I noticed this issue while developing a feature that did exactly that (though I later abandoned that code).
It expected a WebSocket URL, but we're no longer (directly) using WebSockets to talk to Gemini. Instead of trying to (potentially erroneously) map a given custom WebSocket URL to an `HttpOptions` object (the new preferred way of customizing requests made by the Gemini API client), we're simply deprecating `base_url` and pointing users to the `http_options` argument instead.
We move the thread creation to the VADAnalyzer instead of the input
transport. This can potentially be useful if we need to analyze multiple audio
streams.
- Usage in classes that are already deprecated
- Usage related to realtime LLMs, which don't yet support `LLMContext`
- Usage in (soon-to-be-deprecated) code paths related to `OpenAILLMContext` itself and associated machinery
Note that `LLMContext` doesn't have a `get_messages_for_persistent_storage()`; the messages are already in the "standard" format so they can be used directly for storage.
NOTE: oops! Turns out some of these files had *already* been updated to use universal `LLMContext` even though they weren't yet using `ToolsSchema`. This commit should fix those examples.
With all these examples updated, we no longer need dedicated examples illustrating `LLMContext`, so they're removed.
Here’s where we *don’t* yet use `LLMContext` and associated machinery:
- Realtime services: OpenAI Realtime, Gemini Live, and AWS Nova Sonic (support coming soon)
- `GoogleLLMOpenAIBetaService` (it’s deprecated, so we didn’t bother adding support)
- `LLMLogObserver` (support coming soon)
- `GatedOpenAILLMContextAggregator` (support coming soon)
- `LangchainProcessor` (support coming soon)
- `Mem0MemoryService` (support coming soon)
- Examples that use LLM-specific tools definitions as opposed to `ToolsSchema` (these will be updated soon)
- Examples that rely `GoogleLLMContext.upgrade_to_google` (TBD what to do with these)
Examples that use `LLMLogObserver`:
- 30-
Examples that use `GatedOpenAILLMContextAggregator`:
- 22-
Examples that use `LangchainProcessor`:
- 07b-
Examples that use `Mem0MemoryService`:
- 37-
Examples that need updating to use `ToolsSchema`:
- 15-
- 15a-
- 20a-
- 20c-
- 20d-
- 22b-
- 22c-
- 33-
- 36-
Examples that use `GoogleLLMContext.upgrade_to_google`:
- 22d-
- 25-
- Document the new global location support in GoogleVertexLLMService
- Explain the difference between regional and global API hosts
- Follow Keep a Changelog format
Immediate is the "default", i.e. has the more obvious name (e.g. `ManuallySwitchServiceFrame` v `ManuallySwitchServiceControlFrame`), since that's *probably* what users will want to reach for. Also, the immediate frames are more likely to behave like what we had before the last few commits, where the service switch would always "jump the queue" by having an immediate effect once it hit the `ServiceSwitcher` in the pipeline, jumping ahead of frames in front of it destined for the service.
- A text frame
- A `ManuallySwitchServiceFrame` (which is a `ServiceSwitcherFrame`)
- Another text frame
And expect that the first text frame be handled by the initially active service and the second text frame be handled by the newly active one.
Previously, the `ManuallySwitchServiceFrame` would have an effect too early, causing both text frames to be handled by the newly active service. Why? Because the frame filtering condition was being updated *directly* by the `ServiceSwitcher`, which is upstream from the services it's switching between. It could therefore update the filters *before* the services received the prior frames.
- Update _get_base_url method to handle 'global' location case
- Use 'aiplatform.googleapis.com' for global locations
- Use '{location}-aiplatform.googleapis.com' for regional locations
- Maintains backward compatibility with existing regional endpoints
* Handle missing rawResponse in transcription messages
- Use message.get('rawResponse', {}) to safely access rawResponse field
- Default is_final to False when rawResponse is missing
- Add proper type annotations for better code clarity
- Minor import formatting cleanup
This prevents KeyError crashes when transcription messages from Daily's API
don't include the rawResponse field in edge cases.
* docs: add changelog line
Removing `VisionImageRawFrame` lets us simplify LLM services' logic, getting us closer to the idealized architecture where all they care about is handling context frames.
This change is in service of getting us closer to ready to deprecate usage of `OpenAILLMContext` and subclasses in favor of the universal `LLMContext`, at least for the traditional text-to-text LLMs.
Why remove `VisionImageRawFrame` rather than deprecate? It's "internal"—only created by `VisionImageFrameAggregator`—and never intended to be used directly by users (it would be difficult to use directly anyway).
Move the logic that was once in `VisionImageFrameAggregator` directly into the examples. Reasoning:
- If `UserImageRequester` is defined in the examples, it makes sense for `UserImageProcessor` to be too, as it’s the flip side of the same coin, so to speak
- The logic is now pretty trivial
- This kind of one-shot, history-less image-describing pipeline shouldn't be common at all; it's ok for it to live in examples rather than as a dedicated class
- In the short term, this enables us to create `LLMContext`s for services that support it and `OpenAILLMContext`s for services that don't yet (AWS)
This commit also adds missing translation from OpenAI-format image context messages to AWS format. Note that this isn't a wasted effort in the face of the upcoming migration to universal `LLMContext`—this work will be reused as it has to be implemented there too.
This lets a text frame bypass TTS while still being included in the LLM
context. Useful for cases like structured text that isn’t meant to be spoken but
should still contribute to context.
Some implementations were returing a list and some were returning a JSON
string. They should all return a list and the user would decide if it wants to
transform that into JSON.
* Added Sarvam TTS Websocket Implementation
* Addressed some of the comments on PR
* added change voice logic
* added changes from main
* pushing text frames and added flush audio
* updated docs string for better docs
* Addressed comments and added some improvements
* pushed optional args down
* removed new line
* made aiohttp session mandatory in http service
* added push frame and removed unused function
* removed pong message
* added disconnecting logic
---------
Co-authored-by: vinayak-sarvam <vinayak@sarvam.ai>
1. `ToolsSchema` has been supported in `LLMSetToolsFrame` for a while but wasn't properly reflected in these type hints
2. The new universal `LLMContext` expects tools to be either a `ToolsSchema` or `NOT_GIVEN`.
This abstraction will allow us to update Pipecat Flows to avoid reaching into LLM service internals to generate summaries.
In addition to being a helpful refactor to remove a fragile part of Pipecat Flows, this change helps set the stage for supporting the upcoming `LLMSwitcher`, where the “active” LLM will only be determined at runtime—today, Pipecat Flows needs to know ahead of time what type of LLM it’s working with, to load an LLM-specific “adapter” that does the work of generating summaries, among other things.
- Add support for LLM-specific messages in the universal `LLMContext`, to enable using LLM-specific functionality while still using the universal LLM context
Watchdog timers have been removed. They were introduced in 0.0.72 to help
diagnose pipeline freezes. Unfortunately, they proved ineffective since they
required developers to use Pipecat-specific queues, iterators, and events to
correctly reset the timer, which limited their usefulness and added friction.
This patch uses `wait_for2` package to implement `asyncio.wait_for()` for
Python < 3.12.
In Python 3.12, `asyncio.wait_for()` is implemented in terms of
`asyncio.timeout()` which fixed a bunch of issues. However, this was never
backported (because of the lack of `async.timeout()`) and there are still many
remainig issues, specially in Python 3.10, in `async.wait_for()`.
See https://github.com/python/cpython/pull/98518
We don't want to set `last_frame_time` on other frames like `HeartBeatFrame`, `LLMGeneratedTextFrame`, `InterruptionFrames` so that we can calculate `diff_time` and compare it against `vad_stop_secs` properly
We now force each inserted item in the priority queue to be a tuple and the
actual value to be last in the tuple. All the previous values in the tuple also
need to be numeric.
- Add timeout (default 5.0s) and retry_on_timeout parameters to constructor
- Implement timeout/retry logic in get_chat_completions using asyncio.wait_for
- Extract build_chat_completion_params() as public method for subclass customization
We need to increment the counters before the await otherwise we could go to a
different task that could add an item with the same counter.
Also, we need to handle non-frame items as well.
Skipping over 07b-interruptible-langchain.py for now, as it requires deeper changes involving `LLMUserResponseAggregator` and `LLMAssistantResponseAggregator`.
The same functionality can be achieved using either:
- `LLMMessagesUpdateFrame` with the desired messages, with `run_llm` set to `True`
- `OpenAILLMContextFrame` with a new context initialized with the desired messages
Introduces a new example script demonstrating how to use OpenAI's function calling capabilities within a Pipecat pipeline. The example integrates OpenAI STT, TTS, and LLM services, registers a weather function, and sets up a pipeline for real-time audio interaction over WebRTC.
System frames are now queued. Before, system frames could be generated from any
task and would not guarantee any order which was causing undesired
behavior. Also, it was possible to get into some rare recursion issues because
of the way system frames were executed (they were executed in-place, meaning
calling `push_frame()` would finish after the system frame traversed all the
pipeline). This makes system frames more deterministic.
Before CancelFrames didn't need to be waited for because system frames were
processed in-place and therefore calling push_frame() would finalize after it
traversed all the pipeline. Now, system frames are queued so we need to wait
until CancelFrame reaches the end of the pipeline.
- Restore TextInputMessage.realtimeInput structure for correct API format
- Remove invalid turnComplete message from _send_user_text method
- turnComplete is only valid for clientContent, not realtimeInput messages
- realtimeInput text completion is automatically inferred by the API
This fixes WebSocket 1007 errors caused by mixing realtimeInput and
clientContent message types in violation of the Gemini Live API contract.
Fixed a line length issue in tavus.py where the on_transcription_stopped callback was exceeding the maximum line length. Split the partial() call across multiple lines for better readability and compliance with project style guidelines.
As suggested in PR review, removed the _on_transcription_stopped and
_on_transcription_error method definitions. Now using the consistent
partial(self._on_handle_callback, ...) pattern for these callbacks,
matching how all other callbacks are handled.
This simplifies the code while maintaining the same functionality.
This was sending a 1007 because it was wrapping RealtimeInput in the json.
- Updated the `TextInputMessage` class to directly store text input as a string.
- Modified the `from_text` class method to create an instance using the new `text` attribute.
TavusTransport was broken in Pipecat 0.0.77 due to PR #2292 adding required
callbacks (on_transcription_stopped, on_transcription_error) to DailyCallbacks.
This fix adds placeholder implementations of these callbacks to TavusTransportClient,
allowing TavusTransport to initialize properly. These callbacks are not used by
Tavus (which handles avatar video, not transcription) but are required by the
DailyCallbacks validation.
Fixes initialization error:
- 2 validation errors for DailyCallbacks
- on_transcription_stopped: Field required
- on_transcription_error: Field required
Changes
Split out module attributes to make engine settings clearer
Removed internal audio buffer to use latest Speechmatics python SDK (0.4.0)
Use diarization for improved VAD in multi-speaker situations
Support custom dictionary / vocabulary with attributes
Deprecated attributes superseded by re-organised attributes
Diarization Enhancements
Focus on specific speakers (using speaker labels)
Ignore specific speakers (using speaker labels)
Separate transcription formats for active and inactive speakers
Support for known speakers
* Adds pipecat.runner.run - FastAPI-based development server with automatic bot discovery
* Adds new RunnerArguments types for different transports
* Adds new runner utils for creating transports and parsing data
* Adds new Daily and LiveKit utils for setup
- Introduced `InputTextRawFrame` to represent raw text input from users or programs.
- Updated `GeminiMultimodalLiveLLMService` to process `InputTextRawFrame` and send user text via the Gemini Live API's realtime input stream.
- Enhanced `_send_user_text` method documentation for clarity on its functionality and usage.
- Updated `RealtimeInput` to include an optional `text` parameter.
- Introduced `TextInputMessage` class for encapsulating text input data.
- Implemented `_send_user_text` method to send text input to the Gemini Live API.
- Enhanced message processing to support text input alongside media chunks.
- `{PR_NUMBER}.added.2.md`, `{PR_NUMBER}.added.3.md` - for additional entries of the same type
- `{PR_NUMBER}.changed.md` - for changes to existing functionality
- `{PR_NUMBER}.fixed.md` - for bug fixes
- `{PR_NUMBER}.deprecated.md` - for deprecations
- `{PR_NUMBER}.removed.md` - for removed features
- `{PR_NUMBER}.security.md` - for security fixes
- `{PR_NUMBER}.performance.md` - for performance improvements
- `{PR_NUMBER}.other.md` - for other changes
4. Each changelog file should at least contain a main single line starting with `- ` followed by a clear description of the change. No line wrapping.
5. If the change is complicated, changelog files can have indented lines after the main line with additional details or code samples.
6. Use ⚠️ emoji prefix for breaking changes.
7. **Write changes in user-facing terms first.** Lead with what users of the framework will notice: new APIs, changed behavior, new parameters, fixed bugs they might have hit, etc. Implementation details (internal refactoring, how something is wired up under the hood) can be included as secondary context after the user-facing description, but should never be the *only* content of a changelog entry when there is a user-visible effect.
**Good** (user-facing first, implementation detail as context):
```
- Turn completion instructions now persist correctly across full context updates when using `system_instruction`. Previously they were injected as a context system message, which caused warning spam and didn't survive context updates.
```
**Bad** (implementation detail only, no user-facing framing):
```
- Fixed turn completion instructions being injected as a context system message instead of using `system_instruction`.
```
Ask yourself: "If I'm a developer building on Pipecat, what would I notice changed?" Start there.
## Example
For PR #3519 with a new feature and a bug fix:
`changelog/3519.added.md`:
```
- Added `SomeNewFeature` for doing something useful.
```
`changelog/3519.fixed.md`:
```
- Fixed an issue where something was not working correctly in some user-visible scenario. The root cause was an internal implementation detail.
The **Code Cleanup Skill** reviews, refactors, and documents code changes in your current branch, ensuring alignment with **Pipecat's architecture, coding standards, and example patterns**.
It focuses on **readability, correctness, performance, and consistency**, while avoiding breaking changes.
---
## Skill Overview
This skill analyzes all changes introduced in your branch and performs the following actions:
1.**Analyze Branch Changes**
- Review uncommitted changes and outgoing commits
2.**Refactor for Readability**
- Improve clarity, naming, structure, and modern Python usage
**Agent assumptions (applies to all agents and subagents):**
- All tools are functional and will work without error. Do not test tools or make exploratory calls. Make sure this is clear to every subagent that is launched.
- Only call a tool if it is required to complete the task. Every tool call should have a clear purpose.
To do this, follow these steps precisely:
1. Launch a haiku agent to check if any of the following are true:
- The pull request is closed
- The pull request is a draft
- The pull request does not need code review (e.g. automated PR, trivial change that is obviously correct)
- Claude has already commented on this PR (check `gh pr view <PR> --comments` for comments left by claude)
If any condition is true, stop and do not proceed.
Note: Still review Claude generated PR's.
2. Launch a haiku agent to return a list of file paths (not their contents) for all relevant CLAUDE.md files including:
- The root CLAUDE.md file, if it exists
- Any CLAUDE.md files in directories containing files modified by the pull request
3. Launch a sonnet agent to view the pull request and return a summary of the changes
4. Launch 4 agents in parallel to independently review the changes. Each agent should return the list of issues, where each issue includes a description and the reason it was flagged (e.g. "CLAUDE.md adherence", "bug"). The agents should do the following:
Agents 1 + 2: CLAUDE.md compliance sonnet agents
Audit changes for CLAUDE.md compliance in parallel. Note: When evaluating CLAUDE.md compliance for a file, you should only consider CLAUDE.md files that share a file path with the file or parents.
Agent 3: Opus bug agent (parallel subagent with agent 4)
Scan for obvious bugs. Focus only on the diff itself without reading extra context. Flag only significant bugs; ignore nitpicks and likely false positives. Do not flag issues that you cannot validate without looking at context outside of the git diff.
Agent 4: Opus bug agent (parallel subagent with agent 3)
Look for problems that exist in the introduced code. This could be security issues, incorrect logic, etc. Only look for issues that fall within the changed code.
**CRITICAL: We only want HIGH SIGNAL issues.** Flag issues where:
- The code will fail to compile or parse (syntax errors, type errors, missing imports, unresolved references)
- The code will definitely produce wrong results regardless of inputs (clear logic errors)
- Clear, unambiguous CLAUDE.md violations where you can quote the exact rule being broken
Do NOT flag:
- Code style or quality concerns
- Potential issues that depend on specific inputs or state
- Subjective suggestions or improvements
If you are not certain an issue is real, do not flag it. False positives erode trust and waste reviewer time.
In addition to the above, each subagent should be told the PR title and description. This will help provide context regarding the author's intent.
5. For each issue found in the previous step by agents 3 and 4, launch parallel subagents to validate the issue. These subagents should get the PR title and description along with a description of the issue. The agent's job is to review the issue to validate that the stated issue is truly an issue with high confidence. For example, if an issue such as "variable is not defined" was flagged, the subagent's job would be to validate that is actually true in the code. Another example would be CLAUDE.md issues. The agent should validate that the CLAUDE.md rule that was violated is scoped for this file and is actually violated. Use Opus subagents for bugs and logic issues, and sonnet agents for CLAUDE.md violations.
6. Filter out any issues that were not validated in step 5. This step will give us our list of high signal issues for our review.
7. If issues were found, skip to step 8 to post comments.
If NO issues were found, post a summary comment using `gh pr comment` (if `--comment` argument is provided):
"No issues found. Checked for bugs and CLAUDE.md compliance."
8. Create a list of all comments that you plan on leaving. This is only for you to make sure you are comfortable with the comments. Do not post this list anywhere.
9. Post inline comments for each issue using `gh pr review` with inline comments. For each comment:
- Provide a brief description of the issue
- For small, self-contained fixes, include a committable suggestion block
- For larger fixes (6+ lines, structural changes, or changes spanning multiple locations), describe the issue and suggested fix without a suggestion block
- Never post a committable suggestion UNLESS committing the suggestion fixes the issue entirely. If follow up steps are required, do not leave a committable suggestion.
**IMPORTANT: Only post ONE comment per unique issue. Do not post duplicate comments.**
Use this list when evaluating issues in Steps 4 and 5 (these are false positives, do NOT flag):
- Pre-existing issues
- Something that appears to be a bug but is actually correct
- Pedantic nitpicks that a senior engineer would not flag
- Issues that a linter will catch (do not run the linter to verify)
- General code quality concerns (e.g., lack of test coverage, general security issues) unless explicitly required in CLAUDE.md
- Issues mentioned in CLAUDE.md but explicitly silenced in the code (e.g., via a lint ignore comment)
Notes:
- Use gh CLI to interact with GitHub (e.g., fetch pull requests, create comments). Do not use web fetch.
- Create a todo list before starting.
- You must cite and link each issue in inline comments (e.g., if referring to a CLAUDE.md, include a link to it).
- If no issues are found, post a comment with the following format:
---
## Code review
No issues found. Checked for bugs and CLAUDE.md compliance.
---
- When linking to code in inline comments, follow the following format precisely, otherwise the Markdown preview won't render correctly: `https://github.com/OWNER/REPO/blob/FULL_SHA/path/to/file.py#L10-L15`
- Requires full git sha
- You must provide the full sha. Commands like `https://github.com/owner/repo/blob/$(git rev-parse HEAD)/foo/bar` will not work, since your comment will be directly rendered in Markdown.
- Repo name must match the repo you're code reviewing
- # sign after the file name
- Line range format is L[start]-L[end]
- Provide at least 1 line of context before and after, centered on the line you are commenting about (eg. if you are commenting about lines 5-6, you should link to `L4-7`)
description: Document a Python module and its classes using Google style
---
Document a Python module or class using Google-style docstrings following project conventions. The argument can be a class name or a module path.
## Instructions
1. Determine what to document based on the argument:
**If a module path is provided** (e.g. `src/pipecat/audio/vad/vad_analyzer.py`):
- Use that file directly
**If a class name is provided** (e.g. `VADAnalyzer`):
- Search for `class ClassName` in `src/pipecat/`
- If multiple files contain that class name, list all matches with their file paths, ask the user which one they want to document, and wait for confirmation
2. Once the file is identified, read the module to understand its structure:
- Identify all classes, functions, and important type aliases
- **Already documented code** - If a class, method, or function already has a complete docstring that follows the project style, do not modify it. A docstring is complete if it has:
- A one-line summary
- Args section (if it has parameters)
- Returns section (if it returns something meaningful)
- Only add or improve documentation where it is missing or incomplete
## Module Docstring Format
```python
"""[One-line description of module purpose].
[Optional: Longer explanation of functionality, key classes, or use cases.]
"""
```
Example:
```python
"""Neuphonic text-to-speech service implementations.
This module provides WebSocket and HTTP-based integrations with Neuphonic's
text-to-speech API for real-time audio synthesis.
"""
```
## Class Docstring Format
```python
classClassName:
"""One-line summary describing what the class does.
[Longer description explaining purpose, behavior, and key features.
Use action-oriented language.]
[Optional: Event handlers, usage notes, or important caveats.]
"""
```
Example:
```python
classFrameProcessor(BaseObject):
"""Base class for all frame processors in the pipeline.
Frame processors are the building blocks of Pipecat pipelines, they can be
linked to form complex processing pipelines. They receive frames, process
them, and pass them to the next or previous processor in the chain.
Event handlers available:
- on_before_process_frame: Called before a frame is processed
- on_after_process_frame: Called after a frame is processed
Note: When listing event handlers, do NOT use backticks. Include an `Example::` section (with double colon for Sphinx) showing the decorator pattern and function signature for each event.
4. Generate or update the PR description with these sections:
## PR Description Format
### Summary (always include)
Brief bullet points describing what changed and why. Focus on the *purpose* and *impact*, not implementation details.
```markdown
## Summary
- Added X to enable Y
- Fixed bug where Z would happen
- Refactored W for better maintainability
```
### Breaking Changes (include only if applicable)
Document any changes that affect existing users or APIs.
```markdown
## Breaking Changes
-`ClassName.method()` now requires a `param` argument
- Removed deprecated `old_function()` - use `new_function()` instead
```
### Testing (include when non-obvious)
How to verify the changes work. Skip for trivial changes.
```markdown
## Testing
- Run `uv run pytest tests/test_feature.py` to verify the fix
- Example usage: `uv run examples/new_feature.py`
```
### Fixes (include if issues are provided or found in commits)
List issues this PR fixes. GitHub will automatically close these issues when the PR is merged.
```markdown
## Fixes
- Fixes #123
- Fixes #456
```
Note: Use "Fixes #X" format (not "Closes" or "Resolves") for consistency. Each issue should be on its own line with "Fixes" to ensure GitHub auto-closes them.
## Guidelines
- **Be concise** - Reviewers should understand the PR in 30 seconds
- **Focus on why** - The diff shows *what* changed, explain *why*
- **Skip empty sections** - Only include sections that have content
- **Use bullet points** - Easier to scan than paragraphs
- **Don't duplicate the diff** - Avoid listing every file or line changed
## Example Output
```markdown
## Summary
- Added `/docstring` skill for documenting Python modules with Google-style docstrings
- Skill finds classes by name and handles conflicts when multiple matches exist
- Skips already-documented code to avoid unnecessary changes
description: Update documentation pages to match source code changes on the current branch
---
Update documentation pages to reflect source code changes on the current branch. Analyzes the diff against main, maps changed source files to their corresponding doc pages, and makes targeted edits.
## Arguments
```
/update-docs [DOCS_PATH]
```
-`DOCS_PATH` (optional): Path to the docs repository root. If not provided, ask the user.
Examples:
-`/update-docs /Users/me/src/docs`
-`/update-docs`
## Instructions
### Step 1: Resolve docs path
If `DOCS_PATH` was provided as an argument, use it. Otherwise, ask the user for the path to their docs repository.
Verify the path exists and contains `server/services/` subdirectory.
### Step 2: Create docs branch
Get the current pipecat branch name:
```bash
git rev-parse --abbrev-ref HEAD
```
In the docs repo, create a new branch off main with a matching name:
```bash
cd DOCS_PATH && git checkout main && git pull && git checkout -b {branch-name}-docs
```
For example, if the pipecat branch is `feat/new-service`, the docs branch becomes `feat/new-service-docs`.
All doc edits in subsequent steps are made on this branch.
Ignore `__init__.py`, `__pycache__`, test files, and files that only contain type re-exports.
### Step 4: Map source files to doc pages
For each changed source file, find the corresponding doc page. Read the mapping file at `.claude/skills/update-docs/SOURCE_DOC_MAPPING.md` and apply its tiered lookup: tier 1 (known exceptions) → tier 2 (pattern matching) → tier 3 (search fallback). **First match wins.**
### Step 5: Analyze each source-doc pair
For each mapped pair:
1.**Read the full source file** to understand current state
2.**Read the diff** for that file: `git diff main..HEAD -- <source_file>`
3.**Read the current doc page** in full
Identify what changed by comparing source to docs:
- **Constructor parameters**: Compare `__init__` signature to the Configuration section's `<ParamField>` entries
- **InputParams fields**: Compare `InputParams(BaseModel)` class fields to the InputParams table
- **Event handlers**: Compare `_register_event_handler` calls and event handler definitions to Event Handlers section
- **Behavioral changes**: Check if Notes section needs updating
### Step 6: Make targeted edits
For each doc page that needs updates, edit **only the sections that need changes**. Preserve all other content exactly as-is.
#### Rules
- **Never remove content** unless the corresponding source code was removed
- **Never rewrite sections** that are already accurate
- **Match existing formatting** — if the page uses `<ParamField>` tags, use them; if it uses tables, use tables
- **Keep descriptions concise** — match the tone and length of surrounding content
- **Preserve CardGroup, links, and examples** unless they reference removed functionality
- **Don't touch frontmatter** unless the class was renamed
#### Section-specific guidance
**Configuration** (constructor params):
- Use `<ParamField path="name" type="type" default="value">` format if the page already uses it
- Add new params in logical order (required first, then optional)
- Remove params that no longer exist in source
- Update types/defaults that changed
**InputParams** (runtime settings):
- Use markdown table format: `| Parameter | Type | Default | Description |`
- Match the field names and types from the `InputParams(BaseModel)` class
- Include the default values from the source
**Usage** (code examples):
- Update import paths, class names, and parameter names
- Only modify examples if they would break or be misleading with the new API
- Don't rewrite working examples just to add new optional params
**Notes**:
- Add notes for new behavioral gotchas or breaking changes
- Remove notes about limitations that were fixed
- Keep existing notes that are still accurate
**Event Handlers**:
- Update the event table and example code
- Add new events, remove deleted ones
- Update handler signatures if they changed
**Overview / Key Features / Prerequisites**:
- Only update if the PR fundamentally changes what the service does (new capability, removed capability, renamed class)
- Most PRs will NOT need changes to these sections
### Step 7: Update guides
Guides at `DOCS_PATH/guides/` reference specific class names, parameters, imports, and code patterns. After completing reference doc edits, check if any guides need updates too.
For each changed source file, collect the class names, renamed parameters, and changed imports from the diff. Search the guides directory:
After processing all mapped pairs, check for two kinds of gaps:
**Missing pages**: Source files that had no doc page mapping (neither tier 1, 2, nor 3) and are not marked as "(skip)". For each, tell the user:
- The source file path
- The main class(es) it defines
- Whether a new doc page should be created
**Missing sections**: Mapped doc pages that are missing standard sections compared to the source. For example, a transport page with no Configuration section, or a service page with no InputParams table when the source defines `InputParams(BaseModel)`. Flag these and offer to add the missing sections.
If the user wants a new page, do all three of the following:
#### 8a: Create the doc page
Create the new `.mdx` file using this template structure:
```
---
title: "Service Name"
description: "Brief description"
---
## Overview
[Description from class docstring or source analysis]
<CardGroup cols={2}>
[Cards for API reference and examples if available]
</CardGroup>
## Installation
```bash
pip install "pipecat-ai[package-name]"
```
## Prerequisites
[Environment variables and account setup]
## Configuration
[ParamField entries for constructor params]
## InputParams
[Table of InputParams fields, if the service has them]
## Usage
### Basic Setup
```python
[Minimalworkingexample]
```
## Notes
[Important caveats]
## Event Handlers
[Event table and example code]
```
#### 8b: Add to docs.json
Add the new page path to `DOCS_PATH/docs.json` in the correct navigation group. The path format is `server/services/{category}/{provider}` (without the `.mdx` extension).
Find the matching group in the navigation structure:
- **STT** → `"group": "Speech-to-Text"` under Services
- **TTS** → `"group": "Text-to-Speech"` under Services
- **LLM** → `"group": "LLM"` under Services
- **S2S** → `"group": "Speech-to-Speech"` under Services
- **Transport** → `"group": "Transport"` under Services
- **Serializer** → `"group": "Serializers"` under Services
- **Image generation** → `"group": "Image Generation"` under Services
- **Video** → `"group": "Video"` under Services
- **Memory** → `"group": "Memory"` under Services
- **Vision** → `"group": "Vision"` under Services
- **Analytics** → `"group": "Analytics & Monitoring"` under Services
Insert the new entry **alphabetically** within the group's `pages` array. For example, adding a new STT service "foo":
```json
{
"group": "Speech-to-Text",
"pages": [
"server/services/stt/assemblyai",
"server/services/stt/aws",
...
"server/services/stt/foo",
...
]
}
```
#### 8c: Add to supported-services.mdx
Add a new row to the correct category table in `DOCS_PATH/server/services/supported-services.mdx`.
- **DisplayName**: Use the service's human-readable name (e.g., "ElevenLabs", "AWS Polly", "Google Gemini")
- **package**: Look at the service's `pyproject.toml` extras or the import pattern in the source code. For example, if the service is in `src/pipecat/services/foo/`, the package is typically `foo`.
- If no pip dependencies are required, use `No dependencies required` instead.
Insert the new row **alphabetically** within the table. Match the column alignment of the existing rows.
- `guides/learn/speech-to-text.mdx` — Updated code example (renamed `old_param` → `new_param`)
### New service pages
- `server/services/tts/newprovider.mdx` — Created page, added to docs.json (Text-to-Speech), added to supported-services.mdx
### Unmapped source files
- `src/pipecat/services/newprovider/tts.py` — NewProviderTTSService (no doc page exists)
### Skipped files
- `src/pipecat/services/ai_service.py` — internal base class
```
## Guidelines
- **Be conservative** — only change what the diff warrants. Don't "improve" docs beyond what changed in source.
- **Read before editing** — always read the full doc page before making changes so you understand the existing structure.
- **Preserve voice** — match the writing style of the existing doc page, don't impose a different tone.
- **One PR at a time** — this skill operates on the current branch's diff against main. Don't look at other branches.
- **Parallel analysis** — when multiple source files map to different doc pages, analyze and edit them in parallel for efficiency.
- **Shared source files** — files like `services/google/google.py` are shared bases. Check which services import from them and update all affected doc pages.
## Checklist
Before finishing, verify:
- [ ] All changed source files were checked against the mapping table
- [ ] Each doc page edit matches the actual source code change (not guessed)
- [ ] No content was removed unless the corresponding source was removed
- [ ] New parameters have accurate types and defaults from source
- [ ] Formatting matches the existing page style
- [ ] Guides referencing changed APIs were checked and updated
- [ ] New service pages were added to `docs.json` in the correct group, alphabetically
- [ ] New service pages were added to `supported-services.mdx` in the correct table, alphabetically
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
Pipecat is an open-source Python framework for building real-time voice and multimodal conversational AI agents. It orchestrates audio/video, AI services, transports, and conversation pipelines using a frame-based architecture.
## Common Commands
```bash
# Setup development environment
uv sync --group dev --all-extras --no-extra gstreamer
# Install pre-commit hooks
uv run pre-commit install
# Run all tests
uv run pytest
# Run a single test file
uv run pytest tests/test_name.py
# Run a specific test
uv run pytest tests/test_name.py::test_function_name
# Preview changelog
uv run towncrier build --draft --version Unreleased
All data flows as **Frame** objects through a pipeline of **FrameProcessors**:
```
[Processor1] → [Processor2] → ... → [ProcessorN]
```
**Key components:**
- **Frames** (`src/pipecat/frames/frames.py`): Data units (audio, text, video) and control signals. Flow DOWNSTREAM (input→output) or UPSTREAM (acknowledgments/errors).
- **FrameProcessor** (`src/pipecat/processors/frame_processor.py`): Base processing unit. Each processor receives frames, processes them, and pushes results downstream.
- **ParallelPipeline** (`src/pipecat/pipeline/parallel_pipeline.py`): Runs multiple pipelines in parallel.
- **Transports** (`src/pipecat/transports/`): Transports are frame processors used for external I/O layer (Daily WebRTC, LiveKit WebRTC, WebSocket, Local). Abstract interface via `BaseTransport`, `BaseInputTransport` and `BaseOutputTransport`.
- **Pipeline Task (`src/pipecat/pipeline/task.py`)**: Runs and manages a pipeline. Pipeline tasks send the first frame, `StartFrame`, to the pipeline in order for processors to know they can start processing and pushing frames. Pipeline tasks internally create a pipeline with two additional processors, a source processor before the user-defined pipeline and a sink processor at the end. Those are used for multiple things: error handling, pipeline task level events, heartbeat monitoring, etc.
- **Pipeline Runner (`src/pipecat/pipeline/runner.py`)**: High-level entry point for executing pipeline tasks. Handles signal management (SIGINT/SIGTERM) for graceful shutdown and optional garbage collection. Run a single pipeline task with `await runner.run(task)` or multiple concurrently with `await asyncio.gather(runner.run(task1), runner.run(task2))`.
- **Services** (`src/pipecat/services/`): 60+ AI provider integrations (STT, TTS, LLM, etc.). Extend base classes: `AIService`, `LLMService`, `STTService`, `TTSService`, `VisionService`.
- **Serializers** (`src/pipecat/serializers/`): Convert frames to/from wire formats for WebSocket transports. `FrameSerializer` base class defines `serialize()` and `deserialize()`. Telephony serializers (Twilio, Plivo, Vonage, Telnyx, Exotel, Genesys) handle provider-specific protocols and audio encoding (e.g., μ-law).
- **RTVI** (`src/pipecat/processors/frameworks/rtvi.py`): Real-Time Voice Interface protocol bridging clients and the pipeline. `RTVIProcessor` handles incoming client messages (text input, audio, function call results). `RTVIObserver` converts pipeline frames to outgoing messages: user/bot speaking events, transcriptions, LLM/TTS lifecycle, function calls, metrics, and audio levels.
- **Observers** (`src/pipecat/observers/`): Monitor frame flow without modifying the pipeline. Passed to `PipelineTask` via the `observers` parameter. Implement `on_process_frame()` and `on_push_frame()` callbacks.
### Important Patterns
- **Context Aggregation**: `LLMContext` accumulates messages for LLM calls; `UserResponse` aggregates user input
- **Turn Management**: Turn management is done through `LLMUserAggregator` and
`LLMAssistantAggregator`, created with `LLMContextAggregatorPair`
- **User turn strategies**: Detection of when the user starts and stops speaking is done via user turn start/stop strategies. They push `UserStartedSpeakingFrame` and `UserStoppedSpeakingFrame` respectively.
- **Interruptions**: Interruptions are usually triggered by a user turn start strategy (e.g. `VADUserTurnStartStrategy`) but they can be triggered by other processors as well, in which case the user turn start strategies don't need to. An `InterruptionFrame` carries an optional `asyncio.Event` that is set when the frame reaches the pipeline sink. If a processor stops an `InterruptionFrame` from propagating downstream (i.e., doesn't push it), it **must** call `frame.complete()` to avoid stalling `push_interruption_task_frame_and_wait()` callers.
- **Uninterruptible Frames**: These are frames that will not be removed from internal queues even if there's an interruption. For example, `EndFrame` and `StopFrame`.
- **Events**: Most classes in Pipecat have `BaseObject` as the very base class. `BaseObject` has support for events. Events can run in the background in an async task (default) or synchronously (`sync=True`) if we want immediate action. Synchronous event handlers need to execute fast.
- **Async Task Management**: Always use `self.create_task(coroutine, name)` instead of raw `asyncio.create_task()`. The `TaskManager` automatically tracks tasks and cleans them up on processor shutdown. Use `await self.cancel_task(task, timeout)` for cancellation.
- **Error Handling**: Use `await self.push_error(msg, exception, fatal)` to push errors upstream. Services should use `fatal=False` (the default) so application code can handle errors and take action (e.g. switch to another service).
- **Type hints**: Required for complex async code.
- **Dataclass vs Pydantic**: Use `@dataclass` for frames and internal pipeline data (high-frequency, no validation needed). Use Pydantic `BaseModel` for configuration, parameters, metrics, and external API data (benefits from validation and serialization). Specifically:
-`@dataclass`: Frame types, context aggregator pairs, internal data containers
-`BaseModel`: Service `InputParams`, transport/VAD/turn params, metrics data, API request/response models, serializer params
### Docstring Example
```python
classMyService(LLMService):
"""Description of what the service does.
More detailed description.
Event handlers available:
- on_connected: Called when we are connected
Example::
@service.event_handler("on_connected")
async def on_connected(service, frame):
...
"""
def__init__(self,param1:str,**kwargs):
"""Initialize the service.
Args:
param1: Description of param1.
**kwargs: Additional arguments passed to parent.
"""
super().__init__(**kwargs)
```
## Service Implementation
When adding a new service:
1. Extend the appropriate base class (`STTService`, `TTSService`, `LLMService`, etc.)
2. Implement required abstract methods
3. Handle necessary frames
4. By default, all frames should be pushed in the direction they came
5. Push `ErrorFrame` on failures
6. Add metrics tracking via `MetricsData` if relevant
7. Follow the pattern of existing services in `src/pipecat/services/`
## Testing
Test utilities live in `src/pipecat/tests/utils.py`. Use `run_test()` to send frames through a pipeline and assert expected output frames in each direction. Use `SleepFrame(sleep=N)` to add delays between frames.
Pipecat welcomes community-maintained integrations! As our ecosystem grows, we've established a process for any developer to create and maintain their own service integrations while ensuring discoverability for the Pipecat community.
## Overview
**What we support:** Community-maintained integrations that live in separate repositories and are maintained by their authors.
**What we don't do:** The Pipecat team does not code review, test, or maintain community integrations. We provide guidance and list approved integrations for discoverability.
**Why this approach:** This allows the community to move quickly while keeping the Pipecat core team focused on maintaining the framework itself.
## Submitting your Integration
To be listed as an official community integration, follow these steps:
### Step 1: Build Your Integration
Create your integration following the patterns and examples shown in the "Integration Patterns and Examples" section below.
### Step 2: Set Up Your Repository
Your repository must contain these components:
- **Source code** - Complete implementation following Pipecat patterns
- **Foundational example** - Single file example showing basic usage (see [Pipecat examples](https://github.com/pipecat-ai/pipecat/tree/main/examples))
- **README.md** - Must include:
- Introduction and explanation of your integration
- Installation instructions
- Usage instructions with Pipecat Pipeline
- How to run your example
- Pipecat version compatibility (e.g., "Tested with Pipecat v0.0.86")
- Company attribution: If you work for the company providing the service, please mention this in your README. This helps build confidence that the integration will be actively maintained.
- **LICENSE** - Permissive license (BSD-2 like Pipecat, or equivalent open source terms)
- **Code documentation** - Source code with docstrings (we recommend following [Pipecat's docstring conventions](https://github.com/pipecat-ai/pipecat/blob/main/CONTRIBUTING.md#docstring-conventions))
- **Changelog** - Maintain a changelog for version updates
### Step 3: Join Discord
Join our Discord: https://discord.gg/pipecat
### Step 4: Submit for Listing
Submit a pull request to add your integration to our [Community Integrations documentation page](https://docs.pipecat.ai/server/services/community-integrations).
**To submit:**
1. Fork the [Pipecat docs repository](https://github.com/pipecat-ai/docs)
2. Edit the file `server/services/community-integrations.mdx`
3. Add your integration to the appropriate service category table with:
- Service name
- Link to your repository
- Maintainer GitHub username(s)
4. Include a link to your demo video (approx 30-60 seconds) in your PR description showing:
- Core functionality of your integration
- Handling of an interruption (if applicable to service type)
5. Submit your pull request
Once your PR is submitted, post in the `#community-integrations` Discord channel to let us know.
## Integration Patterns and Examples
### STT (Speech-to-Text) Services
#### Websocket-based Services
**Base class:**`WebsocketSTTService`
**Use for:** Services where you manage the websocket connection directly. Combines `STTService` with `WebsocketService` for automatic reconnection and keepalive support.
- **`_process_context(self, context: LLMContext)`** — The main method that processes an LLM context and generates a response. Each LLM service overrides `process_frame` to extract context from `LLMContextFrame` and calls `_process_context`.
- **`adapter_class`** — Class attribute pointing to a `BaseLLMAdapter` subclass. Defaults to `OpenAILLMAdapter`. Non-OpenAI services must implement their own adapter (see `src/pipecat/adapters/base_llm_adapter.py`) with methods:
-`get_llm_invocation_params(context)` — Extract provider-specific params from universal context
-`to_provider_tools_format(tools_schema)` — Convert standard tools to provider format
-`get_messages_for_logging(context)` — Format messages for logging
- **Frame sequence:** Output must follow this frame sequence pattern:
-`LLMFullResponseStartFrame` — Signals the start of an LLM response
-`LLMTextFrame` — Contains LLM content, typically streamed as tokens
-`LLMFullResponseEndFrame` — Signals the end of an LLM response
- **Thought frames (reasoning models):** If the model supports extended thinking / chain-of-thought, emit thought frames alongside the response:
-`LLMThoughtStartFrame` — Signals the start of a thought
-`LLMThoughtTextFrame` — Contains thought content, streamed as tokens
-`LLMThoughtEndFrame` — Signals the end of a thought
- **Context aggregation** is handled by the framework via `LLMContext` + `LLMContextAggregatorPair`. The LLM service just processes context it receives — no need to implement aggregators.
### TTS (Text-to-Speech) Services
#### WebsocketTTSService
**Use for:** Websocket-based streaming services (with or without word timestamps)
- For websocket services, use asyncio WebSocket implementation
- Handle idle service timeouts with keepalives
- TTS services push both audio (`TTSAudioRawFrame`) and text (`TTSTextFrame`) frames
### Telephony Serializers
Pipecat supports telephony provider integration using websocket connections to exchange MediaStreams. These services use a FrameSerializer to serialize and deserialize inputs from the FastAPIWebsocketTransport.
- Must implement `run_vision` method that takes a `UserImageRawFrame` and returns an `AsyncGenerator[Frame, None]`
- The method processes the image frame and yields frames with analysis results
- Must yield the frame sequence: `VisionFullResponseStartFrame`, `VisionTextFrame`, `VisionFullResponseEndFrame`
## Implementation Guidelines
### Naming Conventions
#### Package and Repository Naming
Use the `pipecat-{vendor}` naming convention for your PyPI package and repository:
-`pipecat-{vendor}` — for single-service integrations (e.g., `pipecat-deepdub`)
-`pipecat-{vendor}-{type}` — when a vendor offers multiple service types (e.g., `pipecat-upliftai-stt`, `pipecat-upliftai-tts`)
This convention makes community packages easily discoverable via PyPI search and clearly identifies them as part of the Pipecat ecosystem.
#### Class Naming
- **STT:** `VendorSTTService`
- **LLM:** `VendorLLMService`
- **TTS:**
- Websocket: `VendorTTSService`
- HTTP: `VendorHttpTTSService`
- **Image:** `VendorImageGenService`
- **Vision:** `VendorVisionService`
- **Telephony:** `VendorFrameSerializer`
### Metrics Support
Enable metrics in your service:
```python
defcan_generate_metrics(self)->bool:
"""Check if this service can generate processing metrics.
Returns:
True, as this service supports metrics.
"""
returnTrue
```
### Service Settings
Every AI service (STT, LLM, TTS, image generation, etc.) exposes a **Settings dataclass** that serves two roles:
1.**Store mode** — the service's `self._settings` holds the current value of every runtime-updatable field.
2.**Delta mode** — an update frame (e.g. `TTSUpdateSettingsFrame`) specifies only the fields that should change; unspecified fields remain `NOT_GIVEN`.
#### Defining your Settings class
Extend `STTSettings`, `TTSSettings`, `LLMSettings`, or `ImageGenSettings` (or, if your service directly subclasses `AIService`, `ServiceSettings`). The base classes already provide common fields (e.g. `model`, `voice`, `language`). You only need to add **service-specific knobs that should be runtime-updatable**:
| Anything users may want to change mid-session | Audio encoding, sample format |
| | Connection parameters (timeouts, retries) |
The rule of thumb: if a caller might send an update frame to change it at runtime, it belongs in Settings. Everything else is init-only config stored as `self._xxx`.
#### Wiring settings into `__init__`
Accept an **optional**`settings` parameter. Build a `default_settings` object with all fields set to real values, then merge any caller overrides with `apply_update`.
Add a `Settings`**class attribute** that points to your settings dataclass. This lets callers access the settings class through the service itself (e.g. `MyTTSService.Settings(...)`) without a separate import:
```python
fromtypingimportOptional
classMyTTSService(TTSService):
Settings=MyTTSSettings
_settings:Settings
def__init__(
self,
*,
api_key:str,
settings:Optional[Settings]=None,
**kwargs,
):
# 1. Defaults — every field has a real value (store mode).
default_settings=self.Settings(
model="my-model-v1",
voice="default-voice",
language="en",
speaking_rate=1.0,
)
# 2. Merge caller overrides (only given fields win).
ifsettingsisnotNone:
default_settings.apply_update(settings)
# 3. Pass the fully-populated settings to the base class.
AI services support runtime configuration changes via `*UpdateSettingsFrame`s (e.g. `STTUpdateSettingsFrame`, `TTSUpdateSettingsFrame`, `LLMUpdateSettingsFrame`).
To react to runtime setting changes, override `_update_settings`. The base implementation applies the delta to `self._settings` and returns a `dict` mapping each changed field name to its **pre-update** value. Your override should call `super()` first, then act on the changed fields. A common implementation might look like:
"""Apply a settings update, reconfiguring the connection if needed."""
changed=awaitsuper()._update_settings(update)
ifnotchanged:
returnchanged
awaitself._disconnect()
awaitself._connect()
returnchanged
```
The dict keys work like a set for membership tests (`"language" in changed`) and truthiness (`if changed`). Use `changed.keys() - {"language"}` for set difference, or `changed["language"]` to inspect the previous value of a field.
Note that, in this example, the service requires a reconnect to apply the new language. Consider, for each setting, whether your service requires reconnection or can apply changes in-place.
If your service can't yet apply certain settings at runtime, call `self._warn_unhandled_updated_settings(changed)` with any unhandled field names so users get a clear log message:
Sample rates are set via PipelineParams and passed to each frame processor at initialization. The pattern is to _not_ set the sample rate value in the constructor of a given service. Instead, use the `start()` method to initialize sample rates from the frame:
Note that `self.sample_rate` is a `@property` set in the TTSService base class, which provides access to the private sample rate value obtained from the StartFrame.
### Tracing Decorators
Use Pipecat's tracing decorators:
- **STT:** `@traced_stt` - decorate `_handle_transcription(self, transcript, is_final, language)` (the standard method name convention)
- **LLM:** `@traced_llm` - decorate the `_process_context()` method
- **TTS:** `@traced_tts` - decorate the `run_tts()` method
## Best Practices
### Packaging and Distribution
- Name your package `pipecat-{vendor}` (see [Naming Conventions](#naming-conventions))
- Use [uv](https://docs.astral.sh/uv/) for packaging (encouraged)
- Publish to PyPI for easier installation
- Follow semantic versioning principles
- Maintain a changelog
### HTTP Communication
For REST-based communication, use aiohttp. Pipecat includes this as a required dependency, so using it prevents adding an additional dependency to your integration.
### Error Handling
- Wrap API calls in appropriate try/catch blocks
- Handle rate limits and network failures gracefully
- Provide meaningful error messages
- When errors occur, raise exceptions AND push errors to notify the pipeline:
- Your foundational example serves as a valuable integration-level test
- Unit tests are nice to have. As the Pipecat teams provides better guidance, we will encourage unit testing more
## Disclaimer
Community integrations are community-maintained and not officially supported by the Pipecat team. Users should evaluate these integrations independently. The Pipecat team reserves the right to remove listings that become unmaintained or problematic.
## Staying Up to Date
Pipecat evolves rapidly to support the latest AI technologies and patterns. While we strive to minimize breaking changes, they do occur as the framework matures.
**We strongly recommend:**
- Join our Discord at https://discord.gg/pipecat and monitor the `#announcements` channel for release notifications
We encourage community-maintained integrations! Please see our [Community Integration Guide](COMMUNITY_INTEGRATIONS.md) for the process and requirements.
**Want to contribute to Pipecat core?**
We welcome contributions of all kinds! Your help is appreciated. Follow these steps to get involved:
1.**Fork this repository**: Start by forking the Pipecat Documentation repository to your GitHub account.
@@ -13,24 +17,137 @@ We welcome contributions of all kinds! Your help is appreciated. Follow these st
git checkout -b your-branch-name
```
4. **Make your changes**: Edit or add files as necessary.
5. **Test your changes**: Ensure that your changes look correct and follow the style set in the codebase.
6. **Commit your changes**: Once you're satisfied with your changes, commit them with a meaningful message.
5. **Add a changelog entry**: Create a changelog fragment file (see [Changelog Entries](#changelog-entries) below).
6. **Test your changes**: Ensure that your changes look correct and follow the style set in the codebase.
7. **Commit your changes**: Once you're satisfied with your changes, commit them with a meaningful message.
```bash
git commit -m "Description of your changes"
```
7. **Push your changes**: Push your branch to your forked repository.
8. **Push your changes**: Push your branch to your forked repository.
```bash
git push origin your-branch-name
```
8. **Submit a Pull Request (PR)**: Open a PR from your forked repository to the main branch of this repo.
9. **Submit a Pull Request (PR)**: Open a PR from your forked repository to the main branch of this repo.
> Important: Describe the changes you've made clearly!
Our maintainers will review your PR, and once everything is good, your contributions will be merged!
## Changelog Entries
Every pull request that makes a user-facing change should include a changelog entry. We use a changelog fragment system to avoid merge conflicts.
### Creating a Changelog Fragment
1. Create a new file in the `changelog/` directory with this naming pattern:
```
<PR_number>.<type>.md
```
2. Choose the appropriate type:
- `added.md` - New features
- `changed.md` - Changes in existing functionality
- `deprecated.md` - Soon-to-be removed features
- `removed.md` - Removed features
- `fixed.md` - Bug fixes
- `performance.md` - Performance improvements
- `security.md` - Security fixes
- `other.md` - Other changes (documentation, dependencies, etc.)
3. Write your changelog entry as a Markdown bullet point. Include the `-` at the start:
**Example files:**
`changelog/1234.added.md`:
```markdown
- Added support for Anthropic Claude 3.5 Sonnet with improved streaming performance.
```
`changelog/5678.fixed.md`:
```markdown
- Fixed an issue where audio frames were dropped during high-load scenarios.
```
**For entries with nested bullets:**
`changelog/1234.changed.md`:
```markdown
- Updated service configuration:
- Changed default timeout to 30 seconds
- Added retry logic for failed connections
```
### Multiple Changes in One PR
**Different types of changes:** Create separate fragment files for each type:
```
changelog/1234.added.md
changelog/1234.fixed.md
```
**Multiple changes of the same type:** Create numbered fragment files:
```
changelog/1234.changed.md
changelog/1234.changed.2.md
```
**Related changes:** Use nested bullets in a single fragment:
```markdown
- Updated service configuration:
- Changed default timeout to 30 seconds
- Added retry logic for failed connections
```
**Rule of thumb:** One logical change per fragment file. If changes are unrelated, use separate files.
### Preview Your Changes
To see what your changelog entry will look like:
```bash
towncrier build --draft --version Unreleased
```
This won't modify any files, just show you a preview.
### When to Skip Changelog Entries
You can skip adding a changelog entry for:
- Documentation-only changes
- Internal refactoring with no user-facing impact
- Test-only changes
- CI/build configuration changes
If you're unsure whether your change needs a changelog entry, ask in your PR!
## Dependency Management
This project uses [uv](https://docs.astral.sh/uv/) for dependency management. The `uv.lock` file is committed to ensure reproducible builds.
### Adding or Updating Dependencies
1. Edit `pyproject.toml` to add/update dependencies
2. Run `uv lock` to update the lockfile with new dependency resolution
3. Run `uv sync` to install the updated dependencies locally
4. Always commit both files together:
```bash
git add pyproject.toml uv.lock
git commit -m "feat: add new dependency for feature X"
```
**Important:** Never manually edit `uv.lock`. It's auto-generated by `uv lock`.
# 🎙️ Pipecat: Real-Time Voice & Multimodal AI Agents
**Pipecat** is an open-source Python framework for building real-time voice and multimodal conversational agents. Orchestrate audio and video, AI services, different transports, and conversation pipelines effortlessly—so you can focus on what makes your agent unique.
> Want to dive right in? [Install Pipecat](https://docs.pipecat.ai/getting-started/installation) then try the [quickstart](https://docs.pipecat.ai/getting-started/quickstart).
> Want to dive right in? Run `pipecat init quickstart` or follow the [quickstart guide](https://docs.pipecat.ai/getting-started/quickstart).
## 🚀 What You Can Build
@@ -19,8 +19,6 @@
- **Business Agents** – customer intake, support bots, guided flows
- **Complex Dialog Systems** – design logic with structured conversations
🧭 Looking to build structured conversations? Check out [Pipecat Flows](https://github.com/pipecat-ai/pipecat-flows) for managing complex conversational states and transitions.
## 🧠 Why Pipecat?
- **Voice-first**: Integrates speech recognition, text-to-speech, and conversation handling
@@ -28,171 +26,184 @@
- **Composable Pipelines**: Build complex behavior from modular components
- **Real-Time**: Ultra-low latency interaction with different transports (e.g. WebSockets or WebRTC)
## 🌐 Pipecat Ecosystem
### 🧩 Multi-agent systems
Need multiple AI agents working together? [Pipecat Subagents](https://github.com/pipecat-ai/pipecat-subagents) lets you build distributed multi-agent systems where each agent runs its own pipeline and communicates through a shared message bus. Hand off conversations between specialists, dispatch background tasks, and scale agents across processes or machines.
### 📱 Client SDKs
Building client applications? You can connect to Pipecat from any platform using our official SDKs:
Looking to build structured conversations? Check out [Pipecat Flows](https://github.com/pipecat-ai/pipecat-flows) for managing complex conversational states and transitions.
### 🪄 Beautiful UIs
Want to build beautiful and engaging experiences? Checkout the [Voice UI Kit](https://github.com/pipecat-ai/voice-ui-kit), a collection of components, hooks and templates for building voice AI applications quickly.
### 🛠️ Create and deploy projects
Create a new project in under a minute with the [Pipecat CLI](https://github.com/pipecat-ai/pipecat-cli). Then use the CLI to monitor and deploy your agent to production.
### 🔍 Debugging
Looking for help debugging your pipeline and processors? Check out [Whisker](https://github.com/pipecat-ai/whisker), a real-time Pipecat debugger.
### 🖥️ Terminal
Love terminal applications? Check out [Tail](https://github.com/pipecat-ai/tail), a terminal dashboard for Pipecat.
### 🤖 Claude Code Skills
Use [Pipecat Skills](https://github.com/pipecat-ai/skills) with [Claude Code](https://claude.ai/code) to scaffold projects, deploy to Pipecat Cloud, and more. Install the marketplace with:
```
claude plugin marketplace add pipecat-ai/skills
```
and install any of the available plugins.
### 🧩 Community Integrations
Build and share your own Pipecat service integrations! Browse existing [community integrations](https://docs.pipecat.ai/api-reference/server/services/community-integrations) or check out our [guide](COMMUNITY_INTEGRATIONS.md) to create your own.
### 📺️ Pipecat TV Channel
Catch new features, interviews, and how-tos on our [Pipecat TV](https://www.youtube.com/playlist?list=PLzU2zoMTQIHjqC3v4q2XVSR3hGSzwKFwH) channel.
| Community | [Browse community integrations →](https://docs.pipecat.ai/api-reference/server/services/community-integrations) |
📚 [View full services documentation →](https://docs.pipecat.ai/server/services/supported-services)
📚 [View full services documentation →](https://docs.pipecat.ai/api-reference/server/services/supported-services)
## ⚡ Getting started
You can get started with Pipecat running on your local machine, then move your agent processes to the cloud when you’re ready.
You can get started with Pipecat running on your local machine, then move your agent processes to the cloud when you're ready.
```shell
# Install the module
pip install pipecat-ai
1. Install uv
# Set up your environment
cp dot-env.template .env
```
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
To keep things lightweight, only the core framework is included by default. If you need support for third-party AI services, you can add the necessary dependencies with:
> **Need help?** Refer to the [uv install documentation](https://docs.astral.sh/uv/getting-started/installation/).
```shell
pip install "pipecat-ai[option,...]"
```
2. Install the module
```bash
# For new projects
uv init my-pipecat-app
cd my-pipecat-app
uv add pipecat-ai
# Or for existing projects
uv add pipecat-ai
```
3. Set up your environment
```bash
cp env.example .env
```
4. To keep things lightweight, only the core framework is included by default. If you need support for third-party AI services, you can add the necessary dependencies with:
```bash
uv add "pipecat-ai[option,...]"
```
> **Using pip?** You can still use `pip install pipecat-ai` and `pip install "pipecat-ai[option,...]"` to get set up.
## 🧪 Code examples
- [Foundational](https://github.com/pipecat-ai/pipecat/tree/main/examples/foundational) — small snippets that build on each other, introducing one or two concepts at a time
- [Foundational](https://github.com/pipecat-ai/pipecat/tree/main/examples) — small snippets that build on each other, introducing one or two concepts at a time
- [Example apps](https://github.com/pipecat-ai/pipecat-examples) — complete applications that you can use as starting points for development
## 🛠️ Hacking on the framework itself
## 🛠️ Contributing to the framework
1. Set up a virtual environment before following these instructions. From the root of the repo:
6. (Optional) If you want to use this package from another directory:
```shell
pip install "path_to_this_repo[option,...]"
```
```
claude plugin marketplace add pipecat-ai/pipecat
claude plugin install pipecat-dev@pipecat-dev-skills
```
### Running tests
Install the test dependencies:
To run all tests, from the root directory:
```shell
pip install -r test-requirements.txt
```bash
uv run pytest
```
From the root directory, run:
Run a specific test suite:
```shell
pytest
```bash
uv run pytest tests/test_name.py
```
### Setting up your editor
This project uses strict [PEP 8](https://peps.python.org/pep-0008/) formatting via [Ruff](https://github.com/astral-sh/ruff).
#### Emacs
You can use [use-package](https://github.com/jwiegley/use-package) to install [emacs-lazy-ruff](https://github.com/christophermadsen/emacs-lazy-ruff) package and configure `ruff` arguments:
`ruff` was installed in the `venv` environment described before, so you should be able to use [pyvenv-auto](https://github.com/ryotaro612/pyvenv-auto) to automatically load that environment inside Emacs.
```elisp
(use-package pyvenv-auto
:ensure t
:defer t
:hook ((python-mode . pyvenv-auto-run)))
```
#### Visual Studio Code
Install the
[Ruff](https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff) extension. Then edit the user settings (_Ctrl-Shift-P_ `Open User Settings (JSON)`) and set it as the default Python formatter, and enable formatting on save:
```json
"[python]": {
"editor.defaultFormatter": "charliermarsh.ruff",
"editor.formatOnSave": true
}
```
#### PyCharm
`ruff` was installed in the `venv` environment described before, now to enable autoformatting on save, go to `File` -> `Settings` -> `Tools` -> `File Watchers` and add a new watcher with the following settings:
1. **Name**: `Ruff formatter`
2. **File type**: `Python`
3. **Working directory**: `$ContentRoot$`
4. **Arguments**: `format $FilePath$`
5. **Program**: `$PyInterpreterDirectory$/ruff`
## 🤝 Contributing
We welcome contributions from the community! Whether you're fixing bugs, improving documentation, or adding new features, here's how you can help:
- Added a `session_id` field to `RunnerArguments` so bots can log or trace a per-session identifier in local development the same way they can in Pipecat Cloud. The development runner now mints a UUID at every construction site, and paths that already returned a `sessionId` to the caller (Daily `/start`, dial-in webhook) share that same UUID with the runner args instead of generating two. The SmallWebRTC `/api/offer` endpoint also accepts an optional `session_id` query parameter so the `/sessions/{session_id}/...` proxy can thread it through.
Frames can represent discrete chunks of data, for instance a chunk of text, a chunk of audio, or an image. They can also be used to as control flow, for instance a frame that indicates that there is no more data available, or that a user started or stopped talking. They can also represent more complex data structures, such as a message array used for an LLM completion.
## FrameProcessors
Frame processors operate on frames. Every frame processor implements a `process_frame` method that consumes one frame and produces zero or more frames. Frame processors can do simple transforms, such as concatenating text fragments into sentences, or they can treat frames as input for an AI Service, and emit chat completions based on message arrays or transform text into audio or images.
## Pipelines
Pipelines are lists of frame processors linked together. Frame processors can push frames upstream or downstream to their peers. A very simple pipeline might chain an LLM frame processor to a text-to-speech frame processor, with a transport as an output.
## Transports
Transports provide input and output frame processors to receive or send frames respectively. For example, the `DailyTransport` does this with a WebRTC session joined to a Daily.co room.
2. The Transport places a Transcription frame in the Pipeline’s source queue.

3. The Pipeline passes the Transcription frame to the first Frame Processor in its list, the LLM User Message Aggregator.

4. The LLM User Message Aggregator updates the LLM Context with a `{“user”: “Hello LLM”}` message.

5. The LLM User Message Aggregator yields an LLM Message Frame, containing the updated LLM Context. The Pipeline passes this frame to the LLM Frame Processor.

6. The LLM Frame Processor creates a streaming chat completion based on the LLM context and yields the first chunk of a response, Text Frame with the value “Hi, “. The Pipeline passes this frame to the TTS Frame Processor. The TTS Frame Processor aggregates this response but doesn’t yield anything, yet, because it’s waiting for a full sentence.

7. The LLM Frame Processor yields another Text Frame with the value “there.”. The Pipeline passes this frame to the TTS Frame Processor.

8. The TTS Frame Processor now has a full sentence, so it starts streaming audio based on “Hi, there.” It yields the first chunk of streaming audio as an Audio frame, which the Pipeline passes to the LLM Assistant Message Aggregator.

9. The LLM Assistant Message Aggregator doesn’t do anything with Audio frames, so it immediately yields the frame, unchanged. This is the convention for all Frame Processors: frames that the processor doesn’t process should be immediately yielded.

10. The Pipeline places the first Audio frame in its sink queue, which is being watched by the Transport. Since the frame is now in a queue, the Pipeline can continue processing other frames. Note that the source and sink queues form a sort of “boundary of concurrent processing” between a Pipeline and the outside world. In a Pipeline, Frames are processed sequentially; once a Frame is on a queue it can be processed in parallel with the frames being processed by the Pipeline. TODO: link to a more in-depth section about this.

11. The TTS Frame Processor yields another Audio frame as the Transport transmits the first Audio frame.

12. As before, the LLM Assistant Message Aggregator immediately yields the Audio frame and the Pipeline places the Audio frame in the sink queue.

13. The TTS Frame Processor has no more frames to yield. The LLM Frame Processor emits an LLM Response End Frame, which the Pipeline passes to the TTS Frame Processor.

14. The TTS Frame Processor immediately yields the LLM Response End Frame, so the Pipeline passes it along to the LLM Assistant Message Aggregator. The LLM Assistant Message Aggregator updates the LLM Context with the full response from the LLM. TODO TODO: I realized I forgot that the TSS Frame Processor also yields the Text frames that the LLM emitted so that the LLM Assistant Message Aggregator could accumulate them, arrggh.

15. The system is quiet, and waiting for the next message from the Transport.
# Understanding Different Frame Types in the Pipecat System
In the Pipecat system, frames are used to represent different types of data and control signals that flow through the pipeline. Understanding these frame types is crucial for working with the system effectively. This tutorial will cover the main categories of frames and their specific uses.
## 1. Base Frame Classes
### Frame
The `Frame` class is the base class for all frames. It includes:
-`id`: A unique identifier
-`name`: A descriptive name
-`pts`: Presentation timestamp (optional)
### DataFrame
`DataFrame` is a subclass of `Frame` and serves as a base for most data-carrying frames.
## 2. Audio Frames
### AudioRawFrame
Represents a chunk of audio with properties:
-`audio`: Raw audio data
-`sample_rate`: Audio sample rate
-`num_channels`: Number of audio channels
Subclasses include:
-`InputAudioRawFrame`: For audio from input sources
-`OutputAudioRawFrame`: For audio to be played by output devices
-`TTSAudioRawFrame`: For audio generated by Text-to-Speech services
## 3. Image Frames
### ImageRawFrame
Represents an image with properties:
-`image`: Raw image data
-`size`: Image dimensions
-`format`: Image format (e.g., JPEG, PNG)
Subclasses include:
-`InputImageRawFrame`: For images from input sources
-`OutputImageRawFrame`: For images to be displayed
-`UserImageRawFrame`: For images associated with a specific user
-`VisionImageRawFrame`: For images with associated text for description
-`URLImageRawFrame`: For images with an associated URL
### SpriteFrame
Represents an animated sprite, containing a list of `ImageRawFrame` objects.
## 4. Text and Transcription Frames
### TextFrame
Represents a chunk of text, used for various purposes in the pipeline.
### TranscriptionFrame
A specialized `TextFrame` for speech transcriptions, including:
-`user_id`: ID of the speaking user
-`timestamp`: When the transcription was generated
-`language`: Detected language of the speech
### InterimTranscriptionFrame
Similar to `TranscriptionFrame`, but for interim (not final) transcriptions.
## 5. LLM (Language Model) Frames
### LLMMessagesFrame
Contains a list of messages for an LLM service to process.
### LLMMessagesAppendFrame and LLMMessagesUpdateFrame
Used to modify the current context of LLM messages.
### LLMSetToolsFrame
Specifies tools (functions) available for the LLM to use.
### LLMEnablePromptCachingFrame
Controls prompt caching in certain LLMs.
## 6. System and Control Frames
### SystemFrame
Base class for system-level frames.
Important system frames include:
-`StartFrame`: Initiates a pipeline
-`CancelFrame`: Stops a pipeline immediately
-`ErrorFrame`: Notifies of errors (with `FatalErrorFrame` for unrecoverable errors)
-`EndTaskFrame` and `CancelTaskFrame`: Control pipeline tasks
-`StartInterruptionFrame` and `StopInterruptionFrame`: Indicate user speech for interruptions
### ControlFrame
Base class for control-flow frames.
Notable control frames:
-`EndFrame`: Signals the end of a pipeline
-`LLMFullResponseStartFrame` and `LLMFullResponseEndFrame`: Bracket LLM responses
-`UserStartedSpeakingFrame` and `UserStoppedSpeakingFrame`: Indicate user speech activity
-`BotStartedSpeakingFrame` and `BotStoppedSpeakingFrame`: Indicate bot speech activity
-`TTSStartedFrame` and `TTSStoppedFrame`: Bracket Text-to-Speech responses
## 7. Special Purpose Frames
### MetricsFrame
Contains performance metrics data.
### FunctionCallInProgressFrame and FunctionCallResultFrame
Used for handling LLM function (tool) calls.
### ServiceUpdateSettingsFrame
Base class for updating service settings, with specific subclasses for LLM, TTS, and STT services.
## Conclusion
Understanding these frame types is essential for working with the Pipecat system. Each frame type serves a specific purpose in the pipeline, whether it's carrying data (like audio or images), controlling the flow of the pipeline, or managing system-level operations. By using the appropriate frame types, you can effectively process and transmit various kinds of information through your pipeline.
This directory contains examples to help you learn how to build with Pipecat.
This directory contains examples showing how to build voice and multimodal agents with Pipecat.
## Getting Started
## Setup
New to Pipecat? Start here:
1. Follow the [README](https://github.com/pipecat-ai/pipecat/blob/main/README.md#%EF%B8%8F-contributing-to-the-framework) steps to get your local environment configured.
- **[Quickstart](quickstart/)** - Get your first voice AI bot running in 5 minutes _(coming soon)_
- **[Client/Server Web](client-server-web/)** - Learn to build web applications with Pipecat's client SDKs _(coming soon)_
- **[Phone Bot with Twilio](phone-bot-twilio/)** - Connect your bot to a phone number _(coming soon)_
> **Run from root directory**: Make sure you are running the steps from the root directory.
## Foundational Examples
> **Using local audio?**: The `LocalAudioTransport` requires a system dependency for `portaudio`. Install the dependency to use the transport.
Single-file examples that introduce core Pipecat concepts one at a time. These examples:
2. Copy the [`env.example`](../env.example) file and add API keys for services you plan to use:
- Build on each other progressively
- Focus on specific features or integrations
- Are used for testing with every Pipecat release
```bash
cp env.example .env
# Edit .env with your API keys
```
See the **[Foundational Examples README](foundational/)** for the complete list.
3. Run any example:
## More Advanced Examples
```bash
uv run python getting-started/01-say-one-thing.py
```
Ready to explore complex use cases? Visit **[pipecat-examples](https://github.com/pipecat-ai/pipecat-examples)** for:
4. Open the web interface at http://localhost:7860/client/ and click "Connect"
- Production-ready applications
- Multi-platform client implementations
- Telephony integrations
- Multimodal and creative applications
- Deployment and monitoring examples
## Running examples with other transports
Most examples support running with other transports, like Twilio or Daily.
### Daily
You need to create a Daily account at https://dashboard.daily.co/u/signup. Once signed up, you can create your own room from the dashboard and set the environment variables `DAILY_ROOM_URL` and `DAILY_API_KEY`. Alternatively, you can let the example create a room for you (still needs `DAILY_API_KEY` environment variable). Then, start any example with `-t daily`:
```bash
uv run getting-started/06-voice-agent.py -t daily
```
### Twilio
It is also possible to run the example through a Twilio phone number. You will need to setup a few things:
1. Install and run [ngrok](https://ngrok.com/download).
```bash
ngrok http 7860
```
2. Configure your Twilio phone number. One way is to setup a TwiML app and set the request URL to the ngrok URL from step (1). Then, set your phone number to use the new TwiML app.
Then, run the example with:
```bash
uv run getting-started/06-voice-agent.py -t twilio -x NGROK_HOST_NAME
```
## Directory Structure
### [`getting-started/`](./getting-started/)
Progressive introduction to Pipecat, from minimal TTS to a full voice agent with function calling.
### [`voice/`](./voice/)
Full STT + LLM + TTS voice agent pipelines showcasing different speech service providers (Deepgram, ElevenLabs, Cartesia, etc.)
### [`function-calling/`](./function-calling/)
Function calling with different LLM providers (OpenAI, Anthropic, Google, etc.)
### [`transcription/`](./transcription/)
Speech-to-text examples with various STT providers.
### [`vision/`](./vision/)
Image description and vision capabilities with different multimodal LLMs.
### [`realtime/`](./realtime/)
Realtime and multimodal live APIs (OpenAI Realtime, Gemini Live, AWS Nova Sonic, Ultravox, Grok).
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
)
messages=[
{
"role":"system",
"content":"You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
)
# Create audio buffer processor
audiobuffer=AudioBufferProcessor()
messages=[
{
"role":"system",
"content":"You are a helpful assistant demonstrating audio recording capabilities. Keep your responses brief and clear.",
voice_id="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
)
messages=[
{
"role":"system",
"content":"You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio. Respond to what the user said in a creative and helpful way.",
},
]
tts=CartesiaTTSService(
api_key=os.environ["CARTESIA_API_KEY"],
settings=CartesiaTTSService.Settings(
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
llm=GoogleLLMService(
api_key=os.environ["GOOGLE_API_KEY"],
settings=GoogleLLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way. You have access to tools to get the current weather - use them when relevant.",
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way. You have access to tools to get the current weather - use them when relevant.",
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
openai_llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
)
groq_llm=GroqLLMService(
api_key=os.environ["GROQ_API_KEY"],
settings=GroqLLMService.Settings(
system_instruction="You are a very helpful assistant. Your goal is to demonstrate your capabilities in detail in a creative and helpful way.",
),
)
openai_context=LLMContext()
groq_context=LLMContext()
# We use an external VADProcessor because the UserTurnProcessor is shared
# across multiple parallel aggregators. The VADProcessor emits
# VADUserStartedSpeakingFrame and VADUserStoppedSpeakingFrame which the
# UserTurnProcessor needs to manage turn lifecycle.
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
# Main LLM — drives the conversation. Its RTVI events reach the client.
main_llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
)
# Evaluator LLM — silently grades the user's message in the background.
# Its RTVI events will be suppressed so the client is unaware of this branch.
evaluator_llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
name="EvaluatorLLM",
settings=OpenAILLMService.Settings(
system_instruction="You are a silent quality evaluator. When given a user message, respond with a single JSON object: {'score': <1-5>, 'reason': '<brief reason>'}. Do not respond conversationally.",
),
)
main_context=LLMContext()
evaluator_context=LLMContext()
# We use an external VADProcessor because the UserTurnProcessor is shared
# across multiple parallel aggregators. The VADProcessor emits
# VADUserStartedSpeakingFrame and VADUserStoppedSpeakingFrame which the
# UserTurnProcessor needs to manage turn lifecycle.
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
),
base_url="http://0.0.0.0:8000/v1",
)
messages=[
{
"role":"system",
"content":"You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
voice="d4db5fb9-f44b-4bd1-85fa-192e0f0d75f9",# Spanish-speaking Lady
),
)
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a live translation assistant. Your sole purpose is to translate English text into Spanish. When you receive English text from the user, immediately translate it into natural, fluent Spanish. Do not add explanations, commentary, or extra information—only provide the Spanish translation of the text you receive.",
),
)
context=LLMContext()
# We use the TranscriptionUserTurnStartStrategy to start a new user turn
# every time a transcription is received. We disable interruptions, so the
# user can continue speaking while the bot is transcribing, without
system_prompt="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way."
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way. You can speak the following languages: 'English' and 'Spanish'.",
"content":f"Please introduce yourself to the user and let them know the languages you speak. Your initial responses should be in {tts.current_language}.",
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative and helpful way. You can do the following voices: 'News Lady', 'British Lady' and 'Barbershop Man'.",
"content":f"Please introduce yourself to the user and let them know the voices you can do. Your initial responses should be as if you were a {tts.current_voice}.",
# Cartesia offers a `<spell></spell>` tags that we can use to ask the user
# to confirm the emails.
# (see https://docs.cartesia.ai/build-with-sonic/formatting-text-for-sonic/spelling-out-input-text)
tts=CartesiaTTSService(
api_key=os.environ["CARTESIA_API_KEY"],
settings=CartesiaTTSService.Settings(
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
# Rime offers a function `spell()` that we can use to ask the user
# to confirm the emails.
# (see https://docs.rime.ai/api-reference/spell)
# tts = RimeHttpTTSService(
# api_key=os.getenv("RIME_API_KEY", ""),
# settings=RimeTTSSettings(
# voice="eva",
# ),
# aiohttp_session=session,
# )
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You need to gather a valid email or emails from the user. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. If the user provides one or more email addresses confirm them with the user. Enclose all emails with <spell> tags, for example <spell>a@a.com</spell>.",
),
)
# You can aslo register a function_name of None to get all functions
# sent to the same callback with an additional function_name parameter.
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
voice="71a7ad14-091c-4e8e-a314-022ece01c121",# British Reading Lady
),
)
llm=OpenAILLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAILLMService.Settings(
system_instruction="You are a helpful assistant in a voice conversation. Your responses will be spoken aloud, so avoid emojis, bullet points, or other formatting that can't be spoken. Respond to what the user said in a creative, helpful, and brief way.",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.