Files
pipecat/CHANGELOG.md
2026-05-15 22:18:46 +00:00

446 KiB
Raw Permalink Blame History

Changelog

All notable changes to Pipecat will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[1.2.1] - 2026-05-15

Changed

  • Changed the default WebSocket endpoints for GradiumSTTService and GradiumTTSService to the region-neutral wss://api.gradium.ai/api/speech/asr and wss://api.gradium.ai/api/speech/tts. Gradium now automatically routes traffic to the nearest endpoint. Override the url to pin to a specific region. (PR #4500)

Fixed

  • Fixed bot hangs when filter_incomplete_user_turns was enabled and the LLM responded by calling a tool. The user turn never finalized, so the assistant aggregator gated the tool-result context push and the LLM continuation never ran. Tool calls now finalize the turn the moment they start, before the function dispatches. (PR #4501)

[1.2.0] - 2026-05-14

Added

  • Added a session_id field to RunnerArguments so bots can log or trace a per-session identifier in local development the same way they can in Pipecat Cloud. The development runner now mints a UUID at every construction site, and paths that already returned a sessionId to the caller (Daily /start, dial-in webhook) share that same UUID with the runner args instead of generating two. The SmallWebRTC /api/offer endpoint also accepts an optional session_id query parameter so the /sessions/{session_id}/... proxy can thread it through. (PR #4385)

  • Added a max_buffer_delay_ms constructor argument to CartesiaTTSService for controlling Cartesia's server-side text buffering. When unset, Pipecat picks a sensible default based on text_aggregation_mode: 0 in SENTENCE mode (custom buffering — avoids stacking client-side aggregation on top of Cartesia's default 3000ms server buffer) and unset in TOKEN mode (Cartesia's managed buffering applies). Pass an explicit value (05000ms) to override. (PR #4390)

  • Added a mip_opt_out constructor argument to DeepgramTTSService and DeepgramHttpTTSService so callers can opt out of the Deepgram Model Improvement Program. When set, the value is forwarded to Deepgram as a query parameter on the speak request. Defaults to None, which preserves the existing behavior. See https://dpgr.am/deepgram-mip for pricing implications before enabling. (PR #4400)

  • Added an opt-in add_tool_change_messages flag to the LLM aggregators (set via LLMContextAggregatorPair(..., add_tool_change_messages=True)) that appends a developer-role message to the context whenever LLMSetToolsFrame changes the set of advertised standard tools. Helps the LLM stay coherent across mid-conversation tool changes, mitigating several flavors of tool-call-related hallucination: calling tools that have been removed, avoiding tools that have been re-added, and hallucinating output (made-up answers or tool-call-shaped non-tool-calls) when tools are unavailable. (PR #4404)

  • Added deferred(strategy) and DeferredUserTurnStopStrategy in pipecat.turns.user_stop. Wraps a stop strategy so it fires only the inference-triggered event and suppresses on_user_turn_stopped, leaving finalization to another strategy in the chain such as LLMTurnCompletionUserTurnStopStrategy. (PR #4405)

  • Added ExternalUserTurnCompletionStopStrategy in pipecat.turns.user_stop — a generic stop strategy that finalizes the user turn whenever a UserTurnInferenceCompletedFrame arrives, regardless of which component produced it. LLMTurnCompletionUserTurnStopStrategy now extends this base; future producers (Flux, custom end-of-turn classifiers, etc.) can use the base directly or subclass it to add producer-specific setup. (PR #4405)

  • Added on_user_turn_inference_triggered, a new event on the user turn controller, processor, aggregator and stop strategies that fires when a strategy has enough signal to start LLM inference. By default it fires together with on_user_turn_stopped; a gating strategy can fire only the inference-triggered event and defer finalization to a peer. (PR #4405)

  • Added FilterIncompleteUserTurnStrategies in pipecat.turns.user_turn_strategies — a UserTurnStrategies specialization that wraps the detector chain with deferred(...) and appends LLMTurnCompletionUserTurnStopStrategy as the finalizer. Common case: user_turn_strategies=FilterIncompleteUserTurnStrategies(). Pass config=UserTurnCompletionConfig(...) to customize timeouts and prompts. (PR #4405)

  • Added LLMTurnCompletionUserTurnStopStrategy in pipecat.turns.user_stop. When installed, the strategy gates on_user_turn_stopped on a UserTurnInferenceCompletedFrame (a new fieldless system frame emitted by any component that can judge turn completeness — e.g. the UserTurnCompletionLLMServiceMixin on ). A finalization_timeout provides a safety net if no completion frame ever arrives. (PR #4405)

  • Added first-class RTVI support for the UI Agent Protocol:

    • Adds ui-event, ui-snapshot, and ui-cancel-task client-to-server messages, plus ui-command and ui-task server-to-client messages, with paired *Data / *Message pydantic models.
    • Adds built-in command payload models for Toast, Navigate, ScrollTo, Highlight, Focus, Click, SetInputValue, and SelectText; matching default handlers live in @pipecat-ai/client-react.
    • Adds RTVIProcessor.on_ui_message for inbound ui-event, ui-snapshot, and ui-cancel-task messages.
    • Adds five UI pipeline frames, mirroring the client-message frame-and-event pattern: downstream code pushes RTVIUICommandFrame / RTVIUITaskFrame for the observer to wrap into outbound UICommandMessage / UITaskMessage envelopes, while the processor pushes inbound RTVIUIEventFrame, RTVIUISnapshotFrame, and RTVIUICancelTaskFrame alongside on_ui_message.
    • Bumps the RTVI PROTOCOL_VERSION from 1.2.0 to 1.3.0. (PR #4407)
  • AWS Transcribe STT, Polly TTS, Bedrock LLM, and the Bedrock AgentCore processor now resolve credentials via the standard boto3 provider chain (EC2 instance profiles, EKS pod roles / IRSA, ECS task roles, SSO, ~/.aws/credentials) when explicit credentials and AWS_* environment variables are absent. Services running with IAM roles no longer need to export static credentials. (PR #4416)

  • Added keyterms support to ElevenLabs STT services so Scribe V2 callers can bias transcription for both file-based and realtime transcription. (PR #4426)

  • Added watchdog_min_timeout parameter to DeepgramFluxSTT and DeepgramFluxSageMakerSTT (default 0.5 seconds) to control the minimum silence duration before the watchdog sends a silence packet to prevent dangling turns. The actual threshold is max(chunk_duration * 2, watchdog_min_timeout), so it also adapts automatically to the audio chunk size in use. (PR #4430)

  • Added cancel_on_interruption=False support for GeminiLiveLLMService on models that support Gemini's NON_BLOCKING tool mechanism (currently Gemini 2.x); the conversation now continues while the tool runs. On models that don't yet support NON_BLOCKING (Gemini 3.x), the service surfaces a one-time warning explaining the limitation. (Note: an intermittent 1008 error can occasionally fire on Gemini 2.5 during long-running tool calls; we auto-reconnect.) (PR #4448)

  • Added NvidiaSageMakerWebsocketSTTService for streaming speech recognition using NVIDIA Nemotron ASR via an AWS SageMaker bidirectional-stream endpoint. Produces InterimTranscriptionFrame and TranscriptionFrame frames, is VAD-aware, and automatically reconnects on error. (PR #4464)

  • Added NVIDIA Magpie TTS services via AWS SageMaker: NvidiaSageMakerHTTPTTSService (single HTTP invocation, streams raw PCM back) and NvidiaSageMakerWebsocketTTSService (persistent HTTP/2 bidi-stream with full interruption support via InterruptibleTTSService). (PR #4464)

  • Added support for reasoning configuration on OpenAIRealtimeLLMService, for use with reasoning-capable Realtime models such as gpt-realtime-2. (PR #4470)

  • Inworld TTS updates:

    • Added delivery_mode setting (STABLE/BALANCED/CREATIVE) to InworldTTSService and InworldHttpTTSService, enabling the stability-vs-creativity tradeoff in inworld-tts-2.
    • Added language support to InworldTTSService and InworldHttpTTSService. The language setting is now forwarded to the API, and a new language_to_inworld_language() helper normalizes Pipecat Language enums to Inworld's BCP-47 locale tags. (PR #4473)

Changed

  • Updated the default SonioxTTSService model from tts-rt-v1-preview to the generally available tts-rt-v1. (PR #4386)

  • Default cartesia_version for CartesiaTTSService bumped from 2025-04-16 to 2026-03-01, matching CartesiaHttpTTSService and unlocking the use_normalized_timestamps and max_buffer_delay_ms fields. (PR #4390)

  • ⚠️ CartesiaTTSService now sends use_normalized_timestamps: true instead of the deprecated use_original_timestamps field. Word timestamps now reflect what was actually spoken (post text-normalization and pronunciation-dictionary substitution), matching the convention Pipecat uses for ElevenLabs. This is a behavior change for sonic-3 users, who were previously receiving timestamps tied to the input transcript. (PR #4390)

  • Broadened tool_resources to app_resources for easy access not just in tool handlers but in other places like custom FrameProcessors. Three changes: a rename (tool_resourcesapp_resources), a new app_resources property on PipelineTask, and a new pipeline_task property on FrameProcessor. Tool handlers now read params.app_resources; custom processors read self.pipeline_task.app_resources. The previous tool_resources aliases (on PipelineTask, FunctionCallParams, and FrameProcessorSetup) keep working but are deprecated as of 1.2.0 and emit DeprecationWarnings. (PR #4395)

  • Lowered the per-message log in SmallWebRTCInputTransport._handle_app_message from debug to trace. App messages can be high-frequency and were noisy at debug level; set the loguru level to TRACE to see them again. (PR #4397)

  • Changed the default model for GrokRealtimeLLMService to grok-voice-think-fast-1.0, xAI's recommended Voice Agent model. The previous default of grok-voice-fast-1.0 has been deprecated by xAI and is being removed. (PR #4401)

  • Changed the default Inworld TTS model from inworld-tts-1.5-max to inworld-tts-2 (Realtime TTS-2) across InworldHttpTTSService, InworldTTSService, and the InworldRealtimeLLMService cascade. Existing users can pin the prior model explicitly via the model/tts_model argument; both inworld-tts-1.5-max and inworld-tts-1.5-mini remain valid model IDs. (PR #4422)

  • Changed the default model for GrokLLMService from grok-3 to grok-4.20-non-reasoning. xAI is retiring grok-3 on May 15, 2026. (PR #4429)

  • DeepgramFluxSTT watchdog silence threshold is now dynamic: max(chunk_duration * 2, watchdog_min_timeout) instead of a fixed 500 ms. This prevents false silence injections when large audio chunks are sent at lower frequency. (PR #4430)

  • ElevenLabsTTSService now sends close_context to the server as soon as the turn is complete (on on_turn_context_completed) rather than waiting until all audio has finished playing back. The isFinal message from ElevenLabs is now used to signal TTSStoppedFrame and clean up the audio context, improving turn transition timing. (PR #4433)

  • Updated InworldHttpTTSService and InworldTTSService to use PCM audio encoding by default, which returns audio bytes without headers. (PR #4446)

  • Moved create_task, cancel_task, the task_manager property, and setup(task_manager) up from FrameProcessor to BaseObject. Custom BaseObject subclasses (turn strategies, controllers, etc.) now inherit these methods directly instead of reimplementing the task manager wiring. Owners propagate the task manager to their child BaseObjects via await child.setup(task_manager). (PR #4449)

  • Changed the default OpenAI Realtime input audio transcription model from gpt-4o-transcribe to gpt-realtime-whisper for both OpenAIRealtimeSTTService and OpenAIRealtimeLLMService. The new model does not accept the prompt parameter; if a prompt is supplied alongside gpt-realtime-whisper, it is dropped automatically and a warning is logged. To keep using prompt hints, explicitly pin model="gpt-4o-transcribe" (or "gpt-4o-mini-transcribe"). (PR #4450)

  • Updated the default model for CartesiaTTSService and CartesiaHttpTTSService from sonic-3 to sonic-3.5. (PR #4462)

  • Changed the default model for OpenAIRealtimeLLMService from gpt-realtime-1.5 to gpt-realtime-2. (PR #4472)

Deprecated

  • Deprecated LLMUserAggregatorParams.filter_incomplete_user_turns. Use user_turn_strategies=FilterIncompleteUserTurnStrategies() (or add LLMTurnCompletionUserTurnStopStrategy to a custom user_turn_strategies.stop) instead. Setting the legacy flag still works for one release: the aggregator emits a DeprecationWarning and rewires the strategies as if you had passed FilterIncompleteUserTurnStrategies directly. (PR #4405)

  • Deprecated ResampyResampler in favor of SOXRAudioResampler (or the create_file_resampler() / create_stream_resampler() factories). Instantiating ResampyResampler now emits a DeprecationWarning. The class will be removed in Pipecat 2.0 along with the default resampy and numba dependencies. (PR #4428)

Fixed

  • Fixed CartesiaTTSService surfacing flush_done messages from Cartesia as ErrorFrames. The latest API emits a flush_done per transcript when server-side buffering is disabled; Pipecat now consumes them silently since each turn already has its own context_id. (PR #4390)

  • Fixed Cartesia tag helpers (SPELL, EMOTION_TAG, PAUSE_TAG, VOLUME_TAG, SPEED_TAG) raising TypeError when called on an instance (e.g. tts.SPELL("hi")). They're now @staticmethod and callable from both the class and an instance. (PR #4390)

  • Fixed CartesiaHttpTTSService pushing two ErrorFrames on a non-200 response — one with the API's error text and a second, less informative "Unknown error" frame from the outer exception handler. It now pushes a single frame that includes the HTTP status code and returns cleanly. (PR #4390)

  • Fixed an issue where LocalSmartTurnAnalyzerV3 was imported unconditionally for user turn stop strategies. It is now only imported when default_user_turn_stop_strategies() is called. This improves startup time and removes the transformers "PyTorch/TensorFlow/Flax not found" warning when the default stop strategies are not used. (PR #4393)

  • Fixed GrokRealtimeLLMService ignoring the configured model. The model was stored in Settings but never sent to xAI, so every session silently fell back to xAI's server-side default. The model is now passed via the ?model= query parameter on the WebSocket URL as xAI's Voice Agent API requires. (PR #4401)

  • Fixed on_user_turn_stopped firing prematurely when filter_incomplete_user_turns was enabled. The event now fires only after the LLM confirms the user turn is complete (); previously the smart-turn detector's tentative stop was bubbling up before the LLM had a chance to veto it, causing observers, transcript appenders and UI indicators to receive an early — and sometimes duplicated — signal. (PR #4405)

  • Fixed TTSSpeakFrame(append_to_context=True) greetings sometimes splitting across two assistant messages in the LLM context and not surfacing in on_assistant_turn_stopped. The LLMAssistantPushAggregationFrame emitted at the end of a TTS context now carries a PTS just past the last word so it can't overtake clock-queued TTSTextFrames in the transport's output, and LLMAssistantAggregator now triggers on_assistant_turn_started/on_assistant_turn_stopped when it receives the frame outside an LLM response cycle (restoring v0.0.104 behavior for greeting transcripts). (PR #4414)

  • Fixed ElevenLabsTTSService and ElevenLabsHttpTTSService producing merged words (e.g. bookLook) when using Flash models. Flash often splits sentences mid-stream into alignment chunks that begin with a real inter-word space, but the previous fix unconditionally stripped that space from every chunk. Leading spaces are now stripped only on the first alignment chunk of an utterance, so subsequent chunks correctly flush partial words across boundaries. (PR #4415)

  • Fixed AWS Polly TTS, Bedrock LLM, and the Bedrock AgentCore processor erroring out when only one of AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY was set in the environment. The half-populated kwargs are no longer forwarded to aioboto3; partial env-var configurations now fall through to the boto3 credential chain like fully-unset configurations do. (PR #4416)

  • Fixed ElevenLabsTTSService and ElevenLabsHttpTTSService writing romanized/normalized text to the LLM context. With non-Latin input (e.g., Chinese), the assistant transcript was getting populated with pinyin (Ni Hao ! instead of 你好!), which then degraded subsequent LLM turns. The services now consume alignment by default and only switch to normalizedAlignment / normalized_alignment when pronunciation_dictionary_locators is configured (where alignment has overlapping restarts that produce duplicated/garbled words, per #4316). Both fields are read with preferred-with-fallback semantics since each is nullable per the API schema. (PR #4424)

  • Fixed a deadlock in TTSService that could permanently stall pipeline processing when all three conditions occurred together: pause_frame_processing=True, an interruption arrived before any TTS audio was played, and an UninterruptibleFrame (e.g. TTSUpdateSettingsFrame, FunctionCallResultFrame) was in the processing queue at that moment. The process task would block on __process_event.wait() indefinitely because BotStoppedSpeakingFrame never arrives (no audio was played) and the interruption handler did not resume processing. Affects services using pause_frame_processing=True such as ElevenLabs, Rime, AsyncAI, Gradium, and ResembleAI. (PR #4431)

  • Fixed interruptions being delayed when a slow non-uninterruptible frame was processing and an uninterruptible frame was waiting in the queue. The bot would stall until the slow frame finished instead of cancelling it immediately on interruption. (PR #4434)

  • Fixed TTSService dropping uninterruptible frames (e.g. FunctionCallResultFrame) from its internal serialization queue when an interruption occurs. Previously, the queue was recreated on every interruption, silently discarding any queued frames. The queue is now reset instead of recreated, preserving uninterruptible frames so they are always delivered downstream. (PR #4435)

  • Fixed a race condition in the Daily transport that caused AttributeError: 'NoneType' object has no attribute 'send_app_message' when tearing down a pipeline. Both DailyInputTransport and DailyOutputTransport share the same DailyTransportClient and both call cleanup(), which was releasing the underlying CallClient on the first call — leaving the second caller with a None client. (PR #4440)

  • Restored cancel_on_interruption=False support for AWSNovaSonicLLMService and OpenAIRealtimeLLMService. These services previously honored the flag by simply not cancelling in-flight function calls on interruption; the introduction of the new async-tool mechanism (which threads started/intermediate/final messages through the LLM context) broke that path because the realtime services didn't know how to interpret those messages. Note that new-style streamed intermediate results (FunctionCallResultProperties(is_final=False)) are not supported on these realtime services. Similar fixes for other impacted realtime services are forthcoming. (PR #4441)

  • Fixed two misspelled Gemini TTS voice names in GeminiTTSService.AVAILABLE_VOICES. (PR #4443)

  • Extended the cancel_on_interruption=False regression fix to GrokRealtimeLLMService, AzureRealtimeLLMService, and UltravoxRealtimeLLMService. Grok and Azure use the same approach as in #4441 (each service detects async-tool messages in the LLM context and routes the final result to its formal tool-result channel; Azure inherits transitively from OpenAIRealtimeLLMService). Ultravox needed a different approach because its API freezes the conversation between client_tool_invocation and the matching client_tool_result — for async-registered functions it now ships a placeholder client_tool_result immediately when the function is invoked (to unfreeze the conversation), then injects the real result as user-side text once the tool finishes. Streamed intermediate results (FunctionCallResultProperties(is_final=False)) are still not supported on any of these realtime services. GeminiLiveLLMService and InworldRealtimeLLMService are excluded for now: Gemini Live's async-tool path needs deeper investigation, and Inworld tool calling needs to be sorted out first. (PR #4447)

  • Fixed OpenAIRealtimeLLMService handling of multi-output-item responses (observed with gpt-realtime-2). A single response can now contain more than one audio item, and the first item's audio.done may arrive after the second item's deltas have started. Deltas still arrive strictly in playback order, so we continue to forward them as received (matching OpenAI's reference implementation). The fix removes spurious warnings, ensures truncation always targets the latest audio item, and emits a single bracketing TTSStartedFrame/TTSStoppedFrame pair per assistant turn (the Stopped is now pushed on response.done). (PR #4465)

  • Fixed missing output attribute on LLM OpenTelemetry spans when the LLM call is interrupted mid-stream. (PR #4467)

  • Fixed incorrect metrics.ttfb on STT OpenTelemetry spans, and parented them to the current turn span. (PR #4467)

  • Fixed incorrect metrics.ttfb on TTS OpenTelemetry spans for streaming services. (PR #4467)

  • Extended the cancel_on_interruption=False regression fix to InworldRealtimeLLMService. Uses the same approach as in #4441 (the service detects async-tool messages in the LLM context and routes the final result to its formal tool-result channel). Note: as of this writing, Inworld Realtime doesn't appear to handle the resulting delayed tool result reliably — the routing is best-effort and the service surfaces a one-time warning when async-tool messages are seen. Streamed intermediate results (FunctionCallResultProperties(is_final=False)) are still not supported on this realtime service. (Inworld was excluded from #4447 pending resolution of an unrelated tool-calling issue, which turned out to be an account-level matter.) (PR #4474)

  • Fixed Cartesia TTS Korean word timestamps to use normal spacing rules, preserving word boundaries and per-word timestamp alignment during downstream aggregation. (PR #4475)

  • Fixed Cartesia TTS Chinese and Japanese timestamp grouping to preserve provider text spacing, avoiding artificial spaces when timestamp groups are reassembled downstream. (PR #4475)

  • Fixed SonioxSTTService final transcription frames missing detected language metadata when Soniox returns token-level language annotations. (PR #4482)

  • Fixed Soniox final transcription language detection to use the most common recognized token language, avoiding mislabeling an utterance when the last token is tagged with a different language. (PR #4495)

  • Fixed dropped audio in streaming TTS services whose wire protocol doesn't echo context_id back on incoming audio (Sarvam, Smallest, Soniox, Inworld, and others). Previously, audio that arrived between contexts or at the very start of a turn was tagged with context_id=None and silently dropped with an "unable to append audio to context: no context ID provided" debug log. TTSService.get_active_audio_context_id() now falls back to the synthesis-side _turn_context_id when the playback cursor isn't set yet. (PR #4497)

Security

  • Fixed a path traversal issue in the development runner's /files/{filename:path} download endpoint. Previously, when the runner was started with --folder, a request like /files/..%2F..%2Fetc%2Fpasswd could escape the configured folder because %2F-encoded separators bypassed Starlette's path normalisation. The endpoint now resolves the joined path and rejects any filename that escapes the allowed base with a 403, and also returns 404 (instead of an implicit null 200) when --folder is unset. (PR #4417)

[1.1.0] - 2026-04-27

Added

  • Added MistralSTTService for real-time speech-to-text using Mistral's Voxtral Realtime API (voxtral-mini-transcribe-realtime-2602). Supports streaming transcription with interim results, automatic language detection, and VAD-driven utterance lifecycle. (PR #4253)

  • Added buttons field to OutputDTMFFrame and OutputDTMFUrgentFrame for sending multi-key DTMF sequences as a list[KeypadEntry]. Use OutputDTMFFrame.from_string("123#") (or the equivalent on OutputDTMFUrgentFrame) to build one from a dial string, and to_string() to convert back. (PR #4313)

  • Added DailyTransport.send_dtmf() to expose the Daily call client's DTMF sending capability, enabling applications to send tones during a call (e.g. IVR navigation). (PR #4313)

  • Added DailyOutputDTMFFrame and DailyOutputDTMFUrgentFrame frames. In addition to the inherited buttons, they accept session_id, digit_duration_ms and method, which are forwarded to Daily's send_dtmf as sessionId, digitDurationMs and method. (PR #4313)

  • Added incremental pyright type checking. A pyrightconfig.json at the repo root uses typeCheckingMode: "basic" with an explicit include list of modules that pass cleanly (clocks, metrics, transcriptions, frames, observers, extensions, turns, pipeline, runner). Remaining modules will be added in subsequent PRs. CI enforces the checked set via uv run pyright in the format workflow. (PR #4324)

  • Added multilingual support to DeepgramFluxSTTService via a new language_hints: list[Language] setting. Works with Deepgram's new flux-general-multi model to bias transcription across English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch. Omit the hints to use auto-detection, or pass a subset to bias toward expected languages. Hints can be updated mid-stream via STTUpdateSettingsFrame (sent as a Deepgram Configure control message, no reconnect) to support detect-then-lock flows. (PR #4326)

  • Added fine-grained server-side VAD tuning options to SarvamSTTService.Settings for the saaras:v3 model, including speech thresholds, frame-count controls, pre-speech padding, interruption sensitivity, and initial-frame skipping. (PR #4334)

  • Added XAISTTService for real-time speech-to-text using xAI's voice STT WebSocket API (wss://api.x.ai/v1/stt). Streams raw audio (PCM, µ-law, or A-law) and emits interim and final transcription frames driven by the server's is_final / speech_final flags. Settings expose interim_results, endpointing, language, multichannel, channels, and diarize. Requires the xai optional extra (pip install "pipecat-ai[xai]"). (PR #4340)

  • Added XAITTSService for streaming text-to-speech using xAI's WebSocket TTS endpoint (wss://api.x.ai/v1/tts). Streams text.delta chunks up and base64 audio.delta chunks down on the same connection so audio begins flowing before the full utterance finishes synthesizing; complements the batch-HTTP XAIHttpTTSService. Defaults to raw PCM output so TTSAudioRawFrame needs no decoding. The xai optional extra now pulls in pipecat-ai[websockets-base]. (PR #4341)

  • Added SonioxTTSService, a real-time WebSocket TTS service that streams text in and audio out over a persistent connection. Install with pip install "pipecat-ai[soniox]". (PR #4360)

  • Added support for Daily's built-in screenVideo destination in DailyTransport. When "screenVideo" is included in video_out_destinations transport parameter, a dedicated screen video track is created at join time and frames with transport_destination="screenVideo" are routed to it.

    params = DailyParams(
          video_out_enabled=True,
          video_out_is_live=True,
          video_out_width=1280,
          video_out_height=720,
          video_out_destinations=["screenVideo"]
    )
    
    ...
    
    frame = OutputImageRawFrame(...)
    frame.transport_destination = "screenVideo"
    

    (PR #4370)

  • Added camera_out_send_settings to DailyParams. This dict is passed verbatim to the Daily client's camera publishing settings, allowing applications to fully control encoding, codec, bitrate, and framerate.

    params = DailyParams(
        camera_out_send_settings={
            "maxQuality": "high",
            "encodings": {
                "high": {"maxBitrate": 2_000_000, "maxFramerate": 30}
            },
        },
    )
    

    (PR #4370)

  • Added tool_resources to PipelineTask and FunctionCallParams. Pass an application-defined object (DB handles, clients, state, etc.) to PipelineTask(..., tool_resources=...) and access it from any tool handler via params.tool_resources. Passed by reference; the caller retains their handle and can read mutations after the task finishes. Resolves #4256. (PR #4371)

Changed

  • Updated NVIDIA STT services to align with Nemotron Speech defaults and configuration: api_key is now optional for local deployments, additional recognition settings are available (including alternatives, word offsets, and diarization), and streaming/segmented docs now reflect Nemotron Speech APIs.

    • NVIDIA streaming STT now sets TranscriptionFrame.finalized=True when the provider marks a result as final, and preserves language on both TranscriptionFrame and InterimTranscriptionFrame. (PR #4269)
  • Updated NvidiaLLMService to emit model reasoning as LLMThought*Frames (from both reasoning_content and <think>...</think> output), avoid mixing reasoning text into normal assistant content, and allow keyless local NIM endpoints while warning when the cloud endpoint is used without an API key. (PR #4270)

  • STT services now reconnect safely when settings change: reconnection is deferred until the current user turn ends (i.e., until UserStoppedSpeakingFrame is received) rather than interrupting an active speech session. Audio frames received while the reconnect is in progress are buffered and replayed once the new connection is ready. CartesiaSTTService and DeepgramSTTService both use this new behavior. (PR #4311)

  • Reduced debug log noise for LLM services. The system instruction is now logged once when composed (e.g. when turn completion is enabled) instead of on every LLM call. Per-call logs now show only the conversation messages, consistent across Google, Anthropic, AWS, and OpenAI services. (PR #4314)

  • LiveKitRunnerArguments.token is now a required str (previously str | None with a default of None). LiveKit requires a token to join a room, so the type now reflects reality. This only affects custom runners that construct LiveKitRunnerArguments directly; code consuming the argument from the standard runner is unaffected. (PR #4324)

  • TranscriptionFrame.language and InterimTranscriptionFrame.language emitted by DeepgramFluxSTTService now reflect the language Deepgram detected for each turn (read from the languages field on Flux's TurnInfo event). On flux-general-multi this gives per-turn accuracy for downstream consumers (e.g. TTS voice selection). flux-general-en continues to emit Language.EN. (PR #4326)

  • Added includes_inter_frame_spaces parameter to TTSService.add_word_timestamps and _add_word_timestamps (default None). When True, downstream consumers will not inject additional spaces between tokens; None leaves each frame's own default unchanged.

    • InworldTTSService now passes includes_inter_frame_spaces=True when reporting word timestamps, since Inworld tokens already include inter-word spacing. (PR #4330)
  • SarvamSTTService now uses saaras:v3 as its default model instead of saarika:v2.5. Applications that relied on the previous default should set settings=SarvamSTTService.Settings(model="saarika:v2.5") explicitly. (PR #4334)

  • SpeechTimeoutUserTurnStopStrategy now waits only user_speech_timeout when a transcript arrives without a VAD stop event, rather than max(ttfs_p99_latency, user_speech_timeout). If you had ttfs_p99_latency > user_speech_timeout, turn detection in that path is slightly faster than before. (PR #4337)

  • If you use an STT service that emits finalized transcripts (Speechmatics, Soniox, Deepgram Flux, AssemblyAI) with SpeechTimeoutUserTurnStopStrategy, user turns now end as soon as user_speech_timeout elapses after VAD stop. Previously the strategy also waited for the STT P99 latency (ttfs_p99_latency) even when the transcript was already marked final. user_speech_timeout is still honored as a floor — STT finalization never shortens it. (PR #4337)

  • ⚠️ PlivoFrameSerializer and TelnyxFrameSerializer now raise ValueError at construction when auto_hang_up=True (the default) but required credentials are missing, matching TwilioFrameSerializer. Previously they constructed successfully and the hangup failed silently at call-end, leaving phantom billable sessions on the provider. If you relied on the old silent behavior, pass auto_hang_up=False explicitly or provide the credentials. The specific fields checked are call_id/auth_id/auth_token for Plivo and call_control_id/api_key for Telnyx. (PR #4349)

  • ToolsSchema(standard_tools=...) now accepts any Sequence[FunctionSchema | DirectFunction] rather than requiring an exact list of the union. Callers can pass a narrower list[FunctionSchema] (or any other Sequence) without the type checker complaining about list invariance. (PR #4352)

  • Updated aic-sdk dependency to ~=2.2.0. The AIC_LICENSE_KEY environment variable replaces the previous AICOUSTICS_LICENSE_KEY. (PR #4362)

  • Loosened the protobuf dependency to >=5.29.6,<7, so projects pinned to protobuf 5.x can install pipecat-ai again. The previous >=6.31.1,<7 pin (introduced in 1.0.8 alongside the nvidia-riva-client 2.25.1 upgrade) silently blocked any environment whose dependency graph already constrained protobuf to the 5.x line. The bundled frames_pb2.py is now compiled with protoc 5.x so it imports cleanly on both 5.x and 6.x runtimes.

    Installing the nvidia extra still pulls protobuf 6.x: nvidia-riva-client 2.25.1 ships gencode that requires a 6.x runtime, so pipecat-ai[nvidia] now declares protobuf>=6.31.1,<7 explicitly to cover an upstream packaging gap (https://github.com/nvidia-riva/python-clients/issues/172). (PR #4372)

  • Daily rooms created by the development runner (pipecat.runner.run) now expire after 4 hours with eject_at_room_exp=True, mirroring Pipecat Cloud's max session limit. Previously, runner-created rooms inherited a 2-hour expiration on the default code paths and had no expiration at all when callers posted partial dailyRoomProperties (e.g. {"start_video_off": true}) to /start, causing rooms to accumulate indefinitely. Explicit exp and eject_at_room_exp values in dailyRoomProperties are still respected. (PR #4374)

  • Updated daily-python dependency to ~=0.28.0. (PR #4379)

Deprecated

  • Deprecated TransportParams.video_out_bitrate for the Daily transport. Use DailyParams.camera_out_send_settings instead to configure camera publishing encodings (bitrate, framerate, codec, etc.). (PR #4370)

Fixed

  • Fixed missing tool handlers so unregistered tool calls fail with a normal final tool result instead of leaving tool-call state hanging. (PR #4301)

  • Fixed pipecat-ai[tavus] not installing the required daily-python dependency. Installing the tavus extra now correctly pulls in pipecat-ai[daily]. (PR #4304)

  • Fixed audio loss and potential errors when STT settings were updated mid-speech. Previously, CartesiaSTTService and DeepgramSTTService would immediately disconnect and reconnect when settings changed, dropping any in-flight audio. Reconnection is now deferred until the user stops speaking, and audio arriving during the reconnect window is buffered and replayed. (PR #4311)

  • Fixed SmallestTTSService WebSocket endpoint URL to match Smallest AI v4.0.0 API (wss://waves-api.smallest.aiwss://api.smallest.ai) and restored keepalive using a silent space message instead of the unsupported flush command. (PR #4320)

  • Fixed whitespace handling in TTS token streaming mode. Inter-token whitespace (e.g., spaces between words) is now preserved for correct prosody, while leading whitespace before the first non-whitespace token is still stripped to avoid issues with TTS models that are sensitive to leading spaces. (PR #4323)

  • Fixed SentryMetrics silently dropping MetricsFrames from stop_ttfb_metrics and stop_processing_metrics. SentryMetrics called the base FrameProcessorMetrics implementation but discarded its return value, so FrameProcessor never pushed the MetricsFrame downstream. This prevented observers (e.g. UserBotLatencyObserver, MetricsLogObserver) from seeing TTFB and processing metrics for any service using metrics=SentryMetrics(). The metrics were still calculated and Sentry transactions still completed — only the downstream frame push was affected. (PR #4325)

  • Fixed ElevenLabsTTSService and ElevenLabsHttpTTSService emitting word timestamps and TTSTextFrame content that matched the input text instead of the spoken audio when a pronunciation dictionary (pronunciation_dictionary_locators) or text normalization rewrote the input. Both services now consume ElevenLabs' normalized alignment, so downstream consumers (captions, transcripts, context aggregation) reflect what the listener actually hears. (PR #4344)

  • Fixed a crash in DeepgramSTTService when an STTUpdateSettingsFrame arrived before the WebSocket handshake completed (for example, when pushing an update upstream on StartFrame). The settings-triggered reconnect cancelled the in-flight connection task before its keepalive task was created, causing an UnboundLocalError: cannot access local variable 'keepalive_task' in the handler's finally block. (PR #4347)

  • Fixed direct-function registration crashing for functions without a docstring. DirectFunctionWrapper passed inspect.getdoc()'s result to docstring_parser.parse(), which raises when the docstring is None. Functions now register cleanly whether or not they have a docstring; an empty docstring produces empty description and parameter metadata as expected. (PR #4352)

  • Fixed AssemblyAISTTService, CartesiaSTTService, GradiumSTTService, and SonioxSTTService crashing the pipeline on transient WebSocket send failures. Each run_stt sent audio directly without catching errors, so a single network hiccup mid-stream raised an uncaught exception through process_frame. The guards now log a warning and let the connection-state check on the next call handle recovery, matching the pattern used by Deepgram, xAI, Azure, and other push-based STTs. (PR #4352)

  • Fixed Gemini Live losing conversation history in the (rare) case of a WebSocket reconnect before any session resumption handle is received. When the session reconnects (e.g. on system instruction change), conversation history is now re-seeded into the new session before it is marked ready for input. (PR #4355)

  • Fixed SmallWebRTC data channel silently stalling on networks with a 1280-byte MTU (IPv6, Tailscale overlays, many consumer VPNs). aiortc's default SCTP chunk size of 1200 bytes produces ~1305-byte UDP datagrams after headers, which the kernel rejects with EMSGSIZE; aiortc has no path-MTU discovery so it retransmits forever at the same oversized size. The chunk size is now clamped to 1100 bytes (~1205-byte datagrams, ~75 bytes of slack). Override with PIPECAT_SCTP_MAX_CHUNK_SIZE if your path MTU requires a different value. (PR #4358)

[1.0.0] - 2026-04-14

Migration guide: https://docs.pipecat.ai/pipecat/migration/migration-1.0

Added

  • Updated LemonSlice transport:

    • Added on_avatar_connected and on_avatar_disconnected events triggered when the avatar joins and leaves the room.
    • Added api_url parameter to LemonSliceNewSessionRequest to allow overriding the LemonSlice API endpoint.
    • Added support for passing arbitrary named parameters to the LemonSlice API endpoint. (PR #3995)
  • Added Inworld Realtime LLM service with WebSocket-based cascade STT/LLM/TTS, semantic VAD, function calling, and Router support. (PR #4140)

  • ⚠️ Added WebSocket-based OpenAIResponsesLLMService as the new default for the OpenAI Responses API. It maintains a persistent connection to wss://api.openai.com/v1/responses and automatically uses previous_response_id to send only incremental context, falling back to full context on reconnection or cache miss. The previous HTTP-based implementation is now available as OpenAIResponsesHttpLLMService. (PR #4141)

  • Added group_parallel_tools parameter to LLMService (default True). When True, all function calls from the same LLM response batch share a group ID and the LLM is triggered exactly once after the last call completes. Set to False to trigger inference independently for each function call result as it arrives. (PR #4217)

  • Added async function call support to register_function() and register_direct_function() via cancel_on_interruption=False. When set to False, the LLM continues the conversation immediately without waiting for the function result. The result is injected back into the context as a developer message once available, triggering a new LLM inference at that point. (PR #4217)

  • Added enable_prompt_caching setting to AWSBedrockLLMService for Bedrock ConverseStream prompt caching. (PR #4219)

  • Added support for streaming intermediate results from async function calls. Call result_callback multiple times with properties=FunctionCallResultProperties(is_final=False) to push incremental updates, then call it once more (with is_final=True, the default) to deliver the final result. Only valid for functions registered with cancel_on_interruption=False. (PR #4230)

  • Added LLMMessagesTransformFrame to facilitate programmatically editing context in a frame-based way.

    The previous approach required the caller to directly grab a reference to the context object, grab a "snapshot" of its messages at that point in time, transform the messages, and then push an LLMMessagesUpdateFrame with the transformed messages. This approach can lead to problems: what if there had already been a change to the context queued in the pipeline? The transformed messages would simply overwrite it without consideration. (PR #4231)

  • The development runner now exports a module-level app FastAPI instance (from pipecat.runner.run import app) so you can register custom routes before calling main(). (PR #4234)

  • ToolsSchema now accepts custom_tools for OpenAI LLM services (OpenAILLMService, OpenAIResponsesLLMService, OpenAIResponsesHttpLLMService, and OpenAIRealtimeLLMService), letting you pass provider-specific tools like tool_search alongside standard function tools. (PR #4248)

  • Added enhancements to NvidiaTTSService:

    • Cross-sentence stitching: multiple sentences within an LLM turn are fed into a single SynthesizeOnline gRPC stream for seamless audio across sentence boundaries (requires Magpie TTS model v1.7.0+).
    • custom_dictionary and encoding parameters for IPA-based custom pronunciation and output audio encoding.
    • Metrics generation (can_generate_metrics returns true) and stop_all_metrics() when an audio context is interrupted.
    • gRPC error handling around synthesis config retrieval (GetRivaSynthesisConfig). (PR #4249)
  • Added MistralTTSService for streaming text-to-speech using Mistral's Voxtral TTS API (voxtral-mini-tts-2603). Supports SSE-based audio streaming with automatic resampling from the API's native 24kHz to any requested sample rate. Requires the mistral optional extra (pip install pipecat-ai[mistral]). (PR #4251)

  • Added truncate_large_values parameter to LLMContext.get_messages(). When True, returns compact deep copies of messages with binary data (base64 images, audio) replaced by short placeholders and long string values in LLM-specific messages recursively truncated. Useful for serialization, logging, and debugging tools. (PR #4272)

  • CartesiaSTTService now supports runtime settings updates (e.g. changing language or model via STTUpdateSettingsFrame). The service automatically reconnects with the new parameters. Previously, settings updates were silently ignored. (PR #4282)

  • Added pcm_32000 and pcm_48000 sample rate support to ElevenLabs TTS services. (PR #4293)

  • Added enable_logging parameter to ElevenLabsHttpTTSService. Set to False to enable zero retention mode (enterprise only). (PR #4293)

Changed

  • Updated onnxruntime from 1.23.2 to 1.24.3, adding support for Python 3.14. (PR #3984)

  • MCPClient now requires async with MCPClient(...) as mcp: or explicit start()/close() calls to manage the connection lifecycle. (PR #4034)

  • ⚠️ Updated langchain extra to require langchain 1.x (from 0.3.x), langchain-community 0.4.x (from 0.3.x), and langchain-openai 1.x (from 0.3.x). If you pin these packages in your project, update your pins accordingly. (PR #4192)

  • WebsocketService reconnection errors are now non-fatal. When a websocket service exhausts its reconnection attempts (either via exponential backoff or quick failure detection), it emits a non-fatal ErrorFrame instead of a fatal one. This allows application-level failover (e.g. ServiceSwitcher) to handle the failure instead of killing the entire pipeline. (PR #4201)

  • Changed GrokLLMService default model from grok-3-beta to grok-3, now that the model is generally available. (PR #4209)

  • GoogleImageGenService now defaults to imagen-4.0-generate-001 (previously imagen-3.0-generate-002). (PR #4213)

  • ⚠️ BaseOpenAILLMService.get_chat_completions() now accepts an LLMContext instead of OpenAILLMInvocationParams. If you override this method, update your signature accordingly. (PR #4215)

  • When multiple function calls are returned in a single LLM response, by default (when group_parallel_tools=True) the LLM is now triggered exactly once after the last call in the batch completes, rather than waiting for all function calls. (PR #4217)

  • ⚠️ LLMService.function_call_timeout_secs now defaults to None instead of 10.0. Deferred function calls will run indefinitely unless a timeout is explicitly set at the service level or per-call. If you relied on the previous 10-second default, pass function_call_timeout_secs=10.0 explicitly. (PR #4224)

  • Updated NvidiaTTSService:

    • Made api_key optional for local NIM deployments.
    • Voice, language, and quality can be updated without reconnecting the gRPC client; new values take effect on the next synthesis turn, not for the current turn's in-flight requests.
    • Replaced per-sentence synchronous synthesize_online calls with async queue-backed gRPC streaming.
    • Streaming now uses asyncio tasks with explicit gRPC cancellation on interruption and stale-response filtering when a stream is aborted or replaced.
    • Renamed Riva references to Nemotron Speech in docs and messages.
    • Disabled automatic TTS start frames at the service level (push_start_frame=False) and emit TTSStartedFrame when a stitched synthesis stream is started for a context. (PR #4249)

Removed

  • ⚠️ Removed OpenPipeLLMService and the openpipe extra. OpenPipe was acquired by CoreWeave and the package is no longer maintained. If you were using openpipe as an LLM provider, switch to the underlying provider directly (e.g. openai). The OpenPipe interface can still be used with OpenAILLMService by specifying a base_url. (PR #4191)

  • ⚠️ Removed NoisereduceFilter. Use system-level noise reduction or a service-based alternative instead. (PR #4204)

  • ⚠️ Removed deprecated vad_enabled and vad_audio_passthrough transport params. (PR #4204)

  • ⚠️ Removed deprecated camera_in_enabled, camera_in_is_live, camera_in_width, camera_in_height, camera_out_enabled, camera_out_is_live, camera_out_width, camera_out_height, and camera_out_color transport params. Use the video_in_* and video_out_* equivalents instead. (PR #4204)

  • ⚠️ Removed FrameProcessor.wait_for_task(). Use create_task() and manage tasks with the built-in TaskManager instead. (PR #4204)

  • ⚠️ Removed deprecated transport frames: TransportMessageFrame, TransportMessageUrgentFrame, InputTransportMessageUrgentFrame, DailyTransportMessageFrame, and DailyTransportMessageUrgentFrame. Use OutputTransportMessageFrame, OutputTransportMessageUrgentFrame, InputTransportMessageFrame, DailyOutputTransportMessageFrame, and DailyOutputTransportMessageUrgentFrame instead. (PR #4204)

  • ⚠️ Removed create_default_resampler() from pipecat.audio.utils. (PR #4204)

  • ⚠️ Removed DailyRunner.configure_with_args(). Use PipelineRunner with RunnerArguments instead. (PR #4204)

  • ⚠️ Removed deprecated on_pipeline_ended, on_pipeline_cancelled, and on_pipeline_stopped events from PipelineTask. Use on_pipeline_finished instead. (PR #4204)

  • ⚠️ Removed single-argument function call support from LLMService. Functions must use named parameters instead of a single arguments parameter. (PR #4204)

  • ⚠️ Removed FalSmartTurnAnalyzer and LocalSmartTurnAnalyzer. (PR #4204)

  • ⚠️ Removed RTVIObserver.errors_enabled parameter. (PR #4204)

  • ⚠️ Removed deprecated RTVI models, frames, and processor methods including RTVIConfig, RTVIServiceConfig, RTVIServiceOptionConfig, various RTVI*Data models, RTVIActionFrame, and RTVIProcessor.handle_function_call/handle_function_call_start. Use the updated RTVI processor API instead. (PR #4204)

  • ⚠️ Removed deprecated KeypadEntryFrame alias. (PR #4204)

  • ⚠️ Removed deprecated interruption frames: StartInterruptionFrame and BotInterruptionFrame. Use InterruptionFrame and InterruptionTaskFrame instead. (PR #4204)

  • ⚠️ Removed LLMService.request_image_frame(). Push a UserImageRequestFrame instead. (PR #4204)

  • ⚠️ Removed TTSService.say(). Push a TTSSpeakFrame into the pipeline instead. (PR #4204)

  • ⚠️ Removed KrispFilter. The krisp extra has been removed from pyproject.toml. (PR #4204)

  • ⚠️ Removed AudioBufferProcessor.user_continuous_stream parameter. Use user_audio_passthrough instead. (PR #4204)

  • ⚠️ Removed LLMService.start_callback parameter. Register an on_llm_response_start event handler instead. (PR #4204)

  • ⚠️ Removed deprecated observers field from PipelineParams. Pass observers directly to PipelineTask constructor instead. (PR #4204)

  • ⚠️ Removed deprecated pipecat.services.openai_realtime package. Use pipecat.services.openai.realtime instead. (PR #4208)

  • ⚠️ Removed deprecated pipecat.services.google.llm_vertex module. Use pipecat.services.google.vertex.llm instead. (PR #4208)

  • ⚠️ Removed deprecated GoogleLLMOpenAIBetaService from pipecat.services.google.openai. Use GoogleLLMService from pipecat.services.google.llm instead. (PR #4208)

  • ⚠️ Removed deprecated OpenAIRealtimeBetaLLMService and AzureRealtimeBetaLLMService. Use OpenAIRealtimeLLMService and AzureRealtimeLLMService from pipecat.services.openai.realtime and pipecat.services.azure.realtime instead. (PR #4208)

  • ⚠️ Removed deprecated pipecat.services.ai_services module. Import from pipecat.services.ai_service, pipecat.services.llm_service, pipecat.services.stt_service, pipecat.services.tts_service, etc. instead. (PR #4208)

  • ⚠️ Removed deprecated pipecat.services.gemini_multimodal_live package. Use pipecat.services.google.gemini_live instead. Note that class names no longer include "Multimodal" (e.g. GeminiMultimodalLiveLLMServiceGeminiLiveLLMService). (PR #4208)

  • ⚠️ Removed deprecated pipecat.services.google.gemini_live.llm_vertex module. Use pipecat.services.google.gemini_live.vertex.llm instead. (PR #4208)

  • ⚠️ Removed deprecated pipecat.services.nim package. Use pipecat.services.nvidia.llm instead (NimLLMServiceNvidiaLLMService). (PR #4208)

  • ⚠️ Removed deprecated pipecat.services.deepgram.stt_sagemaker and pipecat.services.deepgram.tts_sagemaker modules. Use pipecat.services.deepgram.sagemaker.stt and pipecat.services.deepgram.sagemaker.tts instead. (PR #4208)

  • ⚠️ Removed deprecated pipecat.services.aws_nova_sonic package. Use pipecat.services.aws.nova_sonic instead. (PR #4208)

  • ⚠️ Removed deprecated pipecat.services.riva package. Use pipecat.services.nvidia.stt and pipecat.services.nvidia.tts instead (RivaSTTServiceNvidiaSTTService, RivaTTSServiceNvidiaTTSService). (PR #4208)

  • ⚠️ Removed deprecated compatibility modules: pipecat.services.openai_realtime_beta (use pipecat.services.openai.realtime), pipecat.services.openai_realtime.context, pipecat.services.openai_realtime.frames, pipecat.services.openai.realtime.context, pipecat.services.openai.realtime.frames, pipecat.services.gemini_multimodal_live (use pipecat.services.google.gemini_live), pipecat.services.aws_nova_sonic.context (use pipecat.services.aws.nova_sonic), pipecat.services.google.openai and pipecat.services.google.llm_openai (use pipecat.services.google.llm). (PR #4215)

  • ⚠️ Removed VisionImageFrameAggregator (from pipecat.processors.aggregators.vision_image_frame). Vision/image handling is now built into LLMContext (from pipecat.processors.aggregators.llm_context). See the 12* examples for the recommended replacement pattern. (PR #4215)

  • ⚠️ Removed OpenAILLMContext, OpenAILLMContextFrame, and OpenAILLMContext.from_messages(). Use LLMContext (from pipecat.processors.aggregators.llm_context) and LLMContextFrame (from pipecat.frames.frames) instead. All services now exclusively use the universal LLMContext.

    From the developer's point of view, migrating will usually be a matter of going from this:

    context = OpenAILLMContext(messages, tools)
    context_aggregator = llm.create_context_aggregator(context)
    

    To this:

    from pipecat.processors.aggregators.llm_context import LLMContext
    from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
    
    context = LLMContext(messages, tools)
    context_aggregator = LLMContextAggregatorPair(context)
    

    (PR #4215)

  • ⚠️ Removed deprecated frame types LLMMessagesFrame and OpenAILLMContextAssistantTimestampFrame from pipecat.frames.frames. Instead of LLMMessagesFrame, use LLMContextFrame with the new messages, or LLMMessagesUpdateFrame with run_llm=True. (PR #4215)

  • ⚠️ Removed GatedOpenAILLMContextAggregator (from pipecat.processors.aggregators.gated_open_ai_llm_context). Use GatedLLMContextAggregator (from pipecat.processors.aggregators.gated_llm_context) instead. (PR #4215)

  • ⚠️ Removed deprecated service-specific context and aggregator machinery, which was superseded by the universal LLMContext system.

    Service-specific classes removed: AnthropicLLMContext, AnthropicContextAggregatorPair, AWSBedrockLLMContext, AWSBedrockContextAggregatorPair, OpenAIContextAggregatorPair, and their user/assistant aggregators. Also removed create_context_aggregator() from LLMService, OpenAILLMService, AnthropicLLMService, and AWSBedrockLLMService.

    Base aggregator classes removed (from pipecat.processors.aggregators.llm_response): BaseLLMResponseAggregator, LLMContextResponseAggregator, LLMUserContextAggregator, LLMAssistantContextAggregator, LLMUserResponseAggregator, LLMAssistantResponseAggregator.

    From the developer's point of view, migrating will usually be a matter of going from this:

    context = OpenAILLMContext(messages, tools)
    context_aggregator = llm.create_context_aggregator(context)
    

    To this:

    from pipecat.processors.aggregators.llm_context import LLMContext
    from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
    
    context = LLMContext(messages, tools)
    context_aggregator = LLMContextAggregatorPair(context)
    

    (PR #4215)

  • ⚠️ Removed deprecated service parameters and shims that have been replaced by the settings=Service.Settings(...) pattern or direct __init__ parameters:

    • PollyTTSService alias (use AWSTTSService)
    • TTSService: text_aggregator, text_filter init params
    • AWSNovaSonicLLMService: send_transcription_frames init param
    • DeepgramSTTService: url init param (use base_url)
    • FishAudioTTSService: model init param (use reference_id or settings)
    • GladiaSTTService: language and confidence from GladiaInputParams, InputParams class alias
    • GeminiTTSService: api_key init param
    • GeminiLiveLLMService: base_url init param (use http_options)
    • GoogleVertexLLMService: InputParams class with location/project_id fields (use direct init params); project_id is now required, location defaults to "us-east4"
    • MiniMaxHttpTTSService: english_normalization from InputParams (use text_normalization)
    • SimliVideoService: simli_config init param (use api_key/face_id), use_turn_server init param; api_key and face_id are now required
    • AnthropicLLMService: enable_prompt_caching_beta from InputParams (use enable_prompt_caching) (PR #4220)
  • ⚠️ Removed deprecated pipecat.transports.services and pipecat.transports.network module aliases. Update imports to use pipecat.transports.daily.transport, pipecat.transports.livekit.transport, pipecat.transports.websocket.*, pipecat.transports.webrtc.*, and pipecat.transports.daily.utils respectively. (PR #4225)

  • ⚠️ Removed deprecated pipecat.sync package. Use pipecat.utils.sync instead. (PR #4225)

  • ⚠️ Removed deprecated TranscriptionMessage, ThoughtTranscriptionMessage, and TranscriptionUpdateFrame from pipecat.frames.frames. (PR #4228)

  • ⚠️ Removed deprecated allow_interruptions parameter from PipelineParams, StartFrame, and FrameProcessor. Interruptions are now always allowed by default. Use LLMUserAggregator's user_turn_strategies / user_mute_strategies parameters to control interruption behavior. (PR #4228)

  • ⚠️ Removed deprecated STTMuteFilter, STTMuteConfig, and STTMuteStrategy from pipecat.processors.filters.stt_mute_filter. Use pipecat.turns.user_mute strategies with LLMUserAggregator's user_mute_strategies parameter instead. (PR #4228)

  • ⚠️ Removed deprecated pipecat.processors.transcript_processor module (TranscriptProcessor, TranscriptProcessorConfig). Use pipeline observers instead. (PR #4228)

  • ⚠️ Removed deprecated EmulateUserStartedSpeakingFrame and EmulateUserStoppedSpeakingFrame frames, and the emulated field from UserStartedSpeakingFrame / UserStoppedSpeakingFrame. (PR #4228)

  • ⚠️ Removed deprecated interruption_strategies parameter from PipelineParams, StartFrame, and FrameProcessor. Use LLMUserAggregator's user_turn_strategies parameter instead. (PR #4228)

  • ⚠️ Removed deprecated pipecat.audio.interruptions module (BaseInterruptionStrategy, MinWordsInterruptionStrategy). Use pipecat.turns.user_start.MinWordsUserTurnStartStrategy with LLMUserAggregator's user_turn_strategies parameter instead. (PR #4228)

  • ⚠️ Removed deprecated pipecat.utils.tracing.class_decorators module. Use pipecat.utils.tracing.service_decorators instead. (PR #4228)

  • ⚠️ Removed deprecated add_pattern_pair method from PatternPairAggregator. Use add_pattern instead. (PR #4228)

  • ⚠️ Removed deprecated UserResponseAggregator class from pipecat.processors.aggregators.user_response. Use LLMUserAggregator instead. (PR #4228)

  • ⚠️ Removed ExternalUserTurnStrategies and the automatic fallback to it in LLMUserAggregator when a SpeechControlParamsFrame was received from the transport. (PR #4229)

  • ⚠️ Removed vad_analyzer and turn_analyzer parameters from TransportParams and all transport input classes, along with all deprecated VAD/turn analysis logic in BaseInputTransport. VAD and turn detection are now handled entirely by LLMUserAggregator. (PR #4229)

  • ⚠️ Removed deprecated TranscriptionUserTurnStopStrategy alias (deprecated in 0.0.102). Use SpeechTimeoutUserTurnStopStrategy instead. (PR #4232)

  • ⚠️ Removed deprecated vad_events setting and should_interrupt parameter from DeepgramSTTService (deprecated in 0.0.99). Use Silero VAD for voice activity detection instead. (PR #4232)

  • ⚠️ Removed deprecated send_transcription_frames parameter from OpenAIRealtimeLLMService (deprecated in 0.0.92). Transcription frames are always sent. (PR #4232)

  • ⚠️ Removed deprecated UserIdleProcessor (deprecated in 0.0.100). Use LLMUserAggregator with the user_idle_timeout parameter instead. (PR #4232)

  • ⚠️ Removed deprecated UserBotLatencyLogObserver (deprecated in 0.0.102). Use UserBotLatencyObserver with its on_latency_measured event handler instead. (PR #4232)

  • ⚠️ Removed the riva install extra. Use nvidia instead (pip install "pipecat-ai[nvidia]"). (PR #4235)

  • Removed the empty remote-smart-turn install extra (was already a no-op). (PR #4235)

  • ⚠️ Removed DeprecatedModuleProxy and all service __init__.py re-export shims. Flat imports like from pipecat.services.openai import OpenAILLMService no longer work. Use the full submodule path instead: from pipecat.services.openai.llm import OpenAILLMService. This is already the established pattern across all examples and internal code. (PR #4239)

  • ⚠️ Removed deprecated PIPECAT_OBSERVER_FILES environment variable support. Use PIPECAT_SETUP_FILES instead. (PR #4267)

Fixed

  • Fixed IdleFrameProcessor where asyncio.Event was unconditionally cleared in a finally block instead of only on the success path. (PR #3796)

  • Fixed MCPClient opening a new connection for every tool call instead of reusing the session. (PR #4034)

  • GoogleLLMService now applies a low-latency thinking default (thinking_level="minimal") for Gemini 3+ Flash models. (PR #4067)

  • Fixed WebsocketService entering an infinite reconnection loop when a server accepts the WebSocket handshake but immediately closes the connection (e.g. invalid API key, close code 1008). The service now detects connections that fail repeatedly within seconds of being established and stops retrying after 3 consecutive quick failures. (PR #4201)

  • Fixed InworldHttpTTSService streaming responses crashing with UnicodeDecodeError when multi-byte UTF-8 characters were split across chunk boundaries. This caused TTS audio to cut off mid-sentence intermittently. (PR #4202)

  • Fixed a crash (JSONDecodeError) when a user interruption occurs while the LLM is streaming function call arguments. Previously, the incomplete JSON arguments were passed directly to json.loads(), causing an unhandled exception. Affected services: OpenAI, Google (OpenAI-compatible), and SambaNova. (PR #4203)

  • Fixed BaseOutputTransport discarding pending UninterruptibleFrame items (e.g. function-call context updates) when an interruption arrived. The audio task is now kept alive and only interruptible frames are drained when uninterruptible frames are present in the queue. (PR #4217)

  • Fixed spurious LLM inference being triggered when a function call result arrived while the user was actively speaking. The context frame is now suppressed until the user stops speaking. (PR #4217)

  • Fixed CartesiaTTSService failing with "Context has closed" errors when switching voice, model, or language via TTSUpdateSettingsFrame. The service now automatically flushes the current audio context and opens a fresh one when these settings change. (PR #4220)

  • Fixed duplicate LLM replies that could occur when multiple async function call results arrived while an LLM request was already queued. (PR #4230)

  • Fixed undefined _warn_deprecated_param calls in OpenAIRealtimeLLMService and GrokRealtimeLLMService for the deprecated session_properties init parameter. (PR #4232)

  • Fixed Gemini Live bot hanging after a session resumption reconnect. Audio, video, and text input were silently dropped after reconnecting because the internal _ready_for_realtime_input flag was not being reset. (PR #4242)

  • Fixed VADController getting stuck in the SPEAKING state when audio frames stop arriving mid-speech (e.g. user mutes mic). A new audio_idle_timeout parameter (default 1s, set to 0 to disable) forces a transition back to QUIET and emits on_speech_stopped when no audio is received while speaking. (PR #4244)

  • Fixed PipelineRunner._gc_collect() blocking the event loop by running gc.collect() synchronously. Now offloaded via asyncio.to_thread to avoid stalling concurrent pipeline tasks. (PR #4255)

  • Fixed ElevenLabsTTSService incorrectly enabling auto_mode when using TextAggregationMode.TOKEN. Auto mode disables server-side buffering and is designed for complete sentences — enabling it with token streaming degraded speech quality. The default is now derived automatically from the aggregation strategy: auto_mode=True for SENTENCE, auto_mode=False for TOKEN. Callers can still override by passing auto_mode explicitly. (PR #4265)

  • Fixed ValueError: write to closed file during pipeline shutdown when observers were active. Observer proxy tasks are now cancelled before observer resources are cleaned up. (PR #4267)

  • Fixed delayed turn completion when STT transcripts arrive after the p99 timeout. Previously, a late transcript (beyond the p99 window) would fall through to the 5-second user_turn_stop_timeout fallback. Now the turn stop triggers immediately when the late transcript arrives. (PR #4283)

  • Fixed ElevenLabsTTSService ignoring enable_logging=False and enable_ssml_parsing=False. The truthy check treated False the same as None (both skipped), and Python's str(False) produced "False" instead of the lowercase "false" expected by the API. (PR #4293)

  • Fixed on_assistant_turn_stopped not resetting internal state when the LLM returned no text tokens. Added interrupted field to AssistantTurnStoppedMessage to indicate whether the assistant turn was interrupted. (PR #4294)

  • Fixed LLMContextSummarizer failing with "No messages to summarize" when using system_instruction instead of a system-role message at the start of the context. The summarizer previously scanned the entire context for the first system message, which could match a mid-conversation injection (e.g. idle notifications) instead of the initial prompt, causing the summarization range to be empty. (PR #4295)

[0.0.108] - 2026-03-27

Added

  • Added SarvamLLMService with support for sarvam-30b, sarvam-30b-16k, sarvam-105b and sarvam-105b-32k. (PR #3978)

  • Added on_turn_context_created(context_id) hook to TTSService. Override this to perform provider-specific setup (e.g. eagerly opening a server-side context) before text starts flowing. Called each time a new turn context ID is created. (PR #4013)

  • Added XAIHttpTTSService for text-to-speech using xAI's HTTP TTS API. (PR #4031)

  • Added support for "developer" role messages in conversation context across all LLM adapters. For non-OpenAI services (Anthropic, Google, AWS Bedrock), "developer" messages are converted to "user" messages (use system_instruction to set the system instruction). For OpenAI services, "developer" messages pass through in conversation history. For the Responses API, they are kept as "developer" role (matching the existing "system" → "developer" conversion). (PR #4089)

  • Added SmallestTTSService, a WebSocket-based TTS service integration with Smallest AI's Waves API. Supports the Lightning v2 and v3.1 models with configurable voice, language, speed, consistency, similarity, and enhancement settings. (PR #4092)

  • Added warnings in turn stop strategies when VADParams.stop_secs differs from the recommended default (0.2s) or when stop_secs >= STT p99 latency, which collapses the STT wait timeout to 0s and may cause delayed turn detection. The warnings guide developers to re-run the stt-benchmark with their VAD settings. (PR #4115)

  • Added domain parameter to AssemblyAISTTSettings for specialized recognition modes such as Medical Mode (domain="medical-v1"). (PR #4117)

  • Added NovitaLLMService for using Novita AI's LLM models via their OpenAI-compatible API. (PR #4119)

  • Added cleanup() method to VADAnalyzer and VADController so VAD analyzer resources are properly released when no longer needed. Custom VADAnalyzer subclasses can override cleanup() to free any held resources. (PR #4120)

  • Added on_end_of_turn event handler to AssemblyAISTTService. This fires after the final transcript is pushed, providing a reliable hook for end-of-turn logic that doesn't race with TranscriptionFrame. Works in both Pipecat and AssemblyAI turn detection modes. (PR #4128)

  • Added DeepgramFluxSageMakerSTTService for running Deepgram Flux speech-to-text on AWS SageMaker endpoints. Use with ExternalUserTurnStrategies to take advantage of Flux's turn detection. (PR #4143)

  • Added Mem0MemoryService.get_memories() convenience method for retrieving all stored memories outside the pipeline (e.g. to build a personalized greeting at connection time). This avoids the need to manually handle client type branching, filter construction, and async wrapping. (PR #4156)

Changed

  • Added context prewarming path for InworldTTSService to improve first audio latency. (PR #4013)

  • Added KrispVivaVadAnalyzer for Voice Activity Detection using the Krisp VIVA SDK (requires krisp_audio). (PR #4022)

  • Modified InworldTTSService to close context at end of turn instead of relying on idle timeout. (PR #4028)

  • Added Gemini 3 support to the Gemini Live service. (PR #4078)

  • TTSService: the default stop_frame_timeout_s (idle time before an automatic TTSStoppedFrame is pushed when push_stop_frames=True) has changed from 2.0 to 3.0 seconds. (PR #4084)

  • ⚠️ GeminiLLMAdapter now only treats messages[0] as the initial system message, matching all other adapters. Previously it searched for the first "system" message anywhere in the conversation history. A "system" message appearing later in the list will now be converted to "user" instead of being extracted as the system instruction. (PR #4089)

  • Fixed InworldTtsService to fallback to full text when TTS timestamps are not received. (PR #4113)

  • ⚠️ Realtime services (Gemini Live, OpenAI Realtime, Grok Realtime, Nova Sonic) now prefer system_instruction from service settings over an initial system message in the LLM context, matching the behavior of non-realtime services. Previously, context-provided system instructions took precedence. A warning is now logged when both are set. (PR #4130)

  • Bumped nvidia-riva-client minimum version to >=2.25.1. (PR #4136)

  • Upgraded protobuf from 5.x to 6.x (>=6.31.1,<7). (PR #4136)

  • Unrecognized language strings (e.g. Deepgram's "multi") no longer produce a warning at startup. The log message has been downgraded to debug level since these are valid service-specific values that are passed through correctly. (PR #4137)

  • GrokLLMService and GrokRealtimeLLMService now live in the pipecat.services.xai module alongside XAIHttpTTSService, since all three use the same xAI API. Update imports from pipecat.services.grok.* to pipecat.services.xai.* (e.g. from pipecat.services.xai.llm import GrokLLMService). (PR #4142)

  • ⚠️ Bumped mem0ai dependency from ~=0.1.94 to >=1.0.8,<2. Users of the mem0 extra will need to update their mem0ai package. (PR #4156)

Deprecated

  • pipecat.services.grok.llm, pipecat.services.grok.realtime.llm, and pipecat.services.grok.realtime.events are deprecated. The old import paths still work but emit a DeprecationWarning; use pipecat.services.xai.llm, pipecat.services.xai.realtime.llm, and pipecat.services.xai.realtime.events instead. (PR #4142)

Removed

  • ⚠️ TTSService.add_word_timestamps() no longer supports the "Reset" and "TTSStoppedFrame" sentinel strings. If you have a custom TTS service that called await self.add_word_timestamps([("Reset", 0)]) or await self.add_word_timestamps([("TTSStoppedFrame", 0), ("Reset", 0)], ctx_id), replace them with await self.append_to_audio_context(ctx_id, TTSStoppedFrame(context_id=ctx_id)) and let _handle_audio_context manage the word-timestamp reset automatically. (PR #4145)

  • Removed SambaNovaSTTService. SambaNova no longer offers speech-to-text audio models. Use another STT provider instead. (PR #4154)

Fixed

  • Fixed Gemini Live (GoogleGeminiLiveLLMService) not honoring settings.system_instruction. The system instruction was being read from a deprecated constructor parameter instead of the settings object, causing it to be silently ignored. (PR #4089)

  • Fixed AWSBedrockLLMAdapter sending an empty message list to the API when the only message in context was a system message. The lone system message is now converted to "user" role instead of being extracted, matching the existing Anthropic adapter behavior. (PR #4089)

  • Fixed Gemini Live pipeline hanging indefinitely when an EndFrame was deferred while waiting for the bot to finish responding and turn_complete never arrived. As a possible root-cause fix, turn_complete messages are now handled even if they lack usage_metadata. As a fallback, the deferred EndFrame now has a 30-second safety timeout. (PR #4125)

  • Fixed ElevenLabs WebSocket disconnections (1008 "Maximum simultaneous contexts exceeded") caused by rapid user interruptions. When interruptions arrived before any TTS text was generated, phantom contexts were created on the ElevenLabs server that were never closed, eventually exceeding the 5-context limit. (PR #4126)

  • Fixed the final sentence being dropped from the conversation context when using RTVI text input with non-word-timestamp TTS services. The LLMFullResponseEndFrame was racing ahead of the last TTSTextFrame, causing the LLMAssistantAggregator to finalize the context before the final sentence arrived. (PR #4127)

  • Fixed audio crackling and popping in recordings when both user and bot are speaking. AudioBufferProcessor no longer injects silence into a track's buffer while that track is actively producing audio, preventing mid-utterance interruptions in the recorded output. (PR #4135)

  • Fixed websocket TTS word timestamps so interrupted contexts cannot leak stale words or backward PTS values into later turns. (PR #4145)

  • Fixed a race condition in InterruptibleTTSService where, if run_tts had been invoked but BotStartedSpeakingFrame had not yet been received, a user interruption could allow stale audio to leak through. (PR #4145)

  • Fixed Gemini Live local VAD mode (GeminiVADParams(disabled=True) with external VAD) not working. The bot now correctly detects user speech and signals turn boundaries to the Gemini API. (PR #4146)

  • Fixed Gemini Live message handling to process all server_content fields independently. Gemini 3.x can bundle multiple fields (e.g. model_turn and output_transcription) on the same message, but the previous elif chain only processed the first match, silently dropping the rest. (PR #4147)

  • Fixed ServiceSwitcher with ServiceSwitcherStrategyFailover incorrectly triggering failover when ErrorFrames from other pipeline stages (e.g. TTS) propagated upstream through the switcher. Previously, any non-fatal error passing through would be misattributed to the active service and trigger an unwanted service switch. Now only errors originating from the switcher's own managed services trigger failover. (PR #4149)

  • Fixed LiveKitOutputTransport not clearing the rtc.AudioSource internal buffer on interruption, causing the bot to continue speaking for several seconds after being interrupted. (PR #4151)

  • Fixed a crash in OpenAI LLM processing when the provider returns chunk.choices[0].delta.audio = None, which caused 'NoneType' object has no attribute 'get' errors during audio transcript handling. (PR #4152)

  • Fixed error floods in DeepgramSTTService when the WebSocket connection drops. With Deepgram SDK 6.x, send_media() raises exceptions on a dead connection instead of silently failing, causing every queued audio frame to log an error. Now send_media() failures are caught gracefully — a single warning is logged and audio frames are skipped until the existing reconnection logic restores the connection. (PR #4153)

  • Mem0MemoryService no longer blocks the event loop during memory storage and retrieval. All Mem0 API calls now run in a background thread, and message storage is fire-and-forget so it doesn't delay downstream processing. (PR #4156)

  • Fixed Mem0MemoryService failing to store messages when the context contained system or developer role messages. The Mem0 API only accepts user and assistant roles, so other roles are now filtered out before storing. (PR #4156)

  • Added missing on_dtmf_event callback to LemonSliceTransportClient.setup() DailyCallbacks construction, fixing a ValidationError at pipeline setup time. (PR #4161)

  • Fixed an issue in InworldTTSService where, in cases of fast interruption, we would continue receiving audio from the previous context. (PR #4167)

  • Fixed a word timestamp interleaving issue in InworldTTSService when processing multiple sentences. (PR #4167)

  • Fixed duplicate TTSStoppedFrame being pushed in TTS services using push_stop_frames=True. When the stop-frame timeout fired, a second TTSStoppedFrame could be pushed after the normal one at context completion. (PR #4172)

  • ⚠️ Fixed DeepgramSTTService compatibility with deepgram-sdk 6.1.0. The SDK now requires explicit message objects for send_keep_alive(), send_close_stream(), and send_finalize(). The minimum deepgram-sdk version is now 6.1.0. (PR #4174)

  • Fixed RTVI events not being delivered to clients when using WebSocket transports. ProtobufFrameSerializer now sets ignore_rtvi_messages=False by default. (PR #4176)

  • Fixed a timing issue where turn detection timer tasks (idle controller, speech timeout, turn analyzer, and turn completion) could miss their first tick because the newly created asyncio task was not yet scheduled when the caller continued. (PR #4183)

  • Fixed FastAPIWebsocketTransport intermittently hanging on shutdown when the remote side (e.g. Twilio) disconnects while audio is being sent. A race condition between the send and receive paths could cause the on_client_disconnected callback to be skipped, leaving the pipeline waiting for a disconnect signal that never came. (PR #4186)

Performance

  • RimeTTSService now handles Rime's done WebSocket message to complete audio contexts immediately, eliminating the 3-second idle timeout that previously added latency at the end of each utterance. (PR #4172)

[0.0.107] - 2026-03-23

Added

  • Added frame_order parameter to SyncParallelPipeline. Set frame_order=FrameOrder.PIPELINE to push synchronized output frames in pipeline definition order (all frames from the first pipeline, then the second, etc.) instead of the default arrival order. (PR #4029)

  • Added sync_with_audio field to OutputImageRawFrame. When set to True, the output transport queues image frames with audio so they are displayed only after all preceding audio has been sent, enabling synchronized audio/image playback. (PR #4029)

  • Added OpenAIResponsesLLMService, a new LLM service that uses the OpenAI Responses API. Supports streaming text, function calling, usage metrics, and out-of-band inference. Works with the universal LLMContext and LLMContextAggregatorPair. See examples/foundational/07-interruptible-openai-responses.py and 14-function-calling-openai-responses.py. (PR #4074)

  • Added audio_out_auto_silence parameter to TransportParams (defaults to True). When set to False, the transport waits for audio data instead of inserting silence when the output queue is empty, which is useful for scenarios that require uninterrupted audio playback without artificial gaps. (PR #4104)

Changed

  • Renamed tracing span attributes to align with OpenTelemetry GenAI semantic conventions: gen_ai.system to gen_ai.provider.name, system to gen_ai.system_instructions, gen_ai.usage.cache_read_input_tokens to gen_ai.usage.cache_read.input_tokens, and gen_ai.usage.cache_creation_input_tokens to gen_ai.usage.cache_creation.input_tokens. (PR #3449)

  • DeepgramSageMakerTTSService now correctly routes audio through the base TTSService audio context queue. Audio frames are delivered via append_to_audio_context() instead of being pushed directly, enabling proper ordering, interruption handling, and start/stop frame lifecycle management. Interruptions now trigger a Clear message to Deepgram (flushing its text buffer) at the right time via on_audio_context_interrupted. (PR #4083)

  • GradiumTTSService now sends a per-context setup message with client_req_id before the first text message for each TTS context, following Gradium's multiplexing protocol. Previously, a single setup message was sent at connection time without a client_req_id, which prevented Gradium from associating requests with their sessions when using close_ws_on_eos=False. (PR #4091)

Fixed

  • Fixed stale system_instruction in LLM tracing spans by reading from _settings.system_instruction instead of the removed _system_instruction attribute. (PR #3449)

  • Fixed SyncParallelPipeline breaking the Whisker debugger. (PR #4029)

  • Fixed SyncParallelPipeline race condition where concurrent SystemFrame processing (e.g. from RTVI) could corrupt sink queues and cause deadlocks. SystemFrames now take a fast path that passes them through without draining queued output. (PR #4029)

  • Fixed TTS frame ordering so that non-system frames always arrive in correct order relative to the TTSStartedFrame/TTSAudioRawFrame/TTSStoppedFrame sequence. Previously these frames could race ahead of or behind audio context frames, producing out-of-order output downstream. (PR #4075)

  • Fixed SarvamTTSService audio and error frames now route through append_to_audio_context() instead of push_frame(), ensuring correct behavior with audio contexts and interruptions. (PR #4082)

  • Fixed audio frame ordering and interruption handling in Fish Audio, LMNT, Neuphonic, and Rime NonJson TTS services. These services were bypassing the base TTSService audio context serialization queue by pushing audio frames directly, which could cause out-of-order frames and broken interruptions during speech. (PR #4090)

  • Fixed Genesys AudioHook serializer to always include the parameters field in protocol messages. The AudioHook protocol requires every message to carry a parameters object (even if empty), but _create_message omitted it when no parameters were provided. This caused clients that validate message structure (including the Genesys reference implementation) to reject pong and parameter-less closed responses, breaking server sequence tracking and preventing outputVariables from reaching the Architect flow. (PR #4093)

[0.0.106] - 2026-03-18

Added

  • Added optional service field to ServiceUpdateSettingsFrame (and its subclasses LLMUpdateSettingsFrame, TTSUpdateSettingsFrame, STTUpdateSettingsFrame) to target a specific service instance. When service is set, only the matching service applies the settings; others forward the frame unchanged. This enables updating a single service when multiple services of the same type exist in the pipeline. (PR #4004)

  • Added sip_provider and room_geo parameters to configure() in the Daily runner. These convenience parameters let callers specify a SIP provider name and geographic region directly without manually constructing DailyRoomProperties and DailyRoomSipParams. (PR #4005)

  • Added PerplexityLLMAdapter that automatically transforms conversation messages to satisfy Perplexity's stricter API constraints (strict role alternation, no non-initial system messages, last message must be user/tool). Previously, certain conversation histories could cause Perplexity API errors that didn't occur with OpenAI (PerplexityLLMService subclasses OpenAILLMService since Perplexity uses an OpenAI-compatible API). (PR #4009)

  • Added DTMF input event support to the Daily transport. Incoming DTMF tones are now received via Daily's on_dtmf_event callback and pushed into the pipeline as InputDTMFFrame, enabling bots to react to keypad presses from phone callers. (PR #4047)

  • Added WakePhraseUserTurnStartStrategy for triggering user turns based on wake phrases, with support for single_activation mode. Deprecates WakeCheckFilter. (PR #4064)

  • Added default_user_turn_start_strategies() and default_user_turn_stop_strategies() helper functions for composing custom strategy lists. (PR #4064)

Changed

  • Changed tool result JSON serialization to use ensure_ascii=False, preserving UTF-8 characters instead of escaping them. This reduces context size and token usage for non-English languages. (PR #3457)

  • OpenAIRealtimeSTTService's noise_reduction parameter is now part of OpenAIRealtimeSTTSettings, making it runtime-updatable via STTUpdateSettingsFrame. The direct noise_reduction init argument is deprecated as of 0.0.106. (PR #3991)

  • Updated sarvamai dependency from 0.1.26a2 (alpha) to 0.1.26 (stable release). (PR #3997)

  • SimliVideoService now extends AIService instead of FrameProcessor, aligning it with the HeyGen and Tavus video services. It supports SimliVideoService.Settings(...) for configuration and uses start()/stop()/cancel() lifecycle methods. Existing constructor usage (api_key, face_id, etc.) remains unchanged. (PR #4001)

  • Update pipecat-ai-small-webrtc-prebuilt to 2.4.0. (PR #4023)

  • Nova Sonic assistant text transcripts are now delivered in real-time using speculative text events instead of delayed final text events. Previously, assistant text only arrived after all audio had finished playing, causing laggy transcripts in client UIs. Speculative text arrives before each audio chunk, providing text synchronized with what the bot is saying. This also simplifies the internal text handling by removing the interruption re-push hack and assistant text buffer. (PR #4042)

  • Updated daily-python dependency to 0.25.0. (PR #4047)

  • Added enable_dialout parameter to configure() in pipecat.runner.daily to support dial-out rooms. Also narrowed misleading Optional type hints and deduplicated token expiry calculation. (PR #4048)

  • Extended ProcessFrameResult to stop strategies, allowing a stop strategy to short-circuit evaluation of subsequent strategies by returning STOP. (PR #4064)

  • GradiumSTTService now takes both an encoding and sample_rate constructor argument which is assmebled in the class to form the input_format. PCM accepts 8000, 16000, and 24000 Hz sample rates. (PR #4066)

  • Improved GradiumSTTService transcription accuracy by reworking how text fragments are accumulated and finalized. Previously, trailing words could be dropped when the server's flushed response arrived before all text tokens were delivered. The service now uses a short aggregation delay after flush to capture trailing tokens, producing complete utterances. (PR #4066)

Deprecated

  • SimliVideoService.InputParams is deprecated. Use the direct constructor parameters max_session_length, max_idle_time, and enable_logging instead. (PR #4001)

  • Deprecated LocalSmartTurnAnalyzerV2 and LocalCoreMLSmartTurnAnalyzer. Use LocalSmartTurnAnalyzerV3 instead. Instantiating these analyzers will now emit a DeprecationWarning. (PR #4012)

  • Deprecated WakeCheckFilter in favor of WakePhraseUserTurnStartStrategy. (PR #4064)

Fixed

  • Fixed an issue where the default model for OpenAILLMService and AzureLLMService was mistakenly reverted to gpt-4o. The defaults are now restored to gpt-4.1. (PR #4000)

  • Fixed a race condition where EndTaskFrame could cause the pipeline to shut down before in-flight frames (e.g. LLM function call responses) finished processing. EndTaskFrame and StopTaskFrame now flow through the pipeline as ControlFrames, ensuring all pending work is flushed before shutdown begins. CancelTaskFrame and InterruptionTaskFrame remain immediate (SystemFrame). (PR #4006)

  • Fixed ParallelPipeline dropping or misordering frames during lifecycle synchronization. Buffered frames are now flushed in the correct order relative to synchronization frames (StartFrame goes first, EndFrame/CancelFrame go after), and frames added to the buffer during flush are also drained. (PR #4007)

  • Fixed TTSService potentially canceling in-flight audio during shutdown. The stop sequence now waits for all queued audio contexts to finish processing before canceling the stop frame task. (PR #4007)

  • Fixed Language enum values (e.g. Language.ES) not being converted to service-specific codes when passed via settings=Service.Settings(language=Language.ES) at init time. This caused API errors (e.g. 400 from Rime) because the raw enum was sent instead of the expected language code (e.g. "spa"). Runtime updates via UpdateSettingsFrame were unaffected. The fix centralizes conversion in the base TTSService and STTService classes so all services handle this consistently. (PR #4024)

  • Fixed DeepgramSTTService ignoring the base_url scheme when using ws:// or http://. Previously these were silently overwritten with wss:// / https://, breaking air-gapped or private deployments that don't use TLS. All scheme choices (wss://, https://, ws://, http://, or bare hostname) are now respected. (PR #4026)

  • Fixed LLMSwitcher.register_function() and register_direct_function() not accepting or forwarding the timeout_secs parameter. (PR #4037)

  • Fixed empty user transcriptions in Nova Sonic causing spurious interruptions. Previously, an empty transcription could trigger an interruption of the assistant's response even though the user hadn't actually spoken. (PR #4042)

  • Fixed SonioxSTTService and OpenAIRealtimeSTTService crash when language parameters contain plain strings instead of Language enum values. (PR #4046)

  • Fixed premature user turn stops caused by late transcriptions arriving between turns. A stale transcript from the previous turn could persist into the next turn and trigger a stop before the current turn's real transcript arrived. Stop strategies are now reset at both turn start and turn stop to prevent state from leaking across turn boundaries. (PR #4057)

  • Fixed raw language strings like "de-DE" silently failing when passed to TTS/STT services (e.g. ElevenLabs producing no audio). Raw strings now go through the same Language enum resolution as enum values, so regional codes like "de-DE" are properly converted to service-expected formats like "de". Unrecognized strings log a warning instead of failing silently. (PR #4058)

  • Fixed Deepgram STT list-type settings (keyterm, keywords, search, redact, replace) being stringified instead of passed as lists to the SDK, which caused them to be sent as literal strings (e.g. "['pipecat']") in the WebSocket query params. (PR #4063)

  • Fixed MinWordsUserTurnStartStrategy including text below the word threshold in the output by resetting aggregation when the minimum word count is not met. (PR #4064)

  • Fixed audio overlap and potential dropped TTS content when multiple assistant turns occur in quick succession. TTSService now flushes remaining text before pausing frame processing on LLMFullResponseEndFrame/EndFrame, instead of pausing first. (PR #4071)

Security

  • Bumped PyJWT minimum version from 2.10.1 to 2.12.0 in the livekit extra to address CVE-2026-32597 (GHSA-752w-5fwx-jx9f), where PyJWT <= 2.11.0 accepted unknown crit header extensions. (PR #4035)

[0.0.105] - 2026-03-10

Added

  • Added concurrent audio context support: CartesiaTTSService can now synthesize the next sentence while the previous one is still playing, by setting pause_frame_processing=False and routing each sentence through its own audio context queue. (PR #3804)

  • Added custom video track support to Daily transport. Use video_out_destinations in DailyParams to publish multiple video tracks simultaneously, mirroring the existing audio_out_destinations feature. (PR #3831)

  • Added ServiceSwitcherStrategyFailover that automatically switches to the next service when the active service reports a non-fatal error. Recovery policies can be implemented via the on_service_switched event handler. (PR #3861)

  • Added optional timeout_secs parameter to register_function() and register_direct_function() for per-tool function call timeout control, overriding the global function_call_timeout_secs default. (PR #3915)

  • Added cloud-audio-only recording option to Daily transport's enable_recording property. (PR #3916)

  • Wired up system_instruction in BaseOpenAILLMService, AnthropicLLMService, and AWSBedrockLLMService so it works as a default system prompt, matching the behavior of the Google services. This enables sharing a single LLMContext across multiple LLM services, where each service provides its own system instruction independently.

    llm = OpenAILLMService(
        api_key=os.getenv("OPENAI_API_KEY"),
        system_instruction="You are a helpful assistant.",
    )
    
    context = LLMContext()
    
    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        context.add_message({"role": "user", "content": "Please introduce yourself."})
        await task.queue_frames([LLMRunFrame()])
    

    (PR #3918)

  • Added vad_threshold parameter to AssemblyAIConnectionParams for configuring voice activity detection sensitivity in U3 Pro. Aligning this with external VAD thresholds (e.g., Silero VAD) prevents the "dead zone" where AssemblyAI transcribes speech that VAD hasn't detected yet. (PR #3927)

  • Added push_empty_transcripts parameter to BaseWhisperSTTService and OpenAISTTService to allow empty transcripts to be pushed downstream as TranscriptionFrame instead of discarding them (the default behavior). This is intended for situations where VAD fires even though the user did not speak. In these cases, it is useful to know that nothing was transcribed so that the agent can resume speaking, instead of waiting longer for a transcription. (PR #3930)

  • LLM services (BaseOpenAILLMService, AnthropicLLMService, AWSBedrockLLMService) now log a warning when both system_instruction and a system message in the context are set. The constructor's system_instruction takes precedence. (PR #3932)

  • Runtime settings updates (via STTUpdateSettingsFrame) now work for AWS Transcribe, Azure, Cartesia, Deepgram, ElevenLabs Realtime, Gradium, and Soniox STT services. Previously, changing settings at runtime only stored the new values without reconnecting. (PR #3946)

  • Exposed on_summary_applied event on LLMAssistantAggregator, allowing users to listen for context summarization events without accessing private members. (PR #3947)

  • Deepgram Flux STT settings (keyterm, eot_threshold, eager_eot_threshold, eot_timeout_ms) can now be updated mid-stream via STTUpdateSettingsFrame without triggering a reconnect. The new values are sent to Deepgram as a Configure WebSocket message on the existing connection. (PR #3953)

  • Added system_instruction parameter to run_inference across all LLM services, allowing callers to override the system prompt for one-shot inference calls. Used by _generate_summary to pass the summarization prompt cleanly. (PR #3968)

Changed

  • Audio context management (previously in AudioContextTTSService) is now built into TTSService. All WebSocket providers (cartesia, elevenlabs, asyncai, inworld, rime, gradium, resembleai) now inherit from WebsocketTTSService directly. Word-timestamp baseline is set automatically on the first audio chunk of each context instead of requiring each provider to call start_word_timestamps() in their receive loop. (PR #3804)

  • Daily transport now uses CustomVideoSource/CustomVideoTrack instead of VirtualCameraDevice for the default camera output, mirroring how audio already works with CustomAudioSource/CustomAudioTrack. (PR #3831)

  • ⚠️ Updated DeepgramSTTService to use deepgram-sdk v6. The LiveOptions class was removed from the SDK and is now provided by pipecat directly; import it from pipecat.services.deepgram.stt instead of deepgram. (PR #3848)

  • ServiceSwitcherStrategy base class now provides a handle_error() hook for subclasses to implement error-based switching. ServiceSwitcher defaults to ServiceSwitcherStrategyManual and strategy_type is now optional. (PR #3861)

  • Support for Voice Focus 2.0 models.

    • Updated aic-sdk to ~=2.1.0 to support Voice Focus 2.0 models.
    • Cleaned unused ParameterFixedError exception handling in AICFilter parameter setup. (PR #3889)
  • max_context_tokens and max_unsummarized_messages in LLMAutoContextSummarizationConfig (and deprecated LLMContextSummarizationConfig) can now be set to None independently to disable that summarization threshold. At least one must remain set. (PR #3914)

  • ⚠️ Removed formatted_finals and word_finalization_max_wait_time from AssemblyAIConnectionParams as these were v2 API parameters not supported in v3. Clarified that format_turns only applies to Universal-Streaming models; U3 Pro has automatic formatting built-in. (PR #3927)

  • Changed DeepgramTTSService to send a Clear message on interruption instead of disconnecting and reconnecting the WebSocket, allowing the connection to persist throughout the session. (PR #3958)

  • Re-added enhancement_level support to AICFilter with runtime FilterEnableFrame control, applying ProcessorParameter.Bypass and ProcessorParameter.EnhancementLevel together. (PR #3961)

  • Updated daily-python dependency from ~=0.23.0 to ~=0.24.0. (PR #3970)

  • Updated FishAudioTTSService default model from s1 to s2-pro, matching Fish Audio's latest recommended model for improved quality and speed. (PR #3973)

  • AzureSTTService region parameter is now optional when private_endpoint is provided. A ValueError is raised if neither is given, and a warning is logged if both are provided (private_endpoint takes priority). (PR #3974)

Deprecated

  • Deprecated AudioContextTTSService and AudioContextWordTTSService. Subclass WebsocketTTSService directly instead; audio context management is now part of the base TTSService.

    • Deprecated WordTTSService, WebsocketWordTTSService, and InterruptibleWordTTSService. Word timestamp logic is now always active in TTSService and no longer needs to be opted into via a subclass. (PR #3804)
  • Deprecated pipecat.services.google.llm_vertex, pipecat.services.google.llm_openai, and pipecat.services.google.gemini_live.llm_vertex modules. Use pipecat.services.google.vertex.llm, pipecat.services.google.openai.llm, and pipecat.services.google.gemini_live.vertex.llm instead. The old import paths still work but will emit a DeprecationWarning. (PR #3980)

Removed

  • ⚠️ Removed supports_word_timestamps parameter from TTSService.__init__(). Word timestamp logic is now always active. Remove this argument from any custom subclass super().__init__() calls. (PR #3804)

Fixed

  • Fixed DeepgramSTTService keepalive ping timeout disconnections. The deepgram-sdk v6 removed automatic keepalive; pipecat now sends explicit KeepAlive messages every 5 seconds, within the recommended 35 second interval before Deepgram's 10-second inactivity timeout. (PR #3848)

  • Fixed BufferError: Existing exports of data: object cannot be re-sized in AICFilter caused by holding a memoryview on the mutable audio buffer across async yield points. (PR #3889)

  • Fixed TTS context not being appended to the assistant message history when using TTSSpeakFrame with append_to_context=True with some TTS providers. (PR #3936)

  • Fixed context summarization leaving orphaned tool responses in the kept context when tool calls were moved to the summarized portion. (PR #3937)

  • Fixed turn completion state not resetting at end of LLM responses. LLMFullResponseEndFrame is pushed (not received) by the LLM service, so the mixin now handles it in push_frame instead of process_frame. (PR #3956)

  • Fixed turn completion instructions being injected as a context system message instead of using system_instruction. This caused warning spam when system_instruction was also set and didn't persist across full context updates. (PR #3957)

  • Fixed TTSService audio context queue getting blocked when append_to_audio_context() was called with a None context ID, which prevented subsequent audio from being delivered. (PR #3958)

  • Fixed on_call_state_updated event handler in LiveKit transport receiving incorrect number of arguments due to redundant self passed to _call_event_handler. (PR #3959)

  • Fixed OpenAI Realtime, OpenAI Realtime Beta, and Grok realtime services treating conversation_already_has_active_response as a fatal error. These services now log it as a non-fatal debug event when a response is already in progress. (PR #3960)

  • Fixed SmallWebRTCConnection silently discarding messages sent before the data channel is open by queuing them and flushing once the channel is ready. A bounded queue (MAX_MESSAGE_QUEUE_SIZE = 50) prevents unbounded memory growth, and a 10-second timeout after connection clears the queue and falls back to discard mode if the data channel never opens. (PR #3962)

  • Fixed AzureSTTService failing to initialize when private_endpoint is provided. The Azure Speech SDK's SpeechConfig does not accept both region and endpoint simultaneously, so they are now passed conditionally. (PR #3967)

  • Fixed GoogleLLMService ignoring the system_instruction set via constructor or GoogleLLMSettings when a system message was also present in the context. The settings value now correctly takes priority, and a warning is logged when both are set. (PR #3976)

Other

  • Updated foundational examples to use system_instruction on LLM services instead of adding system messages to LLMContext. (PR #3918)

  • Updated AssemblyAI turn detection example to use keyterms_prompt list format instead of prompt string for improved clarity. (PR #3929)

  • Updated foundational examples and eval scripts to use "user" role instead of "system" when adding messages to LLMContext, since system prompts should be set via system_instruction on the LLM service. (PR #3931)

[0.0.104] - 2026-03-02

Added

  • Added TextAggregationMetricsData metric measuring the time from the first LLM token to the first complete sentence, representing the latency cost of sentence aggregation in the TTS pipeline. (PR #3696)

  • Added support for using strongly-typed objects instead of dicts for updating service settings at runtime.

    Instead of, say:

    await task.queue_frame(
        STTUpdateSettingsFrame(settings={"language": Language.ES})
    )
    

    you'd do:

    await task.queue_frame(
        STTUpdateSettingsFrame(delta=DeepgramSTTSettings(language=Language.ES))
    )
    

    Each service now vends strongly-typed classes like DeepgramSTTSettings representing the service's runtime-updatable settings. (PR #3714)

  • Added support for specifying private endpoints for Azure Speech-to-Text, enabling use in private networks behind firewalls. (PR #3764)

  • Added LemonSliceTransport and LemonSliceApi to support adding real-time LemonSlice Avatars to any Daily room. (PR #3791)

  • Added output_medium parameter to AgentInputParams and OneShotInputParams in Ultravox service to control initial output medium (text or voice) at call creation time. (PR #3806)

  • Added TurnMetricsData as a generic metrics class for turn detection, with e2e processing time measurement. KrispVivaTurn now emits TurnMetricsData with e2e_processing_time_ms tracking the interval from VAD speech-to-silence transition to turn completion. (PR #3809)

  • Added on_audio_context_interrupted() and on_audio_context_completed() callbacks to AudioContextTTSService. Subclasses can override these to perform provider-specific cleanup instead of overriding _handle_interruption(). (PR #3814)

  • Added on_summary_applied event to LLMContextSummarizer for observability, providing message counts before and after context summarization. (PR #3855)

  • Added summary_message_template to LLMContextSummarizationConfig for customizing how summaries are formatted when injected into context (e.g., wrapping in XML tags). (PR #3855)

  • Added summarization_timeout to LLMContextSummarizationConfig (default 120s) to prevent hung LLM calls from permanently blocking future summarizations. (PR #3855)

  • Added optional llm field to LLMContextSummarizationConfig for routing summarization to a dedicated LLM service (e.g., a cheaper/faster model) instead of the pipeline's primary model. (PR #3855)

  • Add AssemblyAI u3-rt-pro model support with built-in turn detection mode (PR #3856)

  • Added LLMSummarizeContextFrame to trigger on-demand context summarization from anywhere in the pipeline (e.g. a function call tool). Accepts an optional config: LLMContextSummaryConfig to override summary generation settings per request. (PR #3863)

  • Added LLMContextSummaryConfig (summary generation params: target_context_tokens, min_messages_after_summary, summarization_prompt) and LLMAutoContextSummarizationConfig (auto-trigger thresholds: max_context_tokens, max_unsummarized_messages, plus a nested summary_config). These replace the monolithic LLMContextSummarizationConfig. (PR #3863)

  • Added support for the speed_alpha parameter to the arcana model in RimeTTSService. (PR #3873)

  • Added ClientConnectedFrame, a new SystemFrame pushed by all transports (Daily, LiveKit, FastAPI WebSocket, WebSocket Server, SmallWebRTC, HeyGen, Tavus) when a client connects. Enables observers to track transport readiness timing. (PR #3881)

  • Added StartupTimingObserver for measuring how long each processor's start() method takes during pipeline startup. Also measures transport readiness — the time from StartFrame to first client connection — via the on_transport_timing_report event. (PR #3881)

  • Added BotConnectedFrame for SFU transports and on_transport_timing_report event to StartupTimingObserver with bot and client connection timing. (PR #3881)

  • Added optional direction parameter to PipelineTask.queue_frame() and PipelineTask.queue_frames(), allowing frames to be pushed upstream from the end of the pipeline. (PR #3883)

  • Added on_latency_breakdown event to UserBotLatencyObserver providing per-service TTFB, text aggregation, user turn duration, and function call latency metrics for each user-to-bot response cycle. (PR #3885)

  • Added on_first_bot_speech_latency event to UserBotLatencyObserver measuring the time from client connection to first bot speech. An on_latency_breakdown is also emitted for this first speech event. (PR #3885)

  • Added broadcast_interruption() to FrameProcessor. This method pushes an InterruptionFrame both upstream and downstream directly from the calling processor, avoiding the round-trip through the pipeline task that push_interruption_task_frame_and_wait() required. (PR #3896)

Changed

  • Added text_aggregation_mode parameter to TTSService and all TTS subclasses with a new TextAggregationMode enum (SENTENCE, TOKEN). All text now flows through text aggregators regardless of mode, enabling pattern detection and tag handling in TOKEN mode. (PR #3696)

  • ⚠️ Refactored runtime-updatable service settings to use strongly-typed classes (TTSSettings, STTSettings, LLMSettings, and service-specific subclasses) instead of plain dicts. Each service's _settings now holds these strongly-typed objects. For service maintainers, see changes in COMMUNITY_INTEGRATIONS.md. (PR #3714)

  • Word timestamp support has been moved from WordTTSService into TTSService via a new supports_word_timestamps parameter. Services that previously extended WordTTSService, AudioContextWordTTSService, or WebsocketWordTTSService now pass supports_word_timestamps=True to their parent __init__ instead. (PR #3786)

  • Improved Ultravox TTFB measurement accuracy by using VAD speech end time instead of UserStoppedSpeakingFrame timing. (PR #3806)

  • Aligned UltravoxRealtimeLLMService frame handling with OpenAI/Gemini realtime services: added InterruptionFrame handling with metrics cleanup, processing metrics at response boundaries, and improved agent transcript handling for both voice and text output modalities. (PR #3806)

  • Updated OpenAIRealtimeLLMService default model to gpt-realtime-1.5. (PR #3807)

  • Added api_key parameter to KrispVivaSDKManager, KrispVivaTurn, and KrispVivaFilter for Krisp SDK v1.6.1+ licensing. Falls back to KRISP_VIVA_API_KEY environment variable. (PR #3809)

  • Bumped nltk minimum version from 3.9.1 to 3.9.3 to resolve a security vulnerability. (PR #3811)

  • ServiceSettingsUpdateFrames are now UninterruptibleFrames. Generally speaking, you don't want a user interruption to prevent a service setting change from going into effect. Note that you usually don't use ServiceSettingsUpdateFrame directly, you use one of its subclasses:

    • LLMUpdateSettingsFrame
    • TTSUpdateSettingsFrame
    • STTUpdateSettingsFrame (PR #3819)
  • Updated context summarization to use user role instead of assistant for summary messages. (PR #3855)

  • Rename AssemblyAISTTService parameter min_end_of_turn_silence_when_confident parameter to min_turn_silence (old name still supported with deprecation warning) (PR #3856)

  • ⚠️ Renamed LLMAssistantAggregatorParams fields: enable_context_summarizationenable_auto_context_summarization and context_summarization_configauto_context_summarization_config (now accepts LLMAutoContextSummarizationConfig). The old names still work with a DeprecationWarning for one release cycle. (PR #3863)

  • ElevenLabsRealtimeSTTService now sets TranscriptionFrame.finalized to True when using CommitStrategy.MANUAL. (PR #3865)

  • Updated numba version pin from == to >=0.61.2 (PR #3868)

  • Updated tracing code to use ServiceSettings dataclass API (given_fields(), attribute access) instead of dict-style access (.items(), in, subscript). (PR #3879)

  • ⚠️ Removed event field and complete() method from InterruptionFrame. Removed event field from InterruptionTaskFrame. These are no longer needed since broadcast_interruption() does not require a round-trip completion signal. (PR #3896)

  • Moved pipecat.services.deepgram.stt_sagemaker and pipecat.services.deepgram.tts_sagemaker to pipecat.services.deepgram.sagemaker.stt and pipecat.services.deepgram.sagemaker.tts. The old import paths still work but emit a DeprecationWarning. (PR #3902)

Deprecated

  • ⚠️ Deprecated aggregate_sentences parameter on TTSService and all TTS subclasses. Use text_aggregation_mode=TextAggregationMode.SENTENCE or text_aggregation_mode=TextAggregationMode.TOKEN instead. (PR #3696)

  • Deprecated set_model(), set_voice(), and set_language() on AI services in favor of runtime updates via TTSUpdateSettingsFrame, STTUpdateSettingsFrame, and LLMUpdateSettingsFrame.

    ⚠️ Note, too, a subtle behavior change in these deprecated methods. Whereas previously only set_language() caused the service to actually react to the update (e.g. by reconnecting to a remote service so it an pick up the change), now all these methods do. This change was made as part of a refactor making them all work the same way under the hood. (PR #3714)

  • Dict-based *UpdateSettingsFrame(settings={...}) is deprecated in favor of passing typed settings delta objects with *UpdateSettingsFrame(delta={...}). (PR #3714)

  • Deprecated WordTTSService, WebsocketWordTTSService, AudioContextWordTTSService, and InterruptibleWordTTSService. Use their non-word counterparts with supports_word_timestamps=True instead:

    • WordTTSServiceTTSService(supports_word_timestamps=True)
    • WebsocketWordTTSServiceWebsocketTTSService(supports_word_timestamps=True)
    • AudioContextWordTTSServiceAudioContextTTSService(supports_word_timestamps=True)
    • InterruptibleWordTTSServiceInterruptibleTTSService(supports_word_timestamps=True) (PR #3786)
  • Deprecated SmartTurnMetricsData in favor of TurnMetricsData. BaseSmartTurn now emits TurnMetricsData directly. (PR #3809)

  • Deprecated LLMContextSummarizationConfig. Use LLMAutoContextSummarizationConfig with a nested LLMContextSummaryConfig instead. The old class emits a DeprecationWarning. (PR #3863)

  • Deprecated push_interruption_task_frame_and_wait() in FrameProcessor. Use broadcast_interruption() instead. The old method now delegates to broadcast_interruption() and logs a deprecation warning. (PR #3896)

Removed

  • Removed local-smart-turn-v3 optional extra from pyproject.toml. The transformers and onnxruntime packages are now always installed as core dependencies since they are required by the default turn stop strategy, TurnAnalyzerUserTurnStopStrategy which uses LocalSmartTurnAnalyzerV3. (PR #3803)

  • ⚠️ Removed PlayHTTTSService and PlayHTHttpTTSService. PlayHT has been shut down and is no longer available. (PR #3838)

Fixed

  • Added LLMSpecificMessage handling in LLMContextSummarizationUtil to skip provider-specific messages during context summarization. (PR #3794)

  • Treated response_cancel_not_active as a non-fatal error in realtime services (OpenAIRealtimeLLMService, GrokRealtimeLLMService, OpenAIRealtimeBetaLLMService) to prevent WebSocket disconnection when cancelling an inactive response. (PR #3795)

  • Fixed Poetry compatibility by inlining local-smart-turn-v3 dependencies (transformers, onnxruntime) into core dependencies instead of using a self-referential extra. (PR #3803)

  • Fixed SentryMetrics method signatures to match updated FrameProcessorMetrics base class, resolving TypeError when using start_time/end_time keyword arguments. (PR #3808)

  • Fixed STT TTFB metrics not being reported for SonioxSTTService and AWSTranscribeSTTService due to missing can_generate_metrics() override. (PR #3813)

  • Fixed an issue where AudioContextTTSService-based providers (AsyncAI, ElevenLabs, Inworld, Rime) did not close or clean up their server-side audio contexts after normal speech completion, only on interruption. (PR #3814)

  • Fixed STT TTFB metrics measuring timeout expiry time instead of actual transcript arrival time. (PR #3822)

  • Fixed InterimTranscriptionFrame and TranslationFrame being unintentionally pushed downstream in LLMUserAggregator. They are now consumed like TranscriptionFrame. (PR #3825)

  • Fixed misleading "Empty audio frame received for STT service" warnings when using audio filters (e.g. RNNoiseFilter, KrispVivaFilter, AICFilter) that buffer audio internally. (PR #3828)

  • Fixed issues with RimeNonJsonTTSService where trailing punctuation is sometimes vocalized (PR #3837)

  • Fixed TTSSpeakFrame not committing spoken text to the conversation context when used outside of an LLM response (e.g., bot greetings or injected speech). (PR #3845)

  • Removed verbose per-chunk audio logging from GenesysAudioHookSerializer that flooded production logs. (PR #3850)

  • Add beta feature warning when using custom prompts with AssemblyAI (PR #3856)

  • Fixed LocalSmartTurnAnalyzerV3 producing incorrect end-of-turn predictions at non-16kHz sample rates (e.g. 8kHz Twilio telephony) by adding automatic resampling to 16kHz before Whisper feature extraction. (PR #3857)

  • Fixed PipelineTask double-inserting RTVIProcessor into the frame chain when the user provides both an RTVIProcessor in the pipeline and a custom RTVIObserver subclass in observers. (PR #3867)

  • Fixed turn completion instructions being lost when LLMMessagesUpdateFrame replaces the LLM context. When filter_incomplete_user_turns is enabled, the turn completion system message is now re-injected after context replacement. (PR #3888)

  • Fixed Azure TTS and STT services silently swallowing cancellation errors (invalid API key, network failures, rate limiting) instead of propagating them as ErrorFrames to the pipeline. (PR #3893)

Performance

  • Switched GradiumTTSService from InterruptibleWordTTSService to AudioContextWordTTSService, eliminating websocket disconnect/reconnect on every interruption by using client_req_id-based multiplexing. (PR #3759)

Other

  • Standardized Sarvam STT/TTS User-Agent header handling to consistently send Pipecat SDK identity in websocket requests. (PR #3886)

[0.0.103] - 2026-02-20

Added

  • Added "timestampTransportStrategy": "ASYNC" to InworldAITTSService. This allows timestamps info to trail audio chunks arrival, resulting in much better first audio chunk latency (PR #3625)

  • Added model-specific InputParams to RimeTTSService: arcana params (repetition_penalty, temperature, top_p) and mistv2 params (no_text_normalization, save_oovs, segment). Model, voice, and param changes now trigger WebSocket reconnection. (PR #3642)

  • Added write_transport_frame() hook to BaseOutputTransport allowing transport subclasses to handle custom frame types that flow through the audio queue. (PR #3719)

  • Added DailySIPTransferFrame and DailySIPReferFrame to the Daily transport. These frames queue SIP transfer and SIP REFER operations with audio, so the operation executes only after the bot finishes its current utterance. (PR #3719)

  • Added keepalive support to SarvamSTTService to prevent idle connection timeouts (e.g. when used behind a ServiceSwitcher). (PR #3730)

  • Added UserIdleTimeoutUpdateFrame to enable or disable user idle detection at runtime by updating the timeout dynamically. (PR #3748)

  • Added broadcast_sibling_id field to the base Frame class. This field is automatically set by broadcast_frame() and broadcast_frame_instance() to the ID of the paired frame pushed in the opposite direction, allowing receivers to identify broadcast pairs. (PR #3774)

  • Added ignored_sources parameter to RTVIObserverParams and add_ignored_source()/remove_ignored_source() methods to RTVIObserver to suppress RTVI messages from specific pipeline processors (e.g. a silent evaluation LLM). (PR #3779)

  • Added DeepgramSageMakerTTSService for running Deepgram TTS models deployed on AWS SageMaker endpoints via HTTP/2 bidirectional streaming. Supports the Deepgram TTS protocol (Speak, Flush, Clear, Close), interruption handling, and per-turn TTFB metrics. (PR #3785)

Changed

  • ⚠️ RimeTTSService now defaults to model="arcana" and the wss://users-ws.rime.ai/ws3 endpoint. InputParams defaults changed from mistv2-specific values to None — only explicitly-set params are sent as query params. (PR #3642)

  • AICFilter now shares read-only AIC models via a singleton AICModelManager in aic_filter.py.

    • Multiple filters using the same model path or (model_id, model_download_dir) share one loaded model, with reference counting and concurrent load deduplication.
    • Model file I/O runs off the event loop so the filter does not block. (PR #3684)
  • Added X-User-Agent and X-Request-Id headers to InworldTTSService for better traceability. (PR #3706)

  • DailyUpdateRemoteParticipantsFrame is no longer deprecated and is now queued with audio like other transport frames. (PR #3719)

  • Bumped Pillow dependency upper bound from <12 to <13 to allow Pillow 12.x. (PR #3728)

  • Moved STT keepalive mechanism from WebsocketSTTService to the STTService base class, allowing any STT service (not just websocket-based ones) to use idle-connection keepalive via the keepalive_timeout and keepalive_interval parameters. (PR #3730)

  • Improved audio context management in AudioContextTTSService by moving context ID tracking to the base class and adding reuse_context_id_within_turn parameter to control concurrent TTS request handling.

    • Added helper methods: has_active_audio_context(), get_active_audio_context_id(), remove_active_audio_context(), reset_active_audio_context()
    • Simplified Cartesia, ElevenLabs, Inworld, Rime, AsyncAI, and Gradium TTS implementations by removing duplicate context management code (PR #3732)
  • UserIdleController is now always created with a default timeout of 0 (disabled). The user_idle_timeout parameter changed from Optional[float] = None to float = 0 in UserTurnProcessor, LLMUserAggregatorParams, and UserIdleController. (PR #3748)

  • Change the version specifier from >=0.2.8 to ~=0.2.8 for the speechmatics-voice package to ensure compatibility with future patch versions. (PR #3761)

  • Updated InworldTTSService and InworldHttpTTSService to use ASYNC timestamp transport strategy by default (PR #3765)

  • Added start_time and end_time parameters to start_ttfb_metrics(), stop_ttfb_metrics(), start_processing_metrics(), and stop_processing_metrics() in FrameProcessor and FrameProcessorMetrics, allowing custom timestamps for metrics measurement. STTService now uses these instead of custom TTFB tracking. (PR #3776)

  • Updated default Anthropic model from claude-sonnet-4-5-20250929 to claude-sonnet-4-6. (PR #3792)

Deprecated

  • Deprecated unused Traceable, @traceable, @traced, and AttachmentStrategy in pipecat.utils.tracing.class_decorators. This module will be removed in a future release. (PR #3733)

Fixed

  • Fixed race condition where RTVIObserver could send messages before DailyTransport join completed. Outbound messages are now queued & delivered after the transport is ready. (PR #3615)

  • Fixed async generator cleanup in OpenAI LLM streaming to prevent AttributeError with uvloop on Python 3.12+ (MagicStack/uvloop#699). (PR #3698)

  • Fixed SmallWebRTCTransport input audio resampling to properly handle all sample rates, including 8kHz audio. (PR #3713)

  • Fixed a race condition in RTVIObserver where bot output messages could be sent before the bot-started-speaking event. (PR #3718)

  • Fixed Grok Realtime session.updated event parsing failure caused by the API returning prefixed voice names (e.g. "human_Ara" instead of "Ara"). (PR #3720)

  • Fixed context ID reuse issue in ElevenLabsTTSService, InworldTTSService, RimeTTSService, CartesiaTTSService, AsyncAITTSService, and PlayHTTTSService. Services now properly reuse the same context ID across multiple run_tts() invocations within a single LLM turn, preventing context tracking issues and incorrect lifecycle signaling. (PR #3729)

  • Fixed word timestamp interleaving issue in ElevenLabsTTSService when processing multiple sentences within a single LLM turn. (PR #3729)

  • Fixed tracing service decorators executing the wrapped function twice when the function itself raised an exception (e.g., LLM rate limit, TTS timeout). (PR #3735)

  • Fixed LLMUserAggregator broadcasting mute events before StartFrame reaches downstream processors. (PR #3737)

  • Fixed UserIdleController false idle triggers caused by gaps between user and bot activity frames. The idle timer now starts only after BotStoppedSpeakingFrame and is suppressed during active user turns and function calls. (PR #3744)

  • Fixed incorrect sample_rate assignment in TavusInputTransport._on_participant_audio_data (was using audio.audio_frames instead of audio.sample_rate). (PR #3768)

  • Fixed RTVIObserver not processing upstream-only frames. Previously, all upstream frames were filtered out to avoid duplicate messages from broadcasted frames. Now only upstream copies of broadcasted frames are skipped. (PR #3774)

  • Fixed mutable default arguments in LLMContextAggregatorPair.__init__() that could cause shared state across instances. (PR #3782)

  • Fixed DeepgramSageMakerSTTService to properly track finalize lifecycle using request_finalize() / confirm_finalize() and use is_final (instead of is_final and speech_final) for final transcription detection, matching DeepgramSTTService behavior. (PR #3784)

  • Fixed a race condition in AudioContextTTSService where the audio context could time out between consecutive TTS requests within the same turn, causing audio to be discarded. (PR #3787)

  • Fixed push_interruption_task_frame_and_wait() hanging indefinitely when the InterruptionFrame does not reach the pipeline sink within the timeout. Added a timeout keyword argument to customize the wait duration. (PR #3789)

[0.0.102] - 2026-02-10

Added

  • Added ResembleAITTSService for text-to-speech using Resemble AI's streaming WebSocket API with word-level timestamps and jitter buffering for smooth audio playback. (PR #3134)

  • Added UserBotLatencyObserver for tracking user-to-bot response latency. When tracing is enabled, latency measurements are automatically recorded as turn.user_bot_latency_seconds attributes on OpenTelemetry turn spans. (PR #3355)

  • Added append_to_context parameter to TTSSpeakFrame for conditional LLM context addition.

    • Allows fine-grained control over whether text should be added to conversation context
    • Defaults to True to maintain backward compatibility (PR #3584)
  • Added TTS context tracking system with context_id field to trace audio generation through the pipeline.

    • TTSAudioRawFrame, TTSStartedFrame, TTSStoppedFrame now include context_id
    • AggregatedTextFrame and TTSTextFrame now include context_id
    • Enables tracking which TTS request generated specific audio chunks (PR #3584)
  • Added support for Inworld TTS Websocket Auto Mode for improved latency (PR #3593)

  • Added new frames for context summarization: LLMContextSummaryRequestFrame and LLMContextSummaryResultFrame. (PR #3621)

  • Added context summarization feature to automatically compress conversation history when conversation length limits (by token or message count) are reached, enabling efficient long-running conversations.

    • Configure via enable_context_summarization=True in LLMAssistantAggregatorParams
    • Customize behavior with LLMContextSummarizationConfig (max tokens, thresholds, etc.)
    • Automatically preserves incomplete function call sequences during summarization
    • See new examples: examples/foundational/54-context-summarization-openai.py and examples/foundational/54a-context-summarization-google.py (PR #3621)
  • Added RTVI function call lifecycle events (llm-function-call-started, llm-function-call-in-progress, llm-function-call-stopped) with configurable security levels via RTVIObserverParams.function_call_report_level. Supports per-function control over what information is exposed (DISABLED, NONE, NAME, or FULL). (PR #3630)

  • Added RequestMetadataFrame and metadata handling for ServiceSwitcher to ensure STT services correctly emit STTMetadataFrame when switching between services. Only the active service's metadata is propagated downstream, switching services triggers the newly active service to re-emit its metadata, and proper frame ordering is maintained at startup. (PR #3637)

  • Added STTMetadataFrame to broadcast STT service latency information at pipeline start.

    • STT services broadcast P99 time-to-final-segment (ttfs_p99_latency) to downstream processors
    • Turn stop strategies automatically configure their STT timeout from this metadata
    • Developers can override ttfs_p99_latency via constructor argument for custom deployments
    • Added measured P99 values for STT providers.
    • See stt-benchmark to measure latency for your configuration (PR #3637)
  • Added support for is_sandbox parameter in LiveAvatarNewSessionRequest to enable sandbox mode for HeyGen LiveAvatar sessions. (PR #3653)

  • Added support for video_settings parameter in LiveAvatarNewSessionRequest to configure video encoding (H264/VP8) and quality levels. (PR #3653)

  • Added OpenAIRealtimeSTTService for real-time streaming speech-to-text using OpenAI's Realtime API WebSocket transcription sessions. Supports local VAD and server-side VAD modes, noise reduction, and automatic reconnection. (PR #3656)

  • Added bulbul:v3-beta TTS model support for Sarvam AI with temperature control and 25 new speaker voices. (PR #3671)

  • Added saaras:v3 STT model support for Sarvam AI with new mode parameter (transcribe, translate, verbatim, translit, codemix) and prompt support. (PR #3671)

  • Added new OpenAI TTS voice options marin and cedar. (PR #3682)

  • Added UserMuteStartedFrame and UserMuteStoppedFrame system frames, and corresponding user-mute-started / user-mute-stopped RTVI messages, so clients can observe when mute strategies activate or deactivate. (PR #3687)

Changed

  • Updated all 30+ TTS service implementations to support context tracking with context_id.

    • Services now generate and propagate context IDs through TTS frames
    • Enables end-to-end tracing of TTS requests through the pipeline (PR #3584)
  • ⚠️ TTSService.run_tts() now requires a context_id parameter for context tracking.

    • Custom TTS service implementations must update their run_tts() signature
    • Before: async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
    • After: async def run_tts(self, text: str, context_id: str) -> AsyncGenerator[Frame, None]: (PR #3584)
  • Simplified context aggregators to use frame.append_to_context flag instead of tracking internal state.

    • Cleaner logic in LLMResponseAggregator and LLMResponseUniversalAggregator
    • More consistent behavior across aggregator implementations (PR #3584)
  • Updated timestamps to be cumulative within an agent turn, using flushCompleted message as an indication of when timestamps from the server are reset to 0 (PR #3593)

  • Changed KokoroTTSService to use kokoro-onnx instead of kokoro as the underlying TTS engine. (PR #3612)

  • Improved user turn stop timing in TranscriptionUserTurnStopStrategy and TurnAnalyzerUserTurnStopStrategy.

    • Timeout now starts on VADUserStoppedSpeakingFrame for tighter, more predictable timing
    • Added support for finalized transcripts (TranscriptionFrame.finalized=True) to trigger earlier
    • Added fallback timeout for edge cases where transcripts arrive without VAD events
    • Removed InterimTranscriptionFrame handling (no longer affects timing) (PR #3637)
  • Improved the accuracy of the UserBotLatencyObserver and UserBotLatencyLogObserver by measuring from the time when the user actually starts speaking. (PR #3637)

  • ⚠️ Renamed timeout parameter to user_speech_timeout in TranscriptionUserTurnStopStrategy. (PR #3637)

  • Updated the VADUserStartedSpeakingFrame to include start_secs and timestamp and VADUserStoppedSpeakingFrame to include stop_secs and timestamp, removing the need to separately handle the SpeechControlParamsFrame for VADParams values. (PR #3637)

  • ⚠️ Renamed TranscriptionUserTurnStopStrategy to SpeechTimeoutUserTurnStopStrategy. The old name is deprecated and will be removed in a future release. (PR #3637)

  • AssemblyAISTTService now automatically configures optimal settings for manual turn detection when vad_force_turn_endpoint=True. This sets end_of_turn_confidence_threshold=1.0 and max_turn_silence=2000 by default, which disables model-based turn detection and reduces latency by relying on external VAD for turn endpoints. Warnings are logged if conflicting settings are detected. (PR #3644)

  • Upgraded the pipecat-ai-small-webrtc-prebuilt package to v2.1.0. (PR #3652)

  • Changed default session mode from "CUSTOM" to "LITE" in HeyGen LiveAvatar integration, with VP8 as the default video encoding. (PR #3653)

  • ⚠️ The default VADParams stop_secs default is changing from 0.8 seconds to 0.2 seconds. This change both simplifies the developer experience and improves the performance of STT services. With a shorter stop_secs value, STT services using a local VAD can finalize sooner, resulting in faster transcription.

    • SpeechTimeoutUserTurnStopStrategy: control how long to wait for additional user speech using user_speech_timeout (default: 0.6 sec).
    • TurnAnalyzerUserTurnStopStrategy: the turn analyzer automatically adjusts the user wait time based on the audio input. (PR #3659)
  • Moved interruption wait event from per-processor instance state to InterruptionFrame itself. Added InterruptionFrame.complete() to signal when the interruption has fully traversed the pipeline. Custom processors that block or consume an InterruptionFrame before it reaches the pipeline sink must call frame.complete() to avoid stalling push_interruption_task_frame_and_wait(). A warning is logged if completion does not happen within 2 seconds. (PR #3660)

  • Update the default model to scribe_v2 for ElevenLabsSTTService. (PR #3664)

  • Changed the DeepgramSTTService default setting for smart_format to False, as agents don't need smart formatting. Disabling this setting provides a small performance improvement, as well. (PR #3666)

  • Changed FunctionCallCancelFrame to broadcast in both directions for consistency with other function call frames. (PR #3672)

  • Changed default user turn stop strategy from TranscriptionUserTurnStopStrategy to TurnAnalyzerUserTurnStopStrategy with LocalSmartTurnAnalyzerV3. (PR #3689)

  • Renamed RequestMetadataFrame to ServiceSwitcherRequestMetadataFrame and added a service field to target a specific service. The frame is now pushed downstream by services after handling instead of being silently consumed. (PR #3692)

  • Update SonioxSTTService to set vad_force_turn_endpoint to True. This setting disabled the turn detection logic available natively in Soniox. Instead, Soniox relies on a local VAD to finalize the transcript. This configuration meaningfully reduces the time to final segment for Soniox. With this setting enabled, Soniox outputs a transcript in ~250ms (median). Pipecat enables smart-turn detection by default using the LocalSmartTurnAnalyzerV3. To use the native turn detection logic in Soniox, just set vad_force_turn_endpoint to False. (PR #3697)

  • Update SonioxSTTService default model to stt-rt-v4. (PR #3697)

  • Updated the default model to async_flash_v1.0 and base URL to https://api.async.com for AsyncAITTSService. (PR #3701)

Deprecated

  • Deprecated UserBotLatencyLogObserver. Use UserBotLatencyObserver directly with its on_latency_measured event handler instead. (PR #3355)

  • Deprecated RTVILLMFunctionCallMessage, RTVILLMFunctionCallMessageData, and RTVIProcessor.handle_function_call(). Use the new llm-function-call-in-progress event sent automatically by RTVIObserver instead. (PR #3630)

Removed

  • ⚠️ Removed timeout parameter from TurnAnalyzerUserTurnStopStrategy. The timeout is now managed internally based on STT latency. (PR #3637)

Fixed

  • Fixed pipeline freeze when InterruptionFrame discards EndFrame or StopFrame by making terminal frames uninterruptible. (PR #3542)

  • Fixed OpenAI LLM stream not being closed on cancellation/exception, which could leak sockets. (PR #3589)

  • Fixed PipelineTask adding duplicate RTVIProcessor and RTVIObserver when they were already provided in the pipeline or observers list. They are now detected and skipped, with appropriate warnings and errors logged for mismatched configurations. (PR #3610)

  • Fixed function call timeout task not being cancelled when the handler completes without calling result_callback or is cancelled externally, which caused RuntimeWarning: coroutine was never awaited. (PR #3616)

  • Fixed sentence splitting for Japanese, Chinese, Korean, and other non-Latin languages in TTS pipeline. NLTK's sentence tokenizer does not support CJK languages, causing text to accumulate until flush instead of being split at sentence boundaries. Added fallback detection for unambiguous non-Latin sentence-ending punctuation (e.g., , , ). (PR #3617)

  • Fixed PipelineTask to also call set_bot_ready() when an external RTVIProcessor is provided. (PR #3623)

  • Fixed VADController not broadcasting SpeechControlParamsFrame on startup, which prevented STT services from receiving VAD params needed for TTFB measurement. (PR #3628)

  • Fixed StopAsyncIteration exceptions in parse_telephony_websocket() when WebSocket connections close before sending expected messages. (PR #3629)

  • Fixed WebSocket transport error when broadcasting InputTransportMessageFrame by correctly instantiating the frame with its message parameter. (PR #3635)

  • Fixed orphan OpenTelemetry spans during flow initialization and transitions in tracing. (PR #3649)

  • Fixed SambaNovaLLMService and GoogleLLMOpenAIBetaService streams not being closed on cancellation/exception, which could leak sockets. (PR #3663)

  • Fixed an issue in InworldTTSService where punctuation was pronounced. Now, the InworldTTSService ensures proper spacing between sentences, resolving pronunciation issues. (PR #3667)

  • Fixed ParallelPipeline allowing frames pushed by internal processors to escape during lifecycle frame (StartFrame/EndFrame/CancelFrame) synchronization. These frames are now buffered and flushed after all branches complete. (PR #3668)

  • Fixed issues in Sarvam STT and TTS services: missing event handler registration for VAD signals, Optional[bool] type annotations, WebSocket state cleanup on API errors, and TTS disconnect/reconnection state management. (PR #3671)

  • Fixed RTVIObserver sending duplicate client messages for frames that are broadcast in both directions (e.g. UserStartedSpeakingFrame, FunctionCallResultFrame). (PR #3672)

  • Fixed WebSocket STT services (ElevenLabs, Cartesia, Gladia, Soniox) disconnecting due to idle timeout when no audio is being sent (e.g. when inactive behind a ServiceSwitcher). WebsocketSTTService now provides opt-in silence-based keepalive via keepalive_timeout and keepalive_interval parameters. (PR #3675)

[0.0.101] - 2026-01-30

Added

  • Additions for AICFilter and AICVADAnalyzer:

    • Added model downloading support to AICFilter with model_id and model_download_dir parameters.
    • Added model_path parameter to AICFilter for loading local .aicmodel files.
    • Added unit tests for AICFilter and AICVADAnalyzer. (PR #3408)
  • Added handling for server_content.interrupted signal in the Gemini Live service for faster interruption response in the case where there isn't already turn tracking in the pipeline, e.g. local VAD + context aggregators. When there is already turn tracking in the pipeline, the additional interruption does no harm. (PR #3429)

  • Added new GenesysFrameSerializer for the Genesys AudioHook WebSocket protocol, enabling bidirectional audio streaming between Pipecat pipelines and Genesys Cloud contact center. (PR #3500)

  • Added reached_upstream_types and reached_downstream_types read-only properties to PipelineTask for inspecting current frame filters. (PR #3510)

  • Added add_reached_upstream_filter() and add_reached_downstream_filter() methods to PipelineTask for appending frame types. (PR #3510)

  • Added UserTurnCompletionLLMServiceMixin for LLM services to detect and filter incomplete user turns. When enabled via filter_incomplete_user_turns in LLMUserAggregatorParams, the LLM outputs a turn completion marker at the start of each response: ✓ (complete), ○ (incomplete short), or ◐ (incomplete long). Incomplete turns are suppressed, and configurable timeouts automatically re-prompt the user. (PR #3518)

  • Added FrameProcessor.broadcast_frame_instance(frame) method to broadcast a frame instance by extracting its fields and creating new instances for each direction. (PR #3519)

  • PipelineTask now automatically adds RTVIProcessor and registers RTVIObserver when enable_rtvi=True (default), simplifying pipeline setup. (PR #3519)

  • Added RTVIProcessor.create_rtvi_observer() factory method for creating RTVI observers. (PR #3519)

  • Added video_out_codec parameter to TransportParams allowing configuration of the preferred video codec (e.g., "VP8", "H264", "H265") for video output in DailyTransport. (PR #3520)

  • Added location parameter to Google TTS services (GoogleHttpTTSService, GoogleTTSService, GeminiTTSService) for regional endpoint support. (PR #3523)

  • Added new PIPECAT_SMART_TURN_LOG_DATA environment variable, which causes Smart Turn input data to be saved to disk (PR #3525)

  • Added result_callback parameter to UserImageRequestFrame to support deferred function call results. (PR #3571)

  • Added function_call_timeout_secs parameter to LLMService to configure timeout for deferred function calls (defaults to 10.0 seconds). (PR #3571)

  • Added vad_analyzer parameter to LLMUserAggregatorParams. VAD analysis is now handled inside the LLMUserAggregator rather than in the transport, keeping voice activity detection closer to where it is consumed. The vad_analyzer on BaseInputTransport is now deprecated.

    context_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(
            vad_analyzer=SileroVADAnalyzer(),
        ),
    )
    

    (PR #3583)

  • Added VADProcessor for detecting speech in audio streams within a pipeline. Pushes VADUserStartedSpeakingFrame, VADUserStoppedSpeakingFrame, and UserSpeakingFrame downstream based on VAD state changes. (PR #3583)

  • Added VADController for managing voice activity detection state and emitting speech events independently of transport or pipeline processors. (PR #3583)

  • Added local PiperTTSService for offline text-to-speech using Piper voice models. The existing HTTP-based service has been renamed to PiperHttpTTSService. (PR #3585)

  • main() in pipecat.runner.run now accepts an optional argparse.ArgumentParser, allowing bots to define custom CLI arguments accessible via runner_args.cli_args. (PR #3590)

  • Added KokoroTTSService for local text-to-speech synthesis using the Kokoro-82M model. (PR #3595)

Changed

  • Updated AICFilter and AICVADAnalyzer to use aic-sdk ~= 2.0.1. (PR #3408)

  • Improved the STT TTFB (Time To First Byte) measurement, reporting the delay between when the user stops speaking and when the final transcription is received. Note: Unlike traditional TTFB which measures from a discrete request, STT services receive continuous audio input—so we measure from speech end to final transcript, which captures the latency that matters for voice AI applications. In support of this change, added finalized field to TranscriptionFrame to indicate when a transcript is the final result for an utterance. (PR #3495)

  • SarvamSTTService now defaults vad_signals and high_vad_sensitivity to None (omitted from connection parameters), improving latency by ~300ms compared to the previous defaults. (PR #3495)

  • Changed frame filter storage from tuples to sets in PipelineTask. (PR #3510)

  • Changed default Inworld TTS model from inworld-tts-1 to inworld-tts-1.5-max. (PR #3531)

  • FrameSerializer now subclasses from BaseObject to enable event support. (PR #3560)

  • Added support for TTFS in SpeechmaticsSTTService and set the default mode to EXTERNAL to support Pipecat-controlled VAD.

    • Changed dependency to speechmatics-voice[smart]>=0.2.8 (PR #3562)
  • ⚠️ Changed function call handling to use timeout-based completion instead of immediate callback execution.

    • Function calls that defer their results (e.g., UserImageRequestFrame) now use a timeout mechanism
    • The result_callback is invoked automatically when the deferred operation completes or after timeout
    • This change affects examples using UserImageRequestFrame - the result_callback should now be passed to the frame instead of being called immediately (PR #3571)
  • Pipecat runner now uses DAILY_ROOM_URL instead of DAILY_SAMPLE_ROOM_URL. (PR #3582)

  • Updates to GradiumSTTService:

    • Now flushes pending transcriptions when VAD detects the user stopped speaking, improving response latency.
    • GradiumSTTService now supports InputParams for configuring language and delay_in_frames settings. (PR #3587)

Deprecated

  • ⚠️ Deprecated vad_analyzer parameter on BaseInputTransport. Pass vad_analyzer to LLMUserAggregatorParams instead or use VADProcessor in the pipeline. (PR #3583)

Removed

  • Removed deprecated AICFilter parameters: enhancement_level, voice_gain, noise_gate_enable. (PR #3408)

Fixed

  • Fixed an issue where if you were using OpenRouterLLMService with a Gemini model, it wouldn't handle multiple "system" messages as expected (and as we do in GoogleLLMService), which is to convert subsequent ones into "user" messages. Instead, the latest "system" message would overwrite the previous ones. (PR #3406)

  • Transports now properly broadcast InputTransportMessageFrame frames both upstream and downstream instead of only pushing downstream. (PR #3519)

  • Fixed FrameProcessor.broadcast_frame() to deep copy kwargs, preventing shared mutable references between the downstream and upstream frame instances. (PR #3519)

  • Fixed OpenAI LLM services to emit ErrorFrame on completion timeout, enabling proper error handling and LLMSwitcher failover. (PR #3529)

  • Fixed a logging issue where non-ASCII characters (e.g., Japanese, Chinese, etc.) were being unnecessarily escaped to Unicode sequences when function call occurred. (PR #3536)

  • Fixed how audio tracks are synchronized inside the AudioBufferProcessor to fix timing issues where silence and audio were misaligned between user and bot buffers. (PR #3541)

  • Fixed race condition in OpenAIRealtimeBetaLLMService that could cause an error when truncating the conversation. (PR #3567)

  • Fixed an infinite loop in WebsocketService that blocked the event loop when a remote server closed the connection gracefully. (PR #3574)

  • Fixed LLMUserAggregator and LLMAssistantAggregator not emitting pending transcripts via on_user_turn_stopped and on_assistant_turn_stopped events when the conversation ends (EndFrame) or is cancelled (CancelFrame). (PR #3575)

  • Added missing LiveKitRunnerArguments and LiveKitTransport support in runner utilities to enable LiveKit transport configuration. (PR #3580)

  • Fixed race condition in OpenAIRealtimeLLMService that could cause an error when truncating the conversation. (PR #3581)

  • Fixed PiperHttpTTSService (olf PiperTTSService) to resample audio output based on the model's sample rate parsed from the WAV header. (PR #3585)

  • Fixed UserTurnController to reset user turn timeout when interim transcriptions are received. (PR #3594)

  • Fixed an issue in the IVRNavigator where the TextFrames pushed had incorrect spacing. Now, the internal IVRProcessor pushes AggregatedTextFrames when in conversation mode. This allows for controlling spacing of the outputted, aggregated text. (PR #3604)

  • Fixed GeminiLiveLLMService transcription timeout handler not being scheduled by yielding to the event loop after task creation. (PR #3605)

[0.0.100] - 2026-01-20

Added

  • Added Hathora service to support Hathora-hosted TTS and STT models (only non-streaming) (PR #3169)

  • Added CambTTSService, using Camb.ai's TTS integration with MARS models (mars-flash, mars-pro, mars-instruct) for high-quality text-to-speech synthesis. (PR #3349)

  • Added the additional_headers param to WebsocketClientParams, allowing WebsocketClientTransport to send custom headers on connect, for cases such as authentication. (PR #3461)

  • Added UserIdleController for detecting user idle state, integrated into LLMUserAggregator and UserTurnProcessor via optional user_idle_timeout parameter. Emits on_user_turn_idle event for application-level handling. Deprecated UserIdleProcessor in favor of the new compositional approach. (PR #3482)

  • Added on_user_mute_started and on_user_mute_stopped event handlers to LLMUserAggregator for tracking user mute state changes. (PR #3490)

Changed

  • Enhanced interruption handling in AsyncAITTSService by supporting multi-context WebSocket sessions for more robust context management. (PR #3287)

  • Throttle UserSpeakingFrame to broadcast at most every 200ms instead of on every audio chunk, reducing frame processing overhead during user speech. (PR #3483)

Deprecated

  • For consistency with other package names, we just deprecated pipecat.turns.mute (introduced in Pipecat 0.0.99) in favor of pipecat.turns.user_mute. (PR #3479)

Fixed

  • Corrected TTFB metric calculation in AsyncAIHttpTTSService. (PR #3287)

  • Fixed an issue where the "bot-llm-text" RTVI event would not fire for realtime (speech-to-speech) services:

    • AWSNovaSonicLLMService
    • GeminiLiveLLMService
    • OpenAIRealtimeLLMService
    • GrokRealtimeLLMService

    The issue was that these services weren't pushing LLMTextFrames. Now they do. (PR #3446)

  • Fixed an issue where on_user_turn_stop_timeout could fire while a user is talking when using ExternalUserTurnStrategies. (PR #3454)

  • Fixed an issue where user turn start strategies were not being reset after a user turn started, causing incorrect strategy behavior. (PR #3455)

  • Fixed MinWordsUserTurnStartStrategy to not aggregate transcriptions, preventing incorrect turn starts when words are spoken with pauses between them. (PR #3462)

  • Fixed an issue where Grok Realtime would error out when running with SmallWebRTC transport. (PR #3480)

  • Fixed a Mem0MemoryService issue where passing async_mode: true was causing an error. See https://docs.mem0.ai/platform/features/async-mode-default-change. (PR #3484)

  • Fixed AWSNovaSonicLLMService.reset_conversation(), which would previously error out. Now it successfully reconnects and "rehydrates" from the context object. (PR #3486)

  • Fixed AzureTTSService transcript formatting issues:

    • Punctuation now appears without extra spaces (e.g., "Hello!" instead of "Hello !")
    • CJK languages (Chinese, Japanese, Korean) no longer have unwanted spaces between characters (PR #3489)
  • Fixed an issue where UninterruptibleFrame frames would not be preserved in some cases. (PR #3494)

  • Fixed memory leak in LiveKitTransport when video_in_enabled is False. (PR #3499)

  • Fixed an issue in AIService where unhandled exceptions in start(), stop(), or cancel() implementations would prevent process_frame() to continue and therefore StartFrame, EndFrame, or CancelFrame from being pushed downstream, causing the pipeline to not start or stop properly. (PR #3503)

  • Moved NVIDIATTSService and NVIDIASTTService client initialization from constructor to start() for better error handling. (PR #3504)

  • Optimized NVIDIATTSService to process incoming audio frames immediately. (PR #3509)

  • Optimized NVIDIASTTService by removing unnecessary queue and task. (PR #3509)

  • Fixed a CambTTSService issue where client was being initialized in the constructor which wouldn't allow for proper Pipeline error handling. (PR #3511)

[0.0.99] - 2026-01-13

Added

  • Introducing user turn strategies. User turn strategies indicate when the user turn starts or stops. In conversational agents, these are often referred to as start/stop speaking or turn-taking plans or policies.

    User turn start strategies indicate when the user starts speaking (e.g. using VAD events or when a user says one or more words).

    User turn stop strategies indicate when the user stops speaking (e.g. using an end-of-turn detection model or by observing incoming transcriptions).

    A list of strategies can be specified for both strategies; strategies are evaluated in order until one evaluates to true.

    Available user turn start strategies:

    • VADUserTurnStartStrategy
    • TranscriptionUserTurnStartStrategy
    • MinWordsUserTurnStartStrategy
    • ExternalUserTurnStartStrategy

    Available user turn stop strategies:

    • TranscriptionUserTurnStopStrategy
    • TurnAnalyzerUserTurnStopStrategy
    • ExternalUserTurnStopStrategy

    The default strategies are:

    • start: [VADUserTurnStartStrategy, TranscriptionUserTurnStartStrategy]
    • stop: [TranscriptionUserTurnStopStrategy]

    Turn strategies are configured when setting up LLMContextAggregatorPair. For example:

    context_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(
            user_turn_strategies=UserTurnStrategies(
                stop=[
                    TurnAnalyzerUserTurnStopStrategy(turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams())
                    )
                ],
            )
        ),
    )
    

    In order to use the user turn strategies you must update to the new universal LLMContext and LLMContextAggregatorPair. (PR #3045)

  • Added RNNoiseFilter for real-time noise suppression using RNNoise neural network via pyrnnoise library. (PR #3205)

  • Added GrokRealtimeLLMService for xAI's Grok Voice Agent API with real-time voice conversations:

    • Support for real-time audio streaming with WebSocket connection
    • Built-in server-side VAD (Voice Activity Detection)
    • Multiple voice options: Ara, Rex, Sal, Eve, Leo
    • Built-in tools support: web_search, x_search, file_search
    • Custom function calling with standard Pipecat tools schema
    • Configurable audio formats (PCM at 8kHz-48kHz) (PR #3267)
  • Added an approximation of TTFB for Ultravox. (PR #3268)

  • Added a new AudioContextTTSService to the TTS service base classes. The AudioContextWordTTSService now inherits from AudioContextTTSService and WebsocketWordTTSService. (PR #3289)

  • LLMUserAggregator now exposes the following events:

    • on_user_turn_started: triggered when a user turn starts
    • on_user_turn_stopped: triggered when a user turn ends
    • on_user_turn_stop_timeout: triggered when a user turn does not stop and times out (PR #3291)
  • Introducing user mute strategies. User mute strategies indicate when user input should be muted based on the current system state.

    In conversational agents, user mute strategies are used to prevent user input from interrupting bot speech, tool execution, or other critical system operations.

    A list of strategies can be specified; all strategies are evaluated for every frame so that each strategy can maintain its internal state. A user frame is muted if any of the configured strategies indicates it should be muted.

    Available user mute strategies:

    • FirstSpeechUserMuteStrategy
    • MuteUntilFirstBotCompleteUserMuteStrategy
    • AlwaysUserMuteStrategy
    • FunctionCallUserMuteStrategy

    User mute strategies replace the legacy STTMuteFilter and provide a more flexible and composable approach to muting user input.

    User mute strategies are configured when setting up the LLMContextAggregatorPair. For example:

    context_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(
            user_mute_strategies=[
                FirstSpeechUserMuteStrategy(),
            ]
        ),
    )
    

    In order to use user mute strategies you should update to the new universal LLMContext and LLMContextAggregatorPair. (PR #3292)

  • Added use_ssl parameter to NvidiaSTTService, NvidiaSegmentedSTTService and NvidiaTTSService. (PR #3300)

  • Added enable_interruptions constructor argument to all user turn strategies. This tells the LLMUserAggregator to push or not push an InterruptionFrame. (PR #3316)

  • Added split_sentences parameter to SpeechmaticsSTTService to control sentence splitting behavior for finals on sentence boundaries. (PR #3328)

  • Added word-level timestamp support to AzureTTSService for accurate text-to-audio synchronization. (PR #3334)

  • Added pronunciation_dict_id parameter to CartesiaTTSService.InputParams and CartesiaHttpTTSService.InputParams to support Cartesia's pronunciation dictionary feature for custom pronunciations. (PR #3346)

  • Added support for using the HeyGen LiveAvatar API with the HeyGenTransport (see https://www.liveavatar.com/). (PR #3357)

  • Added image support to OpenAIRealtimeLLMService via InputImageRawFrame:

    • New start_video_paused parameter to control initial video input state
    • New video_frame_detail parameter to set image processing quality ("auto", "low", or "high"). This corresponds to OpenAI Realtime's image_detail parameter.
    • set_video_input_paused() method to pause/resume video input at runtime
    • set_video_frame_detail() method to adjust video frame quality dynamically
    • Automatic rate limiting (1 frame per second) to prevent API overload (PR #3360)
  • Added UserTurnProcessor, a frame processor built on UserTurnController that pushes UserStartedSpeakingFrame and UserStoppedSpeakingFrame frames and interruptions based on the controller's user turn strategies. (PR #3372)

  • Added UserTurnController to manage user turns. It emits on_user_turn_started, on_user_turn_stopped, and on_user_turn_stop_timeout events, and can be integrated into processors to detect and handle user turns. LLMUserAggregator and UserTurnProcessor are implemented using this controller. (PR #3372)

  • Added should_interrupt property to DeepgramFluxSTTService, DeepgramSTTService, and SpeechmaticsSTTService to configure whether the bot should be interrupted when the external service detects user speech. (PR #3374)

  • LLMAssistantAggregator now exposes the following events:

    • on_assistant_turn_started: triggered when the assistant turn starts
    • on_assistant_turn_stopped: triggered when the assistant turn ends
    • on_assistant_thought: triggered when there's an assistant thought available (PR #3385)
  • Added KrispVivaTurn analyzer for end of turn detection using the Krisp VIVA SDK (requires krisp_audio). (PR #3391)

  • Added support for setting up a pipeline task from external files. You can now register custom pipeline task setup files by setting the PIPECAT_SETUP_FILES environment variable. This variable should contain a colon-separated list of Python files (e.g. export PIPECAT_SETUP_FILES="setup1.py:setup.py:..."). Each file must define a function with the following signature:

    async def setup_pipeline_task(task: PipelineTask):
        ...
    

    (PR #3397)

  • Added a keepalive task for InworldTTSService to keep the service connected in the event of no generations for longer periods of time. (PR #3403)

  • Added enable_vad to Params for use in the GladiaSTTService. When enabled, GladiaSTTService acts as the turn controller, emitting UserStartedSpeakingFrame, UserStoppedSpeakingFrame, and optionally InterruptionFrame. (PR #3404)

  • Added should_interrupt property to GladiaSTTService to configure whether the bot should be interrupted when the external service detects user speech. (PR #3404)

  • Added VonageFrameSerializer for the Vonage Video API Audio Connector WebSocket protocol. (PR #3410)

  • Added append_trailing_space parameter to TTSService to automatically append a trailing space to text before sending to TTS, helping prevent some services from vocalizing trailing punctuation. (PR #3424)

Changed

  • Updated ElevenLabsRealtimeSTTService to accept the include_language_detection parameter to detect language.

      stt = ElevenLabsRealtimeSTTService(
          api_key=os.getenv("ELEVENLABS_API_KEY"),
          include_language_detection=True
      )
    

    (PR #3216)

  • Updated SpeechmaticsSTTService to use new Python Voice SDK with improved VAD, Smart Turn capabilities, and brings dramatic improvements to latency without any impact on accuracy. Use the turn_detection_mode parameter to control the endpointing of speech, with TurnDetectionMode.EXTERNAL (default), TurnDetectionMode.ADAPTIVE, or TurnDetectionMode.SMART_TURN.

        stt = SpeechmaticsSTTService(
            api_key=os.getenv("SPEECHMATICS_API_KEY"),
            params=SpeechmaticsSTTService.InputParams(
                language=Language.EN,
                turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.ADAPTIVE,
                speaker_active_format="<{speaker_id}>{text}</{speaker_id}>",
            ),
        )
    

    (PR #3225)

  • daily-python updated to 0.23.0. (PR #3257)

  • TranscriptionFrame and InterimTranscriptionFrame produced by DailyTransport now include the transport source (i.e., the originating audio track). (PR #3257)

  • Updates to Inworld TTS services:

    • Improved InworldTTSService's websocket implementation to better flush and close context to better handle long inputs.
    • Improved docstrings for InworldTTSService and InworldHttpTTSService. (PR #3288)
  • Improved the error handling and reconnection logic for WebsocketServer by distinguishing between errors when disconnecting and websocket communication errors. (PR #3392)

  • Updated DeepgramSTTService to push user started/stopped speaking and interruption frames when vad_enabled is set to true. This centralizes the frames into the service, removing the need to have your application code handle Deepgram's events and push these frames. (PR #3314)

  • Added encoding validation to DeepgramTTSService to prevent unsupported encodings from reaching the API. The service now raises ValueError at initialization with a clear error message. (PR #3329)

  • Updated read_audio_frame & read_video_frame methods in SmallWebRTCClient to check if the track is enabled before logging a warning. (PR #3336)

  • Updated CartesiaTTSService to support setting language=None, resulting in Cartesia auto-detecting the language of the conversation. (PR #3366)

  • The bundled Smart Turn weights are now updated to v3.2, which has better handling of short utterances, and is more robust against background noise. (PR #3367)

  • Updated SpeechmaticsSTTService dependency to speechmatics-voice[smart]>=0.2.6 (PR #3371)

  • Smart Turn now takes into account vad_start_seconds when buffering audio, meaning that the start of the turn audio is not cut off. This improves accuracy for short utterances.

  • The default value of pre_speech_ms is now set to 500ms for Smart Turn. (PR #3377)

  • Improved Krisp SDK management to allow KrispVivaTurn and KrispVivaFilter to share a single SDK instance within the same process. (PR #3391)

  • Updated default model for GroqTTSService to canopylabs/orpheus-v1-english and voice ID to autumn. (PR #3399)

  • Enhanced FastAPIWebsocketTransport with optional protocol-level audio packetization via the fixed_audio_packet_size parameter to support media endpoints requiring strict framing and real-time pacing. (PR #3410)

  • DeepgramTTSService and RimeTTSService now set append_trailing_space to True to prevent punctuation (e.g., “dot”) from being pronounced. (PR #3424)

  • Updated GeminiLiveLLMService to push LLMThoughtStartFrame, LLMThoughtTextFrame, and LLMThoughtEndFrame when the model returns thought content. (PR #3431)

Deprecated

  • pipecat.audio.interruptions.MinWordsInterruptionStrategy is deprecated. Use pipecat.turns.user_start.MinWordsUserTurnStartStrategy with LLMUserAggregator's new user_turn_strategies parameter instead. (PR #3045)

  • FrameProcessor.interruption_strategies is deprecated, use LLMUserAggregator's new user_turn_strategies parameter instead. (PR #3045)

  • The LLMUserAggregatorParams and LLMAssistantAggregatorParams classes in pipecat.processors.aggregators.llm_response are now deprecated. Use the new universal LLMContext and LLMContextAggregatorPair instead. (PR #3045)

  • Deprecated the emulated field in the UserStartedSpeakingFrame and UserStoppedSpeakingFrame frames. (PR #3045)

  • EmulateUserStartedSpeakingFrame and EmulateUserStoppedSpeakingFrame frames are deprecated. (PR #3045)

  • ⚠️ TransportParams.turn_analyzer is deprecated and might result in unexpected behavior, use LLMUserAggregator's new user_turn_strategies parameter instead. (PR #3045)

  • For SpeechmaticsSTTService, the end_of_utterance_mode parameter is deprecated. Use the new turn_detection_mode parameter instead, with TurnDetectionMode.EXTERNAL,TurnDetectionMode.ADAPTIVE, or TurnDetectionMode.SMART_TURN. The enable_vad parameter is also deprecated and is inferred from the turn_detection_mode. (PR #3225)

  • OpenAILLMContext and its associated things (context aggregators, etc.) are now deprecated in favor of the universal LLMContext and its associated things.

    From the developer's point of view, switching to using LLMContext machinery will usually be a matter of going from this:

    context = OpenAILLMContext(messages, tools)
    context_aggregator = llm.create_context_aggregator(context)
    

    To this:

    context = LLMContext(messages, tools)
    context_aggregator = LLMContextAggregatorPair(context)
    

    (PR #3263)

  • STTMuteFilter is deprecated and will be removed in a future version. Use LLMUserAggregator's new user_mute_strategies instead. (PR #3292)

  • FrameProcessor.interruptions_allowed is now deprecated, use LLMUserAggregator's new parameter user_mute_strategies instead. (PR #3297)

  • PipelineParams.allow_interruptions is now deprecated, use LLMUserAggregator's new parameter user_turn_strategies instead. For example, to disable interruptions but still get user turns you can do:

    context_aggregator = LLMContextAggregatorPair(
        context,
        user_params=LLMUserAggregatorParams(
            user_turn_strategies=UserTurnStrategies(
                start=[TranscriptionUserTurnStartStrategy(enable_interruptions=False)],
            ),
        ),
    )
    

    (PR #3297)

  • TranscriptProcessor and related data classes and frames (TranscriptionMessage, ThoughtTranscriptionMessage, TranscriptionUpdateFrame) are deprecated. Use LLMUserAggregator's and LLMAssistantAggregator's new events (on_user_turn_stopped and on_assistant_turn_stopped) instead. (PR #3385)

  • Deprecated support for the vad_events LiveOptions in DeepgramSTTService. Instead, use a local Silero VAD for VAD events. Additionally, deprecated should_interrupt which will be removed along with vad_events support in a future release. (PR #3386)

  • Loading external observers from files is deprecated, use the new pipeline task setup files and PIPECAT_SETUP_FILES environment variable instead. (PR #3397)

Fixed

  • Improved error handling in ElevenLabsRealtimeSTTService (PR #3233)

  • Fixed an issue in ElevenLabsRealtimeSTTService causing an infinite loop that blocks the process if the websocket disconnects due to an error (PR #3233)

  • Fixed a bug in STTMuteFilter where the user was not always muted during function calls, especially when there were multiple simultaneous calls. (PR #3292)

  • Fixed a RNNoiseFilter issue that would cause a "[Errno 12] Cannot allocate memory" error when processing silence audio frames. (PR #3322)

  • Updated SpeechmaticsSTTService for version 0.0.99+:

    • Fixed SpeechmaticsSTTService to listen for VADUserStoppedSpeakingFrame in order to finalize transcription.
    • Default to TurnDetectionMode.FIXED for Pipecat-controlled end of turn detection.
    • Only emit VAD + interruption frames if VAD is enabled within the plugin (modes other than TurnDetectionMode.FIXED or TurnDetectionMode.EXTERNAL). (PR #3328)
  • Fixed an issue with function calling where a handler failing to invoke its result callback could leave the context stuck in IN_PROGRESS, causing LLM inference for subsequent function call results to block while waiting on the unresolved call. (PR #3343)

  • Fixed an issue with DeepgramTTSService where the model would output "Dot" instead of a period in some circumstances. (PR #3345)

  • Fixed an issue in traced_stt where model_name in OpenTelemetry appears as unknown. (PR #3351)

  • Fixed an issue in GeminiLiveLLMService where TranscriptionFrames were occasionally not pushed. (PR #3356)

  • Fixed potential memory leaks and initialization issues in KrispVivaFilter by improving SDK lifecycle management. (PR #3391)

  • Fixed timing issue in BaseOutputTransport where the bot speaking flag was set after awaiting, allowing the event loop to re-enter the method before the guard was set. (PR #3400)

  • Fixed parallel function calling when using Gemini thinking. (PR 3420)

  • Fixed an issue in traced_llm where model_name in OpenTelemetry appears as unknown. (PR #3422)

  • Fixed an issue in traced_tts, traced_gemini_live, and traced_openai_realtime where model_name in OpenTelemetry appears as unknown. (PR #3428)

  • Fixed request_image_frame (for backwards compatibility) and restored function-callrelated fields in UserImageRequestFrame and UserImageRawFrame, preventing a case where adding a non-LLM message to the context could trigger duplicate LLM inferences (on image arrival and on function-call result), potentially causing an infinite inference loop. (PR #3430)

  • Fixed LLMContext.create_audio_message() by correcting an internal helper that was incorrectly declared async while being run in asyncio.to_thread(). (PR #3435)

Other

  • Added 52-live-transcription.py foundational example demonstrating live transcription and translation from English to Spanish. In this example, the bot is not interruptible: as the user continues speaking, English transcriptions are queued, and the bot continuously translates and speaks each queued sentence in Spanish without being interrupted by new user speech. (PR #3316)

  • Added a new foundational example 53-concurrent-llm-evaluation.py that shows how to use UserTurnProcessor. (PR #3372)

  • Added a new foundational example 28-user-assistant-turns.py that shows how to use the new LLMUserAggregator and LLMAssistantAggregator events to gather a conversation transcript. (PR #3385)

[0.0.98] - 2025-12-17

Added

  • Added RimeNonJsonTTSService which supports non-JSON streaming mode. This new class supports websocket streaming for the Arcana model. (PR #3085)

  • Added additional functionality related to "thinking", for Google and Anthropic LLMs.

    1. New typed parameters for Google and Anthropic LLMs that control the models' thinking behavior (like how much thinking to do, and whether to output thoughts or thought summaries):
      • AnthropicLLMService.ThinkingConfig
      • GoogleLLMService.ThinkingConfig
    2. New frames for representing thoughts output by LLMs:
      • LLMThoughtStartFrame
      • LLMThoughtTextFrame
      • LLMThoughtEndFrame
    3. A generic mechanism for recording LLM thoughts to context, used specifically to support Anthropic, whose thought signatures are expected to appear alongside the text of the thoughts within assistant context messages. See:
      • LLMThoughtEndFrame.signature
      • LLMAssistantAggregator handling of the above field
      • AnthropicLLMAdapter handling of "thought" context messages
    4. Google-specific logic for inserting thought signatures into the context, to help maintain thinking continuity in a chain of LLM calls. See:
      • GoogleLLMService sending LLMMessagesAppendFrames to add LLM-specific "thought_signature" messages to context
      • GeminiLLMAdapter handling of "thought_signature" messages
    5. An expansion of TranscriptProcessor to process LLM thoughts in addition to user and assistant utterances. See:
      • TranscriptProcessor(process_thoughts=True) (defaults to False)
      • ThoughtTranscriptionMessage, which is now also emitted with the "on_transcript_update" event (PR #3175)
  • Data and control frames can now be marked as non-interruptible by using the UninterruptibleFrame mixin. Frames marked as UninterruptibleFrame will not be interrupted during processing, and any queued frames of this type will be retained in the internal queues. This is useful when you need ordered frames (data or control) that should not be discarded or cancelled due to interruptions. (PR #3189)

  • Added on_conversation_detected event to VoicemaiDetector. (PR #3207)

  • Added x-goog-api-client header with Pipecat's version to all Google services' requests. (PR #3208)

  • Added support for the HeyGen LiveAvatar API (see https://www.liveavatar.com/). (PR #3210)

  • Added to AWSNovaSonicLLMService functionality related to the new (and now default) Nova 2 Sonic model ("amazon.nova-2-sonic-v1:0"):

    • Added the endpointing_sensitivity parameter to control how quickly the model decides the user has stopped speaking.
    • Made the assistant-response-trigger hack a no-op. It's only needed for the older Nova Sonic model. (PR #3212)
  • Ultravox Realtime is now a supported speech-to-speech service.

    • Added UltravoxRealtimeLLMService for the integration.
    • Added 49-ultravox-realtime.py example (with tool calling). (PR #3227)
  • Added Daily PSTN dial-in support to the development runner with --dialin flag. This includes:

    • /daily-dialin-webhook endpoint that handles incoming Daily PSTN webhooks
    • Automatic Daily room creation with SIP configuration
    • DialinSettings and DailyDialinRequest types in pipecat.runner.types for type-safe dial-in data
    • The runner now mimics Pipecat Cloud's dial-in webhook handling for local development (PR #3235)
  • Add Gladia session id to logs for GladiaSTTService. (PR #3236)

  • Added InworldHttpTTSService which uses Inworld's HTTP based TTS service in either streaming or non-streaming mode. Note: This class was previously named InworldTTSService. (PR #3239)

  • Added language_hints_strict parameter to SonioxSTTService to strictly enforces language hints. This ensures that transcription occurs in the specified language. (PR #3245)

  • Added Pipecat library version info to the about field in the bot-ready RTVI message. (PR #3248)

  • Added VisionFullResponseStartFrame, VisionFullResponseEndFrame and VisionTextFrame. This are used by vision services similar to LLM services. (PR #3252)

Changed

  • FunctionCallInProgressFrame and FunctionCallResultFrame have changed from system frames to a control frame and a data frame, respectively, and are now both marked as UninterruptibleFrame. (PR #3189)

  • UserBotLatencyLogObserver now uses VADUserStartedSpeakingFrame and VADUserStoppedSpeakingFrame to determine latency from user stopped speaking to bot started speaking. (PR #3206)

  • Updated HeyGenVideoService and HeyGenTransport to support both HeyGen APIs (Interactive Avatar and Live Avatar). Using them is as simple as specifying the service_type when creating the HeyGenVideoService and the HeyGenTransport:

    heyGen = HeyGenVideoService(
        api_key=os.getenv("HEYGEN_LIVE_AVATAR_API_KEY"),
        service_type=ServiceType.LIVE_AVATAR,
        session=session,
    )
    

    (PR #3210)

  • Made "amazon.nova-2-sonic-v1:0" the new default model for AWSNovaSonicLLMService. (PR #3212)

  • Updated the run_inference methods in the LLM service classes (AnthropicLLMService, AWSBedrockLLMService, GoogleLLMService, and OpenAILLMService and its base classes) to use the provided LLM configuration parameters. (PR #3214)

  • Updated default models for:

    • GeminiLiveLLMService to gemini-2.5-flash-native-audio-preview-12-2025.
    • GeminiLiveVertexLLMService to gemini-live-2.5-flash-native-audio. (PR #3228)
  • Changed the reason field in EndFrame, CancelFrame, EndTaskFrame, and CancelTaskFrame from str to Any to indicate that it can hold values other than strings. (PR #3231)

  • Updated websocket STT services to use the WebsocketSTTService base class. This base class manages the websocket connection and handles reconnects. Updated services:

    • AssemblyAISTTService
    • AWSTranscribeSTTService
    • GladiaSTTService
    • SonioxSTTService (PR #3236)
  • Changed Inworld's TTS service implementations:

    • Previously, the HTTP implementation was named InworldTTSService. That has been moved to InworldHttpTTSService. This service now supports word-timestamp alignment data in both streaming and non-streaming modes.
    • Updated the InworldTTSService class to use Inworld's Websocket API. This class now has support for word-timestamp alignment data and tracks contexts for each user turn. (PR #3239)
  • ⚠️ Breaking change: WordTTSService.start_word_timestamps() and WordTTSService.reset_word_timestamps() are now async. (PR #3240)

  • Updated the current RTVI version to 1.1.0 to reflect recent additions and deprecations.

    • New RTVI Messages: send-text and bot-output
    • Deprecated Messages: append-to-context and bot-transcription (PR #3248)
  • MoondreamService now pushes VisionFullResponseStartFrame, VisionFullResponseEndFrame and VisionTextFrame. (PR #3252)

Deprecated

  • FalSmartTurnAnalyzer and LocalSmartTurnAnalyzer are deprecated and will be removed in a future version. Use LocalSmartTurnAnalyzerV3 instead. (PR #3219)

Removed

  • Removed the deprecated VLLM-based open source Ultravox STT service. (PR #3227)

Fixed

  • Fixed a bug in AWSNovaSonicLLMService where we would mishandle cancelled tool calls in the context, resulting in errors. (PR #3212)

  • Better support conversation history with Gemini 2.5 Flash Image (model "gemini-2.5-flash-image"). Prior to this fix, the model had no memory of previous images it had generated, so it wouldn't be able to iterate on them. (PR #3224)

  • Support conversations with Gemini 3 Pro Image (model "gemini-3-pro-image-preview"). Prior to this fix, after the model generated an image the conversation would not be able to progress. (PR #3224)

  • Fixed an issue where ElevenLabsHttpTTSService was not updating voice settings when receiving a TTSUpdateSettingsFrame. (PR #3226)

  • Fixed the return type for SmallWebRTCRequestHandler.handle_web_request() function. (PR #3230)

  • Fix a bug in LLM context audio content handling (PR #3234)

  • In GladiaSTTService, reset the _bytes_sent counter on connecting the websocket. This avoids unnecessary audio buffer trimming. (PR #3236)

  • Fixed a TTS service word-timestamp issue that could cause generated TTSTextFrame instances to have an incorrect pts (pts = -1). (PR #3240)

  • Fixed an issue in SimpleTextAggreagtor where spaces were not being stripped before returning the aggregation. This resulted in an extra space for TTS services that don't support word-timestamp alignment data. (PR #3247)

[0.0.97] - 2025-12-05

Added

  • Added new Gradium services, GradiumSTTService and GradiumTTSService, for speech-to-text and text-to-speech functionality using Gradium's API.

  • Additions for AsyncAITTSService and AsyncAIHttpTTSService:

    • Added new languages: pt, nl, ar, ru, ro, ja, he, hy, tr, hi, zh.
    • Updated the default model to asyncflow_multilingual_v1.0 for improved accuracy and broader language coverage.
  • Added optional tool and tool output filters for MCP services.

Changed

  • Updated Deepgram logging to include Deepgram request IDs for improved debugging.

  • Text Aggregation Improvements:

    • Breaking Change: BaseTextAggregator.aggregate() now returns AsyncIterator[Aggregation] instead of Optional[Aggregation]. This enables the aggregator to return multiple results based on the provided text.
    • Refactored text aggregators to use inheritance: SkipTagsAggregator and PatternPairAggregator now inherit from SimpleTextAggregator, reusing the base class's sentence detection logic.
  • Improved interruption handling to prevent bots from repeating themselves. LLM services that return multiple sentences in a single response (e.g., GoogleLLMService) are now split into individual sentences before being sent to TTS. This ensures interruptions occur at sentence boundaries, preventing the bot from repeating content after being interrupted during long responses.

  • Updated AICFilter to use Quail STT as the default model (AICModelType.QUAIL_STT). Quail STT is optimized for human-to-machine interaction (e.g., voice agents, speech-to-text) and operates at a native sample rate of 16 kHz with fixed enhancement parameters.

  • If an unexpected exception is caught, or if FrameProcessor.push_error() is called with an exception, the file name and line number where the exception occured are now logged.

  • Updated Smart Turn model weights to v3.1.

  • Smart Turn analyzer now uses the full context of the turn rather than just the audio since VAD last triggered.

  • Updated CartesiaSTTService to return the full transcription result in the TranscriptionFrame and InterimTranscriptionFrame. This provides access to word timestamp data.

  • HumeTTSService changes:

    • Added tracking headers (X-Hume-Client-Name and X-Hume-Client-Version) to all requests made by HumeTTSService to the Hume API for better usage tracking and analytics.
    • Added stop() and cancel() cleanup methods to HumeTTSService to properly close the HTTP client and prevent resource leaks.

Deprecated

  • NVIDIA Services name changes (all functionality is unchanged):

    • NimLLMService is now deprecated, use NvidiaLLMService instead.
    • RivaSTTService is now deprecated, use NvidiaSTTService instead.
    • RivaTTSService is now deprecated, use NvidiaTTSService instead.
    • Use uv pip install pipecat-ai[nvidia] instead of uv pip install pipecat-ai[riva]
  • The noise_gate_enable parameter in AICFilter is deprecated and no longer has any effect. Noise gating is now handled automatically by the AIC VAD system. Use AICFilter.create_vad_analyzer() for VAD functionality instead.

  • Package pipecat.sync is deprecated, use pipecat.utils.sync instead.

Fixed

  • Fixed bug in PatternPairAggregator where pattern handlers could be called multiple times for KEEP or AGGREGATE patterns.

  • Fixed sentence aggregation to correctly handle ambiguous punctuation in streaming text, such as currency ("$29.95") and abbreviations ("Mr. Smith").

  • Fixed an issue in AWSTranscribeSTTService where the region arg was always set to us-east-1 when providing an AWS_REGION env var.

  • Fixed an issue in SarvamTTSService where the last sentence was not being spoken. Now, audio is flushed when the TTS services receives the LLMFullResponseEndFrame or EndFrame.

  • Fixed an issue in DeepgramTTSService where a TTSStoppedFrame was incorrectly pushed after a functional call. This caused an issue with the voice-ui-kit's conversational panel rending of the LLM output after a function call.

  • Fixed an issue where LLMTextFrame.skip_tts was being overwritten by LLM services.

  • Fixed an issue that caused WebsocketService instances to attempt reconnection during shutdown.

  • Fixed an issue in ElevenLabsTTSService where character usage metrics were only reported on the first TTS generation per turn.

[0.0.96] - 2025-11-26 🦃 "Happy Thanksgiving!" 🦃

Added

  • Added AWSBedrockAgentCoreProcessor to support invoking an AgentCore-hosted agent in a Pipecat pipeline.

  • Enhanced error handling across the framework:

    • Added on_error callback to FrameProcessor for centralized error handling.

    • Renamed push_error(error: ErrorFrame) to push_error_frame(error: ErrorFrame) for clarity.

    • Added new push_error method for simplified error reporting:

      async def push_error(error_msg: str,
                           exception: Optional[Exception] = None,
                           fatal: bool = False)
      
    • Standardized error logging by replacing logger.exception calls with logger.error throughout the codebase.

  • Added cache_read_input_tokens, cache_creation_input_tokens and reasoning_tokens to OTel spans for LLM call

  • Added LiveKitRESTHelper utility class for managing LiveKit rooms via REST API.

  • Added DeepgramSageMakerSTTService which connects to a SageMaker hosted Deepgram STT model. Added 07c-interruptible-deepgram-sagemaker.py foundational example.

  • Added SageMakerBidiClient to connect to SageMaker hosted BiDi compatible services.

  • Added support for include_timestamps and enable_logging in ElevenLabsRealtimeSTTService. When include_timestamps is enabled, timestamp data is included in the TranscriptionFrame's result parameter.

  • Added optional speaking rate control to InworldTTSService.

  • Introduced a new AggregatedTextFrame type to support passing text along with an aggregated_by field to describe the type of text included. TTSTextFrames now inherit from AggregatedTextFrame. With this inheritance, an observer can watch for AggregatedTextFrames to accumlate the perceived output and determine whether or not the text was spoken based on if that frame is also a TTSTextFrame.

    With this frame, the llm token stream can be transformed into custom composable chunks, allowing for aggregation outside the TTS service. This makes it possible to listen for or handle those aggregations and sets the stage for doing things like composing a best effort of the perceived llm output in a more digestable form and to do so whether or not it is processed by a TTS or if even a TTS exists.

  • Introduced LLMTextProcessor: A new processor meant to allow customization for how LLMTextFrames should be aggregated and considered. It's purpose is to turn LLMTextFrames into AggregatedTextFrames. By default, a TTSService will still aggregate LLMTextFrames by sentence for the service to consume. However, if you wish to override how the llm text is aggregated, you should no longer override the TTS's internal text_aggregator, but instead, insert this processor between your LLM and TTS in the pipeline.

  • New bot-output RTVI message to represent what the bot actually "says".

    • The RTVIObserver now emits bot-output messages based off the new AggregatedTextFrames (bot-tts-text and bot-llm-text are still supported and generated, but bot-transcript is now deprecated in lieu of this new, more thorough, message).

    • The new RTVIBotOutputMessage includes the fields:

      • spoken: A boolean indicating whether the text was spoken by TTS

      • aggregated_by: A string representing how the text was aggregated ("sentence", "word", "my custom aggregation")

    • Introduced new fields to RTVIObserver to support the new bot-output messaging:

      • bot_output_enabled: Defaults to True. Set to false to disable bot-output messages.

      • skip_aggregator_types: Defaults to None. Set to a list of strings that match aggregation types that should not be included in bot-output messages. (Ex. credit_card)

    • Introduced new methods, add_text_transformer() and remove_text_transformer(), to RTVIObserver to support providing (and subsequently removing) callbacks for various types of aggregations (or all aggregations with *) that can modify the text before being sent as a bot-output or tts-text message. (Think obscuring the credit card or inserting extra detail the client might want that the context doesn't need.)

  • In MiniMaxHttpTTSService:

    • Added support for speech-2.6-hd and speech-2.6-turbo models

    • Added languages: Afrikaans, Bulgarian, Catalan, Danish, Persian, Filipino, Hebrew, Croatian, Hungarian, Malay, Norwegian, Nynorsk, Slovak, Slovenian, Swedish, and Tamil

    • Added new emotions: calm and fluent

  • Added enable_logging to SimliVideoService input parameters. It's disabled by default.

Changed

  • Updated FishAudioTTSService default model to s1.

  • Updated DeepgramTTSService to use Deepgram's TTS websocket API. ⚠️ This is a potential breaking change, which only affects you if you're self-hosting DeepgramTTSService. The new service uses Websockets and improves TTFB latency.

  • Updated daily-python to 0.22.0.

  • BaseTextAggregator changes:

    Modified the BaseTextAggregator type so that when text gets aggregated, metadata can be associated with it. Currently, that just means a type, so that the aggregation can be classified or described. Changes made to support this:

    • ⚠️ IMPORTANT: Aggregators are now expected to strip leading/trailing white space characters before returning their aggregation from aggregation() or .text. This way all aggregators have a consistent contract allowing downstream use to know how to stitch aggregations back together.

    • Introduced a new Aggregation dataclass to represent both the aggregated text and a string identifying the type of aggregation (ex. "sentence", "word", "my custom aggregation")

    • ⚠️ Breaking change: BaseTextAggregator.text now returns an Aggregation (instead of str).

      Before:

      aggregated_text = myAggregator.text
      

      Now:

      aggregated_text = myAggregator.text.text
      
    • ⚠️ Breaking change: BaseTextAggregator.aggregate() now returns Optional[Aggregation] (instead of Optional[str]).

      Before:

      aggregation = myAggregator.aggregate(text)
      print(f"successfully aggregated text: {aggregation}")
      

      Now:

      aggregation = myAggregator.aggregate(text)
      if aggregation:
        print(f"successfully aggregated text: {aggregation.text}")
      
    • SimpleTextAggregator, SkipTagsAggregator, PatternPairAggregator updated to produce/consume Aggregation objects.

    • All uses of the above Aggregators have been updated accordingly.

  • Augmented the PatternPairAggregator so that matched patterns can be treated as their own aggregation, taking advantage of the new. To that end:

    • Introduced a new, preferred version of add_pattern to support a new option for treating a match as a separate aggregation returned from aggregate(). This replaces the now deprecated add_pattern_pair method and you provide a MatchAction in lieu of the remove_match field.

      • MatchAction enum: REMOVE, KEEP, AGGREGATE, allowing customization for how a match should be handled.

        • REMOVE: The text along with its delimiters will be removed from the streaming text. Sentence aggregation will continue on as if this text did not exist.

        • KEEP: The delimiters will be removed, but the content between them will be kept. Sentence aggregation will continue on with the internal text included.

        • AGGREGATE: The delimiters will be removed and the content between will be treated as a separate aggregation. Any text before the start of the pattern will be returned early, whether or not a complete sentence was found. Then the pattern will be returned. Then the aggregation will continue on sentence matching after the closing delimiter is found. The content between the delimiters is not aggregated by sentence. It is aggregated as one single block of text.

      • PatternMatch now extends Aggregation and provides richer info to handlers.

    • ⚠️ Breaking change: The PatternMatch type returned to handlers registered via on_pattern_match has been updated to subclass from the new Aggregation type, which means that content has been replaced with text and pattern_id has been replaced with type:

      async dev on_match_tag(match: PatternMatch):
         pattern = match.type # instead of match.pattern_id
         text = match.text # instead of match.content
      
  • TextFrame now includes the field append_to_context to support setting whether or not the encompassing text should be added to the LLM context (by the LLM assistant aggregator). It defaults to True.

  • TTSService base class updates:

    • TTSServices now accept a new skip_aggregator_types to avoid speaking certain aggregation types (now determined/returned by the aggregator)

    • Introduced the ability to do a just-in-time transform of text before it gets sent to the TTS service via callbacks you can set up via a new init field, text_transforms or a new method add_text_transformer(). This makes it possible to do things like introduce TTS-specific tags for spelling or emotion or change the pronunciation of something on the fly. remove_text_transformer has also been added to support removing a registered transform callback.

    • TTS services push AggregatedTextFrame in addition to TTSTextFrames when either an aggregation occurs that should not be spoken or when the TTS service supports word-by-word timestamping. In the latter case, the TTSService preliminarily generates an AggregatedTextFrame, aggregated by sentence to generate the full sentence content as early as possible.

  • Updated CartesiaTTSService:

    • Modified use of custom default text_aggregator to avoid deprecation warnings and push users towards use of transformers or the LLMTextProcessor

    • Added convenience methods for taking advantage of Cartesia's SSML tags: spell, emotion, pauses, volume, and speed.

  • Updated RimeTTSService:

    • Modified use of custom default text_aggregator to avoid deprecation warnings and push users towards use of transformers or the LLMTextProcessor

    • Added convenience methods for taking advantage of Rime's customization options: spell, pauses, pronunciations, and inline speed control.

Deprecated

  • The TTS constructor field, text_aggregator is deprecated in favor of the new LLMTextProcessor. TTSServices still have an internal aggregator for support of default behavior, but if you want to override the aggregation behavior, you should use the new processor.

  • The RTVI bot-transcription event is deprecated in favor of the new bot-output message which is the canonical representation of bot output (spoken or not). The code still emits a transcription message for backwards compatibility while transition occurs.

  • Deprecated add_pattern_pair in the PatternPairAggregator which takes a pattern_id and remove_match field in favor of the new add_pattern method which takes a type and an action

  • english_normalization input parameter for MiniMaxHttpTTSService is deprecated, use test_normalization instead.

Fixed

  • Fixed an issue in AWSBedrockLLMService where the aws_region arg was always set to us-east-1 when providing an AWS_REGION env var.

  • Fixed an issue with DeepgramFluxSTTService where it sometimes failed to reconnect.

  • Fixed an issue in ElevenLabsRealtimeSTTService where dynamic language updates were not working.

  • Fixed an issue in ElevenLabsRealtimeSTTService where setting the sample rate would result in transcripts failing.

  • Fixed InworldTTSService audio config payload to use camelCase keys expected by the Inworld API.

[0.0.95] - 2025-11-18

Added

  • Added ai-coustics integrated VAD (AICVADAnalyzer) with AICFilter factory and example wiring; leverages the enhancement model for robust detection with no ONNX dependency or added processing complexity.

  • Added a watchdog to DeepgramFluxSTTService to prevent dangling tasks in case the user was speaking and we stop receiving audio.

  • Introduced a minimum confidence parameter in DeepgramFluxSTTService to avoid generating transcriptions below a defined threshold.

  • Added ElevenLabsRealtimeSTTService which implements the Realtime STT service from ElevenLabs.

  • Added word-level timestamps support to Hume TTS service

Changed

  • ⚠️ Breaking change: LLMContext.create_image_message(), LLMContext.create_audio_message(), LLMContext.add_image_frame_message() and LLMContext.add_audio_frames_message() are now async methods. This fixes an issue where the asyncio event loop would be blocked while encoding audio or images.

  • ConsumerProcessor now queues frames from the producer internally instead of pushing them directly. This allows us to subclass consumer processors and manipulate frames before they are pushed.

  • BaseTextFilter only require subclasses to implement the filter() method.

  • Extracted the logic for retrying connections, and create a new send_with_retry method inside WebSocketService.

  • Refactored DeepgramFluxSTTService to automatically reconnect if sending a message fails.

  • Updated all STT and TTS services to use consistent error handling pattern with push_error() method for better pipeline error event integration.

  • Added support for maybe_capture_participant_camera() and maybe_capture_participant_screen() for SmallWebRTCTransport in the runner utils.

  • Added Hindi support for Rime TTS services.

  • Updated GeminiTTSService to use Google Cloud Text-to-Speech streaming API instead of the deprecated Gemini API. Now uses credentials / credentials_path for authentication. The api_key parameter is deprecated. Also, added support for prompt parameter for style instructions and expressive markup tags. Significantly improved latency with streaming synthesis.

  • Updated language mappings for the Google and Gemini TTS services to match official documentation.

Deprecated

  • The api_key parameter in GeminiTTSService is deprecated. Use credentials or credentials_path instead for Google Cloud authentication.

Fixed

  • Fixed a SimliVideoService connection issue.

  • Fixed an issue in the Runner where, when using SmallWebRTCTransport, the request_data was not being passed to the SmallWebRTCRunnerArguments body.

  • Fixed subtle issue of assistant context messages ending up with double spaces between words or sentences.

  • Fixed an issue where NeuphonicTTSService wasn't pushing TTSTextFrames, meaning assistant messages weren't being written to context.

  • Fixed an issue with OpenTelemetry where tracing wasn't correctly displaying LLM completions and tools when using the universal LLMContext.

  • Fixed issue where DeepgramFluxSTTService failed to connect if passing a keyterm or tag containing a space.

  • Prevented HeyGenVideoService from automatically disconnecting after 5 minutes.

[0.0.94] - 2025-11-10

Changed

  • Added support for retrying SpeechmaticsTTSService when it returns a 503 error. Default values in InputParams.

Deprecated

  • The KrispFilter is deprecated and will be removed in a future version. Use the KrispVivaFilter instead.

Removed

  • LivekitFrameSerializer has been removed. Use LiveKitTransport instead.

Fixed

  • Fixed a bug related to LLMAssistantAggregator where spaces were sometimes missing from assistant messages in context.

[0.0.93] - 2025-11-07

Added

  • Added support for Sarvam Speech-to-Text service (SarvamSTTService) with streaming WebSocket support for saarika (STT) and saaras (STT-translate) models.

  • Added support for passing in a ToolsSchema in lieu of a list of provider- specific dicts when initializing OpenAIRealtimeLLMService or when updating it using LLMUpdateSettingsFrame.

  • Added TransportParams.audio_out_silence_secs, which specifies how many seconds of silence to output when an EndFrame reaches the output transport. This can help ensure that all audio data is fully delivered to clients.

  • Added new FrameProcessor.broadcast_frame() method. This will push two instances of a given frame class, one upstream and the other downstream.

    await self.broadcast_frame(UserSpeakingFrame)
    
  • Added MetricsLogObserver for logging performance metrics from MetricsFrame instances. Supports filtering via include_metrics parameter to control which metrics types are logged (TTFB, processing time, LLM token usage, TTS usage, smart turn metrics).

  • Added pronunciation_dictionary_locators to ElevenLabsTTSService and ElevenLabsHttpTTSService.

  • Added support for loading external observers. You can now register custom pipeline observers by setting the PIPECAT_OBSERVER_FILES environment variable. This variable should contain a colon-separated list of Python files (e.g. export PIPECAT_OBSERVER_FILES="observer1.py:observer2.py:..."). Each file must define a function with the following signature:

    async def create_observers(task: PipelineTask) -> Iterable[BaseObserver]:
        ...
    
  • Added support for new sonic-3 languages in CartesiaTTSService and CartesiaHttpTTSService.

  • EndFrame and EndTaskFrame have an optional reason field to indicate why the pipeline is being ended.

  • CancelFrame and CancelTaskFrame have an optional reason field to indicate why the pipeline is being canceled. This can be also specified when you cancel a task with PipelineTask.cancel(reason="cancellation reason").

  • Added include_prob_metrics parameter to Whisper STT services to enable access to probability metrics from transcription results.

  • Added utility functions extract_whisper_probability(), extract_openai_gpt4o_probability(), and extract_deepgram_probability() to extract probability metrics from TranscriptionFrame objects for Whisper-based, OpenAI GPT-4o-transcribe, and Deepgram STT services respectively.

  • Added LLMSwitcher.register_direct_function(). It works much like LLMSwitcher.register_function() in that it's a shorthand for registering functions on all LLMs in the switcher, but for direct functions.

  • Added LLMSwitcher.register_direct_function(). It works much like LLMSwitcher.register_function() in that it's a shorthand for registering a function on all LLMs in the switcher, except this new method takes a direct function (a FunctionSchema-less function).

  • Added MCPClient.get_tools_schema() and MCPClient.register_tools_schema() as a two-step alternative to MCPClient.register_tools(), to allow users to pass MCP tools to, say, GeminiLiveLLMService (as well as other speech-to-speech services) in the constructor.

  • Added support for passing in an LLMSwicher to MCPClient.register_tools() (as well as the new MCPClient.register_tools_schema()).

  • Added cpu_count parameter to LocalSmartTurnAnalyzerV3. This is set to 1 by default for more predictable performance on low-CPU systems.

Changed

  • Updated simli-ai to 0.1.25.

  • STTMuteFilter no longer sends STTMuteFrame to the STT service. The filter now blocks frames locally without instructing the STT service to stop processing audio. This prevents inactivity-related errors (such as 409 errors from Google STT) while maintaining the same muting behavior at the application level. Important: The STTMuteFilter should be placed after the STT service itself.

  • Improved GoogleSTTService error handling to properly catch gRPC Aborted exceptions (corresponding to 409 errors) caused by stream inactivity. These exceptions are now logged at DEBUG level instead of ERROR level, since they indicate expected behavior when no audio is sent for 10+ seconds (e.g., during long silences or when audio input is blocked). The service automatically reconnects when this occurs.

  • Bumped the fastapi dependency's upperbound to <0.122.0.

  • Updated the default model for GoogleVertexLLMService to gemini-2.5-flash.

  • Updated the GoogleVertexLLMService to use the GoogleLLMService as a base class instead of the OpenAILLMService.

  • Updated STT and TTS services to pass through unverified language codes with a warning instead of returning None. This allows developers to use newly supported languages before Pipecat's service classes are updated, while still providing guidance on verified languages.

Removed

  • Removed needs_mcp_alternate_schema() from LLMService. The mechanism that relied on it went away.

Fixed

  • Restore backwards compatibility for vision/image features (broken in 0.0.92) when using non-universal context and assistant aggregators.

  • Fixed DeepgramSTTService._disconnect() to properly await is_connected() method call, which is an async coroutine in the Deepgram SDK.

  • Fixed an issue where the SmallWebRTCRequest dataclass in runner would scrub arbitrary request data from client due to camelCase typing. This fixes data passthrough for JS clients where APIRequest is used.

  • Fixed a bug in GeminiLiveLLMService where in some circumstances it wouldn't respond after a tool call.

  • Fixed GeminiLiveLLMService session resumption after a connection timeout.

  • GeminiLiveLLMService now properly supports context-provided system instruction and tools.

  • Fixed GoogleLLMService token counting to avoid double-counting tokens when Gemini sends usage metadata across multiple streaming chunks.

[0.0.92] - 2025-10-31 🎃 "The Haunted Edition" 👻

Added

  • Added a new DeepgramHttpTTSService, which delivers a meaningful reduction in latency when compared to the DeepgramTTSService.

  • Add support for speaking_rate input parameter in GoogleHttpTTSService.

  • Added enable_speaker_diarization and enable_language_identification to SonioxSTTService.

  • Added SpeechmaticsTTSService, which uses Speechmatic's TTS API. Updated examples 07a* to use the new TTS service.

  • Added support for including images or audio to LLM context messages using LLMContext.create_image_message() or LLMContext.create_image_url_message() (not all LLMs support URLs) and LLMContext.create_audio_message(). For example, when creating LLMMessagesAppendFrame:

    message = LLMContext.create_image_message(image=..., size= ...)
    await self.push_frame(LLMMessagesAppendFrame(messages=[message], run_llm=True))
    
  • New event handlers for the DeepgramFluxSTTService: on_start_of_turn, on_turn_resumed, on_end_of_turn, on_eager_end_of_turn, on_update.

  • Added generation_config parameter support to CartesiaTTSService and CartesiaHttpTTSService for Cartesia Sonic-3 models. Includes a new GenerationConfig class with volume (0.5-2.0), speed (0.6-1.5), and emotion (60+ options) parameters for fine-grained speech generation control.

  • Expanded support for univeral LLMContext to OpenAIRealtimeLLMService. As a reminder, the context-setup pattern when using LLMContext is:

    context = LLMContext(messages, tools)
    context_aggregator = LLMContextAggregatorPair(context)
    

    (Note that even though OpenAIRealtimeLLMService now supports the universal LLMContext, it is not meant to be swapped out for another LLM service at runtime with LLMSwitcher.)

    Note: TranscriptionFrames and InterimTranscriptionFrames now go upstream from OpenAIRealtimeLLMService, so if you're using TranscriptProcessor, say, you'll want to adjust accordingly:

    pipeline = Pipeline(
      [
        transport.input(),
        context_aggregator.user(),
    
        # BEFORE
        llm,
        transcript.user(),
    
        # AFTER
        transcript.user(),
        llm,
    
        transport.output(),
        transcript.assistant(),
        context_aggregator.assistant(),
      ]
    )
    

    Also worth noting: whether or not you use the new context-setup pattern with OpenAIRealtimeLLMService, some types have changed under the hood:

    ## BEFORE:
    
    # Context aggregator type
    context_aggregator: OpenAIContextAggregatorPair
    
    # Context frame type
    frame: OpenAILLMContextFrame
    
    # Context type
    context: OpenAIRealtimeLLMContext
    # or
    context: OpenAILLMContext
    
    ## AFTER:
    
    # Context aggregator type
    context_aggregator: LLMContextAggregatorPair
    
    # Context frame type
    frame: LLMContextFrame
    
    # Context type
    context: LLMContext
    

    Also note that RealtimeMessagesUpdateFrame and RealtimeFunctionCallResultFrame have been deprecated, since they're no longer used by OpenAIRealtimeLLMService. OpenAI Realtime now works more like other LLM services in Pipecat, relying on updates to its context, pushed by context aggregators, to update its internal state. Listen for LLMContextFrames for context updates.

    Finally, LLMTextFrames are no longer pushed from OpenAIRealtimeLLMService when it's configured with output_modalities=['audio']. If you need to process its output, listen for TTSTextFrames instead.

  • Expanded support for universal LLMContext to GeminiLiveLLMService. As a reminder, the context-setup pattern when using LLMContext is:

    context = LLMContext(messages, tools)
    context_aggregator = LLMContextAggregatorPair(context)
    

    (Note that even though GeminiLiveLLMService now supports the universal LLMContext, it is not meant to be swapped out for another LLM service at runtime with LLMSwitcher.)

    Worth noting: whether or not you use the new context-setup pattern with GeminiLiveLLMService, some types have changed under the hood:

    ## BEFORE:
    
    # Context aggregator type
    context_aggregator: GeminiLiveContextAggregatorPair
    
    # Context frame type
    frame: OpenAILLMContextFrame
    
    # Context type
    context: GeminiLiveLLMContext
    # or
    context: OpenAILLMContext
    
    ## AFTER:
    
    # Context aggregator type
    context_aggregator: LLMContextAggregatorPair
    
    # Context frame type
    frame: LLMContextFrame
    
    # Context type
    context: LLMContext
    

    Also note that LLMTextFrames are no longer pushed from GeminiLiveLLMService when it's configured with modalities=GeminiModalities.AUDIO. If you need to process its output, listen for TTSTextFrames instead.

Changed

  • The development runner's /start endpoint now supports passing dailyRoomProperties and dailyMeetingTokenProperties in the request body when createDailyRoom is true. Properties are validated against the DailyRoomProperties and DailyMeetingTokenProperties types respectively and passed to Daily's room and token creation APIs.

  • UserImageRawFrame new fields append_to_context and text. The append_to_context field indicates if this image and text should be added to the LLM context (by the LLM assistant aggregator). The text field, if set, might also guide the LLM or the vision service on how to analyze the image.

  • UserImageRequestFrame new fiels append_to_context and text. Both fields will be used to set the same fields on the captured UserImageRawFrame.

  • UserImageRequestFrame don't require function call name and ID anymore.

  • Updated MoondreamService to process UserImageRawFrame.

  • VisionService expects UserImageRawFrame in order to analyze images.

  • DailyTransport triggers on_error event if transcription can't be started or stopped.

  • DailyTransport updates: start_dialout() now returns two values: session_id and error. start_recording() now returns two values: stream_id and error.

  • Updated daily-python to 0.21.0.

  • SimliVideoService now accepts api_key and face_id parameters directly, with optional params for max_session_length and max_idle_time configuration, aligning with other Pipecat service patterns.

  • Updated the default model to sonic-3 for CartesiaTTSService and CartesiaHttpTTSService.

  • FunctionFilter now has a filter_system_frames arg, which controls whether or not SystemFrames are filtered.

  • Upgraded aws_sdk_bedrock_runtime to v0.1.1 to resolve potential CPU issues when running AWSNovaSonicLLMService.

Deprecated

  • The expect_stripped_words parameter of LLMAssistantAggregatorParams is ignored when used with the newer LLMAssistantAggregator, which now handles word spacing automatically.

  • LLMService.request_image_frame() is deprecated, push a UserImageRequestFrame instead.

  • UserResponseAggregator is deprecated and will be removed in a future version.

  • The send_transcription_frames argument to OpenAIRealtimeLLMService is deprecated. Transcription frames are now always sent. They go upstream, to be handled by the user context aggregator. See "Added" section for details.

  • Types in pipecat.services.openai.realtime.context and pipecat.services.openai.realtime.frames are deprecated, as they're no longer used by OpenAIRealtimeLLMService. See "Added" section for details.

  • SimliVideoService simli_config parameter is deprecated. Use api_key and face_id parameters instead.

Removed

  • Removed enable_non_final_tokens and max_non_final_tokens_duration_ms from SonioxSTTService.

  • Removed the aiohttp_session arg from SarvamTTSService as it's no longer used.

Fixed

  • Fixed a PipelineTask issue that was causing an idle timeout for frames that were being generated but not reaching the end of the pipeline. Since the exact point when frames are discarded is unknown, we now monitor pipeline frames using an observer. If the observer detects frames are being generated, it will prevent the pipeline from being considered idle.

  • Fixed an issue in HumeTTSService that was only using Octave 2, which does not support the description field. Now, if a description is provided, it switches to Octave 1.

  • Fixed an issue where DailyTransport would timeout prematurely on join and on leave.

  • Fixed an issue in the runner where starting a DailyTransport room via /start didn't support using the DAILY_SAMPLE_ROOM_URL env var.

  • Fixed an issue in ServiceSwitcher where the STTServices would result in all STT services producing TranscriptionFrames.

Other

  • Updated all vision 12-series foundational examples to load images from a file.

  • Added 14-series video examples for different services. These new examples request an image from the user camera through a function call.

[0.0.91] - 2025-10-21

Added

  • It is now possible to start a bot from the /start endpoint when using the runner Daily's transport. This follows the Pipecat Cloud format with createDailyRoom and body fields in the POST request body.

  • Added an ellipsis character () to the end of sentence detection in the string utils.

  • Expanded support for universal LLMContext to AWSNovaSonicLLMService. As a reminder, the context-setup pattern when using LLMContext is:

    context = LLMContext(messages, tools)
    context_aggregator = LLMContextAggregatorPair(context)
    

    (Note that even though AWSNovaSonicLLMService now supports the universal LLMContext, it is not meant to be swapped out for another LLM service at runtime with LLMSwitcher.)

    Worth noting: whether or not you use the new context-setup pattern with AWSNovaSonicLLMService, some types have changed under the hood:

    ## BEFORE:
    
    # Context aggregator type
    context_aggregator: AWSNovaSonicContextAggregatorPair
    
    # Context frame type
    frame: OpenAILLMContextFrame
    
    # Context type
    context: AWSNovaSonicLLMContext
    # or
    context: OpenAILLMContext
    
    ## AFTER:
    
    # Context aggregator type
    context_aggregator: LLMContextAggregatorPair
    
    # Context frame type
    frame: LLMContextFrame
    
    # Context type
    context: LLMContext
    
  • Added support for bulbul:v3 model in SarvamTTSService and SarvamHttpTTSService.

  • Added keyterms_prompt parameter to AssemblyAIConnectionParams.

  • Added speech_model parameter to AssemblyAIConnectionParams to access the multilingual model.

  • Added support for trickle ICE to the SmallWebRTCTransport.

  • Added support for updating OpenAITTSService settings (instructions and speed) at runtime via TTSUpdateSettingsFrame.

  • Added --whatsapp flag to runner to better surface WhatsApp transport logs.

  • Added on_connected and on_disconnected events to TTS and STT websocket-based services.

  • Added an aggregate_sentences arg in ElevenLabsHttpTTSService, where the default value is True.

  • Added a room_properties arg to the Daily runner's configure() method, allowing DailyRoomProperties to be provided.

  • The runner --folder argument now supports downloading files from subdirectories.

Changed

  • RunnerArguments now include the body field, so there's no need to add it to subclasses. Also, all RunnerArguments fields are now keyword-only.

  • CartesiaSTTService now inherits from WebsocketSTTService.

  • Package upgrades:

    • daily-python upgraded to 0.20.0.
    • openai upgraded to support up to 2.x.x.
    • openpipe upgraded to support up to 5.x.x.
  • SpeechmaticsSTTService updated dependencies for speechmatics-rt>=0.5.0.

Deprecated

  • The send_transcription_frames argument to AWSNovaSonicLLMService is deprecated. Transcription frames are now always sent. They go upstream, to be handled by the user context aggregator. See "Added" section for details.

  • Types in pipecat.services.aws.nova_sonic.context are deprecated, as they're no longer used by AWSNovaSonicLLMService. See "Added" section for details.

Fixed

  • Fixed an issue where the RTVIProcessor was sending duplicate UserStartedSpeakingFrame and UserStoppedSpeakingFrame messages.

  • Fixed an issue in AWSBedrockLLMService where both temperature and top_p were always sent together, causing conflicts with models like Claude Sonnet 4.5 that don't allow both parameters simultaneously. The service now only includes inference parameters that are explicitly set, and InputParams defaults have been changed to None to rely on AWS Bedrock's built-in model defaults.

  • Fixed an issue in RivaSegmentedSTTService where a runtime error occurred due to a mismatch in the _handle_transcription method's signature.

  • Fixed multiple pipeline task cancellation issues. asyncio.CancelledError is now handled properly in PipelineTask making it possible to cancel an asyncio task that it's executing a PipelineRunner cleanly. Also, PipelineTask.cancel() does not block anymore waiting for the CancelFrame to reach the end of the pipeline (going back to the behavior in < 0.0.83).

  • Fixed an issue in ElevenLabsTTSService and ElevenLabsHttpTTSService where the Flash models would split words, resulting in a space being inserted between words.

  • Fixed an issue where audio filters' stop() would not be called when using CancelFrame.

  • Fixed an issue in ElevenLabsHttpTTSService, where apply_text_normalization was incorrectly set as a query parameter. It's now being added as a request parameter.

  • Fixed an issue where RimeHttpTTSService and PiperTTSService could generate incorrectly 16-bit aligned audio frames, potentially leading to internal errors or static audio.

  • Fixed an issue in SpeechmaticsSTTService where AdditionalVocabEntry items needed to have sounds_like for the session to start.

Other

  • Added foundational example 47-sentry-metrics.py, demonstrating how to use the SentryMetrics processor.

  • Added foundational example 14x-function-calling-openpipe.py.

[0.0.90] - 2025-10-10

Added

  • Added audio filter KrispVivaFilter using the Krisp VIVA SDK.

  • Added --folder argument to the runner, allowing files saved in that folder to be downloaded from http://HOST:PORT/file/FILE.

  • Added GeminiLiveVertexLLMService, for accessing Gemini Live via Google Vertex AI.

  • Added some new configuration options to GeminiLiveLLMService:

    • thinking
    • enable_affective_dialog
    • proactivity

    Note that these new configuration options require using a newer model than the default, like "gemini-2.5-flash-native-audio-preview-09-2025". The last two require specifying http_options=HttpOptions(api_version="v1alpha").

  • Added on_pipeline_error event to PipelineTask. This event will get fired when an ErrorFrame is pushed (use FrameProcessor.push_error()).

    @task.event_handler("on_pipeline_error")
    async def on_pipeline_error(task: PipelineTask, frame: ErrorFrame):
        ...
    
  • Added a service_tier InputParam to the BaseOpenAILLMService. This parameter can influence the latency of the response. For example "priority" will result in faster completions, but in exchange for a higher price.

Changed

  • Updated GeminiLiveLLMService to use the google-genai library rather than use WebSockets directly.

Deprecated

  • LivekitFrameSerializer is now deprecated. Use LiveKitTransport instead.

  • pipecat.service.openai_realtime is now deprecated, use pipecat.services.openai.realtime instead or pipecat.services.azure.realtime for Azure Realtime.

  • pipecat.service.aws_nova_sonic is now deprecated, use pipecat.services.aws.nova_sonic instead.

  • GeminiMultimodalLiveLLMService is now deprecated, use GeminiLiveLLMService.

Fixed

  • Fixed a GoogleVertexLLMService issue that would generate an error if no token information was returned.

  • GeminiLiveLLMService will now end gracefully (i.e. after the bot has finished) upon receiving an EndFrame.

  • GeminiLiveLLMService will try to seamlessly reconnect when it loses its connection.

[0.0.89] - 2025-10-07

Fixed

  • Reverted a change introduced in 0.0.88 that was causing pipelines to be frozen when using interruption strategies and processors that block interruption frames (e.g. STTMuteFilter).

[0.0.88] - 2025-10-07

Added

  • Added support for Nano Banana models to GoogleLLMService. For example, you can now use the gemini-2.5-flash-image model to generate images.

  • Added HumeTTSService for text-to-speech synthesis using Hume AI's expressive voice models. Provides high-quality, emotionally expressive speech synthesis with support for various voice models. Includes example in examples/foundational/07ad-interruptible-hume.py. Use with: uv pip install pipecat-ai[hume].

Changed

  • Updated default GoogleLLMService model to gemini-2.5-flash.

Deprecated

  • PlayHT is shutting down their API on December 31st, 2025. As a result, PlayHTTTSService and PlayHTHttpTTSService are deprecated and will be removed in a future version.

Fixed

  • Fixed an issue with AWSNovaSonicLLMService where the client wouldn't connect due to a breaking change in the AWS dependency chain.

  • PermissionError is now caught if NLTK's punkt_tab can't be downloaded.

  • Fixed an issue that would cause wrong user/assistant context ordering when using interruption strategies.

  • Fixed RTVI incoming message handling, broken in 0.0.87.

[0.0.87] - 2025-10-02

Added

  • Added WebsocketSTTService base class for websocket-based STT services. Combines STT functionality with websocket connectivity, providing automatic error handling and reconnection capabilities with exponential backoff.

  • Added DeepgramFluxSTTService for real-time speech recognition using Deepgram's Flux WebSocket API. Flux understands conversational flow and automatically handles turn-taking.

  • Added RTVI messages for user/bot audio levels and system logs.

  • Include OpenAI-based LLM services cached tokens to MetricsFrame.

Changed

  • Updated the default model for AnthropicLLMService to claude-sonnet-4-5-20250929.

Deprecated

  • DailyTransportMessageFrame and DailyTransportMessageUrgentFrame are deprecated, use DailyOutputTransportMessageFrame and DailyOutputTransportMessageUrgentFrame respectively instead.

  • LiveKitTransportMessageFrame and LiveKitTransportMessageUrgentFrame are deprecated, use LiveKitOutputTransportMessageFrame and LiveKitOutputTransportMessageUrgentFrame respectively instead.

  • TransportMessageFrame and TransportMessageUrgentFrame are deprecated, use OutputTransportMessageFrame and OutputTransportMessageUrgentFrame respectively instead.

  • InputTransportMessageUrgentFrame is deprecated, use InputTransportMessageFrame instead.

  • DailyUpdateRemoteParticipantsFrame is deprecated and will be removed in a future version. Instead, create your own custom frame and handle it in the @transport.output().event_handler("on_after_push_frame") event handler or a custom processor.

Fixed

  • Fixed an issue in AWSBedrockLLMService where timeout exceptions weren't being detected.

  • Fixed a PipelineTask issue that could prevent the application to exit if task.cancel() was called when the task was already finished.

  • Fixed an issue where local SmartTurn was not being ran in a separate thread.

[0.0.86] - 2025-09-24

Added

  • Added HeyGenTransport. This is an integration for HeyGen Interactive Avatar. A video service that handles audio streaming and requests HeyGen to generate avatar video responses. (see https://www.heygen.com/). When used, the Pipecat bot joins the same virtual room as the HeyGen Avatar and the user.

  • Added support to TwilioFrameSerializer for region and edge settings.

  • Added support for using universal LLMContext with:

    • LLMLogObserver
    • GatedLLMContextAggregator (formerly GatedOpenAILLMContextAggregator)
    • LangchainProcessor
    • Mem0MemoryService
  • Added StrandsAgentProcessor which allows you to use the Strands Agents framework to build your voice agents. See https://strandsagents.com

  • Added ElevenLabsSTTService for speech-to-text transcription.

  • Added a peer connection monitor to the SmallWebRTCConnection that automatically disconnects if the connection fails to establish within the timeout (1 minute by default).

  • Added memory cleanup improvements to reduce memory peaks.

  • Added on_before_process_frame, on_after_process_frame, on_before_push_frame and on_after_push_frame. These are synchronous events that get called before and after a frame is processed or pushed. Note that these events are synchrnous so they should ideally perform lightweight tasks in order to not block the pipeline. See examples/foundational/45-before-and-after-events.py.

  • Added on_before_leave synchronous event to DailyTransport.

  • Added on_before_disconnect synchronous event to LiveKitTransport.

  • It is now possible to register synchronous event handlers. By default, all event handlers are executed in a separate task. However, in some cases we want to guarantee order of execution, for example, executing something before disconnecting a transport.

    self._register_event_handler("on_event_name", sync=True)
    
  • Added support for global location in GoogleVertexLLMService. The service now supports both regional locations (e.g., "us-east4") and the "global" location for Vertex AI endpoints. When using "global" location, the service will use aiplatform.googleapis.com as the API host instead of the regional format.

  • Added on_pipeline_finished event to PipelineTask. This event will get fired when the pipeline is done running. This can be the result of a StopFrame, CancelFrame or EndFrame.

    @task.event_handler("on_pipeline_finished")
    async def on_pipeline_finished(task: PipelineTask, frame: Frame):
        ...
    
  • Added support for new RTVI send-text event, along with the ability to toggle the audio response off (skip tts) while handling the new context.

Changed

  • Updated aiortc to 1.13.0.

  • Updated sentry to 2.38.0.

  • BaseOutputTransport methods write_audio_frame and write_video_frame now return a boolean to indicate if the transport implementation was able to write the given frame or not.

  • Updated Silero VAD model to v6.

  • Updated livekit to 1.0.13.

  • torch and torchaudio are no longer required for running Smart Turn locally. This avoids gigabytes of dependencies being installed.

  • Updated websockets dependency to support version 15.0. Removed deprecated usage of ConnectionClosed.code and ConnectionClosed.reason attributes in AWSTranscribeSTTService for compatibility.

  • Refactored pyproject.toml to reduce websockets dependency repetition using self-referencing extras. All websockets-dependent services now reference a shared websockets-base extra.

Deprecated

  • GladiaSTTService's confidence arg is deprecated. confidence is no longer needed to determine which transcription or translation frames to emit.

  • PipelineTask events on_pipeline_stopped, on_pipeline_ended and on_pipeline_cancelled are now deprecated. Use on_pipeline_finished instead.

  • Support for the RTVI append-to-context event, in lieu of the new send-text event and making way for future events like send-image.

Fixed

  • Fixed an issue where the pipeline could freeze if a task cancellation never completed because a third-party library swallowed asyncio.CancelledError. We now apply a timeout to task cancellations to prevent these freezes. If the timeout is reached, the system logs warnings and leaves dangling tasks behind, which can help diagnose where cancellation is being blocked.

  • Fixed an AudioBufferProcessor issues that was causing user audio to be missing in stereo recordings causing bot and user overlaps.

  • Fixed a BaseOutputTransport issue that could produce large saved AudioBufferProcessor files when using an audio mixer.

  • Fixed a PipelineRunner issue on Windows where setting up SIGINT and SIGTERM was raising an exception.

  • Fixed an issue where multiple handlers for an event would not run in parallel.

  • Fixed DailyTransport.sip_call_transfer() to automatically use the session ID from the on_dialin_connected event, when not explicitly provided. Now supports cold transfers (from incoming dial-in calls) by automatically tracking session IDs from connection events.

  • Fixed a memory leak in SmallWebRTCTransport. In aiortc, when you receive a MediaStreamTrack (audio or video), frames are produced asynchronously. If the code never consumes these frames, they are queued in memory, causing a memory leak.

  • Fixed an issue in AsyncAITTSService, where TTSTextFrames were not being pushed.

  • Fixed an issue that would cause push_interruption_task_frame_and_wait() to not wait if a previous interruption had already happened.

  • Fixed a couple of bugs in ServiceSwitcher:

    • Using multiple ServiceSwitchers in a pipeline would result in an error.
    • ServiceSwitcherFrames (such as ManuallySwitchServiceFrames) were having an effect too early, essentially "jumping the queue" in terms of pipeline frame ordering.
  • Fixed a self-cancellation deadlock in UserIdleProcessor when returning False from an idle callback. The task now terminates naturally instead of attempting to cancel itself.

  • Fixed an issue in AudioBufferProcessor where a recording is not created when a bot speaks and user input is blocked.

  • Fixed a FastAPIWebsocketTransport and SmallWebRTCTransport issue where on_client_disconnected would be triggered when the bot ends the conversation. That is, on_client_disconnected should only be triggered when the remote client actually disconnects.

  • Fixed an issue in HeyGenVideoService where the BotStartedSpeakingFrame was blocked from moving through the Pipeline.

[0.0.85] - 2025-09-12

Added

  • AzureSTTService now pushes interim transcriptions.

  • Added voice_cloning_key to GoogleTTSService to support custom cloned voices.

  • Added speaking_rate to GoogleTTSService.InputParams to control the speaking rate.

  • Added a speed arg to OpenAITTSService to control the speed of the voice response.

  • Added FrameProcessor.push_interruption_task_frame_and_wait(). Use this method to programatically interrupt the bot from any part of the pipeline. This guarantees that all the processors in the pipeline are interrupted in order (from upstream to downstream). Internally, this works by first pushing an InterruptionTaskFrame upstream until it reaches the pipeline task. The pipeline task then generates an InterruptionFrame, which flows downstream through all processors. Once the InterruptionFrame has reaches the processor waiting for the interruption, the function returns and execution continues after the call. Think of it as sending an upstream request for interruption and waiting until the acknowledgment flows back downstream.

  • Added new base TaskFrame (which is a system frame). This is the base class for all task frames (EndTaskFrame, CancelTaskFrame, etc.) that are meant to be pushed upstream to reach the pipeline task.

  • Expanded support for universal LLMContext to the AWS Bedrock LLM service. Using the universal LLMContext and associated LLMContextAggregatorPair is a pre-requisite for using LLMSwitcher to switch between LLMs at runtime.

  • Added new fields to the development runner's parse_telephony_websocket method in support of providing dynamic data to a bot.

    • Twilio: Added a new body parameter, which parses the websocket message for customParameters. Provide data via the Parameter nouns in your TwiML to use this feature.
    • Telnyx & Exotel: Both providers make the to and from phone numbers available in the websocket messages. You can now access these numbers as call_data["to"] and call_data["from"].

    Note: Each telephony provider offers different features. Refer to the corresponding example in pipecat-examples to see how to pass custom data to your bot.

  • Added body to the WebsocketRunnerArguments as an optional parameter. Custom body information can be passed from the server into the bot file via the bot() method using this new parameter.

  • Added video streaming support to LiveKitTransport.

  • Added OpenAIRealtimeLLMService and AzureRealtimeLLMService which provide access to OpenAI Realtime.

Changed

  • pipeline.tests.utils.run_test() now allows passing PipelineParams instead of individual parameters.

Removed

  • Remove VisionImageRawFrame in favor of context frames (LLMContextFrame or OpenAILLMContextFrame).

Deprecated

  • BotInterruptionFrame is now deprecated, use InterruptionTaskFrame instead.

  • StartInterruptionFrame is now deprected, use InterruptionFrame instead.

  • Deprecate VisionImageFrameAggregator because VisionImageRawFrame has been removed. See the 12* examples for the new recommended replacement pattern.

  • NoisereduceFilter is now deprecated and will be removed in a future version. Use other audio filters like KrispFilter or AICFilter.

  • Deprecated OpenAIRealtimeBetaLLMService and AzureRealtimeBetaLLMService. Use OpenAIRealtimeLLMService and AzureRealtimeLLMService, respectively. Each service will be removed in an upcoming version, 1.0.0.

Fixed

  • Fixed a BaseOutputTransport issue that caused incorrect detection of when the bot stopped talking while using an audio mixer.

  • Fixed a LiveKitTransport issue where RTVI messages were not properly encoded.

  • Add additional fixups to Mistral context messages to ensure they meet Mistral-specific requirements, avoiding Mistral "invalid request" errors.

  • Fixed DailyTransport transcription handling to gracefully handle missing rawResponse field in transcription messages, preventing KeyError crashes.

[0.0.84] - 2025-09-05

Added

  • Add the ability to send DTMF to LiveKitTransport.

  • Expanded support for universal LLMContext to the Anthropic LLM service. Using the universal LLMContext and associated LLMContextAggregatorPair is a pre-requisite for using LLMSwitcher to switch between LLMs at runtime.

Changed

  • Updated daily-python to 0.19.9.

  • Restored DailyTransport's native DTMF support using Daily's send_dtmf() method instead of generated audio tones.

Fixed

  • Fixed a AWSBedrockLLMService crash caused by an extra await.

  • Fixed a OpenAIImageGenService issue where it was not creating URLImageRawFrame correctly.

[0.0.83] - 2025-09-03

Added

  • Added multilingual support for AsyncAI in AsyncAITTSService and AsyncAIHttpTTSService.

    • New languages: es, fr, de, it.
  • Added new frames InputTransportMessageUrgentFrame and DailyInputTransportMessageUrgentFrame for transport messages received from external sources.

  • Added UserSpeakingFrame. This will be sent upstream and downstream while VAD detects the user is speaking.

  • Expanded support for universal LLMContext to more LLM services. Using the universal LLMContext and associated LLMContextAggregatorPair is a pre-requisite for using LLMSwitcher to switch between LLMs at runtime. Here are the newly-supported services:

    • Azure
    • Cerebras
    • Deepseek
    • Fireworks AI
    • Google Vertex AI
    • Grok
    • Groq
    • Mistral
    • NVIDIA NIM
    • Ollama
    • OpenPipe
    • OpenRouter
    • Perplexity
    • Qwen
    • SambaNova
    • Together.ai
  • Added support for WhatsApp User-initiated Calls.

  • Added new audio filter AICFilter, speech enhancement for improving VAD/STT performance, no ONNX dependency. See https://ai-coustics.com/sdk/

  • Added a timeout around cancel input tasks to prevent indefinite hangs when cancellation is swallowed by third-party code.

  • Added pipecat.extensions.ivr for automated IVR system navigation with configurable goals and conversation handling. Supports DTMF input, verbal responses, and intelligent menu traversal.

    Basic usage:

    from pipecat.extensions.ivr.ivr_navigator import IVRNavigator
    
    # Create IVR navigator with your goal
    ivr_navigator = IVRNavigator(
        llm=llm_service,
        ivr_prompt="Navigate to billing department to dispute a charge"
    )
    
    # Handle different outcomes
    @ivr_navigator.event_handler("on_conversation_detected")
    async def on_conversation(processor, conversation_history):
        # Switch to normal conversation mode
        pass
    
    @ivr_navigator.event_handler("on_ivr_status_changed")
    async def on_ivr_status(processor, status):
        if status == IVRStatus.COMPLETED:
            # End pipeline, transfer call, or start bot conversation
        elif status == IVRStatus.STUCK:
            # Handle navigation failure
    
  • BaseOutputTransport now implements write_dtmf() by loading DTMF audio and sending it through the transport. This makes sending DTMF generic across all output transports.

  • Added new config parameters to GladiaSTTService.

    • PreProcessingConfig > audio_enhancer to enhance audio quality.
    • CustomVocabularyItem > pronunciations and language to specify special pronunciations and in which language it will be pronounced.

Changed

  • UserStartedSpeakingFrame and UserStoppedSpeakingFrame are also pushed upstream.

  • ParallelPipeline now waits for CancelFrame to finish in all branches before pushing it downstream.

  • Added sip_codecs to the DailyRoomSipParams.

  • Updated the configure() function in pipecat.runner.daily to include new args to create SIP-enabled rooms. Additionally, added new args to control the room and token expiration durations.

  • pipecat.frames.frames.KeypadEntry is deprecated and has been moved to pipecat.audio.dtmf.types.KeypadEntry.

  • Updated RimeTTSService's flush_audio message to conform with Rime's official API.

  • Updated the default model for CerebrasLLMService to GPT-OSS-120B.

Removed

  • Remove StopInterruptionFrame. This was a legacy frame that was not being used really anywhere and it didn't provide any useful meaning. It was only pushed after UserStoppedSpeakingFrame, so developers can just use UserStoppedSpeakingFrame.

  • DailyTransport.write_dtmf() has been removed in favor of the generic BaseOutputTransport.write_dtmf().

  • Remove deprecated DailyTransport.send_dtmf().

Deprecated

  • Transports have been re-organized.

    pipecat.transports.network.small_webrtc        -> pipecat.transports.smallwebrtc.transport
    pipecat.transports.network.webrtc_connection   -> pipecat.transports.smallwebrtc.connection
    pipecat.transports.network.websocket_client    -> pipecat.transports.websocket.client
    pipecat.transports.network.websocket_server    -> pipecat.transports.websocket.server
    pipecat.transports.network.fastapi_websocket   -> pipecat.transports.websocket.fastapi
    pipecat.transports.services.daily              -> pipecat.transports.daily.transport
    pipecat.transports.services.helpers.daily_rest -> pipecat.transports.daily.utils
    pipecat.transports.services.livekit            -> pipecat.transports.livekit.transport
    pipecat.transports.services.tavus              -> pipecat.transports.tavus.transport
    
  • pipecat.frames.frames.KeypadEntry is deprecated use pipecat.audio.dtmf.types.KeypadEntry instead.

Fixed

  • Fixed an issue where messages received from the transport were always being resent.

  • Fixed SmallWebRTCTransport to not use mid to decide if the transceiver should be sendrecv or not.

  • Fixed an issue where Deepgram swallowed asyncio.CancelledError during disconnect, preventing tasks from being cancelled.

  • Fixed an issue where PipelineTask was not cleaning up the observers.

Performance

  • Reduced latency and improved memory performance in Mem0MemoryService.

[0.0.82] - 2025-08-28

Added

  • Added a new LLMRunFrame to trigger an LLM response:

    await task.queue_frames([LLMRunFrame()])
    

    This replaces OpenAILLMContextFrame, which youd previously typically use like this:

    await task.queue_frames([context_aggregator.user().get_context_frame()])
    

    Use this way of kicking off your conversation when youve already initialized your context and are simply instructing the bot when to go:

    context = OpenAILLMContext(messages, tools)
    context_aggregator = llm.create_context_aggregator(context)
    
    # ...
    
    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        # Kick off the conversation.
        await task.queue_frames([LLMRunFrame()])
    

    Note that if you want to add new messages when kicking off the conversation, you could use LLMMessagesAppendFrame with run_llm=True instead:

    @transport.event_handler("on_client_connected")
    async def on_client_connected(transport, client):
        # Kick off the conversation.
        await task.queue_frames([LLMMessagesAppendFrame(new_messages, run_llm=True)])
    

    In the rare case you dont have a context aggregator in your pipeline, then you may continue using a context frame.

  • Added support for switching between audio+text to text-only modes within the same pipeline. This is done by pushing LLMConfigureOutputFrame(skip_tts=True) to enter text-only mode, and disabling it to return to audio+text. The LLM will still generate tokens and add them to the context, but they will not be sent to TTS.

  • Added skip_tts field to TextFrame. This lets a text frame bypass TTS while still being included in the LLM context. Useful for cases like structured text that isnt meant to be spoken but should still contribute to context.

  • Added a cancel_timeout_secs argument to PipelineTask which defines how long the pipeline has to complete cancellation. When PipelineTask.cancel() is called, a CancelFrame is pushed through the pipeline and must reach the end. If it does not reach the end within the specified time, a warning is shown and the wait is aborted.

  • Added a new "universal" (LLM-agnostic) LLMContext and accompanying LLMContextAggregatorPair, which will eventually replace OpenAILLMContext (and the other under-the-hood contexts) and the other context aggregators. The new universal LLMContext machinery allows a single context to be shared between different LLMs, enabling runtime LLM switching and scenarios like failover.

    From the developer's point of view, switching to using the new universal context machinery will usually be a matter of going from this:

    context = OpenAILLMContext(messages, tools)
    context_aggregator = llm.create_context_aggregator(context)
    

    To this:

    context = LLMContext(messages, tools)
    context_aggregator = LLMContextAggregatorPair(context)
    

    To start, the universal LLMContext is supported with the following LLM services:

    • OpenAILLMService
    • GoogleLLMService
  • Added a new LLMSwitcher class to enable runtime LLM switching, built atop a new generic ServiceSwitcher.

    Switchers take a switching strategy. The first available strategy is ServiceSwitcherStrategyManual.

    To switch LLMs at runtime, the LLMs must be sharing one instance of the new universal LLMContext (see above bullet).

    # Instantiate your LLM services
    llm_openai = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
    llm_google = GoogleLLMService(api_key=os.getenv("GOOGLE_API_KEY"))
    
    # Instantiate a switcher
    # (ServiceSwitcherStrategyManual defaults to OpenAI, as it's first in the list)
    llm_switcher = LLMSwitcher(
        llms=[llm_openai, llm_google], strategy_type=ServiceSwitcherStrategyManual
    )
    
    # Create your pipeline
    pipeline = Pipeline(
      [
          transport.input(),
          stt,
          context_aggregator.user(),
          llm_switcher,
          tts,
          transport.output(),
          context_aggregator.assistant(),
      ]
    )
    task = PipelineTask(pipeline, params=PipelineParams(allow_interruptions=True))
    
    # ...
    # Whenever is appropriate, switch LLMs!
    await task.queue_frames([ManuallySwitchServiceFrame(service=llm_google)])
    
  • Added an LLMService.run_inference() method to LLM services to enable direct, out-of-band (i.e. out-of-pipeline) inference.

Changed

  • Updated daily-python to 0.19.8.

  • PipelineTask now waits for StartFrame to reach the end of the pipeline before pushing any other frames.

  • Updated CartesiaTTSService and CartesiaHttpTTSService to align with Cartesia's changes for the speed parameter. It now takes only an enum of slow, normal, or fast.

  • Added support to AWSBedrockLLMService for setting authentication credentials through environment variables.

  • Updated SarvamTTSService to use WebSocket streaming for real-time audio generation with multiple Indian languages, with HTTP support still available via SarvamHttpTTSService.

Fixed

  • Fixed an RTVI issue that was causing frames to be pushed before pipeline was properly initialized.

  • Fixed some get_messages_for_logging() that were returning a JSON string instead of a list.

  • Fixed a DailyTransport issue that prevented DTMF tones from being sent.

  • Fixed a missing import in SentryMetrics.

  • Fixed AWSPollyTTSService to support AWS credential provider chain (IAM roles, IRSA, instance profiles) instead of requiring explicit environment variables.

  • Fixed a CartesiaTTSService issue that was causing the application to hang after Cartesia's 5 minutes timed out.

  • Fixed an issue preventing SpeechmaticsSTTService from transcribing audio.

[0.0.81] - 2025-08-25

Added

  • Added pipecat.extensions.voicemail, a module for detecting voicemail vs. live conversation, primarily intended for use in outbound calling scenarios. The voicemail module is optimized for text LLMs only.

  • Added new frames to the idle_timeout_frames arg: TranscriptionFrame, InterimTranscriptionFrame, UserStartedSpeakingFrame, and UserStoppedSpeakingFrame. These additions serve as indicators of user activity in the pipeline idle detection logic.

  • Allow passing custom pipeline sink and source processors to a Pipeline. Pipeline source and sink processors are used to know and control what's coming in and out of a Pipeline processor.

  • Added FrameProcessor.pause_processing_system_frames() and FrameProcessor.resume_processing_system_frames(). These allow to pause and resume the processing of system frame.

  • Added new on_process_frame() observer method which makes it possible to know when a frame is being processed.

  • Added new FrameProcessor.entry_processor() method. This allows you to access the first non-compound processor in a pipeline.

  • Added FrameProcessor properties processors, next and previous.

  • ElevenLabsTTSService now supports additional runtime changes to the model, language, and voice_settings parameters.

  • Added apply_text_normalization support to ElevenLabsTTSService and ElevenLabsHttpTTSService.

  • Added MistralLLMService, using Mistral's chat completion API.

  • Added the ability to retry executing a chat completion after a timeout period for OpenAILLMService and its subclasses, AnthropicLLMService, and AWSBedrockLLMService. The LLM services accept new args: retry_timeout_secs and retry_on_timeout. This feature is disabled by default.

Changed

  • Updated daily-python to 0.19.7.

Deprecated

  • FrameProcessor.wait_for_task() is deprecated. Use await task or await asyncio.wait_for(task, timeout) instead.

Removed

  • Watchdog timers have been removed. They were introduced in 0.0.72 to help diagnose pipeline freezes. Unfortunately, they proved ineffective since they required developers to use Pipecat-specific queues, iterators, and events to correctly reset the timer, which limited their usefulness and added friction.

  • Removed unused FrameProcessor.set_parent() and FrameProcessor.get_parent().

Fixed

  • Fixed an issue that would cause PipelineRunner and PipelineTask to not handle external asyncio task cancellation properly.

  • Added SpeechmaticsSTTService exception handling on connection and sending.

  • Replaced asyncio.wait_for() for wait_for2.wait_for() for Python < 3.12. because of issues regarding task cancellation (i.e. cancellation is never propagated). See https://bugs.python.org/issue42130

  • Fixed an AudioBufferProcessor issues that would cause audio overlap when setting a max buffer size.

  • Fixed an issue where AsyncAITTSService had very high latency in responding by adding force=true when sending the flush command.

Performance

  • Improve PipelineTask performance by using direct mode processors and by removing unnecessary tasks.

  • Improve ParallelPipeline performance by using direct mode, by not creating a task for each frame and every sub-pipeline and also by removing other unnecessary tasks.

  • Pipeline performance improvements by using direct mode.

Other

  • Added 14w-function-calling-mistal.py using MistralLLMService.

  • Added 13j-azure-transcription.py using AzureSTTService.

[0.0.80] - 2025-08-13

Added

  • Added GeminiTTSService which uses Google Gemini to generate TTS output. The Gemini model can be prompted to insert styled speech to control the TTS output.

  • Added Exotel support to Pipecat's development runner. You can now connect using the runner with uv run bot.py -t exotel and an ngrok connection to HTTP port 7860.

  • Added enable_direct_mode argument to FrameProcessor. The direct mode is for processors which require very little I/O or compute resources, that is processors that can perform their task almost immediately. These type of processors don't need any of the internal tasks and queues usually created by frame processors which means overall application performance might be slightly increased. Use with care.

  • Added TTFB metrics for HeyGenVideoService and TavusVideoService.

  • Added endpoint_id parameter to AzureSTTService. (Custom EndpointId)

Changed

  • WatchdogPriorityQueue now requires the items to be inserted to always be tuples and the size of the tuple needs to be specified in the constructor when creating the queue with the tuple_size argument.

  • Updated Moondream to revision 2025-01-09.

  • Updated PlayHTHttpTTSService to no longer use the pyht client to remove compatibility issues with other packages. Now you can use the PlayHT HTTP service with other services, like GoogleLLMService.

  • Updated pyproject.toml to once again pin numba to ==0.61.2 in order to resolve package versioning issues.

  • Updated the STTMuteFilter to include VADUserStartedSpeakingFrame and VADUserStoppedSpeakingFrame in the list of frames to filter when the filtering is on.

Performance

  • Improving the latency of the HeyGenVideoService.

  • Improved some frame processors performance by using the new frame processor direct mode. In direct mode a frame processor will process frames right away avoiding the need for internal queues and tasks. This is useful for some simple processors. For example, in processors that wrap other processors (e.g. Pipeline, ParallelPipeline), we add one processor before and one after the wrapped processors (internally, you will see them as sources and sinks). These sources and sinks don't do any special processing and they basically forward frames. So, for these simple processors we now enable the new direct mode which avoids creating any internal tasks (and queues) and therefore improves performance.

Fixed

  • Fixed an issue with the BaseWhisperSTTService where the language was specified as an enum and not a string.

  • Fixed an issue where SmallWebRTCTransport ended before TTS finished.

  • Fixed an issue in OpenAIRealtimeBetaLLMService where specifying a text modalities didn't result in text being outputted from the model.

  • Added SSML reserved character escaping to AzureBaseTTSService to properly handle special characters in text sent to Azure TTS. This fixes an issue where characters like &, <, >, ", and ' in LLM-generated text would cause TTS failures.

  • Fixed a WatchdogPriorityQueue issue that could cause an exception when compating watchdog cancel sentinel items with other items in the queue.

  • Fixed an issue that would cause system frames to not be processed with higher priority than other frames. This could cause slower interruption times.

  • Fixed an issue where retrying a websocket connection error would result in an error.

Other

  • Add foundation example 19b-openai-realtime-beta-text.py, showing how to use OpenAIRealtimeBetaLLMService to output text to a TTS service.

  • Add vision support to release evals so we can run the foundational examples 12 series.

  • Added foundational example 15a-switch-languages.py to release evals. It is able to detect if we switched the language properly.

  • Updated foundational examples to show how to enclose complex logic (e.g. ParallelPipeline) into a single processor so the main pipeline becomes simpler.

  • Added 07n-interruptible-gemini.py, demonstrating how to use GeminiTTSService.

[0.0.79] - 2025-08-07

Changed

  • Changed pipecat-ai's openai dependency to >=1.74.0,<=1.99.1 due to a breaking change in openai 1.99.2 (commit)

Deprecated

  • TTSService.say() is deprecated, push a TTSSpeakFrame instead. Calling functions directly is a discouraged pattern in Pipecat because, for example, it might cause issues with frame ordering.

  • LLMMessagesFrame is deprecated, in favor of either:

    • LLMMessagesUpdateFrame with run_llm=True
    • OpenAILLMContextFrame with desired messages in a new context
  • LLMUserResponseAggregator and LLMAssistantResponseAggregator are deprecated, as they depended on the now-deprecated LLMMessagesFrame. Use LLMUserContextAggregator and LLMAssistantResponseAggregator (or LLM-specific subclasses thereof) instead.

[0.0.78] - 2025-08-07

Added

  • Added SonioxSTTService using Soniox's STT websocket API.

  • Added enable_emulated_vad_interruptions to LLMUserAggregatorParams. When user speech is emulated (e.g. when a transcription is received but VAD doesn't detect speech), this parameter controls whether the emulated speech can interrupt the bot. Default is False (emulated speech is ignored while the bot is speaking).

  • Added new handle_sigint and handle_sigterm to RunnerArguments. This allows applications to know what settings they should use for the environment they are running on. Also, added pipeline_idle_timeout_secs to be able to control the PipelineTask idle timeout.

  • Added processor field to ErrorFrame to indicate FrameProcessor that generated the error.

  • Added new language support for AWSTranscribeSTTService. All languages supporting streaming data input are now supported: https://docs.aws.amazon.com/transcribe/latest/dg/supported-languages.html

  • Added support for Simli Trinity Avatars. A new is_trinity_avatar parameter has been introduced to specify whether the provided faceId corresponds to a Trinity avatar, which is required for optimal Trinity avatar performance.

  • The development runner how handles custom body data for DailyTransport. The body data is passed to the Pipecat client. You can POST to the /start endpoint with a request body of:

    {
        "createDailyRoom": true,
        "dailyRoomProperties": { "start_video_off": true },
        "body": { "custom_data": "value" }
    }
    

    The body information is parsed and used in the application. The dailyRoomProperties are currently not handled.

  • Added detailed latency logging to UserBotLatencyLogObserver, capturing average response time between user stop and bot start, as well as minimum and maximum response latency.

  • Added Chinese, Japanese, Korean word timestamp support to CartesiaTTSService.

  • Added region parameter to GladiaSTTService. Accepted values: eu-west (default), us-west.

Changed

  • System frames are now queued. Before, system frames could be generated from any task and would not guarantee any order which was causing undesired behavior. Also, it was possible to get into some rare recursion issues because of the way system frames were executed (they were executed in-place, meaning calling push_frame() would finish after the system frame traversed all the pipeline). This makes system frames more deterministic.

  • Changed the default model for both ElevenLabsTTSService and ElevenLabsHttpTTSService to eleven_turbo_v2_5. The rationale for this change is that the Turbo v2.5 model exhibits the most stable voice quality along with very low latency TTFB; latencies are on par with the Flash v2.5 model. Also, the Turbo v2.5 model outputs word/timestamp alignment data with correct spacing.

  • The development runners /connect and /start endpoint now both return dailyRoom and dailyToken in place of the previous room_url and token.

  • Updated the pipecat.runner.daily utility to only a take DAILY_API_URL and DAILY_SAMPLE_ROOM_URL environment variables instead of argparsing -u and -k, respectively.

  • Updated daily-python to 0.19.6.

  • Changed TavusVideoService to send audio or video frames only after the transport is ready, preventing warning messages at startup.

  • The development runner now strips any provided protocol (e.g. https://) from the proxy address and issues a warning. It also strips trailing /.

Deprecated

  • In the pipecat.runner.daily, the configure_with_args() function is deprecated. Use the configure() function instead.

  • The development runner's /connect endpoint is deprecated and will be removed in a future version. Use the /start endpoint in its place. In the meantime, both endpoints work and deliver equivalent functionality.

Fixed

  • Fixed a DailyTransport issue that would result in an unhandled concurrent.futures.CancelledError when a future is cancelled.

  • Fixed a RivaSTTService issue that would result in an unhandled concurrent.futures.CancelledError when a future is cancelled when reading from the audio chunks from the incoming audio stream.

  • Fixed an issue in the BaseOutputTransport, mainly reproducible with FastAPIWebsocketOutputTransport when the audio mixer was enabled, where the loop could consume 100% CPU by continuously returning without delay, preventing other asyncio tasks (such as cancellation or shutdown signals) from being processed.

  • Fixed an issue where BotStartedSpeakingFrame and BotStoppedSpeakingFrame were not emitted when using TavusVideoService or HeyGenVideoService.

  • Fixed an issue in LiveKitTransport where empty AudioRawFrames were pushed down the pipeline. This resulted in warnings by the STT processor.

  • Fixed PiperTTSService to send text as a JSON object in the request body, resolving compatibility with Piper's HTTP API.

  • Fixed an issue with the TavusVideoService where an error was thrown due to missing transcription callbacks.

  • Fixed an issue in SpeechmaticsSTTService where the user_id was set to None when diarization is not enabled.

Performance

  • Fixed an issue in TaskObserver (a proxy to all observers) that was degrading global performance.

Other

  • Added 07aa-interruptible-soniox.py, 07ab-interruptible-inworld-http.py, 07ac-interruptible-asyncai.py and 07ac-interruptible-asyncai-http.py release evals.

[0.0.77] - 2025-07-31

Added

  • Added InputTextRawFrame frame type to handle user text input with Gemini Multimodal Live.

  • Added HeyGenVideoService. This is an integration for HeyGen Interactive Avatar. A video service that handles audio streaming and requests HeyGen to generate avatar video responses. (see https://www.heygen.com/)

  • Added the ability to switch voices to RimeTTSService.

  • Added unified development runner for building voice AI bots across multiple transports

    • pipecat.runner.run FastAPI-based development server with automatic bot discovery
    • pipecat.runner.types Runner session argument types (DailyRunnerArguments, SmallWebRTCRunnerArguments, WebSocketRunnerArguments)
    • pipecat.runner.utils.create_transport() Factory function for creating transports from session arguments
    • pipecat.runner.daily and pipecat.runner.livekit Configuration utilities for Daily and LiveKit setups
    • Support for all transport types: Daily, WebRTC, Twilio, Telnyx, Plivo
    • Automatic telephony provider detection and serializer configuration
    • ESP32 WebRTC compatibility with SDP munging
    • Environment detection (ENV=local) for conditional features
  • Added Async.ai TTS integration (https://async.ai/)

    • AsyncAITTSService WebSocket-based streaming TTS with interruption support
    • AsyncAIHttpTTSService HTTP-based streaming TTS service
    • Example scripts:
      • examples/foundational/07ac-interruptible-asyncai.py (WebSocket demo)
      • examples/foundational/07ac-interruptible-asyncai-http.py (HTTP demo)
  • Added transcription_bucket params support to the DailyRESTHelper.

  • Added a new TTS service, InworldTTSService. This service provides low-latency, high-quality speech generation using Inworld's streaming API.

  • Added a new field handle_sigterm to PipelineRunner. It defaults to False. This field handles SIGTERM signals. The handle_sigint field still defaults to True, but now it handles only SIGINT signals.

  • Added foundational example 14u-function-calling-ollama.py for Ollama function calling.

  • Added LocalSmartTurnAnalyzerV2, which supports local on-device inference with the new smart-turn-v2 turn detection model.

  • Added set_log_level to DailyTransport, allowing setting the logging level for Daily's internal logging system.

  • Added on_transcription_stopped and on_transcription_error to Daily callbacks.

Changed

  • Changed the default url for NeuphonicTTSService to wss://api.neuphonic.com as it provides better global performance. You can set the URL to other URLs, such as the previous default: wss://eu-west-1.api.neuphonic.com.

  • Update daily-python to 0.19.5.

  • STTMuteFilter now pushes the STTMuteFrame upstream and downstream, to allow for more flexible STTMuteFilter placement.

  • Play delayed messages from ElevenLabsTTSService if they still belong to the current context.

  • Dependency compatibility improvements: Relaxed version constraints for core dependencies to support broader version ranges while maintaining stability:

    • aiohttp, Markdown, nltk, numpy, Pillow, pydantic, openai, numba: Now support up to the next major version (e.g. numpy>=1.26.4,<3)
    • pyht: Relaxed to >=0.1.6 to resolve grpcio conflicts with nvidia-riva-client
    • fastapi: Updated to support versions >=0.115.6,<0.117.0
    • torch/torchaudio: Changed from exact pinning (==2.5.0) to compatible range (~=2.5.0)
    • aws_sdk_bedrock_runtime: Added Python 3.12+ constraint via environment marker
    • numba: Reduced minimum version to 0.60.0 for better compatibility
  • Changed NeuphonicHttpTTSService to use a POST based request instead of the pyneuphonic package. This removes a package requirement, allowing Neuphonic to work with more services.

  • Updated ElevenLabsTTSService to handle the case where allow_interruptions=False. Now, when interruptions are disabled, the same context ID will be used throughout the conversation.

  • Updated the deepgram optional dependency to 4.7.0, which downgrades the tasks cancelled error to a debug log. This removes the log from appearing in Pipecat logs upon leaving.

  • Upgraded the websockets implementation to the new asyncio implementation. Along with this change, we're updating support for versions >=13.1.0 and <15.0.0. All services have been update to use the asyncio implementation.

  • Updated MiniMaxHttpTTSService with a base_url arg where you can specify the Global endpoint (default) or Mainland China.

  • Replaced regex-based sentence detection in match_endofsentence with NLTK's punkt_tab tokenizer for more reliable sentence boundary detection.

  • Changed the livekit optional dependency for tenacity to tenacity>=8.2.3,<10.0.0 in order to support the google-genai package.

  • For LmntTTSService, changed the default model to blizzard, LMNT's recommended model.

  • Updated SpeechmaticsSTTService:

    • Added support for additional diarization options.
    • Added foundational example 07a-interruptible-speechmatics-vad.py, which uses VAD detection provided by SpeechmaticsSTTService.

Fixed

  • Fixed a LLMUserResponseAggregator issue where interruptions were not being handled properly.

  • Fixed PiperTTSService to work with newer Piper GPL.

  • Fixed a race condition in FastAPIWebsocketClient that occurred when attempting to send a message while the client was disconnecting.

  • Fixed an issue in GoogleLLMService where interruptions did not work when an interruption strategy was used.

  • Fixed an issue in the TranscriptProcessor where newline characters could cause the transcript output to be corrupted (e.g. missing all spaces).

  • Fixed an issue in AudioBufferProcessor when using SmallWebRTCTransport where, if the microphone was muted, track timing was not respected.

  • Fixed an error that occurs when pushing an LLMMessagesFrame. Only some LLM services, like Grok, are impacted by this issue. The fix is to remove the optional name property that was being added to the message.

  • Fixed an issue in AudioBufferProcessor that caused garbled audio when enable_turn_audio was enabled and audio resampling was required.

  • Fixed a dependency issue for uv users where an llvmlite version required python 3.9.

  • Fixed an issue in MiniMaxHttpTTSService where the pitch param was the incorrect type.

  • Fixed an issue with OpenTelemetry tracing where the enable_tracing flag did not disable the internal tracing decorator functions.

  • Fixed an issue in OLLamaLLMService where kwargs were not passed correctly to the parent class.

  • Fixed an issue in ElevenLabsTTSService where the word/timestamp pairs were calculating word boundaries incorrectly.

  • Fixed an issue where, in some edge cases, the EmulateUserStartedSpeakingFrame could be created even if we didn't have a transcription.

  • Fixed an issue in GoogleLLMContext where it would inject the system_message as a "user" message into cases where it was not meant to; it was only meant to do that when there were no "regular" (non-function-call) messages in the context, to ensure that inference would run properly.

  • Fixed an issue in LiveKitTransport where the on_audio_track_subscribed was never emitted.

Other

  • Added new quickstart demos:

    • examples/quickstart: voice AI bot quickstart
    • examples/client-server-web: client/server starter example
    • examples/phone-bot-twilio: twilio starter example
  • Removed most of the examples from the pipecat repo. Examples can now be found in: https://github.com/pipecat-ai/pipecat-examples.

[0.0.76] - 2025-07-11

Added

  • Added SpeechControlParamsFrame, a new SystemFrame that notifies downstream processors of the VAD and Turn analyzer params. This frame is pushed by the BaseInputTransport at Start and any time a VADParamsUpdateFrame is received.

Changed

  • Two package dependencies have been updated:
    • numpy now supports 1.26.0 and newer
    • transformers now supports 4.48.0 and newer

Fixed

  • Fixed an issue with RTVI's handling of append-to-context.

  • Fixed an issue where using audio input with a sample rate requiring resampling could result in empty audio being passed to STT services, causing errors.

  • Fixed the VAD analyzer to process the full audio buffer as long as it contains more than the minimum required bytes per iteration, instead of only analyzing the first chunk.

  • Fixed an issue in ParallelPipeline that caused errors when attempting to drain the queues.

  • Fixed an issue with emulated VAD timeout inconsistency in LLMUserContextAggregator. Previously, emulated VAD scenarios (where transcription is received without VAD detection) used a hardcoded aggregation_timeout (default 0.5s) instead of matching the VAD's stop_secs parameter (default 0.8s). This created different user experiences between real VAD and emulated VAD scenarios. Now, emulated VAD timeouts automatically synchronize with the VAD's stop_secs parameter.

  • Fix a pipeline freeze when using AWS Nova Sonic, which would occur if the user started early, while the bot was still working through trigger_assistant_response().

[0.0.75] - 2025-07-08 [YANKED]

This release has been yanked due to resampling issues affecting audio output quality and critical bugs impacting ParallelPipelines functionality.

Please upgrade to version 0.0.76 or later.

Added

  • Added an aggregate_sentences arg in CartesiaTTSService, ElevenLabsTTSService, NeuphonicTTSService and RimeTTSService, where the default value is True. When aggregate_sentences is True, the TTSService aggregates the LLM streamed tokens into sentences by default. Note: setting the value to False requires a custom processor before the TTSService to aggregate LLM tokens.

  • Added kwargs to the OLLamaLLMService to allow for configuration args to be passed to Ollama.

  • Added call hang-up error handling in TwilioFrameSerializer, which handles the case where the user has hung up before the TwilioFrameSerializer hangs up the call.

Changed

  • Updated RTVIObserver and RTVIProcessor to match the new RTVI 1.0.0 protocol. This includes:

    • Deprecating support for all messages related to service configuaration and actions.
    • Adding support for obtaining and logging data about client, including its RTVI version and optionally included system information (OS/browser/etc.)
    • Adding support for handling the new client-message RTVI message through either a on_client_message event handler or listening for a new RTVIClientMessageFrame
    • Adding support for responding to a client-message with a server-response via either a direct call on the RTVIProcessor or via pushing a new RTVIServerResponseFrame
    • Adding built-in support for handling the new append-to-context RTVI message which allows a client to add to the user or assistant llm context. No extra code is required for supporting this behavior.
    • Updating all JavaScript and React client RTVI examples to use versions 1.0.0 of the clients.

    Get started migrating to RTVI protocol 1.0.0 by following the migration guide: https://docs.pipecat.ai/client/migration-guide

  • Refactored AWSBedrockLLMService and AWSPollyTTSService to work asynchronously using aioboto3 instead of the boto3 library.

  • The UserIdleProcessor now handles the scenario where function calls take longer than the idle timeout duration. This allows you to use the UserIdleProcessor in conjunction with function calls that take a while to return a result.

Fixed

  • Updated the NeuphonicTTSService to work with the updated websocket API.

  • Fixed an issue with RivaSTTService where the watchdog feature was causing an error on initialization.

Performance

  • Remove unncessary push task in each FrameProcessor.

[0.0.74] - 2025-07-03 [YANKED]

This release has been yanked due to resampling issues affecting audio output quality and critical bugs impacting ParallelPipelines functionality.

Please upgrade to version 0.0.76 or later.

Added

  • Added a new STT service, SpeechmaticsSTTService. This service provides real-time speech-to-text transcription using the Speechmatics API. It supports partial and final transcriptions, multiple languages, various audio formats, and speaker diarization.

  • Added normalize and model_id to FishAudioTTSService.

  • Added http_options argument to GoogleLLMService.

  • Added run_llm field to LLMMessagesAppendFrame and LLMMessagesUpdateFrame frames. If true, a context frame will be pushed triggering the LLM to respond.

  • Added a new SOXRStreamAudioResampler for processing audio in chunks or streams. If you write your own processor and need to use an audio resampler, use the new create_stream_resampler().

  • Added new DailyParams.audio_in_user_tracks to allow receiving one track per user (default) or a single track from the room (all participants mixed).

  • Added support for providing "direct" functions, which don't need an accompanying FunctionSchema or function definition dict. Instead, metadata (i.e. name, description, properties, and required) are automatically extracted from a combination of the function signature and docstring.

    Usage:

    # "Direct" function
    # `params` must be the first parameter
    async def do_something(params: FunctionCallParams, foo: int, bar: str = ""):
      """
      Do something interesting.
    
      Args:
        foo (int): The foo to do something interesting with.
        bar (string): The bar to do something interesting with.
      """
    
      result = await process(foo, bar)
      await params.result_callback({"result": result})
    
    # ...
    
    llm.register_direct_function(do_something)
    
    # ...
    
    tools = ToolsSchema(standard_tools=[do_something])
    
  • user_id is now populated in the TranscriptionFrame and InterimTranscriptionFrame when using a transport that provides a user_id, like DailyTransport or LiveKitTransport.

  • Added watchdog_coroutine(). This is a watchdog helper for couroutines. So, if you have a coroutine that is waiting for a result and that takes a long time, you will need to wrap it with watchdog_coroutine() so the watchdog timers are reset regularly.

  • Added session_token parameter to AWSNovaSonicLLMService.

  • Added Gemini Multimodal Live File API for uploading, fetching, listing, and deleting files. See 26f-gemini-live-files-api.py for example usage.

Changed

  • Updated all the services to use the new SOXRStreamAudioResampler, ensuring smooth transitions and eliminating clicks.

  • Upgraded daily-python to 0.19.4.

  • Updated google optional dependency to use google-genai version 1.24.0.

Fixed

  • Fixed an issue where audio would get stuck in the queue when an interrupt occurs during Azure TTS synthesis.

  • Fixed a race condition that occurs in Python 3.10+ where the task could miss the CancelledError and continue running indefinitely, freezing the pipeline.

  • Fixed a AWSNovaSonicLLMService issue introduced in 0.0.72.

Deprecated

  • In FishAudioTTSService, deprecated model and replaced with reference_id. This change is to better align with Fish Audio's variable naming and to reduce confusion about what functionality the variable controls.

[0.0.73] - 2025-06-26

Fixed

  • Fixed an issue introduced in 0.0.72 that would cause ElevenLabsTTSService, GladiaSTTService, NeuphonicTTSService and OpenAIRealtimeBetaLLMService to throw an error.

[0.0.72] - 2025-06-26

Added

  • Added logging and improved error handling to help diagnose and prevent potential Pipeline freezes.

  • Added WatchdogQueue, WatchdogPriorityQueue, WatchdogEvent and WatchdogAsyncIterator. These helper utilities reset watchdog timers appropriately before they expire. When watchdog timers are disabled, the utilities behave as standard counterparts without side effects.

  • Introduce task watchdog timers. Watchdog timers are used to detect if a Pipecat task is taking longer than expected (by default 5 seconds). Watchdog timers are disabled by default and can be enabled globally by passing enable_watchdog_timers argument to PipelineTask constructor. It is possible to change the default watchdog timer timeout by using the watchdog_timeout argument. You can also log how long it takes to reset the watchdog timers which is done with the enable_watchdog_logging. You can control all these settings per each frame processor or even per task. That is, you can set enable_watchdog_timers, enable_watchdog_logging and watchdog_timeout when creating any frame processor through their constructor arguments or when you create a task with FrameProcessor.create_task(). Note that watchdog timers only work with Pipecat tasks and will not work if you use asycio.create_task() or similar.

  • Added lexicon_names parameter to AWSPollyTTSService.InputParams.

  • Added reconnection logic and audio buffer management to GladiaSTTService.

  • The TurnTrackingObserver now ends a turn upon observing an EndFrame or CancelFrame.

  • Added Polish support to AWSTranscribeSTTService.

  • Added new frames FrameProcessorPauseFrame and FrameProcessorResumeFrame which allow pausing and resuming frame processing for a given frame processor. These are control frames, so they are ordered. Pausing frame processor will keep old frames in the internal queues until resume takes place. Frames being pushed while a frame processor is paused will be pushed to the queues. When frame processing is resumed all queued frames will be processed in order. Also added FrameProcessorPauseUrgentFrame and FrameProcessorResumeUrgentFrame which are system frames and therefore they have high priority.

  • Added a property called has_function_calls_in_progress in LLMAssistantContextAggregator that exposes whether a function call is in progress.

  • Added SambaNovaLLMService which provides llm api integration with an OpenAI-compatible interface.

  • Added SambaNovaTTSService which provides speech-to-text functionality using SambaNovas's (whisper) API.

  • Add fundational examples for function calling and transcription 14s-function-calling-sambanova.py, 13g-sambanova-transcription.py

Changed

  • HeartbeatFrames are now control frames. This will make it easier to detect pipeline freezes. Previously, heartbeat frames were system frames which meant they were not get queued with other frames, making it difficult to detect pipeline stalls.

  • Updated OpenAIRealtimeBetaLLMService to accept language in the InputAudioTranscription class for all models.

  • Updated the default model for OpenAIRealtimeBetaLLMService to gpt-4o-realtime-preview-2025-06-03.

  • The PipelineParams arg allow_interruptions now defaults to True.

  • TavusTransport and TavusVideoService now send audio to Tavus using WebRTC audio tracks instead of app-messages over WebSocket. This should improve the overall audio quality.

  • Upgraded daily-python to 0.19.3.

Fixed

  • Fixed an issue that would cause heartbeat frames to be sent before processors were started.

  • Fixed an event loop blocking issue when using SentryMetrics.

  • Fixed an issue in FastAPIWebsocketClient to ensure proper disconnection when the websocket is already closed.

  • Fixed an issue where the UserStoppedSpeakingFrame was not received if the transport was not receiving new audio frames.

  • Fixed an edge case where if the user interrupted the bot but no new aggregation was received, the bot would not resume speaking.

  • Fixed an issue with TelnyxFrameSerializer where it would throw an exception when the user hung up the call.

  • Fixed an issue with ElevenLabsTTSService where the context was not being closed.

  • Fixed function calling in AWSNovaSonicLLMService.

  • Fixed an issue that would cause multiple PipelineTask.on_idle_timeout events to be triggered repeatedly.

  • Fixed an issue that was causing user and bot speech to not be synchronized during recordings.

  • Fixed an issue where voice settings weren't applied to ElevenLabsTTSService.

  • Fixed an issue with GroqTTSService where it was not properly parsing the WAV file header.

  • Fixed an issue with GoogleSTTService where it was constantly reconnecting before starting to receive audio from the user.

  • Fixed an issue where GoogleLLMService's TTFB value was incorrect.

Deprecated

  • AudioBufferProcessor parameter user_continuos_stream is deprecated.

Other

  • Rename 14e-function-calling-gemini.py to 14e-function-calling-google.py.

[0.0.71] - 2025-06-10

Added

  • Adds a parameter called additional_span_attributes to PipelineTask that lets you add any additional attributes you'd like to the conversation span.

Fixed

  • Fixed an issue with CartesiaSTTService initialization.

[0.0.70] - 2025-06-10

Added

  • Added ExotelFrameSerializer to handle telephony calls via Exotel.

  • Added the option informal to TranslationConfig on Gladia config. Allowing to force informal language forms when available.

  • Added CartesiaSTTService which is a websocket based implementation to transcribe audio. Added a foundational example in 13f-cartesia-transcription.py

  • Added an websocket example, showing how to use the new Pipecat client WebsocketTransport to connect with Pipecat FastAPIWebsocketTransport or WebsocketServerTransport.

  • Added language support to RimeHttpTTSService. Extended languages to include German and French for both RimeTTSService and RimeHttpTTSService.

Changed

  • Upgraded daily-python to 0.19.2.

  • Make PipelineTask.add_observer() synchronous. This allows callers to call it before doing the work of running the PipelineTask (i.e. without invoking PipelineTask.set_event_loop() first).

  • Pipecat 0.0.69 forced uvloop event loop on Linux on macOS. Unfortunately, this is causing issue in some systems. So, uvloop is not enabled by default anymore. If you want to use uvloop you can just set the asyncio event policy before starting your agent with:

asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())

Fixed

  • Fixed an issue with various TTS services that would cause audio glitches at the start of every bot turn.

  • Fixed an ElevenLabsTTSService issue where a context warning was printed when pushing a TTSSpeakFrame.

  • Fixed an AssemblyAISTTService issue that could cause unexpected behavior when yielding empty Frame()s.

  • Fixed an issue where OutputAudioRawFrame.transport_destination was being reset to None instead of retaining its intended value before sending the audio frame to write_audio_frame.

  • Fixed a typo in Livekit transport that prevented initialization.

[0.0.69] - 2025-06-02 "AI Engineer World's Fair release"

Added

  • Added a new frame FunctionCallsStartedFrame. This frame is pushed both upstream and downstream from the LLM service to indicate that one or more function calls are going to be executed.

  • Added LLM services on_function_calls_started event. This event will be triggered when the LLM service receives function calls from the model and is going to start executing them.

  • Function calls can now be executed sequentially (in the order received in the completion) by passing run_in_parallel=False when creating your LLM service. By default, if the LLM completion returns 2 or more function calls they run concurrently. In both cases, concurrently and sequentially, a new LLM completion will run when the last function call finishes.

  • Added OpenTelemetry tracing for GeminiMultimodalLiveLLMService and OpenAIRealtimeBetaLLMService.

  • Added initial support for interruption strategies, which determine if the user should interrupt the bot while the bot is speaking. Interruption strategies can be based on factors such as audio volume or the number of words spoken by the user. These can be specified via the new interruption_strategies field in PipelineParams. A new MinWordsInterruptionStrategy strategy has been introduced which triggers an interruption if the user has spoken a minimum number of words. If no interruption strategies are specified, the normal interruption behavior applies. If multiple strategies are provided, the first one that evaluates to true will trigger the interruption.

  • BaseInputTransport now handles StopFrame. When a StopFrame is received the transport will pause sending frames downstream until a new StartFrame is received. This allows the transport to be reused (keeping the same connection) in a different pipeline.

  • Updated AssemblyAI STT service to support their latest streaming speech-to-text model with improved transcription latency and endpointing.

  • You can now access STT service results through the new TranscriptionFrame.result and InterimTranscriptionFrame.result field. This is useful in case you use some specific settings for the STT and you want to access the STT results.

  • The examples runner is now public from the pipecat.examples package. This allows everyone to build their own examples and run them easily.

  • It is now possible to push OutputDTMFFrame or OutputDTMFUrgentFrame with DailyTransport. This will be sent properly if a Daily dial-out connection has been established.

  • Added OutputDTMFUrgentFrame to send a DTMF keypress quickly. The previous OutputDTMFFrame queues the keypress with the rest of data frames.

  • Added DTMFAggregator, which aggregates keypad presses into TranscriptionFrames. Aggregation occurs after a timeout, termination key press, or user interruption. You can specify the prefix of the TranscriptionFrame.

  • Added new functions DailyTransport.start_transcription() and DailyTransport.stop_transcription() to be able to start and stop Daily transcription dynamically (maybe with different settings).

Changed

  • Reverted the default model for GeminiMultimodalLiveLLMService back to models/gemini-2.0-flash-live-001. gemini-2.5-flash-preview-native-audio-dialog has inconsistent performance. You can opt in to using this model by setting the model arg.

  • Function calls are now cancelled by default if there's an interruption. To disable this behavior you can set cancel_on_interruption=False when registering the function call. Since function calls are executed as tasks you can tell if a function call has been cancelled by catching the asyncio.CancelledError exception (and don't forget to raise it again!).

  • Updated OpenTelemetry tracing attribute metrics.ttfb_ms to metrics.ttfb. The attribute reports TTFB in seconds.

Deprecated

  • DailyTransport.send_dtmf() is deprecated, push an OutputDTMFFrame or an OutputDTMFUrgentFrame instead.

Fixed

  • Fixed an issue with ElevenLabsTTSService where long responses would continue generating output even after an interruption.

  • Fixed an issue with the OpenAILLMContext where non-Roman characters were being incorrectly encoded as Unicode escape sequences. This was a logging issue and did not impact the actual conversation.

  • In AWSBedrockLLMService, worked around a possible bug in AWS Bedrock where a toolConfig is required if there has been previous tool use in the messages array. This workaround includes a no_op factory function call is used to satisfy the requirement.

  • Fixed WebsocketClientTransport to use FrameProcessorSetup.task_manager instead of StartFrame.task_manager.

Performance

  • Use uvloop as the new event loop on Linux and macOS systems.

[0.0.68] - 2025-05-28

Added

  • Added GoogleHttpTTSService which uses Google's HTTP TTS API.

  • Added TavusTransport, a new transport implementation compatible with any Pipecat pipeline. When using the TavusTransportthe Pipecat bot will connect in the same room as the Tavus Avatar and the user.

  • Added PlivoFrameSerializer to support Plivo calls. A full running example has also been added to examples/plivo-chatbot.

  • Added UserBotLatencyLogObserver. This is an observer that logs the latency between when the user stops speaking and when the bot starts speaking. This gives you an initial idea on how quickly the AI services respond.

  • Added SarvamTTSService, which implements Sarvam AI's TTS API: https://docs.sarvam.ai/api-reference-docs/text-to-speech/convert.

  • Added PipelineTask.add_observer() and PipelineTask.remove_observer() to allow mangaging observers at runtime. This is useful for cases where the task is passed around to other code components that might want to observe the pipeline dynamically.

  • Added user_id field to TranscriptionMessage. This allows identifying the user in a multi-user scenario. Note that this requires that TranscriptionFrame has the user_id properly set.

  • Added new PipelineTask event handlers on_pipeline_started, on_pipeline_stopped, on_pipeline_ended and on_pipeline_cancelled, which correspond to the StartFrame, StopFrame, EndFrame and CancelFrame respectively.

  • Added additional languages to LmntTTSService. Languages include: hi, id, it, ja, nl, pl, ru, sv, th, tr, uk, vi.

  • Added a model parameter to the LmntTTSService constructor, allowing switching between LMNT models.

  • Added MiniMaxHttpTTSService, which implements MiniMax's T2A API for TTS. Learn more: https://www.minimax.io/platform_overview

  • A new function FrameProcessor.setup() has been added to allow setting up frame processors before receiving a StartFrame. This is what's happening internally: FrameProcessor.setup() is called, StartFrame is pushed from the beginning of the pipeline, your regular pipeline operations, EndFrame or CancelFrame are pushed from the beginning of the pipeline and finally FrameProcessor.cleanup() is called.

  • Added support for OpenTelemetry tracing in Pipecat. This initial implementation includes:

    • A setup_tracing method where you can specify your OpenTelemetry exporter
    • Service decorators for STT (@traced_stt), LLM (@traced_llm), and TTS (@traced_tts) which trace the execution and collect properties and metrics (TTFB, token usage, character counts, etc.)
    • Class decorators that provide execution tracking; these are generic and can be used for service tracking as needed
    • Spans that help track traces on a per conversations and turn basis:
    conversation-uuid
    ├── turn-1
    │   ├── stt_deepgramsttservice
    │   ├── llm_openaillmservice
    │   └── tts_cartesiattsservice
    ...
    └── turn-n
        └── ...
    

    By default, Pipecat has implemented service decorators to trace execution of STT, LLM, and TTS services. You can enable tracing by setting enable_tracing to True in the PipelineTask.

  • Added TurnTrackingObserver, which tracks the start and end of a user/bot turn pair and emits events on_turn_started and on_turn_stopped corresponding to the start and end of a turn, respectively.

  • Allow passing observers to run_test() while running unit tests.

Changed

  • Upgraded daily-python to 0.19.1.

  • ⚠️ Updated SmallWebRTCTransport to align with how other transports handle on_client_disconnected. Now, when the connection is closed and no reconnection is attempted, on_client_disconnected is called instead of on_client_close. The on_client_close callback is no longer used, use on_client_disconnected instead.

  • Check if PipelineTask has already been cancelled.

  • Don't raise an exception if event handler is not registered.

  • Upgraded deepgram-sdk to 4.1.0.

  • Updated GoogleTTSService to use Google's streaming TTS API. The default voice also updated to en-US-Chirp3-HD-Charon.

  • ⚠️ Refactored the TavusVideoService, so it acts like a proxy, sending audio to Tavus and receiving both audio and video. This will make TavusVideoService usable with any Pipecat pipeline and with any transport. This is a breaking change, check the examples/foundational/21a-tavus-layer-small-webrtc.py to see how to use it.

  • DailyTransport now uses custom microphone audio tracks instead of virtual microphones. Now, multiple Daily transports can be used in the same process.

  • DailyTransport now captures audio from individual participants instead of the whole room. This allows identifying audio frames per participant.

  • Updated the default model for AnthropicLLMService to claude-sonnet-4-20250514.

  • Updated the default model for GeminiMultimodalLiveLLMService to models/gemini-2.5-flash-preview-native-audio-dialog.

  • BaseTextFilter methods filter(), update_settings(), handle_interruption() and reset_interruption() are now async.

  • BaseTextAggregator methods aggregate(), handle_interruption() and reset() are now async.

  • The API version for CartesiaTTSService and CartesiaHttpTTSService has been updated. Also, the cartesia dependency has been updated to 2.x.

  • CartesiaTTSService and CartesiaHttpTTSService now support Cartesia's new speed parameter which accepts values of slow, normal, and fast.

  • GeminiMultimodalLiveLLMService now uses the user transcription and usage metrics provided by Gemini Live.

  • GoogleLLMService has been updated to use google-genai instead of the deprecated google-generativeai.

Deprecated

  • In CartesiaTTSService and CartesiaHttpTTSService, emotion has been deprecated by Cartesia. Pipecat is following suit and deprecating emotion as well.

Removed

  • Since GeminiMultimodalLiveLLMService now transcribes it's own audio, the transcribe_user_audio arg has been removed. Audio is now transcribed automatically.

  • Removed SileroVAD frame processor, just use SileroVADAnalyzer instead. Also removed, 07a-interruptible-vad.py example.

Fixed

  • Fixed a DailyTransport issue that was not allow capturing video frames if framerate was greater than zero.

  • Fixed a DeegramSTTService connection issue when the user provided their own LiveOptions.

  • Fixed a DailyTransport issue that would cause images needing resize to block the event loop.

  • Fixed an issue with ElevenLabsTTSService where changing the model or voice while the service is running wasn't working.

  • Fixed an issue that would cause multiple instances of the same class to behave incorrectly if any of the given constructor arguments defaulted to a mutable value (e.g. lists, dictionaries, objects).

  • Fixed an issue with CartesiaTTSService where TTSTextFrame messages weren't being emitted when the model was set to sonic. This resulted in the assistant context not being updated with assistant messages.

Performance

  • DailyTransport: process audio, video and events in separate tasks.

  • Don't create event handler tasks if no user event handlers have been registered.

Other

  • It is now possible to run all (or most) foundational example with multiple transports. By default, they run with P2P (Peer-To-Peer) WebRTC so you can try everything locally. You can also run them with Daily or even with a Twilio phone number.

  • Added foundation examples 07y-interruptible-minimax.py and 07z-interruptible-sarvam.pyto show how to use the MiniMaxHttpTTSService and SarvamTTSService, respectively.

  • Added an open-telemetry-tracing example, showing how to setup tracing. The example also includes Jaeger as an open source OpenTelemetry client to review traces from the example runs.

  • Added foundational example 29-turn-tracking-observer.py to show how to use the TurnTrackingObserver.

[0.0.67] - 2025-05-07

Added

  • Added DebugLogObserver for detailed frame logging with configurable filtering by frame type and endpoint. This observer automatically extracts and formats all frame data fields for debug logging.

  • UserImageRequestFrame.video_source field has been added to request an image from the desired video source.

  • Added support for the AWS Nova Sonic speech-to-speech model with the new AWSNovaSonicLLMService. See https://docs.aws.amazon.com/nova/latest/userguide/speech.html. Note that it requires Python >= 3.12 and pip install pipecat-ai[aws-nova-sonic].

  • Added new AWS services AWSBedrockLLMService and AWSTranscribeSTTService.

  • Added on_active_speaker_changed event handler to the DailyTransport class.

  • Added enable_ssml_parsing and enable_logging to InputParams in ElevenLabsTTSService.

  • Added support to RimeHttpTTSService for the arcana model.

Changed

  • Updated ElevenLabsTTSService to use the beta websocket API (multi-stream-input). This new API supports context_ids and cancelling those contexts, which greatly improves interruption handling.

  • Observers on_push_frame() now take a single argument FramePushed instead of multiple arguments.

  • Updated the default voice for DeepgramTTSService to aura-2-helena-en.

Deprecated

  • PollyTTSService is now deprecated, use AWSPollyTTSService instead.

  • Observer on_push_frame(src, dst, frame, direction, timestamp) is now deprecated, use on_push_frame(data: FramePushed) instead.

Fixed

  • Fixed a DailyTransport issue that was causing issues when multiple audio or video sources where being captured.

  • Fixed a UltravoxSTTService issue that would cause the service to generate all tokens as one word.

  • Fixed a PipelineTask issue that would cause tasks to not be cancelled if task was cancelled from outside of Pipecat.

  • Fixed a TaskManager that was causing dangling tasks to be reported.

  • Fixed an issue that could cause data to be sent to the transports when they were still not ready.

  • Remove custom audio tracks from DailyTransport before leaving.

Removed

  • Removed CanonicalMetricsService as it's no longer maintained.

[0.0.66] - 2025-05-02

Added

  • Added two new input parameters to RimeTTSService: pause_between_brackets and phonemize_between_brackets.

  • Added support for cross-platform local smart turn detection. You can use LocalSmartTurnAnalyzer for on-device inference using Torch.

  • BaseOutputTransport now allows multiple destinations if the transport implementation supports it (e.g. Daily's custom tracks). With multiple destinations it is possible to send different audio or video tracks with a single transport simultaneously. To do that, you need to set the new Frame.transport_destination field with your desired transport destination (e.g. custom track name), tell the transport you want a new destination with TransportParams.audio_out_destinations or TransportParams.video_out_destinations and the transport should take care of the rest.

  • Similar to the new Frame.transport_destination, there's a new Frame.transport_source field which is set by the BaseInputTransport if the incoming data comes from a non-default source (e.g. custom tracks).

  • TTSService has a new transport_destination constructor parameter. This parameter will be used to update the Frame.transport_destination field for each generated TTSAudioRawFrame. This allows sending multiple bots' audio to multiple destinations in the same pipeline.

  • Added DailyTransportParams.camera_out_enabled and DailyTransportParams.microphone_out_enabled which allows you to enable/disable the main output camera or microphone tracks. This is useful if you only want to use custom tracks and not send the main tracks. Note that you still need audio_out_enabled=True or video_out_enabled.

  • Added DailyTransport.capture_participant_audio() which allows you to capture an audio source (e.g. "microphone", "screenAudio" or a custom track name) from a remote participant.

  • Added DailyTransport.update_publishing() which allows you to update the call video and audio publishing settings (e.g. audio and video quality).

  • Added RTVIObserverParams which allows you to configure what RTVI messages are sent to the clients.

  • Added a context_window_compression InputParam to GeminiMultimodalLiveLLMService which allows you to enable a sliding context window for the session as well as set the token limit of the sliding window.

  • Updated SmallWebRTCConnection to support ice_servers with credentials.

  • Added VADUserStartedSpeakingFrame and VADUserStoppedSpeakingFrame, indicating when the VAD detected the user to start and stop speaking. These events are helpful when using smart turn detection, as the user's stop time can differ from when their turn ends (signified by UserStoppedSpeakingFrame).

  • Added TranslationFrame, a new frame type that contains a translated transcription.

  • Added TransportParams.audio_in_passthrough. If set (the default), incoming audio will be pushed downstream.

  • Added MCPClient; a way to connect to MCP servers and use the MCP servers' tools.

  • Added Mem0 OSS, along with Mem0 cloud support now the OSS version is also available.

Changed

  • TransportParams.audio_mixer now supports a string and also a dictionary to provide a mixer per destination. For example:
  audio_out_mixer={
      "track-1": SoundfileMixer(...),
      "track-2": SoundfileMixer(...),
      "track-N": SoundfileMixer(...),
  },
  • The STTMuteFilter now mutes InterimTranscriptionFrame and TranscriptionFrame which allows the STTMuteFilter to be used in conjunction with transports that generate transcripts, e.g. DailyTransport.

  • Function calls now receive a single parameter FunctionCallParams instead of (function_name, tool_call_id, args, llm, context, result_callback) which is now deprecated.

  • Changed the user aggregator timeout for late transcriptions from 1.0s to 0.5s (LLMUserAggregatorParams.aggregation_timeout). Sometimes, the STT services might give us more than one transcription which could come after the user stopped speaking. We still want to include these additional transcriptions with the first one because it's part of the user turn. This is what this timeout is helpful with.

  • Short utterances not detected by VAD while the bot is speaking are now ignored. This reduces the amount of bot interruptions significantly providing a more natural conversation experience.

  • Updated GladiaSTTService to output a TranslationFrame when specifying a translation and translation_config.

  • STT services now passthrough audio frames by default. This allows you to add audio recording without worrying about what's wrong in your pipeline when it doesn't work the first time.

  • Input transports now always push audio downstream unless disabled with TransportParams.audio_in_passthrough. After many Pipecat releases, we realized this is the common use case. There are use cases where the input transport already provides STT and you also don't want recordings, in which case there's no need to push audio to the rest of the pipeline, but this is not a very common case.

  • Added RivaSegmentedSTTService, which allows Riva offline/batch models, such as to be "canary-1b-asr" used in Pipecat.

Deprecated

  • Function calls with parameters (function_name, tool_call_id, args, llm, context, result_callback) are deprectated, use a single FunctionCallParams parameter instead.

  • TransportParams.camera_* parameters are now deprecated, use TransportParams.video_* instead.

  • TransportParams.vad_enabled parameter is now deprecated, use TransportParams.audio_in_enabled and TransportParams.vad_analyzer instead.

  • TransportParams.vad_audio_passthrough parameter is now deprecated, use TransportParams.audio_in_passthrough instead.

  • ParakeetSTTService is now deprecated, use RivaSTTService instead, which uses the model "parakeet-ctc-1.1b-asr" by default.

  • FastPitchTTSService is now deprecated, use RivaTTSService instead, which uses the model "magpie-tts-multilingual" by default.

Fixed

  • Fixed an issue with SimliVideoService where the bot was continuously outputting audio, which prevents the BotStoppedSpeakingFrame from being emitted.

  • Fixed an issue where OpenAIRealtimeBetaLLMService would add two assistant messages to the context.

  • Fixed an issue with GeminiMultimodalLiveLLMService where the context contained tokens instead of words.

  • Fixed an issue with HTTP Smart Turn handling, where the service returns a 500 error. Previously, this would cause an unhandled exception. Now, a 500 error is treated as an incomplete response.

  • Fixed a TTS services issue that could cause assistant output not to be aggregated to the context when also using TTSSpeakFrames.

  • Fixed an issue where the SmartTurnMetricsData was reporting 0ms for inference and processing time when using the FalSmartTurnAnalyzer.

Other

  • Added examples/daily-custom-tracks to show how to send and receive Daily custom tracks.

  • Added examples/daily-multi-translation to showcase how to send multiple simulataneous translations with the same transport.

  • Added 04 foundational examples for client/server transports. Also, renamed 29-livekit-audio-chat.py to 04b-transports-livekit.py.

  • Added foundational example 13c-gladia-translation.py showing how to use TranscriptionFrame and TranslationFrame.

[0.0.65] - 2025-04-23 "Sant Jordi's release" 🌹📕

https://en.wikipedia.org/wiki/Saint_George%27s_Day_in_Catalonia

Added

  • Added automatic hangup logic to the Telnyx serializer. This feature hangs up the Telnyx call when an EndFrame or CancelFrame is received. It is enabled by default and is configurable via the auto_hang_up InputParam.

  • Added a keepalive task to GladiaSTTService to prevent the websocket from disconnecting after 30 seconds of no audio input.

Changed

  • The InputParams for ElevenLabsTTSService and ElevenLabsHttpTTSService no longer require that stability and similarity_boost be set. You can individually set each param.

  • In TwilioFrameSerializer, call_sid is Optional so as to avoid a breaking changed. call_sid is required to automatically hang up.

Fixed

  • Fixed an issue where TwilioFrameSerializer would send two hang up commands: one for the EndFrame and one for the CancelFrame.

[0.0.64] - 2025-04-22

Added

  • Added automatic hangup logic to the Twilio serializer. This feature hangs up the Twilio call when an EndFrame or CancelFrame is received. It is enabled by default and is configurable via the auto_hang_up InputParam.

  • Added SmartTurnMetricsData, which contains end-of-turn prediction metrics, to the MetricsFrame. Using MetricsFrame, you can now retrieve prediction confidence scores and processing time metrics from the smart turn analyzers.

  • Added support for Application Default Credentials in Google services, GoogleSTTService, GoogleTTSService, and GoogleVertexLLMService.

  • Added support for Smart Turn Detection via the turn_analyzer transport parameter. You can now choose between HttpSmartTurnAnalyzer() or FalSmartTurnAnalyzer() for remote inference or LocalCoreMLSmartTurnAnalyzer() for on-device inference using Core ML.

  • DeepgramTTSService accepts base_url argument again, allowing you to connect to an on-prem service.

  • Added LLMUserAggregatorParams and LLMAssistantAggregatorParams which allow you to control aggregator settings. You can now pass these arguments when creating aggregator pairs with create_context_aggregator().

  • Added previous_text context support to ElevenLabsHttpTTSService, improving speech consistency across sentences within an LLM response.

  • Added word/timestamp pairs to ElevenLabsHttpTTSService.

  • It is now possible to disable SoundfileMixer when created. You can then use MixerEnableFrame to dynamically enable it when necessary.

  • Added on_client_connected and on_client_disconnected event handlers to the DailyTransport class. These handlers map to the same underlying Daily events as on_participant_joined and on_participant_left, respectively. This makes it easier to write a single bot pipeline that can also use other transports like SmallWebRTCTransport and FastAPIWebsocketTransport.

Changed

  • GrokLLMService now uses grok-3-beta as its default model.

  • Daily's REST helpers now include an eject_at_token_exp param, which ejects the user when their token expires. This new parameter defaults to False. Also, the default value for enable_prejoin_ui changed to False and eject_at_room_exp changed to False.

  • OpenAILLMService and OpenPipeLLMService now use gpt-4.1 as their default model.

  • SoundfileMixer constructor arguments need to be keywords.

Deprecated

  • DeepgramSTTService parameter url is now deprecated, use base_url instead.

Removed

  • Parameters user_kwargs and assistant_kwargs when creating a context aggregator pair using create_context_aggregator() have been removed. Use user_params and assistant_params instead.

Fixed

  • Fixed an issue that would cause TTS websocket-based services to not cleanup resources properly when disconnecting.

  • Fixed a TavusVideoService issue that was causing audio choppiness.

  • Fixed an issue in SmallWebRTCTransport where an error was thrown if the client did not create a video transceiver.

  • Fixed an issue where LLM input parameters were not working and applied correctly in GoogleVertexLLMService, causing unexpected behavior during inference.

Other

  • Updated the twilio-chatbot example to use the auto-hangup feature.

[0.0.63] - 2025-04-11

Added

  • Added media resolution control to GeminiMultimodalLiveLLMService with GeminiMediaResolution enum, allowing configuration of token usage for image processing (LOW: 64 tokens, MEDIUM: 256 tokens, HIGH: zoomed reframing with 256 tokens).

  • Added Gemini's Voice Activity Detection (VAD) configuration to GeminiMultimodalLiveLLMService with GeminiVADParams, allowing fine control over speech detection sensitivity and timing, including:

    • Start sensitivity (how quickly speech is detected)
    • End sensitivity (how quickly turns end after pauses)
    • Prefix padding (milliseconds of audio to keep before speech is detected)
    • Silence duration (milliseconds of silence required to end a turn)
  • Added comprehensive language support to GeminiMultimodalLiveLLMService, supporting over 30 languages via the language parameter, with proper mapping between Pipecat's Language enum and Gemini's language codes.

  • Added support in SmallWebRTCTransport to detect when remote tracks are muted.

  • Added support for image capture from a video stream to the SmallWebRTCTransport.

  • Added a new iOS client option to the SmallWebRTCTransport video-transform example.

  • Added new processors ProducerProcessor and ConsumerProcessor. The producer processor processes frames from the pipeline and decides whether the consumers should consume it or not. If so, the same frame that is received by the producer is sent to the consumer. There can be multiple consumers per producer. These processors can be useful to push frames from one part of a pipeline to a different one (e.g. when using ParallelPipeline).

  • Improvements for the SmallWebRTCTransport:

    • Wait until the pipeline is ready before triggering the connected event.
    • Queue messages if the data channel is not ready.
    • Update the aiortc dependency to fix an issue where the 'video/rtx' MIME type was incorrectly handled as a codec retransmission.
    • Avoid initial video delays.

Changed

  • In GeminiMultimodalLiveLLMService, removed the transcribe_model_audio parameter in favor of Gemini Live's native output transcription support. Now text transcriptions are produced directly by the model. No configuration is required.

  • Updated GeminiMultimodalLiveLLMServices default model to models/gemini-2.0-flash-live-001 and base_url to the v1beta websocket URL.

Fixed

  • Updated daily-python to 0.17.0 to fix an issue that was preventing to run on older platforms.

  • Fixed an issue where CartesiaTTSService's spell feature would result in the spelled word in the context appearing as "F,O,O,B,A,R" instead of "FOOBAR".

  • Fixed an issue in the Azure TTS services where the language was being set incorrectly.

  • Fixed SmallWebRTCTransport to support dynamic values for TransportParams.audio_out_10ms_chunks. Previously, it only worked with 20ms chunks.

  • Fixed an issue with GeminiMultimodalLiveLLMService where the assistant context messages had no space between words.

  • Fixed an issue where LLMAssistantContextAggregator would prevent a BotStoppedSpeakingFrame from moving through the pipeline.

[0.0.62] - 2025-04-01 "An April Fools' release"

Added

  • Added TransportParams.audio_out_10ms_chunks parameter to allow controlling the amount of audio being sent by the output transport. It defaults to 4, so 40ms audio chunks are sent.

  • Added QwenLLMService for Qwen integration with an OpenAI-compatible interface. Added foundational example 14q-function-calling-qwen.py.

  • Added Mem0MemoryService. Mem0 is a self-improving memory layer for LLM applications. Learn more at: https://mem0.ai/.

  • Added WhisperSTTServiceMLX for Whisper transcription on Apple Silicon. See example in examples/foundational/13e-whisper-mlx.py. Latency of completed transcription using Whisper large-v3-turbo on an M4 macbook is ~500ms.

  • Added SmallWebRTCTransport, a new P2P WebRTC transport.

    • Created two examples in p2p-webrtc:
      • video-transform: Demonstrates sending and receiving audio/video with SmallWebRTCTransport using TypeScript. Includes video frame processing with OpenCV.
      • voice-agent: A minimal example of creating a voice agent with SmallWebRTCTransport.
  • GladiaSTTService now have comprehensive support for the latest API config options, including model, language detection, preprocessing, custom vocabulary, custom spelling, translation, and message filtering options.

  • Added SmallWebRTCTransport, a new P2P WebRTC transport.

    • Created two examples in p2p-webrtc:
      • video-transform: Demonstrates sending and receiving audio/video with SmallWebRTCTransport using TypeScript. Includes video frame processing with OpenCV.
      • voice-agent: A minimal example of creating a voice agent with SmallWebRTCTransport.
  • Added support to ProtobufFrameSerializer to send the messages from TransportMessageFrame and TransportMessageUrgentFrame.

  • Added support for a new TTS service, PiperTTSService. (see https://github.com/rhasspy/piper/)

  • It is now possible to tell whether UserStartedSpeakingFrame or UserStoppedSpeakingFrame have been generated because of emulation frames.

Changed

  • FunctionCallResultFramea are now system frames. This is to prevent function call results to be discarded during interruptions.

  • Pipecat services have been reorganized into packages. Each package can have one or more of the following modules (in the future new module names might be needed) depending on the services implemented:

    • image: for image generation services
    • llm: for LLM services
    • memory: for memory services
    • stt: for Speech-To-Text services
    • tts: for Text-To-Speech services
    • video: for video generation services
    • vision: for video recognition services
  • Base classes for AI services have been reorganized into modules. They can now be found in pipecat.services.[ai_service,image_service,llm_service,stt_service,vision_service].

  • GladiaSTTService now uses the solaria-1 model by default. Other params use Gladia's default values. Added support for more language codes.

Deprecated

  • All Pipecat services imports have been deprecated and a warning will be shown when using the old import. The new import should be pipecat.services.[service].[image,llm,memory,stt,tts,video,vision]. For example, from pipecat.services.openai.llm import OpenAILLMService.

  • Import for AI services base classes from pipecat.services.ai_services is now deprecated, use one of pipecat.services.[ai_service,image_service,llm_service,stt_service,vision_service].

  • Deprecated the language parameter in GladiaSTTService.InputParams in favor of language_config, which better aligns with Gladia's API.

  • Deprecated using GladiaSTTService.InputParams directly. Use the new GladiaInputParams class instead.

Fixed

  • Fixed a FastAPIWebsocketTransport and WebsocketClientTransport issue that would cause the transport to be closed prematurely, preventing the internally queued audio to be sent. The same issue could also cause an infinite loop while using an output mixer and when sending an EndFrame, preventing the bot to finish.

  • Fixed an issue that could cause the TranscriptionUpdateFrame being pushed because of an interruption to be discarded.

  • Fixed an issue that would cause SegmentedSTTService based services (e.g. OpenAISTTService) to try to transcribe non-spoken audio, causing invalid transcriptions.

  • Fixed an issue where GoogleTTSService was emitting two TTSStoppedFrames.

Performance

  • Output transports now send 40ms audio chunks instead of 20ms. This should improve performance.

  • BotSpeakingFrames are now sent every 200ms. If the output transport audio chunks are higher than 200ms then they will be sent at every audio chunk.

Other

  • Added foundational example 37-mem0.py demonstrating how to use the Mem0MemoryService.

  • Added foundational example 13e-whisper-mlx.py demonstrating how to use the WhisperSTTServiceMLX.

[0.0.61] - 2025-03-26

Added

  • Added a new frame, LLMSetToolChoiceFrame, which provides a mechanism for modifying the tool_choice in the context.

  • Added GroqTTSService which provides text-to-speech functionality using Groq's API.

  • Added support in DailyTransport for updating remote participants' canReceive permission via the update_remote_participants() method, by bumping the daily-python dependency to >= 0.16.0.

  • ElevenLabs TTS services now support a sample rate of 8000.

  • Added support for instructions in OpenAITTSService.

  • Added support for base_url in OpenAIImageGenService and OpenAITTSService.

Fixed

  • Fixed an issue in RTVIObserver that prevented handling of Google LLM context messages. The observer now processes both OpenAI-style and Google-style contexts.

  • Fixed an issue in Daily involving switching virtual devices, by bumping the daily-python dependency to >= 0.16.1.

  • Fixed a GoogleAssistantContextAggregator issue where function calls placeholders where not being updated when then function call result was different from a string.

  • Fixed an issue that would cause LLMAssistantContextAggregator to block processing more frames while processing a function call result.

  • Fixed an issue where the RTVIObserver would report two bot started and stopped speaking events for each bot turn.

  • Fixed an issue in UltravoxSTTService that caused improper audio processing and incorrect LLM frame output.

Other

  • Added examples/foundational/07x-interruptible-local.py to show how a local transport can be used.

[0.0.60] - 2025-03-20

Added

  • Added default_headers parameter to BaseOpenAILLMService constructor.

Changed

  • Rollback to deepgram-sdk 3.8.0 since 3.10.1 was causing connections issues.

  • Changed the default InputAudioTranscription model to gpt-4o-transcribe for OpenAIRealtimeBetaLLMService.

Other

  • Update the 19-openai-realtime-beta.py and 19a-azure-realtime-beta.py examples to use the FunctionSchema format.

[0.0.59] - 2025-03-20

Added

  • When registering a function call it is now possible to indicate if you want the function call to be cancelled if there's a user interruption via cancel_on_interruption (defaults to False). This is now possible because function calls are executed concurrently.

  • Added support for detecting idle pipelines. By default, if no activity has been detected during 5 minutes, the PipelineTask will be automatically cancelled. It is possible to override this behavior by passing cancel_on_idle_timeout=False. It is also possible to change the default timeout with idle_timeout_secs or the frames that prevent the pipeline from being idle with idle_timeout_frames. Finally, an on_idle_timeout event handler will be triggered if the idle timeout is reached (whether the pipeline task is cancelled or not).

  • Added FalSTTService, which provides STT for Fal's Wizper API.

  • Added a reconnect_on_error parameter to websocket-based TTS services as well as a on_connection_error event handler. The reconnect_on_error indicates whether the TTS service should reconnect on error. The on_connection_error will always get called if there's any error no matter the value of reconnect_on_error. This allows, for example, to fallback to a different TTS provider if something goes wrong with the current one.

  • Added new SkipTagsAggregator that extends BaseTextAggregator to aggregate text and skips end of sentence matching if aggregated text is between start/end tags.

  • Added new PatternPairAggregator that extends BaseTextAggregator to identify content between matching pattern pairs in streamed text. This allows for detection and processing of structured content like XML-style tags that may span across multiple text chunks or sentence boundaries.

  • Added new BaseTextAggregator. Text aggregators are used by the TTS service to aggregate LLM tokens and decide when the aggregated text should be pushed to the TTS service. They also allow for the text to be manipulated while it's being aggregated. A text aggregator can be passed via text_aggregator to the TTS service.

  • Added new sample_rate constructor parameter to TavusVideoService to allow changing the output sample rate.

  • Added new NeuphonicTTSService. (see https://neuphonic.com)

  • Added new UltravoxSTTService. (see https://github.com/fixie-ai/ultravox)

  • Added on_frame_reached_upstream and on_frame_reached_downstream event handlers to PipelineTask. Those events will be called when a frame reaches the beginning or end of the pipeline respectively. Note that by default, the event handlers will not be called unless a filter is set with PipelineTask.set_reached_upstream_filter() or PipelineTask.set_reached_downstream_filter().

  • Added support for Chirp voices in GoogleTTSService.

  • Added a flush_audio() method to FishTTSService and LmntTTSService.

  • Added a set_language convenience method for GoogleSTTService, allowing you to set a single language. This is in addition to the set_languages method which allows you to set a list of languages.

  • Added on_user_turn_audio_data and on_bot_turn_audio_data to AudioBufferProcessor. This gives the ability to grab the audio of only that turn for both the user and the bot.

  • Added new base class BaseObject which is now the base class of FrameProcessor, PipelineRunner, PipelineTask and BaseTransport. The new BaseObject adds supports for event handlers.

  • Added support for a unified format for specifying function calling across all LLM services.

  weather_function = FunctionSchema(
      name="get_current_weather",
      description="Get the current weather",
      properties={
          "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA",
          },
          "format": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"],
              "description": "The temperature unit to use. Infer this from the user's location.",
          },
      },
      required=["location"],
  )
  tools = ToolsSchema(standard_tools=[weather_function])
  • Added speech_threshold parameter to GladiaSTTService.

  • Allow passing user (user_kwargs) and assistant (assistant_kwargs) context aggregator parameters when using create_context_aggregator(). The values are passed as a mapping that will then be converted to arguments.

  • Added speed as an InputParam for both ElevenLabsTTSService and ElevenLabsHttpTTSService.

  • Added new LLMFullResponseAggregator to aggregate full LLM completions. At every completion the on_completion event handler is triggered.

  • Added a new frame, RTVIServerMessageFrame, and RTVI message RTVIServerMessage which provides a generic mechanism for sending custom messages from server to client. The RTVIServerMessageFrame is processed by the RTVIObserver and will be delivered to the client's onServerMessage callback or ServerMessage event.

  • Added GoogleLLMOpenAIBetaService for Google LLM integration with an OpenAI-compatible interface. Added foundational example 14o-function-calling-gemini-openai-format.py.

  • Added AzureRealtimeBetaLLMService to support Azure's OpeanAI Realtime API. Added foundational example 19a-azure-realtime-beta.py.

  • Introduced GoogleVertexLLMService, a new class for integrating with Vertex AI Gemini models. Added foundational example 14p-function-calling-gemini-vertex-ai.py.

  • Added support in OpenAIRealtimeBetaLLMService for a slate of new features:

    • The 'gpt-4o-transcribe' input audio transcription model, along with new language and prompt options specific to that model.

    • The input_audio_noise_reduction session property.

      session_properties = SessionProperties(
        # ...
        input_audio_noise_reduction=InputAudioNoiseReduction(
          type="near_field" # also supported: "far_field"
        )
        # ...
      )
      
    • The 'semantic_vad' turn_detection session property value, a more sophisticated model for detecting when the user has stopped speaking.

    • on_conversation_item_created and on_conversation_item_updated events to OpenAIRealtimeBetaLLMService.

      @llm.event_handler("on_conversation_item_created")
      async def on_conversation_item_created(llm, item_id, item):
        # ...
      
      @llm.event_handler("on_conversation_item_updated")
      async def on_conversation_item_updated(llm, item_id, item):
        # `item` may not always be available here
        # ...
      
    • The retrieve_conversation_item(item_id) method for introspecting a conversation item on the server.

      item = await llm.retrieve_conversation_item(item_id)
      

Changed

  • Updated OpenAISTTService to use gpt-4o-transcribe as the default transcription model.

  • Updated OpenAITTSService to use gpt-4o-mini-tts as the default TTS model.

  • Function calls are now executed in tasks. This means that the pipeline will not be blocked while the function call is being executed.

  • ⚠️ PipelineTask will now be automatically cancelled if no bot activity is happening in the pipeline. There are a few settings to configure this behavior, see PipelineTask documentation for more details.

  • All event handlers are now executed in separate tasks in order to prevent blocking the pipeline. It is possible that event handlers take some time to execute in which case the pipeline would be blocked waiting for the event handler to complete.

  • Updated TranscriptProcessor to support text output from OpenAIRealtimeBetaLLMService.

  • OpenAIRealtimeBetaLLMService and GeminiMultimodalLiveLLMService now push a TTSTextFrame.

  • Updated the default mode for CartesiaTTSService and CartesiaHttpTTSService to sonic-2.

Deprecated

  • Passing a start_callback to LLMService.register_function() is now deprecated, simply move the code from the start callback to the function call.

  • TTSService parameter text_filter is now deprecated, use text_filters instead which is now a list. This allows passing multiple filters that will be executed in order.

Removed

  • Removed deprecated audio.resample_audio(), use create_default_resampler() instead.

  • Removed deprecatedstt_service parameter from STTMuteFilter.

  • Removed deprecated RTVI processors, use an RTVIObserver instead.

  • Removed deprecated AWSTTSService, use PollyTTSService instead.

  • Removed deprecated field tier from DailyTranscriptionSettings, use model instead.

  • Removed deprecated pipecat.vad package, use pipecat.audio.vad instead.

Fixed

  • Fixed an assistant aggregator issue that could cause assistant text to be split into multiple chunks during function calls.

  • Fixed an assistant aggregator issue that was causing assistant text to not be added to the context during function calls. This could lead to duplications.

  • Fixed a SegmentedSTTService issue that was causing audio to be sent prematurely to the STT service. Instead of analyzing the volume in this service we rely on VAD events which use both VAD and volume.

  • Fixed a GeminiMultimodalLiveLLMService issue that was causing messages to be duplicated in the context when pushing LLMMessagesAppendFrame frames.

  • Fixed an issue with SegmentedSTTService based services (e.g. GroqSTTService) that was not allow audio to pass-through downstream.

  • Fixed a CartesiaTTSService and RimeTTSService issue that would consider text between spelling out tags end of sentence.

  • Fixed a match_endofsentence issue that would result in floating point numbers to be considered an end of sentence.

  • Fixed a match_endofsentence issue that would result in emails to be considered an end of sentence.

  • Fixed an issue where the RTVI message disconnect-bot was pushing an EndFrame, resulting in the pipeline not shutting down. It now pushes an EndTaskFrame upstream to shutdown the pipeline.

  • Fixed an issue with the GoogleSTTService where stream timeouts during periods of inactivity were causing connection failures. The service now properly detects timeout errors and handles reconnection gracefully, ensuring continuous operation even after periods of silence or when using an STTMuteFilter.

  • Fixed an issue in RimeTTSService where the last line of text sent didn't result in an audio output being generated.

  • Fixed OpenAIRealtimeBetaLLMService by adding proper handling for:

    • The conversation.item.input_audio_transcription.delta server message, which was added server-side at some point and not handled client-side.
    • Errors reported by the response.done server message.

Other

  • Add foundational example 07w-interruptible-fal.py, showing FalSTTService.

  • Added a new Ultravox example examples/foundational/07u-interruptible-ultravox.py.

  • Added new Neuphonic examples examples/foundational/07v-interruptible-neuphonic.py and examples/foundational/07v-interruptible-neuphonic-http.py.

  • Added a new example examples/foundational/36-user-email-gathering.py to show how to gather user emails. The example uses's Cartesia's <spell></spell> tags and Rime spell() function to spell out the emails for confirmation.

  • Update the 34-audio-recording.py example to include an STT processor.

  • Added foundational example 35-voice-switching.py showing how to use the new PatternPairAggregator. This example shows how to encode information for the LLM to instruct TTS voice changes, but this can be used to encode any information into the LLM response, which you want to parse and use in other parts of your application.

  • Added a Pipecat Cloud deployment example to the examples directory.

  • Removed foundational examples 28b and 28c as the TranscriptProcessor no longer has an LLM depedency. Renamed foundational example 28a to 28-transcript-processor.py.

[0.0.58] - 2025-02-26

Added

  • Added track-specific audio event on_track_audio_data to AudioBufferProcessor for accessing separate input and output audio tracks.

  • Pipecat version will now be logged on every application startup. This will help us identify what version we are running in case of any issues.

  • Added a new StopFrame which can be used to stop a pipeline task while keeping the frame processors running. The frame processors could then be used in a different pipeline. The difference between a StopFrame and a StopTaskFrame is that, as with EndFrame and EndTaskFrame, the StopFrame is pushed from the task and the StopTaskFrame is pushed upstream inside the pipeline by any processor.

  • Added a new PipelineTask parameter observers that replaces the previous PipelineParams.observers.

  • Added a new PipelineTask parameter check_dangling_tasks to enable or disable checking for frame processors' dangling tasks when the Pipeline finishes running.

  • Added new on_completion_timeout event for LLM services (all OpenAI-based services, Anthropic and Google). Note that this event will only get triggered if LLM timeouts are setup and if the timeout was reached. It can be useful to retrigger another completion and see if the timeout was just a blip.

  • Added new log observers LLMLogObserver and TranscriptionLogObserver that can be useful for debugging your pipelines.

  • Added room_url property to DailyTransport.

  • Added addons argument to DeepgramSTTService.

  • Added exponential_backoff_time() to utils.network module.

Changed

  • ⚠️ PipelineTask now requires keyword arguments (except for the first one for the pipeline).

  • Updated PlayHTHttpTTSService to take a voice_engine and protocol input in the constructor. The previous method of providing a voice_engine input that contains the engine and protocol is deprecated by PlayHT.

  • The base TTSService class now strips leading newlines before sending text to the TTS provider. This change is to solve issues where some TTS providers, like Azure, would not output text due to newlines.

  • GrokLLMSService now uses grok-2 as the default model.

  • AnthropicLLMService now uses claude-3-7-sonnet-20250219 as the default model.

  • RimeHttpTTSService needs an aiohttp.ClientSession to be passed to the constructor as all the other HTTP-based services.

  • RimeHttpTTSService doesn't use a default voice anymore.

  • DeepgramSTTService now uses the new nova-3 model by default. If you want to use the previous model you can pass LiveOptions(model="nova-2-general"). (see https://deepgram.com/learn/introducing-nova-3-speech-to-text-api)

stt = DeepgramSTTService(..., live_options=LiveOptions(model="nova-2-general"))

Deprecated

  • PipelineParams.observers is now deprecated, you the new PipelineTask parameter observers.

Removed

  • Remove TransportParams.audio_out_is_live since it was not being used at all.

Fixed

  • Fixed an issue that would cause undesired interruptions via EmulateUserStartedSpeakingFrame.

  • Fixed a GoogleLLMService that was causing an exception when sending inline audio in some cases.

  • Fixed an AudioContextWordTTSService issue that would cause an EndFrame to disconnect from the TTS service before audio from all the contexts was received. This affected services like Cartesia and Rime.

  • Fixed an issue that was not allowing to pass an OpenAILLMContext to create GoogleLLMService's context aggregators.

  • Fixed a ElevenLabsTTSService, FishAudioTTSService, LMNTTTSService and PlayHTTTSService issue that was resulting in audio requested before an interruption being played after an interruption.

  • Fixed match_endofsentence support for ellipses.

  • Fixed an issue where EndTaskFrame was not triggering on_client_disconnected or closing the WebSocket in FastAPI.

  • Fixed an issue in DeepgramSTTService where the sample_rate passed to the LiveOptions was not being used, causing the service to use the default sample rate of pipeline.

  • Fixed a context aggregator issue that would not append the LLM text response to the context if a function call happened in the same LLM turn.

  • Fixed an issue that was causing HTTP TTS services to push TTSStoppedFrame more than once.

  • Fixed a FishAudioTTSService issue where TTSStoppedFrame was not being pushed.

  • Fixed an issue that start_callback was not invoked for some LLM services.

  • Fixed an issue that would cause DeepgramSTTService to stop working after an error occurred (e.g. sudden network loss). If the network recovered we would not reconnect.

  • Fixed a STTMuteFilter issue that would not mute user audio frames causing transcriptions to be generated by the STT service.

Other

  • Added Gemini support to examples/phone-chatbot.

  • Added foundational example 34-audio-recording.py showing how to use the AudioBufferProcessor callbacks to save merged and track recordings.

[0.0.57] - 2025-02-14

Added

  • Added new AudioContextWordTTSService. This is a TTS base class for TTS services that handling multiple separate audio requests.

  • Added new frames EmulateUserStartedSpeakingFrame and EmulateUserStoppedSpeakingFrame which can be used to emulated VAD behavior without VAD being present or not being triggered.

  • Added a new audio_in_stream_on_start field to TransportParams.

  • Added a new method start_audio_in_streaming in the BaseInputTransport.

    • This method should be used to start receiving the input audio in case the field audio_in_stream_on_start is set to false.
  • Added support for the RTVIProcessor to handle buffered audio in base64 format, converting it into InputAudioRawFrame for transport.

  • Added support for the RTVIProcessor to trigger start_audio_in_streaming only after the client-ready message.

  • Added new MUTE_UNTIL_FIRST_BOT_COMPLETE strategy to STTMuteStrategy. This strategy starts muted and remains muted until the first bot speech completes, ensuring the bot's first response cannot be interrupted. This complements the existing FIRST_SPEECH strategy which only mutes during the first detected bot speech.

  • Added support for Google Cloud Speech-to-Text V2 through GoogleSTTService.

  • Added RimeTTSService, a new WordTTSService. Updated the foundational example 07q-interruptible-rime.py to use RimeTTSService.

  • Added support for Groq's Whisper API through the new GroqSTTService and OpenAI's Whisper API through the new OpenAISTTService. Introduced a new base class BaseWhisperSTTService to handle common Whisper API functionality.

  • Added PerplexityLLMService for Perplexity NIM API integration, with an OpenAI-compatible interface. Also, added foundational example 14n-function-calling-perplexity.py.

  • Added DailyTransport.update_remote_participants(). This allows you to update remote participant's settings, like their permissions or which of their devices are enabled. Requires that the local participant have participant admin permission.

Changed

  • We don't consider a colon : and end of sentence any more.

  • Updated DailyTransport to respect the audio_in_stream_on_start field, ensuring it only starts receiving the audio input if it is enabled.

  • Updated FastAPIWebsocketOutputTransport to send TransportMessageFrame and TransportMessageUrgentFrame to the serializer.

  • Updated WebsocketServerOutputTransport to send TransportMessageFrame and TransportMessageUrgentFrame to the serializer.

  • Enhanced STTMuteConfig to validate strategy combinations, preventing MUTE_UNTIL_FIRST_BOT_COMPLETE and FIRST_SPEECH from being used together as they handle first bot speech differently.

  • Updated foundational example 07n-interruptible-google.py to use all Google services.

  • RimeHttpTTSService now uses the mistv2 model by default.

  • Improved error handling in AzureTTSService to properly detect and log synthesis cancellation errors.

  • Enhanced WhisperSTTService with full language support and improved model documentation.

  • Updated foundation example 14f-function-calling-groq.py to use GroqSTTService for transcription.

  • Updated GroqLLMService to use llama-3.3-70b-versatile as the default model.

  • RTVIObserver doesn't handle LLMSearchResponseFrame frames anymore. For now, to handle those frames you need to create a GoogleRTVIObserver instead.

Deprecated

  • STTMuteFilter constructor's stt_service parameter is now deprecated and will be removed in a future version. The filter now manages mute state internally instead of querying the STT service.

  • RTVI.observer() is now deprecated, instantiate an RTVIObserver directly instead.

  • All RTVI frame processors (e.g. RTVISpeakingProcessor, RTVIBotLLMProcessor) are now deprecated, instantiate an RTVIObserver instead.

Fixed

  • Fixed a FalImageGenService issue that was causing the event loop to be blocked while loading the downloadded image.

  • Fixed a CartesiaTTSService service issue that would cause audio overlapping in some cases.

  • Fixed a websocket-based service issue (e.g. CartesiaTTSService) that was preventing a reconnection after the server disconnected cleanly, which was causing an inifite loop instead.

  • Fixed a BaseOutputTransport issue that was causing upstream frames to no be pushed upstream.

  • Fixed multiple issue where user transcriptions where not being handled properly. It was possible for short utterances to not trigger VAD which would cause user transcriptions to be ignored. It was also possible for one or more transcriptions to be generated after VAD in which case they would also be ignored.

  • Fixed an issue that was causing BotStoppedSpeakingFrame to be generated too late. This could then cause issues unblocking STTMuteFilter later than desired.

  • Fixed an issue that was causing AudioBufferProcessor to not record synchronized audio.

  • Fixed an RTVI issue that was causing bot-tts-text messages to be sent before being processed by the output transport.

  • Fixed an issue[#1192] in 11labs where we are trying to reconnect/disconnect the websocket connection even when the connection is already closed.

  • Fixed an issue where has_regular_messages condition was always true in GoogleLLMContext due to Part having function_call & function_response with None values.

Other

  • Added new instant-voice example. This example showcases how to enable instant voice communication as soon as a user connects.

  • Added new local-input-select-stt example. This examples allows you to play with local audio inputs by slecting them through a nice text interface.

[0.0.56] - 2025-02-06

Changed

  • Use gemini-2.0-flash-001 as the default model for GoogleLLMSerivce.

  • Improved foundational examples 22b, 22c, and 22d to support function calling. With these base examples, FunctionCallInProgressFrame and FunctionCallResultFrame will no longer be blocked by the gates.

Fixed

  • Fixed a TkLocalTransport and LocalAudioTransport issues that was causing errors on cleanup.

  • Fixed an issue that was causing tests.utils import to fail because of logging setup.

  • Fixed a SentryMetrics issue that was preventing any metrics to be sent to Sentry and also was preventing from metrics frames to be pushed to the pipeline.

  • Fixed an issue in BaseOutputTransport where incoming audio would not be resampled to the desired output sample rate.

  • Fixed an issue with the TwilioFrameSerializer and TelnyxFrameSerializer where twilio_sample_rate and telnyx_sample_rate were incorrectly initialized to audio_in_sample_rate. Those values currently default to 8000 and should be set manually from the serializer constructor if a different value is needed.

Other

  • Added a new sentry-metrics example.

[0.0.55] - 2025-02-05

Added

  • Added a new start_metadata field to PipelineParams. The provided metadata will be set to the initial StartFrame being pushed from the PipelineTask.

  • Added new fields to PipelineParams to control audio input and output sample rates for the whole pipeline. This allows controlling sample rates from a single place instead of having to specify sample rates in each service. Setting a sample rate to a service is still possible and will override the value from PipelineParams.

  • Introduce audio resamplers (BaseAudioResampler). This is just a base class to implement audio resamplers. Currently, two implementations are provided SOXRAudioResampler and ResampyResampler. A new create_default_resampler() has been added (replacing the now deprecated resample_audio()).

  • It is now possible to specify the asyncio event loop that a PipelineTask and all the processors should run on by passing it as a new argument to the PipelineRunner. This could allow running pipelines in multiple threads each one with its own event loop.

  • Added a new utils.TaskManager. Instead of a global task manager we now have a task manager per PipelineTask. In the previous version the task manager was global, so running multiple simultaneous PipelineTasks could result in dangling task warnings which were not actually true. In order, for all the processors to know about the task manager, we pass it through the StartFrame. This means that processors should create tasks when they receive a StartFrame but not before (because they don't have a task manager yet).

  • Added TelnyxFrameSerializer to support Telnyx calls. A full running example has also been added to examples/telnyx-chatbot.

  • Allow pushing silence audio frames before TTSStoppedFrame. This might be useful for testing purposes, for example, passing bot audio to an STT service which usually needs additional audio data to detect the utterance stopped.

  • TwilioSerializer now supports transport message frames. With this we can create Twilio emulators.

  • Added a new transport: WebsocketClientTransport.

  • Added a metadata field to Frame which makes it possible to pass custom data to all frames.

  • Added test/utils.py inside of pipecat package.

Changed

  • GatedOpenAILLMContextAggregator now require keyword arguments. Also, a new start_open argument has been added to set the initial state of the gate.

  • Added organization and project level authentication to OpenAILLMService.

  • Improved the language checking logic in ElevenLabsTTSService and ElevenLabsHttpTTSService to properly handle language codes based on model compatibility, with appropriate warnings when language codes cannot be applied.

  • Updated GoogleLLMContext to support pushing LLMMessagesUpdateFrames that contain a combination of function calls, function call responses, system messages, or just messages.

  • InputDTMFFrame is now based on DTMFFrame. There's also a new OutputDTMFFrame frame.

Deprecated

  • resample_audio() is now deprecated, use create_default_resampler() instead.

Removed

  • AudioBufferProcessor.reset_audio_buffers() has been removed, use AudioBufferProcessor.start_recording() and AudioBufferProcessor.stop_recording() instead.

Fixed

  • Fixed a AudioBufferProcessor that would cause crackling in some recordings.

  • Fixed an issue in AudioBufferProcessor where user callback would not be called on task cancellation.

  • Fixed an issue in AudioBufferProcessor that would cause wrong silence padding in some cases.

  • Fixed an issue where ElevenLabsTTSService messages would return a 1009 websocket error by increasing the max message size limit to 16MB.

  • Fixed a DailyTransport issue that would cause events to be triggered before join finished.

  • Fixed a PipelineTask issue that was preventing processors to be cleaned up after cancelling the task.

  • Fixed an issue where queuing a CancelFrame to a pipeline task would not cause the task to finish. However, using PipelineTask.cancel() is still the recommended way to cancel a task.

Other

  • Improved Unit Test run_test() to use PipelineTask and PipelineRunner. There's now also some control around StartFrame and EndFrame. The EndTaskFrame has been removed since it doesn't seem necessary with this new approach.

  • Updated twilio-chatbot with a few new features: use 8000 sample rate and avoid resampling, a new client useful for stress testing and testing locally without the need to make phone calls. Also, added audio recording on both the client and the server to make sure the audio sounds good.

  • Updated examples to use task.cancel() to immediately exit the example when a participant leaves or disconnects, instead of pushing an EndFrame. Pushing an EndFrame causes the bot to run through everything that is internally queued (which could take some seconds). Note that using task.cancel() might not always be the best option and pushing an EndFrame could still be desirable to make sure all the pipeline is flushed.

[0.0.54] - 2025-01-27

Added

  • In order to create tasks in Pipecat frame processors it is now recommended to use FrameProcessor.create_task() (which uses the new utils.asyncio.create_task()). It takes care of uncaught exceptions, task cancellation handling and task management. To cancel or wait for a task there is FrameProcessor.cancel_task() and FrameProcessor.wait_for_task(). All of Pipecat processors have been updated accordingly. Also, when a pipeline runner finishes, a warning about dangling tasks might appear, which indicates if any of the created tasks was never cancelled or awaited for (using these new functions).

  • It is now possible to specify the period of the PipelineTask heartbeat frames with heartbeats_period_secs.

  • Added DailyMeetingTokenProperties and DailyMeetingTokenParams Pydantic models for meeting token creation in get_token method of DailyRESTHelper.

  • Added enable_recording and geo parameters to DailyRoomProperties.

  • Added RecordingsBucketConfig to DailyRoomProperties to upload recordings to a custom AWS bucket.

Changed

  • Enhanced UserIdleProcessor with retry functionality and control over idle monitoring via new callback signature (processor, retry_count) -> bool. Updated the 17-detect-user-idle.py to show how to use the retry_count.

  • Add defensive error handling for OpenAIRealtimeBetaLLMService's audio truncation. Audio truncation errors during interruptions now log a warning and allow the session to continue instead of throwing an exception.

  • Modified TranscriptProcessor to use TTS text frames for more accurate assistant transcripts. Assistant messages are now aggregated based on bot speaking boundaries rather than LLM context, providing better handling of interruptions and partial utterances.

  • Updated foundational examples 28a-transcription-processor-openai.py, 28b-transcript-processor-anthropic.py, and 28c-transcription-processor-gemini.py to use the updated TranscriptProcessor.

Fixed

  • Fixed an GeminiMultimodalLiveLLMService issue that was preventing the user to push initial LLM assistant messages (using LLMMessagesAppendFrame).

  • Added missing FrameProcessor.cleanup() calls to Pipeline, ParallelPipeline and UserIdleProcessor.

  • Fixed a type error when using voice_settings in ElevenLabsHttpTTSService.

  • Fixed an issue where OpenAIRealtimeBetaLLMService function calling resulted in an error.

  • Fixed an issue in AudioBufferProcessor where the last audio buffer was not being processed, in cases where the _user_audio_buffer was smaller than the buffer size.

Performance

  • Replaced audio resampling library resampy with soxr. Resampling a 2:21s audio file from 24KHz to 16KHz took 1.41s with resampy and 0.031s with soxr with similar audio quality.

Other

  • Added initial unit test infrastructure.

[0.0.53] - 2025-01-18

Added

  • Added ElevenLabsHttpTTSService which uses EleveLabs' HTTP API instead of the websocket one.

  • Introduced pipeline frame observers. Observers can view all the frames that go through the pipeline without the need to inject processors in the pipeline. This can be useful, for example, to implement frame loggers or debuggers among other things. The example examples/foundational/30-observer.py shows how to add an observer to a pipeline for debugging.

  • Introduced heartbeat frames. The pipeline task can now push periodic heartbeats down the pipeline when enable_heartbeats=True. Heartbeats are system frames that are supposed to make it all the way to the end of the pipeline. When a heartbeat frame is received the traversing time (i.e. the time it took to go through the whole pipeline) will be displayed (with TRACE logging) otherwise a warning will be shown. The example examples/foundational/31-heartbeats.py shows how to enable heartbeats and forces warnings to be displayed.

  • Added LLMTextFrame and TTSTextFrame which should be pushed by LLM and TTS services respectively instead of TextFrames.

  • Added OpenRouter for OpenRouter integration with an OpenAI-compatible interface. Added foundational example 14m-function-calling-openrouter.py.

  • Added a new WebsocketService based class for TTS services, containing base functions and retry logic.

  • Added DeepSeekLLMService for DeepSeek integration with an OpenAI-compatible interface. Added foundational example 14l-function-calling-deepseek.py.

  • Added FunctionCallResultProperties dataclass to provide a structured way to control function call behavior, including:

    • run_llm: Controls whether to trigger LLM completion
    • on_context_updated: Optional callback triggered after context update
  • Added a new foundational example 07e-interruptible-playht-http.py for easy testing of PlayHTHttpTTSService.

  • Added support for Google TTS Journey voices in GoogleTTSService.

  • Added 29-livekit-audio-chat.py, as a new foundational examples for LiveKitTransportLayer.

  • Added enable_prejoin_ui, max_participants and start_video_off params to DailyRoomProperties.

  • Added session_timeout to FastAPIWebsocketTransport and WebsocketServerTransport for configuring session timeouts (in seconds). Triggers on_session_timeout for custom timeout handling. See examples/websocket-server/bot.py.

  • Added the new modalities option and helper function to set Gemini output modalities.

  • Added examples/foundational/26d-gemini-live-text.py which is using Gemini as TEXT modality and using another TTS provider for TTS process.

Changed

  • Modified UserIdleProcessor to start monitoring only after first conversation activity (UserStartedSpeakingFrame or BotStartedSpeakingFrame) instead of immediately.

  • Modified OpenAIAssistantContextAggregator to support controlled completions and to emit context update callbacks via FunctionCallResultProperties.

  • Added aws_session_token to the PollyTTSService.

  • Changed the default model for PlayHTHttpTTSService to Play3.0-mini-http.

  • api_key, aws_access_key_id and region are no longer required parameters for the PollyTTSService (AWSTTSService)

  • Added session_timeout example in examples/websocket-server/bot.py to handle session timeout event.

  • Changed InputParams in src/pipecat/services/gemini_multimodal_live/gemini.py to support different modalities.

  • Changed DeepgramSTTService to send finalize event whenever VAD detects UserStoppedSpeakingFrame. This helps in faster transcriptions and clearing the Deepgram audio buffer.

Fixed

  • Fixed an issue where DeepgramSTTService was not generating metrics using pipeline's VAD.

  • Fixed UserIdleProcessor not properly propagating EndFrames through the pipeline.

  • Fixed an issue where websocket based TTS services could incorrectly terminate their connection due to a retry counter not resetting.

  • Fixed a PipelineTask issue that would cause a dangling task after stopping the pipeline with an EndFrame.

  • Fixed an import issue for PlayHTHttpTTSService.

  • Fixed an issue where languages couldn't be used with the PlayHTHttpTTSService.

  • Fixed an issue where OpenAIRealtimeBetaLLMService audio chunks were hitting an error when truncating audio content.

  • Fixed an issue where setting the voice and model for RimeHttpTTSService wasn't working.

  • Fixed an issue where IdleFrameProcessor and UserIdleProcessor were getting initialized before the start of the pipeline.

[0.0.52] - 2024-12-24

Added

  • Constructor arguments for GoogleLLMService to directly set tools and tool_config.

  • Smart turn detection example (22d-natural-conversation-gemini-audio.py) that leverages Gemini 2.0 capabilities (). (see https://x.com/kwindla/status/1870974144831275410)

  • Added DailyTransport.send_dtmf() to send dial-out DTMF tones.

  • Added DailyTransport.sip_call_transfer() to forward SIP and PSTN calls to another address or number. For example, transfer a SIP call to a different SIP address or transfer a PSTN phone number to a different PSTN phone number.

  • Added DailyTransport.sip_refer() to transfer incoming SIP/PSTN calls from outside Daily to another SIP/PSTN address.

  • Added an auto_mode input parameter to ElevenLabsTTSService. auto_mode is set to True by default. Enabling this setting disables the chunk schedule and all buffers, which reduces latency.

  • Added KoalaFilter which implement on device noise reduction using Koala Noise Suppression. (see https://picovoice.ai/platform/koala/)

  • Added CerebrasLLMService for Cerebras integration with an OpenAI-compatible interface. Added foundational example 14k-function-calling-cerebras.py.

  • Pipecat now supports Python 3.13. We had a dependency on the audioop package which was deprecated and now removed on Python 3.13. We are now using audioop-lts (https://github.com/AbstractUmbra/audioop) to provide the same functionality.

  • Added timestamped conversation transcript support:

    • New TranscriptProcessor factory provides access to user and assistant transcript processors.
    • UserTranscriptProcessor processes user speech with timestamps from transcription.
    • AssistantTranscriptProcessor processes assistant responses with LLM context timestamps.
    • Messages emitted with ISO 8601 timestamps indicating when they were spoken.
    • Supports all LLM formats (OpenAI, Anthropic, Google) via standard message format.
    • New examples: 28a-transcription-processor-openai.py, 28b-transcription-processor-anthropic.py, and 28c-transcription-processor-gemini.py.
  • Add support for more languages to ElevenLabs (Arabic, Croatian, Filipino, Tamil) and PlayHT (Afrikans, Albanian, Amharic, Arabic, Bengali, Croatian, Galician, Hebrew, Mandarin, Serbian, Tagalog, Urdu, Xhosa).

Changed

  • PlayHTTTSService uses the new v4 websocket API, which also fixes an issue where text inputted to the TTS didn't return audio.

  • The default model for ElevenLabsTTSService is now eleven_flash_v2_5.

  • OpenAIRealtimeBetaLLMService now takes a model parameter in the constructor.

  • Updated the default model for the OpenAIRealtimeBetaLLMService.

  • Room expiration (exp) in DailyRoomProperties is now optional (None) by default instead of automatically setting a 5-minute expiration time. You must explicitly set expiration time if desired.

Deprecated

  • AWSTTSService is now deprecated, use PollyTTSService instead.

Fixed

  • Fixed token counting in GoogleLLMService. Tokens were summed incorrectly (double-counted in many cases).

  • Fixed an issue that could cause the bot to stop talking if there was a user interruption before getting any audio from the TTS service.

  • Fixed an issue that would cause ParallelPipeline to handle EndFrame incorrectly causing the main pipeline to not terminate or terminate too early.

  • Fixed an audio stuttering issue in FastPitchTTSService.

  • Fixed a BaseOutputTransport issue that was causing non-audio frames being processed before the previous audio frames were played. This will allow, for example, sending a frame A after a TTSSpeakFrame and the frame A will only be pushed downstream after the audio generated from TTSSpeakFrame has been spoken.

  • Fixed a DeepgramSTTService issue that was causing language to be passed as an object instead of a string resulting in the connection to fail.

[0.0.51] - 2024-12-16

Fixed

  • Fixed an issue in websocket-based TTS services that was causing infinite reconnections (Cartesia, ElevenLabs, PlayHT and LMNT).

[0.0.50] - 2024-12-11

Added

  • Added GeminiMultimodalLiveLLMService. This is an integration for Google's Gemini Multimodal Live API, supporting:

    • Real-time audio and video input processing
    • Streaming text responses with TTS
    • Audio transcription for both user and bot speech
    • Function calling
    • System instructions and context management
    • Dynamic parameter updates (temperature, top_p, etc.)
  • Added AudioTranscriber utility class for handling audio transcription with Gemini models.

  • Added new context classes for Gemini:

    • GeminiMultimodalLiveContext
    • GeminiMultimodalLiveUserContextAggregator
    • GeminiMultimodalLiveAssistantContextAggregator
    • GeminiMultimodalLiveContextAggregatorPair
  • Added new foundational examples for GeminiMultimodalLiveLLMService:

    • 26-gemini-multimodal-live.py
    • 26a-gemini-live-transcription.py
    • 26b-gemini-live-video.py
    • 26c-gemini-live-video.py
  • Added SimliVideoService. This is an integration for Simli AI avatars. (see https://www.simli.com)

  • Added NVIDIA Riva's FastPitchTTSService and ParakeetSTTService. (see https://www.nvidia.com/en-us/ai-data-science/products/riva/)

  • Added IdentityFilter. This is the simplest frame filter that lets through all incoming frames.

  • New STTMuteStrategy called FUNCTION_CALL which mutes the STT service during LLM function calls.

  • DeepgramSTTService now exposes two event handlers on_speech_started and on_utterance_end that could be used to implement interruptions. See new example examples/foundational/07c-interruptible-deepgram-vad.py.

  • Added GroqLLMService, GrokLLMService, and NimLLMService for Groq, Grok, and NVIDIA NIM API integration, with an OpenAI-compatible interface.

  • New examples demonstrating function calling with Groq, Grok, Azure OpenAI, Fireworks, and NVIDIA NIM: 14f-function-calling-groq.py, 14g-function-calling-grok.py, 14h-function-calling-azure.py, 14i-function-calling-fireworks.py, and 14j-function-calling-nvidia.py.

  • In order to obtain the audio stored by the AudioBufferProcessor you can now also register an on_audio_data event handler. The on_audio_data handler will be called every time buffer_size (a new constructor argument) is reached. If buffer_size is 0 (default) you need to manually get the audio as before using AudioBufferProcessor.merge_audio_buffers().

@audiobuffer.event_handler("on_audio_data")
async def on_audio_data(processor, audio, sample_rate, num_channels):
    await save_audio(audio, sample_rate, num_channels)
  • Added a new RTVI message called disconnect-bot, which when handled pushes an EndFrame to trigger the pipeline to stop.

Changed

  • STTMuteFilter now supports multiple simultaneous muting strategies.

  • XTTSService language now defaults to Language.EN.

  • SoundfileMixer doesn't resample input files anymore to avoid startup delays. The sample rate of the provided sound files now need to match the sample rate of the output transport.

  • Input frames (audio, image and transport messages) are now system frames. This means they are processed immediately by all processors instead of being queued internally.

  • Expanded the transcriptions.language module to support a superset of languages.

  • Updated STT and TTS services with language options that match the supported languages for each service.

  • Updated the AzureLLMService to use the OpenAILLMService. Updated the api_version to 2024-09-01-preview.

  • Updated the FireworksLLMService to use the OpenAILLMService. Updated the default model to accounts/fireworks/models/firefunction-v2.

  • Updated the simple-chatbot example to include a Javascript and React client example, using RTVI JS and React.

Removed

  • Removed AppFrame. This was used as a special user custom frame, but there's actually no use case for that.

Fixed

  • Fixed a ParallelPipeline issue that would cause system frames to be queued.

  • Fixed FastAPIWebsocketTransport so it can work with binary data (e.g. using the protobuf serializer).

  • Fixed an issue in CartesiaTTSService that could cause previous audio to be received after an interruption.

  • Fixed Cartesia, ElevenLabs, LMNT and PlayHT TTS websocket reconnection. Before, if an error occurred no reconnection was happening.

  • Fixed a BaseOutputTransport issue that was causing audio to be discarded after an EndFrame was received.

  • Fixed an issue in WebsocketServerTransport and FastAPIWebsocketTransport that would cause a busy loop when using audio mixer.

  • Fixed a DailyTransport and LiveKitTransport issue where connections were being closed in the input transport prematurely. This was causing frames queued inside the pipeline being discarded.

  • Fixed an issue in DailyTransport that would cause some internal callbacks to not be executed.

  • Fixed an issue where other frames were being processed while a CancelFrame was being pushed down the pipeline.

  • AudioBufferProcessor now handles interruptions properly.

  • Fixed a WebsocketServerTransport issue that would prevent interruptions with TwilioSerializer from working.

  • DailyTransport.capture_participant_video now allows capturing user's screen share by simply passing video_source="screenVideo".

  • Fixed Google Gemini message handling to properly convert appended messages to Gemini's required format.

  • Fixed an issue with FireworksLLMService where chat completions were failing by removing the stream_options from the chat completion options.

[0.0.49] - 2024-11-17

Added

  • Added RTVI on_bot_started event which is useful in a single turn interaction.

  • Added DailyTransport events dialin-connected, dialin-stopped, dialin-error and dialin-warning. Needs daily-python >= 0.13.0.

  • Added RimeHttpTTSService and the 07q-interruptible-rime.py foundational example.

  • Added STTMuteFilter, a general-purpose processor that combines STT muting and interruption control. When active, it prevents both transcription and interruptions during bot speech. The processor supports multiple strategies: FIRST_SPEECH (mute only during bot's first speech), ALWAYS (mute during all bot speech), or CUSTOM (using provided callback).

  • Added STTMuteFrame, a control frame that enables/disables speech transcription in STT services.

[0.0.48] - 2024-11-10 "Antonio release"

Added

  • There's now an input queue in each frame processor. When you call FrameProcessor.push_frame() this will internally call FrameProcessor.queue_frame() on the next processor (upstream or downstream) and the frame will be internally queued (except system frames). Then, the queued frames will get processed. With this input queue it is also possible for FrameProcessors to block processing more frames by calling FrameProcessor.pause_processing_frames(). The way to resume processing frames is by calling FrameProcessor.resume_processing_frames().

  • Added audio filter NoisereduceFilter.

  • Introduce input transport audio filters (BaseAudioFilter). Audio filters can be used to remove background noises before audio is sent to VAD.

  • Introduce output transport audio mixers (BaseAudioMixer). Output transport audio mixers can be used, for example, to add background sounds or any other audio mixing functionality before the output audio is actually written to the transport.

  • Added GatedOpenAILLMContextAggregator. This aggregator keeps the last received OpenAI LLM context frame and it doesn't let it through until the notifier is notified.

  • Added WakeNotifierFilter. This processor expects a list of frame types and will execute a given callback predicate when a frame of any of those type is being processed. If the callback returns true the notifier will be notified.

  • Added NullFilter. A null filter doesn't push any frames upstream or downstream. This is usually used to disable one of the pipelines in ParallelPipeline.

  • Added EventNotifier. This can be used as a very simple synchronization feature between processors.

  • Added TavusVideoService. This is an integration for Tavus digital twins. (see https://www.tavus.io/)

  • Added DailyTransport.update_subscriptions(). This allows you to have fine grained control of what media subscriptions you want for each participant in a room.

  • Added audio filter KrispFilter.

Changed

  • The following DailyTransport functions are now async which means they need to be awaited: start_dialout, stop_dialout, start_recording, stop_recording, capture_participant_transcription and capture_participant_video.

  • Changed default output sample rate to 24000. This changes all TTS service to output to 24000 and also the default output transport sample rate. This improves audio quality at the cost of some extra bandwidth.

  • AzureTTSService now uses Azure websockets instead of HTTP requests.

  • The previous AzureTTSService HTTP implementation is now AzureHttpTTSService.

Fixed

  • Websocket transports (FastAPI and Websocket) now synchronize with time before sending data. This allows for interruptions to just work out of the box.

  • Improved bot speaking detection for all TTS services by using actual bot audio.

  • Fixed an issue that was generating constant bot started/stopped speaking frames for HTTP TTS services.

  • Fixed an issue that was causing stuttering with AWS TTS service.

  • Fixed an issue with PlayHTTTSService, where the TTFB metrics were reporting very small time values.

  • Fixed an issue where AzureTTSService wasn't initializing the specified language.

Other

  • Add 23-bot-background-sound.py foundational example.

  • Added a new foundational example 22-natural-conversation.py. This example shows how to achieve a more natural conversation detecting when the user ends statement.

[0.0.47] - 2024-10-22

Added

  • Added AssemblyAISTTService and corresponding foundational examples 07o-interruptible-assemblyai.py and 13d-assemblyai-transcription.py.

  • Added a foundational example for Gladia transcription: 13c-gladia-transcription.py

Changed

  • Updated GladiaSTTService to use the V2 API.

  • Changed DailyTransport transcription model to nova-2-general.

Fixed

  • Fixed an issue that would cause an import error when importing SileroVADAnalyzer from the old package pipecat.vad.silero.

  • Fixed enable_usage_metrics to control LLM/TTS usage metrics separately from enable_metrics.

[0.0.46] - 2024-10-19

Added

  • Added audio_passthrough parameter to STTService. If enabled it allows audio frames to be pushed downstream in case other processors need them.

  • Added input parameter options for PlayHTTTSService and PlayHTHttpTTSService.

Changed

  • Changed DeepgramSTTService model to nova-2-general.

  • Moved SileroVAD audio processor to processors.audio.vad.

  • Module utils.audio is now audio.utils. A new resample_audio function has been added.

  • PlayHTTTSService now uses PlayHT websockets instead of HTTP requests.

  • The previous PlayHTTTSService HTTP implementation is now PlayHTHttpTTSService.

  • PlayHTTTSService and PlayHTHttpTTSService now use a voice_engine of PlayHT3.0-mini, which allows for multi-lingual support.

  • Renamed OpenAILLMServiceRealtimeBeta to OpenAIRealtimeBetaLLMService to match other services.

Deprecated

  • LLMUserResponseAggregator and LLMAssistantResponseAggregator are mostly deprecated, use OpenAILLMContext instead.

  • The vad package is now deprecated and audio.vad should be used instead. The avd package will get removed in a future release.

Fixed

  • Fixed an issue that would cause an error if no VAD analyzer was passed to LiveKitTransport params.

  • Fixed SileroVAD processor to support interruptions properly.

Other

  • Added examples/foundational/07-interruptible-vad.py. This is the same as 07-interruptible.py but using the SileroVAD processor instead of passing the VADAnalyzer in the transport.

[0.0.45] - 2024-10-16

Changed

  • Metrics messages have moved out from the transport's base output into RTVI.

[0.0.44] - 2024-10-15

Added

  • Added support for OpenAI Realtime API with the new OpenAILLMServiceRealtimeBeta processor. (see https://platform.openai.com/docs/guides/realtime/overview)

  • Added RTVIBotTranscriptionProcessor which will send the RTVI bot-transcription protocol message. These are TTS text aggregated (into sentences) messages.

  • Added new input params to the MarkdownTextFilter utility. You can set filter_code to filter code from text and filter_tables to filter tables from text.

  • Added CanonicalMetricsService. This processor uses the new AudioBufferProcessor to capture conversation audio and later send it to Canonical AI. (see https://canonical.chat/)

  • Added AudioBufferProcessor. This processor can be used to buffer mixed user and bot audio. This can later be saved into an audio file or processed by some audio analyzer.

  • Added on_first_participant_joined event to LiveKitTransport.

Changed

  • LLM text responses are now logged properly as unicode characters.

  • UserStartedSpeakingFrame, UserStoppedSpeakingFrame, BotStartedSpeakingFrame, BotStoppedSpeakingFrame, BotSpeakingFrame and UserImageRequestFrame are now based from SystemFrame

Fixed

  • Merge RTVIBotLLMProcessor/RTVIBotLLMTextProcessor and RTVIBotTTSProcessor/RTVIBotTTSTextProcessor to avoid out of order issues.

  • Fixed an issue in RTVI protocol that could cause a bot-llm-stopped or bot-tts-stopped message to be sent before a bot-llm-text or bot-tts-text message.

  • Fixed DeepgramSTTService constructor settings not being merged with default ones.

  • Fixed an issue in Daily transport that would cause tasks to be hanging if urgent transport messages were being sent from a transport event handler.

  • Fixed an issue in BaseOutputTransport that would cause EndFrame to be pushed downed too early and call FrameProcessor.cleanup() before letting the transport stop properly.

[0.0.43] - 2024-10-10

Added

  • Added a new util called MarkdownTextFilter which is a subclass of a new base class called BaseTextFilter. This is a configurable utility which is intended to filter text received by TTS services.

  • Added new RTVIUserLLMTextProcessor. This processor will send an RTVI user-llm-text message with the user content's that was sent to the LLM.

Changed

  • TransportMessageFrame doesn't have an urgent field anymore, instead there's now a TransportMessageUrgentFrame which is a SystemFrame and therefore skip all internal queuing.

  • For TTS services, convert inputted languages to match each service's language format

Fixed

  • Fixed an issue where changing a language with the Deepgram STT service wouldn't apply the change. This was fixed by disconnecting and reconnecting when the language changes.

[0.0.42] - 2024-10-02

Added

  • SentryMetrics has been added to report frame processor metrics to Sentry. This is now possible because FrameProcessorMetrics can now be passed to FrameProcessor.

  • Added Google TTS service and corresponding foundational example 07n-interruptible-google.py

  • Added AWS Polly TTS support and 07m-interruptible-aws.py as an example.

  • Added InputParams to Azure TTS service.

  • Added LivekitTransport (audio-only for now).

  • RTVI 0.2.0 is now supported.

  • All FrameProcessors can now register event handlers.

tts = SomeTTSService(...)

@tts.event_handler("on_connected"):
async def on_connected(processor):
  ...
  • Added AsyncGeneratorProcessor. This processor can be used together with a FrameSerializer as an async generator. It provides a generator() function that returns an AsyncGenerator and that yields serialized frames.

  • Added EndTaskFrame and CancelTaskFrame. These are new frames that are meant to be pushed upstream to tell the pipeline task to stop nicely or immediately respectively.

  • Added configurable LLM parameters (e.g., temperature, top_p, max_tokens, seed) for OpenAI, Anthropic, and Together AI services along with corresponding setter functions.

  • Added sample_rate as a constructor parameter for TTS services.

  • Pipecat has a pipeline-based architecture. The pipeline consists of frame processors linked to each other. The elements traveling across the pipeline are called frames.

    To have a deterministic behavior the frames traveling through the pipeline should always be ordered, except system frames which are out-of-band frames. To achieve that, each frame processor should only output frames from a single task.

    In this version all the frame processors have their own task to push frames. That is, when push_frame() is called the given frame will be put into an internal queue (with the exception of system frames) and a frame processor task will push it out.

  • Added pipeline clocks. A pipeline clock is used by the output transport to know when a frame needs to be presented. For that, all frames now have an optional pts field (prensentation timestamp). There's currently just one clock implementation SystemClock and the pts field is currently only used for TextFrames (audio and image frames will be next).

  • A clock can now be specified to PipelineTask (defaults to SystemClock). This clock will be passed to each frame processor via the StartFrame.

  • Added CartesiaHttpTTSService.

  • DailyTransport now supports setting the audio bitrate to improve audio quality through the DailyParams.audio_out_bitrate parameter. The new default is 96kbps.

  • DailyTransport now uses the number of audio output channels (1 or 2) to set mono or stereo audio when needed.

  • Interruptions support has been added to TwilioFrameSerializer when using FastAPIWebsocketTransport.

  • Added new LmntTTSService text-to-speech service. (see https://www.lmnt.com/)

  • Added TTSModelUpdateFrame, TTSLanguageUpdateFrame, STTModelUpdateFrame, and STTLanguageUpdateFrame frames to allow you to switch models, language and voices in TTS and STT services.

  • Added new transcriptions.Language enum.

Changed

  • Context frames are now pushed downstream from assistant context aggregators.

  • Removed Silero VAD torch dependency.

  • Updated individual update settings frame classes into a single ServiceUpdateSettingsFrame class.

  • We now distinguish between input and output audio and image frames. We introduce InputAudioRawFrame, OutputAudioRawFrame, InputImageRawFrame and OutputImageRawFrame (and other subclasses of those). The input frames usually come from an input transport and are meant to be processed inside the pipeline to generate new frames. However, the input frames will not be sent through an output transport. The output frames can also be processed by any frame processor in the pipeline and they are allowed to be sent by the output transport.

  • ParallelTask has been renamed to SyncParallelPipeline. A SyncParallelPipeline is a frame processor that contains a list of different pipelines to be executed concurrently. The difference between a SyncParallelPipeline and a ParallelPipeline is that, given an input frame, the SyncParallelPipeline will wait for all the internal pipelines to complete. This is achieved by making sure the last processor in each of the pipelines is synchronous (e.g. an HTTP-based service that waits for the response).

  • StartFrame is back a system frame to make sure it's processed immediately by all processors. EndFrame stays a control frame since it needs to be ordered allowing the frames in the pipeline to be processed.

  • Updated MoondreamService revision to 2024-08-26.

  • CartesiaTTSService and ElevenLabsTTSService now add presentation timestamps to their text output. This allows the output transport to push the text frames downstream at almost the same time the words are spoken. We say "almost" because currently the audio frames don't have presentation timestamp but they should be played at roughly the same time.

  • DailyTransport.on_joined event now returns the full session data instead of just the participant.

  • CartesiaTTSService is now a subclass of TTSService.

  • DeepgramSTTService is now a subclass of STTService.

  • WhisperSTTService is now a subclass of SegmentedSTTService. A SegmentedSTTService is a STTService where the provided audio is given in a big chunk (i.e. from when the user starts speaking until the user stops speaking) instead of a continous stream.

Fixed

  • Fixed OpenAI multiple function calls.

  • Fixed a Cartesia TTS issue that would cause audio to be truncated in some cases.

  • Fixed a BaseOutputTransport issue that would stop audio and video rendering tasks (after receiving and EndFrame) before the internal queue was emptied, causing the pipeline to finish prematurely.

  • StartFrame should be the first frame every processor receives to avoid situations where things are not initialized (because initialization happens on StartFrame) and other frames come in resulting in undesired behavior.

Performance

  • obj_id() and obj_count() now use itertools.count avoiding the need of threading.Lock.

Other

[0.0.41] - 2024-08-22

Added

  • Added LivekitFrameSerializer audio frame serializer.

Fixed

  • Fix FastAPIWebsocketOutputTransport variable name clash with subclass.

  • Fix an AnthropicLLMService issue with empty arguments in function calling.

Other

  • Fixed studypal example errors.

[0.0.40] - 2024-08-20

Added

  • VAD parameters can now be dynamicallt updated using the VADParamsUpdateFrame.

  • ErrorFrame has now a fatal field to indicate the bot should exit if a fatal error is pushed upstream (false by default). A new FatalErrorFrame that sets this flag to true has been added.

  • AnthropicLLMService now supports function calling and initial support for prompt caching. (see https://www.anthropic.com/news/prompt-caching)

  • ElevenLabsTTSService can now specify ElevenLabs input parameters such as output_format.

  • TwilioFrameSerializer can now specify Twilio's and Pipecat's desired sample rates to use.

  • Added new on_participant_updated event to DailyTransport.

  • Added DailyRESTHelper.delete_room_by_name() and DailyRESTHelper.delete_room_by_url().

  • Added LLM and TTS usage metrics. Those are enabled when PipelineParams.enable_usage_metrics is True.

  • AudioRawFrames are now pushed downstream from the base output transport. This allows capturing the exact words the bot says by adding an STT service at the end of the pipeline.

  • Added new GStreamerPipelineSource. This processor can generate image or audio frames from a GStreamer pipeline (e.g. reading an MP4 file, and RTP stream or anything supported by GStreamer).

  • Added TransportParams.audio_out_is_live. This flag is False by default and it is useful to indicate we should not synchronize audio with sporadic images.

  • Added new BotStartedSpeakingFrame and BotStoppedSpeakingFrame control frames. These frames are pushed upstream and they should wrap BotSpeakingFrame.

  • Transports now allow you to register event handlers without decorators.

Changed

  • Support RTVI message protocol 0.1. This includes new messages, support for messages responses, support for actions, configuration, webhooks and a bunch of new cool stuff. (see https://docs.rtvi.ai/)

  • SileroVAD dependency is now imported via pip's silero-vad package.

  • ElevenLabsTTSService now uses eleven_turbo_v2_5 model by default.

  • BotSpeakingFrame is now a control frame.

  • StartFrame is now a control frame similar to EndFrame.

  • DeepgramTTSService now is more customizable. You can adjust the encoding and sample rate.

Fixed

  • TTSStartFrame and TTSStopFrame are now sent when TTS really starts and stops. This allows for knowing when the bot starts and stops speaking even with asynchronous services (like Cartesia).

  • Fixed AzureSTTService transcription frame timestamps.

  • Fixed an issue with DailyRESTHelper.create_room() expirations which would cause this function to stop working after the initial expiration elapsed.

  • Improved EndFrame and CancelFrame handling. EndFrame should end things gracefully while a CancelFrame should cancel all running tasks as soon as possible.

  • Fixed an issue in AIService that would cause a yielded None value to be processed.

  • RTVI's bot-ready message is now sent when the RTVI pipeline is ready and a first participant joins.

  • Fixed a BaseInputTransport issue that was causing incoming system frames to be queued instead of being pushed immediately.

  • Fixed a BaseInputTransport issue that was causing start/stop interruptions incoming frames to not cancel tasks and be processed properly.

Other

  • Added studypal example (from to the Cartesia folks!).

  • Most examples now use Cartesia.

  • Added examples foundational/19a-tools-anthropic.py, foundational/19b-tools-video-anthropic.py and foundational/19a-tools-togetherai.py.

  • Added examples foundational/18-gstreamer-filesrc.py and foundational/18a-gstreamer-videotestsrc.py that show how to use GStreamerPipelineSource

  • Remove requests library usage.

  • Cleanup examples and use DailyRESTHelper.

[0.0.39] - 2024-07-23

Fixed

  • Fixed a regression introduced in 0.0.38 that would cause Daily transcription to stop the Pipeline.

[0.0.38] - 2024-07-23

Added

  • Added force_reload, skip_validation and trust_repo to SileroVAD and SileroVADAnalyzer. This allows caching and various GitHub repo validations.

  • Added send_initial_empty_metrics flag to PipelineParams to request for initial empty metrics (zero values). True by default.

Fixed

  • Fixed initial metrics format. It was using the wrong keys name/time instead of processor/value.

  • STT services should be using ISO 8601 time format for transcription frames.

  • Fixed an issue that would cause Daily transport to show a stop transcription error when actually none occurred.

[0.0.37] - 2024-07-22

Added

  • Added RTVIProcessor which implements the RTVI-AI standard. See https://github.com/rtvi-ai

  • Added BotInterruptionFrame which allows interrupting the bot while talking.

  • Added LLMMessagesAppendFrame which allows appending messages to the current LLM context.

  • Added LLMMessagesUpdateFrame which allows changing the LLM context for the one provided in this new frame.

  • Added LLMModelUpdateFrame which allows updating the LLM model.

  • Added TTSSpeakFrame which causes the bot say some text. This text will not be part of the LLM context.

  • Added TTSVoiceUpdateFrame which allows updating the TTS voice.

Removed

  • We remove the LLMResponseStartFrame and LLMResponseEndFrame frames. These were added in the past to properly handle interruptions for the LLMAssistantContextAggregator. But the LLMContextAggregator is now based on LLMResponseAggregator which handles interruptions properly by just processing the StartInterruptionFrame, so there's no need for these extra frames any more.

Fixed

  • Fixed an issue with StatelessTextTransformer where it was pushing a string instead of a TextFrame.

  • TTSService end of sentence detection has been improved. It now works with acronyms, numbers, hours and others.

  • Fixed an issue in TTSService that would not properly flush the current aggregated sentence if an LLMFullResponseEndFrame was found.

Performance

  • CartesiaTTSService now uses websockets which improves speed. It also leverages the new Cartesia contexts which maintains generated audio prosody when multiple inputs are sent, therefore improving audio quality a lot.

[0.0.36] - 2024-07-02

Added

  • Added GladiaSTTService. See https://docs.gladia.io/chapters/speech-to-text-api/pages/live-speech-recognition

  • Added XTTSService. This is a local Text-To-Speech service. See https://github.com/coqui-ai/TTS

  • Added UserIdleProcessor. This processor can be used to wait for any interaction with the user. If the user doesn't say anything within a given timeout a provided callback is called.

  • Added IdleFrameProcessor. This processor can be used to wait for frames within a given timeout. If no frame is received within the timeout a provided callback is called.

  • Added new frame BotSpeakingFrame. This frame will be continuously pushed upstream while the bot is talking.

  • It is now possible to specify a Silero VAD version when using SileroVADAnalyzer or SileroVAD.

  • Added AysncFrameProcessor and AsyncAIService. Some services like DeepgramSTTService need to process things asynchronously. For example, audio is sent to Deepgram but transcriptions are not returned immediately. In these cases we still require all frames (except system frames) to be pushed downstream from a single task. That's what AsyncFrameProcessor is for. It creates a task and all frames should be pushed from that task. So, whenever a new Deepgram transcription is ready that transcription will also be pushed from this internal task.

  • The MetricsFrame now includes processing metrics if metrics are enabled. The processing metrics indicate the time a processor needs to generate all its output. Note that not all processors generate these kind of metrics.

Changed

  • WhisperSTTService model can now also be a string.

  • Added missing * keyword separators in services.

Fixed

  • WebsocketServerTransport doesn't try to send frames anymore if serializers returns None.

  • Fixed an issue where exceptions that occurred inside frame processors were being swallowed and not displayed.

  • Fixed an issue in FastAPIWebsocketTransport where it would still try to send data to the websocket after being closed.

Other

  • Added Fly.io deployment example in examples/deployment/flyio-example.

  • Added new 17-detect-user-idle.py example that shows how to use the new UserIdleProcessor.

[0.0.35] - 2024-06-28

Changed

  • FastAPIWebsocketParams now require a serializer.

  • TwilioFrameSerializer now requires a streamSid.

Fixed

  • Silero VAD number of frames needs to be 512 for 16000 sample rate or 256 for 8000 sample rate.

[0.0.34] - 2024-06-25

Fixed

  • Fixed an issue with asynchronous STT services (Deepgram and Azure) that could interruptions to ignore transcriptions.

  • Fixed an issue introduced in 0.0.33 that would cause the LLM to generate shorter output.

[0.0.33] - 2024-06-25

Changed

  • Upgraded to Cartesia's new Python library 1.0.0. CartesiaTTSService now expects a voice ID instead of a voice name (you can get the voice ID from Cartesia's playground). You can also specify the audio sample_rate and encoding instead of the previous output_format.

Fixed

  • Fixed an issue with asynchronous STT services (Deepgram and Azure) that could cause static audio issues and interruptions to not work properly when dealing with multiple LLMs sentences.

  • Fixed an issue that could mix new LLM responses with previous ones when handling interruptions.

  • Fixed a Daily transport blocking situation that occurred while reading audio frames after a participant left the room. Needs daily-python >= 0.10.1.

[0.0.32] - 2024-06-22

Added

  • Allow specifying a DeepgramSTTService url which allows using on-prem Deepgram.

  • Added new FastAPIWebsocketTransport. This is a new websocket transport that can be integrated with FastAPI websockets.

  • Added new TwilioFrameSerializer. This is a new serializer that knows how to serialize and deserialize audio frames from Twilio.

  • Added Daily transport event: on_dialout_answered. See https://reference-python.daily.co/api_reference.html#daily.EventHandler

  • Added new AzureSTTService. This allows you to use Azure Speech-To-Text.

Performance

  • Convert BaseOutputTransport and BaseOutputTransport to fully use asyncio and remove the use of threads.

Other

  • Added twilio-chatbot. This is an example that shows how to integrate Twilio phone numbers with a Pipecat bot.

  • Updated 07f-interruptible-azure.py to use AzureLLMService, AzureSTTService and AzureTTSService.

[0.0.31] - 2024-06-13

Performance

  • Break long audio frames into 20ms chunks instead of 10ms.

[0.0.30] - 2024-06-13

Added

  • Added report_only_initial_ttfb to PipelineParams. This will make it so only the initial TTFB metrics after the user stops talking are reported.

  • Added OpenPipeLLMService. This service will let you run OpenAI through OpenPipe's SDK.

  • Allow specifying frame processors' name through a new name constructor argument.

  • Added DeepgramSTTService. This service has an ongoing websocket connection. To handle this, it subclasses AIService instead of STTService. The output of this service will be pushed from the same task, except system frames like StartFrame, CancelFrame or StartInterruptionFrame.

Changed

  • FrameSerializer.deserialize() can now return None in case it is not possible to desearialize the given data.

  • daily_rest.DailyRoomProperties now allows extra unknown parameters.

Fixed

  • Fixed an issue where DailyRoomProperties.exp always had the same old timestamp unless set by the user.

  • Fixed a couple of issues with WebsocketServerTransport. It needed to use push_audio_frame() and also VAD was not working properly.

  • Fixed an issue that would cause LLM aggregator to fail with small VADParams.stop_secs values.

  • Fixed an issue where BaseOutputTransport would send longer audio frames preventing interruptions.

Other

  • Added new 07h-interruptible-openpipe.py example. This example shows how to use OpenPipe to run OpenAI LLMs and get the logs stored in OpenPipe.

  • Added new dialin-chatbot example. This examples shows how to call the bot using a phone number.

[0.0.29] - 2024-06-07

Added

  • Added a new FunctionFilter. This filter will let you filter frames based on a given function, except system messages which should never be filtered.

  • Added FrameProcessor.can_generate_metrics() method to indicate if a processor can generate metrics. In the future this might get an extra argument to ask for a specific type of metric.

  • Added BasePipeline. All pipeline classes should be based on this class. All subclasses should implement a processors_with_metrics() method that returns a list of all FrameProcessors in the pipeline that can generate metrics.

  • Added enable_metrics to PipelineParams.

  • Added MetricsFrame. The MetricsFrame will report different metrics in the system. Right now, it can report TTFB (Time To First Byte) values for different services, that is the time spent between the arrival of a Frame to the processor/service until the first DataFrame is pushed downstream. If metrics are enabled an intial MetricsFrame with all the services in the pipeline will be sent.

  • Added TTFB metrics and debug logging for TTS services.

Changed

  • Moved ParallelTask to pipecat.pipeline.parallel_task.

Fixed

  • Fixed PlayHT TTS service to work properly async.

[0.0.28] - 2024-06-05

Fixed

  • Fixed an issue with SileroVADAnalyzer that would cause memory to keep growing indefinitely.

[0.0.27] - 2024-06-05

Added

  • Added DailyTransport.participants() and DailyTransport.participant_counts().

[0.0.26] - 2024-06-05

Added

  • Added OpenAITTSService.

  • Allow passing output_format and model_id to CartesiaTTSService to change audio sample format and the model to use.

  • Added DailyRESTHelper which helps you create Daily rooms and tokens in an easy way.

  • PipelineTask now has a has_finished() method to indicate if the task has completed. If a task is never ran has_finished() will return False.

  • PipelineRunner now supports SIGTERM. If received, the runner will be cancelled.

Fixed

  • Fixed an issue where BaseInputTransport and BaseOutputTransport where stopping push tasks before pushing EndFrame frames could cause the bots to get stuck.

  • Fixed an error closing local audio transports.

  • Fixed an issue with Deepgram TTS that was introduced in the previous release.

  • Fixed AnthropicLLMService interruptions. If an interruption occurred, a user message could be appended after the previous user message. Anthropic does not allow that because it requires alternate user and assistant messages.

Performance

  • The BaseInputTransport does not pull audio frames from sub-classes any more. Instead, sub-classes now push audio frames into a queue in the base class. Also, DailyInputTransport now pushes audio frames every 20ms instead of 10ms.

  • Remove redundant camera input thread from DailyInputTransport. This should improve performance a little bit when processing participant videos.

  • Load Cartesia voice on startup.

[0.0.25] - 2024-05-31

Added

  • Added WebsocketServerTransport. This will create a websocket server and will read messages coming from a client. The messages are serialized/deserialized with protobufs. See examples/websocket-server for a detailed example.

  • Added function calling (LLMService.register_function()). This will allow the LLM to call functions you have registered when needed. For example, if you register a function to get the weather in Los Angeles and ask the LLM about the weather in Los Angeles, the LLM will call your function. See https://platform.openai.com/docs/guides/function-calling

  • Added new LangchainProcessor.

  • Added Cartesia TTS support (https://cartesia.ai/)

Fixed

  • Fixed SileroVAD frame processor.

  • Fixed an issue where camera_out_enabled would cause the highg CPU usage if no image was provided.

Performance

  • Removed unnecessary audio input tasks.

[0.0.24] - 2024-05-29

Added

  • Exposed on_dialin_ready for Daily transport SIP endpoint handling. This notifies when the Daily room SIP endpoints are ready. This allows integrating with third-party services like Twilio.

  • Exposed Daily transport on_app_message event.

  • Added Daily transport on_call_state_updated event.

  • Added Daily transport start_recording(), stop_recording and stop_dialout.

Changed

  • Added PipelineParams. This replaces the allow_interruptions argument in PipelineTask and will allow future parameters in the future.

  • Fixed Deepgram Aura TTS base_url and added ErrorFrame reporting.

  • GoogleLLMService api_key argument is now mandatory.

Fixed

  • Daily tranport dialin-ready doesn't not block anymore and it now handles timeouts.

  • Fixed AzureLLMService.

[0.0.23] - 2024-05-23

Fixed

  • Fixed an issue handling Daily transport dialin-ready event.

[0.0.22] - 2024-05-23

Added

[0.0.21] - 2024-05-22

Added

  • Added vision support to Anthropic service.

  • Added WakeCheckFilter which allows you to pass information downstream only if you say a certain phrase/word.

Changed

  • FrameSerializer.serialize() and FrameSerializer.deserialize() are now async.

  • Filter has been renamed to FrameFilter and it's now under processors/filters.

Fixed

  • Fixed Anthropic service to use new frame types.

  • Fixed an issue in LLMUserResponseAggregator and UserResponseAggregator that would cause frames after a brief pause to not be pushed to the LLM.

  • Clear the audio output buffer if we are interrupted.

  • Re-add exponential smoothing after volume calculation. This makes sure the volume value being used doesn't fluctuate so much.

[0.0.20] - 2024-05-22

Added

  • In order to improve interruptions we now compute a loudness level using pyloudnorm. The audio coming WebRTC transports (e.g. Daily) have an Automatic Gain Control (AGC) algorithm applied to the signal, however we don't do that on our local PyAudio signals. This means that currently incoming audio from PyAudio is kind of broken. We will fix it in future releases.

Fixed

  • Fixed an issue where StartInterruptionFrame would cause LLMUserResponseAggregator to push the accumulated text causing the LLM respond in the wrong task. The StartInterruptionFrame should not trigger any new LLM response because that would be spoken in a different task.

  • Fixed an issue where tasks and threads could be paused because the executor didn't have more tasks available. This was causing issues when cancelling and recreating tasks during interruptions.

[0.0.19] - 2024-05-20

Changed

  • LLMUserResponseAggregator and LLMAssistantResponseAggregator internal messages are now exposed through the messages property.

Fixed

  • Fixed an issue where LLMAssistantResponseAggregator was not accumulating the full response but short sentences instead. If there's an interruption we only accumulate what the bot has spoken until now in a long response as well.

[0.0.18] - 2024-05-20

Fixed

  • Fixed an issue in DailyOuputTransport where transport messages were not being sent.

[0.0.17] - 2024-05-19

Added

  • Added google.generativeai model support, including vision. This new google service defaults to using gemini-1.5-flash-latest. Example in examples/foundational/12a-describe-video-gemini-flash.py.

  • Added vision support to openai service. Example in examples/foundational/12a-describe-video-gemini-flash.py.

  • Added initial interruptions support. The assistant contexts (or aggregators) should now be placed after the output transport. This way, only the completed spoken context is added to the assistant context.

  • Added VADParams so you can control voice confidence level and others.

  • VADAnalyzer now uses an exponential smoothed volume to improve speech detection. This is useful when voice confidence is high (because there's someone talking near you) but volume is low.

Fixed

  • Fixed an issue where TTSService was not pushing TextFrames downstream.

  • Fixed issues with Ctrl-C program termination.

  • Fixed an issue that was causing StopTaskFrame to actually not exit the PipelineTask.

[0.0.16] - 2024-05-16

Fixed

  • DailyTransport: don't publish camera and audio tracks if not enabled.

  • Fixed an issue in BaseInputTransport that was causing frames pushed downstream not pushed in the right order.

[0.0.15] - 2024-05-15

Fixed

  • Quick hot fix for receiving DailyTransportMessage.

[0.0.14] - 2024-05-15

Added

  • Added DailyTransport event on_participant_left.

  • Added support for receiving DailyTransportMessage.

Fixed

  • Images are now resized to the size of the output camera. This was causing images not being displayed.

  • Fixed an issue in DailyTransport that would not allow the input processor to shutdown if no participant ever joined the room.

  • Fixed base transports start and stop. In some situation processors would halt or not shutdown properly.

[0.0.13] - 2024-05-14

Changed

  • MoondreamService argument model_id is now model.

  • VADAnalyzer arguments have been renamed for more clarity.

Fixed

  • Fixed an issue with DailyInputTransport and DailyOutputTransport that could cause some threads to not start properly.

  • Fixed STTService. Add max_silence_secs and max_buffer_secs to handle better what's being passed to the STT service. Also add exponential smoothing to the RMS.

  • Fixed WhisperSTTService. Add no_speech_prob to avoid garbage output text.

[0.0.12] - 2024-05-14

Added

  • Added DailyTranscriptionSettings to be able to specify transcription settings much easier (e.g. language).

Other

  • Updated simple-chatbot with Spanish.

  • Add missing dependencies in some of the examples.

[0.0.11] - 2024-05-13

Added

  • Allow stopping pipeline tasks with new StopTaskFrame.

Changed

  • TTS, STT and image generation service now use AsyncGenerator.

Fixed

  • DailyTransport: allow registering for participant transcriptions even if input transport is not initialized yet.

Other

  • Updated storytelling-chatbot.

[0.0.10] - 2024-05-13

Added

  • Added Intel GPU support to MoondreamService.

  • Added support for sending transport messages (e.g. to communicate with an app at the other end of the transport).

  • Added FrameProcessor.push_error() to easily send an ErrorFrame upstream.

Fixed

  • Fixed Azure services (TTS and image generation).

Other

  • Updated simple-chatbot, moondream-chatbot and translation-chatbot examples.

[0.0.9] - 2024-05-12

Changed

Many things have changed in this version. Many of the main ideas such as frames, processors, services and transports are still there but some things have changed a bit.

  • Frames describe the basic units for processing. For example, text, image or audio frames. Or control frames to indicate a user has started or stopped speaking.

  • FrameProcessors process frames (e.g. they convert a TextFrame to an ImageRawFrame) and push new frames downstream or upstream to their linked peers.

  • FrameProcessors can be linked together. The easiest wait is to use the Pipeline which is a container for processors. Linking processors allow frames to travel upstream or downstream easily.

  • Transports are a way to send or receive frames. There can be local transports (e.g. local audio or native apps), network transports (e.g. websocket) or service transports (e.g. https://daily.co).

  • Pipelines are just a processor container for other processors.

  • A PipelineTask know how to run a pipeline.

  • A PipelineRunner can run one or more tasks and it is also used, for example, to capture Ctrl-C from the user.

[0.0.8] - 2024-04-11

Added

  • Added FireworksLLMService.

  • Added InterimTranscriptionFrame and enable interim results in DailyTransport transcriptions.

Changed

  • FalImageGenService now uses new fal_client package.

Fixed

  • FalImageGenService: use asyncio.to_thread to not block main loop when generating images.

  • Allow TranscriptionFrame after an end frame (transcriptions can be delayed and received after UserStoppedSpeakingFrame).

[0.0.7] - 2024-04-10

Added

  • Add use_cpu argument to MoondreamService.

[0.0.6] - 2024-04-10

Added

  • Added FalImageGenService.InputParams.

  • Added URLImageFrame and UserImageFrame.

  • Added UserImageRequestFrame and allow requesting an image from a participant.

  • Added base VisionService and MoondreamService

Changed

  • Don't pass image_size to ImageGenService, images should have their own size.

  • ImageFrame now receives a tuple(width,height) to specify the size.

  • on_first_other_participant_joined now gets a participant argument.

Fixed

  • Check if camera, speaker and microphone are enabled before writing to them.

Performance

  • DailyTransport only subscribe to desired participant video track.

[0.0.5] - 2024-04-06

Changed

  • Use camera_bitrate and camera_framerate.

  • Increase camera_framerate to 30 by default.

Fixed

  • Fixed LocalTransport.read_audio_frames.

[0.0.4] - 2024-04-04

Added

  • Added project optional dependencies [silero,openai,...].

Changed

  • Moved thransports to its own directory.

  • Use OPENAI_API_KEY instead of OPENAI_CHATGPT_API_KEY.

Fixed

  • Don't write to microphone/speaker if not enabled.

Other

  • Added live translation example.

  • Fix foundational examples.

[0.0.3] - 2024-03-13

Other

  • Added storybot and chatbot examples.

[0.0.2] - 2024-03-12

Initial public release.