pipecat

Author	SHA1	Message	Date
Paul Kompfner	bff741a647	Migrate realtime examples to RealtimeServiceModeConfig Pass realtime_service_mode=RealtimeServiceModeConfig() through every realtime LLM service example (base, async-tool, video, text-output, persistent-context, update-settings, MCP) so context aggregation uses the new realtime-mode semantics instead of relying on local VAD as a workaround. Where examples previously wired SileroVADAnalyzer into LLMUserAggregatorParams to coax turn frames out of services that don't emit them server-side (AWS Nova Sonic, Ultravox, Gemini Live), the local VAD is now removed. realtime_service_mode keeps context writes correct without it, and the Phase 1.5 server-side InterruptionFrame fixes for Nova Sonic and Ultravox keep the bot from talking past the user when they barge in. Transcript-logging event handlers move from on_user_turn_stopped / on_assistant_turn_stopped to on_user_message_added / on_assistant_message_added, which carry the finalized text in realtime mode (the turn-stopped events fire before the message is finalized, so their `content` is None in that mode). For services that don't emit user-turn frames (Gemini Live, AWS Nova Sonic, Ultravox) the example now carries a Tier 1 comment block that spells out which downstream processors won't activate, how to add local VAD if needed, and the caveat that locally-generated turn boundaries are a heuristic that may diverge from server-side ground truth. Adds examples/realtime/realtime-openai-local-vad.py, a new variant of the OpenAI Realtime example that disables OpenAI's server-side turn detection and drives turn boundaries locally — useful when you want a turn analyzer like LocalSmartTurnV3 to decide when the user is done speaking. Server-emitted turn frames are still preferred when available. The Gemini Live local-VAD variant already existed; it's been updated in place rather than rewritten.	2026-05-21 11:25:29 -04:00
Paul Kompfner	20d9bf4af6	Document user-turn-frame behavior in realtime service docstrings Each realtime LLM service docstring now states whether the service emits UserStartedSpeakingFrame / UserStoppedSpeakingFrame from server-side turn signals, and what that implies for the rest of the pipeline. For the services that don't (Gemini Live, AWS Nova Sonic, Ultravox), the docstring spells out which downstream processors won't activate (RTVI client speech events, TurnTrackingObserver, AudioBufferProcessor turn recording, UserIdleController, user mute strategies, voicemail detector), points at realtime_service_mode for correct context-write semantics, and notes the option of wiring local VAD plus the caveat that locally- generated turn boundaries are a heuristic that may not match the provider's server-side turn decisions. For the services that do (OpenAI Realtime, Inworld, Grok/xAI Realtime), the docstring confirms turn frames are emitted from server VAD and points at realtime_service_mode.	2026-05-21 11:25:29 -04:00
Paul Kompfner	a00211627f	Surface server-side interruption from Nova Sonic and Ultravox BaseOutputTransport only clears buffered audio mid-playback on InterruptionFrame. Realtime services stream audio downstream as fast as they produce it, and playback necessarily trails the buffer — so when the user interrupts, the bot keeps talking past the interruption unless the service surfaces the interruption to the pipeline. Two realtime services were missing this signal: - AWS Nova Sonic acknowledged the INTERRUPTED stop reason internally (closing its own response state) but never broadcast InterruptionFrame. - Ultravox's playback_clear_buffer message — the server's explicit "drop buffered output audio" signal for interruptions — was not handled at all. In both cases the latent bug was masked by enabling local VAD on the user aggregator, which produced UserStartedSpeakingFrame and triggered the aggregator-side interruption path. The realtime context aggregator work makes local VAD optional, so the underlying gap needs fixing first. Wire broadcast_interruption() into both services on the server-side interruption signal, firing before the response-end signal so the assistant aggregator marks the message interrupted=True before LLMFullResponseEndFrame closes the turn.	2026-05-21 11:25:29 -04:00
Paul Kompfner	11d7fcf174	Add changelog fragments for realtime service mode Fragments use the +<name> prefix so they show up under "Unreleased" without a PR-number suffix; rename to <PR#>.<type>.md before merge.	2026-05-21 11:25:29 -04:00
Paul Kompfner	1fe8cf5289	Add RealtimeServiceModeConfig to LLMContextAggregatorPair Decouple context management from turn frames and transcripts when a realtime LLM service drives the conversation. Three problems with today's behavior: - Some realtime services (Gemini Live, AWS Nova Sonic, Ultravox) emit no UserStarted/StoppedSpeakingFrame at all, so the aggregator — which writes user messages on those frames — doesn't write to context correctly without them. - The workaround (local VAD on the aggregator) generates turn boundaries that don't match the provider's server-side ground truth, and the per-service "do I need it?" rule is hard to keep straight. - When local turn detection is the intended driver, turn-end strategies still wait for transcripts on the latency critical path. Add a realtime_service_mode: RealtimeServiceModeConfig \| None = None kwarg on LLMContextAggregatorPair. When set, the pair switches both halves to trailing context writes: user messages are flushed on the first assistant content frame, assistant messages on the next user transcript, both halves on EndFrame. Turn-end strategies stop waiting for transcripts by default. Two fine-grained boolean fields (context_writes_await_turns, turns_await_transcripts) let callers dial back to cascade-style behavior selectively; their invalid combination is rejected in __post_init__. The bifurcation is dispatch-only: seven branch points across the two halves, each at method entry, each delegating to a mode-pure private method. Cross-half coordination uses an asyncio.Lock and a back-reference shared by both halves; the assistant signals user.flush() on LLMFullResponseStartFrame, and the user signals assistant.flush() on the first new transcript after the assistant turn. The mechanism reuses the existing push_aggregation() — no parallel write path. Two new events fire when messages are flushed to context: on_user_message_added and on_assistant_message_added. In cascade mode they coincide with the existing turn-stopped events; in realtime mode (where the turn-stopped event fires before the message is finalized) they're the canonical way to subscribe to "context just updated, here's the text." UserTurnStoppedMessage.content is now typed str \| None to reflect that realtime mode fires the event with None. When a RealtimeServiceMetadataFrame arrives and realtime_service_mode is None, the aggregator logs a one-time INFO recommendation pointing users at the option.	2026-05-21 11:25:29 -04:00
Paul Kompfner	3247fd1188	Mark realtime LLM services with RealtimeServiceInfo + emit metadata at start Realtime (speech-to-speech) LLM services need to advertise themselves to the rest of the pipeline so downstream components can adapt. Add a new RealtimeServiceMetadataFrame subtype of ServiceMetadataFrame, following the STTMetadataFrame precedent. LLMService gains a single ClassVar, _realtime_service_info, typed RealtimeServiceInfo \| None and defaulting to None. The presence of a populated instance is what marks a service as realtime, and the RealtimeServiceInfo dataclass carries the per-service knobs the rest of the pipeline needs — currently just emits_user_turn_frames. Keeping it all under one optional ClassVar avoids stranding realtime-only knobs on the generic LLMService surface; non-realtime services keep the default None and the realtime-specific machinery stays inert. When _realtime_service_info is set, the base service auto-broadcasts RealtimeServiceMetadataFrame right after StartFrame propagates downstream (same ordering as STT). When emits_user_turn_frames is False, a one-time INFO log at start explains which pipeline processors depend on those frames (RTVI client speech events, TurnTrackingObserver, AudioBufferProcessor turn recording, UserIdleController, user mute strategies, voicemail detector) and how to add local VAD if needed. Set the ClassVar on the seven realtime services: OpenAI Realtime, Azure Realtime (via inheritance), Inworld, Grok/xAI Realtime all emit user-turn frames; Gemini Live (and Gemini Live Vertex via inheritance), AWS Nova Sonic, Ultravox do not. In a follow-up commit, LLMContextAggregatorPair will consume RealtimeServiceMetadataFrame to surface a one-time recommendation when realtime_service_mode is not configured.	2026-05-20 15:08:40 -04:00
Paul Kompfner	9f0a60b995	Add wait_for_transcript flag on user-turn stop strategies SpeechTimeoutUserTurnStopStrategy and TurnAnalyzerUserTurnStopStrategy both gate end-of-turn on a transcript arriving. That's the right default for cascade STT/LLM/TTS pipelines, but it puts transcripts on the latency critical path in pipelines where local turn detection is the intended driver of end-of-turn — typically realtime LLM services consuming audio directly. Closed PR #4480 explored this same fix in isolation. Add wait_for_transcript: bool = True to both strategies. False makes the strategy signal end-of-turn as soon as VAD / the turn analyzer reports end-of-speech, independent of transcripts. The default preserves existing behavior. LLMContextAggregatorPair will flip this in realtime mode in a follow-up commit.	2026-05-20 14:07:58 -04:00
Mark Backman	709a0ce839	Merge pull request #4527 from pipecat-ai/mb/fix-elevenlabs-keepalive-1008 Fix ElevenLabs keepalive racing context-init (1008 disconnects)	2026-05-20 11:21:17 -04:00
Mark Backman	be93350eae	Merge pull request #4522 from pipecat-ai/mb/stt-latency-smallest Add P99 latency for Smallest AI, Mistral, XAI STT	2026-05-20 11:21:00 -04:00
Mark Backman	4a96ab7073	Merge pull request #4524 from pipecat-ai/mb/fix-runner-imports Improve runner optional transport handling	2026-05-20 11:16:16 -04:00
Filipi da Silva Fuchter	bca337f97e	Merge pull request #4380 from pipecat-ai/filipi/smart_text Smart Text Handling	2026-05-20 10:18:30 -03:00
filipi87	5d9e8c5ac5	Removing debug log.	2026-05-20 10:13:46 -03:00
Mark Backman	70773bce0a	Add changelog for PR #4527	2026-05-20 09:08:47 -04:00
filipi87	8bdb49bd1a	chore: add changelogs for word-timestamp and frame-ordering fixes	2026-05-20 10:03:30 -03:00
filipi87	81bb81c1d0	test: add automated tests for word tracking, frame sequencing, and Cartesia TTS Adds tests for AggregatedFrameSequencer, WordCompletionTracker, and word_timestamp_utils (including CJK language scenarios). Updates existing Cartesia TTS and TTS frame ordering tests to cover the new behaviours.	2026-05-20 10:03:26 -03:00
filipi87	e1bdee598c	fix: preserve raw_text through TTS pipeline for correct LLM context attribution TTSTextFrame entries were losing their original text structure when word timestamps were enabled. AggregatedTextFrame now carries a raw_text field with the original LLM-produced text (including pattern delimiters such as <card>...</card>). The assistant context receives properly-tagged content rather than the cleaned words returned by the TTS provider. Also handles words that straddle two sentence boundaries by splitting and attributing each part to its correct source frame.	2026-05-20 10:03:21 -03:00
filipi87	185a89bb3b	fix: strip Cartesia SSML tags from word timestamp entries SSML markup (e.g. <spell>, <emotion>, <break>) was leaking into word entries returned by the Cartesia word-timestamps API. Tags are now stripped before processing so word-to-text attribution remains accurate when SSML is present in the TTS input.	2026-05-20 10:03:15 -03:00
filipi87	6b9deefbe3	fix: preserve frame insertion order in BaseOutputTransport for equal PTS values Frames sharing the same presentation timestamp were being reordered by the priority queue. Adds a monotonic counter as a tiebreaker so frames with equal PTS are always emitted in insertion order, preventing subtle audio/text sequencing bugs.	2026-05-20 10:03:08 -03:00
filipi87	deefc32faf	fix: hold skipped TTS frames in position until preceding spoken frames complete Skipped frames (e.g. code blocks filtered via skip_aggregator_types) were emitted to the assistant context immediately instead of waiting for preceding spoken frames to finish. Introduces AggregatedFrameSequencer to hold each frame's slot and flush only after all earlier spoken sentences are complete, keeping context ordering correct.	2026-05-20 10:03:03 -03:00
Mark Backman	a5e6886b80	Fix ElevenLabs keepalive racing context-init (1008 disconnects) The keepalive could fire for a new turn's context before that context's voice_settings context-init was sent, making the keepalive the context's first message (no voice_settings) and causing ElevenLabs to reject the later init with a 1008 policy violation. The keepalive now only targets a context once its context-init has been sent (tracked in _context_init_sent).	2026-05-20 08:59:01 -04:00
Mark Backman	d11a4ba0cd	Use shared telephony route availability checks	2026-05-20 08:57:48 -04:00
Mark Backman	38407e091d	Add p99 values for Mistral and XAI	2026-05-19 22:51:33 -04:00
Mark Backman	82cd931efa	Merge pull request #4306 from YFortin/fix/azure-tts-last-word-race fix(azure-tts): Route completion through word boundary queue to prevent last word from being missed	2026-05-19 22:27:50 -04:00
Mark Backman	33e5d1f89b	Add changelog for PR #4522	2026-05-19 18:33:58 -04:00
Mark Backman	861dd23873	Add changelog for runner updates	2026-05-19 17:31:07 -04:00
Mark Backman	b825dd779e	Clarify runner startup banner	2026-05-19 17:31:07 -04:00
Mark Backman	1487da53a9	Improve runner optional transport handling	2026-05-19 17:03:16 -04:00
Mark Backman	aff84a5d9e	Add P99 latency for Smallest AI STT	2026-05-19 11:05:15 -04:00
Mark Backman	c09f6d5adb	Merge pull request #4052 from Vonage/vonage_video_connector_transport Vonage WebRTC Transport Integration	2026-05-19 10:56:20 -04:00
asilvestre	e2d249e5d9	adding uv.lock	2026-05-19 16:33:38 +02:00
asilvestre	956b39b0dc	remove extraenous await in cleanup	2026-05-19 16:33:04 +02:00
asilvestre	bc769eaa82	Changing the example to use OpenAI	2026-05-18 14:40:56 +02:00
asilvestre	ee5aa4dc71	SubscribeSettings to be pydantic and comment fixes	2026-05-18 14:40:56 +02:00
asilvestre	dd38fbc735	add documentation entry	2026-05-18 14:40:56 +02:00
asilvestre	a1c40df471	add documentation entry	2026-05-18 14:40:56 +02:00
asilvestre	c4ff9300c9	fix linting and typechecking	2026-05-18 14:40:56 +02:00
asilvestre	cab4585cbb	added changelog	2026-05-18 14:40:56 +02:00
Antoni Silvestre	18368d047e	Linting and changes to adapt to v1.0	2026-05-18 14:40:56 +02:00
asilvestre	e3abb4b6d7	apply suggestions in PR	2026-05-18 14:40:56 +02:00
Antoni Silvestre	0fd971d59d	Update src/pipecat/runner/types.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-05-18 14:40:56 +02:00
asilvestre	c61672194d	Vonage Video Connector Transport	2026-05-18 14:40:49 +02:00
Filipi da Silva Fuchter	c51a817efa	Merge pull request #4442 from pipecat-ai/filipi/runner_all_transports Unified start route to make all transports available	2026-05-18 09:27:44 -03:00
Bismeet singh	d85eda6da8	Merge pull request #4507 from BismeetSingh/fix/elevenlabs-stt-service-crash-language Fix/elevenlabs stt service crash language	2026-05-17 10:17:07 -04:00
Aleix Conchillo Flaqué	71feb42711	Merge pull request #4503 from pipecat-ai/changelog-1.2.1 Release 1.2.1 - Changelog Update v1.2.1	2026-05-15 15:19:55 -07:00
aconchillo	6b93ca0cb6	Update changelog for version 1.2.1	2026-05-15 22:18:46 +00:00
Aleix Conchillo Flaqué	b6ecce754b	Merge pull request #4501 from pipecat-ai/aleix/fix-filter-incomplete-tool-calls Fix filter-incomplete + function-calling deadlock	2026-05-15 15:11:45 -07:00
Aleix Conchillo Flaqué	d39e6bf921	Add changelog for #4501	2026-05-15 14:54:51 -07:00
Aleix Conchillo Flaqué	63064860ef	Move OpenAITTSService instructions into Settings in the example Mirrors the deprecation in ``OpenAITTSService.__init__``: ``instructions`` is now a Settings field. The constructor still accepts it for backward compatibility but the canonical path is through ``Settings``.	2026-05-15 14:54:51 -07:00
Aleix Conchillo Flaqué	f5158d51e7	Add filter-incomplete + function-calling turn-management example A copy of ``turn-management-filter-incomplete-turns.py`` extended with a ``get_weather(location)`` direct function. Exercises the path where the LLM responds to a complete user turn by calling a tool — used to reproduce (and now verify the fix for) the ``_user_speaking`` gating bug between filter-incomplete and function calls.	2026-05-15 14:54:51 -07:00
Aleix Conchillo Flaqué	94dbd2fa68	Broadcast UserTurnInferenceCompletedFrame on tool calls in filter-incomplete With ``filter_incomplete_user_turns`` enabled, an LLM that responded to a user turn by calling a tool (without first emitting a ✓ marker) never finalized the user turn. ``UserStoppedSpeakingFrame`` stayed deferred, the assistant aggregator kept ``_user_speaking=True``, and when ``FunctionCallResultFrame`` arrived its ``not self._user_speaking`` gate dropped the context push — the LLM continuation never ran and the call hung silently. Broadcast ``UserTurnInferenceCompletedFrame`` on ``FunctionCallsStartedFrame`` (i.e. the moment the LLM commits to a tool call, before the function dispatches), gated by a new ``_turn_completion_broadcasted`` flag so the ✓ path and the tool-call path don't both fire. The flag resets in ``_turn_reset`` alongside the other per-turn state. Emitting on the start frame rather than ``LLMFullResponseEndFrame`` also shrinks the race window — ``UserStoppedSpeakingFrame`` (a ``SystemFrame``) has the maximum possible head start over the ``FunctionCallResultFrame`` (``DataFrame``) that follows.	2026-05-15 14:50:35 -07:00

1 2 3 4 5 ...

9527 Commits