Commit Graph

9527 Commits

Author SHA1 Message Date
Paul Kompfner
bff741a647 Migrate realtime examples to RealtimeServiceModeConfig
Pass realtime_service_mode=RealtimeServiceModeConfig() through every
realtime LLM service example (base, async-tool, video, text-output,
persistent-context, update-settings, MCP) so context aggregation uses
the new realtime-mode semantics instead of relying on local VAD as a
workaround.

Where examples previously wired SileroVADAnalyzer into
LLMUserAggregatorParams to coax turn frames out of services that don't
emit them server-side (AWS Nova Sonic, Ultravox, Gemini Live), the local
VAD is now removed. realtime_service_mode keeps context writes correct
without it, and the Phase 1.5 server-side InterruptionFrame fixes for
Nova Sonic and Ultravox keep the bot from talking past the user when
they barge in.

Transcript-logging event handlers move from on_user_turn_stopped /
on_assistant_turn_stopped to on_user_message_added /
on_assistant_message_added, which carry the finalized text in realtime
mode (the turn-stopped events fire before the message is finalized, so
their `content` is None in that mode).

For services that don't emit user-turn frames (Gemini Live, AWS Nova
Sonic, Ultravox) the example now carries a Tier 1 comment block that
spells out which downstream processors won't activate, how to add local
VAD if needed, and the caveat that locally-generated turn boundaries
are a heuristic that may diverge from server-side ground truth.

Adds examples/realtime/realtime-openai-local-vad.py, a new variant of
the OpenAI Realtime example that disables OpenAI's server-side turn
detection and drives turn boundaries locally — useful when you want a
turn analyzer like LocalSmartTurnV3 to decide when the user is done
speaking. Server-emitted turn frames are still preferred when available.

The Gemini Live local-VAD variant already existed; it's been updated in
place rather than rewritten.
2026-05-21 11:25:29 -04:00
Paul Kompfner
20d9bf4af6 Document user-turn-frame behavior in realtime service docstrings
Each realtime LLM service docstring now states whether the service emits
UserStartedSpeakingFrame / UserStoppedSpeakingFrame from server-side turn
signals, and what that implies for the rest of the pipeline.

For the services that don't (Gemini Live, AWS Nova Sonic, Ultravox), the
docstring spells out which downstream processors won't activate (RTVI
client speech events, TurnTrackingObserver, AudioBufferProcessor turn
recording, UserIdleController, user mute strategies, voicemail detector),
points at realtime_service_mode for correct context-write semantics, and
notes the option of wiring local VAD plus the caveat that locally-
generated turn boundaries are a heuristic that may not match the
provider's server-side turn decisions.

For the services that do (OpenAI Realtime, Inworld, Grok/xAI Realtime),
the docstring confirms turn frames are emitted from server VAD and
points at realtime_service_mode.
2026-05-21 11:25:29 -04:00
Paul Kompfner
a00211627f Surface server-side interruption from Nova Sonic and Ultravox
BaseOutputTransport only clears buffered audio mid-playback on
InterruptionFrame. Realtime services stream audio downstream as fast as
they produce it, and playback necessarily trails the buffer — so when the
user interrupts, the bot keeps talking past the interruption unless the
service surfaces the interruption to the pipeline.

Two realtime services were missing this signal:

  - AWS Nova Sonic acknowledged the INTERRUPTED stop reason internally
    (closing its own response state) but never broadcast InterruptionFrame.
  - Ultravox's playback_clear_buffer message — the server's explicit
    "drop buffered output audio" signal for interruptions — was not
    handled at all.

In both cases the latent bug was masked by enabling local VAD on the
user aggregator, which produced UserStartedSpeakingFrame and triggered
the aggregator-side interruption path. The realtime context aggregator
work makes local VAD optional, so the underlying gap needs fixing first.

Wire broadcast_interruption() into both services on the server-side
interruption signal, firing before the response-end signal so the
assistant aggregator marks the message interrupted=True before
LLMFullResponseEndFrame closes the turn.
2026-05-21 11:25:29 -04:00
Paul Kompfner
11d7fcf174 Add changelog fragments for realtime service mode
Fragments use the +<name> prefix so they show up under "Unreleased"
without a PR-number suffix; rename to <PR#>.<type>.md before merge.
2026-05-21 11:25:29 -04:00
Paul Kompfner
1fe8cf5289 Add RealtimeServiceModeConfig to LLMContextAggregatorPair
Decouple context management from turn frames and transcripts when a
realtime LLM service drives the conversation. Three problems with today's
behavior:

  - Some realtime services (Gemini Live, AWS Nova Sonic, Ultravox) emit
    no UserStarted/StoppedSpeakingFrame at all, so the aggregator — which
    writes user messages on those frames — doesn't write to context
    correctly without them.
  - The workaround (local VAD on the aggregator) generates turn
    boundaries that don't match the provider's server-side ground truth,
    and the per-service "do I need it?" rule is hard to keep straight.
  - When local turn detection is the intended driver, turn-end strategies
    still wait for transcripts on the latency critical path.

Add a realtime_service_mode: RealtimeServiceModeConfig | None = None
kwarg on LLMContextAggregatorPair. When set, the pair switches both
halves to trailing context writes: user messages are flushed on the first
assistant content frame, assistant messages on the next user transcript,
both halves on EndFrame. Turn-end strategies stop waiting for transcripts
by default. Two fine-grained boolean fields (context_writes_await_turns,
turns_await_transcripts) let callers dial back to cascade-style behavior
selectively; their invalid combination is rejected in __post_init__.

The bifurcation is dispatch-only: seven branch points across the two
halves, each at method entry, each delegating to a mode-pure private
method. Cross-half coordination uses an asyncio.Lock and a back-reference
shared by both halves; the assistant signals user.flush() on
LLMFullResponseStartFrame, and the user signals assistant.flush() on the
first new transcript after the assistant turn. The mechanism reuses the
existing push_aggregation() — no parallel write path.

Two new events fire when messages are flushed to context:
on_user_message_added and on_assistant_message_added. In cascade mode
they coincide with the existing turn-stopped events; in realtime mode
(where the turn-stopped event fires before the message is finalized)
they're the canonical way to subscribe to "context just updated, here's
the text."

UserTurnStoppedMessage.content is now typed str | None to reflect that
realtime mode fires the event with None.

When a RealtimeServiceMetadataFrame arrives and realtime_service_mode is
None, the aggregator logs a one-time INFO recommendation pointing users
at the option.
2026-05-21 11:25:29 -04:00
Paul Kompfner
3247fd1188 Mark realtime LLM services with RealtimeServiceInfo + emit metadata at start
Realtime (speech-to-speech) LLM services need to advertise themselves to
the rest of the pipeline so downstream components can adapt. Add a new
RealtimeServiceMetadataFrame subtype of ServiceMetadataFrame, following
the STTMetadataFrame precedent.

LLMService gains a single ClassVar, _realtime_service_info, typed
RealtimeServiceInfo | None and defaulting to None. The presence of a
populated instance is what marks a service as realtime, and the
RealtimeServiceInfo dataclass carries the per-service knobs the rest of
the pipeline needs — currently just emits_user_turn_frames. Keeping it
all under one optional ClassVar avoids stranding realtime-only knobs on
the generic LLMService surface; non-realtime services keep the default
None and the realtime-specific machinery stays inert.

When _realtime_service_info is set, the base service auto-broadcasts
RealtimeServiceMetadataFrame right after StartFrame propagates downstream
(same ordering as STT). When emits_user_turn_frames is False, a one-time
INFO log at start explains which pipeline processors depend on those
frames (RTVI client speech events, TurnTrackingObserver,
AudioBufferProcessor turn recording, UserIdleController, user mute
strategies, voicemail detector) and how to add local VAD if needed.

Set the ClassVar on the seven realtime services: OpenAI Realtime, Azure
Realtime (via inheritance), Inworld, Grok/xAI Realtime all emit
user-turn frames; Gemini Live (and Gemini Live Vertex via inheritance),
AWS Nova Sonic, Ultravox do not.

In a follow-up commit, LLMContextAggregatorPair will consume
RealtimeServiceMetadataFrame to surface a one-time recommendation when
realtime_service_mode is not configured.
2026-05-20 15:08:40 -04:00
Paul Kompfner
9f0a60b995 Add wait_for_transcript flag on user-turn stop strategies
SpeechTimeoutUserTurnStopStrategy and TurnAnalyzerUserTurnStopStrategy
both gate end-of-turn on a transcript arriving. That's the right default
for cascade STT/LLM/TTS pipelines, but it puts transcripts on the latency
critical path in pipelines where local turn detection is the intended
driver of end-of-turn — typically realtime LLM services consuming audio
directly. Closed PR #4480 explored this same fix in isolation.

Add wait_for_transcript: bool = True to both strategies. False makes the
strategy signal end-of-turn as soon as VAD / the turn analyzer reports
end-of-speech, independent of transcripts. The default preserves existing
behavior. LLMContextAggregatorPair will flip this in realtime mode in a
follow-up commit.
2026-05-20 14:07:58 -04:00
Mark Backman
709a0ce839 Merge pull request #4527 from pipecat-ai/mb/fix-elevenlabs-keepalive-1008
Fix ElevenLabs keepalive racing context-init (1008 disconnects)
2026-05-20 11:21:17 -04:00
Mark Backman
be93350eae Merge pull request #4522 from pipecat-ai/mb/stt-latency-smallest
Add P99 latency for Smallest AI, Mistral, XAI STT
2026-05-20 11:21:00 -04:00
Mark Backman
4a96ab7073 Merge pull request #4524 from pipecat-ai/mb/fix-runner-imports
Improve runner optional transport handling
2026-05-20 11:16:16 -04:00
Filipi da Silva Fuchter
bca337f97e Merge pull request #4380 from pipecat-ai/filipi/smart_text
Smart Text Handling
2026-05-20 10:18:30 -03:00
filipi87
5d9e8c5ac5 Removing debug log. 2026-05-20 10:13:46 -03:00
Mark Backman
70773bce0a Add changelog for PR #4527 2026-05-20 09:08:47 -04:00
filipi87
8bdb49bd1a chore: add changelogs for word-timestamp and frame-ordering fixes 2026-05-20 10:03:30 -03:00
filipi87
81bb81c1d0 test: add automated tests for word tracking, frame sequencing, and Cartesia TTS
Adds tests for AggregatedFrameSequencer, WordCompletionTracker, and
word_timestamp_utils (including CJK language scenarios). Updates existing
Cartesia TTS and TTS frame ordering tests to cover the new behaviours.
2026-05-20 10:03:26 -03:00
filipi87
e1bdee598c fix: preserve raw_text through TTS pipeline for correct LLM context attribution
TTSTextFrame entries were losing their original text structure when word
timestamps were enabled. AggregatedTextFrame now carries a raw_text field with
the original LLM-produced text (including pattern delimiters such as
<card>...</card>). The assistant context receives properly-tagged content
rather than the cleaned words returned by the TTS provider. Also handles words
that straddle two sentence boundaries by splitting and attributing each part
to its correct source frame.
2026-05-20 10:03:21 -03:00
filipi87
185a89bb3b fix: strip Cartesia SSML tags from word timestamp entries
SSML markup (e.g. <spell>, <emotion>, <break>) was leaking into word entries
returned by the Cartesia word-timestamps API. Tags are now stripped before
processing so word-to-text attribution remains accurate when SSML is present
in the TTS input.
2026-05-20 10:03:15 -03:00
filipi87
6b9deefbe3 fix: preserve frame insertion order in BaseOutputTransport for equal PTS values
Frames sharing the same presentation timestamp were being reordered by the
priority queue. Adds a monotonic counter as a tiebreaker so frames with equal
PTS are always emitted in insertion order, preventing subtle audio/text
sequencing bugs.
2026-05-20 10:03:08 -03:00
filipi87
deefc32faf fix: hold skipped TTS frames in position until preceding spoken frames complete
Skipped frames (e.g. code blocks filtered via skip_aggregator_types) were
emitted to the assistant context immediately instead of waiting for preceding
spoken frames to finish. Introduces AggregatedFrameSequencer to hold each
frame's slot and flush only after all earlier spoken sentences are complete,
keeping context ordering correct.
2026-05-20 10:03:03 -03:00
Mark Backman
a5e6886b80 Fix ElevenLabs keepalive racing context-init (1008 disconnects)
The keepalive could fire for a new turn's context before that context's
voice_settings context-init was sent, making the keepalive the context's
first message (no voice_settings) and causing ElevenLabs to reject the
later init with a 1008 policy violation. The keepalive now only targets a
context once its context-init has been sent (tracked in _context_init_sent).
2026-05-20 08:59:01 -04:00
Mark Backman
d11a4ba0cd Use shared telephony route availability checks 2026-05-20 08:57:48 -04:00
Mark Backman
38407e091d Add p99 values for Mistral and XAI 2026-05-19 22:51:33 -04:00
Mark Backman
82cd931efa Merge pull request #4306 from YFortin/fix/azure-tts-last-word-race
fix(azure-tts): Route completion through word boundary queue to prevent last word from being missed
2026-05-19 22:27:50 -04:00
Mark Backman
33e5d1f89b Add changelog for PR #4522 2026-05-19 18:33:58 -04:00
Mark Backman
861dd23873 Add changelog for runner updates 2026-05-19 17:31:07 -04:00
Mark Backman
b825dd779e Clarify runner startup banner 2026-05-19 17:31:07 -04:00
Mark Backman
1487da53a9 Improve runner optional transport handling 2026-05-19 17:03:16 -04:00
Mark Backman
aff84a5d9e Add P99 latency for Smallest AI STT 2026-05-19 11:05:15 -04:00
Mark Backman
c09f6d5adb Merge pull request #4052 from Vonage/vonage_video_connector_transport
Vonage WebRTC Transport Integration
2026-05-19 10:56:20 -04:00
asilvestre
e2d249e5d9 adding uv.lock 2026-05-19 16:33:38 +02:00
asilvestre
956b39b0dc remove extraenous await in cleanup 2026-05-19 16:33:04 +02:00
asilvestre
bc769eaa82 Changing the example to use OpenAI 2026-05-18 14:40:56 +02:00
asilvestre
ee5aa4dc71 SubscribeSettings to be pydantic and comment fixes 2026-05-18 14:40:56 +02:00
asilvestre
dd38fbc735 add documentation entry 2026-05-18 14:40:56 +02:00
asilvestre
a1c40df471 add documentation entry 2026-05-18 14:40:56 +02:00
asilvestre
c4ff9300c9 fix linting and typechecking 2026-05-18 14:40:56 +02:00
asilvestre
cab4585cbb added changelog 2026-05-18 14:40:56 +02:00
Antoni Silvestre
18368d047e Linting and changes to adapt to v1.0 2026-05-18 14:40:56 +02:00
asilvestre
e3abb4b6d7 apply suggestions in PR 2026-05-18 14:40:56 +02:00
Antoni Silvestre
0fd971d59d Update src/pipecat/runner/types.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-05-18 14:40:56 +02:00
asilvestre
c61672194d Vonage Video Connector Transport 2026-05-18 14:40:49 +02:00
Filipi da Silva Fuchter
c51a817efa Merge pull request #4442 from pipecat-ai/filipi/runner_all_transports
Unified start route to make all transports available
2026-05-18 09:27:44 -03:00
Bismeet singh
d85eda6da8 Merge pull request #4507 from BismeetSingh/fix/elevenlabs-stt-service-crash-language
Fix/elevenlabs stt service crash language
2026-05-17 10:17:07 -04:00
Aleix Conchillo Flaqué
71feb42711 Merge pull request #4503 from pipecat-ai/changelog-1.2.1
Release 1.2.1 - Changelog Update
v1.2.1
2026-05-15 15:19:55 -07:00
aconchillo
6b93ca0cb6 Update changelog for version 1.2.1 2026-05-15 22:18:46 +00:00
Aleix Conchillo Flaqué
b6ecce754b Merge pull request #4501 from pipecat-ai/aleix/fix-filter-incomplete-tool-calls
Fix filter-incomplete + function-calling deadlock
2026-05-15 15:11:45 -07:00
Aleix Conchillo Flaqué
d39e6bf921 Add changelog for #4501 2026-05-15 14:54:51 -07:00
Aleix Conchillo Flaqué
63064860ef Move OpenAITTSService instructions into Settings in the example
Mirrors the deprecation in ``OpenAITTSService.__init__``: ``instructions``
is now a Settings field. The constructor still accepts it for backward
compatibility but the canonical path is through ``Settings``.
2026-05-15 14:54:51 -07:00
Aleix Conchillo Flaqué
f5158d51e7 Add filter-incomplete + function-calling turn-management example
A copy of ``turn-management-filter-incomplete-turns.py`` extended with
a ``get_weather(location)`` direct function. Exercises the path where
the LLM responds to a complete user turn by calling a tool — used to
reproduce (and now verify the fix for) the ``_user_speaking`` gating
bug between filter-incomplete and function calls.
2026-05-15 14:54:51 -07:00
Aleix Conchillo Flaqué
94dbd2fa68 Broadcast UserTurnInferenceCompletedFrame on tool calls in filter-incomplete
With ``filter_incomplete_user_turns`` enabled, an LLM that responded to
a user turn by calling a tool (without first emitting a ✓ marker)
never finalized the user turn. ``UserStoppedSpeakingFrame`` stayed
deferred, the assistant aggregator kept ``_user_speaking=True``, and
when ``FunctionCallResultFrame`` arrived its ``not self._user_speaking``
gate dropped the context push — the LLM continuation never ran and
the call hung silently.

Broadcast ``UserTurnInferenceCompletedFrame`` on
``FunctionCallsStartedFrame`` (i.e. the moment the LLM commits to a
tool call, before the function dispatches), gated by a new
``_turn_completion_broadcasted`` flag so the ✓ path and the tool-call
path don't both fire. The flag resets in ``_turn_reset`` alongside
the other per-turn state.

Emitting on the start frame rather than ``LLMFullResponseEndFrame``
also shrinks the race window — ``UserStoppedSpeakingFrame`` (a
``SystemFrame``) has the maximum possible head start over the
``FunctionCallResultFrame`` (``DataFrame``) that follows.
2026-05-15 14:50:35 -07:00