Pass realtime_service_mode=RealtimeServiceModeConfig() through every
realtime LLM service example (base, async-tool, video, text-output,
persistent-context, update-settings, MCP) so context aggregation uses
the new realtime-mode semantics instead of relying on local VAD as a
workaround.
Where examples previously wired SileroVADAnalyzer into
LLMUserAggregatorParams to coax turn frames out of services that don't
emit them server-side (AWS Nova Sonic, Ultravox, Gemini Live), the local
VAD is now removed. realtime_service_mode keeps context writes correct
without it, and the Phase 1.5 server-side InterruptionFrame fixes for
Nova Sonic and Ultravox keep the bot from talking past the user when
they barge in.
Transcript-logging event handlers move from on_user_turn_stopped /
on_assistant_turn_stopped to on_user_message_added /
on_assistant_message_added, which carry the finalized text in realtime
mode (the turn-stopped events fire before the message is finalized, so
their `content` is None in that mode).
For services that don't emit user-turn frames (Gemini Live, AWS Nova
Sonic, Ultravox) the example now carries a Tier 1 comment block that
spells out which downstream processors won't activate, how to add local
VAD if needed, and the caveat that locally-generated turn boundaries
are a heuristic that may diverge from server-side ground truth.
Adds examples/realtime/realtime-openai-local-vad.py, a new variant of
the OpenAI Realtime example that disables OpenAI's server-side turn
detection and drives turn boundaries locally — useful when you want a
turn analyzer like LocalSmartTurnV3 to decide when the user is done
speaking. Server-emitted turn frames are still preferred when available.
The Gemini Live local-VAD variant already existed; it's been updated in
place rather than rewritten.
Each realtime LLM service docstring now states whether the service emits
UserStartedSpeakingFrame / UserStoppedSpeakingFrame from server-side turn
signals, and what that implies for the rest of the pipeline.
For the services that don't (Gemini Live, AWS Nova Sonic, Ultravox), the
docstring spells out which downstream processors won't activate (RTVI
client speech events, TurnTrackingObserver, AudioBufferProcessor turn
recording, UserIdleController, user mute strategies, voicemail detector),
points at realtime_service_mode for correct context-write semantics, and
notes the option of wiring local VAD plus the caveat that locally-
generated turn boundaries are a heuristic that may not match the
provider's server-side turn decisions.
For the services that do (OpenAI Realtime, Inworld, Grok/xAI Realtime),
the docstring confirms turn frames are emitted from server VAD and
points at realtime_service_mode.
BaseOutputTransport only clears buffered audio mid-playback on
InterruptionFrame. Realtime services stream audio downstream as fast as
they produce it, and playback necessarily trails the buffer — so when the
user interrupts, the bot keeps talking past the interruption unless the
service surfaces the interruption to the pipeline.
Two realtime services were missing this signal:
- AWS Nova Sonic acknowledged the INTERRUPTED stop reason internally
(closing its own response state) but never broadcast InterruptionFrame.
- Ultravox's playback_clear_buffer message — the server's explicit
"drop buffered output audio" signal for interruptions — was not
handled at all.
In both cases the latent bug was masked by enabling local VAD on the
user aggregator, which produced UserStartedSpeakingFrame and triggered
the aggregator-side interruption path. The realtime context aggregator
work makes local VAD optional, so the underlying gap needs fixing first.
Wire broadcast_interruption() into both services on the server-side
interruption signal, firing before the response-end signal so the
assistant aggregator marks the message interrupted=True before
LLMFullResponseEndFrame closes the turn.
Decouple context management from turn frames and transcripts when a
realtime LLM service drives the conversation. Three problems with today's
behavior:
- Some realtime services (Gemini Live, AWS Nova Sonic, Ultravox) emit
no UserStarted/StoppedSpeakingFrame at all, so the aggregator — which
writes user messages on those frames — doesn't write to context
correctly without them.
- The workaround (local VAD on the aggregator) generates turn
boundaries that don't match the provider's server-side ground truth,
and the per-service "do I need it?" rule is hard to keep straight.
- When local turn detection is the intended driver, turn-end strategies
still wait for transcripts on the latency critical path.
Add a realtime_service_mode: RealtimeServiceModeConfig | None = None
kwarg on LLMContextAggregatorPair. When set, the pair switches both
halves to trailing context writes: user messages are flushed on the first
assistant content frame, assistant messages on the next user transcript,
both halves on EndFrame. Turn-end strategies stop waiting for transcripts
by default. Two fine-grained boolean fields (context_writes_await_turns,
turns_await_transcripts) let callers dial back to cascade-style behavior
selectively; their invalid combination is rejected in __post_init__.
The bifurcation is dispatch-only: seven branch points across the two
halves, each at method entry, each delegating to a mode-pure private
method. Cross-half coordination uses an asyncio.Lock and a back-reference
shared by both halves; the assistant signals user.flush() on
LLMFullResponseStartFrame, and the user signals assistant.flush() on the
first new transcript after the assistant turn. The mechanism reuses the
existing push_aggregation() — no parallel write path.
Two new events fire when messages are flushed to context:
on_user_message_added and on_assistant_message_added. In cascade mode
they coincide with the existing turn-stopped events; in realtime mode
(where the turn-stopped event fires before the message is finalized)
they're the canonical way to subscribe to "context just updated, here's
the text."
UserTurnStoppedMessage.content is now typed str | None to reflect that
realtime mode fires the event with None.
When a RealtimeServiceMetadataFrame arrives and realtime_service_mode is
None, the aggregator logs a one-time INFO recommendation pointing users
at the option.
Realtime (speech-to-speech) LLM services need to advertise themselves to
the rest of the pipeline so downstream components can adapt. Add a new
RealtimeServiceMetadataFrame subtype of ServiceMetadataFrame, following
the STTMetadataFrame precedent.
LLMService gains a single ClassVar, _realtime_service_info, typed
RealtimeServiceInfo | None and defaulting to None. The presence of a
populated instance is what marks a service as realtime, and the
RealtimeServiceInfo dataclass carries the per-service knobs the rest of
the pipeline needs — currently just emits_user_turn_frames. Keeping it
all under one optional ClassVar avoids stranding realtime-only knobs on
the generic LLMService surface; non-realtime services keep the default
None and the realtime-specific machinery stays inert.
When _realtime_service_info is set, the base service auto-broadcasts
RealtimeServiceMetadataFrame right after StartFrame propagates downstream
(same ordering as STT). When emits_user_turn_frames is False, a one-time
INFO log at start explains which pipeline processors depend on those
frames (RTVI client speech events, TurnTrackingObserver,
AudioBufferProcessor turn recording, UserIdleController, user mute
strategies, voicemail detector) and how to add local VAD if needed.
Set the ClassVar on the seven realtime services: OpenAI Realtime, Azure
Realtime (via inheritance), Inworld, Grok/xAI Realtime all emit
user-turn frames; Gemini Live (and Gemini Live Vertex via inheritance),
AWS Nova Sonic, Ultravox do not.
In a follow-up commit, LLMContextAggregatorPair will consume
RealtimeServiceMetadataFrame to surface a one-time recommendation when
realtime_service_mode is not configured.
SpeechTimeoutUserTurnStopStrategy and TurnAnalyzerUserTurnStopStrategy
both gate end-of-turn on a transcript arriving. That's the right default
for cascade STT/LLM/TTS pipelines, but it puts transcripts on the latency
critical path in pipelines where local turn detection is the intended
driver of end-of-turn — typically realtime LLM services consuming audio
directly. Closed PR #4480 explored this same fix in isolation.
Add wait_for_transcript: bool = True to both strategies. False makes the
strategy signal end-of-turn as soon as VAD / the turn analyzer reports
end-of-speech, independent of transcripts. The default preserves existing
behavior. LLMContextAggregatorPair will flip this in realtime mode in a
follow-up commit.
Adds tests for AggregatedFrameSequencer, WordCompletionTracker, and
word_timestamp_utils (including CJK language scenarios). Updates existing
Cartesia TTS and TTS frame ordering tests to cover the new behaviours.
TTSTextFrame entries were losing their original text structure when word
timestamps were enabled. AggregatedTextFrame now carries a raw_text field with
the original LLM-produced text (including pattern delimiters such as
<card>...</card>). The assistant context receives properly-tagged content
rather than the cleaned words returned by the TTS provider. Also handles words
that straddle two sentence boundaries by splitting and attributing each part
to its correct source frame.
SSML markup (e.g. <spell>, <emotion>, <break>) was leaking into word entries
returned by the Cartesia word-timestamps API. Tags are now stripped before
processing so word-to-text attribution remains accurate when SSML is present
in the TTS input.
Frames sharing the same presentation timestamp were being reordered by the
priority queue. Adds a monotonic counter as a tiebreaker so frames with equal
PTS are always emitted in insertion order, preventing subtle audio/text
sequencing bugs.
Skipped frames (e.g. code blocks filtered via skip_aggregator_types) were
emitted to the assistant context immediately instead of waiting for preceding
spoken frames to finish. Introduces AggregatedFrameSequencer to hold each
frame's slot and flush only after all earlier spoken sentences are complete,
keeping context ordering correct.
The keepalive could fire for a new turn's context before that context's
voice_settings context-init was sent, making the keepalive the context's
first message (no voice_settings) and causing ElevenLabs to reject the
later init with a 1008 policy violation. The keepalive now only targets a
context once its context-init has been sent (tracked in _context_init_sent).
Mirrors the deprecation in ``OpenAITTSService.__init__``: ``instructions``
is now a Settings field. The constructor still accepts it for backward
compatibility but the canonical path is through ``Settings``.
A copy of ``turn-management-filter-incomplete-turns.py`` extended with
a ``get_weather(location)`` direct function. Exercises the path where
the LLM responds to a complete user turn by calling a tool — used to
reproduce (and now verify the fix for) the ``_user_speaking`` gating
bug between filter-incomplete and function calls.
With ``filter_incomplete_user_turns`` enabled, an LLM that responded to
a user turn by calling a tool (without first emitting a ✓ marker)
never finalized the user turn. ``UserStoppedSpeakingFrame`` stayed
deferred, the assistant aggregator kept ``_user_speaking=True``, and
when ``FunctionCallResultFrame`` arrived its ``not self._user_speaking``
gate dropped the context push — the LLM continuation never ran and
the call hung silently.
Broadcast ``UserTurnInferenceCompletedFrame`` on
``FunctionCallsStartedFrame`` (i.e. the moment the LLM commits to a
tool call, before the function dispatches), gated by a new
``_turn_completion_broadcasted`` flag so the ✓ path and the tool-call
path don't both fire. The flag resets in ``_turn_reset`` alongside
the other per-turn state.
Emitting on the start frame rather than ``LLMFullResponseEndFrame``
also shrinks the race window — ``UserStoppedSpeakingFrame`` (a
``SystemFrame``) has the maximum possible head start over the
``FunctionCallResultFrame`` (``DataFrame``) that follows.