The OpenAI Realtime story didn't add any service-level code — just a
new example. The original 4480.added.md entry already describes the
feature as "a realtime service like Gemini Live," which generalizes
to OpenAI Realtime.
Mirrors the Gemini Live local-VAD example for OpenAI Realtime, showing
that `wait_for_transcript_to_end_user_turn=False` composes cleanly
with `turn_detection=False`. The OpenAI Realtime service already wires
`UserStoppedSpeakingFrame` to `input_audio_buffer.commit` +
`response.create` when `turn_detection=False`, so the example is the
only new code needed.
Switch the vocabulary for the timer-driven phase that runs when
`wait_for_transcript_to_end_user_turn=False`. "Transcript gather" was
too vague to be self-documenting; "post-turn transcript wait" names
when it happens (after the user turn ends) and what it's for (waiting
for late-arriving transcripts).
Renames the internal property to `_wait_for_post_turn_transcripts`
and the supporting state/method names to match
(`_post_turn_transcript_wait_task`, `_complete_post_turn_transcript_wait`,
etc.). Updates docstrings, comments, log messages, the example
inline doc, and the test prose to use the new vocabulary consistently.
The aggregator's transcript-gather timer (used when
`wait_for_transcript_to_end_user_turn=False`) was hardcoded to
`DEFAULT_TTFS_P99`. Capture `STTMetadataFrame.ttfs_p99_latency` as
it flows through the user aggregator and prefer that value, just
like the stop strategies already do. Falls back to
`DEFAULT_TTFS_P99` when no STT service has reported a value.
Reframe comments, docstrings, identifiers, changelog, and example
around a single explanation of the option: (1) turn strategies do not
consider user transcripts, letting the user turn end sooner, and (2)
the aggregator gathers user transcripts on its own after the turn
ends via a simple timer, then emits `on_user_turn_message_finalized`
with the new user context message.
The mechanism is generic, so internal aggregator vocabulary stays
generic ("transcript-gather", "after the user turn ends"); the
public-facing param docstring is the one place that explains the
"local turn detection drives a realtime service" use case. The stop
strategies' `wait_for_transcript` flag is pointed at as something
that's "usually flipped indirectly" by the aggregator param rather
than something to pair with it.
Renames internal state to match: `_expect_delayed_transcripts` →
`_aggregator_gathers_transcripts`, `_pending_finalization_*` →
`_transcript_gather_*`, `_finalize_delayed_user_message` →
`_finalize_user_message`, etc.
Adds two tests for the strategy's transcripts-without-VAD fallback
path — one in default mode (both events fire with the aggregated
content) and one in delayed-transcript mode (only
``on_user_turn_message_finalized`` fires; no end-of-turn event is
emitted since no turn ever started in the controller).
Updates existing tests for the vocabulary refactor: assertions now
expect ``content=None`` (not ``""``) for the end-of-turn event in
delayed-transcript mode; comments and docstrings use the
standardized terms (end of turn, user message finalization,
pending-finalization timer, plural "transcripts").
Splits ``_maybe_emit_user_turn_stopped`` into three focused methods —
``_flush_user_message_to_context`` (push aggregation, return content +
timestamp), ``_finalize_user_turn`` (default-mode flow, emits both
events), and ``_finalize_delayed_user_message`` (delayed-mode flow,
emits only ``on_user_turn_message_finalized``). Fixes a side-issue
where ``on_user_turn_stopped`` could fire from non-end-of-turn paths
in delayed-transcript mode; that event now has a single origin (the
end-of-turn handler).
Standardizes vocabulary across docstrings and comments:
- "Default mode" / "Delayed-transcript mode" (with
``_expect_delayed_transcripts == False/True``)
- "End of turn" (not "audible stop" or "audible end of turn")
- "User message finalization" (the moment user-text is flushed to
context + ``on_user_turn_message_finalized`` fires)
- "Pending finalization" (the in-between state in delayed mode)
- Transcripts (plural — the aggregator combines multiple per turn)
The timer that triggers user message finalization is no longer
described as a "backstop" — it's the sole trigger for finalization
in delayed-transcript mode, not a fallback. Renamed accordingly:
``_pending_finalization_task``, ``_pending_finalization_handler``,
``_run_pending_finalization``, ``_discard_pending_finalization``.
Adds a separate message class for the two events:
``UserTurnStoppedMessage.content`` is now ``str | None`` (``None``
at end-of-turn in delayed-transcript mode), and a new
``UserMessageFinalizedMessage`` carries the always-populated
``content`` for the finalization event.
Reframes the strategy mutations as part of configuring the flag
(not an "also" aside), and the ordering invariant in the test
docstring as flush-timing (not arrival-timing).
Adds five tests for the delayed-transcript flow on
`LLMUserAggregator`:
- basic flow: `on_user_turn_stopped` fires fast with empty content;
`on_user_turn_message_finalized` fires later with the populated
transcript; user message lands in context.
- backstop with no transcript: backstop timer still finalizes the
turn; message_finalized fires with empty content; no user message
added to context.
- next-turn precondition violation: a new VAD start fires while the
previous turn is still pending; the previous turn is force-flushed
before the new turn begins.
- context-order with assistant response: paired aggregators with a
late user transcript arriving before the assistant content streams;
verifies the user message lands in context before the assistant
message (the conversational-order invariant the design relies on).
- strategy mutation: explicit start/stop strategies are mutated by
the bundle — `TranscriptionUserTurnStartStrategy` is dropped from
start, `wait_for_transcript=False` is flipped on the stop strategy
that had it explicitly set to True.
Tests patch `DEFAULT_TTFS_P99` to keep the backstop fast.
Add the configuration surface to drive a realtime service like Gemini
Live from local turn detection without paying user-transcript latency.
Cascaded pipelines wait for a transcript before ending the user's turn
because the downstream LLM needs the user's words recorded in context
— but that wait is pure latency in pipelines using local turn
detection to drive a realtime service, which consumes user audio
directly.
Set `wait_for_transcript_to_end_user_turn=False` on
`LLMUserAggregatorParams` to turn this on. With that single flag the
aggregator:
- drops `TranscriptionUserTurnStartStrategy` from the start strategies
(so late-arriving realtime transcripts don't trigger new turns),
- sets `wait_for_transcript=False` on any stop strategy that supports
it (so the turn ends on the audible end of the turn, without
waiting for a transcript),
- fires `on_user_turn_stopped` on the audible end of the turn with
empty `content` (since the transcript hasn't arrived), and
- defers the context flush until the transcript arrives or a backstop
timer fires.
A new `on_user_turn_message_finalized` event fires when the user's
message has been written to context. In the default mode it
coincides with `on_user_turn_stopped`; in the delayed-transcript mode
it fires later. Consumers that want the populated transcript should
subscribe to `on_user_turn_message_finalized` — it's the event that
always carries the user message, regardless of mode.
Strategy mutations are logged: loudly when the user passed their own
strategies (we're overwriting parts of their config), quietly
otherwise. The strategy-level `wait_for_transcript` parameter on
`TurnAnalyzerUserTurnStopStrategy` and `SpeechTimeoutUserTurnStopStrategy`
remains exposed for advanced cases.
The example `realtime-gemini-live-local-vad.py` demonstrates the full
pattern.
Until now, both TurnAnalyzerUserTurnStopStrategy and
SpeechTimeoutUserTurnStopStrategy waited for at least one transcript
before ending the user turn. That's the right behavior for cascaded
pipelines, where the downstream LLM can't respond until the user's
words are recorded in its context — but it's pure latency in pipelines
using local turn detection to drive a realtime service like Gemini
Live.
Add a `require_transcript: bool | None = None` parameter to both
strategies. When None (default), it infers from whether an
STTMetadataFrame has been seen — a proxy for "does the downstream LLM
need the transcript in context?". Explicit True/False overrides the
heuristic.
When a transcript isn't required, the strategies also skip the
STT-waiting timeout in the VAD-stopped handler, so the user turn ends
as soon as the analyzer (or speech timer) concludes the turn is
complete.
A single Realtime API response can now contain more than one audio item
(observed with gpt-realtime-2), and the first item's audio.done can
arrive after deltas from the second have started arriving. Deltas still
arrive strictly in playback order across items, so we keep forwarding
them as received — matching OpenAI's reference implementation.
Adjusted OpenAIRealtimeLLMService so a multi-item response is treated as
one continuous TTS turn:
- _handle_evt_audio_delta: on item switch, advance the tracked item in
place (reset total_size) without emitting another TTSStartedFrame.
Truncation now always targets the latest item.
- _handle_evt_audio_done: debug-trace only; no longer pushes
TTSStoppedFrame.
- _handle_evt_response_done: pushes a single TTSStoppedFrame per turn,
bookending the audio with the Started pushed on the first delta.
Added tests covering single-item, overlapping multi-item, non-overlapping
multi-item, and interrupt-during-multi-item (last-item-wins truncation).