Document the local-VAD-plus-server-VAD duplicate-frames caveat

Realtime services that emit their own UserStartedSpeakingFrame /
UserStoppedSpeakingFrame (OpenAI Realtime, Azure Realtime, Inworld,
Grok/xAI Realtime) also call broadcast_interruption() from server VAD
events. Wiring local VAD on top — without first disabling the service's
server-side turn detection — causes the aggregator's VAD-driven
strategies to broadcast the same frames again, producing duplicates
downstream (TurnTrackingObserver, RTVI, AudioBufferProcessor would see
doubled events).

This is pre-existing behavior on main, not introduced by this PR. But
the realtime_service_mode "with local VAD" example invites the
question, so call out the intended pattern explicitly. Update three
places:

  - RealtimeServiceModeConfig docstring: a Note section explaining
    that local VAD is intended for services without server-emitted
    turn frames OR services with server-side turn detection disabled,
    not for "both VADs on".
  - OpenAI Realtime, Inworld, Grok/xAI service docstrings: a one-line
    note that wiring local VAD requires disabling server-side turn
    detection first (with a pointer to the *-local-vad.py example for
    OpenAI Realtime).

No code change — the duplicate behavior is documented as
not-recommended rather than auto-suppressed. Auto-suppression via
RealtimeServiceMetadataFrame.emits_user_turn_frames was considered but
rejected for surprise-factor (users adding local VAD probably expect
their VAD-driven frames to fire).
This commit is contained in:
Paul Kompfner
2026-05-21 12:19:24 -04:00
parent 92ced43300
commit be218e1941
4 changed files with 26 additions and 2 deletions

View File

@@ -282,6 +282,18 @@ class RealtimeServiceModeConfig:
path when local turn detection drives a realtime conversation.
When True, turn-end strategies wait for transcripts to arrive
before signalling end-of-turn.
Note:
Local VAD (via ``LLMUserAggregatorParams.vad_analyzer``) is intended
for use with realtime services that either don't emit
``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame``
themselves (Gemini Live, AWS Nova Sonic, Ultravox) or have their
server-side turn detection disabled (e.g. OpenAI Realtime with
``turn_detection=False``). Wiring local VAD on top of a service
whose server-side turn detection is also active produces duplicate
user-turn frames from both sources — the service broadcasts them,
and the aggregator's local-VAD-driven strategies broadcast them
again. Pick one source.
"""
context_writes_await_turns: bool = False

View File

@@ -204,7 +204,10 @@ class InworldRealtimeLLMService(LLMService[InworldRealtimeLLMAdapter]):
Emits ``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame`` from
Inworld's server-side VAD events. Pair with
``LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())``
so context writes are decoupled from those frames.
so context writes are decoupled from those frames. If you wire local
VAD (``LLMUserAggregatorParams.vad_analyzer``) on top of this
service, disable Inworld's server-side turn detection first;
otherwise both sources broadcast duplicate user-turn frames.
Example::

View File

@@ -213,6 +213,12 @@ class OpenAIRealtimeLLMService(LLMService[OpenAIRealtimeLLMAdapter]):
``LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())``
so context writes are decoupled from those frames; see the
``examples/realtime/realtime-openai.py`` example.
If you wire local VAD (``LLMUserAggregatorParams.vad_analyzer``) on
top of this service, disable OpenAI's server-side turn detection
first (``turn_detection=False``); otherwise both sources broadcast
duplicate user-turn frames. See
``examples/realtime/realtime-openai-local-vad.py``.
"""
Settings = OpenAIRealtimeLLMSettings

View File

@@ -199,7 +199,10 @@ class GrokRealtimeLLMService(LLMService[GrokRealtimeLLMAdapter]):
Emits ``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame`` from
Grok's server-side VAD events. Pair with
``LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())``
so context writes are decoupled from those frames.
so context writes are decoupled from those frames. If you wire local
VAD (``LLMUserAggregatorParams.vad_analyzer``) on top of this
service, disable Grok's server-side turn detection first; otherwise
both sources broadcast duplicate user-turn frames.
"""
Settings = GrokRealtimeLLMSettings