Document the local-VAD-plus-server-VAD duplicate-frames caveat
Realtime services that emit their own UserStartedSpeakingFrame /
UserStoppedSpeakingFrame (OpenAI Realtime, Azure Realtime, Inworld,
Grok/xAI Realtime) also call broadcast_interruption() from server VAD
events. Wiring local VAD on top — without first disabling the service's
server-side turn detection — causes the aggregator's VAD-driven
strategies to broadcast the same frames again, producing duplicates
downstream (TurnTrackingObserver, RTVI, AudioBufferProcessor would see
doubled events).
This is pre-existing behavior on main, not introduced by this PR. But
the realtime_service_mode "with local VAD" example invites the
question, so call out the intended pattern explicitly. Update three
places:
- RealtimeServiceModeConfig docstring: a Note section explaining
that local VAD is intended for services without server-emitted
turn frames OR services with server-side turn detection disabled,
not for "both VADs on".
- OpenAI Realtime, Inworld, Grok/xAI service docstrings: a one-line
note that wiring local VAD requires disabling server-side turn
detection first (with a pointer to the *-local-vad.py example for
OpenAI Realtime).
No code change — the duplicate behavior is documented as
not-recommended rather than auto-suppressed. Auto-suppression via
RealtimeServiceMetadataFrame.emits_user_turn_frames was considered but
rejected for surprise-factor (users adding local VAD probably expect
their VAD-driven frames to fire).
This commit is contained in:
@@ -282,6 +282,18 @@ class RealtimeServiceModeConfig:
|
||||
path when local turn detection drives a realtime conversation.
|
||||
When True, turn-end strategies wait for transcripts to arrive
|
||||
before signalling end-of-turn.
|
||||
|
||||
Note:
|
||||
Local VAD (via ``LLMUserAggregatorParams.vad_analyzer``) is intended
|
||||
for use with realtime services that either don't emit
|
||||
``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame``
|
||||
themselves (Gemini Live, AWS Nova Sonic, Ultravox) or have their
|
||||
server-side turn detection disabled (e.g. OpenAI Realtime with
|
||||
``turn_detection=False``). Wiring local VAD on top of a service
|
||||
whose server-side turn detection is also active produces duplicate
|
||||
user-turn frames from both sources — the service broadcasts them,
|
||||
and the aggregator's local-VAD-driven strategies broadcast them
|
||||
again. Pick one source.
|
||||
"""
|
||||
|
||||
context_writes_await_turns: bool = False
|
||||
|
||||
@@ -204,7 +204,10 @@ class InworldRealtimeLLMService(LLMService[InworldRealtimeLLMAdapter]):
|
||||
Emits ``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame`` from
|
||||
Inworld's server-side VAD events. Pair with
|
||||
``LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())``
|
||||
so context writes are decoupled from those frames.
|
||||
so context writes are decoupled from those frames. If you wire local
|
||||
VAD (``LLMUserAggregatorParams.vad_analyzer``) on top of this
|
||||
service, disable Inworld's server-side turn detection first;
|
||||
otherwise both sources broadcast duplicate user-turn frames.
|
||||
|
||||
Example::
|
||||
|
||||
|
||||
@@ -213,6 +213,12 @@ class OpenAIRealtimeLLMService(LLMService[OpenAIRealtimeLLMAdapter]):
|
||||
``LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())``
|
||||
so context writes are decoupled from those frames; see the
|
||||
``examples/realtime/realtime-openai.py`` example.
|
||||
|
||||
If you wire local VAD (``LLMUserAggregatorParams.vad_analyzer``) on
|
||||
top of this service, disable OpenAI's server-side turn detection
|
||||
first (``turn_detection=False``); otherwise both sources broadcast
|
||||
duplicate user-turn frames. See
|
||||
``examples/realtime/realtime-openai-local-vad.py``.
|
||||
"""
|
||||
|
||||
Settings = OpenAIRealtimeLLMSettings
|
||||
|
||||
@@ -199,7 +199,10 @@ class GrokRealtimeLLMService(LLMService[GrokRealtimeLLMAdapter]):
|
||||
Emits ``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame`` from
|
||||
Grok's server-side VAD events. Pair with
|
||||
``LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())``
|
||||
so context writes are decoupled from those frames.
|
||||
so context writes are decoupled from those frames. If you wire local
|
||||
VAD (``LLMUserAggregatorParams.vad_analyzer``) on top of this
|
||||
service, disable Grok's server-side turn detection first; otherwise
|
||||
both sources broadcast duplicate user-turn frames.
|
||||
"""
|
||||
|
||||
Settings = GrokRealtimeLLMSettings
|
||||
|
||||
Reference in New Issue
Block a user