Show commented-out local-VAD opt-in in no-turn-frames examples
For services that don't emit UserStarted/StoppedSpeakingFrame (Nova Sonic, Gemini Live, Ultravox), the absence of those frames means downstream consumers — including the Pipecat Prebuilt UI — can't group user transcripts into discrete turns. The Tier 1 comment block already called this out, but the fix required users to know to add the SileroVADAnalyzer import + LLMUserAggregatorParams kwarg themselves. Make it a copy-paste: include the relevant imports and `user_params=` argument as commented-out code, with a comment explaining that they're not strictly necessary for context aggregation but enable RTVI / turn- dependent processors when needed. Mirror the wording used in the LLMService startup log. Also fix line wrapping in the llm_service.py startup log for the no- turn-frames case (manual edit to that message left the last line over- length).
This commit is contained in:
@@ -148,24 +148,30 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
|
||||
# Set up context and context management.
|
||||
#
|
||||
# AWS Nova Sonic drives the conversation server-side. It does NOT emit
|
||||
# UserStartedSpeakingFrame / UserStoppedSpeakingFrame, so pipeline
|
||||
# processors that depend on those frames — RTVI client speech events,
|
||||
# AWS Nova Sonic drives the conversation server-side and does not emit
|
||||
# UserStartedSpeakingFrame / UserStoppedSpeakingFrame. Context
|
||||
# aggregation still works with realtime_service_mode, but pipeline
|
||||
# processors that depend on those frames (RTVI client speech events,
|
||||
# TurnTrackingObserver, AudioBufferProcessor turn recording,
|
||||
# UserIdleController, user mute strategies, voicemail detector — won't
|
||||
# activate with the default server-VAD-only setup. Context aggregation
|
||||
# still works with realtime_service_mode.
|
||||
# UserIdleController, user mute strategies, voicemail detector) won't
|
||||
# activate. The Pipecat Prebuilt UI is one such consumer — without
|
||||
# these frames it can't group user transcripts into discrete turns
|
||||
# visually.
|
||||
#
|
||||
# To produce these frames locally, wire a VAD analyzer (e.g.
|
||||
# SileroVADAnalyzer) into LLMUserAggregatorParams. Caveat: locally-
|
||||
# generated turn boundaries are a heuristic and may not match Nova
|
||||
# Sonic's server-side turn decisions, which is what drives the
|
||||
# conversation; the two can drift apart in subtle ways especially
|
||||
# around interruptions and overlapping speech.
|
||||
# If you need those frames, uncomment the SileroVADAnalyzer import
|
||||
# above and the `user_params=` argument below. Note: local turn
|
||||
# detection may not match Nova Sonic's actual server-side turn
|
||||
# decisions and can desynchronize in subtle ways.
|
||||
#
|
||||
# from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
# from pipecat.processors.aggregators.llm_response_universal import (
|
||||
# LLMUserAggregatorParams,
|
||||
# )
|
||||
context = LLMContext(tools=tools)
|
||||
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
realtime_service_mode=RealtimeServiceModeConfig(),
|
||||
# user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
|
||||
)
|
||||
|
||||
# Build the pipeline
|
||||
|
||||
@@ -131,22 +131,32 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
llm.register_function("get_restaurant_recommendation", fetch_restaurant_recommendation)
|
||||
|
||||
context = LLMContext()
|
||||
# Gemini Live drives the conversation server-side. It does NOT emit
|
||||
# UserStartedSpeakingFrame / UserStoppedSpeakingFrame, so pipeline
|
||||
# processors that depend on those frames — RTVI client speech events,
|
||||
# Gemini Live drives the conversation server-side and does not emit
|
||||
# UserStartedSpeakingFrame / UserStoppedSpeakingFrame. Context
|
||||
# aggregation still works with realtime_service_mode, but pipeline
|
||||
# processors that depend on those frames (RTVI client speech events,
|
||||
# TurnTrackingObserver, AudioBufferProcessor turn recording,
|
||||
# UserIdleController, user mute strategies, voicemail detector — won't
|
||||
# activate with the default server-VAD-only setup. Context aggregation
|
||||
# still works with realtime_service_mode.
|
||||
# UserIdleController, user mute strategies, voicemail detector) won't
|
||||
# activate. The Pipecat Prebuilt UI is one such consumer — without
|
||||
# these frames it can't group user transcripts into discrete turns
|
||||
# visually.
|
||||
#
|
||||
# To produce these frames locally, see `realtime-gemini-live-local-vad.py`.
|
||||
# Caveat: locally-generated turn boundaries are a heuristic and may not
|
||||
# match Gemini Live's server-side turn decisions, which is what drives the
|
||||
# conversation; the two can drift apart in subtle ways especially around
|
||||
# interruptions and overlapping speech.
|
||||
# If you need those frames, uncomment the SileroVADAnalyzer import
|
||||
# above and the `user_params=` argument below. Note: local turn
|
||||
# detection may not match Gemini Live's actual server-side turn
|
||||
# decisions and can desynchronize in subtle ways.
|
||||
#
|
||||
# For local VAD driving the conversation (server VAD disabled), see
|
||||
# `realtime-gemini-live-local-vad.py` instead.
|
||||
#
|
||||
# from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
# from pipecat.processors.aggregators.llm_response_universal import (
|
||||
# LLMUserAggregatorParams,
|
||||
# )
|
||||
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
realtime_service_mode=RealtimeServiceModeConfig(),
|
||||
# user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
|
||||
)
|
||||
|
||||
pipeline = Pipeline(
|
||||
|
||||
@@ -175,23 +175,29 @@ There is also a secret menu that changes daily. If the user asks about it, use t
|
||||
|
||||
context = LLMContext([])
|
||||
|
||||
# Ultravox drives the conversation server-side. It does NOT emit
|
||||
# UserStartedSpeakingFrame / UserStoppedSpeakingFrame, so pipeline
|
||||
# processors that depend on those frames — RTVI client speech events,
|
||||
# Ultravox drives the conversation server-side and does not emit
|
||||
# UserStartedSpeakingFrame / UserStoppedSpeakingFrame. Context
|
||||
# aggregation still works with realtime_service_mode, but pipeline
|
||||
# processors that depend on those frames (RTVI client speech events,
|
||||
# TurnTrackingObserver, AudioBufferProcessor turn recording,
|
||||
# UserIdleController, user mute strategies, voicemail detector — won't
|
||||
# activate with this default setup. Context aggregation still works
|
||||
# with realtime_service_mode.
|
||||
# UserIdleController, user mute strategies, voicemail detector) won't
|
||||
# activate. The Pipecat Prebuilt UI is one such consumer — without
|
||||
# these frames it can't group user transcripts into discrete turns
|
||||
# visually.
|
||||
#
|
||||
# To produce these frames locally, wire a VAD analyzer (e.g.
|
||||
# SileroVADAnalyzer) into LLMUserAggregatorParams. Caveat: locally-
|
||||
# generated turn boundaries are a heuristic and may not match
|
||||
# Ultravox's server-side turn decisions, which is what drives the
|
||||
# conversation; the two can drift apart in subtle ways especially
|
||||
# around interruptions and overlapping speech.
|
||||
# If you need those frames, uncomment the SileroVADAnalyzer import
|
||||
# above and the `user_params=` argument below. Note: local turn
|
||||
# detection may not match Ultravox's actual server-side turn
|
||||
# decisions and can desynchronize in subtle ways.
|
||||
#
|
||||
# from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
# from pipecat.processors.aggregators.llm_response_universal import (
|
||||
# LLMUserAggregatorParams,
|
||||
# )
|
||||
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
realtime_service_mode=RealtimeServiceModeConfig(),
|
||||
# user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
|
||||
)
|
||||
|
||||
# Build the pipeline
|
||||
|
||||
@@ -410,9 +410,9 @@ class LLMService(UserTurnCompletionLLMServiceMixin, AIService, Generic[TAdapter]
|
||||
"AudioBufferProcessor turn recording, UserIdleController, user "
|
||||
"mute strategies, voicemail detector) will not activate. To "
|
||||
"produce them locally, add `vad_analyzer=` to "
|
||||
"LLMUserAggregatorParams. Note: local turn detection is a "
|
||||
"heuristic; its boundaries may not match the provider's actual "
|
||||
"server-side turn decisions and can desynchronize in subtle ways."
|
||||
"LLMUserAggregatorParams. Note: local turn detection may not "
|
||||
"match the provider's actual server-side turn decisions and "
|
||||
"can desynchronize in subtle ways."
|
||||
)
|
||||
|
||||
async def stop(self, frame: EndFrame):
|
||||
|
||||
Reference in New Issue
Block a user