Compare commits

...

14 Commits

Author SHA1 Message Date
Paul Kompfner
ef46156c1b Rename *-local-vad.py example variants to *-locally-driven-turns.py
The "-local-vad" suffix was ambiguous now that local VAD has two
meanings in the realtime context: supplementary user-turn frames
broadcast alongside server-driven turns (commented-out opt-in in the
base examples), vs. local turn detection driving the conversation
end-to-end (server-side turn detection disabled, what these variant
files actually demonstrate). The new "-locally-driven-turns" suffix
matches the latter intent unambiguously.

Renames:

  realtime-openai-local-vad.py       → realtime-openai-locally-driven-turns.py
  realtime-gemini-live-local-vad.py  → realtime-gemini-live-locally-driven-turns.py
  realtime-grok-local-vad.py         → realtime-grok-locally-driven-turns.py
  realtime-inworld-local-vad.py      → realtime-inworld-locally-driven-turns.py

Plus the matching changelog fragments. Service docstrings and base
examples that referenced the old filenames now point at the new ones.
2026-05-21 15:26:27 -04:00
Paul Kompfner
86f9ad0c07 Show commented-out local-VAD opt-in in no-turn-frames examples
For services that don't emit UserStarted/StoppedSpeakingFrame (Nova
Sonic, Gemini Live, Ultravox), the absence of those frames means
downstream consumers — including the Pipecat Prebuilt UI — can't group
user transcripts into discrete turns. The Tier 1 comment block already
called this out, but the fix required users to know to add the
SileroVADAnalyzer import + LLMUserAggregatorParams kwarg themselves.

Make it a copy-paste: include the relevant imports and `user_params=`
argument as commented-out code, with a comment explaining that they're
not strictly necessary for context aggregation but enable RTVI / turn-
dependent processors when needed. Mirror the wording used in the
LLMService startup log.

Also fix line wrapping in the llm_service.py startup log for the no-
turn-frames case (manual edit to that message left the last line over-
length).
2026-05-21 15:13:52 -04:00
Paul Kompfner
cb9fe04e0b Wire Inworld manual-mode turn detection + add local-VAD example
Inworld Realtime's session properties accept turn_detection=None to put
the service into manual mode (matching OpenAI Realtime's
turn_detection=False), but the Pipecat integration hardcoded
_handle_user_stopped_speaking and _handle_interruption to assume
server-side VAD: both were no-ops on the client side because Inworld's
server normally handles commit/cancel/response.create automatically. In
manual mode the server doesn't, so local-VAD-driven turns stalled —
the bot never responded after the user stopped speaking, and
interruptions left the in-flight response running.

Mirror the OpenAI Realtime pattern: on user-stopped-speaking in manual
mode, send InputAudioBufferCommitEvent + ResponseCreateEvent; on
interruption in manual mode, send InputAudioBufferClearEvent +
ResponseCancelEvent. Gate both on a new _is_manual_turn_detection()
helper.

Add examples/realtime/realtime-inworld-local-vad.py, the matching
*-local-vad.py variant for parity with the OpenAI Realtime and Grok
Realtime variants, and point the Inworld service docstring at it.
2026-05-21 14:14:13 -04:00
Paul Kompfner
58027484b2 Add realtime-grok-local-vad.py example
Grok Realtime supports manual mode (turn_detection=None) which disables
its server-side VAD and lets local VAD drive turn boundaries — same
pattern as OpenAI Realtime's turn_detection=False. Add the matching
*-local-vad.py variant for parity, and point the Grok service docstring
at it.
2026-05-21 13:00:34 -04:00
Paul Kompfner
3b668dc937 Broadcast Nova Sonic interruption on FINAL TEXT contentEnd unconditionally
The TEXT INTERRUPTED branch gated broadcast_interruption() on
_assistant_is_responding, but Nova Sonic's mid-audio barge-in sequence
fires AUDIO contentEnd with stopReason=END_TURN first (per AWS docs),
which already flips _assistant_is_responding=False. By the time FINAL
TEXT contentEnd with stopReason=INTERRUPTED arrives — the actual
interruption notification — the guard skipped the broadcast and the
output transport's buffered audio kept playing.

Always broadcast on TEXT INTERRUPTED; keep the guard around
_report_assistant_response_ended() so we don't double-close the response
when AUDIO contentEnd already did it.
2026-05-21 12:37:04 -04:00
Paul Kompfner
be218e1941 Document the local-VAD-plus-server-VAD duplicate-frames caveat
Realtime services that emit their own UserStartedSpeakingFrame /
UserStoppedSpeakingFrame (OpenAI Realtime, Azure Realtime, Inworld,
Grok/xAI Realtime) also call broadcast_interruption() from server VAD
events. Wiring local VAD on top — without first disabling the service's
server-side turn detection — causes the aggregator's VAD-driven
strategies to broadcast the same frames again, producing duplicates
downstream (TurnTrackingObserver, RTVI, AudioBufferProcessor would see
doubled events).

This is pre-existing behavior on main, not introduced by this PR. But
the realtime_service_mode "with local VAD" example invites the
question, so call out the intended pattern explicitly. Update three
places:

  - RealtimeServiceModeConfig docstring: a Note section explaining
    that local VAD is intended for services without server-emitted
    turn frames OR services with server-side turn detection disabled,
    not for "both VADs on".
  - OpenAI Realtime, Inworld, Grok/xAI service docstrings: a one-line
    note that wiring local VAD requires disabling server-side turn
    detection first (with a pointer to the *-local-vad.py example for
    OpenAI Realtime).

No code change — the duplicate behavior is documented as
not-recommended rather than auto-suppressed. Auto-suppression via
RealtimeServiceMetadataFrame.emits_user_turn_frames was considered but
rejected for surprise-factor (users adding local VAD probably expect
their VAD-driven frames to fire).
2026-05-21 12:19:24 -04:00
Paul Kompfner
92ced43300 Add Phase 2 changelog fragments for example migration 2026-05-21 11:25:29 -04:00
Paul Kompfner
bff741a647 Migrate realtime examples to RealtimeServiceModeConfig
Pass realtime_service_mode=RealtimeServiceModeConfig() through every
realtime LLM service example (base, async-tool, video, text-output,
persistent-context, update-settings, MCP) so context aggregation uses
the new realtime-mode semantics instead of relying on local VAD as a
workaround.

Where examples previously wired SileroVADAnalyzer into
LLMUserAggregatorParams to coax turn frames out of services that don't
emit them server-side (AWS Nova Sonic, Ultravox, Gemini Live), the local
VAD is now removed. realtime_service_mode keeps context writes correct
without it, and the Phase 1.5 server-side InterruptionFrame fixes for
Nova Sonic and Ultravox keep the bot from talking past the user when
they barge in.

Transcript-logging event handlers move from on_user_turn_stopped /
on_assistant_turn_stopped to on_user_message_added /
on_assistant_message_added, which carry the finalized text in realtime
mode (the turn-stopped events fire before the message is finalized, so
their `content` is None in that mode).

For services that don't emit user-turn frames (Gemini Live, AWS Nova
Sonic, Ultravox) the example now carries a Tier 1 comment block that
spells out which downstream processors won't activate, how to add local
VAD if needed, and the caveat that locally-generated turn boundaries
are a heuristic that may diverge from server-side ground truth.

Adds examples/realtime/realtime-openai-local-vad.py, a new variant of
the OpenAI Realtime example that disables OpenAI's server-side turn
detection and drives turn boundaries locally — useful when you want a
turn analyzer like LocalSmartTurnV3 to decide when the user is done
speaking. Server-emitted turn frames are still preferred when available.

The Gemini Live local-VAD variant already existed; it's been updated in
place rather than rewritten.
2026-05-21 11:25:29 -04:00
Paul Kompfner
20d9bf4af6 Document user-turn-frame behavior in realtime service docstrings
Each realtime LLM service docstring now states whether the service emits
UserStartedSpeakingFrame / UserStoppedSpeakingFrame from server-side turn
signals, and what that implies for the rest of the pipeline.

For the services that don't (Gemini Live, AWS Nova Sonic, Ultravox), the
docstring spells out which downstream processors won't activate (RTVI
client speech events, TurnTrackingObserver, AudioBufferProcessor turn
recording, UserIdleController, user mute strategies, voicemail detector),
points at realtime_service_mode for correct context-write semantics, and
notes the option of wiring local VAD plus the caveat that locally-
generated turn boundaries are a heuristic that may not match the
provider's server-side turn decisions.

For the services that do (OpenAI Realtime, Inworld, Grok/xAI Realtime),
the docstring confirms turn frames are emitted from server VAD and
points at realtime_service_mode.
2026-05-21 11:25:29 -04:00
Paul Kompfner
a00211627f Surface server-side interruption from Nova Sonic and Ultravox
BaseOutputTransport only clears buffered audio mid-playback on
InterruptionFrame. Realtime services stream audio downstream as fast as
they produce it, and playback necessarily trails the buffer — so when the
user interrupts, the bot keeps talking past the interruption unless the
service surfaces the interruption to the pipeline.

Two realtime services were missing this signal:

  - AWS Nova Sonic acknowledged the INTERRUPTED stop reason internally
    (closing its own response state) but never broadcast InterruptionFrame.
  - Ultravox's playback_clear_buffer message — the server's explicit
    "drop buffered output audio" signal for interruptions — was not
    handled at all.

In both cases the latent bug was masked by enabling local VAD on the
user aggregator, which produced UserStartedSpeakingFrame and triggered
the aggregator-side interruption path. The realtime context aggregator
work makes local VAD optional, so the underlying gap needs fixing first.

Wire broadcast_interruption() into both services on the server-side
interruption signal, firing before the response-end signal so the
assistant aggregator marks the message interrupted=True before
LLMFullResponseEndFrame closes the turn.
2026-05-21 11:25:29 -04:00
Paul Kompfner
11d7fcf174 Add changelog fragments for realtime service mode
Fragments use the +<name> prefix so they show up under "Unreleased"
without a PR-number suffix; rename to <PR#>.<type>.md before merge.
2026-05-21 11:25:29 -04:00
Paul Kompfner
1fe8cf5289 Add RealtimeServiceModeConfig to LLMContextAggregatorPair
Decouple context management from turn frames and transcripts when a
realtime LLM service drives the conversation. Three problems with today's
behavior:

  - Some realtime services (Gemini Live, AWS Nova Sonic, Ultravox) emit
    no UserStarted/StoppedSpeakingFrame at all, so the aggregator — which
    writes user messages on those frames — doesn't write to context
    correctly without them.
  - The workaround (local VAD on the aggregator) generates turn
    boundaries that don't match the provider's server-side ground truth,
    and the per-service "do I need it?" rule is hard to keep straight.
  - When local turn detection is the intended driver, turn-end strategies
    still wait for transcripts on the latency critical path.

Add a realtime_service_mode: RealtimeServiceModeConfig | None = None
kwarg on LLMContextAggregatorPair. When set, the pair switches both
halves to trailing context writes: user messages are flushed on the first
assistant content frame, assistant messages on the next user transcript,
both halves on EndFrame. Turn-end strategies stop waiting for transcripts
by default. Two fine-grained boolean fields (context_writes_await_turns,
turns_await_transcripts) let callers dial back to cascade-style behavior
selectively; their invalid combination is rejected in __post_init__.

The bifurcation is dispatch-only: seven branch points across the two
halves, each at method entry, each delegating to a mode-pure private
method. Cross-half coordination uses an asyncio.Lock and a back-reference
shared by both halves; the assistant signals user.flush() on
LLMFullResponseStartFrame, and the user signals assistant.flush() on the
first new transcript after the assistant turn. The mechanism reuses the
existing push_aggregation() — no parallel write path.

Two new events fire when messages are flushed to context:
on_user_message_added and on_assistant_message_added. In cascade mode
they coincide with the existing turn-stopped events; in realtime mode
(where the turn-stopped event fires before the message is finalized)
they're the canonical way to subscribe to "context just updated, here's
the text."

UserTurnStoppedMessage.content is now typed str | None to reflect that
realtime mode fires the event with None.

When a RealtimeServiceMetadataFrame arrives and realtime_service_mode is
None, the aggregator logs a one-time INFO recommendation pointing users
at the option.
2026-05-21 11:25:29 -04:00
Paul Kompfner
3247fd1188 Mark realtime LLM services with RealtimeServiceInfo + emit metadata at start
Realtime (speech-to-speech) LLM services need to advertise themselves to
the rest of the pipeline so downstream components can adapt. Add a new
RealtimeServiceMetadataFrame subtype of ServiceMetadataFrame, following
the STTMetadataFrame precedent.

LLMService gains a single ClassVar, _realtime_service_info, typed
RealtimeServiceInfo | None and defaulting to None. The presence of a
populated instance is what marks a service as realtime, and the
RealtimeServiceInfo dataclass carries the per-service knobs the rest of
the pipeline needs — currently just emits_user_turn_frames. Keeping it
all under one optional ClassVar avoids stranding realtime-only knobs on
the generic LLMService surface; non-realtime services keep the default
None and the realtime-specific machinery stays inert.

When _realtime_service_info is set, the base service auto-broadcasts
RealtimeServiceMetadataFrame right after StartFrame propagates downstream
(same ordering as STT). When emits_user_turn_frames is False, a one-time
INFO log at start explains which pipeline processors depend on those
frames (RTVI client speech events, TurnTrackingObserver,
AudioBufferProcessor turn recording, UserIdleController, user mute
strategies, voicemail detector) and how to add local VAD if needed.

Set the ClassVar on the seven realtime services: OpenAI Realtime, Azure
Realtime (via inheritance), Inworld, Grok/xAI Realtime all emit
user-turn frames; Gemini Live (and Gemini Live Vertex via inheritance),
AWS Nova Sonic, Ultravox do not.

In a follow-up commit, LLMContextAggregatorPair will consume
RealtimeServiceMetadataFrame to surface a one-time recommendation when
realtime_service_mode is not configured.
2026-05-20 15:08:40 -04:00
Paul Kompfner
9f0a60b995 Add wait_for_transcript flag on user-turn stop strategies
SpeechTimeoutUserTurnStopStrategy and TurnAnalyzerUserTurnStopStrategy
both gate end-of-turn on a transcript arriving. That's the right default
for cascade STT/LLM/TTS pipelines, but it puts transcripts on the latency
critical path in pipelines where local turn detection is the intended
driver of end-of-turn — typically realtime LLM services consuming audio
directly. Closed PR #4480 explored this same fix in isolation.

Add wait_for_transcript: bool = True to both strategies. False makes the
strategy signal end-of-turn as soon as VAD / the turn analyzer reports
end-of-speech, independent of transcripts. The default preserves existing
behavior. LLMContextAggregatorPair will flip this in realtime mode in a
follow-up commit.
2026-05-20 14:07:58 -04:00
62 changed files with 2226 additions and 206 deletions

View File

@@ -0,0 +1 @@
- Fixed `InworldRealtimeLLMService` not supporting manual-mode turn detection (`session_properties.audio.input.turn_detection=None`). Previously `_handle_user_stopped_speaking` and `_handle_interruption` assumed Inworld's server-side VAD handled commit/cancel/response.create automatically and were no-ops on the client side. In manual mode the server doesn't, so local-VAD-driven turns stalled: the bot never responded after the user stopped speaking, and interruptions didn't cancel the in-flight response. Wire the explicit `InputAudioBufferCommitEvent` + `ResponseCreateEvent` on user-stopped-speaking and `InputAudioBufferClearEvent` + `ResponseCancelEvent` on interruption, gated on a new `_is_manual_turn_detection()` check (mirroring the pattern in `OpenAIRealtimeLLMService`).

View File

@@ -0,0 +1 @@
- Fixed AWS Nova Sonic not surfacing server-side interruption. When the user interrupted the bot mid-response, the `INTERRUPTED` stop reason was acknowledged internally but no `InterruptionFrame` was emitted, so `BaseOutputTransport` kept draining its audio buffer and the bot kept talking past the interruption. Nova Sonic now broadcasts `InterruptionFrame` on both `INTERRUPTED` paths (text-stage and audio-stage). This was previously masked by enabling local VAD on the user aggregator, which generated `UserStartedSpeakingFrame` and triggered the aggregator-side interruption path; the fix makes the behavior correct without local VAD as a workaround.

View File

@@ -0,0 +1 @@
- Migrated all realtime LLM service examples (OpenAI Realtime, Azure Realtime, Inworld, Grok/xAI Realtime, Gemini Live, Gemini Live Vertex, AWS Nova Sonic, Ultravox) — base examples, `persistent-context-*`, `update-settings/llm/*`, and the Gemini Live MCP example — to use `LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())`. Where examples previously wired `SileroVADAnalyzer` into `LLMUserAggregatorParams` as a workaround for missing turn frames, the local VAD has been removed; the realtime service mode + the Phase 1.5 interruption fixes for Nova Sonic and Ultravox make this safe. Transcript-logging event handlers have moved from `on_user_turn_stopped` / `on_assistant_turn_stopped` to the new `on_user_message_added` / `on_assistant_message_added` events, which carry the finalized message text. Examples for services without server-side user-turn frames (Gemini Live, AWS Nova Sonic, Ultravox) include a Tier 1 comment block explaining what doesn't activate without those frames and how to add local VAD if needed; the corresponding service docstrings have the same warning.

View File

@@ -0,0 +1 @@
- Added `examples/realtime/realtime-grok-locally-driven-turns.py`, a variant of the base Grok Realtime example that disables Grok's server-side turn detection (`turn_detection=None`, manual mode) and instead drives turn boundaries locally with `SileroVADAnalyzer` wired into the user aggregator. Mirrors the OpenAI Realtime locally-driven-turns variant. Server-emitted turn frames are preferred when available.

View File

@@ -0,0 +1 @@
- Added `examples/realtime/realtime-inworld-locally-driven-turns.py`, a variant of the base Inworld Realtime example that disables Inworld's server-side turn detection (`turn_detection=None`, manual mode) and instead drives turn boundaries locally with `SileroVADAnalyzer` wired into the user aggregator. Mirrors the OpenAI Realtime and Grok Realtime locally-driven-turns variants. Server-emitted turn frames are preferred when available.

View File

@@ -0,0 +1 @@
- Added a startup INFO log on realtime LLM services that don't emit `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame` (Gemini Live, AWS Nova Sonic, Ultravox). The log spells out which downstream processors depend on those frames (RTVI client speech events, `TurnTrackingObserver`, `AudioBufferProcessor` turn recording, `UserIdleController`, user mute strategies, voicemail detector) and how to opt into local VAD when needed.

View File

@@ -0,0 +1 @@
- Added `examples/realtime/realtime-openai-locally-driven-turns.py`, a variant of the base OpenAI Realtime example that disables OpenAI's server-side turn detection (`turn_detection=False`) and instead drives turn boundaries locally with `SileroVADAnalyzer` wired into the user aggregator. Use this variant if you need a turn analyzer like `LocalSmartTurnV3` to decide when the user is done speaking, or if you need `UserStartedSpeakingFrame` / `UserStoppedSpeakingFrame` to fire from the same source as `InterruptionFrame`. Server-emitted turn frames are preferred when available.

View File

@@ -0,0 +1 @@
- Added `RealtimeServiceMetadataFrame`, broadcast at pipeline start by realtime LLM services (OpenAI Realtime, Azure Realtime, Inworld, Grok/xAI Realtime, Gemini Live, AWS Nova Sonic, Ultravox). The context aggregator pair listens for it and, when `realtime_service_mode` isn't configured, logs a one-time INFO recommendation pointing users at the option and the `on_user_turn_stopped` timing change it implies.

View File

@@ -0,0 +1 @@
- Added `RealtimeServiceModeConfig` and a new `realtime_service_mode` kwarg on `LLMContextAggregatorPair`, opting the pair into realtime (speech-to-speech) LLM behavior. When set, user messages are written to context when the assistant response starts rather than on user-turn-end frames — so context stays correct even when the realtime service emits no turn frames at all — and, by default, turn-end strategies stop waiting for transcripts before signalling end-of-turn, keeping transcript latency off the critical path in local-VAD-driven realtime pipelines. Both behaviors are individually controllable via the `context_writes_await_turns` and `turns_await_transcripts` fields. Cascade (non-realtime) behavior is unchanged when the kwarg is omitted.

View File

@@ -0,0 +1 @@
- Added `on_user_message_added` and `on_assistant_message_added` event handlers on `LLMUserAggregator` and `LLMAssistantAggregator`. Each fires when its respective message is flushed to context and carries the finalized content. In cascade mode they coincide with `on_user_turn_stopped` / `on_assistant_turn_stopped`; in realtime mode (where turn-stop fires before the message is finalized) they're the canonical way to subscribe to "context just updated, here's the text."

View File

@@ -0,0 +1 @@
- Fixed Ultravox Realtime not surfacing server-side interruption. The server sends a `playback_clear_buffer` message when the user interrupts the bot mid-speech, instructing clients to drop buffered output audio; this was previously unhandled, so `BaseOutputTransport` kept playing the buffered audio and the bot kept talking past the interruption. Ultravox now broadcasts `InterruptionFrame` on `playback_clear_buffer`. This was previously masked by enabling local VAD on the user aggregator, which generated `UserStartedSpeakingFrame` and triggered the aggregator-side interruption path; the fix makes the behavior correct without local VAD as a workaround.

View File

@@ -0,0 +1 @@
- `UserTurnStoppedMessage.content` is now typed `str | None`. In realtime mode (`RealtimeServiceModeConfig(context_writes_await_turns=False)`) the user message isn't finalized at turn-stop time, so `content` is `None`; subscribers wanting the finalized text should use the new `on_user_message_added` event. Cascade behavior is unchanged.

View File

@@ -0,0 +1 @@
- `SpeechTimeoutUserTurnStopStrategy` and `TurnAnalyzerUserTurnStopStrategy` now accept a `wait_for_transcript: bool = True` kwarg. When set to `False`, the strategy signals end-of-turn as soon as VAD / the turn analyzer reports end-of-speech rather than waiting for a transcript — useful when local turn detection is the intended driver of a realtime conversation. `LLMContextAggregatorPair` flips this for you when `realtime_service_mode` is configured with the default `turns_await_transcripts=False`.

View File

@@ -11,7 +11,6 @@ from dotenv import load_dotenv
from loguru import logger
from mcp.client.session_group import StreamableHttpParameters
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -19,7 +18,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -84,7 +83,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext([{"role": "user", "content": "Please introduce yourself."}])
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -15,7 +15,6 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -23,7 +22,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -241,7 +240,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext(tools=tools)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -33,6 +33,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -203,7 +204,10 @@ Remember, your responses should be short - just one or two sentences usually."""
llm.register_function("load_conversation", load_conversation)
context = LLMContext([{"role": "developer", "content": "Say hello!"}], tools)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
[

View File

@@ -15,7 +15,6 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -23,7 +22,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -217,7 +216,7 @@ Remember, your responses should be short. Just one or two sentences, usually."""
context = LLMContext([{"role": "developer", "content": "Say hello!"}], tools)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -24,7 +24,6 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -32,7 +31,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -133,7 +132,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext(tools=tools)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -15,7 +15,6 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -24,7 +23,7 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
UserTurnStoppedMessage,
)
from pipecat.runner.types import RunnerArguments
@@ -148,10 +147,31 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
llm.register_function("get_current_weather", fetch_weather_from_api)
# Set up context and context management.
#
# AWS Nova Sonic drives the conversation server-side and does not emit
# UserStartedSpeakingFrame / UserStoppedSpeakingFrame. Context
# aggregation still works with realtime_service_mode, but pipeline
# processors that depend on those frames (RTVI client speech events,
# TurnTrackingObserver, AudioBufferProcessor turn recording,
# UserIdleController, user mute strategies, voicemail detector) won't
# activate. The Pipecat Prebuilt UI is one such consumer — without
# these frames it can't group user transcripts into discrete turns
# visually.
#
# If you need those frames, uncomment the SileroVADAnalyzer import
# above and the `user_params=` argument below. Note: local turn
# detection may not match Nova Sonic's actual server-side turn
# decisions and can desynchronize in subtle ways.
#
# from pipecat.audio.vad.silero import SileroVADAnalyzer
# from pipecat.processors.aggregators.llm_response_universal import (
# LLMUserAggregatorParams,
# )
context = LLMContext(tools=tools)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
# user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
)
# Build the pipeline
@@ -195,14 +215,18 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Client disconnected")
await task.cancel()
@user_aggregator.event_handler("on_user_turn_stopped")
async def on_user_turn_stopped(aggregator, strategy, message: UserTurnStoppedMessage):
# Nova Sonic doesn't emit user-turn frames so on_user_turn_stopped
# would never fire. The *_message_added events fire when messages are
# written to context and carry the finalized content; use those for
# transcript logging.
@user_aggregator.event_handler("on_user_message_added")
async def on_user_message_added(aggregator, message: UserTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}user: {message.content}"
logger.info(f"Transcript: {line}")
@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMessage):
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")

View File

@@ -24,7 +24,6 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -32,7 +31,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -144,7 +143,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext(tools=tools)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -13,7 +13,6 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -21,7 +20,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -174,7 +173,7 @@ Remember, your responses should be short. Just one or two sentences, usually. Re
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -28,7 +28,10 @@ from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.google.gemini_live.llm import GeminiLiveLLMService
@@ -125,7 +128,10 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext()
# Server-side VAD is enabled by default; no local VAD is added.
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
[

View File

@@ -15,7 +15,10 @@ from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.google.gemini_live.llm import GeminiLiveLLMService
@@ -158,7 +161,10 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
)
# Server-side VAD is enabled by default; no local VAD is added.
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
)
# Build the pipeline
pipeline = Pipeline(

View File

@@ -15,7 +15,10 @@ from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.google.gemini_live.llm import GeminiLiveLLMService
@@ -84,7 +87,10 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
],
)
# Server-side VAD is enabled by default; no local VAD is added.
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
[

View File

@@ -17,7 +17,10 @@ from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
RealtimeServiceModeConfig,
)
from pipecat.processors.frame_processor import FrameDirection
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -148,7 +151,10 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
[{"role": "developer", "content": "Say hello."}],
)
# Server-side VAD is enabled by default; no local VAD is added.
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
[

View File

@@ -9,7 +9,10 @@ from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
RealtimeServiceModeConfig,
)
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -115,7 +118,10 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
# Set up conversation context and management
context = LLMContext()
# Server-side VAD is enabled by default; no local VAD is added.
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
[

View File

@@ -4,6 +4,29 @@
# SPDX-License-Identifier: BSD 2-Clause License
#
"""Gemini Live with locally-driven turn detection.
By default Gemini Live drives the conversation with its own server-side VAD
(see `realtime-gemini-live.py`). That setup doesn't surface
``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame``, so pipeline
processors that depend on those frames (RTVI client speech events,
``TurnTrackingObserver``, ``AudioBufferProcessor`` turn recording,
``UserIdleController``, user mute strategies, voicemail detector) don't
activate.
This variant disables Gemini Live's server-side VAD
(``GeminiVADParams(disabled=True)``) and instead drives turn boundaries
locally with ``SileroVADAnalyzer`` wired into the user aggregator. Use this
variant if you need those downstream processors, or if you want a turn
analyzer like ``LocalSmartTurnV3`` to decide when the user is done speaking.
Caveat: locally-generated turn boundaries are a heuristic and may not match
the provider's actual server-side turn decisions, which is what really
drives the conversation. The two can drift apart in subtle, hard-to-debug
ways, especially around interruptions and overlapping speech. Prefer
server-emitted turn frames (i.e. the base `realtime-gemini-live.py` example)
unless you have a specific reason to drive turn detection locally.
"""
import os
@@ -20,6 +43,7 @@ from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
UserTurnStoppedMessage,
)
from pipecat.runner.types import RunnerArguments
@@ -72,6 +96,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
user_params=LLMUserAggregatorParams(
vad_analyzer=SileroVADAnalyzer(),
),
@@ -107,14 +132,17 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Client disconnected")
await task.cancel()
@user_aggregator.event_handler("on_user_turn_stopped")
async def on_user_turn_stopped(aggregator, strategy, message: UserTurnStoppedMessage):
# The *_message_added events fire when messages are written to context
# and carry the finalized content. In realtime mode the turn-stopped
# events fire before the message text is finalized.
@user_aggregator.event_handler("on_user_message_added")
async def on_user_message_added(aggregator, message: UserTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}user: {message.content}"
logger.info(f"Transcript: {line}")
@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMessage):
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")

View File

@@ -18,7 +18,10 @@ from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.google.gemini_live.vertex.llm import GeminiLiveVertexLLMService
@@ -124,7 +127,10 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext([{"role": "developer", "content": "Say hello."}])
# Server-side VAD is enabled by default; no local VAD is added.
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
[

View File

@@ -16,7 +16,10 @@ from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import (
create_transport,
@@ -64,7 +67,10 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
],
)
# Server-side VAD is enabled by default; no local VAD is added.
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
[

View File

@@ -21,6 +21,7 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
RealtimeServiceModeConfig,
UserTurnStoppedMessage,
)
from pipecat.runner.types import RunnerArguments
@@ -130,8 +131,33 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
llm.register_function("get_restaurant_recommendation", fetch_restaurant_recommendation)
context = LLMContext()
# Server-side VAD is enabled by default; no local VAD is added.
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context)
# Gemini Live drives the conversation server-side and does not emit
# UserStartedSpeakingFrame / UserStoppedSpeakingFrame. Context
# aggregation still works with realtime_service_mode, but pipeline
# processors that depend on those frames (RTVI client speech events,
# TurnTrackingObserver, AudioBufferProcessor turn recording,
# UserIdleController, user mute strategies, voicemail detector) won't
# activate. The Pipecat Prebuilt UI is one such consumer — without
# these frames it can't group user transcripts into discrete turns
# visually.
#
# If you need those frames, uncomment the SileroVADAnalyzer import
# above and the `user_params=` argument below. Note: local turn
# detection may not match Gemini Live's actual server-side turn
# decisions and can desynchronize in subtle ways.
#
# For local VAD driving the conversation (server VAD disabled), see
# `realtime-gemini-live-locally-driven-turns.py` instead.
#
# from pipecat.audio.vad.silero import SileroVADAnalyzer
# from pipecat.processors.aggregators.llm_response_universal import (
# LLMUserAggregatorParams,
# )
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
# user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
)
pipeline = Pipeline(
[
@@ -166,14 +192,19 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Client disconnected")
await task.cancel()
@user_aggregator.event_handler("on_user_turn_stopped")
async def on_user_turn_stopped(aggregator, strategy, message: UserTurnStoppedMessage):
# Gemini Live doesn't emit user-turn frames so on_user_turn_stopped
# would never fire. The *_message_added events fire when messages are
# written to context and carry the finalized content; use those for
# transcript logging regardless of whether the service emits turn
# frames.
@user_aggregator.event_handler("on_user_message_added")
async def on_user_message_added(aggregator, message: UserTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}user: {message.content}"
logger.info(f"Transcript: {line}")
@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMessage):
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")

View File

@@ -29,7 +29,10 @@ from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.llm_service import FunctionCallParams
@@ -129,7 +132,10 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
)
context = LLMContext(tools=tools)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
[

View File

@@ -0,0 +1,262 @@
#
# Copyright (c) 2024-2026, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
"""Grok Realtime with locally-driven turn detection.
By default Grok Realtime drives the conversation with its own server-side
VAD (see `realtime-grok.py`). This variant disables server-side turn
detection (``turn_detection=None``, the "manual" mode in Grok's session
properties) and instead drives turn boundaries locally with
``SileroVADAnalyzer`` wired into the user aggregator. Use this variant if
you want a turn analyzer like ``LocalSmartTurnV3`` to decide when the user
is done speaking, or if you need ``UserStartedSpeakingFrame`` /
``UserStoppedSpeakingFrame`` to fire from the same source as
``InterruptionFrame``.
Caveat: locally-generated turn boundaries are a heuristic and may not match
the provider's actual server-side turn decisions. Prefer server-emitted
turn frames (i.e. the base `realtime-grok.py` example) unless you have a
specific reason to drive turn detection locally.
"""
import os
from datetime import datetime
from dotenv import load_dotenv
from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.observers.loggers.transcription_log_observer import (
TranscriptionLogObserver,
)
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
UserTurnStoppedMessage,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.llm_service import FunctionCallParams
from pipecat.services.xai.realtime.events import SessionProperties
from pipecat.services.xai.realtime.llm import GrokRealtimeLLMService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
load_dotenv(override=True)
async def fetch_weather_from_api(params: FunctionCallParams):
"""Handle weather function calls."""
temperature = 75 if params.arguments.get("format") == "fahrenheit" else 24
await params.result_callback(
{
"conditions": "nice",
"temperature": temperature,
"format": params.arguments.get("format", "celsius"),
"timestamp": datetime.now().strftime("%Y%m%d_%H%M%S"),
}
)
async def get_current_time(params: FunctionCallParams):
"""Handle time function calls."""
await params.result_callback(
{
"time": datetime.now().strftime("%H:%M:%S"),
"date": datetime.now().strftime("%Y-%m-%d"),
"timezone": "local",
}
)
async def get_restaurant_recommendation(params: FunctionCallParams):
"""Handle restaurant recommendation function calls."""
location = params.arguments.get("location", "unknown")
await params.result_callback(
{
"name": "The Golden Dragon",
"cuisine": "Chinese",
"location": location,
"rating": 4.5,
}
)
weather_function = FunctionSchema(
name="get_current_weather",
description="Get the current weather for a location",
properties={
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use.",
},
},
required=["location", "format"],
)
time_function = FunctionSchema(
name="get_current_time",
description="Get the current time and date",
properties={},
required=[],
)
restaurant_function = FunctionSchema(
name="get_restaurant_recommendation",
description="Get a restaurant recommendation for a location",
properties={
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
},
required=["location"],
)
tools = ToolsSchema(standard_tools=[weather_function, time_function, restaurant_function])
transport_params = {
"daily": lambda: DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
"twilio": lambda: FastAPIWebsocketParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
"webrtc": lambda: TransportParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
}
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info("Starting Grok Voice Agent bot")
session_properties = SessionProperties(
voice="Ara",
# Disable Grok's server-side turn detection (manual mode). This
# example drives turn boundaries locally via the SileroVADAnalyzer
# wired into the user aggregator below.
turn_detection=None,
)
llm = GrokRealtimeLLMService(
api_key=os.environ["XAI_API_KEY"],
settings=GrokRealtimeLLMService.Settings(
system_instruction="""You are a helpful and friendly AI assistant powered by Grok.
You have access to several tools:
- Weather information
- Current time
- Restaurant recommendations
- Web search (built-in)
- X/Twitter search (built-in)
Your voice and personality should be warm and engaging. Keep your responses
concise and conversational since this is a voice interaction.
If the user asks about current events or news, use web search.
If they ask about what people are saying on social media, use X search.
Always be helpful and proactive in offering assistance.""",
session_properties=session_properties,
),
)
llm.register_function("get_current_weather", fetch_weather_from_api)
llm.register_function("get_current_time", get_current_time)
llm.register_function("get_restaurant_recommendation", get_restaurant_recommendation)
context = LLMContext(
[{"role": "developer", "content": "Say hello and introduce yourself!"}],
tools,
)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
# Drive turn detection locally via SileroVAD wired into the user
# aggregator. realtime_service_mode keeps context-write semantics
# correct and (by default) drops the transcript wait on turn-end so
# local VAD can drive turn boundaries on the latency critical path.
realtime_service_mode=RealtimeServiceModeConfig(),
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
)
pipeline = Pipeline(
[
transport.input(),
user_aggregator,
llm,
transport.output(),
assistant_aggregator,
]
)
task = PipelineTask(
pipeline,
params=PipelineParams(
enable_metrics=True,
enable_usage_metrics=True,
),
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
observers=[TranscriptionLogObserver()],
)
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info("Client connected")
await task.queue_frames([LLMRunFrame()])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):
logger.info("Client disconnected")
await task.cancel()
@user_aggregator.event_handler("on_user_message_added")
async def on_user_message_added(aggregator, message: UserTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}user: {message.content}"
logger.info(f"Transcript: {line}")
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
async def bot(runner_args: RunnerArguments):
"""Main bot entry point compatible with Pipecat Cloud."""
transport = await create_transport(runner_args, transport_params)
await run_bot(transport, runner_args)
if __name__ == "__main__":
from pipecat.runner.run import main
main()

View File

@@ -33,9 +33,6 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
# Note: Grok has built-in server-side VAD, so we don't need local VAD
# from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.observers.loggers.transcription_log_observer import (
TranscriptionLogObserver,
@@ -47,6 +44,7 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
RealtimeServiceModeConfig,
UserTurnStoppedMessage,
)
from pipecat.runner.types import RunnerArguments
@@ -212,7 +210,10 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
tools,
)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
)
# Build the pipeline
# Note: In realtime mode, transcription comes from Grok (upstream),
@@ -248,15 +249,19 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info("Client disconnected")
await task.cancel()
# Log transcript updates
@user_aggregator.event_handler("on_user_turn_stopped")
async def on_user_turn_stopped(aggregator, strategy, message: UserTurnStoppedMessage):
# Log transcript updates. In realtime mode the turn-stopped events
# fire before the message text is finalized (UserTurnStoppedMessage
# content is None), so subscribe to the *_message_added events
# instead — they fire when the message is written to context and
# carry the finalized content.
@user_aggregator.event_handler("on_user_message_added")
async def on_user_message_added(aggregator, message: UserTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}user: {message.content}"
logger.info(f"Transcript: {line}")
@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMessage):
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")

View File

@@ -0,0 +1,235 @@
#
# Copyright (c) 2024-2026, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
"""Inworld Realtime with locally-driven turn detection.
By default Inworld Realtime drives the conversation with its own
server-side semantic VAD (see `realtime-inworld.py`). This variant
disables server-side turn detection (``turn_detection=None``, the
"manual" mode in Inworld's session properties) and instead drives turn
boundaries locally with ``SileroVADAnalyzer`` wired into the user
aggregator. Use this variant if you want a turn analyzer like
``LocalSmartTurnV3`` to decide when the user is done speaking, or if you
need ``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame`` to fire
from the same source as ``InterruptionFrame``.
Caveat: locally-generated turn boundaries are a heuristic and may not
match the provider's actual server-side turn decisions. Prefer
server-emitted turn frames (i.e. the base `realtime-inworld.py` example)
unless you have a specific reason to drive turn detection locally.
"""
import os
import random
from datetime import datetime
from dotenv import load_dotenv
from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.observers.loggers.transcription_log_observer import (
TranscriptionLogObserver,
)
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
UserTurnStoppedMessage,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.inworld.realtime.events import (
AudioConfiguration,
AudioInput,
AudioOutput,
InputTranscription,
PCMAudioFormat,
SessionProperties,
)
from pipecat.services.inworld.realtime.llm import InworldRealtimeLLMService
from pipecat.services.llm_service import FunctionCallParams
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
load_dotenv(override=True)
async def fetch_weather_from_api(params: FunctionCallParams):
temperature = (
random.randint(60, 85)
if params.arguments["format"] == "fahrenheit"
else random.randint(15, 30)
)
await params.result_callback(
{
"conditions": "nice",
"temperature": temperature,
"format": params.arguments["format"],
"timestamp": datetime.now().strftime("%Y%m%d_%H%M%S"),
}
)
weather_function = FunctionSchema(
name="get_current_weather",
description="Get the current weather",
properties={
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use.",
},
},
required=["location", "format"],
)
tools = ToolsSchema(standard_tools=[weather_function])
transport_params = {
"daily": lambda: DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
"twilio": lambda: FastAPIWebsocketParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
"webrtc": lambda: TransportParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
}
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info("Starting Inworld Realtime bot (local VAD)")
model = "openai/gpt-4.1-mini"
voice = "Sarah"
tts_model = "inworld-tts-2"
stt_model = "assemblyai/u3-rt-pro"
# Setting session_properties here replaces Inworld's defaults wholesale,
# so we provide a complete SessionProperties — with turn_detection=None
# (manual mode) so local VAD drives turn boundaries instead.
session_properties = SessionProperties(
model=model,
output_modalities=["audio", "text"],
audio=AudioConfiguration(
input=AudioInput(
format=PCMAudioFormat(rate=24000),
transcription=InputTranscription(model=stt_model),
turn_detection=None,
),
output=AudioOutput(
format=PCMAudioFormat(rate=24000),
model=tts_model,
voice=voice,
),
),
)
llm = InworldRealtimeLLMService(
api_key=os.environ["INWORLD_API_KEY"],
settings=InworldRealtimeLLMService.Settings(
system_instruction="""You are a helpful and friendly AI assistant powered by Inworld.
Your voice and personality should be warm and engaging. Keep your responses
concise and conversational since this is a voice interaction.
Always be helpful and proactive in offering assistance.""",
session_properties=session_properties,
),
)
# Note: function calling requires a paid Inworld account and a
# function-calling-capable model
llm.register_function("get_current_weather", fetch_weather_from_api)
context = LLMContext(
[{"role": "developer", "content": "Say hello and introduce yourself!"}],
tools,
)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
# Drive turn detection locally via SileroVAD wired into the user
# aggregator. realtime_service_mode keeps context-write semantics
# correct and (by default) drops the transcript wait on turn-end so
# local VAD can drive turn boundaries on the latency critical path.
realtime_service_mode=RealtimeServiceModeConfig(),
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
)
pipeline = Pipeline(
[
transport.input(),
user_aggregator,
llm,
transport.output(),
assistant_aggregator,
]
)
task = PipelineTask(
pipeline,
params=PipelineParams(
enable_metrics=True,
enable_usage_metrics=True,
),
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
observers=[TranscriptionLogObserver()],
)
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info("Client connected")
await task.queue_frames([LLMRunFrame()])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):
logger.info("Client disconnected")
await task.cancel()
@user_aggregator.event_handler("on_user_message_added")
async def on_user_message_added(aggregator, message: UserTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
logger.info(f"Transcript: {timestamp}user: {message.content}")
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
logger.info(f"Transcript: {timestamp}assistant: {message.content}")
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
async def bot(runner_args: RunnerArguments):
"""Main bot entry point compatible with Pipecat Cloud."""
transport = await create_transport(runner_args, transport_params)
await run_bot(transport, runner_args)
if __name__ == "__main__":
from pipecat.runner.run import main
main()

View File

@@ -47,6 +47,7 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
RealtimeServiceModeConfig,
UserTurnStoppedMessage,
)
from pipecat.runner.types import RunnerArguments
@@ -149,7 +150,10 @@ Always be helpful and proactive in offering assistance.""",
tools,
)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
realtime_service_mode=RealtimeServiceModeConfig(),
)
# Build the pipeline
pipeline = Pipeline(
@@ -182,13 +186,16 @@ Always be helpful and proactive in offering assistance.""",
logger.info("Client disconnected")
await task.cancel()
@user_aggregator.event_handler("on_user_turn_stopped")
async def on_user_turn_stopped(aggregator, strategy, message: UserTurnStoppedMessage):
# In realtime mode the turn-stopped events fire before the message
# text is finalized; subscribe to the *_message_added events for the
# finalized content.
@user_aggregator.event_handler("on_user_message_added")
async def on_user_message_added(aggregator, message: UserTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
logger.info(f"Transcript: {timestamp}user: {message.content}")
@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMessage):
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
logger.info(f"Transcript: {timestamp}assistant: {message.content}")

View File

@@ -24,7 +24,6 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -32,7 +31,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -147,7 +146,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext(tools=tools)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -10,7 +10,6 @@ import os
from dotenv import load_dotenv
from loguru import logger
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.observers.loggers.transcription_log_observer import TranscriptionLogObserver
from pipecat.pipeline.pipeline import Pipeline
@@ -19,7 +18,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import (
@@ -106,7 +105,7 @@ Remember, your responses should be short. Just one or two sentences, usually. Re
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -0,0 +1,267 @@
#
# Copyright (c) 2024-2026, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
"""OpenAI Realtime with locally-driven turn detection.
By default OpenAI Realtime drives the conversation with its own server-side
VAD (see `realtime-openai.py`). This variant disables server-side turn
detection (``turn_detection=False``) and instead drives turn boundaries
locally with ``SileroVADAnalyzer`` wired into the user aggregator. This is
the path to take if you want a turn analyzer like ``LocalSmartTurnV3`` to
decide when the user is done speaking, or if you need ``UserStartedSpeakingFrame``
/ ``UserStoppedSpeakingFrame`` to fire from the same source as
``InterruptionFrame``.
Caveat: locally-generated turn boundaries are a heuristic and may not match
the provider's actual server-side turn decisions. With OpenAI Realtime,
server-side turn detection is generally what the service expects to drive
the conversation, and disabling it puts the responsibility on you. Prefer
server-emitted turn frames (i.e. the base `realtime-openai.py` example)
unless you have a specific reason to drive turn detection locally.
"""
import asyncio
import os
from datetime import datetime
from dotenv import load_dotenv
from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame, LLMSetToolsFrame
from pipecat.observers.loggers.transcription_log_observer import TranscriptionLogObserver
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
UserTurnStoppedMessage,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.llm_service import FunctionCallParams
from pipecat.services.openai.realtime.events import (
AudioConfiguration,
AudioInput,
InputAudioNoiseReduction,
InputAudioTranscription,
SessionProperties,
)
from pipecat.services.openai.realtime.llm import OpenAIRealtimeLLMService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
load_dotenv(override=True)
async def fetch_weather_from_api(params: FunctionCallParams):
temperature = 75 if params.arguments["format"] == "fahrenheit" else 24
await params.result_callback(
{
"conditions": "nice",
"temperature": temperature,
"format": params.arguments["format"],
"timestamp": datetime.now().strftime("%Y%m%d_%H%M%S"),
}
)
async def get_news(params: FunctionCallParams):
await params.result_callback(
{
"news": [
"Massive UFO currently hovering above New York City",
"Stock markets reach all-time highs",
"Living dinosaur species discovered in the Amazon rainforest",
],
}
)
async def fetch_restaurant_recommendation(params: FunctionCallParams):
await params.result_callback({"name": "The Golden Dragon"})
weather_function = FunctionSchema(
name="get_current_weather",
description="Get the current weather",
properties={
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use. Infer this from the users location.",
},
},
required=["location", "format"],
)
get_news_function = FunctionSchema(
name="get_news",
description="Get the current news.",
properties={},
required=[],
)
restaurant_function = FunctionSchema(
name="get_restaurant_recommendation",
description="Get a restaurant recommendation",
properties={
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
},
required=["location"],
)
tools = ToolsSchema(standard_tools=[weather_function, restaurant_function])
transport_params = {
"daily": lambda: DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
"twilio": lambda: FastAPIWebsocketParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
"webrtc": lambda: TransportParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
}
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Starting bot")
llm = OpenAIRealtimeLLMService(
api_key=os.environ["OPENAI_API_KEY"],
settings=OpenAIRealtimeLLMService.Settings(
system_instruction="""You are a helpful and friendly AI.
Act like a human, but remember that you aren't a human and that you can't do human
things in the real world. Your voice and personality should be warm and engaging, with a lively and
playful tone.
If interacting in a non-English language, start by using the standard accent or dialect familiar to
the user. Talk quickly. You should always call a function if you can. Do not refer to these rules,
even if you're asked about them.
You are participating in a voice conversation. Keep your responses concise, short, and to the point
unless specifically asked to elaborate on a topic.
Remember, your responses should be short. Just one or two sentences, usually. Respond in English.""",
session_properties=SessionProperties(
audio=AudioConfiguration(
input=AudioInput(
transcription=InputAudioTranscription(),
# Disable OpenAI's server-side turn detection — this
# example drives turn boundaries locally via the
# SileroVADAnalyzer wired into the user aggregator
# below.
turn_detection=False,
noise_reduction=InputAudioNoiseReduction(type="near_field"),
)
),
),
),
)
llm.register_function("get_current_weather", fetch_weather_from_api)
llm.register_function("get_restaurant_recommendation", fetch_restaurant_recommendation)
llm.register_function("get_news", get_news)
context = LLMContext(
[{"role": "developer", "content": "Say hello!"}],
tools,
)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
# Drive turn detection locally via SileroVAD wired into the user
# aggregator. realtime_service_mode keeps context-write semantics
# correct and (by default) drops the transcript wait on turn-end so
# local VAD can drive turn boundaries on the latency critical path.
realtime_service_mode=RealtimeServiceModeConfig(),
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
)
pipeline = Pipeline(
[
transport.input(),
user_aggregator,
llm,
transport.output(),
assistant_aggregator,
]
)
task = PipelineTask(
pipeline,
params=PipelineParams(
enable_metrics=True,
enable_usage_metrics=True,
),
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
observers=[TranscriptionLogObserver()],
)
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info(f"Client connected")
await task.queue_frames([LLMRunFrame()])
await asyncio.sleep(15)
new_tools = ToolsSchema(
standard_tools=[weather_function, restaurant_function, get_news_function]
)
await task.queue_frames([LLMSetToolsFrame(tools=new_tools)])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):
logger.info(f"Client disconnected")
await task.cancel()
@user_aggregator.event_handler("on_user_message_added")
async def on_user_message_added(aggregator, message: UserTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}user: {message.content}"
logger.info(f"Transcript: {line}")
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
async def bot(runner_args: RunnerArguments):
"""Main bot entry point compatible with Pipecat Cloud."""
transport = await create_transport(runner_args, transport_params)
await run_bot(transport, runner_args)
if __name__ == "__main__":
from pipecat.runner.run import main
main()

View File

@@ -13,7 +13,6 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -21,7 +20,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -177,7 +176,7 @@ Remember, your responses should be short. Just one or two sentences, usually. Re
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -14,7 +14,6 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame, LLMSetToolsFrame
from pipecat.observers.loggers.transcription_log_observer import TranscriptionLogObserver
from pipecat.pipeline.pipeline import Pipeline
@@ -24,7 +23,7 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
UserTurnStoppedMessage,
)
from pipecat.runner.types import RunnerArguments
@@ -187,7 +186,13 @@ Remember, your responses should be short. Just one or two sentences, usually. Re
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
# OpenAI Realtime drives the conversation server-side and emits its
# own UserStarted/StoppedSpeakingFrame from server VAD events, so
# local VAD on the aggregator is unnecessary. realtime_service_mode
# decouples context writes from turn frames and transcript-bound
# turn-end. See `realtime-openai-locally-driven-turns.py` for the
# variant that disables server VAD and drives turn detection locally.
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
@@ -251,15 +256,19 @@ Remember, your responses should be short. Just one or two sentences, usually. Re
logger.info(f"Client disconnected")
await task.cancel()
# Log transcript updates
@user_aggregator.event_handler("on_user_turn_stopped")
async def on_user_turn_stopped(aggregator, strategy, message: UserTurnStoppedMessage):
# Log transcript updates. In realtime mode the turn-stopped events
# fire before the message text is finalized (UserTurnStoppedMessage
# content is None), so subscribe to the *_message_added events
# instead — they fire when the message is written to context and
# carry the finalized content.
@user_aggregator.event_handler("on_user_message_added")
async def on_user_message_added(aggregator, message: UserTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}user: {message.content}"
logger.info(f"Transcript: {line}")
@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMessage):
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")

View File

@@ -26,14 +26,13 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -42,8 +41,6 @@ from pipecat.services.ultravox.llm import OneShotInputParams, UltravoxRealtimeLL
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
from pipecat.turns.user_stop import SpeechTimeoutUserTurnStopStrategy
from pipecat.turns.user_turn_strategies import UserTurnStrategies
load_dotenv(override=True)
@@ -134,12 +131,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext([])
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(
user_turn_strategies=UserTurnStrategies(
stop=[SpeechTimeoutUserTurnStopStrategy()],
),
vad_analyzer=SileroVADAnalyzer(),
),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -12,8 +12,6 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
@@ -21,7 +19,7 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
UserTurnStoppedMessage,
)
from pipecat.runner.types import RunnerArguments
@@ -32,8 +30,6 @@ from pipecat.services.ultravox.llm import OneShotInputParams, UltravoxRealtimeLL
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
from pipecat.turns.user_stop import SpeechTimeoutUserTurnStopStrategy
from pipecat.turns.user_turn_strategies import UserTurnStrategies
# Load environment variables
load_dotenv(override=True)
@@ -188,17 +184,9 @@ There is also a secret menu that changes daily. If the user asks about it, use t
context = LLMContext([])
# Necessary to complete the function call lifecycle in Pipecat and
# to produce user and assistant turn stopped events.
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(
user_turn_strategies=UserTurnStrategies(
stop=[SpeechTimeoutUserTurnStopStrategy()],
),
# Set the VAD analyzer to emulate timing of the model.
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.5)),
),
realtime_service_mode=RealtimeServiceModeConfig(),
)
# Build the pipeline
@@ -234,14 +222,16 @@ There is also a secret menu that changes daily. If the user asks about it, use t
logger.info(f"Client disconnected")
await task.cancel()
@user_aggregator.event_handler("on_user_turn_stopped")
async def on_user_turn_stopped(aggregator, strategy, message: UserTurnStoppedMessage):
# Ultravox doesn't emit user-turn frames; subscribe to the
# *_message_added events for the finalized message text.
@user_aggregator.event_handler("on_user_message_added")
async def on_user_message_added(aggregator, message: UserTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}user: {message.content}"
logger.info(f"Transcript: {line}")
@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMessage):
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")

View File

@@ -12,7 +12,6 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
@@ -20,7 +19,7 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
UserTurnStoppedMessage,
)
from pipecat.runner.types import RunnerArguments
@@ -30,8 +29,6 @@ from pipecat.services.ultravox.llm import OneShotInputParams, UltravoxRealtimeLL
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
from pipecat.turns.user_stop import SpeechTimeoutUserTurnStopStrategy
from pipecat.turns.user_turn_strategies import UserTurnStrategies
# Load environment variables
load_dotenv(override=True)
@@ -178,18 +175,29 @@ There is also a secret menu that changes daily. If the user asks about it, use t
context = LLMContext([])
# Necessary to complete the function call lifecycle in Pipecat and
# to produce user and assistant turn stopped events.
# Ultravox drives the conversation server-side and does not emit
# UserStartedSpeakingFrame / UserStoppedSpeakingFrame. Context
# aggregation still works with realtime_service_mode, but pipeline
# processors that depend on those frames (RTVI client speech events,
# TurnTrackingObserver, AudioBufferProcessor turn recording,
# UserIdleController, user mute strategies, voicemail detector) won't
# activate. The Pipecat Prebuilt UI is one such consumer — without
# these frames it can't group user transcripts into discrete turns
# visually.
#
# If you need those frames, uncomment the SileroVADAnalyzer import
# above and the `user_params=` argument below. Note: local turn
# detection may not match Ultravox's actual server-side turn
# decisions and can desynchronize in subtle ways.
#
# from pipecat.audio.vad.silero import SileroVADAnalyzer
# from pipecat.processors.aggregators.llm_response_universal import (
# LLMUserAggregatorParams,
# )
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(
user_turn_strategies=UserTurnStrategies(
stop=[SpeechTimeoutUserTurnStopStrategy()],
),
# Set the VAD analyzer to create reliable TTFB measurements and
# user stop events.
vad_analyzer=SileroVADAnalyzer(),
),
realtime_service_mode=RealtimeServiceModeConfig(),
# user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
)
# Build the pipeline
@@ -224,14 +232,18 @@ There is also a secret menu that changes daily. If the user asks about it, use t
logger.info(f"Client disconnected")
await task.cancel()
@user_aggregator.event_handler("on_user_turn_stopped")
async def on_user_turn_stopped(aggregator, strategy, message: UserTurnStoppedMessage):
# Ultravox doesn't emit user-turn frames so on_user_turn_stopped
# would never fire. The *_message_added events fire when messages are
# written to context and carry the finalized content; use those for
# transcript logging.
@user_aggregator.event_handler("on_user_message_added")
async def on_user_message_added(aggregator, message: UserTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}user: {message.content}"
logger.info(f"Transcript: {line}")
@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMessage):
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")

View File

@@ -10,7 +10,6 @@ import os
from dotenv import load_dotenv
from loguru import logger
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame, LLMUpdateSettingsFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -18,7 +17,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -60,7 +59,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext()
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -11,7 +11,6 @@ from dotenv import load_dotenv
from loguru import logger
from pipecat.adapters.base_llm_adapter import LLMContextMessage
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame, LLMUpdateSettingsFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -20,7 +19,7 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -66,7 +65,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext(messages)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
@@ -88,8 +87,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
)
@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMessage):
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")

View File

@@ -10,7 +10,6 @@ import os
from dotenv import load_dotenv
from loguru import logger
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame, LLMUpdateSettingsFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -18,7 +17,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -60,7 +59,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext()
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -10,7 +10,6 @@ import os
from dotenv import load_dotenv
from loguru import logger
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame, LLMUpdateSettingsFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -18,7 +17,7 @@ from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -58,7 +57,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext()
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(

View File

@@ -11,7 +11,6 @@ from dotenv import load_dotenv
from loguru import logger
from pipecat.adapters.base_llm_adapter import LLMContextMessage
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame, LLMUpdateSettingsFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -20,7 +19,7 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -63,7 +62,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext(messages)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
@@ -85,8 +84,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
)
@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMessage):
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")

View File

@@ -11,7 +11,6 @@ from dotenv import load_dotenv
from loguru import logger
from pipecat.adapters.base_llm_adapter import LLMContextMessage
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame, LLMUpdateSettingsFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -20,7 +19,7 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -63,7 +62,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext(messages)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
@@ -85,8 +84,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
)
@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMessage):
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")

View File

@@ -13,7 +13,6 @@ from loguru import logger
from pipecat.adapters.base_llm_adapter import LLMContextMessage
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame, LLMUpdateSettingsFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -22,7 +21,7 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
AssistantTurnStoppedMessage,
LLMContextAggregatorPair,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
@@ -74,7 +73,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context = LLMContext(messages)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
realtime_service_mode=RealtimeServiceModeConfig(),
)
pipeline = Pipeline(
@@ -96,8 +95,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
)
@assistant_aggregator.event_handler("on_assistant_turn_stopped")
async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMessage):
@assistant_aggregator.event_handler("on_assistant_message_added")
async def on_assistant_message_added(aggregator, message: AssistantTurnStoppedMessage):
timestamp = f"[{message.timestamp}] " if message.timestamp else ""
line = f"{timestamp}assistant: {message.content}"
logger.info(f"Transcript: {line}")

View File

@@ -1439,6 +1439,27 @@ class STTMetadataFrame(ServiceMetadataFrame):
ttfs_p99_latency: float
@dataclass
class RealtimeServiceMetadataFrame(ServiceMetadataFrame):
"""Metadata announcing a realtime (speech-to-speech) LLM service.
Broadcast by realtime LLM services at pipeline start so downstream
processors — notably ``LLMContextAggregatorPair`` — can detect that
a realtime service is in the pipeline. The aggregator uses this to
surface a one-time recommendation to opt in to
``RealtimeServiceModeConfig`` when it hasn't been configured.
Parameters:
emits_user_turn_frames: Whether this service emits
``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame``
from server-side turn signals. False for services with no
server-side turn signals (e.g. Gemini Live, AWS Nova Sonic,
Ultravox).
"""
emits_user_turn_frames: bool = True
@dataclass
class ServiceSwitcherRequestMetadataFrame(ControlFrame):
"""Request a service to re-emit its metadata frames.

View File

@@ -55,6 +55,7 @@ from pipecat.frames.frames import (
LLMThoughtEndFrame,
LLMThoughtStartFrame,
LLMThoughtTextFrame,
RealtimeServiceMetadataFrame,
StartFrame,
TextFrame,
TranscriptionFrame,
@@ -83,7 +84,11 @@ from pipecat.processors.aggregators.llm_context_summarizer import (
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.turns.user_idle_controller import UserIdleController
from pipecat.turns.user_mute import BaseUserMuteStrategy
from pipecat.turns.user_start import BaseUserTurnStartStrategy, UserTurnStartedParams
from pipecat.turns.user_start import (
BaseUserTurnStartStrategy,
TranscriptionUserTurnStartStrategy,
UserTurnStartedParams,
)
from pipecat.turns.user_stop import BaseUserTurnStopStrategy, UserTurnStoppedParams
from pipecat.turns.user_turn_completion_mixin import UserTurnCompletionConfig
from pipecat.turns.user_turn_controller import UserTurnController
@@ -258,6 +263,55 @@ class LLMAssistantAggregatorParams:
self.context_summarization_config = None
@dataclass
class RealtimeServiceModeConfig:
"""Configure an ``LLMContextAggregatorPair`` for use with a realtime LLM service.
Both fields default to False (the recommended realtime behavior, dropping
transcript-related waits at both points in the flow). Override individual
fields to dial back to cascade-style behavior selectively.
Parameters:
context_writes_await_turns: When False (default), context writes are
triggered by the content stream itself (transcripts and assistant
text frames), making writes independent of turn-frame availability
and timing. When True, user messages are written to context on
user-turn-end frames (cascade behavior).
turns_await_transcripts: When False (default), turn-end fires as soon
as VAD signals end of speech, avoiding latency on the critical
path when local turn detection drives a realtime conversation.
When True, turn-end strategies wait for transcripts to arrive
before signalling end-of-turn.
Note:
Local VAD (via ``LLMUserAggregatorParams.vad_analyzer``) is intended
for use with realtime services that either don't emit
``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame``
themselves (Gemini Live, AWS Nova Sonic, Ultravox) or have their
server-side turn detection disabled (e.g. OpenAI Realtime with
``turn_detection=False``). Wiring local VAD on top of a service
whose server-side turn detection is also active produces duplicate
user-turn frames from both sources — the service broadcasts them,
and the aggregator's local-VAD-driven strategies broadcast them
again. Pick one source.
"""
context_writes_await_turns: bool = False
turns_await_transcripts: bool = False
def __post_init__(self):
"""Validate the field combination."""
if not self.turns_await_transcripts and self.context_writes_await_turns:
raise ValueError(
"Invalid combination: turns fire early (without transcripts) "
"but context writes wait on those turn frames — context would "
"be written with incomplete user messages. Either set "
"turns_await_transcripts=True (preserve transcript-aware "
"turn-end timing) or context_writes_await_turns=False "
"(decouple writes from turn frames)."
)
@dataclass
class UserTurnStoppedMessage:
"""A user turn stopped message containing a user transcript update.
@@ -266,13 +320,18 @@ class UserTurnStoppedMessage:
the aggregated transcript that is then used in the context.
Parameters:
content: The message content/text.
content: The message content/text. ``None`` in realtime mode
(``RealtimeServiceModeConfig(context_writes_await_turns=False)``)
when fired from a user-turn-stop frame, since the user message
hasn't been finalized at that point. Subscribers that need the
finalized text should listen to ``on_user_message_added``
instead.
timestamp: When the user turn started.
user_id: Optional identifier for the user.
"""
content: str
content: str | None
timestamp: str
user_id: str | None = None
@@ -567,6 +626,9 @@ class LLMUserAggregator(LLMContextAggregator):
context: LLMContext,
*,
params: LLMUserAggregatorParams | None = None,
_realtime_service_mode: RealtimeServiceModeConfig | None = None,
_paired_half: "LLMAssistantAggregator | None" = None,
_pair_lock: asyncio.Lock | None = None,
**kwargs,
):
"""Initialize the user context aggregator.
@@ -574,6 +636,14 @@ class LLMUserAggregator(LLMContextAggregator):
Args:
context: The LLM context for conversation storage.
params: Configuration parameters for aggregation behavior.
_realtime_service_mode: Pair-internal. Realtime-mode
configuration propagated from
``LLMContextAggregatorPair``. Not intended for direct use —
construct the aggregators via the pair.
_paired_half: Pair-internal. Back-reference to the paired
assistant aggregator for cross-half coordination.
_pair_lock: Pair-internal. Shared asyncio lock serializing
cross-half flushes.
**kwargs: Additional arguments.
"""
params = params or LLMUserAggregatorParams()
@@ -590,9 +660,23 @@ class LLMUserAggregator(LLMContextAggregator):
self._register_event_handler("on_user_turn_stop_timeout")
self._register_event_handler("on_user_turn_idle")
self._register_event_handler("on_user_turn_inference_triggered")
self._register_event_handler("on_user_message_added")
self._register_event_handler("on_user_mute_started")
self._register_event_handler("on_user_mute_stopped")
# Realtime-mode wiring. Defaults (no config) preserve cascade
# behavior: context writes happen on turn frames, turns wait
# for transcripts.
self._realtime_service_mode = _realtime_service_mode
self._paired_half = _paired_half
self._pair_lock = _pair_lock
if _realtime_service_mode is not None:
self._context_writes_await_turns = _realtime_service_mode.context_writes_await_turns
self._turns_await_transcripts = _realtime_service_mode.turns_await_transcripts
else:
self._context_writes_await_turns = True
self._turns_await_transcripts = True
user_turn_strategies = self._params.user_turn_strategies or UserTurnStrategies()
# Deprecated path: translate filter_incomplete_user_turns into
@@ -606,8 +690,19 @@ class LLMUserAggregator(LLMContextAggregator):
)
self._params.user_turn_strategies = user_turn_strategies
# Realtime mutation: when turns shouldn't wait for transcripts,
# drop the transcription-based start strategy and flip the
# wait_for_transcript flag on stop strategies that expose it. The
# set of strategies that support it intentionally stays narrow —
# the flag was reintroduced specifically for this realtime path.
if not self._turns_await_transcripts:
self._apply_realtime_strategy_mutations(user_turn_strategies)
self._user_is_muted = False
self._user_turn_start_timestamp = ""
# Tracks whether the §3.6 recommendation log has already fired
# for this session — see _handle_realtime_service_metadata.
self._realtime_recommendation_logged = False
# Full transcript across the user turn. Each
# `_on_user_turn_inference_triggered` push captures only the
# new segment since the previous push (push_aggregation resets
@@ -717,6 +812,9 @@ class LLMUserAggregator(LLMContextAggregator):
await self.push_frame(frame, direction)
elif isinstance(frame, LLMSetToolChoiceFrame):
self.set_tool_choice(frame.tool_choice)
elif isinstance(frame, RealtimeServiceMetadataFrame):
await self._handle_realtime_service_metadata(frame)
await self.push_frame(frame, direction)
else:
await self.push_frame(frame, direction)
@@ -734,9 +832,16 @@ class LLMUserAggregator(LLMContextAggregator):
self._context.add_message({"role": self.role, "content": aggregation})
await self.push_context_frame()
message = UserTurnStoppedMessage(
content=aggregation, timestamp=self._user_turn_start_timestamp
)
await self._call_event_handler("on_user_message_added", message)
return aggregation
async def _start(self, frame: StartFrame):
self._validate_realtime_pairing()
if self._vad_controller:
await self._vad_controller.setup(self.task_manager)
@@ -748,13 +853,138 @@ class LLMUserAggregator(LLMContextAggregator):
await s.setup(self.task_manager)
async def _stop(self, frame: EndFrame):
await self._maybe_emit_user_turn_stopped(on_session_end=True)
if not self._context_writes_await_turns:
# Realtime: flush trailing user content directly. The
# on_user_turn_stopped event already fired (if turn frames
# were emitted), so don't re-fire it from session end.
await self.push_aggregation()
else:
await self._maybe_emit_user_turn_stopped(on_session_end=True)
await self._cleanup()
async def _cancel(self, frame: CancelFrame):
await self._maybe_emit_user_turn_stopped(on_session_end=True)
if not self._context_writes_await_turns:
await self.push_aggregation()
else:
await self._maybe_emit_user_turn_stopped(on_session_end=True)
await self._cleanup()
def _validate_realtime_pairing(self):
"""Validate the realtime-mode wiring set by ``LLMContextAggregatorPair``.
Realtime mode requires both halves to be paired through the
``LLMContextAggregatorPair`` so cross-half flushes can find each
other. Direct construction of a half with the private realtime
kwargs is not supported.
"""
if not self._context_writes_await_turns:
if self._paired_half is None:
raise RuntimeError(
f"{self}: realtime_service_mode is configured but this user "
"aggregator has no paired assistant aggregator. Construct "
"the pair via LLMContextAggregatorPair("
"context, realtime_service_mode=RealtimeServiceModeConfig())."
)
if self._paired_half is not None:
if (
self._context_writes_await_turns != self._paired_half._context_writes_await_turns
or self._turns_await_transcripts != self._paired_half._turns_await_transcripts
):
raise RuntimeError(
f"{self}: realtime-mode config mismatch between user and "
"assistant halves. Use LLMContextAggregatorPair to construct "
"the pair so both halves share the same configuration."
)
def _apply_realtime_strategy_mutations(self, user_turn_strategies: UserTurnStrategies) -> None:
"""Mutate turn strategies for the realtime ``turns_await_transcripts=False`` path.
Drops ``TranscriptionUserTurnStartStrategy`` from the start strategies
(transcripts shouldn't start a turn when the realtime service drives
the conversation) and flips ``wait_for_transcript=False`` on stop
strategies that expose the flag, so end-of-turn fires as soon as VAD /
the turn analyzer reports end-of-speech.
"""
custom_strategies = self._params.user_turn_strategies is not None
start_strategies = user_turn_strategies.start or []
dropped: list[str] = []
kept_start: list[BaseUserTurnStartStrategy] = []
for s in start_strategies:
if isinstance(s, TranscriptionUserTurnStartStrategy):
dropped.append(s.__class__.__name__)
else:
kept_start.append(s)
user_turn_strategies.start = kept_start
flipped: list[str] = []
for s in user_turn_strategies.stop or []:
if hasattr(s, "wait_for_transcript"):
try:
s.wait_for_transcript = False
flipped.append(s.__class__.__name__)
except AttributeError:
# Strategy exposes the property but no setter — skip.
pass
if not dropped and not flipped:
return
msg = (
f"{self}: realtime_service_mode(turns_await_transcripts=False) — "
f"dropped {dropped or 'no'} start strategy(ies); set "
f"wait_for_transcript=False on {flipped or 'no'} stop strategy(ies)."
)
if custom_strategies:
logger.warning(msg)
else:
logger.debug(msg)
async def _handle_realtime_service_metadata(self, frame: RealtimeServiceMetadataFrame):
"""Handle a ``RealtimeServiceMetadataFrame`` broadcast by a realtime LLM service.
When ``realtime_service_mode`` isn't configured, log a one-time INFO
recommendation pointing the user at the option and warning about the
timing change on ``on_user_turn_stopped``. When it is configured, log
a confirming debug message. Fires at most once per session.
"""
if self._realtime_recommendation_logged:
return
self._realtime_recommendation_logged = True
if self._realtime_service_mode is None:
logger.info(
f"{self}: detected realtime service `{frame.service_name}` in the "
"pipeline. For correct context-write semantics with realtime "
"services, consider passing "
"realtime_service_mode=RealtimeServiceModeConfig() to "
"LLMContextAggregatorPair. Note: this changes when user messages "
"are written to context — they're written when the assistant "
"response starts rather than when the user-turn-end frame fires. "
"Subscribe to `on_user_message_added` instead of "
"`on_user_turn_stopped` if you need post-write semantics."
)
else:
logger.debug(
f"{self}: detected realtime service `{frame.service_name}`; "
"realtime_service_mode is configured."
)
async def _realtime_handoff_flush(self) -> None:
"""Flush pending user aggregation to context.
Called by the paired assistant half from
``_realtime_handle_llm_start`` (i.e. on ``LLMFullResponseStartFrame``)
to commit the in-flight user message before the assistant starts
its own turn. No-op when there's no pending content.
"""
if not self._aggregation:
return
# push_aggregation writes the message to context, pushes
# LLMContextFrame, and emits on_user_message_added.
await self.push_aggregation()
self._user_turn_start_timestamp = ""
async def _cleanup(self):
if self._vad_controller:
await self._vad_controller.cleanup()
@@ -826,6 +1056,10 @@ class LLMUserAggregator(LLMContextAggregator):
await self.push_context_frame()
async def _handle_transcription(self, frame: TranscriptionFrame):
if not self._context_writes_await_turns:
await self._realtime_handle_transcription(frame)
return
text = frame.text
# Make sure we really have some text.
@@ -839,6 +1073,30 @@ class LLMUserAggregator(LLMContextAggregator):
)
)
async def _realtime_handle_transcription(self, frame: TranscriptionFrame):
"""Realtime variant: signal the paired assistant half to flush, then append.
The first new user transcript after an assistant turn ends is what
commits the assistant's pending message to context. The flush is
idempotent (no-op when nothing pending), so it's safe to call on
every chunk.
"""
if not frame.text.strip():
return
if self._paired_half is not None and self._pair_lock is not None:
async with self._pair_lock:
await self._paired_half._realtime_handoff_flush()
if not self._user_turn_start_timestamp:
self._user_turn_start_timestamp = time_now_iso8601()
self._aggregation.append(
TextPartForConcatenation(
frame.text, includes_inter_part_spaces=frame.includes_inter_frame_spaces
)
)
async def _queued_broadcast_frame(self, frame_cls: type[Frame], **kwargs):
"""Broadcasts a frame upstream and queues it for internal processing.
@@ -903,6 +1161,17 @@ class LLMUserAggregator(LLMContextAggregator):
controller: UserTurnController,
strategy: BaseUserTurnStopStrategy,
):
if not self._context_writes_await_turns:
# Realtime: turn frames are supplemental — they don't drive
# context writes. Fire the event without pushing aggregation;
# the trailing-write path commits the user message instead.
logger.debug(
f"{self}: User turn inference triggered (strategy: {strategy}) "
"[realtime: event-only, no context push]"
)
await self._call_event_handler("on_user_turn_inference_triggered", strategy)
return
logger.debug(f"{self}: User turn inference triggered (strategy: {strategy})")
# Push aggregation now: this writes the user message segment to
@@ -935,6 +1204,17 @@ class LLMUserAggregator(LLMContextAggregator):
await self._user_idle_controller.process_frame(UserStoppedSpeakingFrame())
if not self._context_writes_await_turns:
# Realtime: turn frames are supplemental. The user message
# isn't finalized at turn-stop time — content is None.
# Subscribers wanting the finalized text use
# on_user_message_added instead.
message = UserTurnStoppedMessage(
content=None, timestamp=self._user_turn_start_timestamp
)
await self._call_event_handler("on_user_turn_stopped", strategy, message)
return
await self._maybe_emit_user_turn_stopped(strategy)
async def _on_reset_aggregation(
@@ -1030,6 +1310,9 @@ class LLMAssistantAggregator(LLMContextAggregator):
context: LLMContext,
*,
params: LLMAssistantAggregatorParams | None = None,
_realtime_service_mode: RealtimeServiceModeConfig | None = None,
_paired_half: "LLMUserAggregator | None" = None,
_pair_lock: asyncio.Lock | None = None,
**kwargs,
):
"""Initialize the assistant context aggregator.
@@ -1037,6 +1320,14 @@ class LLMAssistantAggregator(LLMContextAggregator):
Args:
context: The OpenAI LLM context for conversation storage.
params: Configuration parameters for aggregation behavior.
_realtime_service_mode: Pair-internal. Realtime-mode
configuration propagated from
``LLMContextAggregatorPair``. Not intended for direct use —
construct the aggregators via the pair.
_paired_half: Pair-internal. Back-reference to the paired
user aggregator for cross-half coordination.
_pair_lock: Pair-internal. Shared asyncio lock serializing
cross-half flushes.
**kwargs: Additional arguments.
"""
params = params or LLMAssistantAggregatorParams()
@@ -1048,6 +1339,24 @@ class LLMAssistantAggregator(LLMContextAggregator):
)
self._params = params
# Realtime-mode wiring. Defaults (no config) preserve cascade
# behavior: write to context on LLMFullResponseEndFrame.
self._realtime_service_mode = _realtime_service_mode
self._paired_half = _paired_half
self._pair_lock = _pair_lock
if _realtime_service_mode is not None:
self._context_writes_await_turns = _realtime_service_mode.context_writes_await_turns
self._turns_await_transcripts = _realtime_service_mode.turns_await_transcripts
else:
self._context_writes_await_turns = True
self._turns_await_transcripts = True
# Realtime mode only. Holds the assistant turn's content between
# LLMFullResponseEndFrame (the moment we mark it ready to flush)
# and the next user transcript (the moment we actually write it
# to context).
self._pending_assistant_message_to_flush: dict | None = None
self._function_calls_in_progress: dict[str, FunctionCallInProgressFrame | None] = {}
self._function_calls_image_results: dict[str, UserImageRawFrame] = {}
self._context_updated_tasks: set[asyncio.Task] = set()
@@ -1084,6 +1393,7 @@ class LLMAssistantAggregator(LLMContextAggregator):
self._register_event_handler("on_assistant_turn_started")
self._register_event_handler("on_assistant_turn_stopped")
self._register_event_handler("on_assistant_message_added")
self._register_event_handler("on_assistant_thought")
self._register_event_handler("on_summary_applied")
@@ -1184,6 +1494,10 @@ class LLMAssistantAggregator(LLMContextAggregator):
if self._push_context_on_bot_stopped_speaking and not self._user_speaking:
logger.debug(f"{self}: Bot stopped speaking — pushing deferred context frame!")
await self.push_context_frame(FrameDirection.UPSTREAM)
elif isinstance(frame, RealtimeServiceMetadataFrame):
# The user half logs the §3.6 recommendation; the assistant
# half just passes the frame through.
await self.push_frame(frame, direction)
else:
await self.push_frame(frame, direction)
@@ -1192,9 +1506,37 @@ class LLMAssistantAggregator(LLMContextAggregator):
await self._summarizer.process_frame(frame)
async def _start(self, frame: StartFrame):
self._validate_realtime_pairing()
if self._summarizer:
await self._summarizer.setup(self.task_manager)
def _validate_realtime_pairing(self):
"""Validate the realtime-mode wiring set by ``LLMContextAggregatorPair``.
Realtime mode requires both halves to be paired through the
``LLMContextAggregatorPair`` so cross-half flushes can find each
other. Direct construction of a half with the private realtime
kwargs is not supported.
"""
if not self._context_writes_await_turns:
if self._paired_half is None:
raise RuntimeError(
f"{self}: realtime_service_mode is configured but this assistant "
"aggregator has no paired user aggregator. Construct the pair "
"via LLMContextAggregatorPair("
"context, realtime_service_mode=RealtimeServiceModeConfig())."
)
if self._paired_half is not None:
if (
self._context_writes_await_turns != self._paired_half._context_writes_await_turns
or self._turns_await_transcripts != self._paired_half._turns_await_transcripts
):
raise RuntimeError(
f"{self}: realtime-mode config mismatch between user and "
"assistant halves. Use LLMContextAggregatorPair to construct "
"the pair so both halves share the same configuration."
)
async def push_aggregation(self) -> str:
"""Push the current assistant aggregation with timestamp."""
if not self._aggregation:
@@ -1247,6 +1589,12 @@ class LLMAssistantAggregator(LLMContextAggregator):
async def _handle_end_or_cancel(self, frame: Frame):
await self._trigger_assistant_turn_stopped(interrupted=isinstance(frame, CancelFrame))
if not self._context_writes_await_turns:
# Flush any pending assistant content parked by
# _realtime_trigger_assistant_turn_stopped (i.e. the bot
# finished its last reply but no follow-up user transcript
# arrived before the session ended).
await self._realtime_handoff_flush()
if self._summarizer:
await self._summarizer.cleanup()
@@ -1349,26 +1697,7 @@ class LLMAssistantAggregator(LLMContextAggregator):
run_llm = True
if run_llm and not self._user_speaking:
if self.has_queued_frame(FunctionCallResultFrame):
# Another FunctionCallResultFrame is already queued. Defer the context push
# to bundle all results into a single LLM call instead of triggering one
# inference pass per result. The context will be pushed once the last
# function call in the queue is processed.
logger.debug(
f"{self}: More FunctionCallResultFrames queued — deferring context frame push."
)
elif self._bot_speaking:
# Defer the context frame push until the bot finishes speaking. If multiple
# function call results arrive while the bot is speaking, they all accumulate
# in the context and a single push is performed once speaking stops, preventing
# the LLM from running multiple times and producing duplicated responses.
# This should be an edge case, since it would require a FunctionCallResultFrame
# being queued between an LLM response start and end frame.
logger.debug(f"{self}: Bot is speaking — deferring context frame push.")
self._push_context_on_bot_stopped_speaking = True
else:
logger.debug(f"{self}: Pushing context frame!")
await self.push_context_frame(FrameDirection.UPSTREAM)
await self._maybe_push_context_after_function_result()
# Call the `on_context_updated` callback once the function call result
# is added to the context. Also, run this in a separate task to make
@@ -1379,6 +1708,42 @@ class LLMAssistantAggregator(LLMContextAggregator):
self._context_updated_tasks.add(task)
task.add_done_callback(self._context_updated_task_finished)
async def _maybe_push_context_after_function_result(self) -> None:
"""Decide whether to push a context frame after a function-call result.
Dispatched by mode. Cascade re-runs LLM inference by pushing an
``LLMContextFrame`` upstream (with care to avoid duplicate pushes
while results are queued or the bot is still speaking). Realtime
services consume function results directly via
``FunctionCallResultFrame``, so the context-driven re-inference
cycle is unnecessary.
"""
if not self._context_writes_await_turns:
# Realtime: the realtime service has the result via
# FunctionCallResultFrame. No context push needed.
return
if self.has_queued_frame(FunctionCallResultFrame):
# Another FunctionCallResultFrame is already queued. Defer the context push
# to bundle all results into a single LLM call instead of triggering one
# inference pass per result. The context will be pushed once the last
# function call in the queue is processed.
logger.debug(
f"{self}: More FunctionCallResultFrames queued — deferring context frame push."
)
elif self._bot_speaking:
# Defer the context frame push until the bot finishes speaking. If multiple
# function call results arrive while the bot is speaking, they all accumulate
# in the context and a single push is performed once speaking stops, preventing
# the LLM from running multiple times and producing duplicated responses.
# This should be an edge case, since it would require a FunctionCallResultFrame
# being queued between an LLM response start and end frame.
logger.debug(f"{self}: Bot is speaking — deferring context frame push.")
self._push_context_on_bot_stopped_speaking = True
else:
logger.debug(f"{self}: Pushing context frame!")
await self.push_context_frame(FrameDirection.UPSTREAM)
async def _handle_function_call_intermediate_result(
self, frame: FunctionCallResultFrame, in_progress_frame: FunctionCallInProgressFrame
):
@@ -1469,6 +1834,20 @@ class LLMAssistantAggregator(LLMContextAggregator):
)
async def _handle_llm_start(self, _: LLMFullResponseStartFrame):
if not self._context_writes_await_turns:
await self._realtime_handle_llm_start()
return
await self._trigger_assistant_turn_started()
async def _realtime_handle_llm_start(self):
"""Realtime: flush the paired user half before starting the assistant turn.
The first content frame of an assistant turn is the trigger to
commit any in-flight user transcript to context.
"""
if self._paired_half is not None and self._pair_lock is not None:
async with self._pair_lock:
await self._paired_half._realtime_handoff_flush()
await self._trigger_assistant_turn_started()
async def _handle_llm_end(self, _: LLMFullResponseEndFrame):
@@ -1606,6 +1985,10 @@ class LLMAssistantAggregator(LLMContextAggregator):
await self._call_event_handler("on_assistant_turn_started")
async def _trigger_assistant_turn_stopped(self, *, interrupted: bool = False):
if not self._context_writes_await_turns:
await self._realtime_trigger_assistant_turn_stopped(interrupted=interrupted)
return
if not self._assistant_turn_start_timestamp:
return
@@ -1620,9 +2003,86 @@ class LLMAssistantAggregator(LLMContextAggregator):
timestamp=self._assistant_turn_start_timestamp,
)
await self._call_event_handler("on_assistant_turn_stopped", message)
if aggregation:
await self._call_event_handler("on_assistant_message_added", message)
self._assistant_turn_start_timestamp = ""
async def _realtime_trigger_assistant_turn_stopped(self, *, interrupted: bool):
"""Realtime variant: defer the context write or flush on interruption.
Normal end-of-turn (``interrupted=False``, from
``LLMFullResponseEndFrame``) parks the message text in a pending
slot — it isn't written to context until the next user transcript
arrives or the session ends. Interruption (``interrupted=True``)
commits immediately, matching today's
``AssistantTurnStoppedMessage.interrupted`` semantics.
"""
if not self._assistant_turn_start_timestamp:
return
timestamp = self._assistant_turn_start_timestamp
self._assistant_turn_start_timestamp = ""
if interrupted:
aggregation = await self.push_aggregation()
if aggregation:
aggregation = self._maybe_strip_turn_completion_markers(aggregation)
message = AssistantTurnStoppedMessage(
content=aggregation, interrupted=True, timestamp=timestamp
)
await self._call_event_handler("on_assistant_turn_stopped", message)
if aggregation:
await self._call_event_handler("on_assistant_message_added", message)
return
# Normal end. Park the message for trailing write.
raw_aggregation = self.aggregation_string()
if raw_aggregation:
self._pending_assistant_message_to_flush = {
"raw": raw_aggregation,
"timestamp": timestamp,
}
await self.reset()
stripped = (
self._maybe_strip_turn_completion_markers(raw_aggregation) if raw_aggregation else ""
)
message = AssistantTurnStoppedMessage(
content=stripped, interrupted=False, timestamp=timestamp
)
await self._call_event_handler("on_assistant_turn_stopped", message)
async def _realtime_handoff_flush(self) -> None:
"""Flush pending assistant aggregation to context.
Called by the paired user half from
``_realtime_handle_transcription`` when a new transcript arrives,
committing the assistant's deferred message before the user
starts a new turn. No-op when nothing is pending.
"""
if self._pending_assistant_message_to_flush is None:
return
pending = self._pending_assistant_message_to_flush
self._pending_assistant_message_to_flush = None
raw = pending["raw"]
timestamp = pending["timestamp"]
# Mirror push_aggregation: write the raw aggregation (with any
# turn-completion markers intact) to context, emit LLMContextFrame
# and the timestamp frame. Markers are stripped only from the
# event-carried text.
self._context.add_message({"role": "assistant", "content": raw})
await self.push_context_frame()
timestamp_frame = LLMContextAssistantTimestampFrame(timestamp=time_now_iso8601())
await self.push_frame(timestamp_frame)
stripped = self._maybe_strip_turn_completion_markers(raw)
message = AssistantTurnStoppedMessage(
content=stripped, interrupted=False, timestamp=timestamp
)
await self._call_event_handler("on_assistant_message_added", message)
def _maybe_strip_turn_completion_markers(self, text: str) -> str:
"""Strip turn completion markers from assistant transcript.
@@ -1685,6 +2145,7 @@ class LLMContextAggregatorPair:
user_params: LLMUserAggregatorParams | None = None,
assistant_params: LLMAssistantAggregatorParams | None = None,
add_tool_change_messages: bool | None = None,
realtime_service_mode: RealtimeServiceModeConfig | None = None,
):
"""Initialize the LLM context aggregator pair.
@@ -1702,14 +2163,38 @@ class LLMContextAggregatorPair:
announcement is added exactly once (the second aggregator's
diff is empty by the time it sees the frame). Leave as
``None`` to respect per-params settings.
realtime_service_mode: When provided, configures the pair for
use with a realtime (speech-to-speech) LLM service.
Context writes become trailing — driven by the content
stream itself (transcripts, ``LLMFullResponseStartFrame``)
rather than turn frames — and, by default, turn-end
strategies stop waiting for transcripts. Both halves share
this configuration via a private channel; mismatched
halves are rejected at ``StartFrame``. Defaults to
``None``, which preserves cascade behavior.
"""
user_params = user_params or LLMUserAggregatorParams()
assistant_params = assistant_params or LLMAssistantAggregatorParams()
if add_tool_change_messages is not None:
user_params.add_tool_change_messages = add_tool_change_messages
assistant_params.add_tool_change_messages = add_tool_change_messages
self._user = LLMUserAggregator(context, params=user_params)
self._assistant = LLMAssistantAggregator(context, params=assistant_params)
pair_lock = asyncio.Lock() if realtime_service_mode is not None else None
self._user = LLMUserAggregator(
context,
params=user_params,
_realtime_service_mode=realtime_service_mode,
_pair_lock=pair_lock,
)
self._assistant = LLMAssistantAggregator(
context,
params=assistant_params,
_realtime_service_mode=realtime_service_mode,
_pair_lock=pair_lock,
)
# Wire the cross-half back-references after both halves exist.
self._user._paired_half = self._assistant
self._assistant._paired_half = self._user
def user(self) -> LLMUserAggregator:
"""Get the user context aggregator.

View File

@@ -56,7 +56,7 @@ from pipecat.services.aws.nova_sonic.session_continuation import (
SessionContinuationHelper,
SessionContinuationParams,
)
from pipecat.services.llm_service import LLMService
from pipecat.services.llm_service import LLMService, RealtimeServiceInfo
from pipecat.services.settings import NOT_GIVEN, LLMSettings, _NotGiven, assert_given
from pipecat.utils.time import time_now_iso8601
@@ -241,6 +241,17 @@ class AWSNovaSonicLLMService(LLMService[AWSNovaSonicLLMAdapter]):
Provides bidirectional audio streaming, real-time transcription, text generation,
and function calling capabilities using AWS Nova Sonic model.
Does NOT emit ``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame``,
so pipeline processors that depend on those frames — RTVI client
speech events, ``TurnTrackingObserver``, ``AudioBufferProcessor`` turn
recording, ``UserIdleController``, user mute strategies, voicemail
detector — won't activate with the default server-VAD-only setup. Pair
with ``LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())``
so context writes are correct anyway. To produce the turn frames
locally, wire ``vad_analyzer=SileroVADAnalyzer()`` (or similar) into
``LLMUserAggregatorParams``; locally-generated turn boundaries are a
heuristic and may not match Nova Sonic's server-side turn decisions.
"""
Settings = AWSNovaSonicLLMSettings
@@ -249,6 +260,10 @@ class AWSNovaSonicLLMService(LLMService[AWSNovaSonicLLMAdapter]):
# Override the default adapter to use the AWSNovaSonicLLMAdapter one
adapter_class = AWSNovaSonicLLMAdapter
# Realtime (speech-to-speech) service. Does NOT emit
# UserStarted/StoppedSpeakingFrame from server-side turn signals.
_realtime_service_info = RealtimeServiceInfo(emits_user_turn_frames=False)
def __init__(
self,
*,
@@ -1428,9 +1443,15 @@ class AWSNovaSonicLLMService(LLMService[AWSNovaSonicLLMAdapter]):
if self._sc.on_content_end_assistant_final_text(content.text_content):
self.create_task(self._run_sc_handoff(), name="sc_handoff")
else:
# FINAL TEXT INTERRUPTED is the canonical barge-in
# signal. The AUDIO branch usually closed the
# response already (AUDIO contentEnd arrives with
# END_TURN on barge-in, before this), but the
# output transport's audio buffer is still draining
# — broadcast unconditionally to clear it.
await self.broadcast_interruption()
if self._assistant_is_responding:
# TEXT INTERRUPTED before audio started means no AUDIO
# contentEnd will arrive — end the response here.
# No AUDIO contentEnd will arrive — close here.
self._assistant_is_responding = False
await self._report_assistant_response_ended()
# Session continuation: TEXT INTERRUPTED is a completion
@@ -1443,6 +1464,18 @@ class AWSNovaSonicLLMService(LLMService[AWSNovaSonicLLMAdapter]):
if stop_reason in ("END_TURN", "INTERRUPTED"):
# END_TURN: normal completion. INTERRUPTED: user interrupted
# mid-audio. Both mean no more audio for this turn.
if stop_reason == "INTERRUPTED":
# Emit InterruptionFrame upstream so the assistant
# aggregator marks the message interrupted=True, and
# downstream so BaseOutputTransport clears the audio
# buffer (without this the bot keeps talking past the
# interruption while the buffer drains, since Nova
# Sonic doesn't surface server-side interruption any
# other way). Must fire before
# _report_assistant_response_ended so the aggregator
# handles InterruptionFrame before LLMFullResponseEndFrame
# closes the turn.
await self.broadcast_interruption()
self._assistant_is_responding = False
await self._report_assistant_response_ended()
elif content.role == Role.USER:

View File

@@ -62,7 +62,7 @@ from pipecat.processors.aggregators.llm_context import LLMContext, LLMSpecificMe
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.google.frames import LLMSearchOrigin, LLMSearchResponseFrame, LLMSearchResult
from pipecat.services.google.utils import update_google_client_http_options
from pipecat.services.llm_service import FunctionCallFromLLM, LLMService
from pipecat.services.llm_service import FunctionCallFromLLM, LLMService, RealtimeServiceInfo
from pipecat.services.settings import NOT_GIVEN, LLMSettings, _NotGiven, assert_given
from pipecat.transcriptions.language import Language, resolve_language
from pipecat.utils.string import match_endofsentence
@@ -361,6 +361,18 @@ class GeminiLiveLLMService(LLMService[GeminiLLMAdapter]):
This service enables real-time conversations with Gemini, supporting both
text and audio modalities. It handles voice transcription, streaming audio
responses, and tool usage.
Does NOT emit ``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame``
(the API exposes an ``interrupted`` event but no turn-start/-end), so
pipeline processors that depend on those frames — RTVI client speech
events, ``TurnTrackingObserver``, ``AudioBufferProcessor`` turn
recording, ``UserIdleController``, user mute strategies, voicemail
detector — won't activate with the default server-VAD-only setup. Pair
with ``LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())``
so context writes are correct anyway. To produce the turn frames
locally, see ``examples/realtime/realtime-gemini-live-locally-driven-turns.py``;
note that locally-generated turn boundaries are a heuristic and may
not match Gemini Live's server-side turn decisions.
"""
Settings = GeminiLiveLLMSettings
@@ -369,6 +381,11 @@ class GeminiLiveLLMService(LLMService[GeminiLLMAdapter]):
# Overriding the default adapter to use the Gemini one.
adapter_class = GeminiLLMAdapter
# Realtime (speech-to-speech) service. Does NOT emit
# UserStarted/StoppedSpeakingFrame from server-side turn signals —
# the API exposes an `interrupted` event but no turn-start/-end.
_realtime_service_info = RealtimeServiceInfo(emits_user_turn_frames=False)
@property
def _is_gemini_3(self) -> bool:
"""Check if the current model is a Gemini 3.x model."""

View File

@@ -51,7 +51,7 @@ from pipecat.metrics.metrics import LLMTokenUsage
from pipecat.processors.aggregators import async_tool_messages
from pipecat.processors.aggregators.llm_context import LLMContext, LLMSpecificMessage
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.llm_service import FunctionCallFromLLM, LLMService
from pipecat.services.llm_service import FunctionCallFromLLM, LLMService, RealtimeServiceInfo
from pipecat.services.settings import (
NOT_GIVEN,
LLMSettings,
@@ -201,6 +201,16 @@ class InworldRealtimeLLMService(LLMService[InworldRealtimeLLMAdapter]):
Supports function calling, conversation management, and real-time
transcription.
Emits ``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame`` from
Inworld's server-side VAD events. Pair with
``LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())``
so context writes are decoupled from those frames. If you wire local
VAD (``LLMUserAggregatorParams.vad_analyzer``) on top of this
service, disable Inworld's server-side turn detection first via
``turn_detection=None`` (manual mode); otherwise both sources
broadcast duplicate user-turn frames. See
``examples/realtime/realtime-inworld-locally-driven-turns.py``.
Example::
llm = InworldRealtimeLLMService(
@@ -245,6 +255,10 @@ class InworldRealtimeLLMService(LLMService[InworldRealtimeLLMAdapter]):
adapter_class = InworldRealtimeLLMAdapter
# Realtime (speech-to-speech) service. Emits UserStarted/Stopped
# speaking frames from server-side VAD events.
_realtime_service_info = RealtimeServiceInfo(emits_user_turn_frames=True)
# Target ~60ms audio chunks when sending to Inworld (16-bit mono).
_AUDIO_CHUNK_TARGET_MS = 60
@@ -417,12 +431,25 @@ class InworldRealtimeLLMService(LLMService[InworldRealtimeLLMAdapter]):
return rate
return getattr(self, "_output_sample_rate", 24000)
def _is_manual_turn_detection(self) -> bool:
"""Whether server-side turn detection is disabled (manual mode)."""
session_properties = assert_given(self._settings.session_properties)
return bool(
session_properties.audio
and session_properties.audio.input
and session_properties.audio.input.turn_detection is None
)
async def _handle_interruption(self):
"""Handle user interruption of assistant speech.
Inworld's server-side VAD handles response cancellation and buffer
cleanup automatically, so we only need to clean up local state.
Server-side VAD handles response cancellation and buffer cleanup
automatically; in manual mode the client must send the cancel
and clear events explicitly.
"""
if self._is_manual_turn_detection():
await self.send_client_event(events.InputAudioBufferClearEvent())
await self.send_client_event(events.ResponseCancelEvent())
await self._truncate_current_audio_response()
await self.stop_all_metrics()
@@ -437,10 +464,16 @@ class InworldRealtimeLLMService(LLMService[InworldRealtimeLLMAdapter]):
async def _handle_user_stopped_speaking(self, frame):
"""Handle user stopped speaking event.
Inworld's server-side VAD handles commit and response creation,
so this is a no-op. Metrics are started in _handle_evt_speech_stopped.
Server-side VAD handles commit and response creation
automatically; in manual mode the client must send them
explicitly. Metrics are started in _handle_evt_speech_stopped
in the server-VAD path.
"""
pass
if self._is_manual_turn_detection():
await self.start_ttfb_metrics()
await self.start_processing_metrics()
await self.send_client_event(events.InputAudioBufferCommitEvent())
await self.send_client_event(events.ResponseCreateEvent())
async def _handle_bot_stopped_speaking(self):
"""Handle bot stopped speaking event."""

View File

@@ -16,6 +16,7 @@ from collections.abc import Awaitable, Callable, Mapping, Sequence
from dataclasses import dataclass
from typing import (
Any,
ClassVar,
Generic,
Protocol,
cast,
@@ -48,6 +49,7 @@ from pipecat.frames.frames import (
LLMFullResponseStartFrame,
LLMTextFrame,
LLMUpdateSettingsFrame,
RealtimeServiceMetadataFrame,
StartFrame,
)
from pipecat.processors.aggregators.llm_context import (
@@ -97,6 +99,31 @@ class FunctionCallResultCallback(Protocol):
...
@dataclass(frozen=True)
class RealtimeServiceInfo:
"""Per-service metadata for realtime (speech-to-speech) LLM services.
Realtime LLM subclasses set ``LLMService._realtime_service_info`` to a
populated instance; the presence of a non-None value is what marks a
service as realtime. Non-realtime services keep the default ``None``.
Carries the configuration ``LLMService`` and
``LLMContextAggregatorPair`` need to wire up realtime behavior:
auto-broadcasting ``RealtimeServiceMetadataFrame`` at start, the
startup INFO log for services with no server-side turn signals, and
the aggregator's one-time recommendation log.
Parameters:
emits_user_turn_frames: Whether the service emits
``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame``
from server-side turn signals. False for services with no
server-side turn signals (e.g. Gemini Live, AWS Nova Sonic,
Ultravox).
"""
emits_user_turn_frames: bool = True
@dataclass
class FunctionCallParams:
"""Parameters for a function call.
@@ -244,6 +271,15 @@ class LLMService(UserTurnCompletionLLMServiceMixin, AIService, Generic[TAdapter]
# However, subclasses should override this with a more specific adapter when necessary.
adapter_class: type[BaseLLMAdapter] = OpenAILLMAdapter
# Marker + per-service config for realtime (speech-to-speech) LLM
# services. Realtime subclasses override this with a populated
# ``RealtimeServiceInfo`` instance — the presence of a non-None value
# is what marks the service as realtime. Non-realtime services keep
# the default ``None`` and the realtime-specific machinery
# (auto-broadcast of ``RealtimeServiceMetadataFrame``, startup INFO
# log for services without server-side turn signals) stays inert.
_realtime_service_info: ClassVar[RealtimeServiceInfo | None] = None
# Returned to the LLM as the tool result when an unavailable function is
# called. Deliberately neutral about future availability so the LLM can
# pick the function up again if it returns (e.g. via the
@@ -363,6 +399,21 @@ class LLMService(UserTurnCompletionLLMServiceMixin, AIService, Generic[TAdapter]
await self._create_sequential_runner_task()
if self._enable_async_tool_cancellation and self._has_async_tools():
self._setup_async_tool_cancellation()
if (
self._realtime_service_info is not None
and not self._realtime_service_info.emits_user_turn_frames
):
logger.info(
f"{self} does not emit UserStartedSpeakingFrame/"
"UserStoppedSpeakingFrame. Pipeline processors that depend on "
"these frames (RTVI client speech events, TurnTrackingObserver, "
"AudioBufferProcessor turn recording, UserIdleController, user "
"mute strategies, voicemail detector) will not activate. To "
"produce them locally, add `vad_analyzer=` to "
"LLMUserAggregatorParams. Note: local turn detection may not "
"match the provider's actual server-side turn decisions and "
"can desynchronize in subtle ways."
)
async def stop(self, frame: EndFrame):
"""Stop the LLM service.
@@ -495,6 +546,23 @@ class LLMService(UserTurnCompletionLLMServiceMixin, AIService, Generic[TAdapter]
await super().push_frame(frame, direction)
# Broadcast realtime-service metadata immediately after the
# StartFrame propagates downstream, mirroring the order STT
# services use for STTMetadataFrame. The aggregator (upstream)
# already received its own StartFrame and is ready to process
# the broadcast; downstream processors see StartFrame then the
# metadata in their queues.
if (
self._realtime_service_info is not None
and isinstance(frame, StartFrame)
and direction == FrameDirection.DOWNSTREAM
):
await self.broadcast_frame(
RealtimeServiceMetadataFrame,
service_name=self.name,
emits_user_turn_frames=self._realtime_service_info.emits_user_turn_frames,
)
async def _push_llm_text(self, text: str):
"""Push LLM text, using turn completion detection if enabled.

View File

@@ -51,7 +51,7 @@ from pipecat.metrics.metrics import LLMTokenUsage
from pipecat.processors.aggregators import async_tool_messages
from pipecat.processors.aggregators.llm_context import LLMContext, LLMSpecificMessage
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.llm_service import FunctionCallFromLLM, LLMService
from pipecat.services.llm_service import FunctionCallFromLLM, LLMService, RealtimeServiceInfo
from pipecat.services.openai._constants import OPENAI_REALTIME_WHISPER_MODEL, OPENAI_SAMPLE_RATE
from pipecat.services.settings import (
NOT_GIVEN,
@@ -204,6 +204,21 @@ class OpenAIRealtimeLLMService(LLMService[OpenAIRealtimeLLMAdapter]):
Implements the OpenAI Realtime API with WebSocket communication for low-latency
bidirectional audio and text interactions. Supports function calling, conversation
management, and real-time transcription.
Emits ``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame`` from
OpenAI's server-side VAD events, so pipeline processors that depend on
those frames (RTVI client speech events, ``TurnTrackingObserver``,
``AudioBufferProcessor`` turn recording, ``UserIdleController``, user
mute strategies, voicemail detector) work out of the box. Pair with
``LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())``
so context writes are decoupled from those frames; see the
``examples/realtime/realtime-openai.py`` example.
If you wire local VAD (``LLMUserAggregatorParams.vad_analyzer``) on
top of this service, disable OpenAI's server-side turn detection
first (``turn_detection=False``); otherwise both sources broadcast
duplicate user-turn frames. See
``examples/realtime/realtime-openai-locally-driven-turns.py``.
"""
Settings = OpenAIRealtimeLLMSettings
@@ -212,6 +227,10 @@ class OpenAIRealtimeLLMService(LLMService[OpenAIRealtimeLLMAdapter]):
# Overriding the default adapter to use the OpenAIRealtimeLLMAdapter one.
adapter_class = OpenAIRealtimeLLMAdapter
# Realtime (speech-to-speech) service. Emits UserStarted/Stopped
# speaking frames from server-side VAD events.
_realtime_service_info = RealtimeServiceInfo(emits_user_turn_frames=True)
def __init__(
self,
*,

View File

@@ -48,7 +48,7 @@ from pipecat.frames.frames import (
from pipecat.processors.aggregators import async_tool_messages
from pipecat.processors.aggregators.llm_context import LLMContext, LLMSpecificMessage
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.llm_service import FunctionCallFromLLM, LLMService
from pipecat.services.llm_service import FunctionCallFromLLM, LLMService, RealtimeServiceInfo
from pipecat.services.settings import NOT_GIVEN, LLMSettings, _NotGiven, assert_given
from pipecat.utils.time import time_now_iso8601
@@ -174,11 +174,26 @@ class UltravoxRealtimeLLMService(LLMService):
Note: Ultravox is an audio-native model, so voice transcriptions are not used
by the model and may not always align with its understanding of user input.
Does NOT emit ``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame``,
so pipeline processors that depend on those frames — RTVI client
speech events, ``TurnTrackingObserver``, ``AudioBufferProcessor`` turn
recording, ``UserIdleController``, user mute strategies, voicemail
detector — won't activate with the default server-VAD-only setup. Pair
with ``LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())``
so context writes are correct anyway. To produce the turn frames
locally, wire ``vad_analyzer=SileroVADAnalyzer()`` (or similar) into
``LLMUserAggregatorParams``; locally-generated turn boundaries are a
heuristic and may not match Ultravox's server-side turn decisions.
"""
Settings = UltravoxRealtimeLLMSettings
_settings: Settings
# Realtime (speech-to-speech) service. Does NOT emit
# UserStarted/StoppedSpeakingFrame from server-side turn signals.
_realtime_service_info = RealtimeServiceInfo(emits_user_turn_frames=False)
def __init__(
self,
*,
@@ -600,6 +615,18 @@ class UltravoxRealtimeLLMService(LLMService):
case "state":
if self._bot_responding and data.get("state") != "speaking":
await self._handle_response_end()
case "playback_clear_buffer":
# Server signals that the user interrupted the bot
# mid-speech and any buffered output audio should be
# dropped. Broadcast InterruptionFrame so the assistant
# aggregator records the message interrupted=True
# (upstream) and BaseOutputTransport clears its audio
# buffer (downstream). The subsequent "state" message
# transitioning off "speaking" is what closes the
# response via _handle_response_end; firing the
# interruption first ensures the aggregator handles
# InterruptionFrame before LLMFullResponseEndFrame.
await self.broadcast_interruption()
case "client_tool_invocation":
await self._handle_tool_invocation(
data.get("toolName"), data.get("invocationId"), data.get("parameters")

View File

@@ -50,7 +50,7 @@ from pipecat.metrics.metrics import LLMTokenUsage
from pipecat.processors.aggregators import async_tool_messages
from pipecat.processors.aggregators.llm_context import LLMContext, LLMSpecificMessage
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.llm_service import FunctionCallFromLLM, LLMService
from pipecat.services.llm_service import FunctionCallFromLLM, LLMService, RealtimeServiceInfo
from pipecat.services.settings import (
NOT_GIVEN,
LLMSettings,
@@ -195,6 +195,16 @@ class GrokRealtimeLLMService(LLMService[GrokRealtimeLLMAdapter]):
- Built-in tools (web_search, x_search, file_search)
- Custom function calling
- Server-side VAD (Voice Activity Detection)
Emits ``UserStartedSpeakingFrame`` / ``UserStoppedSpeakingFrame`` from
Grok's server-side VAD events. Pair with
``LLMContextAggregatorPair(..., realtime_service_mode=RealtimeServiceModeConfig())``
so context writes are decoupled from those frames. If you wire local
VAD (``LLMUserAggregatorParams.vad_analyzer``) on top of this
service, disable Grok's server-side turn detection first via
``turn_detection=None`` (manual mode); otherwise both sources
broadcast duplicate user-turn frames. See
``examples/realtime/realtime-grok-locally-driven-turns.py``.
"""
Settings = GrokRealtimeLLMSettings
@@ -203,6 +213,10 @@ class GrokRealtimeLLMService(LLMService[GrokRealtimeLLMAdapter]):
# Use the Grok-specific adapter
adapter_class = GrokRealtimeLLMAdapter
# Realtime (speech-to-speech) service. Emits UserStarted/Stopped
# speaking frames from server-side VAD events.
_realtime_service_info = RealtimeServiceInfo(emits_user_turn_frames=True)
def __init__(
self,
*,

View File

@@ -45,16 +45,35 @@ class SpeechTimeoutUserTurnStopStrategy(BaseUserTurnStopStrategy):
transcript — so the stt wait is marked done immediately.
"""
def __init__(self, *, user_speech_timeout: float = 0.6, **kwargs):
def __init__(
self,
*,
user_speech_timeout: float = 0.6,
wait_for_transcript: bool = True,
**kwargs,
):
"""Initialize the speech timeout-based user turn stop strategy.
Args:
user_speech_timeout: Time to wait for the user to potentially
say more after they pause speaking. Defaults to 0.6 seconds.
wait_for_transcript: Whether to require at least one transcript
before triggering end-of-turn. When True (default), turn-end
fires only after the user-speech timer expires *and* at least
one transcript has been received. When False, the strategy
signals turn-end as soon as VAD reports end of speech and the
user-speech timer has elapsed — independent of transcripts.
Set this to False when local turn detection is the intended
driver of the conversation (e.g. with a realtime LLM service
consuming audio directly), so transcripts are off the latency
critical path. ``LLMContextAggregatorPair`` flips this for
you when ``realtime_service_mode`` is configured with
``turns_await_transcripts=False``.
**kwargs: Additional keyword arguments.
"""
super().__init__(**kwargs)
self._user_speech_timeout = user_speech_timeout
self._wait_for_transcript = wait_for_transcript
self._stt_timeout: float = 0.0 # STT P99 latency from STTMetadataFrame
self._stop_secs: float = 0.0 # VAD stop_secs from VADUserStoppedSpeakingFrame
self._stop_secs_warned: bool = False
@@ -69,6 +88,15 @@ class SpeechTimeoutUserTurnStopStrategy(BaseUserTurnStopStrategy):
self._user_speech_wait_done: bool = False
self._stt_wait_done: bool = False
@property
def wait_for_transcript(self) -> bool:
"""Whether transcripts gate end-of-turn signalling."""
return self._wait_for_transcript
@wait_for_transcript.setter
def wait_for_transcript(self, value: bool) -> None:
self._wait_for_transcript = value
async def reset(self):
"""Reset the strategy to its initial state."""
await super().reset()
@@ -252,10 +280,14 @@ class SpeechTimeoutUserTurnStopStrategy(BaseUserTurnStopStrategy):
Both timers must be done (stt is marked done immediately on the
fallback path and when finalization short-circuits the safety net),
the user must not be currently speaking, and at least one transcript
must have been received.
the user must not be currently speaking, and — when
``wait_for_transcript`` is True — at least one transcript must
have been received.
"""
if self._vad_user_speaking or not self._text:
if self._vad_user_speaking:
return
if self._wait_for_transcript and not self._text:
return
if self._user_speech_wait_done and self._stt_wait_done:

View File

@@ -44,15 +44,35 @@ class TurnAnalyzerUserTurnStopStrategy(BaseUserTurnStopStrategy):
as a fallback.
"""
def __init__(self, *, turn_analyzer: BaseTurnAnalyzer, **kwargs):
def __init__(
self,
*,
turn_analyzer: BaseTurnAnalyzer,
wait_for_transcript: bool = True,
**kwargs,
):
"""Initialize the user turn stop strategy.
Args:
turn_analyzer: The turn detection analyzer instance to detect end of user turn.
wait_for_transcript: Whether to require a transcript before
triggering end-of-turn. When True (default), turn-end fires
only after the turn analyzer reports COMPLETE *and* either a
finalized transcript arrives or the STT safety-net timeout
elapses with text in hand. When False, the strategy signals
turn-end as soon as the turn analyzer reports COMPLETE —
independent of transcripts. Set this to False when local
turn detection is the intended driver of the conversation
(e.g. with a realtime LLM service consuming audio directly),
so transcripts are off the latency critical path.
``LLMContextAggregatorPair`` flips this for you when
``realtime_service_mode`` is configured with
``turns_await_transcripts=False``.
**kwargs: Additional keyword arguments.
"""
super().__init__(**kwargs)
self._turn_analyzer = turn_analyzer
self._wait_for_transcript = wait_for_transcript
self._stt_timeout: float = 0.0 # STT P99 latency from STTMetadataFrame
self._stop_secs: float = 0.0 # VAD stop_secs from VADUserStoppedSpeakingFrame
@@ -66,6 +86,15 @@ class TurnAnalyzerUserTurnStopStrategy(BaseUserTurnStopStrategy):
self._timeout_task: asyncio.Task | None = None
self._timeout_expired: bool = False
@property
def wait_for_transcript(self) -> bool:
"""Whether transcripts gate end-of-turn signalling."""
return self._wait_for_transcript
@wait_for_transcript.setter
def wait_for_transcript(self, value: bool) -> None:
self._wait_for_transcript = value
async def reset(self):
"""Reset the strategy to its initial state."""
await super().reset()
@@ -256,11 +285,25 @@ class TurnAnalyzerUserTurnStopStrategy(BaseUserTurnStopStrategy):
"""Trigger user turn stopped if conditions are met.
Conditions:
- We have transcription text
- Turn analyzer indicates turn is complete
- Either the timeout has elapsed OR we have a finalized transcript
- When ``wait_for_transcript`` is True (default): we have
transcription text *and* either the safety-net timeout has
elapsed or a finalized transcript arrived.
- When ``wait_for_transcript`` is False: fire as soon as the turn
analyzer reports COMPLETE — independent of transcripts.
"""
if not self._text or not self._turn_complete:
if not self._turn_complete:
return
if not self._wait_for_transcript:
# Turn-end is driven by the analyzer; transcripts are bookkeeping.
if self._timeout_task:
await self.task_manager.cancel_task(self._timeout_task)
self._timeout_task = None
await self.trigger_user_turn_stopped()
return
if not self._text:
return
# For finalized transcripts, trigger immediately

View File

@@ -4,6 +4,7 @@
# SPDX-License-Identifier: BSD 2-Clause License
#
import asyncio
import json
import unittest
@@ -33,6 +34,7 @@ from pipecat.frames.frames import (
LLMThoughtEndFrame,
LLMThoughtStartFrame,
LLMThoughtTextFrame,
RealtimeServiceMetadataFrame,
SpeechControlParamsFrame,
StartFrame,
TextFrame,
@@ -55,6 +57,8 @@ from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregator,
LLMUserAggregatorParams,
RealtimeServiceModeConfig,
UserTurnStoppedMessage,
)
from pipecat.processors.frame_processor import FrameDirection
from pipecat.tests.utils import SleepFrame, run_test
@@ -63,6 +67,10 @@ from pipecat.turns.user_mute import (
FunctionCallUserMuteStrategy,
MuteUntilFirstBotCompleteUserMuteStrategy,
)
from pipecat.turns.user_start import (
TranscriptionUserTurnStartStrategy,
VADUserTurnStartStrategy,
)
from pipecat.turns.user_stop import SpeechTimeoutUserTurnStopStrategy
from pipecat.turns.user_turn_strategies import (
FilterIncompleteUserTurnStrategies,
@@ -1651,5 +1659,314 @@ class TestToolChangeMessages(unittest.IsolatedAsyncioTestCase):
self.assertFalse(pair.assistant()._add_tool_change_messages)
class TestRealtimeServiceModeConfig(unittest.TestCase):
def test_default_fields_are_realtime(self):
cfg = RealtimeServiceModeConfig()
self.assertFalse(cfg.context_writes_await_turns)
self.assertFalse(cfg.turns_await_transcripts)
def test_keep_transcripts_keep_writes_on_turn(self):
cfg = RealtimeServiceModeConfig(
turns_await_transcripts=True, context_writes_await_turns=True
)
self.assertTrue(cfg.context_writes_await_turns)
self.assertTrue(cfg.turns_await_transcripts)
def test_keep_transcripts_trailing_writes(self):
# Valid third row: turns wait on transcripts but context writes
# are trailing. The plan calls this out as the explicit fine-grained
# case (downstream consumers of user-turn frames want transcripts).
cfg = RealtimeServiceModeConfig(turns_await_transcripts=True)
self.assertFalse(cfg.context_writes_await_turns)
self.assertTrue(cfg.turns_await_transcripts)
def test_invalid_combination_rejected(self):
# turns fire early but context writes wait → incomplete messages.
with self.assertRaises(ValueError):
RealtimeServiceModeConfig(
turns_await_transcripts=False, context_writes_await_turns=True
)
class TestRealtimeServiceModeAggregator(unittest.IsolatedAsyncioTestCase):
"""End-to-end tests for the trailing-write realtime mode."""
def _build_pair(
self,
*,
realtime_service_mode: RealtimeServiceModeConfig | None = None,
user_params: LLMUserAggregatorParams | None = None,
) -> tuple[LLMContext, LLMContextAggregatorPair]:
context = LLMContext()
pair = LLMContextAggregatorPair(
context,
user_params=user_params,
realtime_service_mode=realtime_service_mode,
)
return context, pair
async def test_pair_propagates_realtime_mode_to_halves(self):
_, pair = self._build_pair(realtime_service_mode=RealtimeServiceModeConfig())
# The pair wires shared state into both halves.
self.assertIs(pair.user()._paired_half, pair.assistant())
self.assertIs(pair.assistant()._paired_half, pair.user())
self.assertIs(pair.user()._pair_lock, pair.assistant()._pair_lock)
self.assertFalse(pair.user()._context_writes_await_turns)
self.assertFalse(pair.user()._turns_await_transcripts)
self.assertFalse(pair.assistant()._context_writes_await_turns)
self.assertFalse(pair.assistant()._turns_await_transcripts)
async def test_pair_omits_realtime_wiring_when_unset(self):
_, pair = self._build_pair()
# Backreferences are still created (harmless), but no shared lock
# is allocated when the realtime config is absent.
self.assertIsNone(pair.user()._pair_lock)
self.assertIsNone(pair.assistant()._pair_lock)
self.assertTrue(pair.user()._context_writes_await_turns)
self.assertTrue(pair.assistant()._context_writes_await_turns)
async def test_realtime_strategy_mutations_with_defaults(self):
_, pair = self._build_pair(realtime_service_mode=RealtimeServiceModeConfig())
# The mutated strategies live on the UserTurnController owned by
# the user aggregator.
strategies = pair.user()._user_turn_controller._user_turn_strategies
# TranscriptionUserTurnStartStrategy is dropped.
for s in strategies.start:
self.assertNotIsInstance(s, TranscriptionUserTurnStartStrategy)
# VAD start strategy is preserved.
self.assertTrue(any(isinstance(s, VADUserTurnStartStrategy) for s in strategies.start))
# Stop strategies that expose wait_for_transcript have it flipped.
for s in strategies.stop:
if hasattr(s, "wait_for_transcript"):
self.assertFalse(s.wait_for_transcript)
async def test_realtime_strategy_mutations_skipped_when_turns_await_transcripts(self):
_, pair = self._build_pair(
realtime_service_mode=RealtimeServiceModeConfig(turns_await_transcripts=True),
)
strategies = pair.user()._user_turn_controller._user_turn_strategies
# When turns still wait for transcripts, the transcript start
# strategy stays in the chain.
self.assertTrue(
any(isinstance(s, TranscriptionUserTurnStartStrategy) for s in strategies.start)
)
async def test_trailing_write_user_then_assistant_then_user(self):
_, pair = self._build_pair(realtime_service_mode=RealtimeServiceModeConfig())
user, assistant = pair
user_msg_added: list[UserTurnStoppedMessage] = []
assistant_msg_added: list[AssistantTurnStoppedMessage] = []
@user.event_handler("on_user_message_added")
async def _on_um(_a, msg):
user_msg_added.append(msg)
@assistant.event_handler("on_assistant_message_added")
async def _on_am(_a, msg):
assistant_msg_added.append(msg)
context = user.context
# Sequence: user transcript, assistant response starts (flushes
# user), assistant response ends (parks pending), new user
# transcript (flushes assistant), then EndFrame flushes the new
# user message.
frames_to_send = [
TranscriptionFrame(text="Hello!", user_id="", timestamp="now"),
SleepFrame(),
LLMFullResponseStartFrame(),
LLMTextFrame("Hi "),
LLMTextFrame("there!"),
LLMFullResponseEndFrame(),
SleepFrame(),
TranscriptionFrame(text="How are you?", user_id="", timestamp="now"),
SleepFrame(),
]
await run_test(
Pipeline([user, assistant]),
frames_to_send=frames_to_send,
)
# Context should contain: user("Hello!"), assistant("Hi there!"),
# user("How are you?").
messages = context.get_messages()
roles_contents = [(m["role"], m["content"]) for m in messages]
self.assertEqual(
roles_contents,
[
("user", "Hello!"),
("assistant", "Hi there!"),
("user", "How are you?"),
],
)
self.assertEqual([m.content for m in user_msg_added], ["Hello!", "How are you?"])
self.assertEqual([m.content for m in assistant_msg_added], ["Hi there!"])
for msg in assistant_msg_added:
self.assertFalse(msg.interrupted)
async def test_interruption_writes_assistant_immediately(self):
_, pair = self._build_pair(realtime_service_mode=RealtimeServiceModeConfig())
user, assistant = pair
assistant_messages: list[AssistantTurnStoppedMessage] = []
@assistant.event_handler("on_assistant_message_added")
async def _on_am(_a, msg):
assistant_messages.append(msg)
context = user.context
frames_to_send = [
TranscriptionFrame(text="Hi!", user_id="", timestamp="now"),
LLMFullResponseStartFrame(),
LLMTextFrame("Hello "),
SleepFrame(),
InterruptionFrame(),
]
await run_test(
Pipeline([user, assistant]),
frames_to_send=frames_to_send,
)
roles_contents = [(m["role"], m["content"]) for m in context.get_messages()]
# User message written when assistant started; assistant message
# written immediately on interruption with interrupted=True.
self.assertEqual(roles_contents, [("user", "Hi!"), ("assistant", "Hello")])
self.assertEqual(len(assistant_messages), 1)
self.assertTrue(assistant_messages[0].interrupted)
async def test_user_turn_stopped_in_realtime_mode_has_none_content(self):
# When VAD turn frames fire in realtime mode, the user-turn-stop
# message carries content=None — the message isn't finalized yet.
_, pair = self._build_pair(
realtime_service_mode=RealtimeServiceModeConfig(),
user_params=LLMUserAggregatorParams(
user_turn_strategies=UserTurnStrategies(
stop=[
SpeechTimeoutUserTurnStopStrategy(
user_speech_timeout=TRANSCRIPTION_TIMEOUT,
)
],
),
user_turn_stop_timeout=USER_TURN_STOP_TIMEOUT,
),
)
user, assistant = pair
stop_messages: list[UserTurnStoppedMessage] = []
@user.event_handler("on_user_turn_stopped")
async def _on_stop(_a, _s, msg):
stop_messages.append(msg)
frames_to_send = [
VADUserStartedSpeakingFrame(),
TranscriptionFrame(text="hey", user_id="", timestamp="now"),
VADUserStoppedSpeakingFrame(),
SleepFrame(sleep=TRANSCRIPTION_TIMEOUT + 0.05),
]
await run_test(
Pipeline([user, assistant]),
frames_to_send=frames_to_send,
)
self.assertEqual(len(stop_messages), 1)
self.assertIsNone(stop_messages[0].content)
async def test_realtime_metadata_recommendation_log_when_unconfigured(self):
# Cascade pair receiving a RealtimeServiceMetadataFrame logs the
# one-time recommendation. The user half records the fact via
# _realtime_recommendation_logged.
_, pair = self._build_pair()
user = pair.user()
frames_to_send = [
RealtimeServiceMetadataFrame(
service_name="FakeRealtimeLLM", emits_user_turn_frames=False
),
]
await run_test(Pipeline([pair.user(), pair.assistant()]), frames_to_send=frames_to_send)
self.assertTrue(user._realtime_recommendation_logged)
async def test_realtime_metadata_no_log_when_configured(self):
# When realtime mode is opted in, the metadata frame is consumed
# without firing the recommendation log (we still flag the
# one-shot bookkeeping).
_, pair = self._build_pair(realtime_service_mode=RealtimeServiceModeConfig())
user = pair.user()
frames_to_send = [
RealtimeServiceMetadataFrame(
service_name="FakeRealtimeLLM", emits_user_turn_frames=False
),
]
await run_test(Pipeline([pair.user(), pair.assistant()]), frames_to_send=frames_to_send)
self.assertTrue(user._realtime_recommendation_logged)
async def test_realtime_mode_requires_paired_half(self):
# Direct construction of a half with realtime mode set but no
# paired_half raises at StartFrame validation. We call the
# validation directly so the error isn't swallowed by the
# pipeline's exception handler.
context = LLMContext()
cfg = RealtimeServiceModeConfig()
user = LLMUserAggregator(context, _realtime_service_mode=cfg)
with self.assertRaises(RuntimeError):
user._validate_realtime_pairing()
assistant = LLMAssistantAggregator(context, _realtime_service_mode=cfg)
with self.assertRaises(RuntimeError):
assistant._validate_realtime_pairing()
async def test_realtime_mode_rejects_mismatched_halves(self):
# If a user code path constructs halves with mismatched configs,
# StartFrame validation catches it.
context = LLMContext()
lock = asyncio.Lock()
user = LLMUserAggregator(
context,
_realtime_service_mode=RealtimeServiceModeConfig(),
_pair_lock=lock,
)
assistant = LLMAssistantAggregator(
context,
_realtime_service_mode=RealtimeServiceModeConfig(turns_await_transcripts=True),
_pair_lock=lock,
)
user._paired_half = assistant
assistant._paired_half = user
with self.assertRaises(RuntimeError):
user._validate_realtime_pairing()
async def test_function_call_no_context_push_in_realtime_mode(self):
# Realtime services consume function results directly via
# FunctionCallResultFrame, so the aggregator should not push
# LLMContextFrame upstream after a function call result.
_, pair = self._build_pair(realtime_service_mode=RealtimeServiceModeConfig())
assistant = pair.assistant()
frames_to_send = [
FunctionCallInProgressFrame(
function_name="get_weather",
tool_call_id="1",
arguments={"location": "Los Angeles"},
cancel_on_interruption=True,
),
SleepFrame(),
FunctionCallResultFrame(
function_name="get_weather",
tool_call_id="1",
arguments={"location": "Los Angeles"},
result={"conditions": "Sunny"},
),
SleepFrame(),
]
_, up_frames = await run_test(
assistant,
frames_to_send=frames_to_send,
)
# No LLMContextFrame should have been pushed upstream in
# realtime mode (cascade would push one to re-run inference).
self.assertFalse(any(isinstance(f, LLMContextFrame) for f in up_frames))
if __name__ == "__main__":
unittest.main()