Commit Graph

9437 Commits

Author SHA1 Message Date
Paul Kompfner
3fee91ddec Drop redundant changelog entry for OpenAI Realtime example
The OpenAI Realtime story didn't add any service-level code — just a
new example. The original 4480.added.md entry already describes the
feature as "a realtime service like Gemini Live," which generalizes
to OpenAI Realtime.
2026-05-18 12:06:48 -04:00
Paul Kompfner
638294c1cc Add realtime-openai-local-vad example
Mirrors the Gemini Live local-VAD example for OpenAI Realtime, showing
that `wait_for_transcript_to_end_user_turn=False` composes cleanly
with `turn_detection=False`. The OpenAI Realtime service already wires
`UserStoppedSpeakingFrame` to `input_audio_buffer.commit` +
`response.create` when `turn_detection=False`, so the example is the
only new code needed.
2026-05-18 11:50:16 -04:00
Paul Kompfner
ea96b7aec7 Rename transcript-gather to post-turn transcript wait
Switch the vocabulary for the timer-driven phase that runs when
`wait_for_transcript_to_end_user_turn=False`. "Transcript gather" was
too vague to be self-documenting; "post-turn transcript wait" names
when it happens (after the user turn ends) and what it's for (waiting
for late-arriving transcripts).

Renames the internal property to `_wait_for_post_turn_transcripts`
and the supporting state/method names to match
(`_post_turn_transcript_wait_task`, `_complete_post_turn_transcript_wait`,
etc.). Updates docstrings, comments, log messages, the example
inline doc, and the test prose to use the new vocabulary consistently.
2026-05-18 10:51:14 -04:00
Paul Kompfner
666c619113 Size transcript-gather timer to STT-reported P99 TTFS
The aggregator's transcript-gather timer (used when
`wait_for_transcript_to_end_user_turn=False`) was hardcoded to
`DEFAULT_TTFS_P99`. Capture `STTMetadataFrame.ttfs_p99_latency` as
it flows through the user aggregator and prefer that value, just
like the stop strategies already do. Falls back to
`DEFAULT_TTFS_P99` when no STT service has reported a value.
2026-05-18 10:29:19 -04:00
Paul Kompfner
797d09a1d5 Align vocabulary around wait_for_transcript_to_end_user_turn=False
Reframe comments, docstrings, identifiers, changelog, and example
around a single explanation of the option: (1) turn strategies do not
consider user transcripts, letting the user turn end sooner, and (2)
the aggregator gathers user transcripts on its own after the turn
ends via a simple timer, then emits `on_user_turn_message_finalized`
with the new user context message.

The mechanism is generic, so internal aggregator vocabulary stays
generic ("transcript-gather", "after the user turn ends"); the
public-facing param docstring is the one place that explains the
"local turn detection drives a realtime service" use case. The stop
strategies' `wait_for_transcript` flag is pointed at as something
that's "usually flipped indirectly" by the aggregator param rather
than something to pair with it.

Renames internal state to match: `_expect_delayed_transcripts` →
`_aggregator_gathers_transcripts`, `_pending_finalization_*` →
`_transcript_gather_*`, `_finalize_delayed_user_message` →
`_finalize_user_message`, etc.
2026-05-18 10:18:22 -04:00
Paul Kompfner
ee1538d18e test: cover fallback path and align with vocabulary refactor
Adds two tests for the strategy's transcripts-without-VAD fallback
path — one in default mode (both events fire with the aggregated
content) and one in delayed-transcript mode (only
``on_user_turn_message_finalized`` fires; no end-of-turn event is
emitted since no turn ever started in the controller).

Updates existing tests for the vocabulary refactor: assertions now
expect ``content=None`` (not ``""``) for the end-of-turn event in
delayed-transcript mode; comments and docstrings use the
standardized terms (end of turn, user message finalization,
pending-finalization timer, plural "transcripts").
2026-05-18 09:55:42 -04:00
Paul Kompfner
8330c3487d Refactor delayed-transcript machinery; standardize vocabulary
Splits ``_maybe_emit_user_turn_stopped`` into three focused methods —
``_flush_user_message_to_context`` (push aggregation, return content +
timestamp), ``_finalize_user_turn`` (default-mode flow, emits both
events), and ``_finalize_delayed_user_message`` (delayed-mode flow,
emits only ``on_user_turn_message_finalized``). Fixes a side-issue
where ``on_user_turn_stopped`` could fire from non-end-of-turn paths
in delayed-transcript mode; that event now has a single origin (the
end-of-turn handler).

Standardizes vocabulary across docstrings and comments:

- "Default mode" / "Delayed-transcript mode" (with
  ``_expect_delayed_transcripts == False/True``)
- "End of turn" (not "audible stop" or "audible end of turn")
- "User message finalization" (the moment user-text is flushed to
  context + ``on_user_turn_message_finalized`` fires)
- "Pending finalization" (the in-between state in delayed mode)
- Transcripts (plural — the aggregator combines multiple per turn)

The timer that triggers user message finalization is no longer
described as a "backstop" — it's the sole trigger for finalization
in delayed-transcript mode, not a fallback. Renamed accordingly:
``_pending_finalization_task``, ``_pending_finalization_handler``,
``_run_pending_finalization``, ``_discard_pending_finalization``.

Adds a separate message class for the two events:
``UserTurnStoppedMessage.content`` is now ``str | None`` (``None``
at end-of-turn in delayed-transcript mode), and a new
``UserMessageFinalizedMessage`` carries the always-populated
``content`` for the finalization event.
2026-05-18 09:55:11 -04:00
Paul Kompfner
4479a3a6af docs: tighten wait_for_transcript_to_end_user_turn docstring + test docstring
Reframes the strategy mutations as part of configuring the flag
(not an "also" aside), and the ordering invariant in the test
docstring as flush-timing (not arrival-timing).
2026-05-15 15:16:39 -04:00
Paul Kompfner
8631518388 test: cover wait_for_transcript_to_end_user_turn=False aggregator behavior
Adds five tests for the delayed-transcript flow on
`LLMUserAggregator`:

- basic flow: `on_user_turn_stopped` fires fast with empty content;
  `on_user_turn_message_finalized` fires later with the populated
  transcript; user message lands in context.
- backstop with no transcript: backstop timer still finalizes the
  turn; message_finalized fires with empty content; no user message
  added to context.
- next-turn precondition violation: a new VAD start fires while the
  previous turn is still pending; the previous turn is force-flushed
  before the new turn begins.
- context-order with assistant response: paired aggregators with a
  late user transcript arriving before the assistant content streams;
  verifies the user message lands in context before the assistant
  message (the conversational-order invariant the design relies on).
- strategy mutation: explicit start/stop strategies are mutated by
  the bundle — `TranscriptionUserTurnStartStrategy` is dropped from
  start, `wait_for_transcript=False` is flipped on the stop strategy
  that had it explicitly set to True.

Tests patch `DEFAULT_TTFS_P99` to keep the backstop fast.
2026-05-15 14:08:50 -04:00
Paul Kompfner
47e2f7a037 realtime + local turn detection: drop the user-transcript wait
Add the configuration surface to drive a realtime service like Gemini
Live from local turn detection without paying user-transcript latency.
Cascaded pipelines wait for a transcript before ending the user's turn
because the downstream LLM needs the user's words recorded in context
— but that wait is pure latency in pipelines using local turn
detection to drive a realtime service, which consumes user audio
directly.

Set `wait_for_transcript_to_end_user_turn=False` on
`LLMUserAggregatorParams` to turn this on. With that single flag the
aggregator:

- drops `TranscriptionUserTurnStartStrategy` from the start strategies
  (so late-arriving realtime transcripts don't trigger new turns),
- sets `wait_for_transcript=False` on any stop strategy that supports
  it (so the turn ends on the audible end of the turn, without
  waiting for a transcript),
- fires `on_user_turn_stopped` on the audible end of the turn with
  empty `content` (since the transcript hasn't arrived), and
- defers the context flush until the transcript arrives or a backstop
  timer fires.

A new `on_user_turn_message_finalized` event fires when the user's
message has been written to context. In the default mode it
coincides with `on_user_turn_stopped`; in the delayed-transcript mode
it fires later. Consumers that want the populated transcript should
subscribe to `on_user_turn_message_finalized` — it's the event that
always carries the user message, regardless of mode.

Strategy mutations are logged: loudly when the user passed their own
strategies (we're overwriting parts of their config), quietly
otherwise. The strategy-level `wait_for_transcript` parameter on
`TurnAnalyzerUserTurnStopStrategy` and `SpeechTimeoutUserTurnStopStrategy`
remains exposed for advanced cases.

The example `realtime-gemini-live-local-vad.py` demonstrates the full
pattern.
2026-05-15 13:49:16 -04:00
Paul Kompfner
6d21507e95 user turn stop strategies: don't always wait for transcripts
Until now, both TurnAnalyzerUserTurnStopStrategy and
SpeechTimeoutUserTurnStopStrategy waited for at least one transcript
before ending the user turn. That's the right behavior for cascaded
pipelines, where the downstream LLM can't respond until the user's
words are recorded in its context — but it's pure latency in pipelines
using local turn detection to drive a realtime service like Gemini
Live.

Add a `require_transcript: bool | None = None` parameter to both
strategies. When None (default), it infers from whether an
STTMetadataFrame has been seen — a proxy for "does the downstream LLM
need the transcript in context?". Explicit True/False overrides the
heuristic.

When a transcript isn't required, the strategies also skip the
STT-waiting timeout in the VAD-stopped handler, so the user turn ends
as soon as the analyzer (or speech timer) concludes the turn is
complete.
2026-05-13 15:45:51 -04:00
Mark Backman
5fef239b68 Merge pull request #4450 from pipecat-ai/mb/gpt-realtime-whisper
Default OpenAI Realtime transcription to gpt-realtime-whisper
2026-05-13 09:48:33 -04:00
Filipi da Silva Fuchter
9148e307cc Merge pull request #4464 from pipecat-ai/filipi/nvidia_sagemaker
NVidia sagemaker - TTS and STT services
2026-05-13 07:53:26 -03:00
Filipi da Silva Fuchter
703d23b658 Update examples/voice/voice-nvidia-sagemaker.py
Co-authored-by: Mark Backman <mark@daily.co>
2026-05-13 06:36:57 -04:00
Filipi da Silva Fuchter
227ba288da Update examples/voice/voice-nvidia-sagemaker.py
Co-authored-by: Mark Backman <mark@daily.co>
2026-05-13 06:36:45 -04:00
Mark Backman
3e8c5c08f4 Clarify realtime settings update condition 2026-05-12 17:48:53 -04:00
Mark Backman
644030584f Centralize OpenAI audio constants 2026-05-12 17:48:53 -04:00
filipi87
0740021ff4 Removing changelog for sanitize_text_for_tts 2026-05-12 18:29:35 -03:00
filipi87
68f265fa62 Fixing ruff format. 2026-05-12 18:28:14 -03:00
filipi87
b9f052079d Removing sanitize_text_for_tts 2026-05-12 18:22:15 -03:00
filipi87
130bb7371c Removing sanitize_text_for_tts 2026-05-12 18:21:47 -03:00
filipi87
5d61763987 Refactoring how we are reconnecting the STT. 2026-05-12 18:20:19 -03:00
filipi87
7984556692 Fixing typecheck. 2026-05-12 18:00:07 -03:00
filipi87
bea9e4b3ba New example voice-nvidia-sagemaker.py 2026-05-12 17:44:11 -03:00
Mark Backman
19df443500 Merge pull request #4471 from pipecat-ai/mb/fix-gstreamer-pyright-import 2026-05-12 16:34:48 -04:00
Mark Backman
07f241143b Merge pull request #4469 from pipecat-ai/mb/remove-vad-analyzer-runner-utils-docstring 2026-05-12 16:34:27 -04:00
Mark Backman
2fdb9bbf42 Merge pull request #4462 from pipecat-ai/mb/cartesia-sonic-3.5 2026-05-12 16:34:04 -04:00
filipi87
0146947b68 Addressing the comments left in the PR review. 2026-05-12 17:12:19 -03:00
Mark Backman
e2bfa6352f Add changelog for #4450 2026-05-12 15:20:57 -04:00
Mark Backman
abd28e2ac1 Update OpenAI realtime transcription default 2026-05-12 15:20:57 -04:00
kompfner
88deebbf5f Merge pull request #4472 from pipecat-ai/pk/default-gpt-realtime-2
Switch OpenAIRealtimeLLMService default model to gpt-realtime-2
2026-05-12 15:17:12 -04:00
filipi87
c2bdc1aada Fixing metrics and adding extra guard after sanitization. 2026-05-12 16:11:01 -03:00
Paul Kompfner
fc0589e8f1 Switch OpenAIRealtimeLLMService default model to gpt-realtime-2 2026-05-12 14:57:59 -04:00
kompfner
67f8d34e9f Merge pull request #4470 from pipecat-ai/pk/gpt-realtime-2-reasoning-effort
Add reasoning support to OpenAIRealtimeLLMService for gpt-realtime-2
2026-05-12 14:43:39 -04:00
kompfner
d3b8710720 Merge pull request #4465 from pipecat-ai/pk/gpt-realtime-2
Handle gpt-realtime-2 multi-output-item audio responses
2026-05-12 14:30:15 -04:00
Mark Backman
86e2aa85d3 Fix GStreamer pipeline source pyright import 2026-05-12 14:16:36 -04:00
Paul Kompfner
b89500256d Drop debug logging added while investigating multi-output-item audio 2026-05-12 14:05:16 -04:00
Paul Kompfner
a52bdef32b Add reasoning support to OpenAIRealtimeLLMService for gpt-realtime-2 2026-05-12 13:55:19 -04:00
Mark Backman
afd9fc5fdf Remove vad_analyzer from create_transport docstring example 2026-05-12 13:50:17 -04:00
filipi87
7f98dba925 Changelog files for the new nvidia features. 2026-05-12 14:43:12 -03:00
filipi87
6a27ed35b1 Fixing the Bidi client to accept None. 2026-05-12 12:19:30 -03:00
filipi87
a34864d643 Fixed ruff, pyright, and test_service_init failures 2026-05-12 11:39:52 -03:00
Paul Kompfner
007fa3a3a8 Handle gpt-realtime-2 multi-output-item audio responses
A single Realtime API response can now contain more than one audio item
(observed with gpt-realtime-2), and the first item's audio.done can
arrive after deltas from the second have started arriving. Deltas still
arrive strictly in playback order across items, so we keep forwarding
them as received — matching OpenAI's reference implementation.

Adjusted OpenAIRealtimeLLMService so a multi-item response is treated as
one continuous TTS turn:

- _handle_evt_audio_delta: on item switch, advance the tracked item in
  place (reset total_size) without emitting another TTSStartedFrame.
  Truncation now always targets the latest item.
- _handle_evt_audio_done: debug-trace only; no longer pushes
  TTSStoppedFrame.
- _handle_evt_response_done: pushes a single TTSStoppedFrame per turn,
  bookending the audio with the Started pushed on the first delta.

Added tests covering single-item, overlapping multi-item, non-overlapping
multi-item, and interrupt-during-multi-item (last-item-wins truncation).
2026-05-12 10:34:50 -04:00
filipi87
5dd7413c00 Nvidia Sagemaker Nemotron ASR STT service 2026-05-12 11:16:00 -03:00
filipi87
8e0a338d96 Nvidia Sagemaker Magpie TTS service 2026-05-12 11:15:42 -03:00
Mark Backman
d65aee9181 Add changelog for #4462 2026-05-11 17:34:00 -04:00
Mark Backman
1755016679 Update default Cartesia TTS model to sonic-3.5 2026-05-11 17:33:40 -04:00
Mark Backman
b7f6298601 Merge pull request #4461 from pipecat-ai/mb/security-vuln-2025-05-11
Update uv.lock for urllib3 and langchain-core
2026-05-11 15:58:05 -04:00
Mark Backman
396873ac7e Merge pull request #4460 from pipecat-ai/mb/codex-skills
Add Codex skills and AGENTS.md
2026-05-11 15:57:49 -04:00
Mark Backman
5b33964a1b Update uv.lock for urllib3 and langchain-core 2026-05-11 15:51:01 -04:00