pipecat

Author	SHA1	Message	Date
Paul Kompfner	3fee91ddec	Drop redundant changelog entry for OpenAI Realtime example The OpenAI Realtime story didn't add any service-level code — just a new example. The original 4480.added.md entry already describes the feature as "a realtime service like Gemini Live," which generalizes to OpenAI Realtime.	2026-05-18 12:06:48 -04:00
Paul Kompfner	638294c1cc	Add realtime-openai-local-vad example Mirrors the Gemini Live local-VAD example for OpenAI Realtime, showing that `wait_for_transcript_to_end_user_turn=False` composes cleanly with `turn_detection=False`. The OpenAI Realtime service already wires `UserStoppedSpeakingFrame` to `input_audio_buffer.commit` + `response.create` when `turn_detection=False`, so the example is the only new code needed.	2026-05-18 11:50:16 -04:00
Paul Kompfner	ea96b7aec7	Rename transcript-gather to post-turn transcript wait Switch the vocabulary for the timer-driven phase that runs when `wait_for_transcript_to_end_user_turn=False`. "Transcript gather" was too vague to be self-documenting; "post-turn transcript wait" names when it happens (after the user turn ends) and what it's for (waiting for late-arriving transcripts). Renames the internal property to `_wait_for_post_turn_transcripts` and the supporting state/method names to match (`_post_turn_transcript_wait_task`, `_complete_post_turn_transcript_wait`, etc.). Updates docstrings, comments, log messages, the example inline doc, and the test prose to use the new vocabulary consistently.	2026-05-18 10:51:14 -04:00
Paul Kompfner	666c619113	Size transcript-gather timer to STT-reported P99 TTFS The aggregator's transcript-gather timer (used when `wait_for_transcript_to_end_user_turn=False`) was hardcoded to `DEFAULT_TTFS_P99`. Capture `STTMetadataFrame.ttfs_p99_latency` as it flows through the user aggregator and prefer that value, just like the stop strategies already do. Falls back to `DEFAULT_TTFS_P99` when no STT service has reported a value.	2026-05-18 10:29:19 -04:00
Paul Kompfner	797d09a1d5	Align vocabulary around `wait_for_transcript_to_end_user_turn=False` Reframe comments, docstrings, identifiers, changelog, and example around a single explanation of the option: (1) turn strategies do not consider user transcripts, letting the user turn end sooner, and (2) the aggregator gathers user transcripts on its own after the turn ends via a simple timer, then emits `on_user_turn_message_finalized` with the new user context message. The mechanism is generic, so internal aggregator vocabulary stays generic ("transcript-gather", "after the user turn ends"); the public-facing param docstring is the one place that explains the "local turn detection drives a realtime service" use case. The stop strategies' `wait_for_transcript` flag is pointed at as something that's "usually flipped indirectly" by the aggregator param rather than something to pair with it. Renames internal state to match: `_expect_delayed_transcripts` → `_aggregator_gathers_transcripts`, `_pending_finalization_` → `_transcript_gather_`, `_finalize_delayed_user_message` → `_finalize_user_message`, etc.	2026-05-18 10:18:22 -04:00
Paul Kompfner	ee1538d18e	test: cover fallback path and align with vocabulary refactor Adds two tests for the strategy's transcripts-without-VAD fallback path — one in default mode (both events fire with the aggregated content) and one in delayed-transcript mode (only ``on_user_turn_message_finalized`` fires; no end-of-turn event is emitted since no turn ever started in the controller). Updates existing tests for the vocabulary refactor: assertions now expect ``content=None`` (not ``""``) for the end-of-turn event in delayed-transcript mode; comments and docstrings use the standardized terms (end of turn, user message finalization, pending-finalization timer, plural "transcripts").	2026-05-18 09:55:42 -04:00
Paul Kompfner	8330c3487d	Refactor delayed-transcript machinery; standardize vocabulary Splits ``_maybe_emit_user_turn_stopped`` into three focused methods — ``_flush_user_message_to_context`` (push aggregation, return content + timestamp), ``_finalize_user_turn`` (default-mode flow, emits both events), and ``_finalize_delayed_user_message`` (delayed-mode flow, emits only ``on_user_turn_message_finalized``). Fixes a side-issue where ``on_user_turn_stopped`` could fire from non-end-of-turn paths in delayed-transcript mode; that event now has a single origin (the end-of-turn handler). Standardizes vocabulary across docstrings and comments: - "Default mode" / "Delayed-transcript mode" (with ``_expect_delayed_transcripts == False/True``) - "End of turn" (not "audible stop" or "audible end of turn") - "User message finalization" (the moment user-text is flushed to context + ``on_user_turn_message_finalized`` fires) - "Pending finalization" (the in-between state in delayed mode) - Transcripts (plural — the aggregator combines multiple per turn) The timer that triggers user message finalization is no longer described as a "backstop" — it's the sole trigger for finalization in delayed-transcript mode, not a fallback. Renamed accordingly: ``_pending_finalization_task``, ``_pending_finalization_handler``, ``_run_pending_finalization``, ``_discard_pending_finalization``. Adds a separate message class for the two events: ``UserTurnStoppedMessage.content`` is now ``str \| None`` (``None`` at end-of-turn in delayed-transcript mode), and a new ``UserMessageFinalizedMessage`` carries the always-populated ``content`` for the finalization event.	2026-05-18 09:55:11 -04:00
Paul Kompfner	4479a3a6af	docs: tighten wait_for_transcript_to_end_user_turn docstring + test docstring Reframes the strategy mutations as part of configuring the flag (not an "also" aside), and the ordering invariant in the test docstring as flush-timing (not arrival-timing).	2026-05-15 15:16:39 -04:00
Paul Kompfner	8631518388	test: cover wait_for_transcript_to_end_user_turn=False aggregator behavior Adds five tests for the delayed-transcript flow on `LLMUserAggregator`: - basic flow: `on_user_turn_stopped` fires fast with empty content; `on_user_turn_message_finalized` fires later with the populated transcript; user message lands in context. - backstop with no transcript: backstop timer still finalizes the turn; message_finalized fires with empty content; no user message added to context. - next-turn precondition violation: a new VAD start fires while the previous turn is still pending; the previous turn is force-flushed before the new turn begins. - context-order with assistant response: paired aggregators with a late user transcript arriving before the assistant content streams; verifies the user message lands in context before the assistant message (the conversational-order invariant the design relies on). - strategy mutation: explicit start/stop strategies are mutated by the bundle — `TranscriptionUserTurnStartStrategy` is dropped from start, `wait_for_transcript=False` is flipped on the stop strategy that had it explicitly set to True. Tests patch `DEFAULT_TTFS_P99` to keep the backstop fast.	2026-05-15 14:08:50 -04:00
Paul Kompfner	47e2f7a037	realtime + local turn detection: drop the user-transcript wait Add the configuration surface to drive a realtime service like Gemini Live from local turn detection without paying user-transcript latency. Cascaded pipelines wait for a transcript before ending the user's turn because the downstream LLM needs the user's words recorded in context — but that wait is pure latency in pipelines using local turn detection to drive a realtime service, which consumes user audio directly. Set `wait_for_transcript_to_end_user_turn=False` on `LLMUserAggregatorParams` to turn this on. With that single flag the aggregator: - drops `TranscriptionUserTurnStartStrategy` from the start strategies (so late-arriving realtime transcripts don't trigger new turns), - sets `wait_for_transcript=False` on any stop strategy that supports it (so the turn ends on the audible end of the turn, without waiting for a transcript), - fires `on_user_turn_stopped` on the audible end of the turn with empty `content` (since the transcript hasn't arrived), and - defers the context flush until the transcript arrives or a backstop timer fires. A new `on_user_turn_message_finalized` event fires when the user's message has been written to context. In the default mode it coincides with `on_user_turn_stopped`; in the delayed-transcript mode it fires later. Consumers that want the populated transcript should subscribe to `on_user_turn_message_finalized` — it's the event that always carries the user message, regardless of mode. Strategy mutations are logged: loudly when the user passed their own strategies (we're overwriting parts of their config), quietly otherwise. The strategy-level `wait_for_transcript` parameter on `TurnAnalyzerUserTurnStopStrategy` and `SpeechTimeoutUserTurnStopStrategy` remains exposed for advanced cases. The example `realtime-gemini-live-local-vad.py` demonstrates the full pattern.	2026-05-15 13:49:16 -04:00
Paul Kompfner	6d21507e95	user turn stop strategies: don't always wait for transcripts Until now, both TurnAnalyzerUserTurnStopStrategy and SpeechTimeoutUserTurnStopStrategy waited for at least one transcript before ending the user turn. That's the right behavior for cascaded pipelines, where the downstream LLM can't respond until the user's words are recorded in its context — but it's pure latency in pipelines using local turn detection to drive a realtime service like Gemini Live. Add a `require_transcript: bool \| None = None` parameter to both strategies. When None (default), it infers from whether an STTMetadataFrame has been seen — a proxy for "does the downstream LLM need the transcript in context?". Explicit True/False overrides the heuristic. When a transcript isn't required, the strategies also skip the STT-waiting timeout in the VAD-stopped handler, so the user turn ends as soon as the analyzer (or speech timer) concludes the turn is complete.	2026-05-13 15:45:51 -04:00
Mark Backman	5fef239b68	Merge pull request #4450 from pipecat-ai/mb/gpt-realtime-whisper Default OpenAI Realtime transcription to gpt-realtime-whisper	2026-05-13 09:48:33 -04:00
Filipi da Silva Fuchter	9148e307cc	Merge pull request #4464 from pipecat-ai/filipi/nvidia_sagemaker NVidia sagemaker - TTS and STT services	2026-05-13 07:53:26 -03:00
Filipi da Silva Fuchter	703d23b658	Update examples/voice/voice-nvidia-sagemaker.py Co-authored-by: Mark Backman <mark@daily.co>	2026-05-13 06:36:57 -04:00
Filipi da Silva Fuchter	227ba288da	Update examples/voice/voice-nvidia-sagemaker.py Co-authored-by: Mark Backman <mark@daily.co>	2026-05-13 06:36:45 -04:00
Mark Backman	3e8c5c08f4	Clarify realtime settings update condition	2026-05-12 17:48:53 -04:00
Mark Backman	644030584f	Centralize OpenAI audio constants	2026-05-12 17:48:53 -04:00
filipi87	0740021ff4	Removing changelog for sanitize_text_for_tts	2026-05-12 18:29:35 -03:00
filipi87	68f265fa62	Fixing ruff format.	2026-05-12 18:28:14 -03:00
filipi87	b9f052079d	Removing sanitize_text_for_tts	2026-05-12 18:22:15 -03:00
filipi87	130bb7371c	Removing sanitize_text_for_tts	2026-05-12 18:21:47 -03:00
filipi87	5d61763987	Refactoring how we are reconnecting the STT.	2026-05-12 18:20:19 -03:00
filipi87	7984556692	Fixing typecheck.	2026-05-12 18:00:07 -03:00
filipi87	bea9e4b3ba	New example voice-nvidia-sagemaker.py	2026-05-12 17:44:11 -03:00
Mark Backman	19df443500	Merge pull request #4471 from pipecat-ai/mb/fix-gstreamer-pyright-import	2026-05-12 16:34:48 -04:00
Mark Backman	07f241143b	Merge pull request #4469 from pipecat-ai/mb/remove-vad-analyzer-runner-utils-docstring	2026-05-12 16:34:27 -04:00
Mark Backman	2fdb9bbf42	Merge pull request #4462 from pipecat-ai/mb/cartesia-sonic-3.5	2026-05-12 16:34:04 -04:00
filipi87	0146947b68	Addressing the comments left in the PR review.	2026-05-12 17:12:19 -03:00
Mark Backman	e2bfa6352f	Add changelog for #4450	2026-05-12 15:20:57 -04:00
Mark Backman	abd28e2ac1	Update OpenAI realtime transcription default	2026-05-12 15:20:57 -04:00
kompfner	88deebbf5f	Merge pull request #4472 from pipecat-ai/pk/default-gpt-realtime-2 Switch OpenAIRealtimeLLMService default model to gpt-realtime-2	2026-05-12 15:17:12 -04:00
filipi87	c2bdc1aada	Fixing metrics and adding extra guard after sanitization.	2026-05-12 16:11:01 -03:00
Paul Kompfner	fc0589e8f1	Switch OpenAIRealtimeLLMService default model to gpt-realtime-2	2026-05-12 14:57:59 -04:00
kompfner	67f8d34e9f	Merge pull request #4470 from pipecat-ai/pk/gpt-realtime-2-reasoning-effort Add reasoning support to OpenAIRealtimeLLMService for gpt-realtime-2	2026-05-12 14:43:39 -04:00
kompfner	d3b8710720	Merge pull request #4465 from pipecat-ai/pk/gpt-realtime-2 Handle gpt-realtime-2 multi-output-item audio responses	2026-05-12 14:30:15 -04:00
Mark Backman	86e2aa85d3	Fix GStreamer pipeline source pyright import	2026-05-12 14:16:36 -04:00
Paul Kompfner	b89500256d	Drop debug logging added while investigating multi-output-item audio	2026-05-12 14:05:16 -04:00
Paul Kompfner	a52bdef32b	Add reasoning support to OpenAIRealtimeLLMService for gpt-realtime-2	2026-05-12 13:55:19 -04:00
Mark Backman	afd9fc5fdf	Remove vad_analyzer from create_transport docstring example	2026-05-12 13:50:17 -04:00
filipi87	7f98dba925	Changelog files for the new nvidia features.	2026-05-12 14:43:12 -03:00
filipi87	6a27ed35b1	Fixing the Bidi client to accept None.	2026-05-12 12:19:30 -03:00
filipi87	a34864d643	Fixed ruff, pyright, and test_service_init failures	2026-05-12 11:39:52 -03:00
Paul Kompfner	007fa3a3a8	Handle gpt-realtime-2 multi-output-item audio responses A single Realtime API response can now contain more than one audio item (observed with gpt-realtime-2), and the first item's audio.done can arrive after deltas from the second have started arriving. Deltas still arrive strictly in playback order across items, so we keep forwarding them as received — matching OpenAI's reference implementation. Adjusted OpenAIRealtimeLLMService so a multi-item response is treated as one continuous TTS turn: - _handle_evt_audio_delta: on item switch, advance the tracked item in place (reset total_size) without emitting another TTSStartedFrame. Truncation now always targets the latest item. - _handle_evt_audio_done: debug-trace only; no longer pushes TTSStoppedFrame. - _handle_evt_response_done: pushes a single TTSStoppedFrame per turn, bookending the audio with the Started pushed on the first delta. Added tests covering single-item, overlapping multi-item, non-overlapping multi-item, and interrupt-during-multi-item (last-item-wins truncation).	2026-05-12 10:34:50 -04:00
filipi87	5dd7413c00	Nvidia Sagemaker Nemotron ASR STT service	2026-05-12 11:16:00 -03:00
filipi87	8e0a338d96	Nvidia Sagemaker Magpie TTS service	2026-05-12 11:15:42 -03:00
Mark Backman	d65aee9181	Add changelog for #4462	2026-05-11 17:34:00 -04:00
Mark Backman	1755016679	Update default Cartesia TTS model to sonic-3.5	2026-05-11 17:33:40 -04:00
Mark Backman	b7f6298601	Merge pull request #4461 from pipecat-ai/mb/security-vuln-2025-05-11 Update uv.lock for urllib3 and langchain-core	2026-05-11 15:58:05 -04:00
Mark Backman	396873ac7e	Merge pull request #4460 from pipecat-ai/mb/codex-skills Add Codex skills and AGENTS.md	2026-05-11 15:57:49 -04:00
Mark Backman	5b33964a1b	Update uv.lock for urllib3 and langchain-core	2026-05-11 15:51:01 -04:00

1 2 3 4 5 ...

9437 Commits