Commit Graph

525 Commits

Author SHA1 Message Date
Antoni Silvestre
18368d047e Linting and changes to adapt to v1.0 2026-05-18 14:40:56 +02:00
asilvestre
e3abb4b6d7 apply suggestions in PR 2026-05-18 14:40:56 +02:00
asilvestre
c61672194d Vonage Video Connector Transport 2026-05-18 14:40:49 +02:00
Mark Backman
73278d3309 Use majority language for Soniox transcripts 2026-05-14 15:18:43 -04:00
Mark Backman
49bda11ae8 Merge pull request #4482 from pipecat-ai/mb/soniox-stt-token-language
Propagate Soniox token language
2026-05-13 16:28:56 -04:00
Mark Backman
078af6969a Merge pull request #4473 from timofey-TK/inworld-tts-v2
Add support for Inworld TTS v2 fields
2026-05-13 15:32:16 -04:00
Mark Backman
82f0896d6a Propagate Soniox token language 2026-05-13 15:23:22 -04:00
Mark Backman
08680732f6 Merge pull request #4475 from pipecat-ai/mb/cartesia-korean-fix
Fix Cartesia CJK timestamp spacing
2026-05-13 13:20:42 -04:00
Mark Backman
064b68aa01 Fix Cartesia CJK timestamp spacing 2026-05-13 13:13:40 -04:00
Mark Backman
5fef239b68 Merge pull request #4450 from pipecat-ai/mb/gpt-realtime-whisper
Default OpenAI Realtime transcription to gpt-realtime-whisper
2026-05-13 09:48:33 -04:00
Timofey
39e7f9e354 Fix Inworld TTS v2 request fields 2026-05-13 11:17:31 +03:00
Mark Backman
644030584f Centralize OpenAI audio constants 2026-05-12 17:48:53 -04:00
Mark Backman
abd28e2ac1 Update OpenAI realtime transcription default 2026-05-12 15:20:57 -04:00
Paul Kompfner
a52bdef32b Add reasoning support to OpenAIRealtimeLLMService for gpt-realtime-2 2026-05-12 13:55:19 -04:00
Paul Kompfner
007fa3a3a8 Handle gpt-realtime-2 multi-output-item audio responses
A single Realtime API response can now contain more than one audio item
(observed with gpt-realtime-2), and the first item's audio.done can
arrive after deltas from the second have started arriving. Deltas still
arrive strictly in playback order across items, so we keep forwarding
them as received — matching OpenAI's reference implementation.

Adjusted OpenAIRealtimeLLMService so a multi-item response is treated as
one continuous TTS turn:

- _handle_evt_audio_delta: on item switch, advance the tracked item in
  place (reset total_size) without emitting another TTSStartedFrame.
  Truncation now always targets the latest item.
- _handle_evt_audio_done: debug-trace only; no longer pushes
  TTSStoppedFrame.
- _handle_evt_response_done: pushes a single TTSStoppedFrame per turn,
  bookending the audio with the Started pushed on the first delta.

Added tests covering single-item, overlapping multi-item, non-overlapping
multi-item, and interrupt-during-multi-item (last-item-wins truncation).
2026-05-12 10:34:50 -04:00
Paul Kompfner
72d0fb418a fix: restore cancel_on_interruption=False support in AWS Nova Sonic and OpenAI Realtime
Before the new async-tool mechanism landed, AWSNovaSonicLLMService and
OpenAIRealtimeLLMService honored cancel_on_interruption=False by simply
not cancelling in-flight function calls on interruption — the eventual
result then flowed through the same channel as any synchronous tool
result. The new mechanism (which appends started/intermediate/final
messages to the LLM context as the underlying task progresses) broke
that path: the realtime services didn't know how to interpret those
messages, and the eventual result was never delivered to the provider.

Restore the flag's behavior by teaching both services to detect
async-tool messages in the context and route them appropriately:

- started → skipped silently. The provider already issued the tool call
  and natively awaits a result; nothing to send for the started marker.
- final → delivered via the formal tool-result channel. Same path as a
  synchronous tool result, just delayed.

Streamed intermediate results (FunctionCallResultProperties(is_final=
False)) are not supported on these realtime services. An intermediate
result is logged as an error and surfaced via push_error, then dropped.
Use a non-realtime LLM service if a tool needs to stream intermediate
results. (Docstrings on register_function, register_direct_function, and
FunctionCallResultProperties.is_final updated to call this out.)

A new shared module pipecat.processors.aggregators.async_tool_messages
is the single source of truth for the on-the-wire payload shape: the
aggregator uses its build_*_message functions when injecting messages,
and the realtime services use parse_message when scanning the context.

Adds two example files exercising a network-delayed weather tool with
each service. The plain realtime-aws-nova-sonic.py example is also
reverted to a synchronous tool call now that the async variant lives in
its own file.

Similar fixes for other realtime services are forthcoming.
2026-05-08 09:33:06 -04:00
Aleix Conchillo Flaqué
b78cecf7b2 Rename UserTurnCompletedFrame to UserTurnInferenceCompletedFrame
The old name overlapped semantically with `UserStoppedSpeakingFrame`:
both could be read as "the user's turn is done." They're at different
layers — `UserStoppedSpeakingFrame` is the acoustic stop signal,
while this frame is the post-judgment "inference about the turn is
now complete (turn is semantically final)" signal emitted by the LLM
mixin (on ✓), an end-of-turn classifier, or a custom producer.

The new name pairs naturally with the existing
`on_user_turn_inference_triggered` event vocabulary and removes the
ambiguity with `UserStoppedSpeakingFrame`.
2026-05-07 17:47:41 -07:00
Aleix Conchillo Flaqué
952dddca8b Replace llm_completion_user_turn_stop_strategies() with FilterIncompleteUserTurnStrategies
Wrap the detector chain with `deferred(...)` and append the LLM
completion gate via a `UserTurnStrategies` specialization rather than
a free-standing helper, mirroring the existing
`ExternalUserTurnStrategies` pattern. The class lives next to other
strategy containers in `pipecat.turns.user_turn_strategies`, so users
discover it where they're already configuring `user_turn_strategies`.

The deprecated `filter_incomplete_user_turns` flag now rewires
through `FilterIncompleteUserTurnStrategies` under the hood, keeping
the migration path identical to before. `deferred(...)` stays public
as the explicit escape hatch for non-default compositions.
2026-05-07 17:47:39 -07:00
Aleix Conchillo Flaqué
e3e90d38aa Preserve full user transcript across multiple inferences in one turn
When a stop-strategy chain splits inference-triggered from
finalization (e.g. `LLMTurnCompletionUserTurnStopStrategy` gating a
deferred detector), more than one inference can fire inside a single
user turn — each adds the new transcription segment to the context.
Previously each inference overwrote `_pending_user_turn_aggregation`,
so the eventual `on_user_turn_stopped` event surfaced only the
segment from the last inference, dropping anything the user said
before it.

Concatenate each segment into `_full_user_turn_aggregation` instead
of overwriting, and combine that running buffer with any post-final-
inference segment when emitting the public event.
2026-05-07 17:46:15 -07:00
Aleix Conchillo Flaqué
d1c8162b0c Route turn-completion markers through LLMMarkerFrame
Add an `LLMMarkerFrame(DataFrame)` for sideband LLM markers that need
to be persisted to context but should not flow through the standard
text path (TTS, transcript). The frame carries an
`append_to_context_immediately` flag so the assistant aggregator can
either commit the marker as a stand-alone message (○ / ◐) or merge it
with the upcoming aggregation as a prefix on the response (✓).

`UserTurnCompletionLLMServiceMixin` now emits `LLMMarkerFrame` instead
of pushing the marker as `LLMTextFrame(skip_tts=True)`, which fixes
the case where an incomplete-turn marker (○ / ◐) was aggregated by
the assistant aggregator but never committed to the context because
the assistant turn lifecycle didn't run to completion (no spoken
response, no `LLMFullResponseEndFrame`-driven `push_aggregation`).

The frame is intentionally generic so other components — STT services
with built-in turn signals, end-of-turn classifiers, custom
annotations — can use the same mechanism to inject sideband signals
into the assistant context.
2026-05-07 17:46:15 -07:00
Aleix Conchillo Flaqué
480eca42f5 Split user-turn-stop into inference-triggered and finalized events
Fixes a real bug: with `filter_incomplete_user_turns` enabled, the
smart-turn detector's tentative stop was firing `on_user_turn_stopped`
before the LLM had a chance to veto it. Observers, transcript
appenders and UI indicators received an early — and sometimes
duplicated — signal.

Decomposes the single stop concern into two events:
- `on_user_turn_inference_triggered` fires when a stop strategy has
  enough signal to start LLM inference. The aggregator pushes the
  context here, kicking off the LLM call.
- `on_user_turn_stopped` fires only when the user turn is semantically
  final. Built-in strategies fire both events at the same call site,
  preserving today's behavior for the common case.

Adds `LLMTurnCompletionUserTurnStopStrategy`, which gates
finalization on a `UserTurnCompletedFrame` (a fieldless system frame
emitted by any component judging turn completeness — currently the
`UserTurnCompletionLLMServiceMixin` on `✓`).

Adds `deferred(strategy)` / `DeferredUserTurnStopStrategy`, a thin
wrapper that forwards an inner strategy's events except
`on_user_turn_stopped`. Use this to install a stop strategy as an
inference trigger only, leaving finalization to a peer (e.g. the LLM
completion strategy).

Adds `llm_completion_user_turn_stop_strategies()` for the common
case:

    UserTurnStrategies(
        stop=llm_completion_user_turn_stop_strategies(),
    )

Deprecates `LLMUserAggregatorParams.filter_incomplete_user_turns`.
The aggregator emits a `DeprecationWarning`, wraps existing stop
strategies with `deferred(...)`, and appends
`LLMTurnCompletionUserTurnStopStrategy` automatically.
2026-05-07 17:46:09 -07:00
Mark Backman
1073510574 Merge pull request #4407 from pipecat-ai/mb/ui-agent-wire-format
feat(rtvi): add UI Agent Protocol as first-class RTVI message types
2026-05-07 20:03:41 -04:00
Mark Backman
7a2cec2e45 Merge pull request #4426 from marcelodiaz558/feature/elevenlabs_stt_keyterms
Add ElevenLabs STT keyterms support
2026-05-07 18:44:09 -04:00
Marcelo Díaz
edfcd6948b Add ElevenLabs STT keyterms support 2026-05-07 21:00:26 +00:00
kompfner
991ee9e0e6 Merge pull request #4404 from pipecat-ai/pk/mitigate-calls-to-missing-tools
Mitigate tool-call-related hallucination
2026-05-07 15:05:13 -04:00
filipi87
cf22dac171 Refactoring TTSService to preserve uninterruptible frames. 2026-05-06 16:26:45 -03:00
Filipi da Silva Fuchter
90f0f7cd27 Merge pull request #4431 from pipecat-ai/filipi/tts_deadlock
Fixing TTSService deadlock.
2026-05-06 14:52:04 -03:00
filipi87
a23baf9de6 Fixing TTSService deadlock. 2026-05-06 13:32:26 -03:00
Mark Backman
d18fe7c39c feat(rtvi): type UI accessibility snapshots 2026-05-06 11:29:19 -04:00
Mark Backman
41124dc494 refactor(rtvi): clarify UI message names 2026-05-06 11:08:25 -04:00
Mark Backman
8c3521f2e4 chore(audio): deprecate ResampyResampler in favor of SOXR resamplers
Emits a DeprecationWarning on instantiation. ResampyResampler will be
removed in Pipecat 2.0 along with the default resampy and numba
dependencies.
2026-05-06 09:40:13 -04:00
Mark Backman
61a81ed87b fix(elevenlabs): use alignment by default, normalizedAlignment only with pronunciation dicts
PR #4344 unconditionally switched to normalizedAlignment to fix garbled
words with pronunciation dictionaries (#4316). But normalizedAlignment
returns the post-normalized form of what was spoken - including
romanization of non-Latin scripts (Chinese rendered as pinyin), which
ends up in the LLM context and degrades subsequent turns.

Gate the switch on pronunciation_dictionary_locators being configured.
Adds a _select_alignment helper with preferred-with-fallback (both
fields are nullable per the API schema), used by both the WebSocket
and HTTP services. Tests cover dictionary mode, default mode, fallback
when preferred is missing or null, and HTTP field-name variants.
2026-05-05 14:49:41 -04:00
Paul Kompfner
e06e0c0282 Mitigate tool-call-related hallucination
When tools change mid-conversation, LLMs can produce a few different
flavors of tool-call-related hallucination: calling tools that have
been removed, avoiding tools that have been re-added, or hallucinating
output (made-up answers or tool-call-shaped non-tool-calls) when tools
are unavailable.

This change introduces an opt-in ``add_tool_change_messages`` flag on
the LLM aggregators (preferred entry point: ``LLMContextAggregatorPair(
..., add_tool_change_messages=True)``) that appends a developer-role
message to the context whenever ``LLMSetToolsFrame`` changes the set
of advertised standard tools. Helps the LLM stay coherent across tool
changes by spelling out exactly what just became available or
unavailable. Both aggregators participate; whichever handles the
frame first wins, and the other (if any) sees an empty diff against
the shared context and stays silent — order-independent regardless of
whether the frame flows downstream or upstream.

Also tightens the existing missing-handler path (introduced in #4301):

- Reworded the terminal tool result to a neutral "The function
  ``X`` is not currently available." (overridable via
  ``LLMService.MISSING_FUNCTION_CALL_MESSAGE_TEMPLATE``). Previously
  read "Error: function 'X' is not registered."
- Logs at the call site now distinguish developer error (tool
  advertised but no handler registered → ``logger.error``) from
  hallucination (tool not advertised → ``logger.warning``).

Includes a manual validation harness
(``examples/features/features-add-tool-change-messages.py``) that
exercises the new ``add_tool_change_messages`` mitigation by flipping
tool availability on a turn counter so its effect can be observed
end-to-end with the flag on vs. off.
2026-05-05 13:02:43 -04:00
Mark Backman
fa31a2fd63 Merge pull request #4416 from pipecat-ai/mb/pr-4333-aws-credentials-review
feat(aws): add shared credential resolver with boto3 chain fallback
2026-05-04 21:48:33 -04:00
Mark Backman
83190d38e9 Merge pull request #4414 from pipecat-ai/mb/fix-ttsspeakframe-assistant-turn-stopped 2026-05-04 18:12:33 -04:00
Mark Backman
7519c26ac5 Merge pull request #4417 from pipecat-ai/mb/resolve-runner-filepath 2026-05-04 18:09:34 -04:00
Mark Backman
89f10dd9a1 test: drop webrtc-dependent test, remove webrtc extra from CI 2026-05-04 16:42:05 -04:00
Mark Backman
e780f759d0 fix: validate download path containment in runner
Resolve and contain the user-supplied filename before serving it from
the runner's /files endpoint. Also raise a 404 (instead of returning
None) when the downloads folder is unset, and use the resolved
basename for Content-Disposition.
2026-05-04 16:20:27 -04:00
Daniel Wirjo
35153de28e feat(aws): add shared credential resolver with boto3 chain fallback
AWS Transcribe STT previously only supported credentials via explicit
parameters or environment variables. Services running with IAM roles
(EKS pod roles, IRSA, ECS task roles, EC2 instance profiles) or SSO
couldn't use Transcribe without exporting static credentials.

Changes:
- Add resolve_credentials() to utils.py providing a standard fallback
  chain: explicit params → environment variables → boto3 credential
  provider chain (instance profiles, IRSA, pod roles, SSO, etc.)
- Add AWSCredentials dataclass for type-safe credential passing
- Update AWSTranscribeSTTService to use resolve_credentials() instead
  of manual os.getenv() calls
- The boto3 fallback is only attempted when both access key and secret
  key are unresolved, avoiding replacement of explicitly provided creds
- boto3 is imported lazily inside the function to avoid hard dependency
  for services that don't need the fallback chain
- Add 7 unit tests covering the credential resolution chain

The Bedrock LLM and Polly TTS services already support the full
credential chain via aioboto3.Session() and are not modified.

Related to #4197
2026-05-04 15:40:06 -04:00
Mark Backman
90e6b51acd Fix ElevenLabs alignment chunk spacing 2026-05-04 15:15:37 -04:00
Mark Backman
f1a3ee97de fix: surface TTSSpeakFrame greetings in on_assistant_turn_stopped
Two issues were causing TTSSpeakFrame(append_to_context=True) greetings to
silently lose their trailing words and never fire on_assistant_turn_stopped:

- LLMAssistantPushAggregationFrame was emitted without a PTS, so the
  transport routed it through the audio (sync) queue while word-level
  TTSTextFrames travel through the clock queue. The aggregation could reach
  the assistant aggregator before the final words, leaving them orphaned
  in the buffer. Stamp the frame with `_word_last_pts + 1` when there are
  word timestamps so it can't overtake them.

- The aggregator's LLMAssistantPushAggregationFrame handler called
  push_aggregation() directly, bypassing _trigger_assistant_turn_stopped.
  For TTS-only flows there is no LLMFullResponseStartFrame, so the turn
  start timestamp was never set and on_assistant_turn_stopped never fired.
  Open a turn (if needed) and trigger stopped from the handler.

Fixes #4264.
2026-05-04 10:41:22 -04:00
Mark Backman
43abca0b06 feat(rtvi): add UI Agent Protocol as first-class RTVI message types
The UI Agent Protocol lets server-side AI agents observe and drive
a GUI app on the client side through structured RTVI messages.
Five new top-level RTVI types in kebab-case, in line with the rest
of the protocol:

  ui-event         client → server  (named event with payload)
  ui-command       server → client  (named command with payload)
  ui-snapshot      client → server  (accessibility tree of the page)
  ui-cancel-task   client → server  (cancel an in-flight task group)
  ui-task          server → client  (task lifecycle envelope)

Each ships paired ``*Data`` / ``*Message`` pydantic models in
``rtvi.models``, following the existing RTVI envelope convention
(``BotReady`` / ``BotReadyData``, ``Error`` / ``ErrorData``, etc.).
Built-in command payload models (``Toast``, ``Navigate``,
``ScrollTo``, ``Highlight``, ``Focus``, ``Click``, ``SetInputValue``,
``SelectText``) ship alongside; matching default React handlers
live in ``@pipecat-ai/client-react``.

Bumps the RTVI ``PROTOCOL_VERSION`` from ``1.2.0`` to ``1.3.0``.
Purely additive: only new top-level message types are introduced;
no existing wire shapes are changed. The major-version
compatibility check on ``client-ready`` still passes for older
1.x clients, so old clients continue to connect without warning;
they simply will not exercise the new types.

The ``RTVIProcessor`` registers a new ``on_ui_message`` event
handler that fires for inbound ``ui-event`` / ``ui-snapshot`` /
``ui-cancel-task`` with the parsed Message envelope, mirroring how
``on_client_message`` works for ``client-message``.

Five new pipeline frames let pipeline observers and processors see
UI traffic the same way they see other RTVI messages, mirroring
the frame-and-event pattern used by ``client-message``:

  RTVIUICommandFrame(command_name, payload)
    Pushed by downstream code (e.g. ``pipecat-ai-subagents``'s
    bridge) to send a UI command to the client. Wrapped by the
    observer into a ``UICommandMessage`` envelope.

  RTVIUITaskFrame(data: UITaskData)
    Same shape but for ``ui-task``; wrapped into ``UITaskMessage``.
    ``UITaskData`` is a discriminated union of the four lifecycle
    kinds (group_started / task_update / task_completed /
    group_completed).

  RTVIUIEventFrame(msg_id, event_name, payload)
  RTVIUISnapshotFrame(msg_id, tree)
  RTVIUICancelTaskFrame(msg_id, task_id, reason)
    Pushed by ``RTVIProcessor._handle_message`` whenever the
    matching inbound message arrives, alongside firing
    ``on_ui_message``. Pipeline observers and processors can match
    on the frame; subscribers like the subagents bridge keep using
    the event handler.

The data layer is the canonical authority for the wire format:
higher-level frameworks like ``pipecat-ai-subagents`` build the
agent abstractions on top, and single-LLM Pipecat apps can target
the same wire format directly via custom tools that emit these
typed messages.
2026-05-02 12:09:01 -04:00
Paul Kompfner
c4f5f1ebbb test, refactor: follow-ups to LLMService generic refactor
Two follow-ups now that LLMService is generic over its adapter:

- Add an explicit backward-compat test verifying that an LLMService
  subclass with no generic parameter (the third-party-provider
  pattern) instantiates and returns a usable adapter. The existing
  MockLLMService (declared without brackets) already exercised this
  implicitly, but it's worth a named assertion.

- Drop the now-redundant `params: SomeLLMInvocationParams = ...`
  variable annotations on `adapter.get_llm_invocation_params()`
  results. Since `get_llm_adapter()` now returns the precise adapter
  type, and `BaseLLMAdapter` is generic in its invocation-params
  type, the call already infers the right TypedDict.
2026-05-01 09:36:14 -04:00
kompfner
6d66bbceeb Merge pull request #4395 from pipecat-ai/pk/app-resources-api-updates
Broaden tool_resources to app_resources
2026-04-30 21:19:05 -04:00
Paul Kompfner
1b5c4cfa2a feat: broaden tool_resources to app_resources
Broaden `tool_resources` to `app_resources` for easy access not just in
tool handlers but in other places like custom `FrameProcessor`s.

Involves 3 changes:

- A rename: `tool_resources` -> `app_resources`
- A new property on `PipelineTask`: `app_resources`
- A new property on `FrameProcessor`: `pipeline_task`

Usage in tool handler:

    async def get_weather(params: FunctionCallParams):
        resources = cast(MyAppResources, params.app_resources)
        ...

Usage in custom `FrameProcessor`:

    class MyProcessor(FrameProcessor):
        async def process_frame(self, frame, direction):
            await super().process_frame(frame, direction)
            if self.pipeline_task is not None:
                resources = cast(MyAppResources, self.pipeline_task.app_resources)
                ...

The previous `tool_resources` aliases (on `PipelineTask`,
`FunctionCallParams`, and `FrameProcessorSetup`) keep working but are
deprecated as of 1.2.0 and emit `DeprecationWarning`s.
2026-04-30 16:16:17 -04:00
Mark Backman
351105a975 test(krisp): scope importlib.metadata.version mock to imports only
The four krisp test files installed a process-wide mock of
importlib.metadata.version with `patch(...).start()` at module level and
never called .stop(). Once any of these files was collected, the mock
leaked across the rest of the test session, returning '0.0.0-dev' for
every version check. This corrupted unrelated tests that triggered
transformers' import-time dependency check (e.g. lazy imports of
LocalSmartTurnAnalyzerV3) — transformers saw tqdm=='0.0.0-dev' and
refused to load.

Wrap the pipecat imports in `with patch(...)` so the mock is active
during import (where pipecat's krisp version check needs it) and torn
down before any tests run.
2026-04-30 14:16:54 -04:00
kompfner
ce1311f6ba Merge pull request #4301 from bnovik0v/fix-4300-missing-tool-lifecycle
Fail missing tool calls cleanly
2026-04-27 11:54:43 -04:00
borislav
8869e25142 fix: compare bound method by equality, not identity
Bound methods are created fresh on each attribute access, so
'self._missing_function_call_handler is self._missing_function_call_handler'
is always False. Using 'is' meant the placeholder branch never fired and
both warnings logged when a function was missing at queue time.

Switch to == so equality compares the underlying function and instance.
Strengthen the missing-at-queue-time test to assert the second warning
does not fire.
2026-04-27 17:34:31 +02:00
borislav
822392b0d4 fix: re-resolve registry item at execution time
Address review feedback: a function may be unregistered between when
run_function_calls queues it and when _run_function_call executes it.
Restore the live lookup, falling back to the missing-function handler
when the entry is gone, so the call still terminates with a normal
tool result. Factor the missing-handler item construction into a
helper since it's now built in two places.
2026-04-27 17:22:30 +02:00
kompfner
bc29bdb95e Merge pull request #4371 from Stoic-Angel/feat-global-context
Add a global context for tool calls: tool_resources
2026-04-27 10:55:03 -04:00