Merge branch 'main' into filipi/deepgram

# Conflicts:
#	src/pipecat/services/deepgram/stt_sagemaker.py
This commit is contained in:
filipi87
2026-03-03 10:54:30 -03:00
124 changed files with 4793 additions and 1449 deletions

View File

@@ -59,6 +59,7 @@ jobs:
DOCS_SYNC_TOKEN: ${{ secrets.DOCS_SYNC_TOKEN }}
with:
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
github_token: ${{ secrets.GITHUB_TOKEN }}
prompt: |
You are updating documentation for the pipecat-ai/docs repository based on
changes merged in PR #${{ steps.pr.outputs.number }} of pipecat-ai/pipecat.

View File

@@ -7,6 +7,389 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
<!-- towncrier release notes start -->
## [0.0.104] - 2026-03-02
### Added
- Added `TextAggregationMetricsData` metric measuring the time from the first
LLM token to the first complete sentence, representing the latency cost of
sentence aggregation in the TTS pipeline.
(PR [#3696](https://github.com/pipecat-ai/pipecat/pull/3696))
- Added support for using strongly-typed objects instead of dicts for updating
service settings at runtime.
Instead of, say:
```python
await task.queue_frame(
STTUpdateSettingsFrame(settings={"language": Language.ES})
)
```
you'd do:
```python
await task.queue_frame(
STTUpdateSettingsFrame(delta=DeepgramSTTSettings(language=Language.ES))
)
```
Each service now vends strongly-typed classes like `DeepgramSTTSettings`
representing the service's runtime-updatable settings.
(PR [#3714](https://github.com/pipecat-ai/pipecat/pull/3714))
- Added support for specifying private endpoints for Azure Speech-to-Text,
enabling use in private networks behind firewalls.
(PR [#3764](https://github.com/pipecat-ai/pipecat/pull/3764))
- Added `LemonSliceTransport` and `LemonSliceApi` to support adding real-time
LemonSlice Avatars to any Daily room.
(PR [#3791](https://github.com/pipecat-ai/pipecat/pull/3791))
- Added `output_medium` parameter to `AgentInputParams` and
`OneShotInputParams` in Ultravox service to control initial output medium
(text or voice) at call creation time.
(PR [#3806](https://github.com/pipecat-ai/pipecat/pull/3806))
- Added `TurnMetricsData` as a generic metrics class for turn detection, with
e2e processing time measurement. `KrispVivaTurn` now emits `TurnMetricsData`
with `e2e_processing_time_ms` tracking the interval from VAD
speech-to-silence transition to turn completion.
(PR [#3809](https://github.com/pipecat-ai/pipecat/pull/3809))
- Added `on_audio_context_interrupted()` and `on_audio_context_completed()`
callbacks to `AudioContextTTSService`. Subclasses can override these to
perform provider-specific cleanup instead of overriding
`_handle_interruption()`.
(PR [#3814](https://github.com/pipecat-ai/pipecat/pull/3814))
- Added `on_summary_applied` event to `LLMContextSummarizer` for observability,
providing message counts before and after context summarization.
(PR [#3855](https://github.com/pipecat-ai/pipecat/pull/3855))
- Added `summary_message_template` to `LLMContextSummarizationConfig` for
customizing how summaries are formatted when injected into context (e.g.,
wrapping in XML tags).
(PR [#3855](https://github.com/pipecat-ai/pipecat/pull/3855))
- Added `summarization_timeout` to `LLMContextSummarizationConfig` (default
120s) to prevent hung LLM calls from permanently blocking future
summarizations.
(PR [#3855](https://github.com/pipecat-ai/pipecat/pull/3855))
- Added optional `llm` field to `LLMContextSummarizationConfig` for routing
summarization to a dedicated LLM service (e.g., a cheaper/faster model)
instead of the pipeline's primary model.
(PR [#3855](https://github.com/pipecat-ai/pipecat/pull/3855))
- Add AssemblyAI u3-rt-pro model support with built-in turn detection mode
(PR [#3856](https://github.com/pipecat-ai/pipecat/pull/3856))
- Added `LLMSummarizeContextFrame` to trigger on-demand context summarization
from anywhere in the pipeline (e.g. a function call tool). Accepts an
optional `config: LLMContextSummaryConfig` to override summary generation
settings per request.
(PR [#3863](https://github.com/pipecat-ai/pipecat/pull/3863))
- Added `LLMContextSummaryConfig` (summary generation params:
`target_context_tokens`, `min_messages_after_summary`,
`summarization_prompt`) and `LLMAutoContextSummarizationConfig` (auto-trigger
thresholds: `max_context_tokens`, `max_unsummarized_messages`, plus a nested
`summary_config`). These replace the monolithic
`LLMContextSummarizationConfig`.
(PR [#3863](https://github.com/pipecat-ai/pipecat/pull/3863))
- Added support for the `speed_alpha` parameter to the `arcana` model in
`RimeTTSService`.
(PR [#3873](https://github.com/pipecat-ai/pipecat/pull/3873))
- Added `ClientConnectedFrame`, a new `SystemFrame` pushed by all transports
(Daily, LiveKit, FastAPI WebSocket, WebSocket Server, SmallWebRTC, HeyGen,
Tavus) when a client connects. Enables observers to track transport readiness
timing.
(PR [#3881](https://github.com/pipecat-ai/pipecat/pull/3881))
- Added `StartupTimingObserver` for measuring how long each processor's
`start()` method takes during pipeline startup. Also measures transport
readiness — the time from `StartFrame` to first client connection — via the
`on_transport_timing_report` event.
(PR [#3881](https://github.com/pipecat-ai/pipecat/pull/3881))
- Added `BotConnectedFrame` for SFU transports and `on_transport_timing_report`
event to `StartupTimingObserver` with bot and client connection timing.
(PR [#3881](https://github.com/pipecat-ai/pipecat/pull/3881))
- Added optional `direction` parameter to `PipelineTask.queue_frame()` and
`PipelineTask.queue_frames()`, allowing frames to be pushed upstream from the
end of the pipeline.
(PR [#3883](https://github.com/pipecat-ai/pipecat/pull/3883))
- Added `on_latency_breakdown` event to `UserBotLatencyObserver` providing
per-service TTFB, text aggregation, user turn duration, and function call
latency metrics for each user-to-bot response cycle.
(PR [#3885](https://github.com/pipecat-ai/pipecat/pull/3885))
- Added `on_first_bot_speech_latency` event to `UserBotLatencyObserver`
measuring the time from client connection to first bot speech. An
`on_latency_breakdown` is also emitted for this first speech event.
(PR [#3885](https://github.com/pipecat-ai/pipecat/pull/3885))
- Added `broadcast_interruption()` to `FrameProcessor`. This method pushes an
`InterruptionFrame` both upstream and downstream directly from the calling
processor, avoiding the round-trip through the pipeline task that
`push_interruption_task_frame_and_wait()` required.
(PR [#3896](https://github.com/pipecat-ai/pipecat/pull/3896))
### Changed
- Added `text_aggregation_mode` parameter to `TTSService` and all TTS
subclasses with a new `TextAggregationMode` enum (`SENTENCE`, `TOKEN`). All
text now flows through text aggregators regardless of mode, enabling pattern
detection and tag handling in TOKEN mode.
(PR [#3696](https://github.com/pipecat-ai/pipecat/pull/3696))
- ⚠️ Refactored runtime-updatable service settings to use strongly-typed
classes (`TTSSettings`, `STTSettings`, `LLMSettings`, and service-specific
subclasses) instead of plain dicts. Each service's `_settings` now holds
these strongly-typed objects. For service maintainers, see changes in
COMMUNITY_INTEGRATIONS.md.
(PR [#3714](https://github.com/pipecat-ai/pipecat/pull/3714))
- Word timestamp support has been moved from `WordTTSService` into `TTSService`
via a new `supports_word_timestamps` parameter. Services that previously
extended `WordTTSService`, `AudioContextWordTTSService`, or
`WebsocketWordTTSService` now pass `supports_word_timestamps=True` to their
parent `__init__` instead.
(PR [#3786](https://github.com/pipecat-ai/pipecat/pull/3786))
- Improved Ultravox TTFB measurement accuracy by using VAD speech end time
instead of `UserStoppedSpeakingFrame` timing.
(PR [#3806](https://github.com/pipecat-ai/pipecat/pull/3806))
- Aligned `UltravoxRealtimeLLMService` frame handling with OpenAI/Gemini
realtime services: added `InterruptionFrame` handling with metrics cleanup,
processing metrics at response boundaries, and improved agent transcript
handling for both voice and text output modalities.
(PR [#3806](https://github.com/pipecat-ai/pipecat/pull/3806))
- Updated `OpenAIRealtimeLLMService` default model to `gpt-realtime-1.5`.
(PR [#3807](https://github.com/pipecat-ai/pipecat/pull/3807))
- Added `api_key` parameter to `KrispVivaSDKManager`, `KrispVivaTurn`, and
`KrispVivaFilter` for Krisp SDK v1.6.1+ licensing. Falls back to
`KRISP_VIVA_API_KEY` environment variable.
(PR [#3809](https://github.com/pipecat-ai/pipecat/pull/3809))
- Bumped `nltk` minimum version from 3.9.1 to 3.9.3 to resolve a security
vulnerability.
(PR [#3811](https://github.com/pipecat-ai/pipecat/pull/3811))
- `ServiceSettingsUpdateFrame`s are now `UninterruptibleFrame`s. Generally
speaking, you don't want a user interruption to prevent a service setting
change from going into effect. Note that you usually don't use
`ServiceSettingsUpdateFrame` directly, you use one of its subclasses:
- `LLMUpdateSettingsFrame`
- `TTSUpdateSettingsFrame`
- `STTUpdateSettingsFrame`
(PR [#3819](https://github.com/pipecat-ai/pipecat/pull/3819))
- Updated context summarization to use `user` role instead of `assistant` for
summary messages.
(PR [#3855](https://github.com/pipecat-ai/pipecat/pull/3855))
- Rename `AssemblyAISTTService` parameter
`min_end_of_turn_silence_when_confident` parameter to `min_turn_silence` (old
name still supported with deprecation warning)
(PR [#3856](https://github.com/pipecat-ai/pipecat/pull/3856))
- ⚠️ Renamed `LLMAssistantAggregatorParams` fields:
`enable_context_summarization` → `enable_auto_context_summarization` and
`context_summarization_config` → `auto_context_summarization_config` (now
accepts `LLMAutoContextSummarizationConfig`). The old names still work with a
`DeprecationWarning` for one release cycle.
(PR [#3863](https://github.com/pipecat-ai/pipecat/pull/3863))
- `ElevenLabsRealtimeSTTService` now sets `TranscriptionFrame.finalized` to
`True` when using `CommitStrategy.MANUAL`.
(PR [#3865](https://github.com/pipecat-ai/pipecat/pull/3865))
- Updated numba version pin from == to >=0.61.2
(PR [#3868](https://github.com/pipecat-ai/pipecat/pull/3868))
- Updated tracing code to use `ServiceSettings` dataclass API
(`given_fields()`, attribute access) instead of dict-style access
(`.items()`, `in`, subscript).
(PR [#3879](https://github.com/pipecat-ai/pipecat/pull/3879))
- ⚠️ Removed `event` field and `complete()` method from `InterruptionFrame`.
Removed `event` field from `InterruptionTaskFrame`. These are no longer
needed since `broadcast_interruption()` does not require a round-trip
completion signal.
(PR [#3896](https://github.com/pipecat-ai/pipecat/pull/3896))
- Moved `pipecat.services.deepgram.stt_sagemaker` and
`pipecat.services.deepgram.tts_sagemaker` to
`pipecat.services.deepgram.sagemaker.stt` and
`pipecat.services.deepgram.sagemaker.tts`. The old import paths still work
but emit a `DeprecationWarning`.
(PR [#3902](https://github.com/pipecat-ai/pipecat/pull/3902))
### Deprecated
- ⚠️ Deprecated `aggregate_sentences` parameter on `TTSService` and all TTS
subclasses. Use `text_aggregation_mode=TextAggregationMode.SENTENCE` or
`text_aggregation_mode=TextAggregationMode.TOKEN` instead.
(PR [#3696](https://github.com/pipecat-ai/pipecat/pull/3696))
- Deprecated `set_model()`, `set_voice()`, and `set_language()` on AI services
in favor of runtime updates via `TTSUpdateSettingsFrame`,
`STTUpdateSettingsFrame`, and `LLMUpdateSettingsFrame`.
⚠️ Note, too, a subtle behavior change in these deprecated methods. Whereas
previously only `set_language()` caused the service to actually react to the
update (e.g. by reconnecting to a remote service so it an pick up the
change), now all these methods do. This change was made as part of a refactor
making them all work the same way under the hood.
(PR [#3714](https://github.com/pipecat-ai/pipecat/pull/3714))
- Dict-based `*UpdateSettingsFrame(settings={...})` is deprecated in favor of
passing typed settings delta objects with
`*UpdateSettingsFrame(delta={...})`.
(PR [#3714](https://github.com/pipecat-ai/pipecat/pull/3714))
- Deprecated `WordTTSService`, `WebsocketWordTTSService`,
`AudioContextWordTTSService`, and `InterruptibleWordTTSService`. Use their
non-word counterparts with `supports_word_timestamps=True` instead:
- `WordTTSService` → `TTSService(supports_word_timestamps=True)`
- `WebsocketWordTTSService` →
`WebsocketTTSService(supports_word_timestamps=True)`
- `AudioContextWordTTSService` →
`AudioContextTTSService(supports_word_timestamps=True)`
- `InterruptibleWordTTSService` →
`InterruptibleTTSService(supports_word_timestamps=True)`
(PR [#3786](https://github.com/pipecat-ai/pipecat/pull/3786))
- Deprecated `SmartTurnMetricsData` in favor of `TurnMetricsData`.
`BaseSmartTurn` now emits `TurnMetricsData` directly.
(PR [#3809](https://github.com/pipecat-ai/pipecat/pull/3809))
- Deprecated `LLMContextSummarizationConfig`. Use
`LLMAutoContextSummarizationConfig` with a nested `LLMContextSummaryConfig`
instead. The old class emits a `DeprecationWarning`.
(PR [#3863](https://github.com/pipecat-ai/pipecat/pull/3863))
- Deprecated `push_interruption_task_frame_and_wait()` in `FrameProcessor`. Use
`broadcast_interruption()` instead. The old method now delegates to
`broadcast_interruption()` and logs a deprecation warning.
(PR [#3896](https://github.com/pipecat-ai/pipecat/pull/3896))
### Removed
- Removed `local-smart-turn-v3` optional extra from `pyproject.toml`. The
`transformers` and `onnxruntime` packages are now always installed as core
dependencies since they are required by the default turn stop strategy,
`TurnAnalyzerUserTurnStopStrategy` which uses `LocalSmartTurnAnalyzerV3`.
(PR [#3803](https://github.com/pipecat-ai/pipecat/pull/3803))
- ⚠️ Removed `PlayHTTTSService` and `PlayHTHttpTTSService`. PlayHT has been
shut down and is no longer available.
(PR [#3838](https://github.com/pipecat-ai/pipecat/pull/3838))
### Fixed
- Added `LLMSpecificMessage` handling in `LLMContextSummarizationUtil` to skip
provider-specific messages during context summarization.
(PR [#3794](https://github.com/pipecat-ai/pipecat/pull/3794))
- Treated `response_cancel_not_active` as a non-fatal error in realtime
services (`OpenAIRealtimeLLMService`, `GrokRealtimeLLMService`,
`OpenAIRealtimeBetaLLMService`) to prevent WebSocket disconnection when
cancelling an inactive response.
(PR [#3795](https://github.com/pipecat-ai/pipecat/pull/3795))
- Fixed Poetry compatibility by inlining `local-smart-turn-v3` dependencies
(`transformers`, `onnxruntime`) into core dependencies instead of using a
self-referential extra.
(PR [#3803](https://github.com/pipecat-ai/pipecat/pull/3803))
- Fixed `SentryMetrics` method signatures to match updated
`FrameProcessorMetrics` base class, resolving `TypeError` when using
`start_time`/`end_time` keyword arguments.
(PR [#3808](https://github.com/pipecat-ai/pipecat/pull/3808))
- Fixed STT TTFB metrics not being reported for `SonioxSTTService` and
`AWSTranscribeSTTService` due to missing `can_generate_metrics()` override.
(PR [#3813](https://github.com/pipecat-ai/pipecat/pull/3813))
- Fixed an issue where `AudioContextTTSService`-based providers (AsyncAI,
ElevenLabs, Inworld, Rime) did not close or clean up their server-side audio
contexts after normal speech completion, only on interruption.
(PR [#3814](https://github.com/pipecat-ai/pipecat/pull/3814))
- Fixed STT TTFB metrics measuring timeout expiry time instead of actual
transcript arrival time.
(PR [#3822](https://github.com/pipecat-ai/pipecat/pull/3822))
- Fixed `InterimTranscriptionFrame` and `TranslationFrame` being
unintentionally pushed downstream in `LLMUserAggregator`. They are now
consumed like `TranscriptionFrame`.
(PR [#3825](https://github.com/pipecat-ai/pipecat/pull/3825))
- Fixed misleading "Empty audio frame received for STT service" warnings when
using audio filters (e.g. `RNNoiseFilter`, `KrispVivaFilter`, `AICFilter`)
that buffer audio internally.
(PR [#3828](https://github.com/pipecat-ai/pipecat/pull/3828))
- Fixed issues with `RimeNonJsonTTSService` where trailing punctuation is
sometimes vocalized
(PR [#3837](https://github.com/pipecat-ai/pipecat/pull/3837))
- Fixed `TTSSpeakFrame` not committing spoken text to the conversation context
when used outside of an LLM response (e.g., bot greetings or injected
speech).
(PR [#3845](https://github.com/pipecat-ai/pipecat/pull/3845))
- Removed verbose per-chunk audio logging from `GenesysAudioHookSerializer`
that flooded production logs.
(PR [#3850](https://github.com/pipecat-ai/pipecat/pull/3850))
- Add beta feature warning when using custom prompts with AssemblyAI
(PR [#3856](https://github.com/pipecat-ai/pipecat/pull/3856))
- Fixed `LocalSmartTurnAnalyzerV3` producing incorrect end-of-turn predictions
at non-16kHz sample rates (e.g. 8kHz Twilio telephony) by adding automatic
resampling to 16kHz before Whisper feature extraction.
(PR [#3857](https://github.com/pipecat-ai/pipecat/pull/3857))
- Fixed `PipelineTask` double-inserting `RTVIProcessor` into the frame chain
when the user provides both an `RTVIProcessor` in the pipeline and a custom
`RTVIObserver` subclass in observers.
(PR [#3867](https://github.com/pipecat-ai/pipecat/pull/3867))
- Fixed turn completion instructions being lost when `LLMMessagesUpdateFrame`
replaces the LLM context. When `filter_incomplete_user_turns` is enabled, the
turn completion system message is now re-injected after context replacement.
(PR [#3888](https://github.com/pipecat-ai/pipecat/pull/3888))
- Fixed Azure TTS and STT services silently swallowing cancellation errors
(invalid API key, network failures, rate limiting) instead of propagating
them as `ErrorFrame`s to the pipeline.
(PR [#3893](https://github.com/pipecat-ai/pipecat/pull/3893))
### Performance
- Switched `GradiumTTSService` from `InterruptibleWordTTSService` to
`AudioContextWordTTSService`, eliminating websocket disconnect/reconnect on
every interruption by using `client_req_id`-based multiplexing.
(PR [#3759](https://github.com/pipecat-ai/pipecat/pull/3759))
### Other
- Standardized Sarvam STT/TTS User-Agent header handling to consistently send
Pipecat SDK identity in websocket requests.
(PR [#3886](https://github.com/pipecat-ai/pipecat/pull/3886))
## [0.0.103] - 2026-02-20
### Added

View File

@@ -89,7 +89,7 @@ Catch new features, interviews, and how-tos on our [Pipecat TV](https://www.yout
| Speech-to-Speech | [AWS Nova Sonic](https://docs.pipecat.ai/server/services/s2s/aws), [Gemini Multimodal Live](https://docs.pipecat.ai/server/services/s2s/gemini), [Grok Voice Agent](https://docs.pipecat.ai/server/services/s2s/grok), [OpenAI Realtime](https://docs.pipecat.ai/server/services/s2s/openai), [Ultravox](https://docs.pipecat.ai/server/services/s2s/ultravox), |
| Transport | [Daily (WebRTC)](https://docs.pipecat.ai/server/services/transport/daily), [FastAPI Websocket](https://docs.pipecat.ai/server/services/transport/fastapi-websocket), [SmallWebRTCTransport](https://docs.pipecat.ai/server/services/transport/small-webrtc), [WebSocket Server](https://docs.pipecat.ai/server/services/transport/websocket-server), Local |
| Serializers | [Exotel](https://docs.pipecat.ai/server/utilities/serializers/exotel), [Plivo](https://docs.pipecat.ai/server/utilities/serializers/plivo), [Twilio](https://docs.pipecat.ai/server/utilities/serializers/twilio), [Telnyx](https://docs.pipecat.ai/server/utilities/serializers/telnyx), [Vonage](https://docs.pipecat.ai/server/utilities/serializers/vonage) |
| Video | [HeyGen](https://docs.pipecat.ai/server/services/video/heygen), [Tavus](https://docs.pipecat.ai/server/services/video/tavus), [Simli](https://docs.pipecat.ai/server/services/video/simli) |
| Video | [HeyGen](https://docs.pipecat.ai/server/services/video/heygen), [LemonSlice](https://docs.pipecat.ai/server/services/video/lemonslice), [Tavus](https://docs.pipecat.ai/server/services/video/tavus), [Simli](https://docs.pipecat.ai/server/services/video/simli) |
| Memory | [mem0](https://docs.pipecat.ai/server/services/memory/mem0) |
| Vision & Image | [fal](https://docs.pipecat.ai/server/services/image-generation/fal), [Google Imagen](https://docs.pipecat.ai/server/services/image-generation/google-imagen), [Moondream](https://docs.pipecat.ai/server/services/vision/moondream) |
| Audio Processing | [Silero VAD](https://docs.pipecat.ai/server/utilities/audio/silero-vad-analyzer), [Krisp](https://docs.pipecat.ai/server/utilities/audio/krisp-filter), [Koala](https://docs.pipecat.ai/server/utilities/audio/koala-filter), [ai-coustics](https://docs.pipecat.ai/server/utilities/audio/aic-filter) |

View File

@@ -1 +0,0 @@
- Added `TextAggregationMetricsData` metric measuring the time from the first LLM token to the first complete sentence, representing the latency cost of sentence aggregation in the TTS pipeline.

View File

@@ -1 +0,0 @@
- Added `text_aggregation_mode` parameter to `TTSService` and all TTS subclasses with a new `TextAggregationMode` enum (`SENTENCE`, `TOKEN`). All text now flows through text aggregators regardless of mode, enabling pattern detection and tag handling in TOKEN mode.

View File

@@ -1 +0,0 @@
- ⚠️ Deprecated `aggregate_sentences` parameter on `TTSService` and all TTS subclasses. Use `text_aggregation_mode=TextAggregationMode.SENTENCE` or `text_aggregation_mode=TextAggregationMode.TOKEN` instead.

View File

@@ -1,19 +0,0 @@
- Added support for using strongly-typed objects instead of dicts for updating service settings at runtime.
Instead of, say:
```python
await task.queue_frame(
STTUpdateSettingsFrame(settings={"language": Language.ES})
)
```
you'd do:
```python
await task.queue_frame(
STTUpdateSettingsFrame(delta=DeepgramSTTSettings(language=Language.ES))
)
```
Each service now vends strongly-typed classes like `DeepgramSTTSettings` representing the service's runtime-updatable settings.

View File

@@ -1 +0,0 @@
- ⚠️ Refactored runtime-updatable service settings to use strongly-typed classes (`TTSSettings`, `STTSettings`, `LLMSettings`, and service-specific subclasses) instead of plain dicts. Each service's `_settings` now holds these strongly-typed objects. For service maintainers, see changes in COMMUNITY_INTEGRATIONS.md.

View File

@@ -1 +0,0 @@
- Dict-based `*UpdateSettingsFrame(settings={...})` is deprecated in favor of passing typed settings delta objects with `*UpdateSettingsFrame(delta={...})`.

View File

@@ -1,3 +0,0 @@
- Deprecated `set_model()`, `set_voice()`, and `set_language()` on AI services in favor of runtime updates via `TTSUpdateSettingsFrame`, `STTUpdateSettingsFrame`, and `LLMUpdateSettingsFrame`.
⚠️ Note, too, a subtle behavior change in these deprecated methods. Whereas previously only `set_language()` caused the service to actually react to the update (e.g. by reconnecting to a remote service so it an pick up the change), now all these methods do. This change was made as part of a refactor making them all work the same way under the hood.

View File

@@ -1 +0,0 @@
- Switched `GradiumTTSService` from `InterruptibleWordTTSService` to `AudioContextWordTTSService`, eliminating websocket disconnect/reconnect on every interruption by using `client_req_id`-based multiplexing.

View File

@@ -1 +0,0 @@
- Added support for specifying private endpoints for Azure Speech-to-Text, enabling use in private networks behind firewalls.

View File

@@ -1 +0,0 @@
- Word timestamp support has been moved from `WordTTSService` into `TTSService` via a new `supports_word_timestamps` parameter. Services that previously extended `WordTTSService`, `AudioContextWordTTSService`, or `WebsocketWordTTSService` now pass `supports_word_timestamps=True` to their parent `__init__` instead.

View File

@@ -1,5 +0,0 @@
- Deprecated `WordTTSService`, `WebsocketWordTTSService`, `AudioContextWordTTSService`, and `InterruptibleWordTTSService`. Use their non-word counterparts with `supports_word_timestamps=True` instead:
- `WordTTSService``TTSService(supports_word_timestamps=True)`
- `WebsocketWordTTSService``WebsocketTTSService(supports_word_timestamps=True)`
- `AudioContextWordTTSService``AudioContextTTSService(supports_word_timestamps=True)`
- `InterruptibleWordTTSService``InterruptibleTTSService(supports_word_timestamps=True)`

View File

@@ -1 +0,0 @@
- Added `LLMSpecificMessage` handling in `LLMContextSummarizationUtil` to skip provider-specific messages during context summarization.

View File

@@ -1 +0,0 @@
- Treated `response_cancel_not_active` as a non-fatal error in realtime services (`OpenAIRealtimeLLMService`, `GrokRealtimeLLMService`, `OpenAIRealtimeBetaLLMService`) to prevent WebSocket disconnection when cancelling an inactive response.

View File

@@ -1 +0,0 @@
- Fixed Poetry compatibility by inlining `local-smart-turn-v3` dependencies (`transformers`, `onnxruntime`) into core dependencies instead of using a self-referential extra.

View File

@@ -1 +0,0 @@
- Removed `local-smart-turn-v3` optional extra from `pyproject.toml`. The `transformers` and `onnxruntime` packages are now always installed as core dependencies since they are required by the default turn stop strategy, `TurnAnalyzerUserTurnStopStrategy` which uses `LocalSmartTurnAnalyzerV3`.

View File

@@ -1 +0,0 @@
- Added `output_medium` parameter to `AgentInputParams` and `OneShotInputParams` in Ultravox service to control initial output medium (text or voice) at call creation time.

View File

@@ -1 +0,0 @@
- Improved Ultravox TTFB measurement accuracy by using VAD speech end time instead of `UserStoppedSpeakingFrame` timing.

View File

@@ -1 +0,0 @@
- Aligned `UltravoxRealtimeLLMService` frame handling with OpenAI/Gemini realtime services: added `InterruptionFrame` handling with metrics cleanup, processing metrics at response boundaries, and improved agent transcript handling for both voice and text output modalities.

View File

@@ -1 +0,0 @@
- Updated `OpenAIRealtimeLLMService` default model to `gpt-realtime-1.5`.

View File

@@ -1 +0,0 @@
- Fixed `SentryMetrics` method signatures to match updated `FrameProcessorMetrics` base class, resolving `TypeError` when using `start_time`/`end_time` keyword arguments.

View File

@@ -1 +0,0 @@
- Added `TurnMetricsData` as a generic metrics class for turn detection, with e2e processing time measurement. `KrispVivaTurn` now emits `TurnMetricsData` with `e2e_processing_time_ms` tracking the interval from VAD speech-to-silence transition to turn completion.

View File

@@ -1 +0,0 @@
- Added `api_key` parameter to `KrispVivaSDKManager`, `KrispVivaTurn`, and `KrispVivaFilter` for Krisp SDK v1.6.1+ licensing. Falls back to `KRISP_VIVA_API_KEY` environment variable.

View File

@@ -1 +0,0 @@
- Deprecated `SmartTurnMetricsData` in favor of `TurnMetricsData`. `BaseSmartTurn` now emits `TurnMetricsData` directly.

View File

@@ -1 +0,0 @@
- Bumped `nltk` minimum version from 3.9.1 to 3.9.3 to resolve a security vulnerability.

View File

@@ -1 +0,0 @@
- Fixed STT TTFB metrics not being reported for `SonioxSTTService` and `AWSTranscribeSTTService` due to missing `can_generate_metrics()` override.

View File

@@ -1 +0,0 @@
- Added `on_audio_context_interrupted()` and `on_audio_context_completed()` callbacks to `AudioContextTTSService`. Subclasses can override these to perform provider-specific cleanup instead of overriding `_handle_interruption()`.

View File

@@ -1 +0,0 @@
- Fixed an issue where `AudioContextTTSService`-based providers (AsyncAI, ElevenLabs, Inworld, Rime) did not close or clean up their server-side audio contexts after normal speech completion, only on interruption.

View File

@@ -1,4 +0,0 @@
- `ServiceSettingsUpdateFrame`s are now `UninterruptibleFrame`s. Generally speaking, you don't want a user interruption to prevent a service setting change from going into effect. Note that you usually don't use `ServiceSettingsUpdateFrame` directly, you use one of its subclasses:
- `LLMUpdateSettingsFrame`
- `TTSUpdateSettingsFrame`
- `STTUpdateSettingsFrame`

View File

@@ -1 +0,0 @@
- Fixed STT TTFB metrics measuring timeout expiry time instead of actual transcript arrival time.

View File

@@ -1 +0,0 @@
- Fixed `InterimTranscriptionFrame` and `TranslationFrame` being unintentionally pushed downstream in `LLMUserAggregator`. They are now consumed like `TranscriptionFrame`.

View File

@@ -1 +0,0 @@
- Fixed misleading "Empty audio frame received for STT service" warnings when using audio filters (e.g. `RNNoiseFilter`, `KrispVivaFilter`, `AICFilter`) that buffer audio internally.

View File

@@ -1 +0,0 @@
- Fixed issues with `RimeNonJsonTTSService` where trailing punctuation is sometimes vocalized

View File

@@ -1 +0,0 @@
- ⚠️ Removed `PlayHTTTSService` and `PlayHTHttpTTSService`. PlayHT has been shut down and is no longer available.

View File

@@ -1 +0,0 @@
- Fixed `TTSSpeakFrame` not committing spoken text to the conversation context when used outside of an LLM response (e.g., bot greetings or injected speech).

View File

@@ -1 +0,0 @@
- Removed verbose per-chunk audio logging from `GenesysAudioHookSerializer` that flooded production logs.

View File

@@ -1 +0,0 @@
- Deprecated `ProcessingMetricsData` and `start_processing_metrics()`/`stop_processing_metrics()` on `FrameProcessor` and `FrameProcessorMetrics`. These metrics don't accurately depict a service's performance. Instead, TTFB metrics are recommended. Processing metrics will be removed in the 1.0.0 version.

View File

@@ -1 +0,0 @@
- Added optional `llm` field to `LLMContextSummarizationConfig` for routing summarization to a dedicated LLM service (e.g., a cheaper/faster model) instead of the pipeline's primary model.

View File

@@ -1 +0,0 @@
- Added `summarization_timeout` to `LLMContextSummarizationConfig` (default 120s) to prevent hung LLM calls from permanently blocking future summarizations.

View File

@@ -1 +0,0 @@
- Added `on_summary_applied` event to `LLMContextSummarizer` for observability, providing message counts before and after context summarization.

View File

@@ -1 +0,0 @@
- Added `summary_message_template` to `LLMContextSummarizationConfig` for customizing how summaries are formatted when injected into context (e.g., wrapping in XML tags).

View File

@@ -1 +0,0 @@
- Updated context summarization to use `user` role instead of `assistant` for summary messages.

View File

@@ -1 +0,0 @@
- Fixed `LocalSmartTurnAnalyzerV3` producing incorrect end-of-turn predictions at non-16kHz sample rates (e.g. 8kHz Twilio telephony) by adding automatic resampling to 16kHz before Whisper feature extraction.

View File

@@ -1 +0,0 @@
- Added `LLMContextSummaryConfig` (summary generation params: `target_context_tokens`, `min_messages_after_summary`, `summarization_prompt`) and `LLMAutoContextSummarizationConfig` (auto-trigger thresholds: `max_context_tokens`, `max_unsummarized_messages`, plus a nested `summary_config`). These replace the monolithic `LLMContextSummarizationConfig`.

View File

@@ -1 +0,0 @@
- Added `LLMSummarizeContextFrame` to trigger on-demand context summarization from anywhere in the pipeline (e.g. a function call tool). Accepts an optional `config: LLMContextSummaryConfig` to override summary generation settings per request.

View File

@@ -1 +0,0 @@
- ⚠️ Renamed `LLMAssistantAggregatorParams` fields: `enable_context_summarization``enable_auto_context_summarization` and `context_summarization_config``auto_context_summarization_config` (now accepts `LLMAutoContextSummarizationConfig`). The old names still work with a `DeprecationWarning` for one release cycle.

View File

@@ -1 +0,0 @@
- Deprecated `LLMContextSummarizationConfig`. Use `LLMAutoContextSummarizationConfig` with a nested `LLMContextSummaryConfig` instead. The old class emits a `DeprecationWarning`.

View File

@@ -1 +0,0 @@
- `ElevenLabsRealtimeSTTService` now sets `TranscriptionFrame.finalized` to `True` when using `CommitStrategy.MANUAL`.

View File

@@ -1 +0,0 @@
- Fixed `PipelineTask` double-inserting `RTVIProcessor` into the frame chain when the user provides both an `RTVIProcessor` in the pipeline and a custom `RTVIObserver` subclass in observers.

View File

@@ -1 +0,0 @@
- Updated numba version pin from == to >=0.61.2

View File

@@ -1 +0,0 @@
- Updated tracing code to use `ServiceSettings` dataclass API (`given_fields()`, attribute access) instead of dict-style access (`.items()`, `in`, subscript).

View File

@@ -1 +0,0 @@
- Added optional `direction` parameter to `PipelineTask.queue_frame()` and `PipelineTask.queue_frames()`, allowing frames to be pushed upstream from the end of the pipeline.

View File

@@ -1 +0,0 @@
- Fixed turn completion instructions being lost when `LLMMessagesUpdateFrame` replaces the LLM context. When `filter_incomplete_user_turns` is enabled, the turn completion system message is now re-injected after context replacement.

View File

@@ -108,6 +108,10 @@ KRISP_VIVA_API_KEY=...
KRISP_VIVA_FILTER_MODEL_PATH=...
KRISP_VIVA_TURN_MODEL_PATH=...
# LemonSlice
LEMONSLICE_API_KEY=...
LEMONSLICE_AGENT_ID=...
# LiveKit
LIVEKIT_API_KEY=...
LIVEKIT_API_SECRET=...

View File

@@ -23,8 +23,8 @@ from pipecat.processors.aggregators.llm_response_universal import (
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.aws.llm import AWSBedrockLLMService
from pipecat.services.deepgram.stt_sagemaker import DeepgramSageMakerSTTService
from pipecat.services.deepgram.tts_sagemaker import DeepgramSageMakerTTSService
from pipecat.services.deepgram.sagemaker.stt import DeepgramSageMakerSTTService
from pipecat.services.deepgram.sagemaker.tts import DeepgramSageMakerTTSService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams

View File

@@ -0,0 +1,179 @@
#
# Copyright (c) 2024-2026, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import os
from dotenv import load_dotenv
from loguru import logger
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams
from pipecat.services.assemblyai.stt import AssemblyAISTTService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
from pipecat.turns.user_turn_strategies import ExternalUserTurnStrategies
load_dotenv(override=True)
# We use lambdas to defer transport parameter creation until the transport
# type is selected at runtime.
transport_params = {
"daily": lambda: DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
"twilio": lambda: FastAPIWebsocketParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
"webrtc": lambda: TransportParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
}
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
"""AssemblyAI u3-rt-pro with Built-in Turn Detection
This example demonstrates using AssemblyAI's u3-rt-pro Speech-to-Text model
with AssemblyAI's built-in turn detection for more natural conversation flow.
Key features:
1. AssemblyAI Turn Detection
- Set `vad_force_turn_endpoint=False` to use AssemblyAI's built-in turn detection
- AssemblyAI's model determines when user starts/stops speaking
- Uses `ExternalUserTurnStrategies` to delegate turn control to AssemblyAI
- More natural turn detection based on speech patterns and pauses
2. Advanced Turn Detection Tuning
- `min_turn_silence`: Minimum silence (ms) when confident about end-of-turn.
Lower values = faster responses. Default: 100ms
- `max_turn_silence`: Maximum silence (ms) before forcing end-of-turn.
Prevents long pauses. Default: 1000ms
3. Prompt-Based Transcription Enhancement
- Use `prompt` parameter to improve accuracy for specific names/terms
- Particularly useful for proper nouns, technical terms, domain vocabulary
- Example: "Names: Xiomara, Saoirse, Krzystof. Technical terms: API, OAuth."
4. Speaker Diarization (Optional)
- Enable with `speaker_labels=True`
- Automatically identifies different speakers in multi-party conversations
- TranscriptionFrame includes speaker_id field (e.g., "Speaker A", "Speaker B")
5. Language Detection (Optional, multilingual model only)
- Enable with `language_detection=True`
- Automatically detects spoken language
- Available with universal-streaming-multilingual model
For more information: https://www.assemblyai.com/docs/speech-to-text/streaming
"""
logger.info(f"Starting bot")
stt = AssemblyAISTTService(
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
vad_force_turn_endpoint=False, # Use AssemblyAI's built-in turn detection
connection_params=AssemblyAIConnectionParams(
speech_model="u3-rt-pro",
# Optional: Tune turn detection timing (defaults shown below)
# min_turn_silence=100, # Default
# max_turn_silence=1000, # Default
# Optional: Boost accuracy for specific names/terms
# prompt="Names: Xiomara, Saoirse, Krzystof. Technical terms: API, OAuth.",
# Optional: Enable speaker diarization
# speaker_labels=True,
),
)
tts = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
voice_id="71a7ad14-091c-4e8e-a314-022ece01c121", # British Reading Lady
)
llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
},
]
context = LLMContext(messages)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(
user_turn_strategies=ExternalUserTurnStrategies(),
vad_analyzer=SileroVADAnalyzer(),
),
)
pipeline = Pipeline(
[
transport.input(), # Transport user input
stt, # STT
user_aggregator, # User responses
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output
assistant_aggregator, # Assistant spoken responses
]
)
task = PipelineTask(
pipeline,
params=PipelineParams(
enable_metrics=True,
enable_usage_metrics=True,
),
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
)
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info(f"Client connected")
# Kick off the conversation.
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
await task.queue_frames([LLMRunFrame()])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):
logger.info(f"Client disconnected")
await task.cancel()
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
async def bot(runner_args: RunnerArguments):
"""Main bot entry point compatible with Pipecat Cloud."""
transport = await create_transport(runner_args, transport_params)
await run_bot(transport, runner_args)
if __name__ == "__main__":
from pipecat.runner.run import main
main()

View File

@@ -55,7 +55,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
stt = NvidiaSTTService(api_key=os.getenv("NVIDIA_API_KEY"))
llm = NvidiaLLMService(
api_key=os.getenv("NVIDIA_API_KEY"), model="meta/llama-3.1-405b-instruct"
api_key=os.getenv("NVIDIA_API_KEY"),
model="meta/llama-3.3-70b-instruct",
)
tts = NvidiaTTSService(api_key=os.getenv("NVIDIA_API_KEY"))

View File

@@ -16,6 +16,7 @@ from pipecat.pipeline.task import PipelineTask
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams
from pipecat.services.assemblyai.stt import AssemblyAISTTService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
@@ -49,6 +50,9 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
stt = AssemblyAISTTService(
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
connection_params=AssemblyAIConnectionParams(
speech_model="u3-rt-pro",
),
)
tl = TranscriptionLogger()

View File

@@ -5,13 +5,17 @@
#
import asyncio
import os
from dotenv import load_dotenv
from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.observers.startup_timing_observer import StartupTimingObserver
from pipecat.observers.user_bot_latency_observer import UserBotLatencyObserver
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
@@ -25,6 +29,7 @@ from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.llm_service import FunctionCallParams
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
@@ -32,6 +37,17 @@ from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
load_dotenv(override=True)
async def fetch_weather_from_api(params: FunctionCallParams):
await asyncio.sleep(0.25)
await params.result_callback({"conditions": "nice", "temperature": "75"})
async def fetch_restaurant_recommendation(params: FunctionCallParams):
await asyncio.sleep(0.1)
await params.result_callback({"name": "The Golden Dragon"})
# We use lambdas to defer transport parameter creation until the transport
# type is selected at runtime.
transport_params = {
@@ -62,6 +78,38 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
llm.register_function("get_current_weather", fetch_weather_from_api)
llm.register_function("get_restaurant_recommendation", fetch_restaurant_recommendation)
weather_function = FunctionSchema(
name="get_current_weather",
description="Get the current weather",
properties={
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use. Infer this from the user's location.",
},
},
required=["location", "format"],
)
restaurant_function = FunctionSchema(
name="get_restaurant_recommendation",
description="Get a restaurant recommendation",
properties={
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
},
required=["location"],
)
tools = ToolsSchema(standard_tools=[weather_function, restaurant_function])
messages = [
{
"role": "system",
@@ -69,7 +117,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
},
]
context = LLMContext(messages)
context = LLMContext(messages, tools)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
@@ -87,8 +135,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
]
)
# Create latency tracking observer
latency_observer = UserBotLatencyObserver()
startup_observer = StartupTimingObserver()
task = PipelineTask(
pipeline,
@@ -97,14 +145,29 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
enable_usage_metrics=True,
),
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
observers=[latency_observer],
observers=[latency_observer, startup_observer],
)
# Log latency measurements using the event handler
@latency_observer.event_handler("on_first_bot_speech_latency")
async def on_first_bot_speech_latency(observer, latency_seconds):
logger.info(f"First bot speech: {latency_seconds:.3f}s after client connected")
@latency_observer.event_handler("on_latency_measured")
async def on_latency_measured(observer, latency_seconds):
logger.info(f"⏱️ User-to-bot latency: {latency_seconds:.3f}s")
@startup_observer.event_handler("on_startup_timing_report")
async def on_startup_timing_report(observer, report):
logger.info(f"Total startup: {report.total_duration_secs:.3f}s")
for timing in report.processor_timings:
logger.info(f" {timing.processor_name}: {timing.duration_secs:.3f}s")
@startup_observer.event_handler("on_transport_timing_report")
async def on_transport_timing_report(observer, report):
if report.bot_connected_secs is not None:
logger.info(f"Bot connected: {report.bot_connected_secs:.3f}s")
logger.info(f"Client connected: {report.client_connected_secs:.3f}s")
turn_observer = task.turn_tracking_observer
if turn_observer:
@@ -119,6 +182,11 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
else:
logger.info(f"🏁 Turn {turn_number} completed in {duration:.2f}s")
@latency_observer.event_handler("on_latency_breakdown")
async def on_latency_breakdown(observer, breakdown):
for event in breakdown.chronological_events():
logger.info(f" {event}")
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info(f"Client connected")

View File

@@ -24,7 +24,7 @@ from pipecat.processors.aggregators.llm_response_universal import (
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.deepgram.stt_sagemaker import (
from pipecat.services.deepgram.sagemaker.stt import (
DeepgramSageMakerSTTService,
DeepgramSageMakerSTTSettings,
)

View File

@@ -22,10 +22,10 @@ from pipecat.processors.aggregators.llm_response_universal import (
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams
from pipecat.services.assemblyai.stt import AssemblyAISTTService, AssemblyAISTTSettings
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.transcriptions.language import Language
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
@@ -51,7 +51,12 @@ transport_params = {
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Starting bot")
stt = AssemblyAISTTService(api_key=os.getenv("ASSEMBLYAI_API_KEY"))
stt = AssemblyAISTTService(
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
connection_params=AssemblyAIConnectionParams(
speech_model="u3-rt-pro",
),
)
tts = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
@@ -63,7 +68,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
"content": "You are a helpful LLM in a WebRTC call demonstrating dynamic keyterms updates. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Try saying difficult names like 'Xiomara', 'Saoirse', or 'Krzystof' to test transcription accuracy.",
},
]
@@ -97,14 +102,24 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info(f"Client connected")
logger.info(
"Phase 1: No keyterms boosting - try saying 'Xiomara', 'Saoirse', or 'Krzystof'"
)
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
await task.queue_frames([LLMRunFrame()])
await asyncio.sleep(10)
logger.info("Updating AssemblyAI STT settings: language=es")
await asyncio.sleep(15)
logger.info("🔄 Updating keyterms: Adding difficult names for boosting")
await task.queue_frame(
STTUpdateSettingsFrame(delta=AssemblyAISTTSettings(language=Language.ES))
STTUpdateSettingsFrame(
delta=AssemblyAISTTSettings(
connection_params=AssemblyAIConnectionParams(
keyterms_prompt=["Xiomara", "Saoirse", "Krzystof", "Nguyen", "Pipecat"]
)
)
)
)
logger.info("Phase 2: Keyterms active - same names should transcribe better now!")
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):

View File

@@ -22,11 +22,11 @@ from pipecat.processors.aggregators.llm_response_universal import (
)
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.deepgram.tts_sagemaker import (
from pipecat.services.deepgram.sagemaker.tts import (
DeepgramSageMakerTTSService,
DeepgramSageMakerTTSSettings,
)
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams

View File

@@ -0,0 +1,123 @@
#
# Copyright (c) 2024-2026, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import asyncio
import os
import sys
import aiohttp
from dotenv import load_dotenv
from loguru import logger
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
LLMUserAggregatorParams,
)
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.elevenlabs.tts import ElevenLabsTTSService
from pipecat.services.groq.llm import GroqLLMService
from pipecat.transports.lemonslice.transport import (
LemonSliceNewSessionRequest,
LemonSliceParams,
LemonSliceTransport,
)
load_dotenv(override=True)
logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
async def main():
async with aiohttp.ClientSession() as session:
transport = LemonSliceTransport(
bot_name="Pipecat",
api_key=os.getenv("LEMONSLICE_API_KEY"),
session=session,
session_request=LemonSliceNewSessionRequest(
agent_id=os.getenv("LEMONSLICE_AGENT_ID"),
),
params=LemonSliceParams(
audio_in_enabled=True,
audio_out_enabled=True,
microphone_out_enabled=False,
),
)
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
llm = GroqLLMService(api_key=os.getenv("GROQ_API_KEY"))
tts = ElevenLabsTTSService(
api_key=os.getenv("ELEVENLABS_API_KEY", ""),
voice_id=os.getenv("ELEVENLABS_VOICE_ID", ""),
)
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
},
]
context = LLMContext(messages)
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
context,
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
)
pipeline = Pipeline(
[
transport.input(), # Transport user input
stt, # STT
user_aggregator, # User responses
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output
assistant_aggregator, # Assistant spoken responses
]
)
task = PipelineTask(
pipeline,
params=PipelineParams(
audio_in_sample_rate=16000,
audio_out_sample_rate=16000,
enable_metrics=True,
enable_usage_metrics=True,
),
)
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, participant):
logger.info("Client connected")
# Kick off the conversation.
messages.append(
{
"role": "system",
"content": "Start by greeting the user and ask how you can help.",
}
)
await task.queue_frames([LLMRunFrame()])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, participant):
logger.info("Client disconnected")
await task.cancel()
runner = PipelineRunner()
await runner.run(task)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -121,6 +121,7 @@ uv run 07-interruptible.py -t twilio -x NGROK_HOST_NAME
- **[19-openai-realtime-beta.py](./19-openai-realtime-beta.py)**: OpenAI Speech-to-Speech (Direct S2S, Function calls)
- **[21-tavus-layer-tavus-transport.py](./21-tavus-layer-tavus-transport.py)**: Tavus digital twin (Avatar integration)
- **[27-simli-layer.py](./27-simli-layer.py)**: Simli avatar integration (Video synchronization)
- **[56-lemonslice-transport.py](./56-lemonslice-transport.py)**: LemonSlice avatar integration (A/V Synced Avatar integration)
### Performance & Optimization

View File

@@ -82,6 +82,7 @@ koala = [ "pvkoala~=2.0.3" ]
kokoro = [ "kokoro-onnx>=0.5.0,<1", "requests>=2.32.5,<3" ]
krisp = [ "pipecat-ai-krisp~=0.4.0" ]
langchain = [ "langchain~=0.3.20", "langchain-community~=0.3.20", "langchain-openai~=0.3.9" ]
lemonslice = [ "pipecat-ai[daily]" ]
livekit = [ "livekit~=1.0.13", "livekit-api~=1.0.5", "tenacity>=8.2.3,<10.0.0", "pyjwt>=2.10.1" ]
lmnt = [ "pipecat-ai[websockets-base]" ]
local = [ "pyaudio~=0.2.14" ]

View File

@@ -368,7 +368,7 @@ class ClassificationProcessor(FrameProcessor):
await self._voicemail_notifier.notify() # Clear buffered TTS frames
# Interrupt the current pipeline to stop any ongoing processing
await self.push_interruption_task_frame_and_wait()
await self.broadcast_interruption()
# Set the voicemail event to trigger the voicemail handler
self._voicemail_event.clear()

View File

@@ -11,7 +11,6 @@ including data frames, system frames, and control frames for audio, video, text,
and LLM processing.
"""
import asyncio
import time
from dataclasses import dataclass, field
from typing import (
@@ -1141,24 +1140,9 @@ class InterruptionFrame(SystemFrame):
This frame is used to interrupt the pipeline. For example, when a user
starts speaking to cancel any in-progress bot output. It can also be pushed
by any processor.
Parameters:
event: Optional event set when the frame has fully traversed the
pipeline.
"""
event: Optional[asyncio.Event] = None
def complete(self):
"""Signal that this interruption has been fully processed.
Called automatically when the frame reaches the pipeline sink, or
manually when the frame is consumed before reaching it (e.g. when
the user is muted).
"""
if self.event:
self.event.set()
pass
@dataclass
@@ -1825,16 +1809,11 @@ class InterruptionTaskFrame(TaskFrame):
"""Frame indicating the pipeline should be interrupted.
This frame should be pushed upstream to indicate the pipeline should be
interrupted. The pipeline task converts this into an `InterruptionFrame` and
sends it downstream. The `event` is passed to the `InterruptionFrame` so it
can signal when the interruption has fully traversed the pipeline.
Parameters:
event: Optional event passed to the corresponding `InterruptionFrame`.
interrupted. The pipeline task converts this into an `InterruptionFrame`
and sends it downstream.
"""
event: Optional[asyncio.Event] = None
pass
@dataclass
@@ -1910,6 +1889,29 @@ class StopFrame(ControlFrame, UninterruptibleFrame):
pass
@dataclass
class BotConnectedFrame(SystemFrame):
"""Frame indicating the bot has connected to the transport service.
Pushed downstream by SFU transports (Daily, LiveKit, HeyGen, Tavus)
when the bot successfully joins the room. Non-SFU transports do not
emit this frame.
"""
pass
@dataclass
class ClientConnectedFrame(SystemFrame):
"""Frame indicating that a client has connected to the transport.
Pushed downstream by the input transport when a client (participant)
connects. Used by observers to measure transport readiness timing.
"""
pass
@dataclass
class OutputTransportReadyFrame(ControlFrame):
"""Frame indicating that the output transport is ready.

View File

@@ -41,10 +41,6 @@ class TTFBMetricsData(MetricsData):
class ProcessingMetricsData(MetricsData):
"""General processing time metrics data.
.. deprecated:: 0.0.104
Processing metrics are deprecated and will be removed in a future version.
Use TTFB metrics instead.
Parameters:
value: Processing time measurement in seconds.
"""

View File

@@ -100,3 +100,11 @@ class BaseObserver(BaseObject):
data: The event data containing details about the frame transfer.
"""
pass
async def on_pipeline_started(self):
"""Called when the pipeline has fully started.
Fired after the ``StartFrame`` has been processed by all processors
in the pipeline, including nested ``ParallelPipeline`` branches.
"""
pass

View File

@@ -0,0 +1,328 @@
#
# Copyright (c) 2024-2026, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
"""Observer for tracking pipeline startup timing.
This module provides an observer that measures how long each processor's
``start()`` method takes during pipeline startup. It works by tracking
when a ``StartFrame`` arrives at a processor (``on_process_frame``) versus
when it leaves (``on_push_frame``), giving the exact ``start()`` duration
for each processor in the pipeline.
It also measures transport timing — the time from ``StartFrame`` to the
first ``BotConnectedFrame`` (SFU transports only) and ``ClientConnectedFrame``
— via a separate ``on_transport_timing_report`` event.
Example::
observer = StartupTimingObserver()
@observer.event_handler("on_startup_timing_report")
async def on_report(observer, report):
for t in report.processor_timings:
print(f"{t.processor_name}: {t.duration_secs:.3f}s")
@observer.event_handler("on_transport_timing_report")
async def on_transport(observer, report):
if report.bot_connected_secs is not None:
print(f"Bot connected in {report.bot_connected_secs:.3f}s")
print(f"Client connected in {report.client_connected_secs:.3f}s")
task = PipelineTask(pipeline, observers=[observer])
"""
import time
from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple, Type
from pydantic import BaseModel, Field
from pipecat.frames.frames import BotConnectedFrame, ClientConnectedFrame, StartFrame
from pipecat.observers.base_observer import BaseObserver, FrameProcessed, FramePushed
from pipecat.pipeline.base_pipeline import BasePipeline
from pipecat.pipeline.pipeline import PipelineSource
from pipecat.processors.frame_processor import FrameProcessor
# Internal pipeline types excluded from tracking by default.
_INTERNAL_TYPES = (PipelineSource, BasePipeline)
@dataclass
class _ArrivalInfo:
"""Internal record of when a StartFrame arrived at a processor."""
processor: FrameProcessor
arrival_ts_ns: int
class ProcessorStartupTiming(BaseModel):
"""Startup timing for a single processor.
Parameters:
processor_name: The name of the processor.
start_offset_secs: Offset in seconds from the StartFrame to when this
processor's start() began.
duration_secs: How long the processor's start() took, in seconds.
"""
processor_name: str
start_offset_secs: float
duration_secs: float
class StartupTimingReport(BaseModel):
"""Report of startup timings for all measured processors.
Parameters:
start_time: Unix timestamp when the first processor began starting.
total_duration_secs: Total wall-clock time from first to last processor start.
processor_timings: Per-processor timing data, in pipeline order.
"""
start_time: float
total_duration_secs: float
processor_timings: List[ProcessorStartupTiming] = Field(default_factory=list)
class TransportTimingReport(BaseModel):
"""Time from pipeline start to transport connection milestones.
Parameters:
start_time: Unix timestamp of the StartFrame (pipeline start).
bot_connected_secs: Seconds from StartFrame to first BotConnectedFrame
(only set for SFU transports).
client_connected_secs: Seconds from StartFrame to first ClientConnectedFrame.
"""
start_time: float
bot_connected_secs: Optional[float] = None
client_connected_secs: Optional[float] = None
class StartupTimingObserver(BaseObserver):
"""Observer that measures processor startup times during pipeline initialization.
Tracks how long each processor's ``start()`` method takes by measuring the
time between when a ``StartFrame`` arrives at a processor and when it is
pushed downstream. This captures WebSocket connections, API authentication,
model loading, and other initialization work.
Also measures transport timing, the time from ``StartFrame`` to connection
milestones:
- ``bot_connected_secs``: When the bot joins the transport room
(SFU transports only, triggered by ``BotConnectedFrame``).
- ``client_connected_secs``: When a remote participant connects
(triggered by ``ClientConnectedFrame``).
By default, internal pipeline processors (``PipelineSource``, ``Pipeline``)
are excluded from the report. Pass ``processor_types`` to measure only
specific types.
Event handlers available:
- on_startup_timing_report: Called once after startup completes with the full
timing report.
- on_transport_timing_report: Called once when the first client connects with a
TransportTimingReport containing client_connected_secs and bot_connected_secs
(if available).
Example::
observer = StartupTimingObserver(
processor_types=(STTService, TTSService)
)
@observer.event_handler("on_startup_timing_report")
async def on_report(observer, report):
for t in report.processor_timings:
logger.info(f"{t.processor_name}: {t.duration_secs:.3f}s")
@observer.event_handler("on_transport_timing_report")
async def on_transport(observer, report):
if report.bot_connected_secs is not None:
logger.info(f"Bot connected in {report.bot_connected_secs:.3f}s")
logger.info(f"Client connected in {report.client_connected_secs:.3f}s")
task = PipelineTask(pipeline, observers=[observer])
Args:
processor_types: Optional tuple of processor types to measure. If None,
all non-internal processors are measured.
"""
def __init__(
self,
*,
processor_types: Optional[Tuple[Type[FrameProcessor], ...]] = None,
**kwargs,
):
"""Initialize the startup timing observer.
Args:
processor_types: Optional tuple of processor types to measure.
If None, all non-internal processors are measured.
**kwargs: Additional arguments passed to parent class.
"""
super().__init__(**kwargs)
self._processor_types = processor_types
# Map processor ID -> arrival info.
self._arrivals: Dict[int, _ArrivalInfo] = {}
# Collected timings in pipeline order.
self._timings: List[ProcessorStartupTiming] = []
# Lock onto the first StartFrame we see (by frame ID).
self._start_frame_id: Optional[str] = None
# Whether we've already emitted the startup timing report.
self._startup_timing_reported = False
# Whether we've already measured transport timing.
self._transport_timing_reported = False
# Timestamp (ns) when we first see a StartFrame arrive at a processor.
self._start_frame_arrival_ns: Optional[int] = None
# Bot connected timing (stored for inclusion in the transport report).
self._bot_connected_secs: Optional[float] = None
# Wall clock time when the StartFrame was first seen.
self._start_wall_clock: Optional[float] = None
self._register_event_handler("on_startup_timing_report")
self._register_event_handler("on_transport_timing_report")
def _should_track(self, processor: FrameProcessor) -> bool:
"""Check if a processor should be tracked for timing.
Args:
processor: The processor to check.
Returns:
True if the processor matches the filter or no filter is set.
"""
if self._processor_types is not None:
return isinstance(processor, self._processor_types)
# Default: exclude internal pipeline plumbing.
return not isinstance(processor, _INTERNAL_TYPES)
async def on_pipeline_started(self):
"""Emit the startup timing report when the pipeline has fully started.
Called by the ``PipelineTask`` after the ``StartFrame`` has been
processed by all processors, including nested ``ParallelPipeline``
branches.
"""
if self._timings:
await self._emit_report()
async def on_process_frame(self, data: FrameProcessed):
"""Record when a StartFrame arrives at a processor.
Args:
data: The frame processing event data.
"""
if self._startup_timing_reported:
return
if not isinstance(data.frame, StartFrame):
return
# Lock onto the first StartFrame.
if self._start_frame_id is None:
self._start_frame_id = data.frame.id
self._start_frame_arrival_ns = data.timestamp
self._start_wall_clock = time.time()
elif data.frame.id != self._start_frame_id:
return
if self._should_track(data.processor):
self._arrivals[data.processor.id] = _ArrivalInfo(
processor=data.processor, arrival_ts_ns=data.timestamp
)
async def on_push_frame(self, data: FramePushed):
"""Record when a StartFrame leaves a processor and compute the delta.
Also handles ``BotConnectedFrame`` and ``ClientConnectedFrame`` to
measure transport timing.
Args:
data: The frame push event data.
"""
if isinstance(data.frame, BotConnectedFrame):
self._handle_bot_connected(data)
return
if isinstance(data.frame, ClientConnectedFrame):
await self._handle_client_connected(data)
return
if self._startup_timing_reported:
return
if not isinstance(data.frame, StartFrame):
return
if self._start_frame_id is not None and data.frame.id != self._start_frame_id:
return
arrival = self._arrivals.pop(data.source.id, None)
if arrival is None:
return
duration_ns = data.timestamp - arrival.arrival_ts_ns
duration_secs = duration_ns / 1e9
start_offset_secs = (arrival.arrival_ts_ns - self._start_frame_arrival_ns) / 1e9
self._timings.append(
ProcessorStartupTiming(
processor_name=arrival.processor.name,
start_offset_secs=start_offset_secs,
duration_secs=duration_secs,
)
)
def _handle_bot_connected(self, data: FramePushed):
"""Record bot connected timing on first BotConnectedFrame."""
if self._bot_connected_secs is not None or self._start_frame_arrival_ns is None:
return
delta_ns = data.timestamp - self._start_frame_arrival_ns
self._bot_connected_secs = delta_ns / 1e9
async def _handle_client_connected(self, data: FramePushed):
"""Emit transport timing report on first ClientConnectedFrame."""
if self._transport_timing_reported or self._start_frame_arrival_ns is None:
return
self._transport_timing_reported = True
delta_ns = data.timestamp - self._start_frame_arrival_ns
client_connected_secs = delta_ns / 1e9
report = TransportTimingReport(
start_time=self._start_wall_clock or 0.0,
bot_connected_secs=self._bot_connected_secs,
client_connected_secs=client_connected_secs,
)
await self._call_event_handler("on_transport_timing_report", report)
async def _emit_report(self):
"""Build and emit the startup timing report."""
if self._startup_timing_reported:
return
self._startup_timing_reported = True
total = sum(t.duration_secs for t in self._timings)
report = StartupTimingReport(
start_time=self._start_wall_clock or 0.0,
total_duration_secs=total,
processor_timings=self._timings,
)
await self._call_event_handler("on_startup_timing_report", report)

View File

@@ -1,22 +1,146 @@
#
# Copyright (c) 2024-2026, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
"""Observer for tracking user-to-bot response latency.
This module provides an observer that monitors the time between when a user
stops speaking and when the bot starts speaking, emitting events when latency
is measured.
is measured. Optionally collects per-service latency breakdown metrics
(TTFB, text aggregation) when ``enable_metrics=True``.
"""
import time
from typing import Optional, Set
from collections import deque
from typing import Dict, List, Optional
from pydantic import BaseModel, Field
from pipecat.frames.frames import (
BotStartedSpeakingFrame,
ClientConnectedFrame,
FunctionCallInProgressFrame,
FunctionCallResultFrame,
InterruptionFrame,
MetricsFrame,
UserStoppedSpeakingFrame,
VADUserStartedSpeakingFrame,
VADUserStoppedSpeakingFrame,
)
from pipecat.metrics.metrics import (
TextAggregationMetricsData,
TTFBMetricsData,
)
from pipecat.observers.base_observer import BaseObserver, FramePushed
from pipecat.processors.frame_processor import FrameDirection
class TTFBBreakdownMetrics(BaseModel):
"""TTFB measurement with timestamp for timeline placement.
Parameters:
processor: Name of the processor that reported the TTFB.
model: Optional model name associated with the metric.
start_time: Unix timestamp when the TTFB measurement started.
duration_secs: TTFB duration in seconds.
"""
processor: str
model: Optional[str] = None
start_time: float
duration_secs: float
class TextAggregationBreakdownMetrics(BaseModel):
"""Text aggregation measurement with timestamp for timeline placement.
Parameters:
processor: Name of the processor that reported the metric.
start_time: Unix timestamp when text aggregation started.
duration_secs: Aggregation duration in seconds.
"""
processor: str
start_time: float
duration_secs: float
class FunctionCallMetrics(BaseModel):
"""Latency for a single function call execution.
Parameters:
function_name: Name of the function that was called.
start_time: Unix timestamp when execution started.
duration_secs: Time in seconds from execution start to result.
"""
function_name: str
start_time: float
duration_secs: float
class LatencyBreakdown(BaseModel):
"""Per-service latency breakdown for a single user-to-bot cycle.
Collected between ``VADUserStoppedSpeakingFrame`` and
``BotStartedSpeakingFrame`` when ``enable_metrics=True`` in
:class:`~pipecat.pipeline.task.PipelineParams`.
Parameters:
ttfb: Time-to-first-byte metrics from each service in the pipeline.
text_aggregation: First text aggregation measurement, representing
the latency cost of sentence aggregation in the TTS pipeline.
user_turn_start_time: Unix timestamp when the user turn started
(actual user silence, adjusted for VAD stop_secs). ``None`` if
no ``VADUserStoppedSpeakingFrame`` was observed.
user_turn_secs: Duration in seconds of the user's turn, measured
from when the user actually stopped speaking to when the turn
was released (``UserStoppedSpeakingFrame``). This includes
VAD silence detection, STT finalization, and any turn analyzer
wait. ``None`` if no ``UserStoppedSpeakingFrame`` was observed
(e.g. no turn analyzer configured).
function_calls: Latency for each function call executed during
this cycle. Empty if no function calls occurred.
"""
ttfb: List[TTFBBreakdownMetrics] = Field(default_factory=list)
text_aggregation: Optional[TextAggregationBreakdownMetrics] = None
user_turn_start_time: Optional[float] = None
user_turn_secs: Optional[float] = None
function_calls: List[FunctionCallMetrics] = Field(default_factory=list)
def chronological_events(self) -> List[str]:
"""Return human-readable event labels sorted by start time.
Collects all sub-metrics into a flat list, sorts by ``start_time``,
and returns formatted strings suitable for logging.
Returns:
List of formatted strings, one per event, in chronological order.
"""
events: List[tuple] = []
if self.user_turn_start_time is not None and self.user_turn_secs is not None:
events.append((self.user_turn_start_time, f"User turn: {self.user_turn_secs:.3f}s"))
for t in self.ttfb:
events.append((t.start_time, f"{t.processor}: TTFB {t.duration_secs:.3f}s"))
for fc in self.function_calls:
events.append((fc.start_time, f"{fc.function_name}: {fc.duration_secs:.3f}s"))
if self.text_aggregation:
ta = self.text_aggregation
events.append(
(ta.start_time, f"{ta.processor}: text aggregation {ta.duration_secs:.3f}s")
)
events.sort(key=lambda e: e[0])
return [label for _, label in events]
class UserBotLatencyObserver(BaseObserver):
"""Observer that tracks user-to-bot response latency.
@@ -25,34 +149,66 @@ class UserBotLatencyObserver(BaseObserver):
latency is measured, allowing consumers to log, trace, or otherwise process
the latency data.
When ``enable_metrics=True`` in pipeline params, also collects per-service
latency breakdown (TTFB, text aggregation) and emits an
``on_latency_breakdown`` event alongside the existing latency measurement.
This observer follows the composition pattern used by TurnTrackingObserver,
acting as a reusable component for latency measurement.
Events:
on_latency_measured(observer, latency_seconds): Emitted when user-to-bot
latency is calculated. Includes the latency value in seconds as a float.
on_latency_measured(observer, latency_seconds): Emitted when
time-to-first-bot-speech is calculated. Measures the time from
when the user stopped speaking to when the bot starts speaking.
on_latency_breakdown(observer, breakdown): Emitted at each
``BotStartedSpeakingFrame`` with a :class:`LatencyBreakdown`
containing per-service metrics collected during the user→bot cycle.
on_first_bot_speech_latency(observer, latency_seconds): Emitted once,
the first time ``BotStartedSpeakingFrame`` arrives after
``ClientConnectedFrame``. Measures the time from client connection
to the first bot speech.
"""
def __init__(self, **kwargs):
def __init__(self, *, max_frames=100, **kwargs):
"""Initialize the user-bot latency observer.
Sets up tracking for processed frames and user speech timing
to calculate response latencies.
Args:
max_frames: Maximum number of frame IDs to keep in history for
duplicate detection. Defaults to 100.
**kwargs: Additional arguments passed to parent class.
"""
super().__init__(**kwargs)
self._user_stopped_time: Optional[float] = None
self._processed_frames: Set[str] = set()
self._user_turn_start_time: Optional[float] = None
self._user_turn: Optional[float] = None
# First bot speech tracking
self._client_connected_time: Optional[float] = None
self._first_bot_speech_measured: bool = False
# Frame deduplication (bounded deque + set pattern)
self._processed_frames: set = set()
self._frame_history: deque = deque(maxlen=max_frames)
# Per-cycle metric accumulators
self._ttfb: List[TTFBBreakdownMetrics] = []
self._text_aggregation: Optional[TextAggregationBreakdownMetrics] = None
self._function_call_starts: Dict[str, tuple[str, float]] = {}
self._function_call_metrics: List[FunctionCallMetrics] = []
self._register_event_handler("on_latency_measured")
self._register_event_handler("on_latency_breakdown")
self._register_event_handler("on_first_bot_speech_latency")
async def on_push_frame(self, data: FramePushed):
"""Process frames to track speech timing and calculate latency.
Tracks VAD events and bot speaking events to measure the time between
user stopping speech and bot starting speech.
user stopping speech and bot starting speech. Also accumulates metrics
from MetricsFrame for the latency breakdown.
Args:
data: Frame push event containing the frame and direction information.
@@ -61,23 +217,135 @@ class UserBotLatencyObserver(BaseObserver):
if data.direction != FrameDirection.DOWNSTREAM:
return
# Skip already processed frames
# Skip already processed frames (bounded deque + set)
if data.frame.id in self._processed_frames:
return
self._processed_frames.add(data.frame.id)
self._frame_history.append(data.frame.id)
# Track VAD and bot speaking events for latency
if len(self._processed_frames) > len(self._frame_history):
self._processed_frames = set(self._frame_history)
# Track client connection (first occurrence only)
if isinstance(data.frame, ClientConnectedFrame):
if self._client_connected_time is None:
self._client_connected_time = time.time()
return
# Track speech and pipeline events for latency
if isinstance(data.frame, VADUserStartedSpeakingFrame):
# Reset when user starts speaking
self._user_stopped_time = None
self._user_turn_start_time = None
self._user_turn = None
self._reset_accumulators()
# If user speaks before the bot's first speech, abandon the
# first-bot-speech measurement — it's only meaningful for greetings.
self._first_bot_speech_measured = True
elif isinstance(data.frame, VADUserStoppedSpeakingFrame):
# Record the actual time the user stopped speaking, which is
# the VAD determination time minus the stop_secs silence duration
# that had to elapse before the VAD confirmed speech ended.
self._user_stopped_time = data.frame.timestamp - data.frame.stop_secs
elif isinstance(data.frame, BotStartedSpeakingFrame) and self._user_stopped_time:
# Calculate and emit latency
self._user_turn_start_time = self._user_stopped_time
elif isinstance(data.frame, UserStoppedSpeakingFrame):
# Measure the user turn duration: from actual user silence to
# turn release. Includes VAD silence detection, STT finalization,
# and any turn analyzer wait.
if self._user_stopped_time is not None:
self._user_turn = time.time() - self._user_stopped_time
elif isinstance(data.frame, InterruptionFrame):
# Discard stale metrics from cancelled LLM/TTS cycles
self._reset_accumulators()
elif isinstance(data.frame, FunctionCallInProgressFrame):
self._function_call_starts[data.frame.tool_call_id] = (
data.frame.function_name,
time.time(),
)
elif isinstance(data.frame, FunctionCallResultFrame):
start = self._function_call_starts.pop(data.frame.tool_call_id, None)
if start is not None:
function_name, start_time = start
self._function_call_metrics.append(
FunctionCallMetrics(
function_name=function_name,
start_time=start_time,
duration_secs=time.time() - start_time,
)
)
elif isinstance(data.frame, MetricsFrame):
self._handle_metrics_frame(data.frame)
elif isinstance(data.frame, BotStartedSpeakingFrame):
await self._handle_bot_started_speaking()
async def _handle_bot_started_speaking(self):
"""Handle BotStartedSpeakingFrame to emit latency and breakdown."""
emit_breakdown = False
# One-time first bot speech measurement (client connect → first speech)
if self._client_connected_time is not None and not self._first_bot_speech_measured:
self._first_bot_speech_measured = True
latency = time.time() - self._client_connected_time
await self._call_event_handler("on_first_bot_speech_latency", latency)
emit_breakdown = True
if self._user_stopped_time is not None:
latency = time.time() - self._user_stopped_time
self._user_stopped_time = None
await self._call_event_handler("on_latency_measured", latency)
emit_breakdown = True
if emit_breakdown:
breakdown = LatencyBreakdown(
ttfb=list(self._ttfb),
text_aggregation=self._text_aggregation,
user_turn_start_time=self._user_turn_start_time,
user_turn_secs=self._user_turn,
function_calls=list(self._function_call_metrics),
)
await self._call_event_handler("on_latency_breakdown", breakdown)
self._reset_accumulators()
def _handle_metrics_frame(self, frame: MetricsFrame):
"""Extract latency metrics from a MetricsFrame.
Accumulates metrics when a measurement is in progress: either a
user→bot cycle (after ``VADUserStoppedSpeakingFrame``) or the
first-bot-speech window (after ``ClientConnectedFrame``).
"""
waiting_for_first_speech = (
self._client_connected_time is not None and not self._first_bot_speech_measured
)
if self._user_stopped_time is None and not waiting_for_first_speech:
return
now = time.time()
for metrics_data in frame.data:
if isinstance(metrics_data, TTFBMetricsData) and metrics_data.value > 0:
self._ttfb.append(
TTFBBreakdownMetrics(
processor=metrics_data.processor,
model=metrics_data.model,
start_time=now - metrics_data.value,
duration_secs=metrics_data.value,
)
)
elif isinstance(metrics_data, TextAggregationMetricsData):
# Only keep the first measurement — it's the one that
# impacts the initial speaking latency.
if self._text_aggregation is None:
self._text_aggregation = TextAggregationBreakdownMetrics(
processor=metrics_data.processor,
start_time=now - metrics_data.value,
duration_secs=metrics_data.value,
)
def _reset_accumulators(self):
"""Clear per-cycle metric accumulators."""
self._ttfb = []
self._text_aggregation = None
self._user_turn_start_time = None
self._user_turn = None
self._function_call_starts = {}
self._function_call_metrics = []

View File

@@ -892,7 +892,7 @@ class PipelineTask(BasePipelineTask):
# pipeline. This is in case the push task is blocked waiting for a
# pipeline-ending frame to finish traversing the pipeline.
logger.debug(f"{self}: received interruption task frame {frame}")
await self._pipeline.queue_frame(InterruptionFrame(event=frame.event))
await self._pipeline.queue_frame(InterruptionFrame())
elif isinstance(frame, ErrorFrame):
await self._call_event_handler("on_pipeline_error", frame)
if frame.fatal:
@@ -915,6 +915,7 @@ class PipelineTask(BasePipelineTask):
if isinstance(frame, StartFrame):
await self._call_event_handler("on_pipeline_started", frame)
await self._observer.on_pipeline_started()
# Start heartbeat tasks now that StartFrame has been processed
# by all processors in the pipeline
@@ -931,8 +932,6 @@ class PipelineTask(BasePipelineTask):
self._pipeline_end_event.set()
elif isinstance(frame, CancelFrame):
self._pipeline_end_event.set()
elif isinstance(frame, InterruptionFrame):
frame.complete()
elif isinstance(frame, HeartbeatFrame):
await self._heartbeat_queue.put(frame)

View File

@@ -39,6 +39,12 @@ class Proxy:
observer: BaseObserver
class _PipelineStartedSignal:
"""Internal sentinel queued to observers when the pipeline has started."""
pass
class TaskObserver(BaseObserver):
"""Proxy observer that manages multiple observers without blocking the pipeline.
@@ -129,6 +135,10 @@ class TaskObserver(BaseObserver):
for proxy in self._proxies:
await proxy.cleanup()
async def on_pipeline_started(self):
"""Forward pipeline started signal to all managed observers."""
await self._send_to_proxy(_PipelineStartedSignal())
async def on_process_frame(self, data: FrameProcessed):
"""Queue frame data for all managed observers.
@@ -186,7 +196,9 @@ class TaskObserver(BaseObserver):
while True:
data = await queue.get()
if isinstance(data, FramePushed):
if isinstance(data, _PipelineStartedSignal):
await observer.on_pipeline_started()
elif isinstance(data, FramePushed):
if on_push_frame_deprecated:
await observer.on_push_frame(
data.source, data.destination, data.frame, data.direction, data.timestamp

View File

@@ -104,7 +104,7 @@ class DTMFAggregator(FrameProcessor):
# For first digit, schedule interruption.
if is_first_digit:
await self.push_interruption_task_frame_and_wait()
await self.broadcast_interruption()
# Check for immediate flush conditions
if frame.button == self._termination_digit:

View File

@@ -581,7 +581,7 @@ class LLMUserContextAggregator(LLMContextResponseAggregator):
logger.debug(
"Interruption conditions met - pushing interruption and aggregation"
)
await self.push_interruption_task_frame_and_wait()
await self.broadcast_interruption()
await self._process_aggregation()
else:
logger.debug("Interruption conditions not met - not pushing aggregation")

View File

@@ -608,12 +608,6 @@ class LLMUserAggregator(LLMContextAggregator):
if should_mute_frame:
logger.trace(f"{frame.name} suppressed - user currently muted")
# When muted, the InterruptionFrame won't propagate further and
# will never reach the pipeline sink. Complete it here so
# push_interruption_task_frame_and_wait() doesn't hang.
if should_mute_frame and isinstance(frame, InterruptionFrame):
frame.complete()
should_mute_next_time = False
for s in self._params.user_mute_strategies:
should_mute_next_time |= await s.process_frame(frame)
@@ -737,7 +731,7 @@ class LLMUserAggregator(LLMContextAggregator):
await self._user_idle_controller.process_frame(UserStartedSpeakingFrame())
if params.enable_interruptions and self._allow_interruptions:
await self.push_interruption_task_frame_and_wait()
await self.broadcast_interruption()
await self._call_event_handler("on_user_turn_started", strategy)

View File

@@ -234,12 +234,6 @@ class STTMuteFilter(FrameProcessor):
await self.push_frame(frame, direction)
else:
logger.trace(f"{frame.__class__.__name__} suppressed - STT currently muted")
# When muted, the InterruptionFrame won't propagate further
# and will never reach the pipeline sink. Complete it here so
# push_interruption_task_frame_and_wait() doesn't hang.
if isinstance(frame, InterruptionFrame):
frame.complete()
else:
# Pass all other frames through
await self.push_frame(frame, direction)

View File

@@ -41,7 +41,6 @@ from pipecat.frames.frames import (
FrameProcessorResumeFrame,
FrameProcessorResumeUrgentFrame,
InterruptionFrame,
InterruptionTaskFrame,
StartFrame,
SystemFrame,
UninterruptibleFrame,
@@ -240,10 +239,6 @@ class FrameProcessor(BaseObject):
self.__process_frame_task: Optional[asyncio.Task] = None
self.__process_current_frame: Optional[Frame] = None
# Set while awaiting push_interruption_task_frame_and_wait() so that
# _start_interruption() knows not to cancel the process task.
self._wait_for_interruption = False
# Frame processor events.
self._register_event_handler("on_before_process_frame", sync=True)
self._register_event_handler("on_after_process_frame", sync=True)
@@ -329,7 +324,7 @@ class FrameProcessor(BaseObject):
warnings.simplefilter("always")
warnings.warn(
"`FrameProcessor.interruptions_allowed` is deprecated. "
"Use `LLMUserAggregator`'s new `user_mute_strategies` parameter instead.",
"Use `LLMUserAggregator`'s new `user_mute_strategies` parameter instead.",
DeprecationWarning,
stacklevel=2,
)
@@ -441,35 +436,19 @@ class FrameProcessor(BaseObject):
if frame:
await self.push_frame(frame)
_processing_metrics_warned = False
async def start_processing_metrics(self, *, start_time: Optional[float] = None):
"""Start processing metrics collection.
.. deprecated:: 0.0.104
Processing metrics are deprecated and will be removed in a future version.
Use TTFB metrics instead.
Args:
start_time: Optional timestamp to use as the start time. If None,
uses the current time.
"""
if self.can_generate_metrics() and self.metrics_enabled:
if not FrameProcessor._processing_metrics_warned:
FrameProcessor._processing_metrics_warned = True
logger.warning(
"Processing metrics are deprecated and will be removed in a future version. "
"Use TTFB metrics instead."
)
await self._metrics.start_processing_metrics(start_time=start_time)
async def stop_processing_metrics(self, *, end_time: Optional[float] = None):
"""Stop processing metrics collection and push results.
.. deprecated:: 0.0.104
Processing metrics are deprecated and will be removed in a future version.
Use TTFB metrics instead.
Args:
end_time: Optional timestamp to use as the end time. If None, uses
the current time.
@@ -647,15 +626,6 @@ class FrameProcessor(BaseObject):
if self._cancelling:
return
# If we are waiting for an interruption, bypass all queued system frames
# and process the frame right away. This is because a previous system
# frame might be waiting for the interruption frame blocking the input
# task, so this InterruptionFrame would never be dequeued and we'd
# deadlock.
if self._wait_for_interruption and isinstance(frame, InterruptionFrame):
await self.__process_frame(frame, direction, callback)
return
if self._enable_direct_mode:
await self.__process_frame(frame, direction, callback)
else:
@@ -790,43 +760,32 @@ class FrameProcessor(BaseObject):
await self._call_event_handler("on_after_push_frame", frame)
async def broadcast_interruption(self):
"""Broadcast an `InterruptionFrame` both upstream and downstream."""
logger.debug(f"{self}: broadcasting interruption")
self.__reset_process_task()
await self.stop_all_metrics()
await self.broadcast_frame(InterruptionFrame)
async def push_interruption_task_frame_and_wait(self, *, timeout: float = 5.0):
"""Push an interruption task frame upstream and wait for the interruption.
This function sends an `InterruptionTaskFrame` upstream to the
pipeline task. The task creates a corresponding `InterruptionFrame`
and sends it downstream through the pipeline. An `asyncio.Event` is
attached to both frames so the caller can wait until the interruption
has fully traversed the pipeline. The event is set when the
`InterruptionFrame` reaches the pipeline sink. If the frame does
not complete within the given timeout, a warning is logged and the
event is forcibly set so the caller is unblocked.
Args:
timeout: Maximum seconds to wait for the interruption to complete.
.. deprecated:: 0.0.104
Use :meth:`broadcast_interruption` instead. This method now
delegates to ``broadcast_interruption()`` and ignores *timeout*.
"""
self._wait_for_interruption = True
import warnings
event = asyncio.Event()
with warnings.catch_warnings():
warnings.simplefilter("always")
warnings.warn(
"`FrameProcessor.push_interruption_task_frame_and_wait()` is deprecated. "
"Use `FrameProcessor.broadcast_interruption()` instead.",
DeprecationWarning,
stacklevel=2,
)
await self.push_frame(InterruptionTaskFrame(event=event), FrameDirection.UPSTREAM)
# Wait for the `InterruptionFrame` to complete and log a warning if it
# takes too long. If it does take too long make sure we unblock it,
# otherwise we will hang here forever.
while not event.is_set():
try:
await asyncio.wait_for(event.wait(), timeout=timeout)
except asyncio.TimeoutError:
logger.warning(
f"{self}: InterruptionFrame has not completed after"
f" {timeout}s. Make sure InterruptionFrame.complete()"
" is being called (e.g. if the frame is being blocked"
" or consumed before reaching the pipeline sink)."
)
event.set()
self._wait_for_interruption = False
await self.broadcast_interruption()
async def broadcast_frame(self, frame_cls: Type[Frame], **kwargs):
"""Broadcasts a frame of the specified class upstream and downstream.
@@ -933,15 +892,7 @@ class FrameProcessor(BaseObject):
async def _start_interruption(self):
"""Start handling an interruption by cancelling current tasks."""
try:
if self._wait_for_interruption:
# If we get here we know the process task was just waiting for
# an interruption (push_interruption_task_frame_and_wait()), so
# we can't cancel the task because it might still need to do
# more things (e.g. pushing a frame after the
# interruption). Instead we just drain the queue because this is
# an interruption.
self.__reset_process_task()
elif isinstance(self.__process_current_frame, UninterruptibleFrame):
if isinstance(self.__process_current_frame, UninterruptibleFrame):
# We don't want to cancel UninterruptibleFrame, so we simply
# cleanup the queue.
self.__reset_process_queue()

View File

@@ -1702,7 +1702,7 @@ class RTVIProcessor(FrameProcessor):
async def interrupt_bot(self):
"""Send a bot interruption frame upstream."""
await self.push_interruption_task_frame_and_wait()
await self.broadcast_interruption()
async def send_server_message(self, data: Any):
"""Send a server message to the client."""

View File

@@ -150,10 +150,6 @@ class FrameProcessorMetrics(BaseObject):
async def start_processing_metrics(self, *, start_time: Optional[float] = None):
"""Start measuring processing time.
.. deprecated:: 0.0.104
Processing metrics are deprecated and will be removed in a future version.
Use TTFB metrics instead.
Args:
start_time: Optional timestamp to use as the start time. If None,
uses the current time.
@@ -163,10 +159,6 @@ class FrameProcessorMetrics(BaseObject):
async def stop_processing_metrics(self, *, end_time: Optional[float] = None):
"""Stop processing time measurement and generate metrics frame.
.. deprecated:: 0.0.104
Processing metrics are deprecated and will be removed in a future version.
Use TTFB metrics instead.
Args:
end_time: Optional timestamp to use as the end time. If None, uses
the current time.

View File

@@ -12,7 +12,8 @@ transcription WebSocket messages and connection configuration.
from typing import List, Literal, Optional
from pydantic import BaseModel, Field
from loguru import logger
from pydantic import BaseModel, ConfigDict, Field, model_validator
class Word(BaseModel):
@@ -68,8 +69,16 @@ class TurnMessage(BaseMessage):
transcript: The transcribed text for this turn.
end_of_turn_confidence: Confidence score for end-of-turn detection.
words: List of individual words with timing and confidence data.
language_code: Detected language code (e.g., "es", "fr"). Only present with
complete utterances or when end_of_turn is True.
language_confidence: Confidence score (0-1) for language detection. Only present
with complete utterances or when end_of_turn is True.
speaker: Speaker label (e.g., "A", "B"). Only present when speaker_labels is
enabled and end_of_turn is True. Maps to 'speaker_label' in JSON response.
"""
model_config = ConfigDict(populate_by_name=True)
type: Literal["Turn"] = "Turn"
turn_order: int
turn_is_formatted: bool
@@ -77,6 +86,21 @@ class TurnMessage(BaseMessage):
transcript: str
end_of_turn_confidence: float
words: List[Word]
language_code: Optional[str] = None
language_confidence: Optional[float] = None
speaker: Optional[str] = Field(default=None, alias="speaker_label")
class SpeechStartedMessage(BaseMessage):
"""Message sent when speech is first detected in the audio stream.
Parameters:
type: Always "SpeechStarted" for this message type.
timestamp: Audio timestamp in milliseconds when speech was detected.
"""
type: Literal["SpeechStarted"] = "SpeechStarted"
timestamp: int
class TerminationMessage(BaseMessage):
@@ -94,7 +118,7 @@ class TerminationMessage(BaseMessage):
# Union type for all possible message types
AnyMessage = BeginMessage | TurnMessage | TerminationMessage
AnyMessage = BeginMessage | TurnMessage | SpeechStartedMessage | TerminationMessage
class AssemblyAIConnectionParams(BaseModel):
@@ -106,10 +130,19 @@ class AssemblyAIConnectionParams(BaseModel):
formatted_finals: Whether to enable transcript formatting. Defaults to True.
word_finalization_max_wait_time: Maximum time to wait for word finalization in milliseconds.
end_of_turn_confidence_threshold: Confidence threshold for end-of-turn detection.
min_end_of_turn_silence_when_confident: Minimum silence duration when confident about end-of-turn.
min_turn_silence: Minimum silence duration when confident about end-of-turn.
min_end_of_turn_silence_when_confident: DEPRECATED. Use min_turn_silence instead.
max_turn_silence: Maximum silence duration before forcing end-of-turn.
keyterms_prompt: List of key terms to guide transcription. Will be JSON serialized before sending.
speech_model: Select between English and multilingual models. Defaults to "universal-streaming-english".
prompt: Optional text prompt to guide the transcription. Only used when speech_model is "u3-rt-pro".
speech_model: Select between English, multilingual, and u3-rt-pro models. Defaults to "u3-rt-pro".
language_detection: Enable automatic language detection. Only applicable to
universal-streaming-multilingual. When enabled, Turn messages include
language_code and language_confidence fields. Defaults to None (not sent).
format_turns: Whether to format transcript turns. Defaults to True.
speaker_labels: Enable speaker diarization. When enabled, final transcripts
(end_of_turn=True) include a speaker field identifying the speaker
(e.g., "Speaker A", "Speaker B"). Defaults to None (not sent).
"""
sample_rate: int = 16000
@@ -117,9 +150,27 @@ class AssemblyAIConnectionParams(BaseModel):
formatted_finals: bool = True
word_finalization_max_wait_time: Optional[int] = None
end_of_turn_confidence_threshold: Optional[float] = None
min_end_of_turn_silence_when_confident: Optional[int] = None
min_turn_silence: Optional[int] = None
min_end_of_turn_silence_when_confident: Optional[int] = None # Deprecated
max_turn_silence: Optional[int] = None
keyterms_prompt: Optional[List[str]] = None
speech_model: Literal["universal-streaming-english", "universal-streaming-multilingual"] = (
"universal-streaming-english"
)
prompt: Optional[str] = None
speech_model: Literal[
"universal-streaming-english", "universal-streaming-multilingual", "u3-rt-pro"
] = "u3-rt-pro"
language_detection: Optional[bool] = None
format_turns: bool = True
speaker_labels: Optional[bool] = None
@model_validator(mode="after")
def handle_deprecated_param(self):
"""Handle deprecated min_end_of_turn_silence_when_confident parameter."""
if self.min_end_of_turn_silence_when_confident is not None:
logger.warning(
"The 'min_end_of_turn_silence_when_confident' parameter is deprecated and will be "
"removed in a future version. Please use 'min_turn_silence' instead."
)
# If min_turn_silence is not set, use the deprecated value
if self.min_turn_silence is None:
self.min_turn_silence = self.min_end_of_turn_silence_when_confident
return self

View File

@@ -26,6 +26,8 @@ from pipecat.frames.frames import (
InterimTranscriptionFrame,
StartFrame,
TranscriptionFrame,
UserStartedSpeakingFrame,
UserStoppedSpeakingFrame,
VADUserStartedSpeakingFrame,
VADUserStoppedSpeakingFrame,
)
@@ -41,6 +43,7 @@ from .models import (
AssemblyAIConnectionParams,
BaseMessage,
BeginMessage,
SpeechStartedMessage,
TerminationMessage,
TurnMessage,
)
@@ -54,6 +57,28 @@ except ModuleNotFoundError as e:
raise Exception(f"Missing module: {e}")
def map_language_from_assemblyai(language_code: str) -> Language:
"""Map AssemblyAI language codes to Pipecat Language enum.
AssemblyAI returns simple language codes like "es", "fr", etc.
This function maps them to the corresponding Language enum values.
Args:
language_code: AssemblyAI language code (e.g., "es", "fr", "de")
Returns:
Corresponding Language enum value, defaulting to Language.EN if not found.
"""
try:
# Try to match the language code directly
return Language(language_code.lower())
except ValueError:
logger.warning(
f"Unknown language code from AssemblyAI: {language_code}, defaulting to English"
)
return Language.EN
@dataclass
class AssemblyAISTTSettings(STTSettings):
"""Settings for the AssemblyAI STT service.
@@ -87,6 +112,8 @@ class AssemblyAISTTService(WebsocketSTTService):
api_endpoint_base_url: str = "wss://streaming.assemblyai.com/v3/ws",
connection_params: AssemblyAIConnectionParams = AssemblyAIConnectionParams(),
vad_force_turn_endpoint: bool = True,
should_interrupt: bool = True,
speaker_format: Optional[str] = None,
ttfs_p99_latency: Optional[float] = ASSEMBLYAI_TTFS_P99,
**kwargs,
):
@@ -97,18 +124,66 @@ class AssemblyAISTTService(WebsocketSTTService):
language: Language code for transcription. Defaults to English (Language.EN).
api_endpoint_base_url: WebSocket endpoint URL. Defaults to AssemblyAI's streaming endpoint.
connection_params: Connection configuration parameters. Defaults to AssemblyAIConnectionParams().
vad_force_turn_endpoint: Whether to force turn endpoint on VAD stop. When True,
disables AssemblyAI's model-based turn detection and relies on external VAD
to trigger turn endpoints. Automatically sets end_of_turn_confidence_threshold=1.0
and max_turn_silence=2000 unless explicitly overridden. Defaults to True.
vad_force_turn_endpoint: Controls turn detection mode.
When True (Pipecat mode, default): Forces AssemblyAI to return finals ASAP
so Pipecat's turn detection (e.g., Smart Turn) decides when the user is done.
- min_turn_silence defaults to 100ms (user can override)
- max_turn_silence is ALWAYS set equal to min_turn_silence
- VAD stop sends ForceEndpoint as ceiling
- No UserStarted/StoppedSpeakingFrame emitted from STT
When False (AssemblyAI turn detection mode, u3-rt-pro only): AssemblyAI's model
controls turn endings using built-in turn detection.
- Uses AssemblyAI API defaults for all parameters (unless user explicitly sets them)
- Respects all user-provided connection_params as-is
- Emits UserStarted/StoppedSpeakingFrame from STT
- No ForceEndpoint on VAD stop
should_interrupt: Whether to interrupt the bot when the user starts speaking
in AssemblyAI turn detection mode (vad_force_turn_endpoint=False). Only applies
when using AssemblyAI's built-in turn detection. Defaults to True.
speaker_format: Optional format string for speaker labels when diarization is enabled.
Use {speaker} for speaker label and {text} for transcript text.
Example: "<{speaker}>{text}</{speaker}>" or "{speaker}: {text}"
If None, transcript text is not modified. Defaults to None.
ttfs_p99_latency: P99 latency from speech end to final transcript in seconds.
Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs: Additional arguments passed to parent STTService class.
"""
# When vad_force_turn_endpoint is enabled, configure connection params for manual
# turn detection mode (disable model-based turn detection)
# AssemblyAI turn detection mode (vad_force_turn_endpoint=False) requires the
# SpeechStarted event for reliable barge-in. Only u3-rt-pro supports
# this. Other models must use Pipecat turn detection.
is_u3_pro = connection_params.speech_model == "u3-rt-pro"
if not vad_force_turn_endpoint and not is_u3_pro:
raise ValueError(
f"AssemblyAI turn detection mode (vad_force_turn_endpoint=False) requires "
f"u3-rt-pro for SpeechStarted support. Either set "
f"vad_force_turn_endpoint=True for {connection_params.speech_model}, "
f"or use speech_model='u3-rt-pro'."
)
# Validate that prompt and keyterms_prompt are not both set
if connection_params.prompt is not None and connection_params.keyterms_prompt is not None:
raise ValueError(
"The prompt and keyterms_prompt parameters cannot be used in the same request. "
"Please choose either one or the other based on your use case. When you use "
"keyterms_prompt, your boosted words are appended to the default prompt automatically. "
"Or to boost within prompt: <prompt> + Make sure to boost the words <keyterms> in the audio. "
"For more info go to: https://www.assemblyai.com/docs/streaming/universal-3-pro"
)
# Warn if user sets a custom prompt (recommend testing without one first)
if connection_params.prompt is not None:
logger.warning(
"Custom prompt detected. Prompting is a beta feature. We recommend testing "
"with no prompt first, as this will use our optimized default prompt for "
"voice agents. Bad prompts may lead to bad results. If you'd like to create "
"your own prompt, check out our prompting guide at: "
"https://www.assemblyai.com/docs/streaming/prompting"
)
# When vad_force_turn_endpoint is enabled, configure connection params
# for Pipecat turn detection mode (fast finals for smart turn analyzer)
if vad_force_turn_endpoint:
connection_params = self._configure_manual_turn_mode(connection_params)
connection_params = self._configure_pipecat_turn_mode(connection_params, is_u3_pro)
super().__init__(
sample_rate=connection_params.sample_rate,
@@ -124,6 +199,8 @@ class AssemblyAISTTService(WebsocketSTTService):
self._api_key = api_key
self._api_endpoint_base_url = api_endpoint_base_url
self._vad_force_turn_endpoint = vad_force_turn_endpoint
self._should_interrupt = should_interrupt
self._speaker_format = speaker_format
self._termination_event = asyncio.Event()
self._received_termination = False
@@ -135,45 +212,64 @@ class AssemblyAISTTService(WebsocketSTTService):
self._chunk_size_ms = 50
self._chunk_size_bytes = 0
def _configure_manual_turn_mode(
self, connection_params: AssemblyAIConnectionParams
) -> AssemblyAIConnectionParams:
"""Configure connection params for manual turn detection mode.
self._user_speaking = False
When vad_force_turn_endpoint is enabled, we want to disable AssemblyAI's
model-based turn detection and rely on external VAD. This requires:
- end_of_turn_confidence_threshold=1.0 (disable semantic turn detection)
- max_turn_silence=2000 (high value since VAD handles turn endings)
def _configure_pipecat_turn_mode(
self, connection_params: AssemblyAIConnectionParams, is_u3_pro: bool
) -> AssemblyAIConnectionParams:
"""Configure connection params for Pipecat turn detection mode.
When vad_force_turn_endpoint is enabled, force AssemblyAI to return
finals as fast as possible so Pipecat's smart turn analyzer can decide
when the user is done speaking. VAD stop is the absolute ceiling.
u3-rt-pro:
- min_turn_silence defaults to 100ms (user can override)
- max_turn_silence is ALWAYS set equal to min_turn_silence
to avoid double turn detection (AssemblyAI + Pipecat both analyzing)
- If user sets max_turn_silence, it's ignored with a warning
- end_of_turn_confidence_threshold: not set (API default)
universal-streaming-*:
- end_of_turn_confidence_threshold=0.0 (disable semantic turn detection)
- min_turn_silence=160
- max_turn_silence: not set (API default)
Args:
connection_params: The user-provided connection parameters.
is_u3_pro: Whether using u3-rt-pro model.
Returns:
Updated connection parameters configured for manual turn mode.
Updated connection parameters configured for Pipecat turn mode.
"""
updates = {}
# Check end_of_turn_confidence_threshold
if connection_params.end_of_turn_confidence_threshold is None:
updates["end_of_turn_confidence_threshold"] = 1.0
elif connection_params.end_of_turn_confidence_threshold != 1.0:
logger.warning(
f"vad_force_turn_endpoint is enabled but end_of_turn_confidence_threshold "
f"is set to {connection_params.end_of_turn_confidence_threshold}. "
f"For manual turn detection mode, this should be 1.0 to disable "
f"model-based turn detection. The current value will be used."
)
if is_u3_pro:
# u3-rt-pro: Synchronize max_turn_silence with min_turn_silence
min_silence = connection_params.min_turn_silence
if min_silence is None:
min_silence = 100
# Check max_turn_silence
if connection_params.max_turn_silence is None:
updates["max_turn_silence"] = 2000
elif connection_params.max_turn_silence < 1000:
logger.warning(
f"vad_force_turn_endpoint is enabled but max_turn_silence is set to "
f"{connection_params.max_turn_silence}ms. With manual turn detection, "
f"a higher value (e.g., 2000ms) is recommended to avoid premature "
f"turn endings. The current value will be used."
)
# Warn if user set max_turn_silence (will be overridden)
if connection_params.max_turn_silence is not None:
logger.warning(
f"Your max_turn_silence value ({connection_params.max_turn_silence}ms) will be "
f"OVERRIDDEN in Pipecat mode (vad_force_turn_endpoint=True). It will be set to "
f"{min_silence}ms (matching min_turn_silence) and SENT to "
f"AssemblyAI to avoid double turn detection. To use your max_turn_silence as-is, "
f"switch to AssemblyAI turn detection mode (vad_force_turn_endpoint=False)."
)
updates = {
"min_turn_silence": min_silence,
"max_turn_silence": min_silence,
}
else:
# universal-streaming: Different configuration (works differently)
updates = {
"end_of_turn_confidence_threshold": 1.0,
"min_turn_silence": 160,
}
# Apply updates if any
if updates:
@@ -190,9 +286,14 @@ class AssemblyAISTTService(WebsocketSTTService):
return True
async def _update_settings(self, delta: STTSettings) -> dict[str, Any]:
"""Apply a settings delta.
"""Apply a settings delta and send UpdateConfiguration if connected.
Settings are stored but not applied to the active connection.
Stores settings changes and sends UpdateConfiguration message to AssemblyAI
without reconnecting. Supports updating:
- keyterms_prompt: List of terms to boost (can be empty array to clear)
- prompt: Custom prompt text (u3-rt-pro only)
- max_turn_silence: Maximum silence before forcing turn end
- min_turn_silence: Silence before EOT check
Args:
delta: A :class:`STTSettings` (or ``AssemblyAISTTSettings``) delta.
@@ -205,18 +306,72 @@ class AssemblyAISTTService(WebsocketSTTService):
if not changed:
return changed
# TODO: someday we could reconnect here to apply updated settings.
# Code might look something like the below:
# # Re-apply manual turn mode config if vad_force_turn_endpoint is active
# # and connection_params were updated.
# if self._vad_force_turn_endpoint and "connection_params" in changed:
# self._settings.connection_params = self._configure_manual_turn_mode(
# self._settings.connection_params
# )
# await self._disconnect()
# await self._connect()
# If websocket is connected, send UpdateConfiguration for supported params
if (
self._websocket
and self._websocket.state is State.OPEN
and "connection_params" in changed
):
# Build UpdateConfiguration message
update_config = {"type": "UpdateConfiguration"}
conn_params = self._settings.connection_params
self._warn_unhandled_updated_settings(changed)
# Get the old connection_params to see what changed
old_conn_params = changed.get("connection_params")
# Check each potentially changed parameter
if (
old_conn_params is None
or conn_params.keyterms_prompt != old_conn_params.keyterms_prompt
):
if conn_params.keyterms_prompt is not None:
update_config["keyterms_prompt"] = conn_params.keyterms_prompt
logger.info(f"Updating keyterms_prompt to: {conn_params.keyterms_prompt}")
if old_conn_params is None or conn_params.prompt != old_conn_params.prompt:
if conn_params.prompt is not None:
if conn_params.speech_model != "u3-rt-pro":
logger.warning(
f"prompt parameter is only supported with u3-rt-pro model, "
f"current model is {conn_params.speech_model}"
)
else:
update_config["prompt"] = conn_params.prompt
logger.info(f"Updating prompt")
if (
old_conn_params is None
or conn_params.max_turn_silence != old_conn_params.max_turn_silence
):
if conn_params.max_turn_silence is not None:
update_config["max_turn_silence"] = conn_params.max_turn_silence
logger.info(f"Updating max_turn_silence to: {conn_params.max_turn_silence}ms")
if (
old_conn_params is None
or conn_params.min_turn_silence != old_conn_params.min_turn_silence
):
if conn_params.min_turn_silence is not None:
update_config["min_turn_silence"] = conn_params.min_turn_silence
logger.info(f"Updating min_turn_silence to: {conn_params.min_turn_silence}ms")
# Send update if we have parameters to update
if len(update_config) > 1: # More than just "type"
try:
await self._websocket.send(json.dumps(update_config))
logger.info(f"Sent UpdateConfiguration: {update_config}")
except Exception as e:
logger.error(f"Failed to send UpdateConfiguration: {e}")
elif "connection_params" in changed:
logger.warning(
"Connection params changed but WebSocket not connected. "
"Settings will be applied on next connection."
)
# Warn about other settings that can't be changed dynamically
other_changes = {k: v for k, v in changed.items() if k not in ["connection_params"]}
if other_changes:
self._warn_unhandled_updated_settings(other_changes)
return changed
@@ -283,6 +438,7 @@ class AssemblyAISTTService(WebsocketSTTService):
and self._websocket
and self._websocket.state is State.OPEN
):
self.request_finalize()
await self._websocket.send(json.dumps({"type": "ForceEndpoint"}))
await self.start_processing_metrics()
@@ -295,6 +451,9 @@ class AssemblyAISTTService(WebsocketSTTService):
"""Build WebSocket URL with query parameters using urllib.parse.urlencode."""
params = {}
for k, v in self._settings.connection_params.model_dump().items():
# Skip deprecated parameter - it's been migrated to min_turn_silence
if k == "min_end_of_turn_silence_when_confident":
continue
if v is not None:
if k == "keyterms_prompt":
params[k] = json.dumps(v)
@@ -421,6 +580,9 @@ class AssemblyAISTTService(WebsocketSTTService):
async for message in self._get_websocket():
try:
data = json.loads(message)
# Log raw JSON for Turn messages to debug speaker_label
if data.get("type") == "Turn":
logger.trace(f"{self} RAW JSON from AssemblyAI: {json.dumps(data, indent=2)}")
await self._handle_message(data)
except json.JSONDecodeError:
logger.warning(f"Received non-JSON message: {message}")
@@ -433,6 +595,8 @@ class AssemblyAISTTService(WebsocketSTTService):
return BeginMessage.model_validate(message)
elif msg_type == "Turn":
return TurnMessage.model_validate(message)
elif msg_type == "SpeechStarted":
return SpeechStartedMessage.model_validate(message)
elif msg_type == "Termination":
return TerminationMessage.model_validate(message)
else:
@@ -449,11 +613,33 @@ class AssemblyAISTTService(WebsocketSTTService):
)
elif isinstance(parsed_message, TurnMessage):
await self._handle_transcription(parsed_message)
elif isinstance(parsed_message, SpeechStartedMessage):
await self._handle_speech_started(parsed_message)
elif isinstance(parsed_message, TerminationMessage):
await self._handle_termination(parsed_message)
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
async def _handle_speech_started(self, message: SpeechStartedMessage):
"""Handle SpeechStarted event — fast barge-in for AssemblyAI turn detection.
Broadcasts UserStartedSpeakingFrame to signal the start of user
speech, then pushes an interruption to cancel any bot audio.
SpeechStarted fires before any transcript arrives, so the turn
is cleanly started before any transcription frames are pushed.
Only applies when using AssemblyAI's built-in turn detection. When using
Pipecat turn detection, VAD + smart turn analyzer handle interruptions.
"""
if self._vad_force_turn_endpoint:
return # Pipecat mode: handled by aggregator
await self.start_processing_metrics()
await self.broadcast_frame(UserStartedSpeakingFrame)
if self._should_interrupt:
await self.push_interruption_task_frame_and_wait()
self._user_speaking = True
async def _handle_termination(self, message: TerminationMessage):
"""Handle termination message."""
self._received_termination = True
@@ -466,30 +652,109 @@ class AssemblyAISTTService(WebsocketSTTService):
await self.push_frame(EndFrame())
async def _handle_transcription(self, message: TurnMessage):
"""Handle transcription results."""
"""Handle transcription results with two turn detection modes.
Pipecat turn detection (vad_force_turn_endpoint=True):
- No UserStarted/StoppedSpeakingFrame from STT
- end_of_turn → TranscriptionFrame (finalized set by base class
if this is a ForceEndpoint response)
- else → InterimTranscriptionFrame
AssemblyAI turn detection (vad_force_turn_endpoint=False):
- UserStartedSpeakingFrame on first transcript
- end_of_turn → TranscriptionFrame + UserStoppedSpeakingFrame
- else → InterimTranscriptionFrame
"""
if not message.transcript:
return
if message.end_of_turn and (
not self._settings.connection_params.formatted_finals or message.turn_is_formatted
):
await self.push_frame(
TranscriptionFrame(
message.transcript,
self._user_id,
time_now_iso8601(),
self._settings.language,
message,
# Use detected language if available with sufficient confidence
language = Language.EN
if message.language_code and message.language_confidence:
if message.language_confidence >= 0.7:
language = map_language_from_assemblyai(message.language_code)
else:
logger.warning(
f"Low language detection confidence ({message.language_confidence:.2f}) "
f"for language '{message.language_code}', falling back to English"
)
# Handle speaker diarization
speaker_id = self._user_id
transcript_text = message.transcript
if message.speaker:
speaker_id = message.speaker
# Format transcript with speaker labels if format string provided
if self._speaker_format:
transcript_text = self._speaker_format.format(
speaker=message.speaker, text=message.transcript
)
# Determine if this is a final turn from AssemblyAI
is_final_turn = message.end_of_turn and (
not self._settings.connection_params.format_turns or message.turn_is_formatted
)
if self._vad_force_turn_endpoint:
# --- Pipecat turn detection mode ---
# No UserStarted/StoppedSpeakingFrame — VAD + smart turn analyzer handle this
if is_final_turn:
finalize_confirmed = bool(message.turn_is_formatted)
if finalize_confirmed:
self.confirm_finalize()
logger.debug(f'{self} Transcript: "{transcript_text}"')
await self.push_frame(
TranscriptionFrame(
transcript_text,
speaker_id,
time_now_iso8601(),
language,
message,
)
)
await self._trace_transcription(transcript_text, True, language)
await self.stop_processing_metrics()
else:
await self.push_frame(
InterimTranscriptionFrame(
transcript_text,
speaker_id,
time_now_iso8601(),
language,
message,
)
)
)
await self._trace_transcription(message.transcript, True, self._settings.language)
await self.stop_processing_metrics()
else:
await self.push_frame(
InterimTranscriptionFrame(
message.transcript,
self._user_id,
time_now_iso8601(),
self._settings.language,
message,
# --- AssemblyAI turn detection mode ---
# SpeechStarted always arrives before transcripts with u3-rt-pro,
# so UserStartedSpeakingFrame is guaranteed to be broadcast first.
if is_final_turn:
# AssemblyAI controls finalization, just mark as finalized
await self.push_frame(
TranscriptionFrame(
transcript_text,
speaker_id,
time_now_iso8601(),
language,
message,
finalized=True,
)
)
await self._trace_transcription(transcript_text, True, language)
await self.stop_processing_metrics()
# AAI is authoritative — emit UserStoppedSpeakingFrame immediately.
# broadcast_frame pushes downstream (same queue as TranscriptionFrame
# above, so ordering is preserved) and upstream.
await self.broadcast_frame(UserStoppedSpeakingFrame)
self._user_speaking = False
else:
await self.push_frame(
InterimTranscriptionFrame(
transcript_text,
speaker_id,
time_now_iso8601(),
language,
message,
)
)
)

View File

@@ -35,6 +35,7 @@ from pipecat.utils.tracing.service_decorators import traced_stt
try:
from azure.cognitiveservices.speech import (
CancellationReason,
ResultReason,
SpeechConfig,
SpeechRecognizer,
@@ -209,6 +210,7 @@ class AzureSTTService(STTService):
)
self._speech_recognizer.recognizing.connect(self._on_handle_recognizing)
self._speech_recognizer.recognized.connect(self._on_handle_recognized)
self._speech_recognizer.canceled.connect(self._on_handle_canceled)
self._speech_recognizer.start_continuous_recognition_async()
except Exception as e:
await self.push_error(
@@ -280,3 +282,13 @@ class AzureSTTService(STTService):
result=event,
)
asyncio.run_coroutine_threadsafe(self.push_frame(frame), self.get_event_loop())
def _on_handle_canceled(self, event):
details = event.result.cancellation_details
if details.reason == CancellationReason.Error:
error_msg = f"Azure STT recognition canceled: {details.reason}"
if details.error_details:
error_msg += f" - {details.error_details}"
asyncio.run_coroutine_threadsafe(
self.push_error(error_msg=error_msg), self.get_event_loop()
)

View File

@@ -561,9 +561,13 @@ class AzureTTSService(TTSService, AzureBaseTTSService):
# User cancellation (from interruption) is expected, not an error
if reason == CancellationReason.CancelledByUser:
logger.debug(f"{self}: Speech synthesis canceled by user (interruption)")
self._audio_queue.put_nowait(None)
else:
logger.warning(f"{self}: Speech synthesis canceled: {reason}")
self._audio_queue.put_nowait(None)
details = evt.result.cancellation_details
error_msg = f"Azure TTS synthesis canceled: {reason}"
if details.error_details:
error_msg += f" - {details.error_details}"
self._audio_queue.put_nowait(Exception(error_msg))
async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
"""Push a frame and handle state changes.
@@ -676,6 +680,9 @@ class AzureTTSService(TTSService, AzureBaseTTSService):
chunk = await self._audio_queue.get()
if chunk is None: # End of stream
break
if isinstance(chunk, Exception): # Error from _handle_canceled
yield ErrorFrame(error=str(chunk))
break
if self._first_chunk:
await self.stop_ttfb_metrics()

View File

@@ -9,6 +9,7 @@ import sys
from pipecat.services import DeprecatedModuleProxy
from .flux import *
from .sagemaker import *
from .stt import *
from .tts import *

View File

@@ -675,7 +675,7 @@ class DeepgramFluxSTTService(WebsocketSTTService):
self._user_is_speaking = True
await self.broadcast_frame(UserStartedSpeakingFrame)
if self._should_interrupt:
await self.push_interruption_task_frame_and_wait()
await self.broadcast_interruption()
await self.start_metrics()
await self._call_event_handler("on_start_of_turn", transcript)
if transcript:

View File

@@ -0,0 +1,448 @@
#
# Copyright (c) 2024-2026, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
"""Deepgram speech-to-text service for AWS SageMaker.
This module provides a Pipecat STT service that connects to Deepgram models
deployed on AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for
low-latency real-time transcription with support for interim results, multiple
languages, and various Deepgram features.
"""
import asyncio
import json
from dataclasses import dataclass
from typing import Any, AsyncGenerator, Dict, Optional
from loguru import logger
from pipecat.frames.frames import (
CancelFrame,
EndFrame,
ErrorFrame,
Frame,
InterimTranscriptionFrame,
StartFrame,
TranscriptionFrame,
VADUserStartedSpeakingFrame,
VADUserStoppedSpeakingFrame,
)
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.aws.sagemaker.bidi_client import SageMakerBidiClient
from pipecat.services.deepgram.stt import _DeepgramSTTSettingsBase
from pipecat.services.settings import STTSettings
from pipecat.services.stt_latency import DEEPGRAM_SAGEMAKER_TTFS_P99
from pipecat.services.stt_service import STTService
from pipecat.transcriptions.language import Language
from pipecat.utils.time import time_now_iso8601
from pipecat.utils.tracing.service_decorators import traced_stt
try:
from deepgram import LiveOptions
except ModuleNotFoundError as e:
logger.error(f"Exception: {e}")
logger.error(
"In order to use DeepgramSageMakerSTTService, you need to `pip install pipecat-ai[deepgram,sagemaker]`."
)
raise Exception(f"Missing module: {e}")
@dataclass
class DeepgramSageMakerSTTSettings(_DeepgramSTTSettingsBase):
"""Settings for the Deepgram SageMaker STT service.
See ``_DeepgramSTTSettingsBase`` for full documentation.
"""
pass
class DeepgramSageMakerSTTService(STTService):
"""Deepgram speech-to-text service for AWS SageMaker.
Provides real-time speech recognition using Deepgram models deployed on
AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for low-latency
transcription with support for interim results, speaker diarization, and
multiple languages.
Requirements:
- AWS credentials configured (via environment variables, AWS CLI, or instance metadata)
- A deployed SageMaker endpoint with Deepgram model: https://developers.deepgram.com/docs/deploy-amazon-sagemaker
- Deepgram SDK for LiveOptions configuration
Example::
stt = DeepgramSageMakerSTTService(
endpoint_name="my-deepgram-endpoint",
region="us-east-2",
live_options=LiveOptions(
model="nova-3",
language="en",
interim_results=True,
punctuate=True,
),
)
"""
_settings: DeepgramSageMakerSTTSettings
def __init__(
self,
*,
endpoint_name: str,
region: str,
sample_rate: Optional[int] = None,
live_options: Optional[LiveOptions] = None,
ttfs_p99_latency: Optional[float] = DEEPGRAM_SAGEMAKER_TTFS_P99,
**kwargs,
):
"""Initialize the Deepgram SageMaker STT service.
Args:
endpoint_name: Name of the SageMaker endpoint with Deepgram model
deployed (e.g., "my-deepgram-nova-3-endpoint").
region: AWS region where the endpoint is deployed (e.g., "us-east-2").
sample_rate: Audio sample rate in Hz. If None, uses value from
live_options or defaults to the value from StartFrame.
live_options: Deepgram LiveOptions configuration. Treated as a
delta from a set of sensible defaults — only the fields you
set are overridden; all others keep their default values.
ttfs_p99_latency: P99 latency from speech end to final transcript in seconds.
Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs: Additional arguments passed to the parent STTService.
"""
sample_rate = sample_rate or (live_options.sample_rate if live_options else None)
default_options = LiveOptions(
encoding="linear16",
language=Language.EN,
model="nova-3",
channels=1,
interim_results=True,
punctuate=True,
)
settings = DeepgramSageMakerSTTSettings(
model=default_options.model,
language=default_options.language,
live_options=default_options,
)
if live_options:
settings._merge_live_options_delta(live_options)
super().__init__(
sample_rate=sample_rate,
ttfs_p99_latency=ttfs_p99_latency,
settings=settings,
**kwargs,
)
self._endpoint_name = endpoint_name
self._region = region
self._client: Optional[SageMakerBidiClient] = None
self._response_task: Optional[asyncio.Task] = None
self._keepalive_task: Optional[asyncio.Task] = None
def can_generate_metrics(self) -> bool:
"""Check if this service can generate processing metrics.
Returns:
True, as Deepgram SageMaker service supports metrics generation.
"""
return True
async def _update_settings(self, delta: STTSettings) -> dict[str, Any]:
"""Apply a settings delta and warn about unhandled changes."""
changed = await super()._update_settings(delta)
if not changed:
return changed
# TODO: someday we could reconnect here to apply updated settings.
# Code might look something like the below:
# await self._disconnect()
# await self._connect()
self._warn_unhandled_updated_settings(changed)
return changed
async def start(self, frame: StartFrame):
"""Start the Deepgram SageMaker STT service.
Args:
frame: The start frame containing initialization parameters.
"""
await super().start(frame)
await self._connect()
async def stop(self, frame: EndFrame):
"""Stop the Deepgram SageMaker STT service.
Args:
frame: The end frame.
"""
await super().stop(frame)
await self._disconnect()
async def cancel(self, frame: CancelFrame):
"""Cancel the Deepgram SageMaker STT service.
Args:
frame: The cancel frame.
"""
await super().cancel(frame)
await self._disconnect()
async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
"""Send audio data to Deepgram for transcription.
Args:
audio: Raw audio bytes to transcribe.
Yields:
Frame: None (transcription results come via BiDi stream callbacks).
"""
if self._client and self._client.is_active:
try:
await self._client.send_audio_chunk(audio)
except Exception as e:
yield ErrorFrame(error=f"Unknown error occurred: {e}")
yield None
async def _connect(self):
"""Connect to the SageMaker endpoint and start the BiDi session.
Builds the Deepgram query string from settings, creates the BiDi client,
starts the streaming session, and launches background tasks for processing
responses and sending KeepAlive messages.
"""
logger.debug("Connecting to Deepgram on SageMaker...")
live_options = LiveOptions(
**{**self._settings.live_options.to_dict(), "sample_rate": self.sample_rate}
)
# Build query string from live_options, converting booleans to strings
query_params = {}
for key, value in live_options.to_dict().items():
if value is not None:
# Convert boolean values to lowercase strings for Deepgram API
if isinstance(value, bool):
query_params[key] = str(value).lower()
else:
query_params[key] = str(value)
query_string = "&".join(f"{k}={v}" for k, v in query_params.items())
# Create BiDi client
self._client = SageMakerBidiClient(
endpoint_name=self._endpoint_name,
region=self._region,
model_invocation_path="v1/listen",
model_query_string=query_string,
)
try:
# Start the session
await self._client.start_session()
# Start processing responses in the background
self._response_task = self.create_task(self._process_responses())
# Start keepalive task to maintain connection
self._keepalive_task = self.create_task(self._send_keepalive())
logger.debug("Connected to Deepgram on SageMaker")
await self._call_event_handler("on_connected")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
await self._call_event_handler("on_connection_error", str(e))
async def _disconnect(self):
"""Disconnect from the SageMaker endpoint.
Sends a CloseStream message to Deepgram, cancels background tasks
(KeepAlive and response processing), and closes the BiDi session.
Safe to call multiple times.
"""
if self._client and self._client.is_active:
logger.debug("Disconnecting from Deepgram on SageMaker...")
# Send CloseStream message to Deepgram
try:
await self._client.send_json({"type": "CloseStream"})
except Exception as e:
logger.warning(f"Failed to send CloseStream message: {e}")
# Cancel keepalive task
if self._keepalive_task and not self._keepalive_task.done():
await self.cancel_task(self._keepalive_task)
# Cancel response processing task
if self._response_task and not self._response_task.done():
await self.cancel_task(self._response_task)
# Close the BiDi session
await self._client.close_session()
logger.debug("Disconnected from Deepgram on SageMaker")
await self._call_event_handler("on_disconnected")
async def _send_keepalive(self):
"""Send periodic KeepAlive messages to maintain the connection.
Sends a KeepAlive JSON message to Deepgram every 5 seconds while the
connection is active. This prevents the connection from timing out during
periods of silence.
"""
while self._client and self._client.is_active:
await asyncio.sleep(5)
if self._client and self._client.is_active:
try:
await self._client.send_json({"type": "KeepAlive"})
except Exception as e:
logger.warning(f"Failed to send KeepAlive: {e}")
async def _process_responses(self):
"""Process streaming responses from Deepgram on SageMaker.
Continuously receives responses from the BiDi stream, decodes the payload,
parses JSON responses from Deepgram, and processes transcription results.
Runs as a background task until the connection is closed or cancelled.
"""
try:
while self._client and self._client.is_active:
result = await self._client.receive_response()
if result is None:
break
# Check if this is a PayloadPart with bytes
if hasattr(result, "value") and hasattr(result.value, "bytes_"):
if result.value.bytes_:
response_data = result.value.bytes_.decode("utf-8")
try:
# Parse JSON response from Deepgram
parsed = json.loads(response_data)
# Extract and process transcript if available
if "channel" in parsed:
await self._handle_transcript_response(parsed)
except json.JSONDecodeError:
logger.warning(f"Non-JSON response: {response_data}")
except asyncio.CancelledError:
logger.debug("Response processor cancelled")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
finally:
logger.debug("Response processor stopped")
async def _handle_transcript_response(self, parsed: dict):
"""Handle a transcript response from Deepgram.
Extracts the transcript text, determines if it's final or interim, extracts
language information, and pushes the appropriate frame (TranscriptionFrame
or InterimTranscriptionFrame) downstream.
Args:
parsed: The parsed JSON response from Deepgram containing channel,
alternatives, transcript, and metadata.
"""
alternatives = parsed.get("channel", {}).get("alternatives", [])
if not alternatives or not alternatives[0].get("transcript"):
return
transcript = alternatives[0]["transcript"]
if not transcript.strip():
return
is_final = parsed.get("is_final", False)
# Extract language if available
language = None
if alternatives[0].get("languages"):
language = alternatives[0]["languages"][0]
language = Language(language)
if is_final:
# Check if this response is from a finalize() call.
# Only mark as finalized when both we requested it AND Deepgram confirms it.
from_finalize = parsed.get("from_finalize", False)
if from_finalize:
self.confirm_finalize()
await self.push_frame(
TranscriptionFrame(
transcript,
self._user_id,
time_now_iso8601(),
language,
result=parsed,
)
)
await self._handle_transcription(transcript, is_final, language)
await self.stop_processing_metrics()
else:
# Interim transcription
await self.push_frame(
InterimTranscriptionFrame(
transcript,
self._user_id,
time_now_iso8601(),
language,
result=parsed,
)
)
@traced_stt
async def _handle_transcription(
self, transcript: str, is_final: bool, language: Optional[Language] = None
):
"""Handle a transcription result with tracing.
This method is decorated with @traced_stt for observability and tracing
integration. The actual transcription processing is handled by the parent
class and observers.
Args:
transcript: The transcribed text.
is_final: Whether this is a final transcription result.
language: The detected language of the transcription, if available.
"""
pass
async def _start_metrics(self):
"""Start processing metrics collection."""
await self.start_processing_metrics()
async def process_frame(self, frame: Frame, direction: FrameDirection):
"""Process frames with Deepgram SageMaker-specific handling.
Args:
frame: The frame to process.
direction: The direction of frame processing.
"""
await super().process_frame(frame, direction)
# Start metrics when user starts speaking (if VAD is not provided by Deepgram)
if isinstance(frame, VADUserStartedSpeakingFrame):
await self._start_metrics()
elif isinstance(frame, VADUserStoppedSpeakingFrame):
# https://developers.deepgram.com/docs/finalize
# Mark that we're awaiting a from_finalize response
self.request_finalize()
if self._client and self._client.is_active:
try:
await self._client.send_json({"type": "Finalize"})
except Exception as e:
logger.warning(f"Error sending Finalize message: {e}")
logger.trace(f"Triggered finalize event on: {frame.name=}, {direction=}")

View File

@@ -0,0 +1,360 @@
#
# Copyright (c) 2024-2026, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
"""Deepgram text-to-speech service for AWS SageMaker.
This module provides a Pipecat TTS service that connects to Deepgram models
deployed on AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for
low-latency real-time speech synthesis with support for interruptions and
streaming audio output.
"""
import asyncio
import json
from dataclasses import dataclass, field
from typing import Any, AsyncGenerator, Optional
from loguru import logger
from pipecat.frames.frames import (
BotStoppedSpeakingFrame,
CancelFrame,
EndFrame,
ErrorFrame,
Frame,
InterruptionFrame,
LLMFullResponseEndFrame,
StartFrame,
TTSAudioRawFrame,
TTSStartedFrame,
)
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.aws.sagemaker.bidi_client import SageMakerBidiClient
from pipecat.services.settings import NOT_GIVEN, TTSSettings, _NotGiven
from pipecat.services.tts_service import TTSService
from pipecat.utils.tracing.service_decorators import traced_tts
@dataclass
class DeepgramSageMakerTTSSettings(TTSSettings):
"""Settings for Deepgram SageMaker TTS service.
Parameters:
encoding: Audio encoding format (e.g. "linear16").
"""
encoding: str | _NotGiven = field(default_factory=lambda: NOT_GIVEN)
class DeepgramSageMakerTTSService(TTSService):
"""Deepgram text-to-speech service for AWS SageMaker.
Provides real-time speech synthesis using Deepgram models deployed on
AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for low-latency
audio generation with support for interruptions via the Clear message.
Requirements:
- AWS credentials configured (via environment variables, AWS CLI, or instance metadata)
- A deployed SageMaker endpoint with Deepgram TTS model: https://developers.deepgram.com/docs/deploy-amazon-sagemaker
- ``pipecat-ai[sagemaker]`` installed
Example::
tts = DeepgramSageMakerTTSService(
endpoint_name="my-deepgram-tts-endpoint",
region="us-east-2",
voice="aura-2-helena-en",
)
"""
_settings: DeepgramSageMakerTTSSettings
def __init__(
self,
*,
endpoint_name: str,
region: str,
voice: str = "aura-2-helena-en",
sample_rate: Optional[int] = None,
encoding: str = "linear16",
**kwargs,
):
"""Initialize the Deepgram SageMaker TTS service.
Args:
endpoint_name: Name of the SageMaker endpoint with Deepgram TTS model
deployed (e.g., "my-deepgram-tts-endpoint").
region: AWS region where the endpoint is deployed (e.g., "us-east-2").
voice: Voice model to use for synthesis. Defaults to "aura-2-helena-en".
sample_rate: Audio sample rate in Hz. If None, uses the value from StartFrame.
encoding: Audio encoding format. Defaults to "linear16".
**kwargs: Additional arguments passed to the parent TTSService.
"""
super().__init__(
sample_rate=sample_rate,
push_stop_frames=True,
pause_frame_processing=True,
append_trailing_space=True,
settings=DeepgramSageMakerTTSSettings(
model=voice,
voice=voice,
language=None,
encoding=encoding,
),
**kwargs,
)
self._endpoint_name = endpoint_name
self._region = region
self._client: Optional[SageMakerBidiClient] = None
self._response_task: Optional[asyncio.Task] = None
self._context_id: Optional[str] = None
self._ttfb_started: bool = False
def can_generate_metrics(self) -> bool:
"""Check if this service can generate processing metrics.
Returns:
True, as Deepgram SageMaker TTS service supports metrics generation.
"""
return True
async def start(self, frame: StartFrame):
"""Start the Deepgram SageMaker TTS service.
Args:
frame: The start frame containing initialization parameters.
"""
await super().start(frame)
await self._connect()
async def stop(self, frame: EndFrame):
"""Stop the Deepgram SageMaker TTS service.
Args:
frame: The end frame.
"""
await super().stop(frame)
await self._disconnect()
async def cancel(self, frame: CancelFrame):
"""Cancel the Deepgram SageMaker TTS service.
Args:
frame: The cancel frame.
"""
await super().cancel(frame)
await self._disconnect()
async def process_frame(self, frame: Frame, direction: FrameDirection):
"""Process frames with special handling for LLM response end.
Args:
frame: The frame to process.
direction: The direction of frame processing.
"""
await super().process_frame(frame, direction)
if isinstance(frame, (LLMFullResponseEndFrame, EndFrame)):
await self.flush_audio()
elif isinstance(frame, BotStoppedSpeakingFrame):
self._ttfb_started = False
async def _connect(self):
"""Connect to the SageMaker endpoint and start the BiDi session.
Builds the Deepgram TTS query string, creates the BiDi client,
starts the streaming session, and launches a background task for processing
responses.
"""
logger.debug("Connecting to Deepgram TTS on SageMaker...")
query_string = (
f"model={self._settings.voice}&encoding={self._settings.encoding}"
f"&sample_rate={self.sample_rate}"
)
self._client = SageMakerBidiClient(
endpoint_name=self._endpoint_name,
region=self._region,
model_invocation_path="v1/speak",
model_query_string=query_string,
)
try:
await self._client.start_session()
self._response_task = self.create_task(self._process_responses())
logger.debug("Connected to Deepgram TTS on SageMaker")
await self._call_event_handler("on_connected")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
await self._call_event_handler("on_connection_error", str(e))
async def _disconnect(self):
"""Disconnect from the SageMaker endpoint.
Sends a Close message to Deepgram, cancels the response processing task,
and closes the BiDi session. Safe to call multiple times.
"""
if self._client and self._client.is_active:
logger.debug("Disconnecting from Deepgram TTS on SageMaker...")
try:
await self._client.send_json({"type": "Close"})
except Exception as e:
logger.warning(f"Failed to send Close message: {e}")
if self._response_task and not self._response_task.done():
await self.cancel_task(self._response_task)
await self._client.close_session()
logger.debug("Disconnected from Deepgram TTS on SageMaker")
await self._call_event_handler("on_disconnected")
async def _update_settings(self, delta: TTSSettings) -> dict[str, Any]:
"""Apply a settings delta and reconnect if necessary.
Since all settings are part of the SageMaker session query string,
any setting change requires reconnecting to apply the new values.
"""
changed = await super()._update_settings(delta)
if not changed:
return changed
# Deepgram uses voice as the model, so keep them in sync for metrics
if "voice" in changed:
self._settings.model = self._settings.voice
self._sync_model_name_to_metrics()
# TODO: someday we could reconnect here to apply updated settings.
# Code might look something like the below:
# await self._disconnect()
# await self._connect()
self._warn_unhandled_updated_settings(changed)
return changed
async def _process_responses(self):
"""Process streaming responses from Deepgram TTS on SageMaker.
Continuously receives responses from the BiDi stream. Attempts to decode
each payload as UTF-8 JSON for control messages (Flushed, Cleared, Metadata,
Warning). If decoding fails, treats the payload as raw audio bytes and pushes
a TTSAudioRawFrame downstream.
"""
try:
while self._client and self._client.is_active:
result = await self._client.receive_response()
if result is None:
break
if hasattr(result, "value") and hasattr(result.value, "bytes_"):
if result.value.bytes_:
payload = result.value.bytes_
# Try to decode as JSON control message first
try:
response_data = payload.decode("utf-8")
parsed = json.loads(response_data)
msg_type = parsed.get("type")
if msg_type == "Metadata":
logger.trace(f"Received metadata: {parsed}")
elif msg_type == "Flushed":
logger.trace(f"Received Flushed: {parsed}")
elif msg_type == "Cleared":
logger.trace(f"Received Cleared: {parsed}")
elif msg_type == "Warning":
logger.warning(
f"{self} warning: "
f"{parsed.get('description', 'Unknown warning')}"
)
else:
logger.debug(f"Received unknown message type: {parsed}")
except (UnicodeDecodeError, json.JSONDecodeError):
# Not JSON — treat as raw audio bytes
await self.stop_ttfb_metrics()
frame = TTSAudioRawFrame(
payload,
self.sample_rate,
1,
context_id=self._context_id,
)
await self.push_frame(frame)
except asyncio.CancelledError:
logger.debug("TTS response processor cancelled")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
finally:
logger.debug("TTS response processor stopped")
async def _handle_interruption(self, frame: InterruptionFrame, direction: FrameDirection):
"""Handle interruption by sending Clear message to Deepgram.
The Clear message will clear Deepgram's internal text buffer and stop
sending audio, allowing for a new response to be generated.
"""
await super()._handle_interruption(frame, direction)
self._ttfb_started = False
if self._client and self._client.is_active:
try:
await self._client.send_json({"type": "Clear"})
except Exception as e:
logger.error(f"{self} error sending Clear message: {e}")
async def flush_audio(self):
"""Flush any pending audio synthesis by sending Flush command.
This should be called when the LLM finishes a complete response to force
generation of audio from Deepgram's internal text buffer.
"""
if self._client and self._client.is_active:
try:
await self._client.send_json({"type": "Flush"})
except Exception as e:
logger.error(f"{self} error sending Flush message: {e}")
@traced_tts
async def run_tts(self, text: str, context_id: str) -> AsyncGenerator[Frame, None]:
"""Generate speech from text using Deepgram TTS on SageMaker.
Args:
text: The text to synthesize into speech.
context_id: The context ID for tracking audio frames.
Yields:
Frame: TTSStartedFrame, then None (audio comes asynchronously via
the response processor).
"""
logger.debug(f"{self}: Generating TTS [{text}]")
try:
if not self._ttfb_started:
await self.start_ttfb_metrics()
self._ttfb_started = True
await self.start_tts_usage_metrics(text)
yield TTSStartedFrame(context_id=context_id)
self._context_id = context_id
await self._client.send_json({"type": "Speak", "text": text})
yield None
except Exception as e:
yield ErrorFrame(error=f"Unknown error occurred: {e}")

View File

@@ -558,7 +558,7 @@ class DeepgramSTTService(STTService):
await self._call_event_handler("on_speech_started", message)
await self.broadcast_frame(UserStartedSpeakingFrame)
if self._should_interrupt:
await self.push_interruption_task_frame_and_wait()
await self.broadcast_interruption()
async def _on_utterance_end(self, message):
await self._call_event_handler("on_utterance_end", message)

View File

@@ -4,444 +4,15 @@
# SPDX-License-Identifier: BSD 2-Clause License
#
"""Deepgram speech-to-text service for AWS SageMaker.
"""Deprecated: use ``pipecat.services.deepgram.sagemaker.stt`` instead."""
This module provides a Pipecat STT service that connects to Deepgram models
deployed on AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for
low-latency real-time transcription with support for interim results, multiple
languages, and various Deepgram features.
"""
import warnings
import asyncio
import json
from dataclasses import dataclass
from typing import Any, AsyncGenerator, Dict, Optional
from loguru import logger
from pipecat.frames.frames import (
CancelFrame,
EndFrame,
ErrorFrame,
Frame,
InterimTranscriptionFrame,
StartFrame,
TranscriptionFrame,
VADUserStartedSpeakingFrame,
VADUserStoppedSpeakingFrame,
warnings.warn(
"Module `pipecat.services.deepgram.stt_sagemaker` is deprecated, "
"use `pipecat.services.deepgram.sagemaker.stt` instead.",
DeprecationWarning,
stacklevel=2,
)
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.aws.sagemaker.bidi_client import SageMakerBidiClient
from pipecat.services.deepgram.stt import DeepgramSTTSettings
from pipecat.services.settings import STTSettings
from pipecat.services.stt_latency import DEEPGRAM_SAGEMAKER_TTFS_P99
from pipecat.services.stt_service import STTService
from pipecat.transcriptions.language import Language
from pipecat.utils.time import time_now_iso8601
from pipecat.utils.tracing.service_decorators import traced_stt
try:
from pipecat.services.deepgram.stt import LiveOptions
except ModuleNotFoundError as e:
logger.error(f"Exception: {e}")
logger.error(
"In order to use DeepgramSageMakerSTTService, you need to `pip install pipecat-ai[deepgram,sagemaker]`."
)
raise Exception(f"Missing module: {e}")
@dataclass
class DeepgramSageMakerSTTSettings(DeepgramSTTSettings):
"""Settings for the Deepgram SageMaker STT service.
See ``DeepgramSTTSettings`` for full documentation.
"""
pass
class DeepgramSageMakerSTTService(STTService):
"""Deepgram speech-to-text service for AWS SageMaker.
Provides real-time speech recognition using Deepgram models deployed on
AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for low-latency
transcription with support for interim results, speaker diarization, and
multiple languages.
Requirements:
- AWS credentials configured (via environment variables, AWS CLI, or instance metadata)
- A deployed SageMaker endpoint with Deepgram model: https://developers.deepgram.com/docs/deploy-amazon-sagemaker
- Deepgram SDK for LiveOptions configuration
Example::
stt = DeepgramSageMakerSTTService(
endpoint_name="my-deepgram-endpoint",
region="us-east-2",
live_options=LiveOptions(
model="nova-3",
language="en",
interim_results=True,
punctuate=True,
),
)
"""
_settings: DeepgramSageMakerSTTSettings
def __init__(
self,
*,
endpoint_name: str,
region: str,
sample_rate: Optional[int] = None,
live_options: Optional[LiveOptions] = None,
ttfs_p99_latency: Optional[float] = DEEPGRAM_SAGEMAKER_TTFS_P99,
**kwargs,
):
"""Initialize the Deepgram SageMaker STT service.
Args:
endpoint_name: Name of the SageMaker endpoint with Deepgram model
deployed (e.g., "my-deepgram-nova-3-endpoint").
region: AWS region where the endpoint is deployed (e.g., "us-east-2").
sample_rate: Audio sample rate in Hz. If None, uses value from
live_options or defaults to the value from StartFrame.
live_options: Deepgram LiveOptions configuration. Treated as a
delta from a set of sensible defaults — only the fields you
set are overridden; all others keep their default values.
ttfs_p99_latency: P99 latency from speech end to final transcript in seconds.
Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs: Additional arguments passed to the parent STTService.
"""
sample_rate = sample_rate or (live_options.sample_rate if live_options else None)
settings = DeepgramSageMakerSTTSettings(
model="nova-3",
language=Language.EN,
encoding="linear16",
channels=1,
interim_results=True,
punctuate=True,
)
if live_options:
lo_dict = live_options.to_dict()
delta = DeepgramSageMakerSTTSettings.from_mapping(
{k: v for k, v in lo_dict.items() if k != "sample_rate"}
)
settings.apply_update(delta)
super().__init__(
sample_rate=sample_rate,
ttfs_p99_latency=ttfs_p99_latency,
settings=settings,
**kwargs,
)
self._endpoint_name = endpoint_name
self._region = region
self._client: Optional[SageMakerBidiClient] = None
self._response_task: Optional[asyncio.Task] = None
self._keepalive_task: Optional[asyncio.Task] = None
def can_generate_metrics(self) -> bool:
"""Check if this service can generate processing metrics.
Returns:
True, as Deepgram SageMaker service supports metrics generation.
"""
return True
async def _update_settings(self, delta: STTSettings) -> dict[str, Any]:
"""Apply a settings delta and warn about unhandled changes."""
changed = await super()._update_settings(delta)
if not changed:
return changed
# TODO: someday we could reconnect here to apply updated settings.
# Code might look something like the below:
# await self._disconnect()
# await self._connect()
self._warn_unhandled_updated_settings(changed)
return changed
async def start(self, frame: StartFrame):
"""Start the Deepgram SageMaker STT service.
Args:
frame: The start frame containing initialization parameters.
"""
await super().start(frame)
await self._connect()
async def stop(self, frame: EndFrame):
"""Stop the Deepgram SageMaker STT service.
Args:
frame: The end frame.
"""
await super().stop(frame)
await self._disconnect()
async def cancel(self, frame: CancelFrame):
"""Cancel the Deepgram SageMaker STT service.
Args:
frame: The cancel frame.
"""
await super().cancel(frame)
await self._disconnect()
async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
"""Send audio data to Deepgram for transcription.
Args:
audio: Raw audio bytes to transcribe.
Yields:
Frame: None (transcription results come via BiDi stream callbacks).
"""
if self._client and self._client.is_active:
try:
await self._client.send_audio_chunk(audio)
except Exception as e:
yield ErrorFrame(error=f"Unknown error occurred: {e}")
yield None
async def _connect(self):
"""Connect to the SageMaker endpoint and start the BiDi session.
Builds the Deepgram query string from settings, creates the BiDi client,
starts the streaming session, and launches background tasks for processing
responses and sending KeepAlive messages.
"""
logger.debug("Connecting to Deepgram on SageMaker...")
# Reconstruct a LiveOptions from the flat settings to build the query string.
live_options = LiveOptions(**self._settings.given_fields())
# Build query string from live_options, converting booleans to strings
query_params = {}
for key, value in live_options.to_dict().items():
if value is not None:
# Convert boolean values to lowercase strings for Deepgram API
if isinstance(value, bool):
query_params[key] = str(value).lower()
else:
query_params[key] = str(value)
query_params["sample_rate"] = str(self.sample_rate)
query_string = "&".join(f"{k}={v}" for k, v in query_params.items())
# Create BiDi client
self._client = SageMakerBidiClient(
endpoint_name=self._endpoint_name,
region=self._region,
model_invocation_path="v1/listen",
model_query_string=query_string,
)
try:
# Start the session
await self._client.start_session()
# Start processing responses in the background
self._response_task = self.create_task(self._process_responses())
# Start keepalive task to maintain connection
self._keepalive_task = self.create_task(self._send_keepalive())
logger.debug("Connected to Deepgram on SageMaker")
await self._call_event_handler("on_connected")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
await self._call_event_handler("on_connection_error", str(e))
async def _disconnect(self):
"""Disconnect from the SageMaker endpoint.
Sends a CloseStream message to Deepgram, cancels background tasks
(KeepAlive and response processing), and closes the BiDi session.
Safe to call multiple times.
"""
if self._client and self._client.is_active:
logger.debug("Disconnecting from Deepgram on SageMaker...")
# Send CloseStream message to Deepgram
try:
await self._client.send_json({"type": "CloseStream"})
except Exception as e:
logger.warning(f"Failed to send CloseStream message: {e}")
# Cancel keepalive task
if self._keepalive_task and not self._keepalive_task.done():
await self.cancel_task(self._keepalive_task)
# Cancel response processing task
if self._response_task and not self._response_task.done():
await self.cancel_task(self._response_task)
# Close the BiDi session
await self._client.close_session()
logger.debug("Disconnected from Deepgram on SageMaker")
await self._call_event_handler("on_disconnected")
async def _send_keepalive(self):
"""Send periodic KeepAlive messages to maintain the connection.
Sends a KeepAlive JSON message to Deepgram every 5 seconds while the
connection is active. This prevents the connection from timing out during
periods of silence.
"""
while self._client and self._client.is_active:
await asyncio.sleep(5)
if self._client and self._client.is_active:
try:
await self._client.send_json({"type": "KeepAlive"})
except Exception as e:
logger.warning(f"Failed to send KeepAlive: {e}")
async def _process_responses(self):
"""Process streaming responses from Deepgram on SageMaker.
Continuously receives responses from the BiDi stream, decodes the payload,
parses JSON responses from Deepgram, and processes transcription results.
Runs as a background task until the connection is closed or cancelled.
"""
try:
while self._client and self._client.is_active:
result = await self._client.receive_response()
if result is None:
break
# Check if this is a PayloadPart with bytes
if hasattr(result, "value") and hasattr(result.value, "bytes_"):
if result.value.bytes_:
response_data = result.value.bytes_.decode("utf-8")
try:
# Parse JSON response from Deepgram
parsed = json.loads(response_data)
# Extract and process transcript if available
if "channel" in parsed:
await self._handle_transcript_response(parsed)
except json.JSONDecodeError:
logger.warning(f"Non-JSON response: {response_data}")
except asyncio.CancelledError:
logger.debug("Response processor cancelled")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
finally:
logger.debug("Response processor stopped")
async def _handle_transcript_response(self, parsed: dict):
"""Handle a transcript response from Deepgram.
Extracts the transcript text, determines if it's final or interim, extracts
language information, and pushes the appropriate frame (TranscriptionFrame
or InterimTranscriptionFrame) downstream.
Args:
parsed: The parsed JSON response from Deepgram containing channel,
alternatives, transcript, and metadata.
"""
alternatives = parsed.get("channel", {}).get("alternatives", [])
if not alternatives or not alternatives[0].get("transcript"):
return
transcript = alternatives[0]["transcript"]
if not transcript.strip():
return
is_final = parsed.get("is_final", False)
# Extract language if available
language = None
if alternatives[0].get("languages"):
language = alternatives[0]["languages"][0]
language = Language(language)
if is_final:
# Check if this response is from a finalize() call.
# Only mark as finalized when both we requested it AND Deepgram confirms it.
from_finalize = parsed.get("from_finalize", False)
if from_finalize:
self.confirm_finalize()
await self.push_frame(
TranscriptionFrame(
transcript,
self._user_id,
time_now_iso8601(),
language,
result=parsed,
)
)
await self._handle_transcription(transcript, is_final, language)
await self.stop_processing_metrics()
else:
# Interim transcription
await self.push_frame(
InterimTranscriptionFrame(
transcript,
self._user_id,
time_now_iso8601(),
language,
result=parsed,
)
)
@traced_stt
async def _handle_transcription(
self, transcript: str, is_final: bool, language: Optional[Language] = None
):
"""Handle a transcription result with tracing.
This method is decorated with @traced_stt for observability and tracing
integration. The actual transcription processing is handled by the parent
class and observers.
Args:
transcript: The transcribed text.
is_final: Whether this is a final transcription result.
language: The detected language of the transcription, if available.
"""
pass
async def _start_metrics(self):
"""Start processing metrics collection."""
await self.start_processing_metrics()
async def process_frame(self, frame: Frame, direction: FrameDirection):
"""Process frames with Deepgram SageMaker-specific handling.
Args:
frame: The frame to process.
direction: The direction of frame processing.
"""
await super().process_frame(frame, direction)
# Start metrics when user starts speaking (if VAD is not provided by Deepgram)
if isinstance(frame, VADUserStartedSpeakingFrame):
await self._start_metrics()
elif isinstance(frame, VADUserStoppedSpeakingFrame):
# https://developers.deepgram.com/docs/finalize
# Mark that we're awaiting a from_finalize response
self.request_finalize()
if self._client and self._client.is_active:
try:
await self._client.send_json({"type": "Finalize"})
except Exception as e:
logger.warning(f"Error sending Finalize message: {e}")
logger.trace(f"Triggered finalize event on: {frame.name=}, {direction=}")
from pipecat.services.deepgram.sagemaker.stt import * # noqa: E402, F401, F403

View File

@@ -4,357 +4,15 @@
# SPDX-License-Identifier: BSD 2-Clause License
#
"""Deepgram text-to-speech service for AWS SageMaker.
"""Deprecated: use ``pipecat.services.deepgram.sagemaker.tts`` instead."""
This module provides a Pipecat TTS service that connects to Deepgram models
deployed on AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for
low-latency real-time speech synthesis with support for interruptions and
streaming audio output.
"""
import warnings
import asyncio
import json
from dataclasses import dataclass, field
from typing import Any, AsyncGenerator, Optional
from loguru import logger
from pipecat.frames.frames import (
BotStoppedSpeakingFrame,
CancelFrame,
EndFrame,
ErrorFrame,
Frame,
InterruptionFrame,
LLMFullResponseEndFrame,
StartFrame,
TTSAudioRawFrame,
TTSStartedFrame,
warnings.warn(
"Module `pipecat.services.deepgram.tts_sagemaker` is deprecated, "
"use `pipecat.services.deepgram.sagemaker.tts` instead.",
DeprecationWarning,
stacklevel=2,
)
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.aws.sagemaker.bidi_client import SageMakerBidiClient
from pipecat.services.settings import NOT_GIVEN, TTSSettings, _NotGiven
from pipecat.services.tts_service import TTSService
from pipecat.utils.tracing.service_decorators import traced_tts
@dataclass
class DeepgramSageMakerTTSSettings(TTSSettings):
"""Settings for Deepgram SageMaker TTS service.
Parameters:
encoding: Audio encoding format (e.g. "linear16").
"""
encoding: str | _NotGiven = field(default_factory=lambda: NOT_GIVEN)
class DeepgramSageMakerTTSService(TTSService):
"""Deepgram text-to-speech service for AWS SageMaker.
Provides real-time speech synthesis using Deepgram models deployed on
AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for low-latency
audio generation with support for interruptions via the Clear message.
Requirements:
- AWS credentials configured (via environment variables, AWS CLI, or instance metadata)
- A deployed SageMaker endpoint with Deepgram TTS model: https://developers.deepgram.com/docs/deploy-amazon-sagemaker
- ``pipecat-ai[sagemaker]`` installed
Example::
tts = DeepgramSageMakerTTSService(
endpoint_name="my-deepgram-tts-endpoint",
region="us-east-2",
voice="aura-2-helena-en",
)
"""
_settings: DeepgramSageMakerTTSSettings
def __init__(
self,
*,
endpoint_name: str,
region: str,
voice: str = "aura-2-helena-en",
sample_rate: Optional[int] = None,
encoding: str = "linear16",
**kwargs,
):
"""Initialize the Deepgram SageMaker TTS service.
Args:
endpoint_name: Name of the SageMaker endpoint with Deepgram TTS model
deployed (e.g., "my-deepgram-tts-endpoint").
region: AWS region where the endpoint is deployed (e.g., "us-east-2").
voice: Voice model to use for synthesis. Defaults to "aura-2-helena-en".
sample_rate: Audio sample rate in Hz. If None, uses the value from StartFrame.
encoding: Audio encoding format. Defaults to "linear16".
**kwargs: Additional arguments passed to the parent TTSService.
"""
super().__init__(
sample_rate=sample_rate,
push_stop_frames=True,
pause_frame_processing=True,
append_trailing_space=True,
settings=DeepgramSageMakerTTSSettings(
model=voice,
voice=voice,
language=None,
encoding=encoding,
),
**kwargs,
)
self._endpoint_name = endpoint_name
self._region = region
self._client: Optional[SageMakerBidiClient] = None
self._response_task: Optional[asyncio.Task] = None
self._context_id: Optional[str] = None
self._ttfb_started: bool = False
def can_generate_metrics(self) -> bool:
"""Check if this service can generate processing metrics.
Returns:
True, as Deepgram SageMaker TTS service supports metrics generation.
"""
return True
async def start(self, frame: StartFrame):
"""Start the Deepgram SageMaker TTS service.
Args:
frame: The start frame containing initialization parameters.
"""
await super().start(frame)
await self._connect()
async def stop(self, frame: EndFrame):
"""Stop the Deepgram SageMaker TTS service.
Args:
frame: The end frame.
"""
await super().stop(frame)
await self._disconnect()
async def cancel(self, frame: CancelFrame):
"""Cancel the Deepgram SageMaker TTS service.
Args:
frame: The cancel frame.
"""
await super().cancel(frame)
await self._disconnect()
async def process_frame(self, frame: Frame, direction: FrameDirection):
"""Process frames with special handling for LLM response end.
Args:
frame: The frame to process.
direction: The direction of frame processing.
"""
await super().process_frame(frame, direction)
if isinstance(frame, (LLMFullResponseEndFrame, EndFrame)):
await self.flush_audio()
elif isinstance(frame, BotStoppedSpeakingFrame):
self._ttfb_started = False
async def _connect(self):
"""Connect to the SageMaker endpoint and start the BiDi session.
Builds the Deepgram TTS query string, creates the BiDi client,
starts the streaming session, and launches a background task for processing
responses.
"""
logger.debug("Connecting to Deepgram TTS on SageMaker...")
query_string = (
f"model={self._settings.voice}&encoding={self._settings.encoding}"
f"&sample_rate={self.sample_rate}"
)
self._client = SageMakerBidiClient(
endpoint_name=self._endpoint_name,
region=self._region,
model_invocation_path="v1/speak",
model_query_string=query_string,
)
try:
await self._client.start_session()
self._response_task = self.create_task(self._process_responses())
logger.debug("Connected to Deepgram TTS on SageMaker")
await self._call_event_handler("on_connected")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
await self._call_event_handler("on_connection_error", str(e))
async def _disconnect(self):
"""Disconnect from the SageMaker endpoint.
Sends a Close message to Deepgram, cancels the response processing task,
and closes the BiDi session. Safe to call multiple times.
"""
if self._client and self._client.is_active:
logger.debug("Disconnecting from Deepgram TTS on SageMaker...")
try:
await self._client.send_json({"type": "Close"})
except Exception as e:
logger.warning(f"Failed to send Close message: {e}")
if self._response_task and not self._response_task.done():
await self.cancel_task(self._response_task)
await self._client.close_session()
logger.debug("Disconnected from Deepgram TTS on SageMaker")
await self._call_event_handler("on_disconnected")
async def _update_settings(self, delta: TTSSettings) -> dict[str, Any]:
"""Apply a settings delta and reconnect if necessary.
Since all settings are part of the SageMaker session query string,
any setting change requires reconnecting to apply the new values.
"""
changed = await super()._update_settings(delta)
if not changed:
return changed
# Deepgram uses voice as the model, so keep them in sync for metrics
if "voice" in changed:
self._settings.model = self._settings.voice
self._sync_model_name_to_metrics()
# TODO: someday we could reconnect here to apply updated settings.
# Code might look something like the below:
# await self._disconnect()
# await self._connect()
self._warn_unhandled_updated_settings(changed)
return changed
async def _process_responses(self):
"""Process streaming responses from Deepgram TTS on SageMaker.
Continuously receives responses from the BiDi stream. Attempts to decode
each payload as UTF-8 JSON for control messages (Flushed, Cleared, Metadata,
Warning). If decoding fails, treats the payload as raw audio bytes and pushes
a TTSAudioRawFrame downstream.
"""
try:
while self._client and self._client.is_active:
result = await self._client.receive_response()
if result is None:
break
if hasattr(result, "value") and hasattr(result.value, "bytes_"):
if result.value.bytes_:
payload = result.value.bytes_
# Try to decode as JSON control message first
try:
response_data = payload.decode("utf-8")
parsed = json.loads(response_data)
msg_type = parsed.get("type")
if msg_type == "Metadata":
logger.trace(f"Received metadata: {parsed}")
elif msg_type == "Flushed":
logger.trace(f"Received Flushed: {parsed}")
elif msg_type == "Cleared":
logger.trace(f"Received Cleared: {parsed}")
elif msg_type == "Warning":
logger.warning(
f"{self} warning: "
f"{parsed.get('description', 'Unknown warning')}"
)
else:
logger.debug(f"Received unknown message type: {parsed}")
except (UnicodeDecodeError, json.JSONDecodeError):
# Not JSON — treat as raw audio bytes
await self.stop_ttfb_metrics()
frame = TTSAudioRawFrame(
payload,
self.sample_rate,
1,
context_id=self._context_id,
)
await self.push_frame(frame)
except asyncio.CancelledError:
logger.debug("TTS response processor cancelled")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
finally:
logger.debug("TTS response processor stopped")
async def _handle_interruption(self, frame: InterruptionFrame, direction: FrameDirection):
"""Handle interruption by sending Clear message to Deepgram.
The Clear message will clear Deepgram's internal text buffer and stop
sending audio, allowing for a new response to be generated.
"""
await super()._handle_interruption(frame, direction)
self._ttfb_started = False
if self._client and self._client.is_active:
try:
await self._client.send_json({"type": "Clear"})
except Exception as e:
logger.error(f"{self} error sending Clear message: {e}")
async def flush_audio(self):
"""Flush any pending audio synthesis by sending Flush command.
This should be called when the LLM finishes a complete response to force
generation of audio from Deepgram's internal text buffer.
"""
if self._client and self._client.is_active:
try:
await self._client.send_json({"type": "Flush"})
except Exception as e:
logger.error(f"{self} error sending Flush message: {e}")
@traced_tts
async def run_tts(self, text: str, context_id: str) -> AsyncGenerator[Frame, None]:
"""Generate speech from text using Deepgram TTS on SageMaker.
Args:
text: The text to synthesize into speech.
context_id: The context ID for tracking audio frames.
Yields:
Frame: TTSStartedFrame, then None (audio comes asynchronously via
the response processor).
"""
logger.debug(f"{self}: Generating TTS [{text}]")
try:
if not self._ttfb_started:
await self.start_ttfb_metrics()
self._ttfb_started = True
await self.start_tts_usage_metrics(text)
yield TTSStartedFrame(context_id=context_id)
self._context_id = context_id
await self._client.send_json({"type": "Speak", "text": text})
yield None
except Exception as e:
yield ErrorFrame(error=f"Unknown error occurred: {e}")
from pipecat.services.deepgram.sagemaker.tts import * # noqa: E402, F401, F403

View File

@@ -613,7 +613,7 @@ class GladiaSTTService(WebsocketSTTService):
await self.broadcast_frame(UserStartedSpeakingFrame)
if self._should_interrupt:
await self.push_interruption_task_frame_and_wait()
await self.broadcast_interruption()
async def _on_speech_ended(self):
"""Handle speech end event from Gladia.

View File

@@ -1265,7 +1265,7 @@ class GeminiLiveLLMService(LLMService):
# combination with the context aggregator default
# turn strategies.
logger.debug("Gemini VAD: interrupted signal received")
await self.push_interruption_task_frame_and_wait()
await self.broadcast_interruption()
elif message.server_content and message.server_content.model_turn:
await self._handle_msg_model_turn(message)
elif (

View File

@@ -734,7 +734,7 @@ class GrokRealtimeLLMService(LLMService):
"""Handle speech started event from VAD."""
await self._truncate_current_audio_response()
await self.broadcast_frame(UserStartedSpeakingFrame)
await self.push_interruption_task_frame_and_wait()
await self.broadcast_interruption()
async def _handle_evt_speech_stopped(self, evt):
"""Handle speech stopped event from VAD."""

View File

@@ -62,10 +62,12 @@ class HeyGenCallbacks(BaseModel):
"""Callback handlers for HeyGen events.
Parameters:
on_participant_connected: Called when a participant connects
on_participant_disconnected: Called when a participant disconnects
on_connected: Called when the bot connects to the LiveKit room.
on_participant_connected: Called when a participant connects.
on_participant_disconnected: Called when a participant disconnects.
"""
on_connected: Callable[[], Awaitable[None]]
on_participant_connected: Callable[[str], Awaitable[None]]
on_participant_disconnected: Callable[[str], Awaitable[None]]
@@ -251,6 +253,7 @@ class HeyGenClient:
logger.debug(f"HeyGenClient send_interval: {self._send_interval}")
await self._ws_connect()
await self._livekit_connect()
self._call_event_callback(self._callbacks.on_connected)
async def stop(self) -> None:
"""Stop the client and terminate all connections.

View File

@@ -128,6 +128,7 @@ class HeyGenVideoService(AIService):
session_request=self._session_request,
service_type=self._service_type,
callbacks=HeyGenCallbacks(
on_connected=self._on_connected,
on_participant_connected=self._on_participant_connected,
on_participant_disconnected=self._on_participant_disconnected,
),
@@ -144,6 +145,10 @@ class HeyGenVideoService(AIService):
await self._client.cleanup()
self._client = None
async def _on_connected(self):
"""Handle bot connected to LiveKit room."""
logger.info("HeyGen bot connected to LiveKit room")
async def _on_participant_connected(self, participant_id: str):
"""Handle participant connected events."""
logger.info(f"Participant connected {participant_id}")

View File

@@ -839,7 +839,7 @@ class OpenAIRealtimeLLMService(LLMService):
async def _handle_evt_speech_started(self, evt):
await self._truncate_current_audio_response()
await self.broadcast_frame(UserStartedSpeakingFrame)
await self.push_interruption_task_frame_and_wait()
await self.broadcast_interruption()
async def _handle_evt_speech_stopped(self, evt):
await self.start_ttfb_metrics()

View File

@@ -639,7 +639,7 @@ class OpenAIRealtimeSTTService(WebsocketSTTService):
logger.debug("Server VAD: speech started")
await self.broadcast_frame(UserStartedSpeakingFrame)
if self._should_interrupt:
await self.push_interruption_task_frame_and_wait()
await self.broadcast_interruption()
await self.start_processing_metrics()
async def _handle_speech_stopped(self, evt: dict):

Some files were not shown because too many files have changed in this diff Show More