Merge branch 'main' into filipi/deepgram
# Conflicts: # src/pipecat/services/deepgram/stt_sagemaker.py
This commit is contained in:
1
.github/workflows/update-docs.yml
vendored
1
.github/workflows/update-docs.yml
vendored
@@ -59,6 +59,7 @@ jobs:
|
||||
DOCS_SYNC_TOKEN: ${{ secrets.DOCS_SYNC_TOKEN }}
|
||||
with:
|
||||
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
|
||||
github_token: ${{ secrets.GITHUB_TOKEN }}
|
||||
prompt: |
|
||||
You are updating documentation for the pipecat-ai/docs repository based on
|
||||
changes merged in PR #${{ steps.pr.outputs.number }} of pipecat-ai/pipecat.
|
||||
|
||||
383
CHANGELOG.md
383
CHANGELOG.md
@@ -7,6 +7,389 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
|
||||
<!-- towncrier release notes start -->
|
||||
|
||||
## [0.0.104] - 2026-03-02
|
||||
|
||||
### Added
|
||||
|
||||
- Added `TextAggregationMetricsData` metric measuring the time from the first
|
||||
LLM token to the first complete sentence, representing the latency cost of
|
||||
sentence aggregation in the TTS pipeline.
|
||||
(PR [#3696](https://github.com/pipecat-ai/pipecat/pull/3696))
|
||||
|
||||
- Added support for using strongly-typed objects instead of dicts for updating
|
||||
service settings at runtime.
|
||||
|
||||
Instead of, say:
|
||||
|
||||
```python
|
||||
await task.queue_frame(
|
||||
STTUpdateSettingsFrame(settings={"language": Language.ES})
|
||||
)
|
||||
```
|
||||
|
||||
you'd do:
|
||||
|
||||
```python
|
||||
await task.queue_frame(
|
||||
STTUpdateSettingsFrame(delta=DeepgramSTTSettings(language=Language.ES))
|
||||
)
|
||||
```
|
||||
|
||||
Each service now vends strongly-typed classes like `DeepgramSTTSettings`
|
||||
representing the service's runtime-updatable settings.
|
||||
(PR [#3714](https://github.com/pipecat-ai/pipecat/pull/3714))
|
||||
|
||||
- Added support for specifying private endpoints for Azure Speech-to-Text,
|
||||
enabling use in private networks behind firewalls.
|
||||
(PR [#3764](https://github.com/pipecat-ai/pipecat/pull/3764))
|
||||
|
||||
- Added `LemonSliceTransport` and `LemonSliceApi` to support adding real-time
|
||||
LemonSlice Avatars to any Daily room.
|
||||
(PR [#3791](https://github.com/pipecat-ai/pipecat/pull/3791))
|
||||
|
||||
- Added `output_medium` parameter to `AgentInputParams` and
|
||||
`OneShotInputParams` in Ultravox service to control initial output medium
|
||||
(text or voice) at call creation time.
|
||||
(PR [#3806](https://github.com/pipecat-ai/pipecat/pull/3806))
|
||||
|
||||
- Added `TurnMetricsData` as a generic metrics class for turn detection, with
|
||||
e2e processing time measurement. `KrispVivaTurn` now emits `TurnMetricsData`
|
||||
with `e2e_processing_time_ms` tracking the interval from VAD
|
||||
speech-to-silence transition to turn completion.
|
||||
(PR [#3809](https://github.com/pipecat-ai/pipecat/pull/3809))
|
||||
|
||||
- Added `on_audio_context_interrupted()` and `on_audio_context_completed()`
|
||||
callbacks to `AudioContextTTSService`. Subclasses can override these to
|
||||
perform provider-specific cleanup instead of overriding
|
||||
`_handle_interruption()`.
|
||||
(PR [#3814](https://github.com/pipecat-ai/pipecat/pull/3814))
|
||||
|
||||
- Added `on_summary_applied` event to `LLMContextSummarizer` for observability,
|
||||
providing message counts before and after context summarization.
|
||||
(PR [#3855](https://github.com/pipecat-ai/pipecat/pull/3855))
|
||||
|
||||
- Added `summary_message_template` to `LLMContextSummarizationConfig` for
|
||||
customizing how summaries are formatted when injected into context (e.g.,
|
||||
wrapping in XML tags).
|
||||
(PR [#3855](https://github.com/pipecat-ai/pipecat/pull/3855))
|
||||
|
||||
- Added `summarization_timeout` to `LLMContextSummarizationConfig` (default
|
||||
120s) to prevent hung LLM calls from permanently blocking future
|
||||
summarizations.
|
||||
(PR [#3855](https://github.com/pipecat-ai/pipecat/pull/3855))
|
||||
|
||||
- Added optional `llm` field to `LLMContextSummarizationConfig` for routing
|
||||
summarization to a dedicated LLM service (e.g., a cheaper/faster model)
|
||||
instead of the pipeline's primary model.
|
||||
(PR [#3855](https://github.com/pipecat-ai/pipecat/pull/3855))
|
||||
|
||||
- Add AssemblyAI u3-rt-pro model support with built-in turn detection mode
|
||||
(PR [#3856](https://github.com/pipecat-ai/pipecat/pull/3856))
|
||||
|
||||
- Added `LLMSummarizeContextFrame` to trigger on-demand context summarization
|
||||
from anywhere in the pipeline (e.g. a function call tool). Accepts an
|
||||
optional `config: LLMContextSummaryConfig` to override summary generation
|
||||
settings per request.
|
||||
(PR [#3863](https://github.com/pipecat-ai/pipecat/pull/3863))
|
||||
|
||||
- Added `LLMContextSummaryConfig` (summary generation params:
|
||||
`target_context_tokens`, `min_messages_after_summary`,
|
||||
`summarization_prompt`) and `LLMAutoContextSummarizationConfig` (auto-trigger
|
||||
thresholds: `max_context_tokens`, `max_unsummarized_messages`, plus a nested
|
||||
`summary_config`). These replace the monolithic
|
||||
`LLMContextSummarizationConfig`.
|
||||
(PR [#3863](https://github.com/pipecat-ai/pipecat/pull/3863))
|
||||
|
||||
- Added support for the `speed_alpha` parameter to the `arcana` model in
|
||||
`RimeTTSService`.
|
||||
(PR [#3873](https://github.com/pipecat-ai/pipecat/pull/3873))
|
||||
|
||||
- Added `ClientConnectedFrame`, a new `SystemFrame` pushed by all transports
|
||||
(Daily, LiveKit, FastAPI WebSocket, WebSocket Server, SmallWebRTC, HeyGen,
|
||||
Tavus) when a client connects. Enables observers to track transport readiness
|
||||
timing.
|
||||
(PR [#3881](https://github.com/pipecat-ai/pipecat/pull/3881))
|
||||
|
||||
- Added `StartupTimingObserver` for measuring how long each processor's
|
||||
`start()` method takes during pipeline startup. Also measures transport
|
||||
readiness — the time from `StartFrame` to first client connection — via the
|
||||
`on_transport_timing_report` event.
|
||||
(PR [#3881](https://github.com/pipecat-ai/pipecat/pull/3881))
|
||||
|
||||
- Added `BotConnectedFrame` for SFU transports and `on_transport_timing_report`
|
||||
event to `StartupTimingObserver` with bot and client connection timing.
|
||||
(PR [#3881](https://github.com/pipecat-ai/pipecat/pull/3881))
|
||||
|
||||
- Added optional `direction` parameter to `PipelineTask.queue_frame()` and
|
||||
`PipelineTask.queue_frames()`, allowing frames to be pushed upstream from the
|
||||
end of the pipeline.
|
||||
(PR [#3883](https://github.com/pipecat-ai/pipecat/pull/3883))
|
||||
|
||||
- Added `on_latency_breakdown` event to `UserBotLatencyObserver` providing
|
||||
per-service TTFB, text aggregation, user turn duration, and function call
|
||||
latency metrics for each user-to-bot response cycle.
|
||||
(PR [#3885](https://github.com/pipecat-ai/pipecat/pull/3885))
|
||||
|
||||
- Added `on_first_bot_speech_latency` event to `UserBotLatencyObserver`
|
||||
measuring the time from client connection to first bot speech. An
|
||||
`on_latency_breakdown` is also emitted for this first speech event.
|
||||
(PR [#3885](https://github.com/pipecat-ai/pipecat/pull/3885))
|
||||
|
||||
- Added `broadcast_interruption()` to `FrameProcessor`. This method pushes an
|
||||
`InterruptionFrame` both upstream and downstream directly from the calling
|
||||
processor, avoiding the round-trip through the pipeline task that
|
||||
`push_interruption_task_frame_and_wait()` required.
|
||||
(PR [#3896](https://github.com/pipecat-ai/pipecat/pull/3896))
|
||||
|
||||
### Changed
|
||||
|
||||
- Added `text_aggregation_mode` parameter to `TTSService` and all TTS
|
||||
subclasses with a new `TextAggregationMode` enum (`SENTENCE`, `TOKEN`). All
|
||||
text now flows through text aggregators regardless of mode, enabling pattern
|
||||
detection and tag handling in TOKEN mode.
|
||||
(PR [#3696](https://github.com/pipecat-ai/pipecat/pull/3696))
|
||||
|
||||
- ⚠️ Refactored runtime-updatable service settings to use strongly-typed
|
||||
classes (`TTSSettings`, `STTSettings`, `LLMSettings`, and service-specific
|
||||
subclasses) instead of plain dicts. Each service's `_settings` now holds
|
||||
these strongly-typed objects. For service maintainers, see changes in
|
||||
COMMUNITY_INTEGRATIONS.md.
|
||||
(PR [#3714](https://github.com/pipecat-ai/pipecat/pull/3714))
|
||||
|
||||
- Word timestamp support has been moved from `WordTTSService` into `TTSService`
|
||||
via a new `supports_word_timestamps` parameter. Services that previously
|
||||
extended `WordTTSService`, `AudioContextWordTTSService`, or
|
||||
`WebsocketWordTTSService` now pass `supports_word_timestamps=True` to their
|
||||
parent `__init__` instead.
|
||||
(PR [#3786](https://github.com/pipecat-ai/pipecat/pull/3786))
|
||||
|
||||
- Improved Ultravox TTFB measurement accuracy by using VAD speech end time
|
||||
instead of `UserStoppedSpeakingFrame` timing.
|
||||
(PR [#3806](https://github.com/pipecat-ai/pipecat/pull/3806))
|
||||
|
||||
- Aligned `UltravoxRealtimeLLMService` frame handling with OpenAI/Gemini
|
||||
realtime services: added `InterruptionFrame` handling with metrics cleanup,
|
||||
processing metrics at response boundaries, and improved agent transcript
|
||||
handling for both voice and text output modalities.
|
||||
(PR [#3806](https://github.com/pipecat-ai/pipecat/pull/3806))
|
||||
|
||||
- Updated `OpenAIRealtimeLLMService` default model to `gpt-realtime-1.5`.
|
||||
(PR [#3807](https://github.com/pipecat-ai/pipecat/pull/3807))
|
||||
|
||||
- Added `api_key` parameter to `KrispVivaSDKManager`, `KrispVivaTurn`, and
|
||||
`KrispVivaFilter` for Krisp SDK v1.6.1+ licensing. Falls back to
|
||||
`KRISP_VIVA_API_KEY` environment variable.
|
||||
(PR [#3809](https://github.com/pipecat-ai/pipecat/pull/3809))
|
||||
|
||||
- Bumped `nltk` minimum version from 3.9.1 to 3.9.3 to resolve a security
|
||||
vulnerability.
|
||||
(PR [#3811](https://github.com/pipecat-ai/pipecat/pull/3811))
|
||||
|
||||
- `ServiceSettingsUpdateFrame`s are now `UninterruptibleFrame`s. Generally
|
||||
speaking, you don't want a user interruption to prevent a service setting
|
||||
change from going into effect. Note that you usually don't use
|
||||
`ServiceSettingsUpdateFrame` directly, you use one of its subclasses:
|
||||
- `LLMUpdateSettingsFrame`
|
||||
- `TTSUpdateSettingsFrame`
|
||||
- `STTUpdateSettingsFrame`
|
||||
(PR [#3819](https://github.com/pipecat-ai/pipecat/pull/3819))
|
||||
|
||||
- Updated context summarization to use `user` role instead of `assistant` for
|
||||
summary messages.
|
||||
(PR [#3855](https://github.com/pipecat-ai/pipecat/pull/3855))
|
||||
|
||||
- Rename `AssemblyAISTTService` parameter
|
||||
`min_end_of_turn_silence_when_confident` parameter to `min_turn_silence` (old
|
||||
name still supported with deprecation warning)
|
||||
(PR [#3856](https://github.com/pipecat-ai/pipecat/pull/3856))
|
||||
|
||||
- ⚠️ Renamed `LLMAssistantAggregatorParams` fields:
|
||||
`enable_context_summarization` → `enable_auto_context_summarization` and
|
||||
`context_summarization_config` → `auto_context_summarization_config` (now
|
||||
accepts `LLMAutoContextSummarizationConfig`). The old names still work with a
|
||||
`DeprecationWarning` for one release cycle.
|
||||
(PR [#3863](https://github.com/pipecat-ai/pipecat/pull/3863))
|
||||
|
||||
- `ElevenLabsRealtimeSTTService` now sets `TranscriptionFrame.finalized` to
|
||||
`True` when using `CommitStrategy.MANUAL`.
|
||||
(PR [#3865](https://github.com/pipecat-ai/pipecat/pull/3865))
|
||||
|
||||
- Updated numba version pin from == to >=0.61.2
|
||||
(PR [#3868](https://github.com/pipecat-ai/pipecat/pull/3868))
|
||||
|
||||
- Updated tracing code to use `ServiceSettings` dataclass API
|
||||
(`given_fields()`, attribute access) instead of dict-style access
|
||||
(`.items()`, `in`, subscript).
|
||||
(PR [#3879](https://github.com/pipecat-ai/pipecat/pull/3879))
|
||||
|
||||
- ⚠️ Removed `event` field and `complete()` method from `InterruptionFrame`.
|
||||
Removed `event` field from `InterruptionTaskFrame`. These are no longer
|
||||
needed since `broadcast_interruption()` does not require a round-trip
|
||||
completion signal.
|
||||
(PR [#3896](https://github.com/pipecat-ai/pipecat/pull/3896))
|
||||
|
||||
- Moved `pipecat.services.deepgram.stt_sagemaker` and
|
||||
`pipecat.services.deepgram.tts_sagemaker` to
|
||||
`pipecat.services.deepgram.sagemaker.stt` and
|
||||
`pipecat.services.deepgram.sagemaker.tts`. The old import paths still work
|
||||
but emit a `DeprecationWarning`.
|
||||
(PR [#3902](https://github.com/pipecat-ai/pipecat/pull/3902))
|
||||
|
||||
### Deprecated
|
||||
|
||||
- ⚠️ Deprecated `aggregate_sentences` parameter on `TTSService` and all TTS
|
||||
subclasses. Use `text_aggregation_mode=TextAggregationMode.SENTENCE` or
|
||||
`text_aggregation_mode=TextAggregationMode.TOKEN` instead.
|
||||
(PR [#3696](https://github.com/pipecat-ai/pipecat/pull/3696))
|
||||
|
||||
- Deprecated `set_model()`, `set_voice()`, and `set_language()` on AI services
|
||||
in favor of runtime updates via `TTSUpdateSettingsFrame`,
|
||||
`STTUpdateSettingsFrame`, and `LLMUpdateSettingsFrame`.
|
||||
|
||||
⚠️ Note, too, a subtle behavior change in these deprecated methods. Whereas
|
||||
previously only `set_language()` caused the service to actually react to the
|
||||
update (e.g. by reconnecting to a remote service so it an pick up the
|
||||
change), now all these methods do. This change was made as part of a refactor
|
||||
making them all work the same way under the hood.
|
||||
(PR [#3714](https://github.com/pipecat-ai/pipecat/pull/3714))
|
||||
|
||||
- Dict-based `*UpdateSettingsFrame(settings={...})` is deprecated in favor of
|
||||
passing typed settings delta objects with
|
||||
`*UpdateSettingsFrame(delta={...})`.
|
||||
(PR [#3714](https://github.com/pipecat-ai/pipecat/pull/3714))
|
||||
|
||||
- Deprecated `WordTTSService`, `WebsocketWordTTSService`,
|
||||
`AudioContextWordTTSService`, and `InterruptibleWordTTSService`. Use their
|
||||
non-word counterparts with `supports_word_timestamps=True` instead:
|
||||
- `WordTTSService` → `TTSService(supports_word_timestamps=True)`
|
||||
- `WebsocketWordTTSService` →
|
||||
`WebsocketTTSService(supports_word_timestamps=True)`
|
||||
- `AudioContextWordTTSService` →
|
||||
`AudioContextTTSService(supports_word_timestamps=True)`
|
||||
- `InterruptibleWordTTSService` →
|
||||
`InterruptibleTTSService(supports_word_timestamps=True)`
|
||||
(PR [#3786](https://github.com/pipecat-ai/pipecat/pull/3786))
|
||||
|
||||
- Deprecated `SmartTurnMetricsData` in favor of `TurnMetricsData`.
|
||||
`BaseSmartTurn` now emits `TurnMetricsData` directly.
|
||||
(PR [#3809](https://github.com/pipecat-ai/pipecat/pull/3809))
|
||||
|
||||
- Deprecated `LLMContextSummarizationConfig`. Use
|
||||
`LLMAutoContextSummarizationConfig` with a nested `LLMContextSummaryConfig`
|
||||
instead. The old class emits a `DeprecationWarning`.
|
||||
(PR [#3863](https://github.com/pipecat-ai/pipecat/pull/3863))
|
||||
|
||||
- Deprecated `push_interruption_task_frame_and_wait()` in `FrameProcessor`. Use
|
||||
`broadcast_interruption()` instead. The old method now delegates to
|
||||
`broadcast_interruption()` and logs a deprecation warning.
|
||||
(PR [#3896](https://github.com/pipecat-ai/pipecat/pull/3896))
|
||||
|
||||
### Removed
|
||||
|
||||
- Removed `local-smart-turn-v3` optional extra from `pyproject.toml`. The
|
||||
`transformers` and `onnxruntime` packages are now always installed as core
|
||||
dependencies since they are required by the default turn stop strategy,
|
||||
`TurnAnalyzerUserTurnStopStrategy` which uses `LocalSmartTurnAnalyzerV3`.
|
||||
(PR [#3803](https://github.com/pipecat-ai/pipecat/pull/3803))
|
||||
|
||||
- ⚠️ Removed `PlayHTTTSService` and `PlayHTHttpTTSService`. PlayHT has been
|
||||
shut down and is no longer available.
|
||||
(PR [#3838](https://github.com/pipecat-ai/pipecat/pull/3838))
|
||||
|
||||
### Fixed
|
||||
|
||||
- Added `LLMSpecificMessage` handling in `LLMContextSummarizationUtil` to skip
|
||||
provider-specific messages during context summarization.
|
||||
(PR [#3794](https://github.com/pipecat-ai/pipecat/pull/3794))
|
||||
|
||||
- Treated `response_cancel_not_active` as a non-fatal error in realtime
|
||||
services (`OpenAIRealtimeLLMService`, `GrokRealtimeLLMService`,
|
||||
`OpenAIRealtimeBetaLLMService`) to prevent WebSocket disconnection when
|
||||
cancelling an inactive response.
|
||||
(PR [#3795](https://github.com/pipecat-ai/pipecat/pull/3795))
|
||||
|
||||
- Fixed Poetry compatibility by inlining `local-smart-turn-v3` dependencies
|
||||
(`transformers`, `onnxruntime`) into core dependencies instead of using a
|
||||
self-referential extra.
|
||||
(PR [#3803](https://github.com/pipecat-ai/pipecat/pull/3803))
|
||||
|
||||
- Fixed `SentryMetrics` method signatures to match updated
|
||||
`FrameProcessorMetrics` base class, resolving `TypeError` when using
|
||||
`start_time`/`end_time` keyword arguments.
|
||||
(PR [#3808](https://github.com/pipecat-ai/pipecat/pull/3808))
|
||||
|
||||
- Fixed STT TTFB metrics not being reported for `SonioxSTTService` and
|
||||
`AWSTranscribeSTTService` due to missing `can_generate_metrics()` override.
|
||||
(PR [#3813](https://github.com/pipecat-ai/pipecat/pull/3813))
|
||||
|
||||
- Fixed an issue where `AudioContextTTSService`-based providers (AsyncAI,
|
||||
ElevenLabs, Inworld, Rime) did not close or clean up their server-side audio
|
||||
contexts after normal speech completion, only on interruption.
|
||||
(PR [#3814](https://github.com/pipecat-ai/pipecat/pull/3814))
|
||||
|
||||
- Fixed STT TTFB metrics measuring timeout expiry time instead of actual
|
||||
transcript arrival time.
|
||||
(PR [#3822](https://github.com/pipecat-ai/pipecat/pull/3822))
|
||||
|
||||
- Fixed `InterimTranscriptionFrame` and `TranslationFrame` being
|
||||
unintentionally pushed downstream in `LLMUserAggregator`. They are now
|
||||
consumed like `TranscriptionFrame`.
|
||||
(PR [#3825](https://github.com/pipecat-ai/pipecat/pull/3825))
|
||||
|
||||
- Fixed misleading "Empty audio frame received for STT service" warnings when
|
||||
using audio filters (e.g. `RNNoiseFilter`, `KrispVivaFilter`, `AICFilter`)
|
||||
that buffer audio internally.
|
||||
(PR [#3828](https://github.com/pipecat-ai/pipecat/pull/3828))
|
||||
|
||||
- Fixed issues with `RimeNonJsonTTSService` where trailing punctuation is
|
||||
sometimes vocalized
|
||||
(PR [#3837](https://github.com/pipecat-ai/pipecat/pull/3837))
|
||||
|
||||
- Fixed `TTSSpeakFrame` not committing spoken text to the conversation context
|
||||
when used outside of an LLM response (e.g., bot greetings or injected
|
||||
speech).
|
||||
(PR [#3845](https://github.com/pipecat-ai/pipecat/pull/3845))
|
||||
|
||||
- Removed verbose per-chunk audio logging from `GenesysAudioHookSerializer`
|
||||
that flooded production logs.
|
||||
(PR [#3850](https://github.com/pipecat-ai/pipecat/pull/3850))
|
||||
|
||||
- Add beta feature warning when using custom prompts with AssemblyAI
|
||||
(PR [#3856](https://github.com/pipecat-ai/pipecat/pull/3856))
|
||||
|
||||
- Fixed `LocalSmartTurnAnalyzerV3` producing incorrect end-of-turn predictions
|
||||
at non-16kHz sample rates (e.g. 8kHz Twilio telephony) by adding automatic
|
||||
resampling to 16kHz before Whisper feature extraction.
|
||||
(PR [#3857](https://github.com/pipecat-ai/pipecat/pull/3857))
|
||||
|
||||
- Fixed `PipelineTask` double-inserting `RTVIProcessor` into the frame chain
|
||||
when the user provides both an `RTVIProcessor` in the pipeline and a custom
|
||||
`RTVIObserver` subclass in observers.
|
||||
(PR [#3867](https://github.com/pipecat-ai/pipecat/pull/3867))
|
||||
|
||||
- Fixed turn completion instructions being lost when `LLMMessagesUpdateFrame`
|
||||
replaces the LLM context. When `filter_incomplete_user_turns` is enabled, the
|
||||
turn completion system message is now re-injected after context replacement.
|
||||
(PR [#3888](https://github.com/pipecat-ai/pipecat/pull/3888))
|
||||
|
||||
- Fixed Azure TTS and STT services silently swallowing cancellation errors
|
||||
(invalid API key, network failures, rate limiting) instead of propagating
|
||||
them as `ErrorFrame`s to the pipeline.
|
||||
(PR [#3893](https://github.com/pipecat-ai/pipecat/pull/3893))
|
||||
|
||||
### Performance
|
||||
|
||||
- Switched `GradiumTTSService` from `InterruptibleWordTTSService` to
|
||||
`AudioContextWordTTSService`, eliminating websocket disconnect/reconnect on
|
||||
every interruption by using `client_req_id`-based multiplexing.
|
||||
(PR [#3759](https://github.com/pipecat-ai/pipecat/pull/3759))
|
||||
|
||||
### Other
|
||||
|
||||
- Standardized Sarvam STT/TTS User-Agent header handling to consistently send
|
||||
Pipecat SDK identity in websocket requests.
|
||||
(PR [#3886](https://github.com/pipecat-ai/pipecat/pull/3886))
|
||||
|
||||
## [0.0.103] - 2026-02-20
|
||||
|
||||
### Added
|
||||
|
||||
@@ -89,7 +89,7 @@ Catch new features, interviews, and how-tos on our [Pipecat TV](https://www.yout
|
||||
| Speech-to-Speech | [AWS Nova Sonic](https://docs.pipecat.ai/server/services/s2s/aws), [Gemini Multimodal Live](https://docs.pipecat.ai/server/services/s2s/gemini), [Grok Voice Agent](https://docs.pipecat.ai/server/services/s2s/grok), [OpenAI Realtime](https://docs.pipecat.ai/server/services/s2s/openai), [Ultravox](https://docs.pipecat.ai/server/services/s2s/ultravox), |
|
||||
| Transport | [Daily (WebRTC)](https://docs.pipecat.ai/server/services/transport/daily), [FastAPI Websocket](https://docs.pipecat.ai/server/services/transport/fastapi-websocket), [SmallWebRTCTransport](https://docs.pipecat.ai/server/services/transport/small-webrtc), [WebSocket Server](https://docs.pipecat.ai/server/services/transport/websocket-server), Local |
|
||||
| Serializers | [Exotel](https://docs.pipecat.ai/server/utilities/serializers/exotel), [Plivo](https://docs.pipecat.ai/server/utilities/serializers/plivo), [Twilio](https://docs.pipecat.ai/server/utilities/serializers/twilio), [Telnyx](https://docs.pipecat.ai/server/utilities/serializers/telnyx), [Vonage](https://docs.pipecat.ai/server/utilities/serializers/vonage) |
|
||||
| Video | [HeyGen](https://docs.pipecat.ai/server/services/video/heygen), [Tavus](https://docs.pipecat.ai/server/services/video/tavus), [Simli](https://docs.pipecat.ai/server/services/video/simli) |
|
||||
| Video | [HeyGen](https://docs.pipecat.ai/server/services/video/heygen), [LemonSlice](https://docs.pipecat.ai/server/services/video/lemonslice), [Tavus](https://docs.pipecat.ai/server/services/video/tavus), [Simli](https://docs.pipecat.ai/server/services/video/simli) |
|
||||
| Memory | [mem0](https://docs.pipecat.ai/server/services/memory/mem0) |
|
||||
| Vision & Image | [fal](https://docs.pipecat.ai/server/services/image-generation/fal), [Google Imagen](https://docs.pipecat.ai/server/services/image-generation/google-imagen), [Moondream](https://docs.pipecat.ai/server/services/vision/moondream) |
|
||||
| Audio Processing | [Silero VAD](https://docs.pipecat.ai/server/utilities/audio/silero-vad-analyzer), [Krisp](https://docs.pipecat.ai/server/utilities/audio/krisp-filter), [Koala](https://docs.pipecat.ai/server/utilities/audio/koala-filter), [ai-coustics](https://docs.pipecat.ai/server/utilities/audio/aic-filter) |
|
||||
|
||||
@@ -1 +0,0 @@
|
||||
- Added `TextAggregationMetricsData` metric measuring the time from the first LLM token to the first complete sentence, representing the latency cost of sentence aggregation in the TTS pipeline.
|
||||
@@ -1 +0,0 @@
|
||||
- Added `text_aggregation_mode` parameter to `TTSService` and all TTS subclasses with a new `TextAggregationMode` enum (`SENTENCE`, `TOKEN`). All text now flows through text aggregators regardless of mode, enabling pattern detection and tag handling in TOKEN mode.
|
||||
@@ -1 +0,0 @@
|
||||
- ⚠️ Deprecated `aggregate_sentences` parameter on `TTSService` and all TTS subclasses. Use `text_aggregation_mode=TextAggregationMode.SENTENCE` or `text_aggregation_mode=TextAggregationMode.TOKEN` instead.
|
||||
@@ -1,19 +0,0 @@
|
||||
- Added support for using strongly-typed objects instead of dicts for updating service settings at runtime.
|
||||
|
||||
Instead of, say:
|
||||
|
||||
```python
|
||||
await task.queue_frame(
|
||||
STTUpdateSettingsFrame(settings={"language": Language.ES})
|
||||
)
|
||||
```
|
||||
|
||||
you'd do:
|
||||
|
||||
```python
|
||||
await task.queue_frame(
|
||||
STTUpdateSettingsFrame(delta=DeepgramSTTSettings(language=Language.ES))
|
||||
)
|
||||
```
|
||||
|
||||
Each service now vends strongly-typed classes like `DeepgramSTTSettings` representing the service's runtime-updatable settings.
|
||||
@@ -1 +0,0 @@
|
||||
- ⚠️ Refactored runtime-updatable service settings to use strongly-typed classes (`TTSSettings`, `STTSettings`, `LLMSettings`, and service-specific subclasses) instead of plain dicts. Each service's `_settings` now holds these strongly-typed objects. For service maintainers, see changes in COMMUNITY_INTEGRATIONS.md.
|
||||
@@ -1 +0,0 @@
|
||||
- Dict-based `*UpdateSettingsFrame(settings={...})` is deprecated in favor of passing typed settings delta objects with `*UpdateSettingsFrame(delta={...})`.
|
||||
@@ -1,3 +0,0 @@
|
||||
- Deprecated `set_model()`, `set_voice()`, and `set_language()` on AI services in favor of runtime updates via `TTSUpdateSettingsFrame`, `STTUpdateSettingsFrame`, and `LLMUpdateSettingsFrame`.
|
||||
|
||||
⚠️ Note, too, a subtle behavior change in these deprecated methods. Whereas previously only `set_language()` caused the service to actually react to the update (e.g. by reconnecting to a remote service so it an pick up the change), now all these methods do. This change was made as part of a refactor making them all work the same way under the hood.
|
||||
@@ -1 +0,0 @@
|
||||
- Switched `GradiumTTSService` from `InterruptibleWordTTSService` to `AudioContextWordTTSService`, eliminating websocket disconnect/reconnect on every interruption by using `client_req_id`-based multiplexing.
|
||||
@@ -1 +0,0 @@
|
||||
- Added support for specifying private endpoints for Azure Speech-to-Text, enabling use in private networks behind firewalls.
|
||||
@@ -1 +0,0 @@
|
||||
- Word timestamp support has been moved from `WordTTSService` into `TTSService` via a new `supports_word_timestamps` parameter. Services that previously extended `WordTTSService`, `AudioContextWordTTSService`, or `WebsocketWordTTSService` now pass `supports_word_timestamps=True` to their parent `__init__` instead.
|
||||
@@ -1,5 +0,0 @@
|
||||
- Deprecated `WordTTSService`, `WebsocketWordTTSService`, `AudioContextWordTTSService`, and `InterruptibleWordTTSService`. Use their non-word counterparts with `supports_word_timestamps=True` instead:
|
||||
- `WordTTSService` → `TTSService(supports_word_timestamps=True)`
|
||||
- `WebsocketWordTTSService` → `WebsocketTTSService(supports_word_timestamps=True)`
|
||||
- `AudioContextWordTTSService` → `AudioContextTTSService(supports_word_timestamps=True)`
|
||||
- `InterruptibleWordTTSService` → `InterruptibleTTSService(supports_word_timestamps=True)`
|
||||
@@ -1 +0,0 @@
|
||||
- Added `LLMSpecificMessage` handling in `LLMContextSummarizationUtil` to skip provider-specific messages during context summarization.
|
||||
@@ -1 +0,0 @@
|
||||
- Treated `response_cancel_not_active` as a non-fatal error in realtime services (`OpenAIRealtimeLLMService`, `GrokRealtimeLLMService`, `OpenAIRealtimeBetaLLMService`) to prevent WebSocket disconnection when cancelling an inactive response.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed Poetry compatibility by inlining `local-smart-turn-v3` dependencies (`transformers`, `onnxruntime`) into core dependencies instead of using a self-referential extra.
|
||||
@@ -1 +0,0 @@
|
||||
- Removed `local-smart-turn-v3` optional extra from `pyproject.toml`. The `transformers` and `onnxruntime` packages are now always installed as core dependencies since they are required by the default turn stop strategy, `TurnAnalyzerUserTurnStopStrategy` which uses `LocalSmartTurnAnalyzerV3`.
|
||||
@@ -1 +0,0 @@
|
||||
- Added `output_medium` parameter to `AgentInputParams` and `OneShotInputParams` in Ultravox service to control initial output medium (text or voice) at call creation time.
|
||||
@@ -1 +0,0 @@
|
||||
- Improved Ultravox TTFB measurement accuracy by using VAD speech end time instead of `UserStoppedSpeakingFrame` timing.
|
||||
@@ -1 +0,0 @@
|
||||
- Aligned `UltravoxRealtimeLLMService` frame handling with OpenAI/Gemini realtime services: added `InterruptionFrame` handling with metrics cleanup, processing metrics at response boundaries, and improved agent transcript handling for both voice and text output modalities.
|
||||
@@ -1 +0,0 @@
|
||||
- Updated `OpenAIRealtimeLLMService` default model to `gpt-realtime-1.5`.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed `SentryMetrics` method signatures to match updated `FrameProcessorMetrics` base class, resolving `TypeError` when using `start_time`/`end_time` keyword arguments.
|
||||
@@ -1 +0,0 @@
|
||||
- Added `TurnMetricsData` as a generic metrics class for turn detection, with e2e processing time measurement. `KrispVivaTurn` now emits `TurnMetricsData` with `e2e_processing_time_ms` tracking the interval from VAD speech-to-silence transition to turn completion.
|
||||
@@ -1 +0,0 @@
|
||||
- Added `api_key` parameter to `KrispVivaSDKManager`, `KrispVivaTurn`, and `KrispVivaFilter` for Krisp SDK v1.6.1+ licensing. Falls back to `KRISP_VIVA_API_KEY` environment variable.
|
||||
@@ -1 +0,0 @@
|
||||
- Deprecated `SmartTurnMetricsData` in favor of `TurnMetricsData`. `BaseSmartTurn` now emits `TurnMetricsData` directly.
|
||||
@@ -1 +0,0 @@
|
||||
- Bumped `nltk` minimum version from 3.9.1 to 3.9.3 to resolve a security vulnerability.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed STT TTFB metrics not being reported for `SonioxSTTService` and `AWSTranscribeSTTService` due to missing `can_generate_metrics()` override.
|
||||
@@ -1 +0,0 @@
|
||||
- Added `on_audio_context_interrupted()` and `on_audio_context_completed()` callbacks to `AudioContextTTSService`. Subclasses can override these to perform provider-specific cleanup instead of overriding `_handle_interruption()`.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed an issue where `AudioContextTTSService`-based providers (AsyncAI, ElevenLabs, Inworld, Rime) did not close or clean up their server-side audio contexts after normal speech completion, only on interruption.
|
||||
@@ -1,4 +0,0 @@
|
||||
- `ServiceSettingsUpdateFrame`s are now `UninterruptibleFrame`s. Generally speaking, you don't want a user interruption to prevent a service setting change from going into effect. Note that you usually don't use `ServiceSettingsUpdateFrame` directly, you use one of its subclasses:
|
||||
- `LLMUpdateSettingsFrame`
|
||||
- `TTSUpdateSettingsFrame`
|
||||
- `STTUpdateSettingsFrame`
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed STT TTFB metrics measuring timeout expiry time instead of actual transcript arrival time.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed `InterimTranscriptionFrame` and `TranslationFrame` being unintentionally pushed downstream in `LLMUserAggregator`. They are now consumed like `TranscriptionFrame`.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed misleading "Empty audio frame received for STT service" warnings when using audio filters (e.g. `RNNoiseFilter`, `KrispVivaFilter`, `AICFilter`) that buffer audio internally.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed issues with `RimeNonJsonTTSService` where trailing punctuation is sometimes vocalized
|
||||
@@ -1 +0,0 @@
|
||||
- ⚠️ Removed `PlayHTTTSService` and `PlayHTHttpTTSService`. PlayHT has been shut down and is no longer available.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed `TTSSpeakFrame` not committing spoken text to the conversation context when used outside of an LLM response (e.g., bot greetings or injected speech).
|
||||
@@ -1 +0,0 @@
|
||||
- Removed verbose per-chunk audio logging from `GenesysAudioHookSerializer` that flooded production logs.
|
||||
@@ -1 +0,0 @@
|
||||
- Deprecated `ProcessingMetricsData` and `start_processing_metrics()`/`stop_processing_metrics()` on `FrameProcessor` and `FrameProcessorMetrics`. These metrics don't accurately depict a service's performance. Instead, TTFB metrics are recommended. Processing metrics will be removed in the 1.0.0 version.
|
||||
@@ -1 +0,0 @@
|
||||
- Added optional `llm` field to `LLMContextSummarizationConfig` for routing summarization to a dedicated LLM service (e.g., a cheaper/faster model) instead of the pipeline's primary model.
|
||||
@@ -1 +0,0 @@
|
||||
- Added `summarization_timeout` to `LLMContextSummarizationConfig` (default 120s) to prevent hung LLM calls from permanently blocking future summarizations.
|
||||
@@ -1 +0,0 @@
|
||||
- Added `on_summary_applied` event to `LLMContextSummarizer` for observability, providing message counts before and after context summarization.
|
||||
@@ -1 +0,0 @@
|
||||
- Added `summary_message_template` to `LLMContextSummarizationConfig` for customizing how summaries are formatted when injected into context (e.g., wrapping in XML tags).
|
||||
@@ -1 +0,0 @@
|
||||
- Updated context summarization to use `user` role instead of `assistant` for summary messages.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed `LocalSmartTurnAnalyzerV3` producing incorrect end-of-turn predictions at non-16kHz sample rates (e.g. 8kHz Twilio telephony) by adding automatic resampling to 16kHz before Whisper feature extraction.
|
||||
@@ -1 +0,0 @@
|
||||
- Added `LLMContextSummaryConfig` (summary generation params: `target_context_tokens`, `min_messages_after_summary`, `summarization_prompt`) and `LLMAutoContextSummarizationConfig` (auto-trigger thresholds: `max_context_tokens`, `max_unsummarized_messages`, plus a nested `summary_config`). These replace the monolithic `LLMContextSummarizationConfig`.
|
||||
@@ -1 +0,0 @@
|
||||
- Added `LLMSummarizeContextFrame` to trigger on-demand context summarization from anywhere in the pipeline (e.g. a function call tool). Accepts an optional `config: LLMContextSummaryConfig` to override summary generation settings per request.
|
||||
@@ -1 +0,0 @@
|
||||
- ⚠️ Renamed `LLMAssistantAggregatorParams` fields: `enable_context_summarization` → `enable_auto_context_summarization` and `context_summarization_config` → `auto_context_summarization_config` (now accepts `LLMAutoContextSummarizationConfig`). The old names still work with a `DeprecationWarning` for one release cycle.
|
||||
@@ -1 +0,0 @@
|
||||
- Deprecated `LLMContextSummarizationConfig`. Use `LLMAutoContextSummarizationConfig` with a nested `LLMContextSummaryConfig` instead. The old class emits a `DeprecationWarning`.
|
||||
@@ -1 +0,0 @@
|
||||
- `ElevenLabsRealtimeSTTService` now sets `TranscriptionFrame.finalized` to `True` when using `CommitStrategy.MANUAL`.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed `PipelineTask` double-inserting `RTVIProcessor` into the frame chain when the user provides both an `RTVIProcessor` in the pipeline and a custom `RTVIObserver` subclass in observers.
|
||||
@@ -1 +0,0 @@
|
||||
- Updated numba version pin from == to >=0.61.2
|
||||
@@ -1 +0,0 @@
|
||||
- Updated tracing code to use `ServiceSettings` dataclass API (`given_fields()`, attribute access) instead of dict-style access (`.items()`, `in`, subscript).
|
||||
@@ -1 +0,0 @@
|
||||
- Added optional `direction` parameter to `PipelineTask.queue_frame()` and `PipelineTask.queue_frames()`, allowing frames to be pushed upstream from the end of the pipeline.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed turn completion instructions being lost when `LLMMessagesUpdateFrame` replaces the LLM context. When `filter_incomplete_user_turns` is enabled, the turn completion system message is now re-injected after context replacement.
|
||||
@@ -108,6 +108,10 @@ KRISP_VIVA_API_KEY=...
|
||||
KRISP_VIVA_FILTER_MODEL_PATH=...
|
||||
KRISP_VIVA_TURN_MODEL_PATH=...
|
||||
|
||||
# LemonSlice
|
||||
LEMONSLICE_API_KEY=...
|
||||
LEMONSLICE_AGENT_ID=...
|
||||
|
||||
# LiveKit
|
||||
LIVEKIT_API_KEY=...
|
||||
LIVEKIT_API_SECRET=...
|
||||
|
||||
@@ -23,8 +23,8 @@ from pipecat.processors.aggregators.llm_response_universal import (
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.aws.llm import AWSBedrockLLMService
|
||||
from pipecat.services.deepgram.stt_sagemaker import DeepgramSageMakerSTTService
|
||||
from pipecat.services.deepgram.tts_sagemaker import DeepgramSageMakerTTSService
|
||||
from pipecat.services.deepgram.sagemaker.stt import DeepgramSageMakerSTTService
|
||||
from pipecat.services.deepgram.sagemaker.tts import DeepgramSageMakerTTSService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
|
||||
@@ -0,0 +1,179 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
|
||||
import os
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import LLMRunFrame
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.aggregators.llm_response_universal import (
|
||||
LLMContextAggregatorPair,
|
||||
LLMUserAggregatorParams,
|
||||
)
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams
|
||||
from pipecat.services.assemblyai.stt import AssemblyAISTTService
|
||||
from pipecat.services.cartesia.tts import CartesiaTTSService
|
||||
from pipecat.services.openai.llm import OpenAILLMService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
from pipecat.turns.user_turn_strategies import ExternalUserTurnStrategies
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
# We use lambdas to defer transport parameter creation until the transport
|
||||
# type is selected at runtime.
|
||||
transport_params = {
|
||||
"daily": lambda: DailyParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
),
|
||||
"twilio": lambda: FastAPIWebsocketParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
),
|
||||
"webrtc": lambda: TransportParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
"""AssemblyAI u3-rt-pro with Built-in Turn Detection
|
||||
|
||||
This example demonstrates using AssemblyAI's u3-rt-pro Speech-to-Text model
|
||||
with AssemblyAI's built-in turn detection for more natural conversation flow.
|
||||
|
||||
Key features:
|
||||
|
||||
1. AssemblyAI Turn Detection
|
||||
- Set `vad_force_turn_endpoint=False` to use AssemblyAI's built-in turn detection
|
||||
- AssemblyAI's model determines when user starts/stops speaking
|
||||
- Uses `ExternalUserTurnStrategies` to delegate turn control to AssemblyAI
|
||||
- More natural turn detection based on speech patterns and pauses
|
||||
|
||||
2. Advanced Turn Detection Tuning
|
||||
- `min_turn_silence`: Minimum silence (ms) when confident about end-of-turn.
|
||||
Lower values = faster responses. Default: 100ms
|
||||
- `max_turn_silence`: Maximum silence (ms) before forcing end-of-turn.
|
||||
Prevents long pauses. Default: 1000ms
|
||||
|
||||
3. Prompt-Based Transcription Enhancement
|
||||
- Use `prompt` parameter to improve accuracy for specific names/terms
|
||||
- Particularly useful for proper nouns, technical terms, domain vocabulary
|
||||
- Example: "Names: Xiomara, Saoirse, Krzystof. Technical terms: API, OAuth."
|
||||
|
||||
4. Speaker Diarization (Optional)
|
||||
- Enable with `speaker_labels=True`
|
||||
- Automatically identifies different speakers in multi-party conversations
|
||||
- TranscriptionFrame includes speaker_id field (e.g., "Speaker A", "Speaker B")
|
||||
|
||||
5. Language Detection (Optional, multilingual model only)
|
||||
- Enable with `language_detection=True`
|
||||
- Automatically detects spoken language
|
||||
- Available with universal-streaming-multilingual model
|
||||
|
||||
For more information: https://www.assemblyai.com/docs/speech-to-text/streaming
|
||||
"""
|
||||
logger.info(f"Starting bot")
|
||||
|
||||
stt = AssemblyAISTTService(
|
||||
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
|
||||
vad_force_turn_endpoint=False, # Use AssemblyAI's built-in turn detection
|
||||
connection_params=AssemblyAIConnectionParams(
|
||||
speech_model="u3-rt-pro",
|
||||
# Optional: Tune turn detection timing (defaults shown below)
|
||||
# min_turn_silence=100, # Default
|
||||
# max_turn_silence=1000, # Default
|
||||
# Optional: Boost accuracy for specific names/terms
|
||||
# prompt="Names: Xiomara, Saoirse, Krzystof. Technical terms: API, OAuth.",
|
||||
# Optional: Enable speaker diarization
|
||||
# speaker_labels=True,
|
||||
),
|
||||
)
|
||||
|
||||
tts = CartesiaTTSService(
|
||||
api_key=os.getenv("CARTESIA_API_KEY"),
|
||||
voice_id="71a7ad14-091c-4e8e-a314-022ece01c121", # British Reading Lady
|
||||
)
|
||||
|
||||
llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
|
||||
},
|
||||
]
|
||||
|
||||
context = LLMContext(messages)
|
||||
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(
|
||||
user_turn_strategies=ExternalUserTurnStrategies(),
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
stt, # STT
|
||||
user_aggregator, # User responses
|
||||
llm, # LLM
|
||||
tts, # TTS
|
||||
transport.output(), # Transport bot output
|
||||
assistant_aggregator, # Assistant spoken responses
|
||||
]
|
||||
)
|
||||
|
||||
task = PipelineTask(
|
||||
pipeline,
|
||||
params=PipelineParams(
|
||||
enable_metrics=True,
|
||||
enable_usage_metrics=True,
|
||||
),
|
||||
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
|
||||
)
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected")
|
||||
# Kick off the conversation.
|
||||
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
|
||||
await task.queue_frames([LLMRunFrame()])
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
logger.info(f"Client disconnected")
|
||||
await task.cancel()
|
||||
|
||||
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
|
||||
|
||||
await runner.run(task)
|
||||
|
||||
|
||||
async def bot(runner_args: RunnerArguments):
|
||||
"""Main bot entry point compatible with Pipecat Cloud."""
|
||||
transport = await create_transport(runner_args, transport_params)
|
||||
await run_bot(transport, runner_args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from pipecat.runner.run import main
|
||||
|
||||
main()
|
||||
@@ -55,7 +55,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
stt = NvidiaSTTService(api_key=os.getenv("NVIDIA_API_KEY"))
|
||||
|
||||
llm = NvidiaLLMService(
|
||||
api_key=os.getenv("NVIDIA_API_KEY"), model="meta/llama-3.1-405b-instruct"
|
||||
api_key=os.getenv("NVIDIA_API_KEY"),
|
||||
model="meta/llama-3.3-70b-instruct",
|
||||
)
|
||||
|
||||
tts = NvidiaTTSService(api_key=os.getenv("NVIDIA_API_KEY"))
|
||||
|
||||
@@ -16,6 +16,7 @@ from pipecat.pipeline.task import PipelineTask
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams
|
||||
from pipecat.services.assemblyai.stt import AssemblyAISTTService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
@@ -49,6 +50,9 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
|
||||
stt = AssemblyAISTTService(
|
||||
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
|
||||
connection_params=AssemblyAIConnectionParams(
|
||||
speech_model="u3-rt-pro",
|
||||
),
|
||||
)
|
||||
|
||||
tl = TranscriptionLogger()
|
||||
|
||||
@@ -5,13 +5,17 @@
|
||||
#
|
||||
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.adapters.schemas.function_schema import FunctionSchema
|
||||
from pipecat.adapters.schemas.tools_schema import ToolsSchema
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import LLMRunFrame
|
||||
from pipecat.observers.startup_timing_observer import StartupTimingObserver
|
||||
from pipecat.observers.user_bot_latency_observer import UserBotLatencyObserver
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
@@ -25,6 +29,7 @@ from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.cartesia.tts import CartesiaTTSService
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.services.llm_service import FunctionCallParams
|
||||
from pipecat.services.openai.llm import OpenAILLMService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
@@ -32,6 +37,17 @@ from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
async def fetch_weather_from_api(params: FunctionCallParams):
|
||||
await asyncio.sleep(0.25)
|
||||
await params.result_callback({"conditions": "nice", "temperature": "75"})
|
||||
|
||||
|
||||
async def fetch_restaurant_recommendation(params: FunctionCallParams):
|
||||
await asyncio.sleep(0.1)
|
||||
await params.result_callback({"name": "The Golden Dragon"})
|
||||
|
||||
|
||||
# We use lambdas to defer transport parameter creation until the transport
|
||||
# type is selected at runtime.
|
||||
transport_params = {
|
||||
@@ -62,6 +78,38 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
|
||||
llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
|
||||
|
||||
llm.register_function("get_current_weather", fetch_weather_from_api)
|
||||
llm.register_function("get_restaurant_recommendation", fetch_restaurant_recommendation)
|
||||
|
||||
weather_function = FunctionSchema(
|
||||
name="get_current_weather",
|
||||
description="Get the current weather",
|
||||
properties={
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
"format": {
|
||||
"type": "string",
|
||||
"enum": ["celsius", "fahrenheit"],
|
||||
"description": "The temperature unit to use. Infer this from the user's location.",
|
||||
},
|
||||
},
|
||||
required=["location", "format"],
|
||||
)
|
||||
restaurant_function = FunctionSchema(
|
||||
name="get_restaurant_recommendation",
|
||||
description="Get a restaurant recommendation",
|
||||
properties={
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
},
|
||||
required=["location"],
|
||||
)
|
||||
tools = ToolsSchema(standard_tools=[weather_function, restaurant_function])
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
@@ -69,7 +117,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
},
|
||||
]
|
||||
|
||||
context = LLMContext(messages)
|
||||
context = LLMContext(messages, tools)
|
||||
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
|
||||
@@ -87,8 +135,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
]
|
||||
)
|
||||
|
||||
# Create latency tracking observer
|
||||
latency_observer = UserBotLatencyObserver()
|
||||
startup_observer = StartupTimingObserver()
|
||||
|
||||
task = PipelineTask(
|
||||
pipeline,
|
||||
@@ -97,14 +145,29 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
enable_usage_metrics=True,
|
||||
),
|
||||
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
|
||||
observers=[latency_observer],
|
||||
observers=[latency_observer, startup_observer],
|
||||
)
|
||||
|
||||
# Log latency measurements using the event handler
|
||||
@latency_observer.event_handler("on_first_bot_speech_latency")
|
||||
async def on_first_bot_speech_latency(observer, latency_seconds):
|
||||
logger.info(f"First bot speech: {latency_seconds:.3f}s after client connected")
|
||||
|
||||
@latency_observer.event_handler("on_latency_measured")
|
||||
async def on_latency_measured(observer, latency_seconds):
|
||||
logger.info(f"⏱️ User-to-bot latency: {latency_seconds:.3f}s")
|
||||
|
||||
@startup_observer.event_handler("on_startup_timing_report")
|
||||
async def on_startup_timing_report(observer, report):
|
||||
logger.info(f"Total startup: {report.total_duration_secs:.3f}s")
|
||||
for timing in report.processor_timings:
|
||||
logger.info(f" {timing.processor_name}: {timing.duration_secs:.3f}s")
|
||||
|
||||
@startup_observer.event_handler("on_transport_timing_report")
|
||||
async def on_transport_timing_report(observer, report):
|
||||
if report.bot_connected_secs is not None:
|
||||
logger.info(f"Bot connected: {report.bot_connected_secs:.3f}s")
|
||||
logger.info(f"Client connected: {report.client_connected_secs:.3f}s")
|
||||
|
||||
turn_observer = task.turn_tracking_observer
|
||||
if turn_observer:
|
||||
|
||||
@@ -119,6 +182,11 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
else:
|
||||
logger.info(f"🏁 Turn {turn_number} completed in {duration:.2f}s")
|
||||
|
||||
@latency_observer.event_handler("on_latency_breakdown")
|
||||
async def on_latency_breakdown(observer, breakdown):
|
||||
for event in breakdown.chronological_events():
|
||||
logger.info(f" {event}")
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected")
|
||||
|
||||
@@ -24,7 +24,7 @@ from pipecat.processors.aggregators.llm_response_universal import (
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.cartesia.tts import CartesiaTTSService
|
||||
from pipecat.services.deepgram.stt_sagemaker import (
|
||||
from pipecat.services.deepgram.sagemaker.stt import (
|
||||
DeepgramSageMakerSTTService,
|
||||
DeepgramSageMakerSTTSettings,
|
||||
)
|
||||
|
||||
@@ -22,10 +22,10 @@ from pipecat.processors.aggregators.llm_response_universal import (
|
||||
)
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams
|
||||
from pipecat.services.assemblyai.stt import AssemblyAISTTService, AssemblyAISTTSettings
|
||||
from pipecat.services.cartesia.tts import CartesiaTTSService
|
||||
from pipecat.services.openai.llm import OpenAILLMService
|
||||
from pipecat.transcriptions.language import Language
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
@@ -51,7 +51,12 @@ transport_params = {
|
||||
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info(f"Starting bot")
|
||||
|
||||
stt = AssemblyAISTTService(api_key=os.getenv("ASSEMBLYAI_API_KEY"))
|
||||
stt = AssemblyAISTTService(
|
||||
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
|
||||
connection_params=AssemblyAIConnectionParams(
|
||||
speech_model="u3-rt-pro",
|
||||
),
|
||||
)
|
||||
|
||||
tts = CartesiaTTSService(
|
||||
api_key=os.getenv("CARTESIA_API_KEY"),
|
||||
@@ -63,7 +68,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
|
||||
"content": "You are a helpful LLM in a WebRTC call demonstrating dynamic keyterms updates. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Try saying difficult names like 'Xiomara', 'Saoirse', or 'Krzystof' to test transcription accuracy.",
|
||||
},
|
||||
]
|
||||
|
||||
@@ -97,14 +102,24 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected")
|
||||
logger.info(
|
||||
"Phase 1: No keyterms boosting - try saying 'Xiomara', 'Saoirse', or 'Krzystof'"
|
||||
)
|
||||
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
|
||||
await task.queue_frames([LLMRunFrame()])
|
||||
|
||||
await asyncio.sleep(10)
|
||||
logger.info("Updating AssemblyAI STT settings: language=es")
|
||||
await asyncio.sleep(15)
|
||||
logger.info("🔄 Updating keyterms: Adding difficult names for boosting")
|
||||
await task.queue_frame(
|
||||
STTUpdateSettingsFrame(delta=AssemblyAISTTSettings(language=Language.ES))
|
||||
STTUpdateSettingsFrame(
|
||||
delta=AssemblyAISTTSettings(
|
||||
connection_params=AssemblyAIConnectionParams(
|
||||
keyterms_prompt=["Xiomara", "Saoirse", "Krzystof", "Nguyen", "Pipecat"]
|
||||
)
|
||||
)
|
||||
)
|
||||
)
|
||||
logger.info("Phase 2: Keyterms active - same names should transcribe better now!")
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
|
||||
@@ -22,11 +22,11 @@ from pipecat.processors.aggregators.llm_response_universal import (
|
||||
)
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.services.deepgram.tts_sagemaker import (
|
||||
from pipecat.services.deepgram.sagemaker.tts import (
|
||||
DeepgramSageMakerTTSService,
|
||||
DeepgramSageMakerTTSSettings,
|
||||
)
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.services.openai.llm import OpenAILLMService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
|
||||
123
examples/foundational/56-lemonslice-transport.py
Normal file
123
examples/foundational/56-lemonslice-transport.py
Normal file
@@ -0,0 +1,123 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import sys
|
||||
|
||||
import aiohttp
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import LLMRunFrame
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.aggregators.llm_response_universal import (
|
||||
LLMContextAggregatorPair,
|
||||
LLMUserAggregatorParams,
|
||||
)
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.services.elevenlabs.tts import ElevenLabsTTSService
|
||||
from pipecat.services.groq.llm import GroqLLMService
|
||||
from pipecat.transports.lemonslice.transport import (
|
||||
LemonSliceNewSessionRequest,
|
||||
LemonSliceParams,
|
||||
LemonSliceTransport,
|
||||
)
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
logger.remove(0)
|
||||
logger.add(sys.stderr, level="DEBUG")
|
||||
|
||||
|
||||
async def main():
|
||||
async with aiohttp.ClientSession() as session:
|
||||
transport = LemonSliceTransport(
|
||||
bot_name="Pipecat",
|
||||
api_key=os.getenv("LEMONSLICE_API_KEY"),
|
||||
session=session,
|
||||
session_request=LemonSliceNewSessionRequest(
|
||||
agent_id=os.getenv("LEMONSLICE_AGENT_ID"),
|
||||
),
|
||||
params=LemonSliceParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
microphone_out_enabled=False,
|
||||
),
|
||||
)
|
||||
|
||||
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
|
||||
|
||||
llm = GroqLLMService(api_key=os.getenv("GROQ_API_KEY"))
|
||||
|
||||
tts = ElevenLabsTTSService(
|
||||
api_key=os.getenv("ELEVENLABS_API_KEY", ""),
|
||||
voice_id=os.getenv("ELEVENLABS_VOICE_ID", ""),
|
||||
)
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
|
||||
},
|
||||
]
|
||||
|
||||
context = LLMContext(messages)
|
||||
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(vad_analyzer=SileroVADAnalyzer()),
|
||||
)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
stt, # STT
|
||||
user_aggregator, # User responses
|
||||
llm, # LLM
|
||||
tts, # TTS
|
||||
transport.output(), # Transport bot output
|
||||
assistant_aggregator, # Assistant spoken responses
|
||||
]
|
||||
)
|
||||
|
||||
task = PipelineTask(
|
||||
pipeline,
|
||||
params=PipelineParams(
|
||||
audio_in_sample_rate=16000,
|
||||
audio_out_sample_rate=16000,
|
||||
enable_metrics=True,
|
||||
enable_usage_metrics=True,
|
||||
),
|
||||
)
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, participant):
|
||||
logger.info("Client connected")
|
||||
# Kick off the conversation.
|
||||
messages.append(
|
||||
{
|
||||
"role": "system",
|
||||
"content": "Start by greeting the user and ask how you can help.",
|
||||
}
|
||||
)
|
||||
await task.queue_frames([LLMRunFrame()])
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, participant):
|
||||
logger.info("Client disconnected")
|
||||
await task.cancel()
|
||||
|
||||
runner = PipelineRunner()
|
||||
|
||||
await runner.run(task)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -121,6 +121,7 @@ uv run 07-interruptible.py -t twilio -x NGROK_HOST_NAME
|
||||
- **[19-openai-realtime-beta.py](./19-openai-realtime-beta.py)**: OpenAI Speech-to-Speech (Direct S2S, Function calls)
|
||||
- **[21-tavus-layer-tavus-transport.py](./21-tavus-layer-tavus-transport.py)**: Tavus digital twin (Avatar integration)
|
||||
- **[27-simli-layer.py](./27-simli-layer.py)**: Simli avatar integration (Video synchronization)
|
||||
- **[56-lemonslice-transport.py](./56-lemonslice-transport.py)**: LemonSlice avatar integration (A/V Synced Avatar integration)
|
||||
|
||||
### Performance & Optimization
|
||||
|
||||
|
||||
@@ -82,6 +82,7 @@ koala = [ "pvkoala~=2.0.3" ]
|
||||
kokoro = [ "kokoro-onnx>=0.5.0,<1", "requests>=2.32.5,<3" ]
|
||||
krisp = [ "pipecat-ai-krisp~=0.4.0" ]
|
||||
langchain = [ "langchain~=0.3.20", "langchain-community~=0.3.20", "langchain-openai~=0.3.9" ]
|
||||
lemonslice = [ "pipecat-ai[daily]" ]
|
||||
livekit = [ "livekit~=1.0.13", "livekit-api~=1.0.5", "tenacity>=8.2.3,<10.0.0", "pyjwt>=2.10.1" ]
|
||||
lmnt = [ "pipecat-ai[websockets-base]" ]
|
||||
local = [ "pyaudio~=0.2.14" ]
|
||||
|
||||
@@ -368,7 +368,7 @@ class ClassificationProcessor(FrameProcessor):
|
||||
await self._voicemail_notifier.notify() # Clear buffered TTS frames
|
||||
|
||||
# Interrupt the current pipeline to stop any ongoing processing
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
await self.broadcast_interruption()
|
||||
|
||||
# Set the voicemail event to trigger the voicemail handler
|
||||
self._voicemail_event.clear()
|
||||
|
||||
@@ -11,7 +11,6 @@ including data frames, system frames, and control frames for audio, video, text,
|
||||
and LLM processing.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from typing import (
|
||||
@@ -1141,24 +1140,9 @@ class InterruptionFrame(SystemFrame):
|
||||
This frame is used to interrupt the pipeline. For example, when a user
|
||||
starts speaking to cancel any in-progress bot output. It can also be pushed
|
||||
by any processor.
|
||||
|
||||
Parameters:
|
||||
event: Optional event set when the frame has fully traversed the
|
||||
pipeline.
|
||||
|
||||
"""
|
||||
|
||||
event: Optional[asyncio.Event] = None
|
||||
|
||||
def complete(self):
|
||||
"""Signal that this interruption has been fully processed.
|
||||
|
||||
Called automatically when the frame reaches the pipeline sink, or
|
||||
manually when the frame is consumed before reaching it (e.g. when
|
||||
the user is muted).
|
||||
"""
|
||||
if self.event:
|
||||
self.event.set()
|
||||
pass
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -1825,16 +1809,11 @@ class InterruptionTaskFrame(TaskFrame):
|
||||
"""Frame indicating the pipeline should be interrupted.
|
||||
|
||||
This frame should be pushed upstream to indicate the pipeline should be
|
||||
interrupted. The pipeline task converts this into an `InterruptionFrame` and
|
||||
sends it downstream. The `event` is passed to the `InterruptionFrame` so it
|
||||
can signal when the interruption has fully traversed the pipeline.
|
||||
|
||||
Parameters:
|
||||
event: Optional event passed to the corresponding `InterruptionFrame`.
|
||||
|
||||
interrupted. The pipeline task converts this into an `InterruptionFrame`
|
||||
and sends it downstream.
|
||||
"""
|
||||
|
||||
event: Optional[asyncio.Event] = None
|
||||
pass
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -1910,6 +1889,29 @@ class StopFrame(ControlFrame, UninterruptibleFrame):
|
||||
pass
|
||||
|
||||
|
||||
@dataclass
|
||||
class BotConnectedFrame(SystemFrame):
|
||||
"""Frame indicating the bot has connected to the transport service.
|
||||
|
||||
Pushed downstream by SFU transports (Daily, LiveKit, HeyGen, Tavus)
|
||||
when the bot successfully joins the room. Non-SFU transports do not
|
||||
emit this frame.
|
||||
"""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
@dataclass
|
||||
class ClientConnectedFrame(SystemFrame):
|
||||
"""Frame indicating that a client has connected to the transport.
|
||||
|
||||
Pushed downstream by the input transport when a client (participant)
|
||||
connects. Used by observers to measure transport readiness timing.
|
||||
"""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
@dataclass
|
||||
class OutputTransportReadyFrame(ControlFrame):
|
||||
"""Frame indicating that the output transport is ready.
|
||||
|
||||
@@ -41,10 +41,6 @@ class TTFBMetricsData(MetricsData):
|
||||
class ProcessingMetricsData(MetricsData):
|
||||
"""General processing time metrics data.
|
||||
|
||||
.. deprecated:: 0.0.104
|
||||
Processing metrics are deprecated and will be removed in a future version.
|
||||
Use TTFB metrics instead.
|
||||
|
||||
Parameters:
|
||||
value: Processing time measurement in seconds.
|
||||
"""
|
||||
|
||||
@@ -100,3 +100,11 @@ class BaseObserver(BaseObject):
|
||||
data: The event data containing details about the frame transfer.
|
||||
"""
|
||||
pass
|
||||
|
||||
async def on_pipeline_started(self):
|
||||
"""Called when the pipeline has fully started.
|
||||
|
||||
Fired after the ``StartFrame`` has been processed by all processors
|
||||
in the pipeline, including nested ``ParallelPipeline`` branches.
|
||||
"""
|
||||
pass
|
||||
|
||||
328
src/pipecat/observers/startup_timing_observer.py
Normal file
328
src/pipecat/observers/startup_timing_observer.py
Normal file
@@ -0,0 +1,328 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""Observer for tracking pipeline startup timing.
|
||||
|
||||
This module provides an observer that measures how long each processor's
|
||||
``start()`` method takes during pipeline startup. It works by tracking
|
||||
when a ``StartFrame`` arrives at a processor (``on_process_frame``) versus
|
||||
when it leaves (``on_push_frame``), giving the exact ``start()`` duration
|
||||
for each processor in the pipeline.
|
||||
|
||||
It also measures transport timing — the time from ``StartFrame`` to the
|
||||
first ``BotConnectedFrame`` (SFU transports only) and ``ClientConnectedFrame``
|
||||
— via a separate ``on_transport_timing_report`` event.
|
||||
|
||||
Example::
|
||||
|
||||
observer = StartupTimingObserver()
|
||||
|
||||
@observer.event_handler("on_startup_timing_report")
|
||||
async def on_report(observer, report):
|
||||
for t in report.processor_timings:
|
||||
print(f"{t.processor_name}: {t.duration_secs:.3f}s")
|
||||
|
||||
@observer.event_handler("on_transport_timing_report")
|
||||
async def on_transport(observer, report):
|
||||
if report.bot_connected_secs is not None:
|
||||
print(f"Bot connected in {report.bot_connected_secs:.3f}s")
|
||||
print(f"Client connected in {report.client_connected_secs:.3f}s")
|
||||
|
||||
task = PipelineTask(pipeline, observers=[observer])
|
||||
"""
|
||||
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Dict, List, Optional, Tuple, Type
|
||||
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
from pipecat.frames.frames import BotConnectedFrame, ClientConnectedFrame, StartFrame
|
||||
from pipecat.observers.base_observer import BaseObserver, FrameProcessed, FramePushed
|
||||
from pipecat.pipeline.base_pipeline import BasePipeline
|
||||
from pipecat.pipeline.pipeline import PipelineSource
|
||||
from pipecat.processors.frame_processor import FrameProcessor
|
||||
|
||||
# Internal pipeline types excluded from tracking by default.
|
||||
_INTERNAL_TYPES = (PipelineSource, BasePipeline)
|
||||
|
||||
|
||||
@dataclass
|
||||
class _ArrivalInfo:
|
||||
"""Internal record of when a StartFrame arrived at a processor."""
|
||||
|
||||
processor: FrameProcessor
|
||||
arrival_ts_ns: int
|
||||
|
||||
|
||||
class ProcessorStartupTiming(BaseModel):
|
||||
"""Startup timing for a single processor.
|
||||
|
||||
Parameters:
|
||||
processor_name: The name of the processor.
|
||||
start_offset_secs: Offset in seconds from the StartFrame to when this
|
||||
processor's start() began.
|
||||
duration_secs: How long the processor's start() took, in seconds.
|
||||
"""
|
||||
|
||||
processor_name: str
|
||||
start_offset_secs: float
|
||||
duration_secs: float
|
||||
|
||||
|
||||
class StartupTimingReport(BaseModel):
|
||||
"""Report of startup timings for all measured processors.
|
||||
|
||||
Parameters:
|
||||
start_time: Unix timestamp when the first processor began starting.
|
||||
total_duration_secs: Total wall-clock time from first to last processor start.
|
||||
processor_timings: Per-processor timing data, in pipeline order.
|
||||
"""
|
||||
|
||||
start_time: float
|
||||
total_duration_secs: float
|
||||
processor_timings: List[ProcessorStartupTiming] = Field(default_factory=list)
|
||||
|
||||
|
||||
class TransportTimingReport(BaseModel):
|
||||
"""Time from pipeline start to transport connection milestones.
|
||||
|
||||
Parameters:
|
||||
start_time: Unix timestamp of the StartFrame (pipeline start).
|
||||
bot_connected_secs: Seconds from StartFrame to first BotConnectedFrame
|
||||
(only set for SFU transports).
|
||||
client_connected_secs: Seconds from StartFrame to first ClientConnectedFrame.
|
||||
"""
|
||||
|
||||
start_time: float
|
||||
bot_connected_secs: Optional[float] = None
|
||||
client_connected_secs: Optional[float] = None
|
||||
|
||||
|
||||
class StartupTimingObserver(BaseObserver):
|
||||
"""Observer that measures processor startup times during pipeline initialization.
|
||||
|
||||
Tracks how long each processor's ``start()`` method takes by measuring the
|
||||
time between when a ``StartFrame`` arrives at a processor and when it is
|
||||
pushed downstream. This captures WebSocket connections, API authentication,
|
||||
model loading, and other initialization work.
|
||||
|
||||
Also measures transport timing, the time from ``StartFrame`` to connection
|
||||
milestones:
|
||||
|
||||
- ``bot_connected_secs``: When the bot joins the transport room
|
||||
(SFU transports only, triggered by ``BotConnectedFrame``).
|
||||
- ``client_connected_secs``: When a remote participant connects
|
||||
(triggered by ``ClientConnectedFrame``).
|
||||
|
||||
By default, internal pipeline processors (``PipelineSource``, ``Pipeline``)
|
||||
are excluded from the report. Pass ``processor_types`` to measure only
|
||||
specific types.
|
||||
|
||||
Event handlers available:
|
||||
|
||||
- on_startup_timing_report: Called once after startup completes with the full
|
||||
timing report.
|
||||
- on_transport_timing_report: Called once when the first client connects with a
|
||||
TransportTimingReport containing client_connected_secs and bot_connected_secs
|
||||
(if available).
|
||||
|
||||
Example::
|
||||
|
||||
observer = StartupTimingObserver(
|
||||
processor_types=(STTService, TTSService)
|
||||
)
|
||||
|
||||
@observer.event_handler("on_startup_timing_report")
|
||||
async def on_report(observer, report):
|
||||
for t in report.processor_timings:
|
||||
logger.info(f"{t.processor_name}: {t.duration_secs:.3f}s")
|
||||
|
||||
@observer.event_handler("on_transport_timing_report")
|
||||
async def on_transport(observer, report):
|
||||
if report.bot_connected_secs is not None:
|
||||
logger.info(f"Bot connected in {report.bot_connected_secs:.3f}s")
|
||||
logger.info(f"Client connected in {report.client_connected_secs:.3f}s")
|
||||
|
||||
task = PipelineTask(pipeline, observers=[observer])
|
||||
|
||||
Args:
|
||||
processor_types: Optional tuple of processor types to measure. If None,
|
||||
all non-internal processors are measured.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
processor_types: Optional[Tuple[Type[FrameProcessor], ...]] = None,
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the startup timing observer.
|
||||
|
||||
Args:
|
||||
processor_types: Optional tuple of processor types to measure.
|
||||
If None, all non-internal processors are measured.
|
||||
**kwargs: Additional arguments passed to parent class.
|
||||
"""
|
||||
super().__init__(**kwargs)
|
||||
self._processor_types = processor_types
|
||||
|
||||
# Map processor ID -> arrival info.
|
||||
self._arrivals: Dict[int, _ArrivalInfo] = {}
|
||||
|
||||
# Collected timings in pipeline order.
|
||||
self._timings: List[ProcessorStartupTiming] = []
|
||||
|
||||
# Lock onto the first StartFrame we see (by frame ID).
|
||||
self._start_frame_id: Optional[str] = None
|
||||
|
||||
# Whether we've already emitted the startup timing report.
|
||||
self._startup_timing_reported = False
|
||||
|
||||
# Whether we've already measured transport timing.
|
||||
self._transport_timing_reported = False
|
||||
|
||||
# Timestamp (ns) when we first see a StartFrame arrive at a processor.
|
||||
self._start_frame_arrival_ns: Optional[int] = None
|
||||
|
||||
# Bot connected timing (stored for inclusion in the transport report).
|
||||
self._bot_connected_secs: Optional[float] = None
|
||||
|
||||
# Wall clock time when the StartFrame was first seen.
|
||||
self._start_wall_clock: Optional[float] = None
|
||||
|
||||
self._register_event_handler("on_startup_timing_report")
|
||||
self._register_event_handler("on_transport_timing_report")
|
||||
|
||||
def _should_track(self, processor: FrameProcessor) -> bool:
|
||||
"""Check if a processor should be tracked for timing.
|
||||
|
||||
Args:
|
||||
processor: The processor to check.
|
||||
|
||||
Returns:
|
||||
True if the processor matches the filter or no filter is set.
|
||||
"""
|
||||
if self._processor_types is not None:
|
||||
return isinstance(processor, self._processor_types)
|
||||
# Default: exclude internal pipeline plumbing.
|
||||
return not isinstance(processor, _INTERNAL_TYPES)
|
||||
|
||||
async def on_pipeline_started(self):
|
||||
"""Emit the startup timing report when the pipeline has fully started.
|
||||
|
||||
Called by the ``PipelineTask`` after the ``StartFrame`` has been
|
||||
processed by all processors, including nested ``ParallelPipeline``
|
||||
branches.
|
||||
"""
|
||||
if self._timings:
|
||||
await self._emit_report()
|
||||
|
||||
async def on_process_frame(self, data: FrameProcessed):
|
||||
"""Record when a StartFrame arrives at a processor.
|
||||
|
||||
Args:
|
||||
data: The frame processing event data.
|
||||
"""
|
||||
if self._startup_timing_reported:
|
||||
return
|
||||
|
||||
if not isinstance(data.frame, StartFrame):
|
||||
return
|
||||
|
||||
# Lock onto the first StartFrame.
|
||||
if self._start_frame_id is None:
|
||||
self._start_frame_id = data.frame.id
|
||||
self._start_frame_arrival_ns = data.timestamp
|
||||
self._start_wall_clock = time.time()
|
||||
elif data.frame.id != self._start_frame_id:
|
||||
return
|
||||
|
||||
if self._should_track(data.processor):
|
||||
self._arrivals[data.processor.id] = _ArrivalInfo(
|
||||
processor=data.processor, arrival_ts_ns=data.timestamp
|
||||
)
|
||||
|
||||
async def on_push_frame(self, data: FramePushed):
|
||||
"""Record when a StartFrame leaves a processor and compute the delta.
|
||||
|
||||
Also handles ``BotConnectedFrame`` and ``ClientConnectedFrame`` to
|
||||
measure transport timing.
|
||||
|
||||
Args:
|
||||
data: The frame push event data.
|
||||
"""
|
||||
if isinstance(data.frame, BotConnectedFrame):
|
||||
self._handle_bot_connected(data)
|
||||
return
|
||||
|
||||
if isinstance(data.frame, ClientConnectedFrame):
|
||||
await self._handle_client_connected(data)
|
||||
return
|
||||
|
||||
if self._startup_timing_reported:
|
||||
return
|
||||
|
||||
if not isinstance(data.frame, StartFrame):
|
||||
return
|
||||
|
||||
if self._start_frame_id is not None and data.frame.id != self._start_frame_id:
|
||||
return
|
||||
|
||||
arrival = self._arrivals.pop(data.source.id, None)
|
||||
if arrival is None:
|
||||
return
|
||||
|
||||
duration_ns = data.timestamp - arrival.arrival_ts_ns
|
||||
duration_secs = duration_ns / 1e9
|
||||
start_offset_secs = (arrival.arrival_ts_ns - self._start_frame_arrival_ns) / 1e9
|
||||
|
||||
self._timings.append(
|
||||
ProcessorStartupTiming(
|
||||
processor_name=arrival.processor.name,
|
||||
start_offset_secs=start_offset_secs,
|
||||
duration_secs=duration_secs,
|
||||
)
|
||||
)
|
||||
|
||||
def _handle_bot_connected(self, data: FramePushed):
|
||||
"""Record bot connected timing on first BotConnectedFrame."""
|
||||
if self._bot_connected_secs is not None or self._start_frame_arrival_ns is None:
|
||||
return
|
||||
|
||||
delta_ns = data.timestamp - self._start_frame_arrival_ns
|
||||
self._bot_connected_secs = delta_ns / 1e9
|
||||
|
||||
async def _handle_client_connected(self, data: FramePushed):
|
||||
"""Emit transport timing report on first ClientConnectedFrame."""
|
||||
if self._transport_timing_reported or self._start_frame_arrival_ns is None:
|
||||
return
|
||||
|
||||
self._transport_timing_reported = True
|
||||
delta_ns = data.timestamp - self._start_frame_arrival_ns
|
||||
client_connected_secs = delta_ns / 1e9
|
||||
report = TransportTimingReport(
|
||||
start_time=self._start_wall_clock or 0.0,
|
||||
bot_connected_secs=self._bot_connected_secs,
|
||||
client_connected_secs=client_connected_secs,
|
||||
)
|
||||
await self._call_event_handler("on_transport_timing_report", report)
|
||||
|
||||
async def _emit_report(self):
|
||||
"""Build and emit the startup timing report."""
|
||||
if self._startup_timing_reported:
|
||||
return
|
||||
self._startup_timing_reported = True
|
||||
|
||||
total = sum(t.duration_secs for t in self._timings)
|
||||
|
||||
report = StartupTimingReport(
|
||||
start_time=self._start_wall_clock or 0.0,
|
||||
total_duration_secs=total,
|
||||
processor_timings=self._timings,
|
||||
)
|
||||
|
||||
await self._call_event_handler("on_startup_timing_report", report)
|
||||
@@ -1,22 +1,146 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""Observer for tracking user-to-bot response latency.
|
||||
|
||||
This module provides an observer that monitors the time between when a user
|
||||
stops speaking and when the bot starts speaking, emitting events when latency
|
||||
is measured.
|
||||
is measured. Optionally collects per-service latency breakdown metrics
|
||||
(TTFB, text aggregation) when ``enable_metrics=True``.
|
||||
"""
|
||||
|
||||
import time
|
||||
from typing import Optional, Set
|
||||
from collections import deque
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
BotStartedSpeakingFrame,
|
||||
ClientConnectedFrame,
|
||||
FunctionCallInProgressFrame,
|
||||
FunctionCallResultFrame,
|
||||
InterruptionFrame,
|
||||
MetricsFrame,
|
||||
UserStoppedSpeakingFrame,
|
||||
VADUserStartedSpeakingFrame,
|
||||
VADUserStoppedSpeakingFrame,
|
||||
)
|
||||
from pipecat.metrics.metrics import (
|
||||
TextAggregationMetricsData,
|
||||
TTFBMetricsData,
|
||||
)
|
||||
from pipecat.observers.base_observer import BaseObserver, FramePushed
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
|
||||
|
||||
class TTFBBreakdownMetrics(BaseModel):
|
||||
"""TTFB measurement with timestamp for timeline placement.
|
||||
|
||||
Parameters:
|
||||
processor: Name of the processor that reported the TTFB.
|
||||
model: Optional model name associated with the metric.
|
||||
start_time: Unix timestamp when the TTFB measurement started.
|
||||
duration_secs: TTFB duration in seconds.
|
||||
"""
|
||||
|
||||
processor: str
|
||||
model: Optional[str] = None
|
||||
start_time: float
|
||||
duration_secs: float
|
||||
|
||||
|
||||
class TextAggregationBreakdownMetrics(BaseModel):
|
||||
"""Text aggregation measurement with timestamp for timeline placement.
|
||||
|
||||
Parameters:
|
||||
processor: Name of the processor that reported the metric.
|
||||
start_time: Unix timestamp when text aggregation started.
|
||||
duration_secs: Aggregation duration in seconds.
|
||||
"""
|
||||
|
||||
processor: str
|
||||
start_time: float
|
||||
duration_secs: float
|
||||
|
||||
|
||||
class FunctionCallMetrics(BaseModel):
|
||||
"""Latency for a single function call execution.
|
||||
|
||||
Parameters:
|
||||
function_name: Name of the function that was called.
|
||||
start_time: Unix timestamp when execution started.
|
||||
duration_secs: Time in seconds from execution start to result.
|
||||
"""
|
||||
|
||||
function_name: str
|
||||
start_time: float
|
||||
duration_secs: float
|
||||
|
||||
|
||||
class LatencyBreakdown(BaseModel):
|
||||
"""Per-service latency breakdown for a single user-to-bot cycle.
|
||||
|
||||
Collected between ``VADUserStoppedSpeakingFrame`` and
|
||||
``BotStartedSpeakingFrame`` when ``enable_metrics=True`` in
|
||||
:class:`~pipecat.pipeline.task.PipelineParams`.
|
||||
|
||||
Parameters:
|
||||
ttfb: Time-to-first-byte metrics from each service in the pipeline.
|
||||
text_aggregation: First text aggregation measurement, representing
|
||||
the latency cost of sentence aggregation in the TTS pipeline.
|
||||
user_turn_start_time: Unix timestamp when the user turn started
|
||||
(actual user silence, adjusted for VAD stop_secs). ``None`` if
|
||||
no ``VADUserStoppedSpeakingFrame`` was observed.
|
||||
user_turn_secs: Duration in seconds of the user's turn, measured
|
||||
from when the user actually stopped speaking to when the turn
|
||||
was released (``UserStoppedSpeakingFrame``). This includes
|
||||
VAD silence detection, STT finalization, and any turn analyzer
|
||||
wait. ``None`` if no ``UserStoppedSpeakingFrame`` was observed
|
||||
(e.g. no turn analyzer configured).
|
||||
function_calls: Latency for each function call executed during
|
||||
this cycle. Empty if no function calls occurred.
|
||||
"""
|
||||
|
||||
ttfb: List[TTFBBreakdownMetrics] = Field(default_factory=list)
|
||||
text_aggregation: Optional[TextAggregationBreakdownMetrics] = None
|
||||
user_turn_start_time: Optional[float] = None
|
||||
user_turn_secs: Optional[float] = None
|
||||
function_calls: List[FunctionCallMetrics] = Field(default_factory=list)
|
||||
|
||||
def chronological_events(self) -> List[str]:
|
||||
"""Return human-readable event labels sorted by start time.
|
||||
|
||||
Collects all sub-metrics into a flat list, sorts by ``start_time``,
|
||||
and returns formatted strings suitable for logging.
|
||||
|
||||
Returns:
|
||||
List of formatted strings, one per event, in chronological order.
|
||||
"""
|
||||
events: List[tuple] = []
|
||||
|
||||
if self.user_turn_start_time is not None and self.user_turn_secs is not None:
|
||||
events.append((self.user_turn_start_time, f"User turn: {self.user_turn_secs:.3f}s"))
|
||||
|
||||
for t in self.ttfb:
|
||||
events.append((t.start_time, f"{t.processor}: TTFB {t.duration_secs:.3f}s"))
|
||||
|
||||
for fc in self.function_calls:
|
||||
events.append((fc.start_time, f"{fc.function_name}: {fc.duration_secs:.3f}s"))
|
||||
|
||||
if self.text_aggregation:
|
||||
ta = self.text_aggregation
|
||||
events.append(
|
||||
(ta.start_time, f"{ta.processor}: text aggregation {ta.duration_secs:.3f}s")
|
||||
)
|
||||
|
||||
events.sort(key=lambda e: e[0])
|
||||
return [label for _, label in events]
|
||||
|
||||
|
||||
class UserBotLatencyObserver(BaseObserver):
|
||||
"""Observer that tracks user-to-bot response latency.
|
||||
|
||||
@@ -25,34 +149,66 @@ class UserBotLatencyObserver(BaseObserver):
|
||||
latency is measured, allowing consumers to log, trace, or otherwise process
|
||||
the latency data.
|
||||
|
||||
When ``enable_metrics=True`` in pipeline params, also collects per-service
|
||||
latency breakdown (TTFB, text aggregation) and emits an
|
||||
``on_latency_breakdown`` event alongside the existing latency measurement.
|
||||
|
||||
This observer follows the composition pattern used by TurnTrackingObserver,
|
||||
acting as a reusable component for latency measurement.
|
||||
|
||||
Events:
|
||||
on_latency_measured(observer, latency_seconds): Emitted when user-to-bot
|
||||
latency is calculated. Includes the latency value in seconds as a float.
|
||||
on_latency_measured(observer, latency_seconds): Emitted when
|
||||
time-to-first-bot-speech is calculated. Measures the time from
|
||||
when the user stopped speaking to when the bot starts speaking.
|
||||
on_latency_breakdown(observer, breakdown): Emitted at each
|
||||
``BotStartedSpeakingFrame`` with a :class:`LatencyBreakdown`
|
||||
containing per-service metrics collected during the user→bot cycle.
|
||||
on_first_bot_speech_latency(observer, latency_seconds): Emitted once,
|
||||
the first time ``BotStartedSpeakingFrame`` arrives after
|
||||
``ClientConnectedFrame``. Measures the time from client connection
|
||||
to the first bot speech.
|
||||
"""
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
def __init__(self, *, max_frames=100, **kwargs):
|
||||
"""Initialize the user-bot latency observer.
|
||||
|
||||
Sets up tracking for processed frames and user speech timing
|
||||
to calculate response latencies.
|
||||
|
||||
Args:
|
||||
max_frames: Maximum number of frame IDs to keep in history for
|
||||
duplicate detection. Defaults to 100.
|
||||
**kwargs: Additional arguments passed to parent class.
|
||||
"""
|
||||
super().__init__(**kwargs)
|
||||
self._user_stopped_time: Optional[float] = None
|
||||
self._processed_frames: Set[str] = set()
|
||||
self._user_turn_start_time: Optional[float] = None
|
||||
self._user_turn: Optional[float] = None
|
||||
|
||||
# First bot speech tracking
|
||||
self._client_connected_time: Optional[float] = None
|
||||
self._first_bot_speech_measured: bool = False
|
||||
|
||||
# Frame deduplication (bounded deque + set pattern)
|
||||
self._processed_frames: set = set()
|
||||
self._frame_history: deque = deque(maxlen=max_frames)
|
||||
|
||||
# Per-cycle metric accumulators
|
||||
self._ttfb: List[TTFBBreakdownMetrics] = []
|
||||
self._text_aggregation: Optional[TextAggregationBreakdownMetrics] = None
|
||||
self._function_call_starts: Dict[str, tuple[str, float]] = {}
|
||||
self._function_call_metrics: List[FunctionCallMetrics] = []
|
||||
|
||||
self._register_event_handler("on_latency_measured")
|
||||
self._register_event_handler("on_latency_breakdown")
|
||||
self._register_event_handler("on_first_bot_speech_latency")
|
||||
|
||||
async def on_push_frame(self, data: FramePushed):
|
||||
"""Process frames to track speech timing and calculate latency.
|
||||
|
||||
Tracks VAD events and bot speaking events to measure the time between
|
||||
user stopping speech and bot starting speech.
|
||||
user stopping speech and bot starting speech. Also accumulates metrics
|
||||
from MetricsFrame for the latency breakdown.
|
||||
|
||||
Args:
|
||||
data: Frame push event containing the frame and direction information.
|
||||
@@ -61,23 +217,135 @@ class UserBotLatencyObserver(BaseObserver):
|
||||
if data.direction != FrameDirection.DOWNSTREAM:
|
||||
return
|
||||
|
||||
# Skip already processed frames
|
||||
# Skip already processed frames (bounded deque + set)
|
||||
if data.frame.id in self._processed_frames:
|
||||
return
|
||||
|
||||
self._processed_frames.add(data.frame.id)
|
||||
self._frame_history.append(data.frame.id)
|
||||
|
||||
# Track VAD and bot speaking events for latency
|
||||
if len(self._processed_frames) > len(self._frame_history):
|
||||
self._processed_frames = set(self._frame_history)
|
||||
|
||||
# Track client connection (first occurrence only)
|
||||
if isinstance(data.frame, ClientConnectedFrame):
|
||||
if self._client_connected_time is None:
|
||||
self._client_connected_time = time.time()
|
||||
return
|
||||
|
||||
# Track speech and pipeline events for latency
|
||||
if isinstance(data.frame, VADUserStartedSpeakingFrame):
|
||||
# Reset when user starts speaking
|
||||
self._user_stopped_time = None
|
||||
self._user_turn_start_time = None
|
||||
self._user_turn = None
|
||||
self._reset_accumulators()
|
||||
# If user speaks before the bot's first speech, abandon the
|
||||
# first-bot-speech measurement — it's only meaningful for greetings.
|
||||
self._first_bot_speech_measured = True
|
||||
elif isinstance(data.frame, VADUserStoppedSpeakingFrame):
|
||||
# Record the actual time the user stopped speaking, which is
|
||||
# the VAD determination time minus the stop_secs silence duration
|
||||
# that had to elapse before the VAD confirmed speech ended.
|
||||
self._user_stopped_time = data.frame.timestamp - data.frame.stop_secs
|
||||
elif isinstance(data.frame, BotStartedSpeakingFrame) and self._user_stopped_time:
|
||||
# Calculate and emit latency
|
||||
self._user_turn_start_time = self._user_stopped_time
|
||||
elif isinstance(data.frame, UserStoppedSpeakingFrame):
|
||||
# Measure the user turn duration: from actual user silence to
|
||||
# turn release. Includes VAD silence detection, STT finalization,
|
||||
# and any turn analyzer wait.
|
||||
if self._user_stopped_time is not None:
|
||||
self._user_turn = time.time() - self._user_stopped_time
|
||||
elif isinstance(data.frame, InterruptionFrame):
|
||||
# Discard stale metrics from cancelled LLM/TTS cycles
|
||||
self._reset_accumulators()
|
||||
elif isinstance(data.frame, FunctionCallInProgressFrame):
|
||||
self._function_call_starts[data.frame.tool_call_id] = (
|
||||
data.frame.function_name,
|
||||
time.time(),
|
||||
)
|
||||
elif isinstance(data.frame, FunctionCallResultFrame):
|
||||
start = self._function_call_starts.pop(data.frame.tool_call_id, None)
|
||||
if start is not None:
|
||||
function_name, start_time = start
|
||||
self._function_call_metrics.append(
|
||||
FunctionCallMetrics(
|
||||
function_name=function_name,
|
||||
start_time=start_time,
|
||||
duration_secs=time.time() - start_time,
|
||||
)
|
||||
)
|
||||
elif isinstance(data.frame, MetricsFrame):
|
||||
self._handle_metrics_frame(data.frame)
|
||||
elif isinstance(data.frame, BotStartedSpeakingFrame):
|
||||
await self._handle_bot_started_speaking()
|
||||
|
||||
async def _handle_bot_started_speaking(self):
|
||||
"""Handle BotStartedSpeakingFrame to emit latency and breakdown."""
|
||||
emit_breakdown = False
|
||||
|
||||
# One-time first bot speech measurement (client connect → first speech)
|
||||
if self._client_connected_time is not None and not self._first_bot_speech_measured:
|
||||
self._first_bot_speech_measured = True
|
||||
latency = time.time() - self._client_connected_time
|
||||
await self._call_event_handler("on_first_bot_speech_latency", latency)
|
||||
emit_breakdown = True
|
||||
|
||||
if self._user_stopped_time is not None:
|
||||
latency = time.time() - self._user_stopped_time
|
||||
self._user_stopped_time = None
|
||||
await self._call_event_handler("on_latency_measured", latency)
|
||||
emit_breakdown = True
|
||||
|
||||
if emit_breakdown:
|
||||
breakdown = LatencyBreakdown(
|
||||
ttfb=list(self._ttfb),
|
||||
text_aggregation=self._text_aggregation,
|
||||
user_turn_start_time=self._user_turn_start_time,
|
||||
user_turn_secs=self._user_turn,
|
||||
function_calls=list(self._function_call_metrics),
|
||||
)
|
||||
await self._call_event_handler("on_latency_breakdown", breakdown)
|
||||
self._reset_accumulators()
|
||||
|
||||
def _handle_metrics_frame(self, frame: MetricsFrame):
|
||||
"""Extract latency metrics from a MetricsFrame.
|
||||
|
||||
Accumulates metrics when a measurement is in progress: either a
|
||||
user→bot cycle (after ``VADUserStoppedSpeakingFrame``) or the
|
||||
first-bot-speech window (after ``ClientConnectedFrame``).
|
||||
"""
|
||||
waiting_for_first_speech = (
|
||||
self._client_connected_time is not None and not self._first_bot_speech_measured
|
||||
)
|
||||
if self._user_stopped_time is None and not waiting_for_first_speech:
|
||||
return
|
||||
|
||||
now = time.time()
|
||||
for metrics_data in frame.data:
|
||||
if isinstance(metrics_data, TTFBMetricsData) and metrics_data.value > 0:
|
||||
self._ttfb.append(
|
||||
TTFBBreakdownMetrics(
|
||||
processor=metrics_data.processor,
|
||||
model=metrics_data.model,
|
||||
start_time=now - metrics_data.value,
|
||||
duration_secs=metrics_data.value,
|
||||
)
|
||||
)
|
||||
elif isinstance(metrics_data, TextAggregationMetricsData):
|
||||
# Only keep the first measurement — it's the one that
|
||||
# impacts the initial speaking latency.
|
||||
if self._text_aggregation is None:
|
||||
self._text_aggregation = TextAggregationBreakdownMetrics(
|
||||
processor=metrics_data.processor,
|
||||
start_time=now - metrics_data.value,
|
||||
duration_secs=metrics_data.value,
|
||||
)
|
||||
|
||||
def _reset_accumulators(self):
|
||||
"""Clear per-cycle metric accumulators."""
|
||||
self._ttfb = []
|
||||
self._text_aggregation = None
|
||||
self._user_turn_start_time = None
|
||||
self._user_turn = None
|
||||
self._function_call_starts = {}
|
||||
self._function_call_metrics = []
|
||||
|
||||
@@ -892,7 +892,7 @@ class PipelineTask(BasePipelineTask):
|
||||
# pipeline. This is in case the push task is blocked waiting for a
|
||||
# pipeline-ending frame to finish traversing the pipeline.
|
||||
logger.debug(f"{self}: received interruption task frame {frame}")
|
||||
await self._pipeline.queue_frame(InterruptionFrame(event=frame.event))
|
||||
await self._pipeline.queue_frame(InterruptionFrame())
|
||||
elif isinstance(frame, ErrorFrame):
|
||||
await self._call_event_handler("on_pipeline_error", frame)
|
||||
if frame.fatal:
|
||||
@@ -915,6 +915,7 @@ class PipelineTask(BasePipelineTask):
|
||||
|
||||
if isinstance(frame, StartFrame):
|
||||
await self._call_event_handler("on_pipeline_started", frame)
|
||||
await self._observer.on_pipeline_started()
|
||||
|
||||
# Start heartbeat tasks now that StartFrame has been processed
|
||||
# by all processors in the pipeline
|
||||
@@ -931,8 +932,6 @@ class PipelineTask(BasePipelineTask):
|
||||
self._pipeline_end_event.set()
|
||||
elif isinstance(frame, CancelFrame):
|
||||
self._pipeline_end_event.set()
|
||||
elif isinstance(frame, InterruptionFrame):
|
||||
frame.complete()
|
||||
elif isinstance(frame, HeartbeatFrame):
|
||||
await self._heartbeat_queue.put(frame)
|
||||
|
||||
|
||||
@@ -39,6 +39,12 @@ class Proxy:
|
||||
observer: BaseObserver
|
||||
|
||||
|
||||
class _PipelineStartedSignal:
|
||||
"""Internal sentinel queued to observers when the pipeline has started."""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class TaskObserver(BaseObserver):
|
||||
"""Proxy observer that manages multiple observers without blocking the pipeline.
|
||||
|
||||
@@ -129,6 +135,10 @@ class TaskObserver(BaseObserver):
|
||||
for proxy in self._proxies:
|
||||
await proxy.cleanup()
|
||||
|
||||
async def on_pipeline_started(self):
|
||||
"""Forward pipeline started signal to all managed observers."""
|
||||
await self._send_to_proxy(_PipelineStartedSignal())
|
||||
|
||||
async def on_process_frame(self, data: FrameProcessed):
|
||||
"""Queue frame data for all managed observers.
|
||||
|
||||
@@ -186,7 +196,9 @@ class TaskObserver(BaseObserver):
|
||||
while True:
|
||||
data = await queue.get()
|
||||
|
||||
if isinstance(data, FramePushed):
|
||||
if isinstance(data, _PipelineStartedSignal):
|
||||
await observer.on_pipeline_started()
|
||||
elif isinstance(data, FramePushed):
|
||||
if on_push_frame_deprecated:
|
||||
await observer.on_push_frame(
|
||||
data.source, data.destination, data.frame, data.direction, data.timestamp
|
||||
|
||||
@@ -104,7 +104,7 @@ class DTMFAggregator(FrameProcessor):
|
||||
|
||||
# For first digit, schedule interruption.
|
||||
if is_first_digit:
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
await self.broadcast_interruption()
|
||||
|
||||
# Check for immediate flush conditions
|
||||
if frame.button == self._termination_digit:
|
||||
|
||||
@@ -581,7 +581,7 @@ class LLMUserContextAggregator(LLMContextResponseAggregator):
|
||||
logger.debug(
|
||||
"Interruption conditions met - pushing interruption and aggregation"
|
||||
)
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
await self.broadcast_interruption()
|
||||
await self._process_aggregation()
|
||||
else:
|
||||
logger.debug("Interruption conditions not met - not pushing aggregation")
|
||||
|
||||
@@ -608,12 +608,6 @@ class LLMUserAggregator(LLMContextAggregator):
|
||||
if should_mute_frame:
|
||||
logger.trace(f"{frame.name} suppressed - user currently muted")
|
||||
|
||||
# When muted, the InterruptionFrame won't propagate further and
|
||||
# will never reach the pipeline sink. Complete it here so
|
||||
# push_interruption_task_frame_and_wait() doesn't hang.
|
||||
if should_mute_frame and isinstance(frame, InterruptionFrame):
|
||||
frame.complete()
|
||||
|
||||
should_mute_next_time = False
|
||||
for s in self._params.user_mute_strategies:
|
||||
should_mute_next_time |= await s.process_frame(frame)
|
||||
@@ -737,7 +731,7 @@ class LLMUserAggregator(LLMContextAggregator):
|
||||
await self._user_idle_controller.process_frame(UserStartedSpeakingFrame())
|
||||
|
||||
if params.enable_interruptions and self._allow_interruptions:
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
await self.broadcast_interruption()
|
||||
|
||||
await self._call_event_handler("on_user_turn_started", strategy)
|
||||
|
||||
|
||||
@@ -234,12 +234,6 @@ class STTMuteFilter(FrameProcessor):
|
||||
await self.push_frame(frame, direction)
|
||||
else:
|
||||
logger.trace(f"{frame.__class__.__name__} suppressed - STT currently muted")
|
||||
|
||||
# When muted, the InterruptionFrame won't propagate further
|
||||
# and will never reach the pipeline sink. Complete it here so
|
||||
# push_interruption_task_frame_and_wait() doesn't hang.
|
||||
if isinstance(frame, InterruptionFrame):
|
||||
frame.complete()
|
||||
else:
|
||||
# Pass all other frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
@@ -41,7 +41,6 @@ from pipecat.frames.frames import (
|
||||
FrameProcessorResumeFrame,
|
||||
FrameProcessorResumeUrgentFrame,
|
||||
InterruptionFrame,
|
||||
InterruptionTaskFrame,
|
||||
StartFrame,
|
||||
SystemFrame,
|
||||
UninterruptibleFrame,
|
||||
@@ -240,10 +239,6 @@ class FrameProcessor(BaseObject):
|
||||
self.__process_frame_task: Optional[asyncio.Task] = None
|
||||
self.__process_current_frame: Optional[Frame] = None
|
||||
|
||||
# Set while awaiting push_interruption_task_frame_and_wait() so that
|
||||
# _start_interruption() knows not to cancel the process task.
|
||||
self._wait_for_interruption = False
|
||||
|
||||
# Frame processor events.
|
||||
self._register_event_handler("on_before_process_frame", sync=True)
|
||||
self._register_event_handler("on_after_process_frame", sync=True)
|
||||
@@ -329,7 +324,7 @@ class FrameProcessor(BaseObject):
|
||||
warnings.simplefilter("always")
|
||||
warnings.warn(
|
||||
"`FrameProcessor.interruptions_allowed` is deprecated. "
|
||||
"Use `LLMUserAggregator`'s new `user_mute_strategies` parameter instead.",
|
||||
"Use `LLMUserAggregator`'s new `user_mute_strategies` parameter instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
@@ -441,35 +436,19 @@ class FrameProcessor(BaseObject):
|
||||
if frame:
|
||||
await self.push_frame(frame)
|
||||
|
||||
_processing_metrics_warned = False
|
||||
|
||||
async def start_processing_metrics(self, *, start_time: Optional[float] = None):
|
||||
"""Start processing metrics collection.
|
||||
|
||||
.. deprecated:: 0.0.104
|
||||
Processing metrics are deprecated and will be removed in a future version.
|
||||
Use TTFB metrics instead.
|
||||
|
||||
Args:
|
||||
start_time: Optional timestamp to use as the start time. If None,
|
||||
uses the current time.
|
||||
"""
|
||||
if self.can_generate_metrics() and self.metrics_enabled:
|
||||
if not FrameProcessor._processing_metrics_warned:
|
||||
FrameProcessor._processing_metrics_warned = True
|
||||
logger.warning(
|
||||
"Processing metrics are deprecated and will be removed in a future version. "
|
||||
"Use TTFB metrics instead."
|
||||
)
|
||||
await self._metrics.start_processing_metrics(start_time=start_time)
|
||||
|
||||
async def stop_processing_metrics(self, *, end_time: Optional[float] = None):
|
||||
"""Stop processing metrics collection and push results.
|
||||
|
||||
.. deprecated:: 0.0.104
|
||||
Processing metrics are deprecated and will be removed in a future version.
|
||||
Use TTFB metrics instead.
|
||||
|
||||
Args:
|
||||
end_time: Optional timestamp to use as the end time. If None, uses
|
||||
the current time.
|
||||
@@ -647,15 +626,6 @@ class FrameProcessor(BaseObject):
|
||||
if self._cancelling:
|
||||
return
|
||||
|
||||
# If we are waiting for an interruption, bypass all queued system frames
|
||||
# and process the frame right away. This is because a previous system
|
||||
# frame might be waiting for the interruption frame blocking the input
|
||||
# task, so this InterruptionFrame would never be dequeued and we'd
|
||||
# deadlock.
|
||||
if self._wait_for_interruption and isinstance(frame, InterruptionFrame):
|
||||
await self.__process_frame(frame, direction, callback)
|
||||
return
|
||||
|
||||
if self._enable_direct_mode:
|
||||
await self.__process_frame(frame, direction, callback)
|
||||
else:
|
||||
@@ -790,43 +760,32 @@ class FrameProcessor(BaseObject):
|
||||
|
||||
await self._call_event_handler("on_after_push_frame", frame)
|
||||
|
||||
async def broadcast_interruption(self):
|
||||
"""Broadcast an `InterruptionFrame` both upstream and downstream."""
|
||||
logger.debug(f"{self}: broadcasting interruption")
|
||||
self.__reset_process_task()
|
||||
await self.stop_all_metrics()
|
||||
await self.broadcast_frame(InterruptionFrame)
|
||||
|
||||
async def push_interruption_task_frame_and_wait(self, *, timeout: float = 5.0):
|
||||
"""Push an interruption task frame upstream and wait for the interruption.
|
||||
|
||||
This function sends an `InterruptionTaskFrame` upstream to the
|
||||
pipeline task. The task creates a corresponding `InterruptionFrame`
|
||||
and sends it downstream through the pipeline. An `asyncio.Event` is
|
||||
attached to both frames so the caller can wait until the interruption
|
||||
has fully traversed the pipeline. The event is set when the
|
||||
`InterruptionFrame` reaches the pipeline sink. If the frame does
|
||||
not complete within the given timeout, a warning is logged and the
|
||||
event is forcibly set so the caller is unblocked.
|
||||
|
||||
Args:
|
||||
timeout: Maximum seconds to wait for the interruption to complete.
|
||||
.. deprecated:: 0.0.104
|
||||
Use :meth:`broadcast_interruption` instead. This method now
|
||||
delegates to ``broadcast_interruption()`` and ignores *timeout*.
|
||||
"""
|
||||
self._wait_for_interruption = True
|
||||
import warnings
|
||||
|
||||
event = asyncio.Event()
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter("always")
|
||||
warnings.warn(
|
||||
"`FrameProcessor.push_interruption_task_frame_and_wait()` is deprecated. "
|
||||
"Use `FrameProcessor.broadcast_interruption()` instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
|
||||
await self.push_frame(InterruptionTaskFrame(event=event), FrameDirection.UPSTREAM)
|
||||
|
||||
# Wait for the `InterruptionFrame` to complete and log a warning if it
|
||||
# takes too long. If it does take too long make sure we unblock it,
|
||||
# otherwise we will hang here forever.
|
||||
while not event.is_set():
|
||||
try:
|
||||
await asyncio.wait_for(event.wait(), timeout=timeout)
|
||||
except asyncio.TimeoutError:
|
||||
logger.warning(
|
||||
f"{self}: InterruptionFrame has not completed after"
|
||||
f" {timeout}s. Make sure InterruptionFrame.complete()"
|
||||
" is being called (e.g. if the frame is being blocked"
|
||||
" or consumed before reaching the pipeline sink)."
|
||||
)
|
||||
event.set()
|
||||
|
||||
self._wait_for_interruption = False
|
||||
await self.broadcast_interruption()
|
||||
|
||||
async def broadcast_frame(self, frame_cls: Type[Frame], **kwargs):
|
||||
"""Broadcasts a frame of the specified class upstream and downstream.
|
||||
@@ -933,15 +892,7 @@ class FrameProcessor(BaseObject):
|
||||
async def _start_interruption(self):
|
||||
"""Start handling an interruption by cancelling current tasks."""
|
||||
try:
|
||||
if self._wait_for_interruption:
|
||||
# If we get here we know the process task was just waiting for
|
||||
# an interruption (push_interruption_task_frame_and_wait()), so
|
||||
# we can't cancel the task because it might still need to do
|
||||
# more things (e.g. pushing a frame after the
|
||||
# interruption). Instead we just drain the queue because this is
|
||||
# an interruption.
|
||||
self.__reset_process_task()
|
||||
elif isinstance(self.__process_current_frame, UninterruptibleFrame):
|
||||
if isinstance(self.__process_current_frame, UninterruptibleFrame):
|
||||
# We don't want to cancel UninterruptibleFrame, so we simply
|
||||
# cleanup the queue.
|
||||
self.__reset_process_queue()
|
||||
|
||||
@@ -1702,7 +1702,7 @@ class RTVIProcessor(FrameProcessor):
|
||||
|
||||
async def interrupt_bot(self):
|
||||
"""Send a bot interruption frame upstream."""
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
await self.broadcast_interruption()
|
||||
|
||||
async def send_server_message(self, data: Any):
|
||||
"""Send a server message to the client."""
|
||||
|
||||
@@ -150,10 +150,6 @@ class FrameProcessorMetrics(BaseObject):
|
||||
async def start_processing_metrics(self, *, start_time: Optional[float] = None):
|
||||
"""Start measuring processing time.
|
||||
|
||||
.. deprecated:: 0.0.104
|
||||
Processing metrics are deprecated and will be removed in a future version.
|
||||
Use TTFB metrics instead.
|
||||
|
||||
Args:
|
||||
start_time: Optional timestamp to use as the start time. If None,
|
||||
uses the current time.
|
||||
@@ -163,10 +159,6 @@ class FrameProcessorMetrics(BaseObject):
|
||||
async def stop_processing_metrics(self, *, end_time: Optional[float] = None):
|
||||
"""Stop processing time measurement and generate metrics frame.
|
||||
|
||||
.. deprecated:: 0.0.104
|
||||
Processing metrics are deprecated and will be removed in a future version.
|
||||
Use TTFB metrics instead.
|
||||
|
||||
Args:
|
||||
end_time: Optional timestamp to use as the end time. If None, uses
|
||||
the current time.
|
||||
|
||||
@@ -12,7 +12,8 @@ transcription WebSocket messages and connection configuration.
|
||||
|
||||
from typing import List, Literal, Optional
|
||||
|
||||
from pydantic import BaseModel, Field
|
||||
from loguru import logger
|
||||
from pydantic import BaseModel, ConfigDict, Field, model_validator
|
||||
|
||||
|
||||
class Word(BaseModel):
|
||||
@@ -68,8 +69,16 @@ class TurnMessage(BaseMessage):
|
||||
transcript: The transcribed text for this turn.
|
||||
end_of_turn_confidence: Confidence score for end-of-turn detection.
|
||||
words: List of individual words with timing and confidence data.
|
||||
language_code: Detected language code (e.g., "es", "fr"). Only present with
|
||||
complete utterances or when end_of_turn is True.
|
||||
language_confidence: Confidence score (0-1) for language detection. Only present
|
||||
with complete utterances or when end_of_turn is True.
|
||||
speaker: Speaker label (e.g., "A", "B"). Only present when speaker_labels is
|
||||
enabled and end_of_turn is True. Maps to 'speaker_label' in JSON response.
|
||||
"""
|
||||
|
||||
model_config = ConfigDict(populate_by_name=True)
|
||||
|
||||
type: Literal["Turn"] = "Turn"
|
||||
turn_order: int
|
||||
turn_is_formatted: bool
|
||||
@@ -77,6 +86,21 @@ class TurnMessage(BaseMessage):
|
||||
transcript: str
|
||||
end_of_turn_confidence: float
|
||||
words: List[Word]
|
||||
language_code: Optional[str] = None
|
||||
language_confidence: Optional[float] = None
|
||||
speaker: Optional[str] = Field(default=None, alias="speaker_label")
|
||||
|
||||
|
||||
class SpeechStartedMessage(BaseMessage):
|
||||
"""Message sent when speech is first detected in the audio stream.
|
||||
|
||||
Parameters:
|
||||
type: Always "SpeechStarted" for this message type.
|
||||
timestamp: Audio timestamp in milliseconds when speech was detected.
|
||||
"""
|
||||
|
||||
type: Literal["SpeechStarted"] = "SpeechStarted"
|
||||
timestamp: int
|
||||
|
||||
|
||||
class TerminationMessage(BaseMessage):
|
||||
@@ -94,7 +118,7 @@ class TerminationMessage(BaseMessage):
|
||||
|
||||
|
||||
# Union type for all possible message types
|
||||
AnyMessage = BeginMessage | TurnMessage | TerminationMessage
|
||||
AnyMessage = BeginMessage | TurnMessage | SpeechStartedMessage | TerminationMessage
|
||||
|
||||
|
||||
class AssemblyAIConnectionParams(BaseModel):
|
||||
@@ -106,10 +130,19 @@ class AssemblyAIConnectionParams(BaseModel):
|
||||
formatted_finals: Whether to enable transcript formatting. Defaults to True.
|
||||
word_finalization_max_wait_time: Maximum time to wait for word finalization in milliseconds.
|
||||
end_of_turn_confidence_threshold: Confidence threshold for end-of-turn detection.
|
||||
min_end_of_turn_silence_when_confident: Minimum silence duration when confident about end-of-turn.
|
||||
min_turn_silence: Minimum silence duration when confident about end-of-turn.
|
||||
min_end_of_turn_silence_when_confident: DEPRECATED. Use min_turn_silence instead.
|
||||
max_turn_silence: Maximum silence duration before forcing end-of-turn.
|
||||
keyterms_prompt: List of key terms to guide transcription. Will be JSON serialized before sending.
|
||||
speech_model: Select between English and multilingual models. Defaults to "universal-streaming-english".
|
||||
prompt: Optional text prompt to guide the transcription. Only used when speech_model is "u3-rt-pro".
|
||||
speech_model: Select between English, multilingual, and u3-rt-pro models. Defaults to "u3-rt-pro".
|
||||
language_detection: Enable automatic language detection. Only applicable to
|
||||
universal-streaming-multilingual. When enabled, Turn messages include
|
||||
language_code and language_confidence fields. Defaults to None (not sent).
|
||||
format_turns: Whether to format transcript turns. Defaults to True.
|
||||
speaker_labels: Enable speaker diarization. When enabled, final transcripts
|
||||
(end_of_turn=True) include a speaker field identifying the speaker
|
||||
(e.g., "Speaker A", "Speaker B"). Defaults to None (not sent).
|
||||
"""
|
||||
|
||||
sample_rate: int = 16000
|
||||
@@ -117,9 +150,27 @@ class AssemblyAIConnectionParams(BaseModel):
|
||||
formatted_finals: bool = True
|
||||
word_finalization_max_wait_time: Optional[int] = None
|
||||
end_of_turn_confidence_threshold: Optional[float] = None
|
||||
min_end_of_turn_silence_when_confident: Optional[int] = None
|
||||
min_turn_silence: Optional[int] = None
|
||||
min_end_of_turn_silence_when_confident: Optional[int] = None # Deprecated
|
||||
max_turn_silence: Optional[int] = None
|
||||
keyterms_prompt: Optional[List[str]] = None
|
||||
speech_model: Literal["universal-streaming-english", "universal-streaming-multilingual"] = (
|
||||
"universal-streaming-english"
|
||||
)
|
||||
prompt: Optional[str] = None
|
||||
speech_model: Literal[
|
||||
"universal-streaming-english", "universal-streaming-multilingual", "u3-rt-pro"
|
||||
] = "u3-rt-pro"
|
||||
language_detection: Optional[bool] = None
|
||||
format_turns: bool = True
|
||||
speaker_labels: Optional[bool] = None
|
||||
|
||||
@model_validator(mode="after")
|
||||
def handle_deprecated_param(self):
|
||||
"""Handle deprecated min_end_of_turn_silence_when_confident parameter."""
|
||||
if self.min_end_of_turn_silence_when_confident is not None:
|
||||
logger.warning(
|
||||
"The 'min_end_of_turn_silence_when_confident' parameter is deprecated and will be "
|
||||
"removed in a future version. Please use 'min_turn_silence' instead."
|
||||
)
|
||||
# If min_turn_silence is not set, use the deprecated value
|
||||
if self.min_turn_silence is None:
|
||||
self.min_turn_silence = self.min_end_of_turn_silence_when_confident
|
||||
return self
|
||||
|
||||
@@ -26,6 +26,8 @@ from pipecat.frames.frames import (
|
||||
InterimTranscriptionFrame,
|
||||
StartFrame,
|
||||
TranscriptionFrame,
|
||||
UserStartedSpeakingFrame,
|
||||
UserStoppedSpeakingFrame,
|
||||
VADUserStartedSpeakingFrame,
|
||||
VADUserStoppedSpeakingFrame,
|
||||
)
|
||||
@@ -41,6 +43,7 @@ from .models import (
|
||||
AssemblyAIConnectionParams,
|
||||
BaseMessage,
|
||||
BeginMessage,
|
||||
SpeechStartedMessage,
|
||||
TerminationMessage,
|
||||
TurnMessage,
|
||||
)
|
||||
@@ -54,6 +57,28 @@ except ModuleNotFoundError as e:
|
||||
raise Exception(f"Missing module: {e}")
|
||||
|
||||
|
||||
def map_language_from_assemblyai(language_code: str) -> Language:
|
||||
"""Map AssemblyAI language codes to Pipecat Language enum.
|
||||
|
||||
AssemblyAI returns simple language codes like "es", "fr", etc.
|
||||
This function maps them to the corresponding Language enum values.
|
||||
|
||||
Args:
|
||||
language_code: AssemblyAI language code (e.g., "es", "fr", "de")
|
||||
|
||||
Returns:
|
||||
Corresponding Language enum value, defaulting to Language.EN if not found.
|
||||
"""
|
||||
try:
|
||||
# Try to match the language code directly
|
||||
return Language(language_code.lower())
|
||||
except ValueError:
|
||||
logger.warning(
|
||||
f"Unknown language code from AssemblyAI: {language_code}, defaulting to English"
|
||||
)
|
||||
return Language.EN
|
||||
|
||||
|
||||
@dataclass
|
||||
class AssemblyAISTTSettings(STTSettings):
|
||||
"""Settings for the AssemblyAI STT service.
|
||||
@@ -87,6 +112,8 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
api_endpoint_base_url: str = "wss://streaming.assemblyai.com/v3/ws",
|
||||
connection_params: AssemblyAIConnectionParams = AssemblyAIConnectionParams(),
|
||||
vad_force_turn_endpoint: bool = True,
|
||||
should_interrupt: bool = True,
|
||||
speaker_format: Optional[str] = None,
|
||||
ttfs_p99_latency: Optional[float] = ASSEMBLYAI_TTFS_P99,
|
||||
**kwargs,
|
||||
):
|
||||
@@ -97,18 +124,66 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
language: Language code for transcription. Defaults to English (Language.EN).
|
||||
api_endpoint_base_url: WebSocket endpoint URL. Defaults to AssemblyAI's streaming endpoint.
|
||||
connection_params: Connection configuration parameters. Defaults to AssemblyAIConnectionParams().
|
||||
vad_force_turn_endpoint: Whether to force turn endpoint on VAD stop. When True,
|
||||
disables AssemblyAI's model-based turn detection and relies on external VAD
|
||||
to trigger turn endpoints. Automatically sets end_of_turn_confidence_threshold=1.0
|
||||
and max_turn_silence=2000 unless explicitly overridden. Defaults to True.
|
||||
vad_force_turn_endpoint: Controls turn detection mode.
|
||||
When True (Pipecat mode, default): Forces AssemblyAI to return finals ASAP
|
||||
so Pipecat's turn detection (e.g., Smart Turn) decides when the user is done.
|
||||
- min_turn_silence defaults to 100ms (user can override)
|
||||
- max_turn_silence is ALWAYS set equal to min_turn_silence
|
||||
- VAD stop sends ForceEndpoint as ceiling
|
||||
- No UserStarted/StoppedSpeakingFrame emitted from STT
|
||||
When False (AssemblyAI turn detection mode, u3-rt-pro only): AssemblyAI's model
|
||||
controls turn endings using built-in turn detection.
|
||||
- Uses AssemblyAI API defaults for all parameters (unless user explicitly sets them)
|
||||
- Respects all user-provided connection_params as-is
|
||||
- Emits UserStarted/StoppedSpeakingFrame from STT
|
||||
- No ForceEndpoint on VAD stop
|
||||
should_interrupt: Whether to interrupt the bot when the user starts speaking
|
||||
in AssemblyAI turn detection mode (vad_force_turn_endpoint=False). Only applies
|
||||
when using AssemblyAI's built-in turn detection. Defaults to True.
|
||||
speaker_format: Optional format string for speaker labels when diarization is enabled.
|
||||
Use {speaker} for speaker label and {text} for transcript text.
|
||||
Example: "<{speaker}>{text}</{speaker}>" or "{speaker}: {text}"
|
||||
If None, transcript text is not modified. Defaults to None.
|
||||
ttfs_p99_latency: P99 latency from speech end to final transcript in seconds.
|
||||
Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
|
||||
**kwargs: Additional arguments passed to parent STTService class.
|
||||
"""
|
||||
# When vad_force_turn_endpoint is enabled, configure connection params for manual
|
||||
# turn detection mode (disable model-based turn detection)
|
||||
# AssemblyAI turn detection mode (vad_force_turn_endpoint=False) requires the
|
||||
# SpeechStarted event for reliable barge-in. Only u3-rt-pro supports
|
||||
# this. Other models must use Pipecat turn detection.
|
||||
is_u3_pro = connection_params.speech_model == "u3-rt-pro"
|
||||
if not vad_force_turn_endpoint and not is_u3_pro:
|
||||
raise ValueError(
|
||||
f"AssemblyAI turn detection mode (vad_force_turn_endpoint=False) requires "
|
||||
f"u3-rt-pro for SpeechStarted support. Either set "
|
||||
f"vad_force_turn_endpoint=True for {connection_params.speech_model}, "
|
||||
f"or use speech_model='u3-rt-pro'."
|
||||
)
|
||||
|
||||
# Validate that prompt and keyterms_prompt are not both set
|
||||
if connection_params.prompt is not None and connection_params.keyterms_prompt is not None:
|
||||
raise ValueError(
|
||||
"The prompt and keyterms_prompt parameters cannot be used in the same request. "
|
||||
"Please choose either one or the other based on your use case. When you use "
|
||||
"keyterms_prompt, your boosted words are appended to the default prompt automatically. "
|
||||
"Or to boost within prompt: <prompt> + Make sure to boost the words <keyterms> in the audio. "
|
||||
"For more info go to: https://www.assemblyai.com/docs/streaming/universal-3-pro"
|
||||
)
|
||||
|
||||
# Warn if user sets a custom prompt (recommend testing without one first)
|
||||
if connection_params.prompt is not None:
|
||||
logger.warning(
|
||||
"Custom prompt detected. Prompting is a beta feature. We recommend testing "
|
||||
"with no prompt first, as this will use our optimized default prompt for "
|
||||
"voice agents. Bad prompts may lead to bad results. If you'd like to create "
|
||||
"your own prompt, check out our prompting guide at: "
|
||||
"https://www.assemblyai.com/docs/streaming/prompting"
|
||||
)
|
||||
|
||||
# When vad_force_turn_endpoint is enabled, configure connection params
|
||||
# for Pipecat turn detection mode (fast finals for smart turn analyzer)
|
||||
if vad_force_turn_endpoint:
|
||||
connection_params = self._configure_manual_turn_mode(connection_params)
|
||||
connection_params = self._configure_pipecat_turn_mode(connection_params, is_u3_pro)
|
||||
|
||||
super().__init__(
|
||||
sample_rate=connection_params.sample_rate,
|
||||
@@ -124,6 +199,8 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
self._api_key = api_key
|
||||
self._api_endpoint_base_url = api_endpoint_base_url
|
||||
self._vad_force_turn_endpoint = vad_force_turn_endpoint
|
||||
self._should_interrupt = should_interrupt
|
||||
self._speaker_format = speaker_format
|
||||
|
||||
self._termination_event = asyncio.Event()
|
||||
self._received_termination = False
|
||||
@@ -135,45 +212,64 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
self._chunk_size_ms = 50
|
||||
self._chunk_size_bytes = 0
|
||||
|
||||
def _configure_manual_turn_mode(
|
||||
self, connection_params: AssemblyAIConnectionParams
|
||||
) -> AssemblyAIConnectionParams:
|
||||
"""Configure connection params for manual turn detection mode.
|
||||
self._user_speaking = False
|
||||
|
||||
When vad_force_turn_endpoint is enabled, we want to disable AssemblyAI's
|
||||
model-based turn detection and rely on external VAD. This requires:
|
||||
- end_of_turn_confidence_threshold=1.0 (disable semantic turn detection)
|
||||
- max_turn_silence=2000 (high value since VAD handles turn endings)
|
||||
def _configure_pipecat_turn_mode(
|
||||
self, connection_params: AssemblyAIConnectionParams, is_u3_pro: bool
|
||||
) -> AssemblyAIConnectionParams:
|
||||
"""Configure connection params for Pipecat turn detection mode.
|
||||
|
||||
When vad_force_turn_endpoint is enabled, force AssemblyAI to return
|
||||
finals as fast as possible so Pipecat's smart turn analyzer can decide
|
||||
when the user is done speaking. VAD stop is the absolute ceiling.
|
||||
|
||||
u3-rt-pro:
|
||||
- min_turn_silence defaults to 100ms (user can override)
|
||||
- max_turn_silence is ALWAYS set equal to min_turn_silence
|
||||
to avoid double turn detection (AssemblyAI + Pipecat both analyzing)
|
||||
- If user sets max_turn_silence, it's ignored with a warning
|
||||
- end_of_turn_confidence_threshold: not set (API default)
|
||||
|
||||
universal-streaming-*:
|
||||
- end_of_turn_confidence_threshold=0.0 (disable semantic turn detection)
|
||||
- min_turn_silence=160
|
||||
- max_turn_silence: not set (API default)
|
||||
|
||||
Args:
|
||||
connection_params: The user-provided connection parameters.
|
||||
is_u3_pro: Whether using u3-rt-pro model.
|
||||
|
||||
Returns:
|
||||
Updated connection parameters configured for manual turn mode.
|
||||
Updated connection parameters configured for Pipecat turn mode.
|
||||
"""
|
||||
updates = {}
|
||||
|
||||
# Check end_of_turn_confidence_threshold
|
||||
if connection_params.end_of_turn_confidence_threshold is None:
|
||||
updates["end_of_turn_confidence_threshold"] = 1.0
|
||||
elif connection_params.end_of_turn_confidence_threshold != 1.0:
|
||||
logger.warning(
|
||||
f"vad_force_turn_endpoint is enabled but end_of_turn_confidence_threshold "
|
||||
f"is set to {connection_params.end_of_turn_confidence_threshold}. "
|
||||
f"For manual turn detection mode, this should be 1.0 to disable "
|
||||
f"model-based turn detection. The current value will be used."
|
||||
)
|
||||
if is_u3_pro:
|
||||
# u3-rt-pro: Synchronize max_turn_silence with min_turn_silence
|
||||
min_silence = connection_params.min_turn_silence
|
||||
if min_silence is None:
|
||||
min_silence = 100
|
||||
|
||||
# Check max_turn_silence
|
||||
if connection_params.max_turn_silence is None:
|
||||
updates["max_turn_silence"] = 2000
|
||||
elif connection_params.max_turn_silence < 1000:
|
||||
logger.warning(
|
||||
f"vad_force_turn_endpoint is enabled but max_turn_silence is set to "
|
||||
f"{connection_params.max_turn_silence}ms. With manual turn detection, "
|
||||
f"a higher value (e.g., 2000ms) is recommended to avoid premature "
|
||||
f"turn endings. The current value will be used."
|
||||
)
|
||||
# Warn if user set max_turn_silence (will be overridden)
|
||||
if connection_params.max_turn_silence is not None:
|
||||
logger.warning(
|
||||
f"Your max_turn_silence value ({connection_params.max_turn_silence}ms) will be "
|
||||
f"OVERRIDDEN in Pipecat mode (vad_force_turn_endpoint=True). It will be set to "
|
||||
f"{min_silence}ms (matching min_turn_silence) and SENT to "
|
||||
f"AssemblyAI to avoid double turn detection. To use your max_turn_silence as-is, "
|
||||
f"switch to AssemblyAI turn detection mode (vad_force_turn_endpoint=False)."
|
||||
)
|
||||
|
||||
updates = {
|
||||
"min_turn_silence": min_silence,
|
||||
"max_turn_silence": min_silence,
|
||||
}
|
||||
else:
|
||||
# universal-streaming: Different configuration (works differently)
|
||||
updates = {
|
||||
"end_of_turn_confidence_threshold": 1.0,
|
||||
"min_turn_silence": 160,
|
||||
}
|
||||
|
||||
# Apply updates if any
|
||||
if updates:
|
||||
@@ -190,9 +286,14 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
return True
|
||||
|
||||
async def _update_settings(self, delta: STTSettings) -> dict[str, Any]:
|
||||
"""Apply a settings delta.
|
||||
"""Apply a settings delta and send UpdateConfiguration if connected.
|
||||
|
||||
Settings are stored but not applied to the active connection.
|
||||
Stores settings changes and sends UpdateConfiguration message to AssemblyAI
|
||||
without reconnecting. Supports updating:
|
||||
- keyterms_prompt: List of terms to boost (can be empty array to clear)
|
||||
- prompt: Custom prompt text (u3-rt-pro only)
|
||||
- max_turn_silence: Maximum silence before forcing turn end
|
||||
- min_turn_silence: Silence before EOT check
|
||||
|
||||
Args:
|
||||
delta: A :class:`STTSettings` (or ``AssemblyAISTTSettings``) delta.
|
||||
@@ -205,18 +306,72 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
if not changed:
|
||||
return changed
|
||||
|
||||
# TODO: someday we could reconnect here to apply updated settings.
|
||||
# Code might look something like the below:
|
||||
# # Re-apply manual turn mode config if vad_force_turn_endpoint is active
|
||||
# # and connection_params were updated.
|
||||
# if self._vad_force_turn_endpoint and "connection_params" in changed:
|
||||
# self._settings.connection_params = self._configure_manual_turn_mode(
|
||||
# self._settings.connection_params
|
||||
# )
|
||||
# await self._disconnect()
|
||||
# await self._connect()
|
||||
# If websocket is connected, send UpdateConfiguration for supported params
|
||||
if (
|
||||
self._websocket
|
||||
and self._websocket.state is State.OPEN
|
||||
and "connection_params" in changed
|
||||
):
|
||||
# Build UpdateConfiguration message
|
||||
update_config = {"type": "UpdateConfiguration"}
|
||||
conn_params = self._settings.connection_params
|
||||
|
||||
self._warn_unhandled_updated_settings(changed)
|
||||
# Get the old connection_params to see what changed
|
||||
old_conn_params = changed.get("connection_params")
|
||||
|
||||
# Check each potentially changed parameter
|
||||
if (
|
||||
old_conn_params is None
|
||||
or conn_params.keyterms_prompt != old_conn_params.keyterms_prompt
|
||||
):
|
||||
if conn_params.keyterms_prompt is not None:
|
||||
update_config["keyterms_prompt"] = conn_params.keyterms_prompt
|
||||
logger.info(f"Updating keyterms_prompt to: {conn_params.keyterms_prompt}")
|
||||
|
||||
if old_conn_params is None or conn_params.prompt != old_conn_params.prompt:
|
||||
if conn_params.prompt is not None:
|
||||
if conn_params.speech_model != "u3-rt-pro":
|
||||
logger.warning(
|
||||
f"prompt parameter is only supported with u3-rt-pro model, "
|
||||
f"current model is {conn_params.speech_model}"
|
||||
)
|
||||
else:
|
||||
update_config["prompt"] = conn_params.prompt
|
||||
logger.info(f"Updating prompt")
|
||||
|
||||
if (
|
||||
old_conn_params is None
|
||||
or conn_params.max_turn_silence != old_conn_params.max_turn_silence
|
||||
):
|
||||
if conn_params.max_turn_silence is not None:
|
||||
update_config["max_turn_silence"] = conn_params.max_turn_silence
|
||||
logger.info(f"Updating max_turn_silence to: {conn_params.max_turn_silence}ms")
|
||||
|
||||
if (
|
||||
old_conn_params is None
|
||||
or conn_params.min_turn_silence != old_conn_params.min_turn_silence
|
||||
):
|
||||
if conn_params.min_turn_silence is not None:
|
||||
update_config["min_turn_silence"] = conn_params.min_turn_silence
|
||||
logger.info(f"Updating min_turn_silence to: {conn_params.min_turn_silence}ms")
|
||||
|
||||
# Send update if we have parameters to update
|
||||
if len(update_config) > 1: # More than just "type"
|
||||
try:
|
||||
await self._websocket.send(json.dumps(update_config))
|
||||
logger.info(f"Sent UpdateConfiguration: {update_config}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to send UpdateConfiguration: {e}")
|
||||
elif "connection_params" in changed:
|
||||
logger.warning(
|
||||
"Connection params changed but WebSocket not connected. "
|
||||
"Settings will be applied on next connection."
|
||||
)
|
||||
|
||||
# Warn about other settings that can't be changed dynamically
|
||||
other_changes = {k: v for k, v in changed.items() if k not in ["connection_params"]}
|
||||
if other_changes:
|
||||
self._warn_unhandled_updated_settings(other_changes)
|
||||
|
||||
return changed
|
||||
|
||||
@@ -283,6 +438,7 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
and self._websocket
|
||||
and self._websocket.state is State.OPEN
|
||||
):
|
||||
self.request_finalize()
|
||||
await self._websocket.send(json.dumps({"type": "ForceEndpoint"}))
|
||||
await self.start_processing_metrics()
|
||||
|
||||
@@ -295,6 +451,9 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
"""Build WebSocket URL with query parameters using urllib.parse.urlencode."""
|
||||
params = {}
|
||||
for k, v in self._settings.connection_params.model_dump().items():
|
||||
# Skip deprecated parameter - it's been migrated to min_turn_silence
|
||||
if k == "min_end_of_turn_silence_when_confident":
|
||||
continue
|
||||
if v is not None:
|
||||
if k == "keyterms_prompt":
|
||||
params[k] = json.dumps(v)
|
||||
@@ -421,6 +580,9 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
async for message in self._get_websocket():
|
||||
try:
|
||||
data = json.loads(message)
|
||||
# Log raw JSON for Turn messages to debug speaker_label
|
||||
if data.get("type") == "Turn":
|
||||
logger.trace(f"{self} RAW JSON from AssemblyAI: {json.dumps(data, indent=2)}")
|
||||
await self._handle_message(data)
|
||||
except json.JSONDecodeError:
|
||||
logger.warning(f"Received non-JSON message: {message}")
|
||||
@@ -433,6 +595,8 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
return BeginMessage.model_validate(message)
|
||||
elif msg_type == "Turn":
|
||||
return TurnMessage.model_validate(message)
|
||||
elif msg_type == "SpeechStarted":
|
||||
return SpeechStartedMessage.model_validate(message)
|
||||
elif msg_type == "Termination":
|
||||
return TerminationMessage.model_validate(message)
|
||||
else:
|
||||
@@ -449,11 +613,33 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
)
|
||||
elif isinstance(parsed_message, TurnMessage):
|
||||
await self._handle_transcription(parsed_message)
|
||||
elif isinstance(parsed_message, SpeechStartedMessage):
|
||||
await self._handle_speech_started(parsed_message)
|
||||
elif isinstance(parsed_message, TerminationMessage):
|
||||
await self._handle_termination(parsed_message)
|
||||
except Exception as e:
|
||||
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
|
||||
|
||||
async def _handle_speech_started(self, message: SpeechStartedMessage):
|
||||
"""Handle SpeechStarted event — fast barge-in for AssemblyAI turn detection.
|
||||
|
||||
Broadcasts UserStartedSpeakingFrame to signal the start of user
|
||||
speech, then pushes an interruption to cancel any bot audio.
|
||||
SpeechStarted fires before any transcript arrives, so the turn
|
||||
is cleanly started before any transcription frames are pushed.
|
||||
|
||||
Only applies when using AssemblyAI's built-in turn detection. When using
|
||||
Pipecat turn detection, VAD + smart turn analyzer handle interruptions.
|
||||
"""
|
||||
if self._vad_force_turn_endpoint:
|
||||
return # Pipecat mode: handled by aggregator
|
||||
|
||||
await self.start_processing_metrics()
|
||||
await self.broadcast_frame(UserStartedSpeakingFrame)
|
||||
if self._should_interrupt:
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
self._user_speaking = True
|
||||
|
||||
async def _handle_termination(self, message: TerminationMessage):
|
||||
"""Handle termination message."""
|
||||
self._received_termination = True
|
||||
@@ -466,30 +652,109 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
await self.push_frame(EndFrame())
|
||||
|
||||
async def _handle_transcription(self, message: TurnMessage):
|
||||
"""Handle transcription results."""
|
||||
"""Handle transcription results with two turn detection modes.
|
||||
|
||||
Pipecat turn detection (vad_force_turn_endpoint=True):
|
||||
- No UserStarted/StoppedSpeakingFrame from STT
|
||||
- end_of_turn → TranscriptionFrame (finalized set by base class
|
||||
if this is a ForceEndpoint response)
|
||||
- else → InterimTranscriptionFrame
|
||||
|
||||
AssemblyAI turn detection (vad_force_turn_endpoint=False):
|
||||
- UserStartedSpeakingFrame on first transcript
|
||||
- end_of_turn → TranscriptionFrame + UserStoppedSpeakingFrame
|
||||
- else → InterimTranscriptionFrame
|
||||
"""
|
||||
if not message.transcript:
|
||||
return
|
||||
if message.end_of_turn and (
|
||||
not self._settings.connection_params.formatted_finals or message.turn_is_formatted
|
||||
):
|
||||
await self.push_frame(
|
||||
TranscriptionFrame(
|
||||
message.transcript,
|
||||
self._user_id,
|
||||
time_now_iso8601(),
|
||||
self._settings.language,
|
||||
message,
|
||||
|
||||
# Use detected language if available with sufficient confidence
|
||||
language = Language.EN
|
||||
if message.language_code and message.language_confidence:
|
||||
if message.language_confidence >= 0.7:
|
||||
language = map_language_from_assemblyai(message.language_code)
|
||||
else:
|
||||
logger.warning(
|
||||
f"Low language detection confidence ({message.language_confidence:.2f}) "
|
||||
f"for language '{message.language_code}', falling back to English"
|
||||
)
|
||||
|
||||
# Handle speaker diarization
|
||||
speaker_id = self._user_id
|
||||
transcript_text = message.transcript
|
||||
|
||||
if message.speaker:
|
||||
speaker_id = message.speaker
|
||||
# Format transcript with speaker labels if format string provided
|
||||
if self._speaker_format:
|
||||
transcript_text = self._speaker_format.format(
|
||||
speaker=message.speaker, text=message.transcript
|
||||
)
|
||||
|
||||
# Determine if this is a final turn from AssemblyAI
|
||||
is_final_turn = message.end_of_turn and (
|
||||
not self._settings.connection_params.format_turns or message.turn_is_formatted
|
||||
)
|
||||
|
||||
if self._vad_force_turn_endpoint:
|
||||
# --- Pipecat turn detection mode ---
|
||||
# No UserStarted/StoppedSpeakingFrame — VAD + smart turn analyzer handle this
|
||||
if is_final_turn:
|
||||
finalize_confirmed = bool(message.turn_is_formatted)
|
||||
if finalize_confirmed:
|
||||
self.confirm_finalize()
|
||||
logger.debug(f'{self} Transcript: "{transcript_text}"')
|
||||
await self.push_frame(
|
||||
TranscriptionFrame(
|
||||
transcript_text,
|
||||
speaker_id,
|
||||
time_now_iso8601(),
|
||||
language,
|
||||
message,
|
||||
)
|
||||
)
|
||||
await self._trace_transcription(transcript_text, True, language)
|
||||
await self.stop_processing_metrics()
|
||||
else:
|
||||
await self.push_frame(
|
||||
InterimTranscriptionFrame(
|
||||
transcript_text,
|
||||
speaker_id,
|
||||
time_now_iso8601(),
|
||||
language,
|
||||
message,
|
||||
)
|
||||
)
|
||||
)
|
||||
await self._trace_transcription(message.transcript, True, self._settings.language)
|
||||
await self.stop_processing_metrics()
|
||||
else:
|
||||
await self.push_frame(
|
||||
InterimTranscriptionFrame(
|
||||
message.transcript,
|
||||
self._user_id,
|
||||
time_now_iso8601(),
|
||||
self._settings.language,
|
||||
message,
|
||||
# --- AssemblyAI turn detection mode ---
|
||||
# SpeechStarted always arrives before transcripts with u3-rt-pro,
|
||||
# so UserStartedSpeakingFrame is guaranteed to be broadcast first.
|
||||
if is_final_turn:
|
||||
# AssemblyAI controls finalization, just mark as finalized
|
||||
await self.push_frame(
|
||||
TranscriptionFrame(
|
||||
transcript_text,
|
||||
speaker_id,
|
||||
time_now_iso8601(),
|
||||
language,
|
||||
message,
|
||||
finalized=True,
|
||||
)
|
||||
)
|
||||
await self._trace_transcription(transcript_text, True, language)
|
||||
await self.stop_processing_metrics()
|
||||
# AAI is authoritative — emit UserStoppedSpeakingFrame immediately.
|
||||
# broadcast_frame pushes downstream (same queue as TranscriptionFrame
|
||||
# above, so ordering is preserved) and upstream.
|
||||
await self.broadcast_frame(UserStoppedSpeakingFrame)
|
||||
self._user_speaking = False
|
||||
else:
|
||||
await self.push_frame(
|
||||
InterimTranscriptionFrame(
|
||||
transcript_text,
|
||||
speaker_id,
|
||||
time_now_iso8601(),
|
||||
language,
|
||||
message,
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
@@ -35,6 +35,7 @@ from pipecat.utils.tracing.service_decorators import traced_stt
|
||||
|
||||
try:
|
||||
from azure.cognitiveservices.speech import (
|
||||
CancellationReason,
|
||||
ResultReason,
|
||||
SpeechConfig,
|
||||
SpeechRecognizer,
|
||||
@@ -209,6 +210,7 @@ class AzureSTTService(STTService):
|
||||
)
|
||||
self._speech_recognizer.recognizing.connect(self._on_handle_recognizing)
|
||||
self._speech_recognizer.recognized.connect(self._on_handle_recognized)
|
||||
self._speech_recognizer.canceled.connect(self._on_handle_canceled)
|
||||
self._speech_recognizer.start_continuous_recognition_async()
|
||||
except Exception as e:
|
||||
await self.push_error(
|
||||
@@ -280,3 +282,13 @@ class AzureSTTService(STTService):
|
||||
result=event,
|
||||
)
|
||||
asyncio.run_coroutine_threadsafe(self.push_frame(frame), self.get_event_loop())
|
||||
|
||||
def _on_handle_canceled(self, event):
|
||||
details = event.result.cancellation_details
|
||||
if details.reason == CancellationReason.Error:
|
||||
error_msg = f"Azure STT recognition canceled: {details.reason}"
|
||||
if details.error_details:
|
||||
error_msg += f" - {details.error_details}"
|
||||
asyncio.run_coroutine_threadsafe(
|
||||
self.push_error(error_msg=error_msg), self.get_event_loop()
|
||||
)
|
||||
|
||||
@@ -561,9 +561,13 @@ class AzureTTSService(TTSService, AzureBaseTTSService):
|
||||
# User cancellation (from interruption) is expected, not an error
|
||||
if reason == CancellationReason.CancelledByUser:
|
||||
logger.debug(f"{self}: Speech synthesis canceled by user (interruption)")
|
||||
self._audio_queue.put_nowait(None)
|
||||
else:
|
||||
logger.warning(f"{self}: Speech synthesis canceled: {reason}")
|
||||
self._audio_queue.put_nowait(None)
|
||||
details = evt.result.cancellation_details
|
||||
error_msg = f"Azure TTS synthesis canceled: {reason}"
|
||||
if details.error_details:
|
||||
error_msg += f" - {details.error_details}"
|
||||
self._audio_queue.put_nowait(Exception(error_msg))
|
||||
|
||||
async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
|
||||
"""Push a frame and handle state changes.
|
||||
@@ -676,6 +680,9 @@ class AzureTTSService(TTSService, AzureBaseTTSService):
|
||||
chunk = await self._audio_queue.get()
|
||||
if chunk is None: # End of stream
|
||||
break
|
||||
if isinstance(chunk, Exception): # Error from _handle_canceled
|
||||
yield ErrorFrame(error=str(chunk))
|
||||
break
|
||||
|
||||
if self._first_chunk:
|
||||
await self.stop_ttfb_metrics()
|
||||
|
||||
@@ -9,6 +9,7 @@ import sys
|
||||
from pipecat.services import DeprecatedModuleProxy
|
||||
|
||||
from .flux import *
|
||||
from .sagemaker import *
|
||||
from .stt import *
|
||||
from .tts import *
|
||||
|
||||
|
||||
@@ -675,7 +675,7 @@ class DeepgramFluxSTTService(WebsocketSTTService):
|
||||
self._user_is_speaking = True
|
||||
await self.broadcast_frame(UserStartedSpeakingFrame)
|
||||
if self._should_interrupt:
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
await self.broadcast_interruption()
|
||||
await self.start_metrics()
|
||||
await self._call_event_handler("on_start_of_turn", transcript)
|
||||
if transcript:
|
||||
|
||||
448
src/pipecat/services/deepgram/sagemaker/stt.py
Normal file
448
src/pipecat/services/deepgram/sagemaker/stt.py
Normal file
@@ -0,0 +1,448 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""Deepgram speech-to-text service for AWS SageMaker.
|
||||
|
||||
This module provides a Pipecat STT service that connects to Deepgram models
|
||||
deployed on AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for
|
||||
low-latency real-time transcription with support for interim results, multiple
|
||||
languages, and various Deepgram features.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from dataclasses import dataclass
|
||||
from typing import Any, AsyncGenerator, Dict, Optional
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
CancelFrame,
|
||||
EndFrame,
|
||||
ErrorFrame,
|
||||
Frame,
|
||||
InterimTranscriptionFrame,
|
||||
StartFrame,
|
||||
TranscriptionFrame,
|
||||
VADUserStartedSpeakingFrame,
|
||||
VADUserStoppedSpeakingFrame,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.services.aws.sagemaker.bidi_client import SageMakerBidiClient
|
||||
from pipecat.services.deepgram.stt import _DeepgramSTTSettingsBase
|
||||
from pipecat.services.settings import STTSettings
|
||||
from pipecat.services.stt_latency import DEEPGRAM_SAGEMAKER_TTFS_P99
|
||||
from pipecat.services.stt_service import STTService
|
||||
from pipecat.transcriptions.language import Language
|
||||
from pipecat.utils.time import time_now_iso8601
|
||||
from pipecat.utils.tracing.service_decorators import traced_stt
|
||||
|
||||
try:
|
||||
from deepgram import LiveOptions
|
||||
except ModuleNotFoundError as e:
|
||||
logger.error(f"Exception: {e}")
|
||||
logger.error(
|
||||
"In order to use DeepgramSageMakerSTTService, you need to `pip install pipecat-ai[deepgram,sagemaker]`."
|
||||
)
|
||||
raise Exception(f"Missing module: {e}")
|
||||
|
||||
|
||||
@dataclass
|
||||
class DeepgramSageMakerSTTSettings(_DeepgramSTTSettingsBase):
|
||||
"""Settings for the Deepgram SageMaker STT service.
|
||||
|
||||
See ``_DeepgramSTTSettingsBase`` for full documentation.
|
||||
"""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class DeepgramSageMakerSTTService(STTService):
|
||||
"""Deepgram speech-to-text service for AWS SageMaker.
|
||||
|
||||
Provides real-time speech recognition using Deepgram models deployed on
|
||||
AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for low-latency
|
||||
transcription with support for interim results, speaker diarization, and
|
||||
multiple languages.
|
||||
|
||||
Requirements:
|
||||
|
||||
- AWS credentials configured (via environment variables, AWS CLI, or instance metadata)
|
||||
- A deployed SageMaker endpoint with Deepgram model: https://developers.deepgram.com/docs/deploy-amazon-sagemaker
|
||||
- Deepgram SDK for LiveOptions configuration
|
||||
|
||||
Example::
|
||||
|
||||
stt = DeepgramSageMakerSTTService(
|
||||
endpoint_name="my-deepgram-endpoint",
|
||||
region="us-east-2",
|
||||
live_options=LiveOptions(
|
||||
model="nova-3",
|
||||
language="en",
|
||||
interim_results=True,
|
||||
punctuate=True,
|
||||
),
|
||||
)
|
||||
"""
|
||||
|
||||
_settings: DeepgramSageMakerSTTSettings
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
endpoint_name: str,
|
||||
region: str,
|
||||
sample_rate: Optional[int] = None,
|
||||
live_options: Optional[LiveOptions] = None,
|
||||
ttfs_p99_latency: Optional[float] = DEEPGRAM_SAGEMAKER_TTFS_P99,
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the Deepgram SageMaker STT service.
|
||||
|
||||
Args:
|
||||
endpoint_name: Name of the SageMaker endpoint with Deepgram model
|
||||
deployed (e.g., "my-deepgram-nova-3-endpoint").
|
||||
region: AWS region where the endpoint is deployed (e.g., "us-east-2").
|
||||
sample_rate: Audio sample rate in Hz. If None, uses value from
|
||||
live_options or defaults to the value from StartFrame.
|
||||
live_options: Deepgram LiveOptions configuration. Treated as a
|
||||
delta from a set of sensible defaults — only the fields you
|
||||
set are overridden; all others keep their default values.
|
||||
ttfs_p99_latency: P99 latency from speech end to final transcript in seconds.
|
||||
Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
|
||||
**kwargs: Additional arguments passed to the parent STTService.
|
||||
"""
|
||||
sample_rate = sample_rate or (live_options.sample_rate if live_options else None)
|
||||
|
||||
default_options = LiveOptions(
|
||||
encoding="linear16",
|
||||
language=Language.EN,
|
||||
model="nova-3",
|
||||
channels=1,
|
||||
interim_results=True,
|
||||
punctuate=True,
|
||||
)
|
||||
|
||||
settings = DeepgramSageMakerSTTSettings(
|
||||
model=default_options.model,
|
||||
language=default_options.language,
|
||||
live_options=default_options,
|
||||
)
|
||||
if live_options:
|
||||
settings._merge_live_options_delta(live_options)
|
||||
|
||||
super().__init__(
|
||||
sample_rate=sample_rate,
|
||||
ttfs_p99_latency=ttfs_p99_latency,
|
||||
settings=settings,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
self._endpoint_name = endpoint_name
|
||||
self._region = region
|
||||
|
||||
self._client: Optional[SageMakerBidiClient] = None
|
||||
self._response_task: Optional[asyncio.Task] = None
|
||||
self._keepalive_task: Optional[asyncio.Task] = None
|
||||
|
||||
def can_generate_metrics(self) -> bool:
|
||||
"""Check if this service can generate processing metrics.
|
||||
|
||||
Returns:
|
||||
True, as Deepgram SageMaker service supports metrics generation.
|
||||
"""
|
||||
return True
|
||||
|
||||
async def _update_settings(self, delta: STTSettings) -> dict[str, Any]:
|
||||
"""Apply a settings delta and warn about unhandled changes."""
|
||||
changed = await super()._update_settings(delta)
|
||||
|
||||
if not changed:
|
||||
return changed
|
||||
|
||||
# TODO: someday we could reconnect here to apply updated settings.
|
||||
# Code might look something like the below:
|
||||
# await self._disconnect()
|
||||
# await self._connect()
|
||||
|
||||
self._warn_unhandled_updated_settings(changed)
|
||||
|
||||
return changed
|
||||
|
||||
async def start(self, frame: StartFrame):
|
||||
"""Start the Deepgram SageMaker STT service.
|
||||
|
||||
Args:
|
||||
frame: The start frame containing initialization parameters.
|
||||
"""
|
||||
await super().start(frame)
|
||||
await self._connect()
|
||||
|
||||
async def stop(self, frame: EndFrame):
|
||||
"""Stop the Deepgram SageMaker STT service.
|
||||
|
||||
Args:
|
||||
frame: The end frame.
|
||||
"""
|
||||
await super().stop(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def cancel(self, frame: CancelFrame):
|
||||
"""Cancel the Deepgram SageMaker STT service.
|
||||
|
||||
Args:
|
||||
frame: The cancel frame.
|
||||
"""
|
||||
await super().cancel(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
|
||||
"""Send audio data to Deepgram for transcription.
|
||||
|
||||
Args:
|
||||
audio: Raw audio bytes to transcribe.
|
||||
|
||||
Yields:
|
||||
Frame: None (transcription results come via BiDi stream callbacks).
|
||||
"""
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_audio_chunk(audio)
|
||||
except Exception as e:
|
||||
yield ErrorFrame(error=f"Unknown error occurred: {e}")
|
||||
yield None
|
||||
|
||||
async def _connect(self):
|
||||
"""Connect to the SageMaker endpoint and start the BiDi session.
|
||||
|
||||
Builds the Deepgram query string from settings, creates the BiDi client,
|
||||
starts the streaming session, and launches background tasks for processing
|
||||
responses and sending KeepAlive messages.
|
||||
"""
|
||||
logger.debug("Connecting to Deepgram on SageMaker...")
|
||||
|
||||
live_options = LiveOptions(
|
||||
**{**self._settings.live_options.to_dict(), "sample_rate": self.sample_rate}
|
||||
)
|
||||
|
||||
# Build query string from live_options, converting booleans to strings
|
||||
query_params = {}
|
||||
for key, value in live_options.to_dict().items():
|
||||
if value is not None:
|
||||
# Convert boolean values to lowercase strings for Deepgram API
|
||||
if isinstance(value, bool):
|
||||
query_params[key] = str(value).lower()
|
||||
else:
|
||||
query_params[key] = str(value)
|
||||
|
||||
query_string = "&".join(f"{k}={v}" for k, v in query_params.items())
|
||||
|
||||
# Create BiDi client
|
||||
self._client = SageMakerBidiClient(
|
||||
endpoint_name=self._endpoint_name,
|
||||
region=self._region,
|
||||
model_invocation_path="v1/listen",
|
||||
model_query_string=query_string,
|
||||
)
|
||||
|
||||
try:
|
||||
# Start the session
|
||||
await self._client.start_session()
|
||||
|
||||
# Start processing responses in the background
|
||||
self._response_task = self.create_task(self._process_responses())
|
||||
|
||||
# Start keepalive task to maintain connection
|
||||
self._keepalive_task = self.create_task(self._send_keepalive())
|
||||
|
||||
logger.debug("Connected to Deepgram on SageMaker")
|
||||
await self._call_event_handler("on_connected")
|
||||
|
||||
except Exception as e:
|
||||
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
|
||||
await self._call_event_handler("on_connection_error", str(e))
|
||||
|
||||
async def _disconnect(self):
|
||||
"""Disconnect from the SageMaker endpoint.
|
||||
|
||||
Sends a CloseStream message to Deepgram, cancels background tasks
|
||||
(KeepAlive and response processing), and closes the BiDi session.
|
||||
Safe to call multiple times.
|
||||
"""
|
||||
if self._client and self._client.is_active:
|
||||
logger.debug("Disconnecting from Deepgram on SageMaker...")
|
||||
|
||||
# Send CloseStream message to Deepgram
|
||||
try:
|
||||
await self._client.send_json({"type": "CloseStream"})
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to send CloseStream message: {e}")
|
||||
|
||||
# Cancel keepalive task
|
||||
if self._keepalive_task and not self._keepalive_task.done():
|
||||
await self.cancel_task(self._keepalive_task)
|
||||
|
||||
# Cancel response processing task
|
||||
if self._response_task and not self._response_task.done():
|
||||
await self.cancel_task(self._response_task)
|
||||
|
||||
# Close the BiDi session
|
||||
await self._client.close_session()
|
||||
|
||||
logger.debug("Disconnected from Deepgram on SageMaker")
|
||||
await self._call_event_handler("on_disconnected")
|
||||
|
||||
async def _send_keepalive(self):
|
||||
"""Send periodic KeepAlive messages to maintain the connection.
|
||||
|
||||
Sends a KeepAlive JSON message to Deepgram every 5 seconds while the
|
||||
connection is active. This prevents the connection from timing out during
|
||||
periods of silence.
|
||||
"""
|
||||
while self._client and self._client.is_active:
|
||||
await asyncio.sleep(5)
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_json({"type": "KeepAlive"})
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to send KeepAlive: {e}")
|
||||
|
||||
async def _process_responses(self):
|
||||
"""Process streaming responses from Deepgram on SageMaker.
|
||||
|
||||
Continuously receives responses from the BiDi stream, decodes the payload,
|
||||
parses JSON responses from Deepgram, and processes transcription results.
|
||||
Runs as a background task until the connection is closed or cancelled.
|
||||
"""
|
||||
try:
|
||||
while self._client and self._client.is_active:
|
||||
result = await self._client.receive_response()
|
||||
|
||||
if result is None:
|
||||
break
|
||||
|
||||
# Check if this is a PayloadPart with bytes
|
||||
if hasattr(result, "value") and hasattr(result.value, "bytes_"):
|
||||
if result.value.bytes_:
|
||||
response_data = result.value.bytes_.decode("utf-8")
|
||||
|
||||
try:
|
||||
# Parse JSON response from Deepgram
|
||||
parsed = json.loads(response_data)
|
||||
|
||||
# Extract and process transcript if available
|
||||
if "channel" in parsed:
|
||||
await self._handle_transcript_response(parsed)
|
||||
|
||||
except json.JSONDecodeError:
|
||||
logger.warning(f"Non-JSON response: {response_data}")
|
||||
|
||||
except asyncio.CancelledError:
|
||||
logger.debug("Response processor cancelled")
|
||||
except Exception as e:
|
||||
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
|
||||
finally:
|
||||
logger.debug("Response processor stopped")
|
||||
|
||||
async def _handle_transcript_response(self, parsed: dict):
|
||||
"""Handle a transcript response from Deepgram.
|
||||
|
||||
Extracts the transcript text, determines if it's final or interim, extracts
|
||||
language information, and pushes the appropriate frame (TranscriptionFrame
|
||||
or InterimTranscriptionFrame) downstream.
|
||||
|
||||
Args:
|
||||
parsed: The parsed JSON response from Deepgram containing channel,
|
||||
alternatives, transcript, and metadata.
|
||||
"""
|
||||
alternatives = parsed.get("channel", {}).get("alternatives", [])
|
||||
if not alternatives or not alternatives[0].get("transcript"):
|
||||
return
|
||||
|
||||
transcript = alternatives[0]["transcript"]
|
||||
if not transcript.strip():
|
||||
return
|
||||
|
||||
is_final = parsed.get("is_final", False)
|
||||
|
||||
# Extract language if available
|
||||
language = None
|
||||
if alternatives[0].get("languages"):
|
||||
language = alternatives[0]["languages"][0]
|
||||
language = Language(language)
|
||||
|
||||
if is_final:
|
||||
# Check if this response is from a finalize() call.
|
||||
# Only mark as finalized when both we requested it AND Deepgram confirms it.
|
||||
from_finalize = parsed.get("from_finalize", False)
|
||||
if from_finalize:
|
||||
self.confirm_finalize()
|
||||
await self.push_frame(
|
||||
TranscriptionFrame(
|
||||
transcript,
|
||||
self._user_id,
|
||||
time_now_iso8601(),
|
||||
language,
|
||||
result=parsed,
|
||||
)
|
||||
)
|
||||
await self._handle_transcription(transcript, is_final, language)
|
||||
await self.stop_processing_metrics()
|
||||
else:
|
||||
# Interim transcription
|
||||
await self.push_frame(
|
||||
InterimTranscriptionFrame(
|
||||
transcript,
|
||||
self._user_id,
|
||||
time_now_iso8601(),
|
||||
language,
|
||||
result=parsed,
|
||||
)
|
||||
)
|
||||
|
||||
@traced_stt
|
||||
async def _handle_transcription(
|
||||
self, transcript: str, is_final: bool, language: Optional[Language] = None
|
||||
):
|
||||
"""Handle a transcription result with tracing.
|
||||
|
||||
This method is decorated with @traced_stt for observability and tracing
|
||||
integration. The actual transcription processing is handled by the parent
|
||||
class and observers.
|
||||
|
||||
Args:
|
||||
transcript: The transcribed text.
|
||||
is_final: Whether this is a final transcription result.
|
||||
language: The detected language of the transcription, if available.
|
||||
"""
|
||||
pass
|
||||
|
||||
async def _start_metrics(self):
|
||||
"""Start processing metrics collection."""
|
||||
await self.start_processing_metrics()
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
"""Process frames with Deepgram SageMaker-specific handling.
|
||||
|
||||
Args:
|
||||
frame: The frame to process.
|
||||
direction: The direction of frame processing.
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
# Start metrics when user starts speaking (if VAD is not provided by Deepgram)
|
||||
if isinstance(frame, VADUserStartedSpeakingFrame):
|
||||
await self._start_metrics()
|
||||
elif isinstance(frame, VADUserStoppedSpeakingFrame):
|
||||
# https://developers.deepgram.com/docs/finalize
|
||||
# Mark that we're awaiting a from_finalize response
|
||||
self.request_finalize()
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_json({"type": "Finalize"})
|
||||
except Exception as e:
|
||||
logger.warning(f"Error sending Finalize message: {e}")
|
||||
logger.trace(f"Triggered finalize event on: {frame.name=}, {direction=}")
|
||||
360
src/pipecat/services/deepgram/sagemaker/tts.py
Normal file
360
src/pipecat/services/deepgram/sagemaker/tts.py
Normal file
@@ -0,0 +1,360 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""Deepgram text-to-speech service for AWS SageMaker.
|
||||
|
||||
This module provides a Pipecat TTS service that connects to Deepgram models
|
||||
deployed on AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for
|
||||
low-latency real-time speech synthesis with support for interruptions and
|
||||
streaming audio output.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Any, AsyncGenerator, Optional
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
BotStoppedSpeakingFrame,
|
||||
CancelFrame,
|
||||
EndFrame,
|
||||
ErrorFrame,
|
||||
Frame,
|
||||
InterruptionFrame,
|
||||
LLMFullResponseEndFrame,
|
||||
StartFrame,
|
||||
TTSAudioRawFrame,
|
||||
TTSStartedFrame,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.services.aws.sagemaker.bidi_client import SageMakerBidiClient
|
||||
from pipecat.services.settings import NOT_GIVEN, TTSSettings, _NotGiven
|
||||
from pipecat.services.tts_service import TTSService
|
||||
from pipecat.utils.tracing.service_decorators import traced_tts
|
||||
|
||||
|
||||
@dataclass
|
||||
class DeepgramSageMakerTTSSettings(TTSSettings):
|
||||
"""Settings for Deepgram SageMaker TTS service.
|
||||
|
||||
Parameters:
|
||||
encoding: Audio encoding format (e.g. "linear16").
|
||||
"""
|
||||
|
||||
encoding: str | _NotGiven = field(default_factory=lambda: NOT_GIVEN)
|
||||
|
||||
|
||||
class DeepgramSageMakerTTSService(TTSService):
|
||||
"""Deepgram text-to-speech service for AWS SageMaker.
|
||||
|
||||
Provides real-time speech synthesis using Deepgram models deployed on
|
||||
AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for low-latency
|
||||
audio generation with support for interruptions via the Clear message.
|
||||
|
||||
Requirements:
|
||||
|
||||
- AWS credentials configured (via environment variables, AWS CLI, or instance metadata)
|
||||
- A deployed SageMaker endpoint with Deepgram TTS model: https://developers.deepgram.com/docs/deploy-amazon-sagemaker
|
||||
- ``pipecat-ai[sagemaker]`` installed
|
||||
|
||||
Example::
|
||||
|
||||
tts = DeepgramSageMakerTTSService(
|
||||
endpoint_name="my-deepgram-tts-endpoint",
|
||||
region="us-east-2",
|
||||
voice="aura-2-helena-en",
|
||||
)
|
||||
"""
|
||||
|
||||
_settings: DeepgramSageMakerTTSSettings
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
endpoint_name: str,
|
||||
region: str,
|
||||
voice: str = "aura-2-helena-en",
|
||||
sample_rate: Optional[int] = None,
|
||||
encoding: str = "linear16",
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the Deepgram SageMaker TTS service.
|
||||
|
||||
Args:
|
||||
endpoint_name: Name of the SageMaker endpoint with Deepgram TTS model
|
||||
deployed (e.g., "my-deepgram-tts-endpoint").
|
||||
region: AWS region where the endpoint is deployed (e.g., "us-east-2").
|
||||
voice: Voice model to use for synthesis. Defaults to "aura-2-helena-en".
|
||||
sample_rate: Audio sample rate in Hz. If None, uses the value from StartFrame.
|
||||
encoding: Audio encoding format. Defaults to "linear16".
|
||||
**kwargs: Additional arguments passed to the parent TTSService.
|
||||
"""
|
||||
super().__init__(
|
||||
sample_rate=sample_rate,
|
||||
push_stop_frames=True,
|
||||
pause_frame_processing=True,
|
||||
append_trailing_space=True,
|
||||
settings=DeepgramSageMakerTTSSettings(
|
||||
model=voice,
|
||||
voice=voice,
|
||||
language=None,
|
||||
encoding=encoding,
|
||||
),
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
self._endpoint_name = endpoint_name
|
||||
self._region = region
|
||||
|
||||
self._client: Optional[SageMakerBidiClient] = None
|
||||
self._response_task: Optional[asyncio.Task] = None
|
||||
self._context_id: Optional[str] = None
|
||||
self._ttfb_started: bool = False
|
||||
|
||||
def can_generate_metrics(self) -> bool:
|
||||
"""Check if this service can generate processing metrics.
|
||||
|
||||
Returns:
|
||||
True, as Deepgram SageMaker TTS service supports metrics generation.
|
||||
"""
|
||||
return True
|
||||
|
||||
async def start(self, frame: StartFrame):
|
||||
"""Start the Deepgram SageMaker TTS service.
|
||||
|
||||
Args:
|
||||
frame: The start frame containing initialization parameters.
|
||||
"""
|
||||
await super().start(frame)
|
||||
await self._connect()
|
||||
|
||||
async def stop(self, frame: EndFrame):
|
||||
"""Stop the Deepgram SageMaker TTS service.
|
||||
|
||||
Args:
|
||||
frame: The end frame.
|
||||
"""
|
||||
await super().stop(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def cancel(self, frame: CancelFrame):
|
||||
"""Cancel the Deepgram SageMaker TTS service.
|
||||
|
||||
Args:
|
||||
frame: The cancel frame.
|
||||
"""
|
||||
await super().cancel(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
"""Process frames with special handling for LLM response end.
|
||||
|
||||
Args:
|
||||
frame: The frame to process.
|
||||
direction: The direction of frame processing.
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, (LLMFullResponseEndFrame, EndFrame)):
|
||||
await self.flush_audio()
|
||||
elif isinstance(frame, BotStoppedSpeakingFrame):
|
||||
self._ttfb_started = False
|
||||
|
||||
async def _connect(self):
|
||||
"""Connect to the SageMaker endpoint and start the BiDi session.
|
||||
|
||||
Builds the Deepgram TTS query string, creates the BiDi client,
|
||||
starts the streaming session, and launches a background task for processing
|
||||
responses.
|
||||
"""
|
||||
logger.debug("Connecting to Deepgram TTS on SageMaker...")
|
||||
|
||||
query_string = (
|
||||
f"model={self._settings.voice}&encoding={self._settings.encoding}"
|
||||
f"&sample_rate={self.sample_rate}"
|
||||
)
|
||||
|
||||
self._client = SageMakerBidiClient(
|
||||
endpoint_name=self._endpoint_name,
|
||||
region=self._region,
|
||||
model_invocation_path="v1/speak",
|
||||
model_query_string=query_string,
|
||||
)
|
||||
|
||||
try:
|
||||
await self._client.start_session()
|
||||
|
||||
self._response_task = self.create_task(self._process_responses())
|
||||
|
||||
logger.debug("Connected to Deepgram TTS on SageMaker")
|
||||
await self._call_event_handler("on_connected")
|
||||
|
||||
except Exception as e:
|
||||
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
|
||||
await self._call_event_handler("on_connection_error", str(e))
|
||||
|
||||
async def _disconnect(self):
|
||||
"""Disconnect from the SageMaker endpoint.
|
||||
|
||||
Sends a Close message to Deepgram, cancels the response processing task,
|
||||
and closes the BiDi session. Safe to call multiple times.
|
||||
"""
|
||||
if self._client and self._client.is_active:
|
||||
logger.debug("Disconnecting from Deepgram TTS on SageMaker...")
|
||||
|
||||
try:
|
||||
await self._client.send_json({"type": "Close"})
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to send Close message: {e}")
|
||||
|
||||
if self._response_task and not self._response_task.done():
|
||||
await self.cancel_task(self._response_task)
|
||||
|
||||
await self._client.close_session()
|
||||
|
||||
logger.debug("Disconnected from Deepgram TTS on SageMaker")
|
||||
await self._call_event_handler("on_disconnected")
|
||||
|
||||
async def _update_settings(self, delta: TTSSettings) -> dict[str, Any]:
|
||||
"""Apply a settings delta and reconnect if necessary.
|
||||
|
||||
Since all settings are part of the SageMaker session query string,
|
||||
any setting change requires reconnecting to apply the new values.
|
||||
"""
|
||||
changed = await super()._update_settings(delta)
|
||||
|
||||
if not changed:
|
||||
return changed
|
||||
|
||||
# Deepgram uses voice as the model, so keep them in sync for metrics
|
||||
if "voice" in changed:
|
||||
self._settings.model = self._settings.voice
|
||||
self._sync_model_name_to_metrics()
|
||||
|
||||
# TODO: someday we could reconnect here to apply updated settings.
|
||||
# Code might look something like the below:
|
||||
# await self._disconnect()
|
||||
# await self._connect()
|
||||
|
||||
self._warn_unhandled_updated_settings(changed)
|
||||
|
||||
return changed
|
||||
|
||||
async def _process_responses(self):
|
||||
"""Process streaming responses from Deepgram TTS on SageMaker.
|
||||
|
||||
Continuously receives responses from the BiDi stream. Attempts to decode
|
||||
each payload as UTF-8 JSON for control messages (Flushed, Cleared, Metadata,
|
||||
Warning). If decoding fails, treats the payload as raw audio bytes and pushes
|
||||
a TTSAudioRawFrame downstream.
|
||||
"""
|
||||
try:
|
||||
while self._client and self._client.is_active:
|
||||
result = await self._client.receive_response()
|
||||
|
||||
if result is None:
|
||||
break
|
||||
|
||||
if hasattr(result, "value") and hasattr(result.value, "bytes_"):
|
||||
if result.value.bytes_:
|
||||
payload = result.value.bytes_
|
||||
|
||||
# Try to decode as JSON control message first
|
||||
try:
|
||||
response_data = payload.decode("utf-8")
|
||||
parsed = json.loads(response_data)
|
||||
msg_type = parsed.get("type")
|
||||
|
||||
if msg_type == "Metadata":
|
||||
logger.trace(f"Received metadata: {parsed}")
|
||||
elif msg_type == "Flushed":
|
||||
logger.trace(f"Received Flushed: {parsed}")
|
||||
elif msg_type == "Cleared":
|
||||
logger.trace(f"Received Cleared: {parsed}")
|
||||
elif msg_type == "Warning":
|
||||
logger.warning(
|
||||
f"{self} warning: "
|
||||
f"{parsed.get('description', 'Unknown warning')}"
|
||||
)
|
||||
else:
|
||||
logger.debug(f"Received unknown message type: {parsed}")
|
||||
|
||||
except (UnicodeDecodeError, json.JSONDecodeError):
|
||||
# Not JSON — treat as raw audio bytes
|
||||
await self.stop_ttfb_metrics()
|
||||
frame = TTSAudioRawFrame(
|
||||
payload,
|
||||
self.sample_rate,
|
||||
1,
|
||||
context_id=self._context_id,
|
||||
)
|
||||
await self.push_frame(frame)
|
||||
|
||||
except asyncio.CancelledError:
|
||||
logger.debug("TTS response processor cancelled")
|
||||
except Exception as e:
|
||||
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
|
||||
finally:
|
||||
logger.debug("TTS response processor stopped")
|
||||
|
||||
async def _handle_interruption(self, frame: InterruptionFrame, direction: FrameDirection):
|
||||
"""Handle interruption by sending Clear message to Deepgram.
|
||||
|
||||
The Clear message will clear Deepgram's internal text buffer and stop
|
||||
sending audio, allowing for a new response to be generated.
|
||||
"""
|
||||
await super()._handle_interruption(frame, direction)
|
||||
self._ttfb_started = False
|
||||
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_json({"type": "Clear"})
|
||||
except Exception as e:
|
||||
logger.error(f"{self} error sending Clear message: {e}")
|
||||
|
||||
async def flush_audio(self):
|
||||
"""Flush any pending audio synthesis by sending Flush command.
|
||||
|
||||
This should be called when the LLM finishes a complete response to force
|
||||
generation of audio from Deepgram's internal text buffer.
|
||||
"""
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_json({"type": "Flush"})
|
||||
except Exception as e:
|
||||
logger.error(f"{self} error sending Flush message: {e}")
|
||||
|
||||
@traced_tts
|
||||
async def run_tts(self, text: str, context_id: str) -> AsyncGenerator[Frame, None]:
|
||||
"""Generate speech from text using Deepgram TTS on SageMaker.
|
||||
|
||||
Args:
|
||||
text: The text to synthesize into speech.
|
||||
context_id: The context ID for tracking audio frames.
|
||||
|
||||
Yields:
|
||||
Frame: TTSStartedFrame, then None (audio comes asynchronously via
|
||||
the response processor).
|
||||
"""
|
||||
logger.debug(f"{self}: Generating TTS [{text}]")
|
||||
|
||||
try:
|
||||
if not self._ttfb_started:
|
||||
await self.start_ttfb_metrics()
|
||||
self._ttfb_started = True
|
||||
await self.start_tts_usage_metrics(text)
|
||||
|
||||
yield TTSStartedFrame(context_id=context_id)
|
||||
self._context_id = context_id
|
||||
|
||||
await self._client.send_json({"type": "Speak", "text": text})
|
||||
|
||||
yield None
|
||||
|
||||
except Exception as e:
|
||||
yield ErrorFrame(error=f"Unknown error occurred: {e}")
|
||||
@@ -558,7 +558,7 @@ class DeepgramSTTService(STTService):
|
||||
await self._call_event_handler("on_speech_started", message)
|
||||
await self.broadcast_frame(UserStartedSpeakingFrame)
|
||||
if self._should_interrupt:
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
await self.broadcast_interruption()
|
||||
|
||||
async def _on_utterance_end(self, message):
|
||||
await self._call_event_handler("on_utterance_end", message)
|
||||
|
||||
@@ -4,444 +4,15 @@
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""Deepgram speech-to-text service for AWS SageMaker.
|
||||
"""Deprecated: use ``pipecat.services.deepgram.sagemaker.stt`` instead."""
|
||||
|
||||
This module provides a Pipecat STT service that connects to Deepgram models
|
||||
deployed on AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for
|
||||
low-latency real-time transcription with support for interim results, multiple
|
||||
languages, and various Deepgram features.
|
||||
"""
|
||||
import warnings
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from dataclasses import dataclass
|
||||
from typing import Any, AsyncGenerator, Dict, Optional
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
CancelFrame,
|
||||
EndFrame,
|
||||
ErrorFrame,
|
||||
Frame,
|
||||
InterimTranscriptionFrame,
|
||||
StartFrame,
|
||||
TranscriptionFrame,
|
||||
VADUserStartedSpeakingFrame,
|
||||
VADUserStoppedSpeakingFrame,
|
||||
warnings.warn(
|
||||
"Module `pipecat.services.deepgram.stt_sagemaker` is deprecated, "
|
||||
"use `pipecat.services.deepgram.sagemaker.stt` instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.services.aws.sagemaker.bidi_client import SageMakerBidiClient
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTSettings
|
||||
from pipecat.services.settings import STTSettings
|
||||
from pipecat.services.stt_latency import DEEPGRAM_SAGEMAKER_TTFS_P99
|
||||
from pipecat.services.stt_service import STTService
|
||||
from pipecat.transcriptions.language import Language
|
||||
from pipecat.utils.time import time_now_iso8601
|
||||
from pipecat.utils.tracing.service_decorators import traced_stt
|
||||
|
||||
try:
|
||||
from pipecat.services.deepgram.stt import LiveOptions
|
||||
except ModuleNotFoundError as e:
|
||||
logger.error(f"Exception: {e}")
|
||||
logger.error(
|
||||
"In order to use DeepgramSageMakerSTTService, you need to `pip install pipecat-ai[deepgram,sagemaker]`."
|
||||
)
|
||||
raise Exception(f"Missing module: {e}")
|
||||
|
||||
|
||||
@dataclass
|
||||
class DeepgramSageMakerSTTSettings(DeepgramSTTSettings):
|
||||
"""Settings for the Deepgram SageMaker STT service.
|
||||
|
||||
See ``DeepgramSTTSettings`` for full documentation.
|
||||
"""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
class DeepgramSageMakerSTTService(STTService):
|
||||
"""Deepgram speech-to-text service for AWS SageMaker.
|
||||
|
||||
Provides real-time speech recognition using Deepgram models deployed on
|
||||
AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for low-latency
|
||||
transcription with support for interim results, speaker diarization, and
|
||||
multiple languages.
|
||||
|
||||
Requirements:
|
||||
|
||||
- AWS credentials configured (via environment variables, AWS CLI, or instance metadata)
|
||||
- A deployed SageMaker endpoint with Deepgram model: https://developers.deepgram.com/docs/deploy-amazon-sagemaker
|
||||
- Deepgram SDK for LiveOptions configuration
|
||||
|
||||
Example::
|
||||
|
||||
stt = DeepgramSageMakerSTTService(
|
||||
endpoint_name="my-deepgram-endpoint",
|
||||
region="us-east-2",
|
||||
live_options=LiveOptions(
|
||||
model="nova-3",
|
||||
language="en",
|
||||
interim_results=True,
|
||||
punctuate=True,
|
||||
),
|
||||
)
|
||||
"""
|
||||
|
||||
_settings: DeepgramSageMakerSTTSettings
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
endpoint_name: str,
|
||||
region: str,
|
||||
sample_rate: Optional[int] = None,
|
||||
live_options: Optional[LiveOptions] = None,
|
||||
ttfs_p99_latency: Optional[float] = DEEPGRAM_SAGEMAKER_TTFS_P99,
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the Deepgram SageMaker STT service.
|
||||
|
||||
Args:
|
||||
endpoint_name: Name of the SageMaker endpoint with Deepgram model
|
||||
deployed (e.g., "my-deepgram-nova-3-endpoint").
|
||||
region: AWS region where the endpoint is deployed (e.g., "us-east-2").
|
||||
sample_rate: Audio sample rate in Hz. If None, uses value from
|
||||
live_options or defaults to the value from StartFrame.
|
||||
live_options: Deepgram LiveOptions configuration. Treated as a
|
||||
delta from a set of sensible defaults — only the fields you
|
||||
set are overridden; all others keep their default values.
|
||||
ttfs_p99_latency: P99 latency from speech end to final transcript in seconds.
|
||||
Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
|
||||
**kwargs: Additional arguments passed to the parent STTService.
|
||||
"""
|
||||
sample_rate = sample_rate or (live_options.sample_rate if live_options else None)
|
||||
|
||||
settings = DeepgramSageMakerSTTSettings(
|
||||
model="nova-3",
|
||||
language=Language.EN,
|
||||
encoding="linear16",
|
||||
channels=1,
|
||||
interim_results=True,
|
||||
punctuate=True,
|
||||
)
|
||||
|
||||
if live_options:
|
||||
lo_dict = live_options.to_dict()
|
||||
delta = DeepgramSageMakerSTTSettings.from_mapping(
|
||||
{k: v for k, v in lo_dict.items() if k != "sample_rate"}
|
||||
)
|
||||
settings.apply_update(delta)
|
||||
|
||||
super().__init__(
|
||||
sample_rate=sample_rate,
|
||||
ttfs_p99_latency=ttfs_p99_latency,
|
||||
settings=settings,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
self._endpoint_name = endpoint_name
|
||||
self._region = region
|
||||
|
||||
self._client: Optional[SageMakerBidiClient] = None
|
||||
self._response_task: Optional[asyncio.Task] = None
|
||||
self._keepalive_task: Optional[asyncio.Task] = None
|
||||
|
||||
def can_generate_metrics(self) -> bool:
|
||||
"""Check if this service can generate processing metrics.
|
||||
|
||||
Returns:
|
||||
True, as Deepgram SageMaker service supports metrics generation.
|
||||
"""
|
||||
return True
|
||||
|
||||
async def _update_settings(self, delta: STTSettings) -> dict[str, Any]:
|
||||
"""Apply a settings delta and warn about unhandled changes."""
|
||||
changed = await super()._update_settings(delta)
|
||||
|
||||
if not changed:
|
||||
return changed
|
||||
|
||||
# TODO: someday we could reconnect here to apply updated settings.
|
||||
# Code might look something like the below:
|
||||
# await self._disconnect()
|
||||
# await self._connect()
|
||||
|
||||
self._warn_unhandled_updated_settings(changed)
|
||||
|
||||
return changed
|
||||
|
||||
async def start(self, frame: StartFrame):
|
||||
"""Start the Deepgram SageMaker STT service.
|
||||
|
||||
Args:
|
||||
frame: The start frame containing initialization parameters.
|
||||
"""
|
||||
await super().start(frame)
|
||||
await self._connect()
|
||||
|
||||
async def stop(self, frame: EndFrame):
|
||||
"""Stop the Deepgram SageMaker STT service.
|
||||
|
||||
Args:
|
||||
frame: The end frame.
|
||||
"""
|
||||
await super().stop(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def cancel(self, frame: CancelFrame):
|
||||
"""Cancel the Deepgram SageMaker STT service.
|
||||
|
||||
Args:
|
||||
frame: The cancel frame.
|
||||
"""
|
||||
await super().cancel(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
|
||||
"""Send audio data to Deepgram for transcription.
|
||||
|
||||
Args:
|
||||
audio: Raw audio bytes to transcribe.
|
||||
|
||||
Yields:
|
||||
Frame: None (transcription results come via BiDi stream callbacks).
|
||||
"""
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_audio_chunk(audio)
|
||||
except Exception as e:
|
||||
yield ErrorFrame(error=f"Unknown error occurred: {e}")
|
||||
yield None
|
||||
|
||||
async def _connect(self):
|
||||
"""Connect to the SageMaker endpoint and start the BiDi session.
|
||||
|
||||
Builds the Deepgram query string from settings, creates the BiDi client,
|
||||
starts the streaming session, and launches background tasks for processing
|
||||
responses and sending KeepAlive messages.
|
||||
"""
|
||||
logger.debug("Connecting to Deepgram on SageMaker...")
|
||||
|
||||
# Reconstruct a LiveOptions from the flat settings to build the query string.
|
||||
live_options = LiveOptions(**self._settings.given_fields())
|
||||
|
||||
# Build query string from live_options, converting booleans to strings
|
||||
query_params = {}
|
||||
for key, value in live_options.to_dict().items():
|
||||
if value is not None:
|
||||
# Convert boolean values to lowercase strings for Deepgram API
|
||||
if isinstance(value, bool):
|
||||
query_params[key] = str(value).lower()
|
||||
else:
|
||||
query_params[key] = str(value)
|
||||
query_params["sample_rate"] = str(self.sample_rate)
|
||||
|
||||
query_string = "&".join(f"{k}={v}" for k, v in query_params.items())
|
||||
|
||||
# Create BiDi client
|
||||
self._client = SageMakerBidiClient(
|
||||
endpoint_name=self._endpoint_name,
|
||||
region=self._region,
|
||||
model_invocation_path="v1/listen",
|
||||
model_query_string=query_string,
|
||||
)
|
||||
|
||||
try:
|
||||
# Start the session
|
||||
await self._client.start_session()
|
||||
|
||||
# Start processing responses in the background
|
||||
self._response_task = self.create_task(self._process_responses())
|
||||
|
||||
# Start keepalive task to maintain connection
|
||||
self._keepalive_task = self.create_task(self._send_keepalive())
|
||||
|
||||
logger.debug("Connected to Deepgram on SageMaker")
|
||||
await self._call_event_handler("on_connected")
|
||||
|
||||
except Exception as e:
|
||||
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
|
||||
await self._call_event_handler("on_connection_error", str(e))
|
||||
|
||||
async def _disconnect(self):
|
||||
"""Disconnect from the SageMaker endpoint.
|
||||
|
||||
Sends a CloseStream message to Deepgram, cancels background tasks
|
||||
(KeepAlive and response processing), and closes the BiDi session.
|
||||
Safe to call multiple times.
|
||||
"""
|
||||
if self._client and self._client.is_active:
|
||||
logger.debug("Disconnecting from Deepgram on SageMaker...")
|
||||
|
||||
# Send CloseStream message to Deepgram
|
||||
try:
|
||||
await self._client.send_json({"type": "CloseStream"})
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to send CloseStream message: {e}")
|
||||
|
||||
# Cancel keepalive task
|
||||
if self._keepalive_task and not self._keepalive_task.done():
|
||||
await self.cancel_task(self._keepalive_task)
|
||||
|
||||
# Cancel response processing task
|
||||
if self._response_task and not self._response_task.done():
|
||||
await self.cancel_task(self._response_task)
|
||||
|
||||
# Close the BiDi session
|
||||
await self._client.close_session()
|
||||
|
||||
logger.debug("Disconnected from Deepgram on SageMaker")
|
||||
await self._call_event_handler("on_disconnected")
|
||||
|
||||
async def _send_keepalive(self):
|
||||
"""Send periodic KeepAlive messages to maintain the connection.
|
||||
|
||||
Sends a KeepAlive JSON message to Deepgram every 5 seconds while the
|
||||
connection is active. This prevents the connection from timing out during
|
||||
periods of silence.
|
||||
"""
|
||||
while self._client and self._client.is_active:
|
||||
await asyncio.sleep(5)
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_json({"type": "KeepAlive"})
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to send KeepAlive: {e}")
|
||||
|
||||
async def _process_responses(self):
|
||||
"""Process streaming responses from Deepgram on SageMaker.
|
||||
|
||||
Continuously receives responses from the BiDi stream, decodes the payload,
|
||||
parses JSON responses from Deepgram, and processes transcription results.
|
||||
Runs as a background task until the connection is closed or cancelled.
|
||||
"""
|
||||
try:
|
||||
while self._client and self._client.is_active:
|
||||
result = await self._client.receive_response()
|
||||
|
||||
if result is None:
|
||||
break
|
||||
|
||||
# Check if this is a PayloadPart with bytes
|
||||
if hasattr(result, "value") and hasattr(result.value, "bytes_"):
|
||||
if result.value.bytes_:
|
||||
response_data = result.value.bytes_.decode("utf-8")
|
||||
|
||||
try:
|
||||
# Parse JSON response from Deepgram
|
||||
parsed = json.loads(response_data)
|
||||
|
||||
# Extract and process transcript if available
|
||||
if "channel" in parsed:
|
||||
await self._handle_transcript_response(parsed)
|
||||
|
||||
except json.JSONDecodeError:
|
||||
logger.warning(f"Non-JSON response: {response_data}")
|
||||
|
||||
except asyncio.CancelledError:
|
||||
logger.debug("Response processor cancelled")
|
||||
except Exception as e:
|
||||
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
|
||||
finally:
|
||||
logger.debug("Response processor stopped")
|
||||
|
||||
async def _handle_transcript_response(self, parsed: dict):
|
||||
"""Handle a transcript response from Deepgram.
|
||||
|
||||
Extracts the transcript text, determines if it's final or interim, extracts
|
||||
language information, and pushes the appropriate frame (TranscriptionFrame
|
||||
or InterimTranscriptionFrame) downstream.
|
||||
|
||||
Args:
|
||||
parsed: The parsed JSON response from Deepgram containing channel,
|
||||
alternatives, transcript, and metadata.
|
||||
"""
|
||||
alternatives = parsed.get("channel", {}).get("alternatives", [])
|
||||
if not alternatives or not alternatives[0].get("transcript"):
|
||||
return
|
||||
|
||||
transcript = alternatives[0]["transcript"]
|
||||
if not transcript.strip():
|
||||
return
|
||||
|
||||
is_final = parsed.get("is_final", False)
|
||||
|
||||
# Extract language if available
|
||||
language = None
|
||||
if alternatives[0].get("languages"):
|
||||
language = alternatives[0]["languages"][0]
|
||||
language = Language(language)
|
||||
|
||||
if is_final:
|
||||
# Check if this response is from a finalize() call.
|
||||
# Only mark as finalized when both we requested it AND Deepgram confirms it.
|
||||
from_finalize = parsed.get("from_finalize", False)
|
||||
if from_finalize:
|
||||
self.confirm_finalize()
|
||||
await self.push_frame(
|
||||
TranscriptionFrame(
|
||||
transcript,
|
||||
self._user_id,
|
||||
time_now_iso8601(),
|
||||
language,
|
||||
result=parsed,
|
||||
)
|
||||
)
|
||||
await self._handle_transcription(transcript, is_final, language)
|
||||
await self.stop_processing_metrics()
|
||||
else:
|
||||
# Interim transcription
|
||||
await self.push_frame(
|
||||
InterimTranscriptionFrame(
|
||||
transcript,
|
||||
self._user_id,
|
||||
time_now_iso8601(),
|
||||
language,
|
||||
result=parsed,
|
||||
)
|
||||
)
|
||||
|
||||
@traced_stt
|
||||
async def _handle_transcription(
|
||||
self, transcript: str, is_final: bool, language: Optional[Language] = None
|
||||
):
|
||||
"""Handle a transcription result with tracing.
|
||||
|
||||
This method is decorated with @traced_stt for observability and tracing
|
||||
integration. The actual transcription processing is handled by the parent
|
||||
class and observers.
|
||||
|
||||
Args:
|
||||
transcript: The transcribed text.
|
||||
is_final: Whether this is a final transcription result.
|
||||
language: The detected language of the transcription, if available.
|
||||
"""
|
||||
pass
|
||||
|
||||
async def _start_metrics(self):
|
||||
"""Start processing metrics collection."""
|
||||
await self.start_processing_metrics()
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
"""Process frames with Deepgram SageMaker-specific handling.
|
||||
|
||||
Args:
|
||||
frame: The frame to process.
|
||||
direction: The direction of frame processing.
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
# Start metrics when user starts speaking (if VAD is not provided by Deepgram)
|
||||
if isinstance(frame, VADUserStartedSpeakingFrame):
|
||||
await self._start_metrics()
|
||||
elif isinstance(frame, VADUserStoppedSpeakingFrame):
|
||||
# https://developers.deepgram.com/docs/finalize
|
||||
# Mark that we're awaiting a from_finalize response
|
||||
self.request_finalize()
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_json({"type": "Finalize"})
|
||||
except Exception as e:
|
||||
logger.warning(f"Error sending Finalize message: {e}")
|
||||
logger.trace(f"Triggered finalize event on: {frame.name=}, {direction=}")
|
||||
from pipecat.services.deepgram.sagemaker.stt import * # noqa: E402, F401, F403
|
||||
|
||||
@@ -4,357 +4,15 @@
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""Deepgram text-to-speech service for AWS SageMaker.
|
||||
"""Deprecated: use ``pipecat.services.deepgram.sagemaker.tts`` instead."""
|
||||
|
||||
This module provides a Pipecat TTS service that connects to Deepgram models
|
||||
deployed on AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for
|
||||
low-latency real-time speech synthesis with support for interruptions and
|
||||
streaming audio output.
|
||||
"""
|
||||
import warnings
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Any, AsyncGenerator, Optional
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
BotStoppedSpeakingFrame,
|
||||
CancelFrame,
|
||||
EndFrame,
|
||||
ErrorFrame,
|
||||
Frame,
|
||||
InterruptionFrame,
|
||||
LLMFullResponseEndFrame,
|
||||
StartFrame,
|
||||
TTSAudioRawFrame,
|
||||
TTSStartedFrame,
|
||||
warnings.warn(
|
||||
"Module `pipecat.services.deepgram.tts_sagemaker` is deprecated, "
|
||||
"use `pipecat.services.deepgram.sagemaker.tts` instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.services.aws.sagemaker.bidi_client import SageMakerBidiClient
|
||||
from pipecat.services.settings import NOT_GIVEN, TTSSettings, _NotGiven
|
||||
from pipecat.services.tts_service import TTSService
|
||||
from pipecat.utils.tracing.service_decorators import traced_tts
|
||||
|
||||
|
||||
@dataclass
|
||||
class DeepgramSageMakerTTSSettings(TTSSettings):
|
||||
"""Settings for Deepgram SageMaker TTS service.
|
||||
|
||||
Parameters:
|
||||
encoding: Audio encoding format (e.g. "linear16").
|
||||
"""
|
||||
|
||||
encoding: str | _NotGiven = field(default_factory=lambda: NOT_GIVEN)
|
||||
|
||||
|
||||
class DeepgramSageMakerTTSService(TTSService):
|
||||
"""Deepgram text-to-speech service for AWS SageMaker.
|
||||
|
||||
Provides real-time speech synthesis using Deepgram models deployed on
|
||||
AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for low-latency
|
||||
audio generation with support for interruptions via the Clear message.
|
||||
|
||||
Requirements:
|
||||
|
||||
- AWS credentials configured (via environment variables, AWS CLI, or instance metadata)
|
||||
- A deployed SageMaker endpoint with Deepgram TTS model: https://developers.deepgram.com/docs/deploy-amazon-sagemaker
|
||||
- ``pipecat-ai[sagemaker]`` installed
|
||||
|
||||
Example::
|
||||
|
||||
tts = DeepgramSageMakerTTSService(
|
||||
endpoint_name="my-deepgram-tts-endpoint",
|
||||
region="us-east-2",
|
||||
voice="aura-2-helena-en",
|
||||
)
|
||||
"""
|
||||
|
||||
_settings: DeepgramSageMakerTTSSettings
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
endpoint_name: str,
|
||||
region: str,
|
||||
voice: str = "aura-2-helena-en",
|
||||
sample_rate: Optional[int] = None,
|
||||
encoding: str = "linear16",
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the Deepgram SageMaker TTS service.
|
||||
|
||||
Args:
|
||||
endpoint_name: Name of the SageMaker endpoint with Deepgram TTS model
|
||||
deployed (e.g., "my-deepgram-tts-endpoint").
|
||||
region: AWS region where the endpoint is deployed (e.g., "us-east-2").
|
||||
voice: Voice model to use for synthesis. Defaults to "aura-2-helena-en".
|
||||
sample_rate: Audio sample rate in Hz. If None, uses the value from StartFrame.
|
||||
encoding: Audio encoding format. Defaults to "linear16".
|
||||
**kwargs: Additional arguments passed to the parent TTSService.
|
||||
"""
|
||||
super().__init__(
|
||||
sample_rate=sample_rate,
|
||||
push_stop_frames=True,
|
||||
pause_frame_processing=True,
|
||||
append_trailing_space=True,
|
||||
settings=DeepgramSageMakerTTSSettings(
|
||||
model=voice,
|
||||
voice=voice,
|
||||
language=None,
|
||||
encoding=encoding,
|
||||
),
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
self._endpoint_name = endpoint_name
|
||||
self._region = region
|
||||
|
||||
self._client: Optional[SageMakerBidiClient] = None
|
||||
self._response_task: Optional[asyncio.Task] = None
|
||||
self._context_id: Optional[str] = None
|
||||
self._ttfb_started: bool = False
|
||||
|
||||
def can_generate_metrics(self) -> bool:
|
||||
"""Check if this service can generate processing metrics.
|
||||
|
||||
Returns:
|
||||
True, as Deepgram SageMaker TTS service supports metrics generation.
|
||||
"""
|
||||
return True
|
||||
|
||||
async def start(self, frame: StartFrame):
|
||||
"""Start the Deepgram SageMaker TTS service.
|
||||
|
||||
Args:
|
||||
frame: The start frame containing initialization parameters.
|
||||
"""
|
||||
await super().start(frame)
|
||||
await self._connect()
|
||||
|
||||
async def stop(self, frame: EndFrame):
|
||||
"""Stop the Deepgram SageMaker TTS service.
|
||||
|
||||
Args:
|
||||
frame: The end frame.
|
||||
"""
|
||||
await super().stop(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def cancel(self, frame: CancelFrame):
|
||||
"""Cancel the Deepgram SageMaker TTS service.
|
||||
|
||||
Args:
|
||||
frame: The cancel frame.
|
||||
"""
|
||||
await super().cancel(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
"""Process frames with special handling for LLM response end.
|
||||
|
||||
Args:
|
||||
frame: The frame to process.
|
||||
direction: The direction of frame processing.
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, (LLMFullResponseEndFrame, EndFrame)):
|
||||
await self.flush_audio()
|
||||
elif isinstance(frame, BotStoppedSpeakingFrame):
|
||||
self._ttfb_started = False
|
||||
|
||||
async def _connect(self):
|
||||
"""Connect to the SageMaker endpoint and start the BiDi session.
|
||||
|
||||
Builds the Deepgram TTS query string, creates the BiDi client,
|
||||
starts the streaming session, and launches a background task for processing
|
||||
responses.
|
||||
"""
|
||||
logger.debug("Connecting to Deepgram TTS on SageMaker...")
|
||||
|
||||
query_string = (
|
||||
f"model={self._settings.voice}&encoding={self._settings.encoding}"
|
||||
f"&sample_rate={self.sample_rate}"
|
||||
)
|
||||
|
||||
self._client = SageMakerBidiClient(
|
||||
endpoint_name=self._endpoint_name,
|
||||
region=self._region,
|
||||
model_invocation_path="v1/speak",
|
||||
model_query_string=query_string,
|
||||
)
|
||||
|
||||
try:
|
||||
await self._client.start_session()
|
||||
|
||||
self._response_task = self.create_task(self._process_responses())
|
||||
|
||||
logger.debug("Connected to Deepgram TTS on SageMaker")
|
||||
await self._call_event_handler("on_connected")
|
||||
|
||||
except Exception as e:
|
||||
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
|
||||
await self._call_event_handler("on_connection_error", str(e))
|
||||
|
||||
async def _disconnect(self):
|
||||
"""Disconnect from the SageMaker endpoint.
|
||||
|
||||
Sends a Close message to Deepgram, cancels the response processing task,
|
||||
and closes the BiDi session. Safe to call multiple times.
|
||||
"""
|
||||
if self._client and self._client.is_active:
|
||||
logger.debug("Disconnecting from Deepgram TTS on SageMaker...")
|
||||
|
||||
try:
|
||||
await self._client.send_json({"type": "Close"})
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to send Close message: {e}")
|
||||
|
||||
if self._response_task and not self._response_task.done():
|
||||
await self.cancel_task(self._response_task)
|
||||
|
||||
await self._client.close_session()
|
||||
|
||||
logger.debug("Disconnected from Deepgram TTS on SageMaker")
|
||||
await self._call_event_handler("on_disconnected")
|
||||
|
||||
async def _update_settings(self, delta: TTSSettings) -> dict[str, Any]:
|
||||
"""Apply a settings delta and reconnect if necessary.
|
||||
|
||||
Since all settings are part of the SageMaker session query string,
|
||||
any setting change requires reconnecting to apply the new values.
|
||||
"""
|
||||
changed = await super()._update_settings(delta)
|
||||
|
||||
if not changed:
|
||||
return changed
|
||||
|
||||
# Deepgram uses voice as the model, so keep them in sync for metrics
|
||||
if "voice" in changed:
|
||||
self._settings.model = self._settings.voice
|
||||
self._sync_model_name_to_metrics()
|
||||
|
||||
# TODO: someday we could reconnect here to apply updated settings.
|
||||
# Code might look something like the below:
|
||||
# await self._disconnect()
|
||||
# await self._connect()
|
||||
|
||||
self._warn_unhandled_updated_settings(changed)
|
||||
|
||||
return changed
|
||||
|
||||
async def _process_responses(self):
|
||||
"""Process streaming responses from Deepgram TTS on SageMaker.
|
||||
|
||||
Continuously receives responses from the BiDi stream. Attempts to decode
|
||||
each payload as UTF-8 JSON for control messages (Flushed, Cleared, Metadata,
|
||||
Warning). If decoding fails, treats the payload as raw audio bytes and pushes
|
||||
a TTSAudioRawFrame downstream.
|
||||
"""
|
||||
try:
|
||||
while self._client and self._client.is_active:
|
||||
result = await self._client.receive_response()
|
||||
|
||||
if result is None:
|
||||
break
|
||||
|
||||
if hasattr(result, "value") and hasattr(result.value, "bytes_"):
|
||||
if result.value.bytes_:
|
||||
payload = result.value.bytes_
|
||||
|
||||
# Try to decode as JSON control message first
|
||||
try:
|
||||
response_data = payload.decode("utf-8")
|
||||
parsed = json.loads(response_data)
|
||||
msg_type = parsed.get("type")
|
||||
|
||||
if msg_type == "Metadata":
|
||||
logger.trace(f"Received metadata: {parsed}")
|
||||
elif msg_type == "Flushed":
|
||||
logger.trace(f"Received Flushed: {parsed}")
|
||||
elif msg_type == "Cleared":
|
||||
logger.trace(f"Received Cleared: {parsed}")
|
||||
elif msg_type == "Warning":
|
||||
logger.warning(
|
||||
f"{self} warning: "
|
||||
f"{parsed.get('description', 'Unknown warning')}"
|
||||
)
|
||||
else:
|
||||
logger.debug(f"Received unknown message type: {parsed}")
|
||||
|
||||
except (UnicodeDecodeError, json.JSONDecodeError):
|
||||
# Not JSON — treat as raw audio bytes
|
||||
await self.stop_ttfb_metrics()
|
||||
frame = TTSAudioRawFrame(
|
||||
payload,
|
||||
self.sample_rate,
|
||||
1,
|
||||
context_id=self._context_id,
|
||||
)
|
||||
await self.push_frame(frame)
|
||||
|
||||
except asyncio.CancelledError:
|
||||
logger.debug("TTS response processor cancelled")
|
||||
except Exception as e:
|
||||
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
|
||||
finally:
|
||||
logger.debug("TTS response processor stopped")
|
||||
|
||||
async def _handle_interruption(self, frame: InterruptionFrame, direction: FrameDirection):
|
||||
"""Handle interruption by sending Clear message to Deepgram.
|
||||
|
||||
The Clear message will clear Deepgram's internal text buffer and stop
|
||||
sending audio, allowing for a new response to be generated.
|
||||
"""
|
||||
await super()._handle_interruption(frame, direction)
|
||||
self._ttfb_started = False
|
||||
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_json({"type": "Clear"})
|
||||
except Exception as e:
|
||||
logger.error(f"{self} error sending Clear message: {e}")
|
||||
|
||||
async def flush_audio(self):
|
||||
"""Flush any pending audio synthesis by sending Flush command.
|
||||
|
||||
This should be called when the LLM finishes a complete response to force
|
||||
generation of audio from Deepgram's internal text buffer.
|
||||
"""
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_json({"type": "Flush"})
|
||||
except Exception as e:
|
||||
logger.error(f"{self} error sending Flush message: {e}")
|
||||
|
||||
@traced_tts
|
||||
async def run_tts(self, text: str, context_id: str) -> AsyncGenerator[Frame, None]:
|
||||
"""Generate speech from text using Deepgram TTS on SageMaker.
|
||||
|
||||
Args:
|
||||
text: The text to synthesize into speech.
|
||||
context_id: The context ID for tracking audio frames.
|
||||
|
||||
Yields:
|
||||
Frame: TTSStartedFrame, then None (audio comes asynchronously via
|
||||
the response processor).
|
||||
"""
|
||||
logger.debug(f"{self}: Generating TTS [{text}]")
|
||||
|
||||
try:
|
||||
if not self._ttfb_started:
|
||||
await self.start_ttfb_metrics()
|
||||
self._ttfb_started = True
|
||||
await self.start_tts_usage_metrics(text)
|
||||
|
||||
yield TTSStartedFrame(context_id=context_id)
|
||||
self._context_id = context_id
|
||||
|
||||
await self._client.send_json({"type": "Speak", "text": text})
|
||||
|
||||
yield None
|
||||
|
||||
except Exception as e:
|
||||
yield ErrorFrame(error=f"Unknown error occurred: {e}")
|
||||
from pipecat.services.deepgram.sagemaker.tts import * # noqa: E402, F401, F403
|
||||
|
||||
@@ -613,7 +613,7 @@ class GladiaSTTService(WebsocketSTTService):
|
||||
|
||||
await self.broadcast_frame(UserStartedSpeakingFrame)
|
||||
if self._should_interrupt:
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
await self.broadcast_interruption()
|
||||
|
||||
async def _on_speech_ended(self):
|
||||
"""Handle speech end event from Gladia.
|
||||
|
||||
@@ -1265,7 +1265,7 @@ class GeminiLiveLLMService(LLMService):
|
||||
# combination with the context aggregator default
|
||||
# turn strategies.
|
||||
logger.debug("Gemini VAD: interrupted signal received")
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
await self.broadcast_interruption()
|
||||
elif message.server_content and message.server_content.model_turn:
|
||||
await self._handle_msg_model_turn(message)
|
||||
elif (
|
||||
|
||||
@@ -734,7 +734,7 @@ class GrokRealtimeLLMService(LLMService):
|
||||
"""Handle speech started event from VAD."""
|
||||
await self._truncate_current_audio_response()
|
||||
await self.broadcast_frame(UserStartedSpeakingFrame)
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
await self.broadcast_interruption()
|
||||
|
||||
async def _handle_evt_speech_stopped(self, evt):
|
||||
"""Handle speech stopped event from VAD."""
|
||||
|
||||
@@ -62,10 +62,12 @@ class HeyGenCallbacks(BaseModel):
|
||||
"""Callback handlers for HeyGen events.
|
||||
|
||||
Parameters:
|
||||
on_participant_connected: Called when a participant connects
|
||||
on_participant_disconnected: Called when a participant disconnects
|
||||
on_connected: Called when the bot connects to the LiveKit room.
|
||||
on_participant_connected: Called when a participant connects.
|
||||
on_participant_disconnected: Called when a participant disconnects.
|
||||
"""
|
||||
|
||||
on_connected: Callable[[], Awaitable[None]]
|
||||
on_participant_connected: Callable[[str], Awaitable[None]]
|
||||
on_participant_disconnected: Callable[[str], Awaitable[None]]
|
||||
|
||||
@@ -251,6 +253,7 @@ class HeyGenClient:
|
||||
logger.debug(f"HeyGenClient send_interval: {self._send_interval}")
|
||||
await self._ws_connect()
|
||||
await self._livekit_connect()
|
||||
self._call_event_callback(self._callbacks.on_connected)
|
||||
|
||||
async def stop(self) -> None:
|
||||
"""Stop the client and terminate all connections.
|
||||
|
||||
@@ -128,6 +128,7 @@ class HeyGenVideoService(AIService):
|
||||
session_request=self._session_request,
|
||||
service_type=self._service_type,
|
||||
callbacks=HeyGenCallbacks(
|
||||
on_connected=self._on_connected,
|
||||
on_participant_connected=self._on_participant_connected,
|
||||
on_participant_disconnected=self._on_participant_disconnected,
|
||||
),
|
||||
@@ -144,6 +145,10 @@ class HeyGenVideoService(AIService):
|
||||
await self._client.cleanup()
|
||||
self._client = None
|
||||
|
||||
async def _on_connected(self):
|
||||
"""Handle bot connected to LiveKit room."""
|
||||
logger.info("HeyGen bot connected to LiveKit room")
|
||||
|
||||
async def _on_participant_connected(self, participant_id: str):
|
||||
"""Handle participant connected events."""
|
||||
logger.info(f"Participant connected {participant_id}")
|
||||
|
||||
@@ -839,7 +839,7 @@ class OpenAIRealtimeLLMService(LLMService):
|
||||
async def _handle_evt_speech_started(self, evt):
|
||||
await self._truncate_current_audio_response()
|
||||
await self.broadcast_frame(UserStartedSpeakingFrame)
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
await self.broadcast_interruption()
|
||||
|
||||
async def _handle_evt_speech_stopped(self, evt):
|
||||
await self.start_ttfb_metrics()
|
||||
|
||||
@@ -639,7 +639,7 @@ class OpenAIRealtimeSTTService(WebsocketSTTService):
|
||||
logger.debug("Server VAD: speech started")
|
||||
await self.broadcast_frame(UserStartedSpeakingFrame)
|
||||
if self._should_interrupt:
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
await self.broadcast_interruption()
|
||||
await self.start_processing_metrics()
|
||||
|
||||
async def _handle_speech_stopped(self, evt: dict):
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user