Compare commits
2 Commits
fix/event-
...
hush/nonIn
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
dbeb6dd4ab | ||
|
|
c2fdbc9e65 |
5
.gitignore
vendored
5
.gitignore
vendored
@@ -51,7 +51,4 @@ docs/api/_build/
|
||||
docs/api/api
|
||||
|
||||
# uv
|
||||
.python-version
|
||||
|
||||
# Pipecat
|
||||
whisker_setup.py
|
||||
.python-version
|
||||
253
CHANGELOG.md
253
CHANGELOG.md
@@ -24,40 +24,39 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
A list of strategies can be specified for both strategies; strategies are
|
||||
evaluated in order until one evaluates to true.
|
||||
|
||||
Available user turn start strategies:
|
||||
Available user turn start strategies:
|
||||
- VADUserTurnStartStrategy
|
||||
- TranscriptionUserTurnStartStrategy
|
||||
- MinWordsUserTurnStartStrategy
|
||||
- ExternalUserTurnStartStrategy
|
||||
|
||||
- VADUserTurnStartStrategy
|
||||
- TranscriptionUserTurnStartStrategy
|
||||
- MinWordsUserTurnStartStrategy
|
||||
- ExternalUserTurnStartStrategy
|
||||
Available user turn stop strategies:
|
||||
- TranscriptionUserTurnStopStrategy
|
||||
- TurnAnalyzerUserTurnStopStrategy
|
||||
- ExternalUserTurnStopStrategy
|
||||
|
||||
Available user turn stop strategies:
|
||||
The default strategies are:
|
||||
|
||||
- TranscriptionUserTurnStopStrategy
|
||||
- TurnAnalyzerUserTurnStopStrategy
|
||||
- ExternalUserTurnStopStrategy
|
||||
- start: [VADUserTurnStartStrategy, TranscriptionUserTurnStartStrategy]
|
||||
- stop: [TranscriptionUserTurnStopStrategy]
|
||||
|
||||
The default strategies are:
|
||||
|
||||
- start: [VADUserTurnStartStrategy, TranscriptionUserTurnStartStrategy]
|
||||
- stop: [TranscriptionUserTurnStopStrategy]
|
||||
|
||||
Turn strategies are configured when setting up `LLMContextAggregatorPair`.
|
||||
urn strategies are configured when setting up `LLMContextAggregatorPair`.
|
||||
For example:
|
||||
|
||||
```python
|
||||
context_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(
|
||||
user_turn_strategies=UserTurnStrategies(
|
||||
stop=[
|
||||
TurnAnalyzerUserTurnStopStrategy(turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams())
|
||||
)
|
||||
],
|
||||
)
|
||||
),
|
||||
)
|
||||
```
|
||||
```python
|
||||
context_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(
|
||||
user_turn_strategies=UserTurnStrategies(
|
||||
stop=[
|
||||
TurnAnalyzerUserTurnStopStrategy(
|
||||
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams())
|
||||
)
|
||||
],
|
||||
)
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
In order to use the user turn strategies you must update to the new
|
||||
universal `LLMContext` and `LLMContextAggregatorPair`.
|
||||
@@ -70,13 +69,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
- Added `GrokRealtimeLLMService` for xAI's Grok Voice Agent API with real-time
|
||||
voice conversations:
|
||||
|
||||
- Support for real-time audio streaming with WebSocket connection
|
||||
- Built-in server-side VAD (Voice Activity Detection)
|
||||
- Multiple voice options: Ara, Rex, Sal, Eve, Leo
|
||||
- Built-in tools support: web_search, x_search, file_search
|
||||
- Custom function calling with standard Pipecat tools schema
|
||||
- Configurable audio formats (PCM at 8kHz-48kHz)
|
||||
(PR [#3267](https://github.com/pipecat-ai/pipecat/pull/3267))
|
||||
- Support for real-time audio streaming with WebSocket connection
|
||||
- Built-in server-side VAD (Voice Activity Detection)
|
||||
- Multiple voice options: Ara, Rex, Sal, Eve, Leo
|
||||
- Built-in tools support: web_search, x_search, file_search
|
||||
- Custom function calling with standard Pipecat tools schema
|
||||
- Configurable audio formats (PCM at 8kHz-48kHz)
|
||||
(PR [#3267](https://github.com/pipecat-ai/pipecat/pull/3267))
|
||||
|
||||
- Added an approximation of TTFB for Ultravox.
|
||||
(PR [#3268](https://github.com/pipecat-ai/pipecat/pull/3268))
|
||||
@@ -87,12 +86,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
(PR [#3289](https://github.com/pipecat-ai/pipecat/pull/3289))
|
||||
|
||||
- `LLMUserAggregator` now exposes the following events:
|
||||
|
||||
- `on_user_turn_started`: triggered when a user turn starts
|
||||
- `on_user_turn_stopped`: triggered when a user turn ends
|
||||
- `on_user_turn_stop_timeout`: triggered when a user turn does not stop
|
||||
and times out
|
||||
(PR [#3291](https://github.com/pipecat-ai/pipecat/pull/3291))
|
||||
- `on_user_turn_started`: triggered when a user turn starts
|
||||
- `on_user_turn_stopped`: triggered when a user turn ends
|
||||
- `on_user_turn_stop_timeout`: triggered when a user turn does not stop
|
||||
and times out
|
||||
(PR [#3291](https://github.com/pipecat-ai/pipecat/pull/3291))
|
||||
|
||||
- Introducing user mute strategies. User mute strategies indicate when user
|
||||
input should be muted based on the current system state.
|
||||
@@ -106,12 +104,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
frame is muted if any of the configured strategies indicates it should be
|
||||
muted.
|
||||
|
||||
Available user mute strategies:
|
||||
Available user mute strategies:
|
||||
|
||||
- `FirstSpeechUserMuteStrategy`
|
||||
- `MuteUntilFirstBotCompleteUserMuteStrategy`
|
||||
- `AlwaysUserMuteStrategy`
|
||||
- `FunctionCallUserMuteStrategy`
|
||||
* `FirstSpeechUserMuteStrategy`
|
||||
* `MuteUntilFirstBotCompleteUserMuteStrategy`
|
||||
* `AlwaysUserMuteStrategy`
|
||||
* `FunctionCallUserMuteStrategy`
|
||||
|
||||
User mute strategies replace the legacy `STTMuteFilter` and provide a more
|
||||
flexible and composable approach to muting user input.
|
||||
@@ -119,16 +117,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
User mute strategies are configured when setting up the
|
||||
`LLMContextAggregatorPair`. For example:
|
||||
|
||||
```python
|
||||
context_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(
|
||||
user_mute_strategies=[
|
||||
FirstSpeechUserMuteStrategy(),
|
||||
]
|
||||
),
|
||||
)
|
||||
```
|
||||
```python
|
||||
context_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(
|
||||
user_mute_strategies=[
|
||||
FirstSpeechUserMuteStrategy(),
|
||||
]
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
In order to use user mute strategies you should update to the new universal
|
||||
`LLMContext` and `LLMContextAggregatorPair`.
|
||||
@@ -161,17 +159,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
(PR [#3357](https://github.com/pipecat-ai/pipecat/pull/3357))
|
||||
|
||||
- Added image support to `OpenAIRealtimeLLMService` via `InputImageRawFrame`:
|
||||
|
||||
- New `start_video_paused` parameter to control initial video input state
|
||||
- New `video_frame_detail` parameter to set image processing quality
|
||||
("auto",
|
||||
"low", or "high"). This corresponds to OpenAI Realtime's `image_detail`
|
||||
parameter.
|
||||
- `set_video_input_paused()` method to pause/resume video input at runtime
|
||||
- `set_video_frame_detail()` method to adjust video frame quality
|
||||
dynamically
|
||||
- Automatic rate limiting (1 frame per second) to prevent API overload
|
||||
(PR [#3360](https://github.com/pipecat-ai/pipecat/pull/3360))
|
||||
- New `start_video_paused` parameter to control initial video input state
|
||||
- New `video_frame_detail` parameter to set image processing quality
|
||||
("auto",
|
||||
"low", or "high"). This corresponds to OpenAI Realtime's `image_detail`
|
||||
parameter.
|
||||
- `set_video_input_paused()` method to pause/resume video input at runtime
|
||||
- `set_video_frame_detail()` method to adjust video frame quality
|
||||
dynamically
|
||||
- Automatic rate limiting (1 frame per second) to prevent API overload
|
||||
(PR [#3360](https://github.com/pipecat-ai/pipecat/pull/3360))
|
||||
|
||||
- Added `UserTurnProcessor`, a frame processor built on `UserTurnController`
|
||||
that pushes `UserStartedSpeakingFrame` and `UserStoppedSpeakingFrame` frames
|
||||
@@ -191,12 +188,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
(PR [#3374](https://github.com/pipecat-ai/pipecat/pull/3374))
|
||||
|
||||
- `LLMAssistantAggregator` now exposes the following events:
|
||||
|
||||
- `on_assistant_turn_started`: triggered when the assistant turn starts
|
||||
- `on_assistant_turn_stopped`: triggered when the assistant turn ends
|
||||
- `on_assistant_thought`: triggered when there's an assistant thought
|
||||
available
|
||||
(PR [#3385](https://github.com/pipecat-ai/pipecat/pull/3385))
|
||||
- `on_assistant_turn_started`: triggered when the assistant turn starts
|
||||
- `on_assistant_turn_stopped`: triggered when the assistant turn ends
|
||||
- `on_assistant_thought`: triggered when there's an assistant thought
|
||||
available
|
||||
(PR [#3385](https://github.com/pipecat-ai/pipecat/pull/3385))
|
||||
|
||||
- Added `KrispVivaTurn` analyzer for end of turn detection using the Krisp VIVA
|
||||
SDK (requires `krisp_audio`).
|
||||
@@ -206,14 +202,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
register custom pipeline task setup files by setting the
|
||||
`PIPECAT_SETUP_FILES` environment variable. This variable should contain a
|
||||
colon-separated list of Python files (e.g. `export
|
||||
PIPECAT_SETUP_FILES="setup1.py:setup.py:..."`). Each file must define a
|
||||
PIPECAT_SETUP_FILES="setup1.py:setup.py:..."`). Each file must define a
|
||||
function with the following signature:
|
||||
|
||||
```python
|
||||
async def setup_pipeline_task(task: PipelineTask):
|
||||
...
|
||||
```
|
||||
|
||||
```python
|
||||
async def setup_pipeline_task(task: PipelineTask):
|
||||
...
|
||||
```
|
||||
(PR [#3397](https://github.com/pipecat-ai/pipecat/pull/3397))
|
||||
|
||||
- Added a keepalive task for `InworldTTSService` to keep the service connected
|
||||
@@ -243,14 +238,12 @@ PIPECAT_SETUP_FILES="setup1.py:setup.py:..."`). Each file must define a
|
||||
|
||||
- Updated `ElevenLabsRealtimeSTTService` to accept the
|
||||
`include_language_detection` parameter to detect language.
|
||||
|
||||
```python
|
||||
stt = ElevenLabsRealtimeSTTService(
|
||||
api_key=os.getenv("ELEVENLABS_API_KEY"),
|
||||
include_language_detection=True
|
||||
)
|
||||
```
|
||||
|
||||
```python
|
||||
stt = ElevenLabsRealtimeSTTService(
|
||||
api_key=os.getenv("ELEVENLABS_API_KEY"),
|
||||
include_language_detection=True
|
||||
)
|
||||
```
|
||||
(PR [#3216](https://github.com/pipecat-ai/pipecat/pull/3216))
|
||||
|
||||
- Updated `SpeechmaticsSTTService` to use new Python Voice SDK with improved
|
||||
@@ -258,18 +251,16 @@ PIPECAT_SETUP_FILES="setup1.py:setup.py:..."`). Each file must define a
|
||||
without any impact on accuracy. Use the `turn_detection_mode` parameter to control
|
||||
the endpointing of speech, with `TurnDetectionMode.EXTERNAL` (default),
|
||||
`TurnDetectionMode.ADAPTIVE`, or `TurnDetectionMode.SMART_TURN`.
|
||||
|
||||
```python
|
||||
```python
|
||||
stt = SpeechmaticsSTTService(
|
||||
api_key=os.getenv("SPEECHMATICS_API_KEY"),
|
||||
params=SpeechmaticsSTTService.InputParams(
|
||||
language=Language.EN,
|
||||
turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.ADAPTIVE,
|
||||
turn_detection_mode=SpeechmaticsSTTService.TurnDetectionMode.ADAPTIVE,
|
||||
speaker_active_format="<{speaker_id}>{text}</{speaker_id}>",
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
```
|
||||
(PR [#3225](https://github.com/pipecat-ai/pipecat/pull/3225))
|
||||
|
||||
- `daily-python` updated to 0.23.0.
|
||||
@@ -282,15 +273,10 @@ PIPECAT_SETUP_FILES="setup1.py:setup.py:..."`). Each file must define a
|
||||
|
||||
- Updates to Inworld TTS services:
|
||||
|
||||
- Improved `InworldTTSService`'s websocket implementation to better flush
|
||||
and close context to better handle long inputs.
|
||||
- Improved docstrings for `InworldTTSService` and `InworldHttpTTSService`.
|
||||
(PR [#3288](https://github.com/pipecat-ai/pipecat/pull/3288))
|
||||
|
||||
- Improved the error handling and reconnection logic for `WebsocketServer` by
|
||||
distinguishing between errors when disconnecting and websocket communication
|
||||
errors.
|
||||
(PR [#3392](https://github.com/pipecat-ai/pipecat/pull/3392))
|
||||
- Improved `InworldTTSService`'s websocket implementation to better flush
|
||||
and close context to better handle long inputs.
|
||||
- Improved docstrings for `InworldTTSService` and `InworldHttpTTSService`.
|
||||
(PR [#3288](https://github.com/pipecat-ai/pipecat/pull/3288))
|
||||
|
||||
- Updated `DeepgramSTTService` to push user started/stopped speaking and
|
||||
interruption frames when `vad_enabled` is set to true. This centralizes the
|
||||
@@ -322,8 +308,7 @@ PIPECAT_SETUP_FILES="setup1.py:setup.py:..."`). Each file must define a
|
||||
- Smart Turn now takes into account `vad_start_seconds` when buffering audio,
|
||||
meaning that the start of the turn audio is not cut off. This improves
|
||||
accuracy for short utterances.
|
||||
|
||||
- The default value of `pre_speech_ms` is now set to 500ms for Smart Turn.
|
||||
- The default value of `pre_speech_ms` is now set to 500ms for Smart Turn.
|
||||
(PR [#3377](https://github.com/pipecat-ai/pipecat/pull/3377))
|
||||
|
||||
- Improved Krisp SDK management to allow `KrispVivaTurn` and `KrispVivaFilter`
|
||||
@@ -391,18 +376,17 @@ PIPECAT_SETUP_FILES="setup1.py:setup.py:..."`). Each file must define a
|
||||
From the developer's point of view, switching to using `LLMContext`
|
||||
machinery will usually be a matter of going from this:
|
||||
|
||||
```python
|
||||
context = OpenAILLMContext(messages, tools)
|
||||
context_aggregator = llm.create_context_aggregator(context)
|
||||
```
|
||||
```python
|
||||
context = OpenAILLMContext(messages, tools)
|
||||
context_aggregator = llm.create_context_aggregator(context)
|
||||
```
|
||||
|
||||
To this:
|
||||
|
||||
```
|
||||
context = LLMContext(messages, tools)
|
||||
context_aggregator = LLMContextAggregatorPair(context)
|
||||
```
|
||||
To this:
|
||||
|
||||
```
|
||||
context = LLMContext(messages, tools)
|
||||
context_aggregator = LLMContextAggregatorPair(context)
|
||||
```
|
||||
(PR [#3263](https://github.com/pipecat-ai/pipecat/pull/3263))
|
||||
|
||||
- `STTMuteFilter` is deprecated and will be removed in a future version. Use
|
||||
@@ -417,17 +401,16 @@ PIPECAT_SETUP_FILES="setup1.py:setup.py:..."`). Each file must define a
|
||||
`LLMUserAggregator`'s new parameter `user_turn_strategies` instead. For
|
||||
example, to disable interruptions but still get user turns you can do:
|
||||
|
||||
```python
|
||||
context_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(
|
||||
user_turn_strategies=UserTurnStrategies(
|
||||
start=[TranscriptionUserTurnStartStrategy(enable_interruptions=False)],
|
||||
),
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
```python
|
||||
context_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(
|
||||
user_turn_strategies=UserTurnStrategies(
|
||||
start=[TranscriptionUserTurnStartStrategy(enable_interruptions=False)],
|
||||
),
|
||||
),
|
||||
)
|
||||
```
|
||||
(PR [#3297](https://github.com/pipecat-ai/pipecat/pull/3297))
|
||||
|
||||
- `TranscriptProcessor` and related data classes and frames
|
||||
@@ -450,9 +433,7 @@ PIPECAT_SETUP_FILES="setup1.py:setup.py:..."`). Each file must define a
|
||||
### Fixed
|
||||
|
||||
- Improved error handling in `ElevenLabsRealtimeSTTService`
|
||||
(PR [#3233](https://github.com/pipecat-ai/pipecat/pull/3233))
|
||||
|
||||
- Fixed an issue in `ElevenLabsRealtimeSTTService` causing an infinite loop
|
||||
- Fixed an issue in `ElevenLabsRealtimeSTTService` causing an infinite loop
|
||||
that blocks the process if the websocket disconnects due to an error
|
||||
(PR [#3233](https://github.com/pipecat-ai/pipecat/pull/3233))
|
||||
|
||||
@@ -465,14 +446,13 @@ PIPECAT_SETUP_FILES="setup1.py:setup.py:..."`). Each file must define a
|
||||
(PR [#3322](https://github.com/pipecat-ai/pipecat/pull/3322))
|
||||
|
||||
- Updated `SpeechmaticsSTTService` for version `0.0.99+`:
|
||||
|
||||
- Fixed `SpeechmaticsSTTService` to listen for `VADUserStoppedSpeakingFrame`
|
||||
in order to finalize transcription.
|
||||
- Default to `TurnDetectionMode.FIXED` for Pipecat-controlled end of turn
|
||||
detection.
|
||||
- Only emit VAD + interruption frames if VAD is enabled within the plugin
|
||||
(modes other than `TurnDetectionMode.FIXED` or `TurnDetectionMode.EXTERNAL`).
|
||||
(PR [#3328](https://github.com/pipecat-ai/pipecat/pull/3328))
|
||||
- Fixed `SpeechmaticsSTTService` to listen for `VADUserStoppedSpeakingFrame`
|
||||
in order to finalize transcription.
|
||||
- Default to `TurnDetectionMode.FIXED` for Pipecat-controlled end of turn
|
||||
detection.
|
||||
- Only emit VAD + interruption frames if VAD is enabled within the plugin
|
||||
(modes other than `TurnDetectionMode.FIXED` or `TurnDetectionMode.EXTERNAL`).
|
||||
(PR [#3328](https://github.com/pipecat-ai/pipecat/pull/3328))
|
||||
|
||||
- Fixed an issue with function calling where a handler failing to invoke its
|
||||
result callback could leave the context stuck in IN_PROGRESS, causing LLM
|
||||
@@ -501,9 +481,6 @@ PIPECAT_SETUP_FILES="setup1.py:setup.py:..."`). Each file must define a
|
||||
guard was set.
|
||||
(PR [#3400](https://github.com/pipecat-ai/pipecat/pull/3400))
|
||||
|
||||
- Fixed parallel function calling when using Gemini thinking.
|
||||
(PR [3420](https://github.com/pipecat-ai/pipecat/pull/3420))
|
||||
|
||||
- Fixed an issue in `traced_llm` where `model_name` in OpenTelemetry appears as
|
||||
`unknown`.
|
||||
(PR [#3422](https://github.com/pipecat-ai/pipecat/pull/3422))
|
||||
|
||||
@@ -73,9 +73,9 @@ Catch new features, interviews, and how-tos on our [Pipecat TV](https://www.yout
|
||||
|
||||
| Category | Services |
|
||||
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| Speech-to-Text | [AssemblyAI](https://docs.pipecat.ai/server/services/stt/assemblyai), [AWS](https://docs.pipecat.ai/server/services/stt/aws), [Azure](https://docs.pipecat.ai/server/services/stt/azure), [Cartesia](https://docs.pipecat.ai/server/services/stt/cartesia), [Deepgram](https://docs.pipecat.ai/server/services/stt/deepgram), [ElevenLabs](https://docs.pipecat.ai/server/services/stt/elevenlabs), [Fal Wizper](https://docs.pipecat.ai/server/services/stt/fal), [Gladia](https://docs.pipecat.ai/server/services/stt/gladia), [Google](https://docs.pipecat.ai/server/services/stt/google), [Gradium](https://docs.pipecat.ai/server/services/stt/gradium), [Groq (Whisper)](https://docs.pipecat.ai/server/services/stt/groq), [Hathora](https://docs.pipecat.ai/server/services/stt/hathora), [NVIDIA Riva](https://docs.pipecat.ai/server/services/stt/riva), [OpenAI (Whisper)](https://docs.pipecat.ai/server/services/stt/openai), [SambaNova (Whisper)](https://docs.pipecat.ai/server/services/stt/sambanova), [Sarvam](https://docs.pipecat.ai/server/services/stt/sarvam), [Soniox](https://docs.pipecat.ai/server/services/stt/soniox), [Speechmatics](https://docs.pipecat.ai/server/services/stt/speechmatics), [Whisper](https://docs.pipecat.ai/server/services/stt/whisper) |
|
||||
| Speech-to-Text | [AssemblyAI](https://docs.pipecat.ai/server/services/stt/assemblyai), [AWS](https://docs.pipecat.ai/server/services/stt/aws), [Azure](https://docs.pipecat.ai/server/services/stt/azure), [Cartesia](https://docs.pipecat.ai/server/services/stt/cartesia), [Deepgram](https://docs.pipecat.ai/server/services/stt/deepgram), [ElevenLabs](https://docs.pipecat.ai/server/services/stt/elevenlabs), [Fal Wizper](https://docs.pipecat.ai/server/services/stt/fal), [Gladia](https://docs.pipecat.ai/server/services/stt/gladia), [Google](https://docs.pipecat.ai/server/services/stt/google), [Gradium](https://docs.pipecat.ai/server/services/stt/gradium), [Groq (Whisper)](https://docs.pipecat.ai/server/services/stt/groq), [NVIDIA Riva](https://docs.pipecat.ai/server/services/stt/riva), [OpenAI (Whisper)](https://docs.pipecat.ai/server/services/stt/openai), [SambaNova (Whisper)](https://docs.pipecat.ai/server/services/stt/sambanova), [Sarvam](https://docs.pipecat.ai/server/services/stt/sarvam), [Soniox](https://docs.pipecat.ai/server/services/stt/soniox), [Speechmatics](https://docs.pipecat.ai/server/services/stt/speechmatics), [Whisper](https://docs.pipecat.ai/server/services/stt/whisper) |
|
||||
| LLMs | [Anthropic](https://docs.pipecat.ai/server/services/llm/anthropic), [AWS](https://docs.pipecat.ai/server/services/llm/aws), [Azure](https://docs.pipecat.ai/server/services/llm/azure), [Cerebras](https://docs.pipecat.ai/server/services/llm/cerebras), [DeepSeek](https://docs.pipecat.ai/server/services/llm/deepseek), [Fireworks AI](https://docs.pipecat.ai/server/services/llm/fireworks), [Gemini](https://docs.pipecat.ai/server/services/llm/gemini), [Grok](https://docs.pipecat.ai/server/services/llm/grok), [Groq](https://docs.pipecat.ai/server/services/llm/groq), [Mistral](https://docs.pipecat.ai/server/services/llm/mistral), [NVIDIA NIM](https://docs.pipecat.ai/server/services/llm/nim), [Ollama](https://docs.pipecat.ai/server/services/llm/ollama), [OpenAI](https://docs.pipecat.ai/server/services/llm/openai), [OpenRouter](https://docs.pipecat.ai/server/services/llm/openrouter), [Perplexity](https://docs.pipecat.ai/server/services/llm/perplexity), [Qwen](https://docs.pipecat.ai/server/services/llm/qwen), [SambaNova](https://docs.pipecat.ai/server/services/llm/sambanova) [Together AI](https://docs.pipecat.ai/server/services/llm/together) |
|
||||
| Text-to-Speech | [Async](https://docs.pipecat.ai/server/services/tts/asyncai), [AWS](https://docs.pipecat.ai/server/services/tts/aws), [Azure](https://docs.pipecat.ai/server/services/tts/azure), [Camb AI](https://docs.pipecat.ai/server/services/tts/camb), [Cartesia](https://docs.pipecat.ai/server/services/tts/cartesia), [Deepgram](https://docs.pipecat.ai/server/services/tts/deepgram), [ElevenLabs](https://docs.pipecat.ai/server/services/tts/elevenlabs), [Fish](https://docs.pipecat.ai/server/services/tts/fish), [Google](https://docs.pipecat.ai/server/services/tts/google), [Gradium](https://docs.pipecat.ai/server/services/tts/gradium), [Groq](https://docs.pipecat.ai/server/services/tts/groq), [Hathora](https://docs.pipecat.ai/server/services/tts/hathora), [Hume](https://docs.pipecat.ai/server/services/tts/hume), [Inworld](https://docs.pipecat.ai/server/services/tts/inworld), [LMNT](https://docs.pipecat.ai/server/services/tts/lmnt), [MiniMax](https://docs.pipecat.ai/server/services/tts/minimax), [Neuphonic](https://docs.pipecat.ai/server/services/tts/neuphonic), [NVIDIA Riva](https://docs.pipecat.ai/server/services/tts/riva), [OpenAI](https://docs.pipecat.ai/server/services/tts/openai), [Piper](https://docs.pipecat.ai/server/services/tts/piper), [PlayHT](https://docs.pipecat.ai/server/services/tts/playht), [Rime](https://docs.pipecat.ai/server/services/tts/rime), [Sarvam](https://docs.pipecat.ai/server/services/tts/sarvam), [Speechmatics](https://docs.pipecat.ai/server/services/tts/speechmatics), [XTTS](https://docs.pipecat.ai/server/services/tts/xtts) |
|
||||
| Text-to-Speech | [Async](https://docs.pipecat.ai/server/services/tts/asyncai), [AWS](https://docs.pipecat.ai/server/services/tts/aws), [Azure](https://docs.pipecat.ai/server/services/tts/azure), [Cartesia](https://docs.pipecat.ai/server/services/tts/cartesia), [Deepgram](https://docs.pipecat.ai/server/services/tts/deepgram), [ElevenLabs](https://docs.pipecat.ai/server/services/tts/elevenlabs), [Fish](https://docs.pipecat.ai/server/services/tts/fish), [Google](https://docs.pipecat.ai/server/services/tts/google), [Gradium](https://docs.pipecat.ai/server/services/tts/gradium), [Groq](https://docs.pipecat.ai/server/services/tts/groq), [Hume](https://docs.pipecat.ai/server/services/tts/hume), [Inworld](https://docs.pipecat.ai/server/services/tts/inworld), [LMNT](https://docs.pipecat.ai/server/services/tts/lmnt), [MiniMax](https://docs.pipecat.ai/server/services/tts/minimax), [Neuphonic](https://docs.pipecat.ai/server/services/tts/neuphonic), [NVIDIA Riva](https://docs.pipecat.ai/server/services/tts/riva), [OpenAI](https://docs.pipecat.ai/server/services/tts/openai), [Piper](https://docs.pipecat.ai/server/services/tts/piper), [PlayHT](https://docs.pipecat.ai/server/services/tts/playht), [Rime](https://docs.pipecat.ai/server/services/tts/rime), [Sarvam](https://docs.pipecat.ai/server/services/tts/sarvam), [Speechmatics](https://docs.pipecat.ai/server/services/tts/speechmatics), [XTTS](https://docs.pipecat.ai/server/services/tts/xtts) |
|
||||
| Speech-to-Speech | [AWS Nova Sonic](https://docs.pipecat.ai/server/services/s2s/aws), [Gemini Multimodal Live](https://docs.pipecat.ai/server/services/s2s/gemini), [Grok Voice Agent](https://docs.pipecat.ai/server/services/s2s/grok), [OpenAI Realtime](https://docs.pipecat.ai/server/services/s2s/openai), [Ultravox](https://docs.pipecat.ai/server/services/s2s/ultravox), |
|
||||
| Transport | [Daily (WebRTC)](https://docs.pipecat.ai/server/services/transport/daily), [FastAPI Websocket](https://docs.pipecat.ai/server/services/transport/fastapi-websocket), [SmallWebRTCTransport](https://docs.pipecat.ai/server/services/transport/small-webrtc), [WebSocket Server](https://docs.pipecat.ai/server/services/transport/websocket-server), Local |
|
||||
| Serializers | [Exotel](https://docs.pipecat.ai/server/utilities/serializers/exotel), [Plivo](https://docs.pipecat.ai/server/utilities/serializers/plivo), [Twilio](https://docs.pipecat.ai/server/utilities/serializers/twilio), [Telnyx](https://docs.pipecat.ai/server/utilities/serializers/telnyx), [Vonage](https://docs.pipecat.ai/server/utilities/serializers/vonage) |
|
||||
|
||||
@@ -1 +0,0 @@
|
||||
- Added Hathora service to support Hathora-hosted TTS and STT models (only non-streaming)
|
||||
@@ -1 +0,0 @@
|
||||
- Enhanced interruption handling in `AsyncAITTSService` by supporting multi-context WebSocket sessions for more robust context management.
|
||||
@@ -1 +0,0 @@
|
||||
- Corrected TTFB metric calculation in `AsyncAIHttpTTSService`.
|
||||
@@ -1 +0,0 @@
|
||||
- Added `CambTTSService`, using Camb.ai's TTS integration with MARS models (mars-flash, mars-pro, mars-instruct) for high-quality text-to-speech synthesis.
|
||||
@@ -1,8 +0,0 @@
|
||||
- Fixed an issue where the "bot-llm-text" RTVI event would not fire for realtime (speech-to-speech) services:
|
||||
|
||||
- `AWSNovaSonicLLMService`
|
||||
- `GeminiLiveLLMService`
|
||||
- `OpenAIRealtimeLLMService`
|
||||
- `GrokRealtimeLLMService`
|
||||
|
||||
The issue was that these services weren't pushing `LLMTextFrame`s. Now they do.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed an issue where `on_user_turn_stop_timeout` could fire while a user is talking when using `ExternalUserTurnStrategies`.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed an issue where user turn start strategies were not being reset after a user turn started, causing incorrect strategy behavior.
|
||||
@@ -1 +0,0 @@
|
||||
- Added the `additional_headers` param to `WebsocketClientParams`, allowing `WebsocketClientTransport` to send custom headers on connect, for cases such as authentication.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed `MinWordsUserTurnStartStrategy` to not aggregate transcriptions, preventing incorrect turn starts when words are spoken with pauses between them.
|
||||
@@ -1 +0,0 @@
|
||||
- For consistency with other package names, we just deprecated `pipecat.turns.mute` (introduced in Pipecat 0.0.99) in favor of `pipecat.turns.user_mute`.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed an issue where Grok Realtime would error out when running with SmallWebRTC transport.
|
||||
@@ -1 +0,0 @@
|
||||
- Added `UserIdleController` for detecting user idle state, integrated into `LLMUserAggregator` and `UserTurnProcessor` via optional `user_idle_timeout` parameter. Emits `on_user_turn_idle` event for application-level handling. Deprecated `UserIdleProcessor` in favor of the new compositional approach.
|
||||
@@ -1 +0,0 @@
|
||||
- Throttle `UserSpeakingFrame` to broadcast at most every 200ms instead of on every audio chunk, reducing frame processing overhead during user speech.
|
||||
@@ -1 +0,0 @@
|
||||
- Fixed a `Mem0MemoryService` issue where passing `async_mode: true` was causing an error. See https://docs.mem0.ai/platform/features/async-mode-default-change.
|
||||
@@ -1,3 +0,0 @@
|
||||
- Fixed `AzureTTSService` transcript formatting issues:
|
||||
- Punctuation now appears without extra spaces (e.g., "Hello!" instead of "Hello !")
|
||||
- CJK languages (Chinese, Japanese, Korean) no longer have unwanted spaces between characters
|
||||
@@ -1 +0,0 @@
|
||||
- Added `on_user_mute_started` and `on_user_mute_stopped` event handlers to `LLMUserAggregator` for tracking user mute state changes.
|
||||
@@ -91,25 +91,6 @@ autodoc_mock_imports = [
|
||||
# MLX dependencies (Apple Silicon specific)
|
||||
"mlx",
|
||||
"mlx_whisper", # Note: might need underscore format too
|
||||
# Pydantic v2 compatibility issues in third-party SDKs
|
||||
"hume",
|
||||
"hume.tts",
|
||||
"hume.tts.types",
|
||||
"cartesia",
|
||||
"camb",
|
||||
"sarvamai",
|
||||
"openpipe",
|
||||
"openai.types.beta.realtime",
|
||||
"langchain_core",
|
||||
"langchain_core.messages",
|
||||
# FastAPI - Pydantic v2 compatibility issues during Sphinx autodoc
|
||||
"fastapi",
|
||||
"fastapi.applications",
|
||||
"fastapi.routing",
|
||||
"fastapi.params",
|
||||
"fastapi.middleware",
|
||||
"fastapi.responses",
|
||||
"uvicorn",
|
||||
]
|
||||
|
||||
# HTML output settings
|
||||
|
||||
@@ -31,9 +31,6 @@ AZURE_DALLE_API_KEY=...
|
||||
AZURE_DALLE_ENDPOINT=https://...
|
||||
AZURE_DALLE_MODEL=...
|
||||
|
||||
# Camb.ai
|
||||
CAMB_API_KEY=...
|
||||
|
||||
# Cartesia
|
||||
CARTESIA_API_KEY=...
|
||||
CARTESIA_VOICE_ID=...
|
||||
@@ -85,9 +82,6 @@ GROK_API_KEY=...
|
||||
# Groq
|
||||
GROQ_API_KEY=...
|
||||
|
||||
# Hathora
|
||||
HATHORA_API_KEY=...
|
||||
|
||||
# Heygen
|
||||
HEYGEN_API_KEY=...
|
||||
HEYGEN_LIVE_AVATAR_API_KEY=...
|
||||
|
||||
217
examples/foundational/07a-non-interruptible.py
Normal file
217
examples/foundational/07a-non-interruptible.py
Normal file
@@ -0,0 +1,217 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""
|
||||
This example demonstrates how to dynamically toggle interruptions while still
|
||||
transcribing user speech. Every 5 seconds, the bot toggles between:
|
||||
- Interruptible mode: user speech will interrupt the bot
|
||||
- Non-interruptible mode: user speech is transcribed but won't interrupt
|
||||
|
||||
This is useful when you want to capture what the user says during bot speech
|
||||
without interrupting the bot's response, and then re-enable interruptions later.
|
||||
|
||||
The key mechanism is a custom frame (`EnableInterruptionsFrame`) that is queued
|
||||
into the pipeline. A custom `DynamicVADUserTurnStartStrategy` subclasses
|
||||
`VADUserTurnStartStrategy` and listens for this frame to update its
|
||||
`enable_interruptions` setting at runtime.
|
||||
|
||||
In both modes:
|
||||
- Voice Activity Detection (VAD) continues working
|
||||
- Speech-to-text transcription continues
|
||||
- User turns are aggregated into context
|
||||
|
||||
Watch the logs to see when interruptions are enabled/disabled, then try speaking
|
||||
while the bot talks to observe the different behaviors.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.audio.vad.vad_analyzer import VADParams
|
||||
from pipecat.frames.frames import Frame, LLMRunFrame, SystemFrame
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.aggregators.llm_response_universal import (
|
||||
LLMContextAggregatorPair,
|
||||
LLMUserAggregatorParams,
|
||||
)
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.cartesia.tts import CartesiaTTSService
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.services.openai.llm import OpenAILLMService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
from pipecat.turns.user_start import VADUserTurnStartStrategy
|
||||
from pipecat.turns.user_stop import TurnAnalyzerUserTurnStopStrategy
|
||||
from pipecat.turns.user_turn_strategies import UserTurnStrategies
|
||||
|
||||
|
||||
@dataclass
|
||||
class EnableInterruptionsFrame(SystemFrame):
|
||||
"""A custom frame to dynamically enable or disable interruptions.
|
||||
|
||||
Queue this frame into the pipeline to change the interruption behavior
|
||||
of DynamicVADUserTurnStartStrategy at runtime.
|
||||
"""
|
||||
|
||||
enable_interruptions: bool
|
||||
|
||||
|
||||
class DynamicVADUserTurnStartStrategy(VADUserTurnStartStrategy):
|
||||
"""A VAD-based user turn start strategy with dynamic interruption control.
|
||||
|
||||
This strategy extends VADUserTurnStartStrategy to listen for
|
||||
EnableInterruptionsFrame, allowing the enable_interruptions setting
|
||||
to be changed at runtime via the pipeline.
|
||||
|
||||
Example:
|
||||
# Create strategy with interruptions initially disabled
|
||||
strategy = DynamicVADUserTurnStartStrategy(enable_interruptions=False)
|
||||
|
||||
# Later, toggle interruptions by queuing a frame
|
||||
await task.queue_frame(EnableInterruptionsFrame(enable_interruptions=True))
|
||||
"""
|
||||
|
||||
async def process_frame(self, frame: Frame):
|
||||
"""Process frames, updating enable_interruptions when our custom frame arrives."""
|
||||
if isinstance(frame, EnableInterruptionsFrame):
|
||||
self._enable_interruptions = frame.enable_interruptions
|
||||
logger.info(f"Interruptions {'ENABLED' if frame.enable_interruptions else 'DISABLED'}")
|
||||
|
||||
await super().process_frame(frame)
|
||||
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
# selected.
|
||||
transport_params = {
|
||||
"daily": lambda: DailyParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
|
||||
),
|
||||
"twilio": lambda: FastAPIWebsocketParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
|
||||
),
|
||||
"webrtc": lambda: TransportParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info(f"Starting bot")
|
||||
|
||||
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
|
||||
|
||||
tts = CartesiaTTSService(
|
||||
api_key=os.getenv("CARTESIA_API_KEY"),
|
||||
voice_id="71a7ad14-091c-4e8e-a314-022ece01c121", # British Reading Lady
|
||||
)
|
||||
|
||||
llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate toggling interruptible behavior. Give longer responses so the user can test speaking while you talk. Sometimes your speech can be interrupted, sometimes it cannot. The system will toggle every 5 seconds. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points.",
|
||||
},
|
||||
]
|
||||
|
||||
context = LLMContext(messages)
|
||||
|
||||
# Create a dynamic VAD strategy that can be toggled at runtime via frames.
|
||||
# This strategy extends VADUserTurnStartStrategy but listens for
|
||||
# EnableInterruptionsFrame to change enable_interruptions dynamically.
|
||||
dynamic_strategy = DynamicVADUserTurnStartStrategy(enable_interruptions=False)
|
||||
|
||||
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(
|
||||
user_turn_strategies=UserTurnStrategies(
|
||||
start=[dynamic_strategy],
|
||||
stop=[TurnAnalyzerUserTurnStopStrategy(turn_analyzer=LocalSmartTurnAnalyzerV3())],
|
||||
),
|
||||
),
|
||||
)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
stt,
|
||||
user_aggregator, # User responses
|
||||
llm, # LLM
|
||||
tts, # TTS
|
||||
transport.output(), # Transport bot output
|
||||
assistant_aggregator, # Assistant spoken responses
|
||||
]
|
||||
)
|
||||
|
||||
task = PipelineTask(
|
||||
pipeline,
|
||||
params=PipelineParams(
|
||||
enable_metrics=True,
|
||||
enable_usage_metrics=True,
|
||||
),
|
||||
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
|
||||
)
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected")
|
||||
|
||||
# Kick off the conversation.
|
||||
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
|
||||
await task.queue_frames([LLMRunFrame()])
|
||||
|
||||
# Toggle interruptions every 5 seconds by queuing EnableInterruptionsFrame.
|
||||
# This is the idiomatic Pipecat way to control behavior via frames.
|
||||
interruptions_enabled = False
|
||||
for _ in range(10): # Toggle 10 times (50 seconds total)
|
||||
await asyncio.sleep(5)
|
||||
interruptions_enabled = not interruptions_enabled
|
||||
|
||||
# Queue a frame to toggle interruptions - the strategy will pick it up
|
||||
await task.queue_frame(
|
||||
EnableInterruptionsFrame(enable_interruptions=interruptions_enabled)
|
||||
)
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
logger.info(f"Client disconnected")
|
||||
await task.cancel()
|
||||
|
||||
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
|
||||
|
||||
await runner.run(task)
|
||||
|
||||
|
||||
async def bot(runner_args: RunnerArguments):
|
||||
"""Main bot entry point compatible with Pipecat Cloud."""
|
||||
transport = await create_transport(runner_args, transport_params)
|
||||
await run_bot(transport, runner_args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from pipecat.runner.run import main
|
||||
|
||||
main()
|
||||
@@ -1,138 +0,0 @@
|
||||
#
|
||||
# Copyright (c) 2024–2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
import os
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.audio.vad.vad_analyzer import VADParams
|
||||
from pipecat.frames.frames import LLMRunFrame
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.aggregators.llm_response_universal import (
|
||||
LLMContextAggregatorPair,
|
||||
LLMUserAggregatorParams,
|
||||
)
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.camb.tts import CambTTSService
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.services.openai.llm import OpenAILLMService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
from pipecat.turns.user_stop import TurnAnalyzerUserTurnStopStrategy
|
||||
from pipecat.turns.user_turn_strategies import UserTurnStrategies
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
# selected.
|
||||
transport_params = {
|
||||
"daily": lambda: DailyParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
|
||||
),
|
||||
"twilio": lambda: FastAPIWebsocketParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
|
||||
),
|
||||
"webrtc": lambda: TransportParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info("Starting Camb AI TTS bot")
|
||||
|
||||
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
|
||||
|
||||
tts = CambTTSService(
|
||||
api_key=os.getenv("CAMB_API_KEY"),
|
||||
model="mars-flash",
|
||||
)
|
||||
|
||||
llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful voice assistant powered by Camb AI text-to-speech. "
|
||||
"Keep your responses concise and conversational since they will be spoken aloud. "
|
||||
"Avoid special characters, emojis, or bullet points.",
|
||||
},
|
||||
]
|
||||
|
||||
context = LLMContext(messages)
|
||||
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(
|
||||
user_turn_strategies=UserTurnStrategies(
|
||||
stop=[TurnAnalyzerUserTurnStopStrategy(turn_analyzer=LocalSmartTurnAnalyzerV3())]
|
||||
),
|
||||
),
|
||||
)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(),
|
||||
stt,
|
||||
user_aggregator,
|
||||
llm,
|
||||
tts,
|
||||
transport.output(),
|
||||
assistant_aggregator,
|
||||
]
|
||||
)
|
||||
|
||||
task = PipelineTask(
|
||||
pipeline,
|
||||
params=PipelineParams(
|
||||
enable_metrics=True,
|
||||
enable_usage_metrics=True,
|
||||
audio_out_sample_rate=22050,
|
||||
),
|
||||
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
|
||||
)
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info("Client connected")
|
||||
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
|
||||
await task.queue_frames([LLMRunFrame()])
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
logger.info("Client disconnected")
|
||||
await task.cancel()
|
||||
|
||||
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
|
||||
|
||||
await runner.run(task)
|
||||
|
||||
|
||||
async def bot(runner_args: RunnerArguments):
|
||||
"""Main bot entry point compatible with Pipecat Cloud."""
|
||||
transport = await create_transport(runner_args, transport_params)
|
||||
await run_bot(transport, runner_args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from pipecat.runner.run import main
|
||||
|
||||
main()
|
||||
@@ -13,12 +13,7 @@ from loguru import logger
|
||||
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.audio.vad.vad_analyzer import VADParams
|
||||
from pipecat.frames.frames import (
|
||||
EndTaskFrame,
|
||||
LLMMessagesAppendFrame,
|
||||
LLMRunFrame,
|
||||
TTSSpeakFrame,
|
||||
)
|
||||
from pipecat.frames.frames import EndFrame, LLMMessagesAppendFrame, LLMRunFrame, TTSSpeakFrame
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
@@ -27,7 +22,7 @@ from pipecat.processors.aggregators.llm_response_universal import (
|
||||
LLMContextAggregatorPair,
|
||||
LLMUserAggregatorParams,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.processors.user_idle_processor import UserIdleProcessor
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.cartesia.tts import CartesiaTTSService
|
||||
@@ -41,43 +36,6 @@ from pipecat.turns.user_turn_strategies import UserTurnStrategies
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
class IdleHandler:
|
||||
"""Helper class to manage user idle retry logic."""
|
||||
|
||||
def __init__(self):
|
||||
self._retry_count = 0
|
||||
|
||||
def reset(self):
|
||||
"""Reset the retry count when user becomes active."""
|
||||
self._retry_count = 0
|
||||
|
||||
async def handle_idle(self, aggregator):
|
||||
"""Handle user idle event with escalating prompts."""
|
||||
self._retry_count += 1
|
||||
|
||||
if self._retry_count == 1:
|
||||
# First attempt: Add a gentle prompt to the conversation
|
||||
message = {
|
||||
"role": "system",
|
||||
"content": "The user has been quiet. Politely and briefly ask if they're still there.",
|
||||
}
|
||||
await aggregator.push_frame(LLMMessagesAppendFrame([message], run_llm=True))
|
||||
elif self._retry_count == 2:
|
||||
# Second attempt: More direct prompt
|
||||
message = {
|
||||
"role": "system",
|
||||
"content": "The user is still inactive. Ask if they'd like to continue our conversation.",
|
||||
}
|
||||
await aggregator.push_frame(LLMMessagesAppendFrame([message], run_llm=True))
|
||||
else:
|
||||
# Third attempt: End the conversation
|
||||
await aggregator.push_frame(
|
||||
TTSSpeakFrame("It seems like you're busy right now. Have a nice day!")
|
||||
)
|
||||
await aggregator.push_frame(EndTaskFrame(), FrameDirection.UPSTREAM)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
# selected.
|
||||
@@ -126,15 +84,42 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
user_turn_strategies=UserTurnStrategies(
|
||||
stop=[TurnAnalyzerUserTurnStopStrategy(turn_analyzer=LocalSmartTurnAnalyzerV3())]
|
||||
),
|
||||
user_idle_timeout=5.0, # Detect user idle after 5 seconds
|
||||
),
|
||||
)
|
||||
|
||||
async def handle_user_idle(user_idle: UserIdleProcessor, retry_count: int) -> bool:
|
||||
if retry_count == 1:
|
||||
# First attempt: Add a gentle prompt to the conversation
|
||||
message = {
|
||||
"role": "system",
|
||||
"content": "The user has been quiet. Politely and briefly ask if they're still there.",
|
||||
}
|
||||
await user_idle.push_frame(LLMMessagesAppendFrame([message], run_llm=True))
|
||||
return True
|
||||
elif retry_count == 2:
|
||||
# Second attempt: More direct prompt
|
||||
message = {
|
||||
"role": "system",
|
||||
"content": "The user is still inactive. Ask if they'd like to continue our conversation.",
|
||||
}
|
||||
await user_idle.push_frame(LLMMessagesAppendFrame([message], run_llm=True))
|
||||
return True
|
||||
else:
|
||||
# Third attempt: End the conversation
|
||||
await user_idle.push_frame(
|
||||
TTSSpeakFrame("It seems like you're busy right now. Have a nice day!")
|
||||
)
|
||||
await task.queue_frame(EndFrame())
|
||||
return False
|
||||
|
||||
user_idle = UserIdleProcessor(callback=handle_user_idle, timeout=5.0)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
stt,
|
||||
user_aggregator, # User aggregator with built-in idle detection
|
||||
user_idle, # Idle user check-in
|
||||
user_aggregator,
|
||||
llm, # LLM
|
||||
tts, # TTS
|
||||
transport.output(), # Transport bot output
|
||||
@@ -151,17 +136,6 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
|
||||
)
|
||||
|
||||
# Set up idle handling with retry logic
|
||||
idle_handler = IdleHandler()
|
||||
|
||||
@user_aggregator.event_handler("on_user_turn_idle")
|
||||
async def on_user_turn_idle(aggregator):
|
||||
await idle_handler.handle_idle(aggregator)
|
||||
|
||||
@user_aggregator.event_handler("on_user_turn_started")
|
||||
async def on_user_turn_started(aggregator, strategy):
|
||||
idle_handler.reset()
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected")
|
||||
|
||||
@@ -119,7 +119,7 @@ class CompletenessCheck(FrameProcessor):
|
||||
|
||||
if isinstance(frame, TextFrame) and frame.text == "YES":
|
||||
logger.debug("Completeness check YES")
|
||||
await self.broadcast_frame(UserStoppedSpeakingFrame)
|
||||
await self.push_frame(UserStoppedSpeakingFrame())
|
||||
await self._notifier.notify()
|
||||
elif isinstance(frame, TextFrame) and frame.text == "NO":
|
||||
logger.debug("Completeness check NO")
|
||||
|
||||
@@ -322,7 +322,7 @@ class CompletenessCheck(FrameProcessor):
|
||||
|
||||
if isinstance(frame, TextFrame) and frame.text == "YES":
|
||||
logger.debug("!!! Completeness check YES")
|
||||
await self.broadcast_frame(UserStoppedSpeakingFrame)
|
||||
await self.push_frame(UserStoppedSpeakingFrame())
|
||||
await self._notifier.notify()
|
||||
elif isinstance(frame, TextFrame) and frame.text == "NO":
|
||||
logger.debug("!!! Completeness check NO")
|
||||
|
||||
@@ -451,7 +451,7 @@ class CompletenessCheck(FrameProcessor):
|
||||
logger.debug("Completeness check YES")
|
||||
if self._idle_task:
|
||||
await self.cancel_task(self._idle_task)
|
||||
await self.broadcast_frame(UserStoppedSpeakingFrame)
|
||||
await self.push_frame(UserStoppedSpeakingFrame())
|
||||
await self._audio_accumulator.reset()
|
||||
await self._notifier.notify()
|
||||
elif isinstance(frame, TextFrame):
|
||||
|
||||
@@ -1,14 +1,18 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.adapters.schemas.function_schema import FunctionSchema
|
||||
from pipecat.adapters.schemas.tools_schema import ToolsSchema
|
||||
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.audio.vad.vad_analyzer import VADParams
|
||||
@@ -21,10 +25,12 @@ from pipecat.processors.aggregators.llm_response_universal import (
|
||||
LLMContextAggregatorPair,
|
||||
LLMUserAggregatorParams,
|
||||
)
|
||||
from pipecat.processors.filters.stt_mute_filter import STTMuteConfig, STTMuteFilter, STTMuteStrategy
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.hathora.stt import HathoraSTTService
|
||||
from pipecat.services.hathora.tts import HathoraTTSService
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.services.deepgram.tts import DeepgramTTSService
|
||||
from pipecat.services.llm_service import FunctionCallParams
|
||||
from pipecat.services.openai.llm import OpenAILLMService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
@@ -34,6 +40,15 @@ from pipecat.turns.user_turn_strategies import UserTurnStrategies
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
async def fetch_weather_from_api(params: FunctionCallParams):
|
||||
# Add a delay to test interruption during function calls
|
||||
logger.info("Weather API call starting...")
|
||||
await asyncio.sleep(5) # 5-second delay
|
||||
logger.info("Weather API call completed")
|
||||
await params.result_callback({"conditions": "nice", "temperature": "75"})
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
# selected.
|
||||
@@ -59,30 +74,50 @@ transport_params = {
|
||||
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info(f"Starting bot")
|
||||
|
||||
stt = HathoraSTTService(
|
||||
model="nvidia-parakeet-tdt-0.6b-v3",
|
||||
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
|
||||
|
||||
# Configure the mute processor with both strategies
|
||||
stt_mute_processor = STTMuteFilter(
|
||||
config=STTMuteConfig(
|
||||
strategies={
|
||||
STTMuteStrategy.MUTE_UNTIL_FIRST_BOT_COMPLETE,
|
||||
STTMuteStrategy.FUNCTION_CALL,
|
||||
}
|
||||
),
|
||||
)
|
||||
|
||||
tts = HathoraTTSService(
|
||||
model="hexgrad-kokoro-82m",
|
||||
)
|
||||
tts = DeepgramTTSService(api_key=os.getenv("DEEPGRAM_API_KEY"), voice="aura-helios-en")
|
||||
|
||||
# See https://models.hathora.dev/model/qwen3-30b-a3b
|
||||
llm = OpenAILLMService(
|
||||
base_url="https://app-362f7ca1-6975-4e18-a605-ab202bf2c315.app.hathora.dev/v1",
|
||||
api_key=os.getenv("HATHORA_API_KEY"),
|
||||
model=None,
|
||||
llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
|
||||
llm.register_function("get_current_weather", fetch_weather_from_api)
|
||||
|
||||
weather_function = FunctionSchema(
|
||||
name="get_current_weather",
|
||||
description="Get the current weather",
|
||||
properties={
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
"format": {
|
||||
"type": "string",
|
||||
"enum": ["celsius", "fahrenheit"],
|
||||
"description": "The temperature unit to use. Infer this from the user's location.",
|
||||
},
|
||||
},
|
||||
required=["location", "format"],
|
||||
)
|
||||
tools = ToolsSchema(standard_tools=[weather_function])
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
|
||||
"content": "You are a helpful assistant who can check the weather. Always check the weather when a location is mentioned. Respond concisely and naturally. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points.",
|
||||
},
|
||||
]
|
||||
|
||||
context = LLMContext(messages)
|
||||
context_aggregator = LLMContextAggregatorPair(
|
||||
context = LLMContext(messages, tools)
|
||||
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(
|
||||
user_turn_strategies=UserTurnStrategies(
|
||||
@@ -94,12 +129,13 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
stt,
|
||||
context_aggregator.user(), # User responses
|
||||
stt, # STT
|
||||
stt_mute_processor, # Add the mute processor between STT and context aggregator
|
||||
user_aggregator, # User responses
|
||||
llm, # LLM
|
||||
tts, # TTS
|
||||
transport.output(), # Transport bot output
|
||||
context_aggregator.assistant(), # Assistant spoken responses
|
||||
assistant_aggregator, # Assistant spoken responses
|
||||
]
|
||||
)
|
||||
|
||||
@@ -115,8 +151,13 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected")
|
||||
# Kick off the conversation.
|
||||
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
|
||||
# Kick off the conversation with a weather-related prompt
|
||||
messages.append(
|
||||
{
|
||||
"role": "system",
|
||||
"content": "Ask the user what city they'd like to know the weather for.",
|
||||
}
|
||||
)
|
||||
await task.queue_frames([LLMRunFrame()])
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
@@ -34,7 +34,7 @@ from pipecat.services.openai.llm import OpenAILLMService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
from pipecat.turns.user_mute import (
|
||||
from pipecat.turns.mute import (
|
||||
FunctionCallUserMuteStrategy,
|
||||
MuteUntilFirstBotCompleteUserMuteStrategy,
|
||||
)
|
||||
@@ -161,14 +161,6 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info(f"Client disconnected")
|
||||
await task.cancel()
|
||||
|
||||
@user_aggregator.event_handler("on_user_mute_started")
|
||||
async def on_user_mute_started(aggregator):
|
||||
logger.info(f"User mute started")
|
||||
|
||||
@user_aggregator.event_handler("on_user_mute_stopped")
|
||||
async def on_user_mute_stopped(aggregator):
|
||||
logger.info(f"User mute stopped")
|
||||
|
||||
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
|
||||
|
||||
await runner.run(task)
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
# Copyright (c) 2024-2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
# Copyright (c) 2024-2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
@@ -44,11 +44,8 @@ from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.aggregators.llm_response_universal import (
|
||||
LLMContextAggregatorPair,
|
||||
LLMUserAggregatorParams,
|
||||
)
|
||||
from pipecat.processors.frameworks.rtvi import RTVIObserver, RTVIProcessor
|
||||
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
|
||||
from pipecat.processors.frameworks.rtvi import RTVIConfig, RTVIObserver, RTVIProcessor
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.cartesia.tts import CartesiaTTSService
|
||||
@@ -56,10 +53,6 @@ from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.services.openai.llm import OpenAILLMService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.turns.user_stop.turn_analyzer_user_turn_stop_strategy import (
|
||||
TurnAnalyzerUserTurnStopStrategy,
|
||||
)
|
||||
from pipecat.turns.user_turn_strategies import UserTurnStrategies
|
||||
|
||||
logger.info("✅ All components loaded successfully!")
|
||||
|
||||
@@ -86,27 +79,20 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
]
|
||||
|
||||
context = LLMContext(messages)
|
||||
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
|
||||
context,
|
||||
user_params=LLMUserAggregatorParams(
|
||||
user_turn_strategies=UserTurnStrategies(
|
||||
stop=[TurnAnalyzerUserTurnStopStrategy(turn_analyzer=LocalSmartTurnAnalyzerV3())]
|
||||
),
|
||||
),
|
||||
)
|
||||
context_aggregator = LLMContextAggregatorPair(context)
|
||||
|
||||
rtvi = RTVIProcessor()
|
||||
rtvi = RTVIProcessor(config=RTVIConfig(config=[]))
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
rtvi, # RTVI processor
|
||||
stt,
|
||||
user_aggregator, # User responses
|
||||
context_aggregator.user(), # User responses
|
||||
llm, # LLM
|
||||
tts, # TTS
|
||||
transport.output(), # Transport bot output
|
||||
assistant_aggregator, # Assistant spoken responses
|
||||
context_aggregator.assistant(), # Assistant spoken responses
|
||||
]
|
||||
)
|
||||
|
||||
@@ -144,11 +130,13 @@ async def bot(runner_args: RunnerArguments):
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
|
||||
turn_analyzer=LocalSmartTurnAnalyzerV3(),
|
||||
),
|
||||
"webrtc": lambda: TransportParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
|
||||
turn_analyzer=LocalSmartTurnAnalyzerV3(),
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
@@ -41,11 +41,8 @@ dependencies = [
|
||||
]
|
||||
|
||||
[project.urls]
|
||||
Homepage = "https://pipecat.ai"
|
||||
Documentation = "https://docs.pipecat.ai/"
|
||||
Source = "https://github.com/pipecat-ai/pipecat"
|
||||
Issues = "https://github.com/pipecat-ai/pipecat/issues"
|
||||
Changelog = "https://github.com/pipecat-ai/pipecat/blob/main/CHANGELOG.md"
|
||||
Website = "https://pipecat.ai"
|
||||
|
||||
[project.optional-dependencies]
|
||||
aic = [ "aic-sdk~=1.2.0" ]
|
||||
@@ -56,7 +53,6 @@ aws = [ "aioboto3~=15.5.0", "pipecat-ai[websockets-base]" ]
|
||||
aws-nova-sonic = [ "aws_sdk_bedrock_runtime~=0.2.0; python_version>='3.12'" ]
|
||||
azure = [ "azure-cognitiveservices-speech~=1.44.0"]
|
||||
cartesia = [ "cartesia~=2.0.3", "pipecat-ai[websockets-base]" ]
|
||||
camb = [ "camb-sdk>=1.5.4" ]
|
||||
cerebras = []
|
||||
daily = [ "daily-python~=0.23.0" ]
|
||||
deepgram = [ "deepgram-sdk~=4.7.0", "pipecat-ai[websockets-base]" ]
|
||||
@@ -100,7 +96,7 @@ qwen = []
|
||||
remote-smart-turn = []
|
||||
rime = [ "pipecat-ai[websockets-base]" ]
|
||||
riva = [ "pipecat-ai[nvidia]" ]
|
||||
runner = [ "python-dotenv>=1.0.0,<2.0.0", "uvicorn>=0.32.0,<1.0.0", "fastapi>=0.115.6,<0.128.0", "pipecat-ai-small-webrtc-prebuilt>=2.0.4"]
|
||||
runner = [ "python-dotenv>=1.0.0,<2.0.0", "uvicorn>=0.32.0,<1.0.0", "fastapi>=0.115.6,<0.122.0", "pipecat-ai-small-webrtc-prebuilt>=2.0.4"]
|
||||
sagemaker = ["aws_sdk_sagemaker_runtime_http2; python_version>='3.12'"]
|
||||
sambanova = []
|
||||
sarvam = [ "sarvamai==0.1.21", "pipecat-ai[websockets-base]" ]
|
||||
@@ -116,7 +112,7 @@ together = []
|
||||
tracing = [ "opentelemetry-sdk>=1.33.0", "opentelemetry-api>=1.33.0", "opentelemetry-instrumentation>=0.54b0" ]
|
||||
ultravox = [ "pipecat-ai[websockets-base]" ]
|
||||
webrtc = [ "aiortc>=1.14.0,<2", "opencv-python>=4.11.0.86,<5" ]
|
||||
websocket = [ "pipecat-ai[websockets-base]", "fastapi>=0.115.6,<0.128.0" ]
|
||||
websocket = [ "pipecat-ai[websockets-base]", "fastapi>=0.115.6,<0.122.0" ]
|
||||
websockets-base = [ "websockets>=13.1,<16.0" ]
|
||||
whisper = [ "faster-whisper~=1.1.1" ]
|
||||
|
||||
|
||||
@@ -97,6 +97,15 @@ TESTS_07 = [
|
||||
("07-interruptible-cartesia-http.py", EVAL_SIMPLE_MATH),
|
||||
("07a-interruptible-speechmatics.py", EVAL_SIMPLE_MATH),
|
||||
("07a-interruptible-speechmatics-vad.py", EVAL_SIMPLE_MATH),
|
||||
("07aa-interruptible-soniox.py", EVAL_SIMPLE_MATH),
|
||||
("07ab-interruptible-inworld.py", EVAL_SIMPLE_MATH),
|
||||
("07ab-interruptible-inworld-http.py", EVAL_SIMPLE_MATH),
|
||||
("07ac-interruptible-asyncai.py", EVAL_SIMPLE_MATH),
|
||||
("07ac-interruptible-asyncai-http.py", EVAL_SIMPLE_MATH),
|
||||
# Need license key to run
|
||||
# ("07ad-interruptible-aicoustics.py", EVAL_SIMPLE_MATH),
|
||||
("07ae-interruptible-hume.py", EVAL_SIMPLE_MATH),
|
||||
("07af-interruptible-gradium.py", EVAL_SIMPLE_MATH),
|
||||
("07b-interruptible-langchain.py", EVAL_SIMPLE_MATH),
|
||||
("07c-interruptible-deepgram.py", EVAL_SIMPLE_MATH),
|
||||
("07c-interruptible-deepgram-flux.py", EVAL_SIMPLE_MATH),
|
||||
@@ -128,16 +137,6 @@ TESTS_07 = [
|
||||
("07y-interruptible-minimax.py", EVAL_SIMPLE_MATH),
|
||||
("07z-interruptible-sarvam.py", EVAL_SIMPLE_MATH),
|
||||
("07z-interruptible-sarvam-http.py", EVAL_SIMPLE_MATH),
|
||||
("07za-interruptible-soniox.py", EVAL_SIMPLE_MATH),
|
||||
("07zb-interruptible-inworld.py", EVAL_SIMPLE_MATH),
|
||||
("07zb-interruptible-inworld-http.py", EVAL_SIMPLE_MATH),
|
||||
("07zc-interruptible-asyncai.py", EVAL_SIMPLE_MATH),
|
||||
("07zc-interruptible-asyncai-http.py", EVAL_SIMPLE_MATH),
|
||||
# Need license key to run
|
||||
# ("07zd-interruptible-aicoustics.py", EVAL_SIMPLE_MATH),
|
||||
("07ze-interruptible-hume.py", EVAL_SIMPLE_MATH),
|
||||
("07zf-interruptible-gradium.py", EVAL_SIMPLE_MATH),
|
||||
("07zh-interruptible-hathora.py", EVAL_SIMPLE_MATH),
|
||||
# Needs a local XTTS docker instance running.
|
||||
# ("07i-interruptible-xtts.py", EVAL_SIMPLE_MATH),
|
||||
# Needs a Krisp license.
|
||||
|
||||
@@ -61,7 +61,6 @@ class KrispFilter(BaseAudioFilter):
|
||||
Provides real-time noise reduction for audio streams using Krisp's
|
||||
proprietary noise suppression algorithms. Requires a Krisp model file
|
||||
for operation.
|
||||
|
||||
.. deprecated:: 0.0.94
|
||||
The KrispFilter is deprecated and will be removed in a future version.
|
||||
Use KrispVivaFilter instead.
|
||||
|
||||
@@ -1024,8 +1024,10 @@ class LLMAssistantContextAggregator(LLMContextResponseAggregator):
|
||||
logger.debug(
|
||||
f"{self} FunctionCallCancelFrame: [{frame.function_name}:{frame.tool_call_id}]"
|
||||
)
|
||||
function_call = self._function_calls_in_progress.get(frame.tool_call_id)
|
||||
if function_call and function_call.cancel_on_interruption:
|
||||
if frame.tool_call_id not in self._function_calls_in_progress:
|
||||
return
|
||||
|
||||
if self._function_calls_in_progress[frame.tool_call_id].cancel_on_interruption:
|
||||
await self.handle_function_call_cancel(frame)
|
||||
del self._function_calls_in_progress[frame.tool_call_id]
|
||||
|
||||
|
||||
@@ -62,8 +62,7 @@ from pipecat.processors.aggregators.llm_context import (
|
||||
NotGiven,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
from pipecat.turns.user_idle_controller import UserIdleController
|
||||
from pipecat.turns.user_mute import BaseUserMuteStrategy
|
||||
from pipecat.turns.mute import BaseUserMuteStrategy
|
||||
from pipecat.turns.user_start import BaseUserTurnStartStrategy, UserTurnStartedParams
|
||||
from pipecat.turns.user_stop import BaseUserTurnStopStrategy, UserTurnStoppedParams
|
||||
from pipecat.turns.user_turn_controller import UserTurnController
|
||||
@@ -81,16 +80,11 @@ class LLMUserAggregatorParams:
|
||||
user_mute_strategies: List of user mute strategies.
|
||||
user_turn_stop_timeout: Time in seconds to wait before considering the
|
||||
user's turn finished.
|
||||
user_idle_timeout: Optional timeout in seconds for detecting user idle state.
|
||||
If set, the aggregator will emit an `on_user_turn_idle` event when the user
|
||||
has been idle (not speaking) for this duration. Set to None to disable
|
||||
idle detection.
|
||||
"""
|
||||
|
||||
user_turn_strategies: Optional[UserTurnStrategies] = None
|
||||
user_mute_strategies: List[BaseUserMuteStrategy] = field(default_factory=list)
|
||||
user_turn_stop_timeout: float = 5.0
|
||||
user_idle_timeout: Optional[float] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -297,14 +291,11 @@ class LLMUserAggregator(LLMContextAggregator):
|
||||
- on_user_turn_started: Called when the user turn starts
|
||||
- on_user_turn_stopped: Called when the user turn ends
|
||||
- on_user_turn_stop_timeout: Called when no user turn stop strategy triggers
|
||||
- on_user_turn_idle: Called when the user has been idle for the configured timeout
|
||||
- on_user_mute_started: Called when the user becomes muted
|
||||
- on_user_mute_stopped: Called when the user becomes unmuted
|
||||
|
||||
Example::
|
||||
|
||||
@aggregator.event_handler("on_user_turn_started")
|
||||
async def on_user_turn_started(aggregator, strategy: BaseUserTurnStartStrategy):
|
||||
async def on_user_turn_started(aggregator, strategy: BaseUserTurnStartStrategy]):
|
||||
...
|
||||
|
||||
@aggregator.event_handler("on_user_turn_stopped")
|
||||
@@ -315,18 +306,6 @@ class LLMUserAggregator(LLMContextAggregator):
|
||||
async def on_user_turn_stop_timeout(aggregator):
|
||||
...
|
||||
|
||||
@aggregator.event_handler("on_user_turn_idle")
|
||||
async def on_user_turn_idle(aggregator):
|
||||
...
|
||||
|
||||
@aggregator.event_handler("on_user_mute_started")
|
||||
async def on_user_mute_started(aggregator):
|
||||
...
|
||||
|
||||
@aggregator.event_handler("on_user_mute_stopped")
|
||||
async def on_user_mute_stopped(aggregator):
|
||||
...
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
@@ -349,9 +328,6 @@ class LLMUserAggregator(LLMContextAggregator):
|
||||
self._register_event_handler("on_user_turn_started")
|
||||
self._register_event_handler("on_user_turn_stopped")
|
||||
self._register_event_handler("on_user_turn_stop_timeout")
|
||||
self._register_event_handler("on_user_turn_idle")
|
||||
self._register_event_handler("on_user_mute_started")
|
||||
self._register_event_handler("on_user_mute_stopped")
|
||||
|
||||
user_turn_strategies = self._params.user_turn_strategies or UserTurnStrategies()
|
||||
|
||||
@@ -374,16 +350,6 @@ class LLMUserAggregator(LLMContextAggregator):
|
||||
"on_user_turn_stop_timeout", self._on_user_turn_stop_timeout
|
||||
)
|
||||
|
||||
# Optional user idle controller
|
||||
self._user_idle_controller: Optional[UserIdleController] = None
|
||||
if self._params.user_idle_timeout:
|
||||
self._user_idle_controller = UserIdleController(
|
||||
user_idle_timeout=self._params.user_idle_timeout
|
||||
)
|
||||
self._user_idle_controller.add_event_handler(
|
||||
"on_user_turn_idle", self._on_user_turn_idle
|
||||
)
|
||||
|
||||
async def cleanup(self):
|
||||
"""Clean up processor resources."""
|
||||
await super().cleanup()
|
||||
@@ -439,9 +405,6 @@ class LLMUserAggregator(LLMContextAggregator):
|
||||
|
||||
await self._user_turn_controller.process_frame(frame)
|
||||
|
||||
if self._user_idle_controller:
|
||||
await self._user_idle_controller.process_frame(frame)
|
||||
|
||||
async def push_aggregation(self) -> str:
|
||||
"""Push the current aggregation."""
|
||||
if len(self._aggregation) == 0:
|
||||
@@ -457,9 +420,6 @@ class LLMUserAggregator(LLMContextAggregator):
|
||||
async def _start(self, frame: StartFrame):
|
||||
await self._user_turn_controller.setup(self.task_manager)
|
||||
|
||||
if self._user_idle_controller:
|
||||
await self._user_idle_controller.setup(self.task_manager)
|
||||
|
||||
for s in self._params.user_mute_strategies:
|
||||
await s.setup(self.task_manager)
|
||||
|
||||
@@ -472,9 +432,6 @@ class LLMUserAggregator(LLMContextAggregator):
|
||||
async def _cleanup(self):
|
||||
await self._user_turn_controller.cleanup()
|
||||
|
||||
if self._user_idle_controller:
|
||||
await self._user_idle_controller.cleanup()
|
||||
|
||||
for s in self._params.user_mute_strategies:
|
||||
await s.cleanup()
|
||||
|
||||
@@ -504,12 +461,6 @@ class LLMUserAggregator(LLMContextAggregator):
|
||||
logger.debug(f"{self}: user is now {'muted' if should_mute_next_time else 'unmuted'}")
|
||||
self._user_is_muted = should_mute_next_time
|
||||
|
||||
# Emit mute state change events
|
||||
if self._user_is_muted:
|
||||
await self._call_event_handler("on_user_mute_started")
|
||||
else:
|
||||
await self._call_event_handler("on_user_mute_stopped")
|
||||
|
||||
return should_mute_frame
|
||||
|
||||
async def _handle_llm_run(self, frame: LLMRunFrame):
|
||||
@@ -614,9 +565,6 @@ class LLMUserAggregator(LLMContextAggregator):
|
||||
async def _on_user_turn_stop_timeout(self, controller):
|
||||
await self._call_event_handler("on_user_turn_stop_timeout")
|
||||
|
||||
async def _on_user_turn_idle(self, controller):
|
||||
await self._call_event_handler("on_user_turn_idle")
|
||||
|
||||
|
||||
class LLMAssistantAggregator(LLMContextAggregator):
|
||||
"""Assistant LLM aggregator that processes bot responses and function calls.
|
||||
@@ -910,8 +858,10 @@ class LLMAssistantAggregator(LLMContextAggregator):
|
||||
logger.debug(
|
||||
f"{self} FunctionCallCancelFrame: [{frame.function_name}:{frame.tool_call_id}]"
|
||||
)
|
||||
function_call = self._function_calls_in_progress.get(frame.tool_call_id)
|
||||
if function_call and function_call.cancel_on_interruption:
|
||||
if frame.tool_call_id not in self._function_calls_in_progress:
|
||||
return
|
||||
|
||||
if self._function_calls_in_progress[frame.tool_call_id].cancel_on_interruption:
|
||||
# Update context with the function call cancellation
|
||||
self._update_function_call_result(frame.function_name, frame.tool_call_id, "CANCELLED")
|
||||
del self._function_calls_in_progress[frame.tool_call_id]
|
||||
|
||||
@@ -8,7 +8,6 @@
|
||||
|
||||
import asyncio
|
||||
import inspect
|
||||
import warnings
|
||||
from typing import Awaitable, Callable, Union
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
@@ -27,10 +26,6 @@ from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
class UserIdleProcessor(FrameProcessor):
|
||||
"""Monitors user inactivity and triggers callbacks after timeout periods.
|
||||
|
||||
.. deprecated::
|
||||
UserIdleProcessor is deprecated in 0.0.100 and will be removed in a future version.
|
||||
Use LLMUserAggregator with user_idle_timeout parameter instead.
|
||||
|
||||
This processor tracks user activity and triggers configurable callbacks when
|
||||
users become idle. It starts monitoring only after the first conversation
|
||||
activity and supports both basic and retry-based callback patterns.
|
||||
@@ -75,14 +70,6 @@ class UserIdleProcessor(FrameProcessor):
|
||||
**kwargs: Additional arguments passed to FrameProcessor.
|
||||
"""
|
||||
super().__init__(**kwargs)
|
||||
|
||||
warnings.warn(
|
||||
"UserIdleProcessor is deprecated in 0.0.100 and will be removed in a "
|
||||
"future version. Use LLMUserAggregator with user_idle_timeout parameter "
|
||||
"instead.",
|
||||
DeprecationWarning,
|
||||
)
|
||||
|
||||
self._callback = self._wrap_callback(callback)
|
||||
self._timeout = timeout
|
||||
self._retry_count = 0
|
||||
|
||||
@@ -34,7 +34,8 @@ class VonageFrameSerializer(FrameSerializer):
|
||||
WebSocket streaming protocol.
|
||||
|
||||
Note:
|
||||
Ref docs: https://developer.vonage.com/en/video/guides/audio-connector
|
||||
Ref docs:
|
||||
https://developer.vonage.com/en/video/guides/audio-connector
|
||||
"""
|
||||
|
||||
class InputParams(BaseModel):
|
||||
|
||||
@@ -9,7 +9,6 @@
|
||||
import asyncio
|
||||
import base64
|
||||
import json
|
||||
import uuid
|
||||
from typing import AsyncGenerator, Optional
|
||||
|
||||
import aiohttp
|
||||
@@ -28,7 +27,7 @@ from pipecat.frames.frames import (
|
||||
TTSStoppedFrame,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.services.tts_service import AudioContextTTSService, TTSService
|
||||
from pipecat.services.tts_service import InterruptibleTTSService, TTSService
|
||||
from pipecat.transcriptions.language import Language, resolve_language
|
||||
from pipecat.utils.tracing.service_decorators import traced_tts
|
||||
|
||||
@@ -73,7 +72,7 @@ def language_to_async_language(language: Language) -> Optional[str]:
|
||||
return resolve_language(language, LANGUAGE_MAP, use_base_code=True)
|
||||
|
||||
|
||||
class AsyncAITTSService(AudioContextTTSService):
|
||||
class AsyncAITTSService(InterruptibleTTSService):
|
||||
"""Async TTS service with WebSocket streaming.
|
||||
|
||||
Provides text-to-speech using Async's streaming WebSocket API.
|
||||
@@ -149,7 +148,6 @@ class AsyncAITTSService(AudioContextTTSService):
|
||||
self._receive_task = None
|
||||
self._keepalive_task = None
|
||||
self._started = False
|
||||
self._context_id = None
|
||||
|
||||
def can_generate_metrics(self) -> bool:
|
||||
"""Check if this service can generate processing metrics.
|
||||
@@ -170,8 +168,8 @@ class AsyncAITTSService(AudioContextTTSService):
|
||||
"""
|
||||
return language_to_async_language(language)
|
||||
|
||||
def _build_msg(self, text: str = "", context_id: str = "", force: bool = False) -> str:
|
||||
msg = {"transcript": text, "context_id": context_id, "force": force}
|
||||
def _build_msg(self, text: str = "", force: bool = False) -> str:
|
||||
msg = {"transcript": text, "force": force}
|
||||
return json.dumps(msg)
|
||||
|
||||
async def start(self, frame: StartFrame):
|
||||
@@ -255,16 +253,11 @@ class AsyncAITTSService(AudioContextTTSService):
|
||||
|
||||
if self._websocket:
|
||||
logger.debug("Disconnecting from Async")
|
||||
# Close all contexts and the socket
|
||||
if self._context_id:
|
||||
await self._websocket.send(json.dumps({"terminate": True}))
|
||||
await self._websocket.close()
|
||||
logger.debug("Disconnected from Async")
|
||||
except Exception as e:
|
||||
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
|
||||
finally:
|
||||
self._websocket = None
|
||||
self._context_id = None
|
||||
self._started = False
|
||||
await self._call_event_handler("on_disconnected")
|
||||
|
||||
@@ -275,10 +268,10 @@ class AsyncAITTSService(AudioContextTTSService):
|
||||
|
||||
async def flush_audio(self):
|
||||
"""Flush any pending audio."""
|
||||
if not self._context_id or not self._websocket:
|
||||
if not self._websocket:
|
||||
return
|
||||
logger.trace(f"{self}: flushing audio")
|
||||
msg = self._build_msg(text=" ", context_id=self._context_id, force=True)
|
||||
msg = self._build_msg(text=" ", force=True)
|
||||
await self._websocket.send(msg)
|
||||
|
||||
async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
|
||||
@@ -298,75 +291,35 @@ class AsyncAITTSService(AudioContextTTSService):
|
||||
if not msg:
|
||||
continue
|
||||
|
||||
received_ctx_id = msg.get("context_id")
|
||||
# Handle final messages first, regardless of context availability
|
||||
# At the moment, this message is received AFTER the close_context message is
|
||||
# sent, so it doesn't serve any functional purpose. For now, we'll just log it.
|
||||
if msg.get("final") is True:
|
||||
logger.trace(f"Received final message for context {received_ctx_id}")
|
||||
continue
|
||||
|
||||
# Check if this message belongs to the current context.
|
||||
if not self.audio_context_available(received_ctx_id):
|
||||
if self._context_id == received_ctx_id:
|
||||
logger.debug(
|
||||
f"Received a delayed message, recreating the context: {self._context_id}"
|
||||
)
|
||||
await self.create_audio_context(self._context_id)
|
||||
else:
|
||||
# This can happen if a message is received _after_ we have closed a context
|
||||
# due to user interruption but _before_ the `isFinal` message for the context
|
||||
# is received.
|
||||
logger.debug(f"Ignoring message from unavailable context: {received_ctx_id}")
|
||||
continue
|
||||
|
||||
if msg.get("audio"):
|
||||
elif msg.get("audio"):
|
||||
await self.stop_ttfb_metrics()
|
||||
audio = base64.b64decode(msg["audio"])
|
||||
frame = TTSAudioRawFrame(audio, self.sample_rate, 1)
|
||||
await self.append_to_audio_context(received_ctx_id, frame)
|
||||
frame = TTSAudioRawFrame(
|
||||
audio=base64.b64decode(msg["audio"]),
|
||||
sample_rate=self.sample_rate,
|
||||
num_channels=1,
|
||||
)
|
||||
await self.push_frame(frame)
|
||||
elif msg.get("error_code"):
|
||||
await self.push_frame(TTSStoppedFrame())
|
||||
await self.stop_all_metrics()
|
||||
await self.push_error(error_msg=f"Error: {msg['message']}")
|
||||
else:
|
||||
await self.push_error(error_msg=f"Unknown message type: {msg}")
|
||||
|
||||
async def _keepalive_task_handler(self):
|
||||
"""Send periodic keepalive messages to maintain WebSocket connection."""
|
||||
KEEPALIVE_SLEEP = 10
|
||||
KEEPALIVE_SLEEP = 3
|
||||
while True:
|
||||
await asyncio.sleep(KEEPALIVE_SLEEP)
|
||||
try:
|
||||
if self._websocket and self._websocket.state is State.OPEN:
|
||||
if self._context_id:
|
||||
keepalive_message = {
|
||||
"transcript": " ",
|
||||
"context_id": self._context_id,
|
||||
}
|
||||
logger.trace("Sending keepalive message")
|
||||
else:
|
||||
# It's possible to have a user interruption which clears the context
|
||||
# without generating a new TTS response. In this case, we'll just send
|
||||
# an empty message to keep the connection alive.
|
||||
keepalive_message = {"transcript": " "}
|
||||
logger.trace("Sending keepalive without context")
|
||||
keepalive_message = {"transcript": " "}
|
||||
logger.trace("Sending keepalive message")
|
||||
await self._websocket.send(json.dumps(keepalive_message))
|
||||
except websockets.ConnectionClosed as e:
|
||||
logger.warning(f"{self} keepalive error: {e}")
|
||||
break
|
||||
|
||||
async def _handle_interruption(self, frame: InterruptionFrame, direction: FrameDirection):
|
||||
"""Handle interruption by closing the current context."""
|
||||
await super()._handle_interruption(frame, direction)
|
||||
|
||||
# Close the current context when interrupted without closing the websocket
|
||||
if self._context_id and self._websocket:
|
||||
try:
|
||||
await self._websocket.send(
|
||||
json.dumps(
|
||||
{"context_id": self._context_id, "close_context": True, "transcript": ""}
|
||||
)
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Error closing context on interruption: {e}")
|
||||
self._context_id = None
|
||||
self._started = False
|
||||
|
||||
@traced_tts
|
||||
async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
|
||||
"""Generate speech from text using Async API websocket endpoint.
|
||||
@@ -383,29 +336,21 @@ class AsyncAITTSService(AudioContextTTSService):
|
||||
if not self._websocket or self._websocket.state is State.CLOSED:
|
||||
await self._connect()
|
||||
|
||||
if not self._started:
|
||||
await self.start_ttfb_metrics()
|
||||
yield TTSStartedFrame()
|
||||
self._started = True
|
||||
|
||||
msg = self._build_msg(text=text, force=True)
|
||||
|
||||
try:
|
||||
if not self._started:
|
||||
await self.start_ttfb_metrics()
|
||||
yield TTSStartedFrame()
|
||||
self._started = True
|
||||
|
||||
if not self._context_id:
|
||||
self._context_id = str(uuid.uuid4())
|
||||
if not self.audio_context_available(self._context_id):
|
||||
await self.create_audio_context(self._context_id)
|
||||
|
||||
msg = self._build_msg(text=text, force=True, context_id=self._context_id)
|
||||
await self._get_websocket().send(msg)
|
||||
await self.start_tts_usage_metrics(text)
|
||||
else:
|
||||
if self._websocket and self._context_id:
|
||||
msg = self._build_msg(text=text, force=True, context_id=self._context_id)
|
||||
await self._get_websocket().send(msg)
|
||||
|
||||
await self._get_websocket().send(msg)
|
||||
await self.start_tts_usage_metrics(text)
|
||||
except Exception as e:
|
||||
yield ErrorFrame(error=f"Unknown error occurred: {e}")
|
||||
yield TTSStoppedFrame()
|
||||
self._started = False
|
||||
await self._disconnect()
|
||||
await self._connect()
|
||||
return
|
||||
yield None
|
||||
except Exception as e:
|
||||
@@ -545,14 +490,7 @@ class AsyncAIHttpTTSService(TTSService):
|
||||
await self.push_error(error_msg=f"Async API error: {error_text}")
|
||||
raise Exception(f"Async API returned status {response.status}: {error_text}")
|
||||
|
||||
# Read streaming bytes; stop TTFB on the *first* received chunk
|
||||
buffer = bytearray()
|
||||
async for chunk in response.content.iter_chunked(64 * 1024):
|
||||
if not chunk:
|
||||
continue
|
||||
await self.stop_ttfb_metrics()
|
||||
buffer.extend(chunk)
|
||||
audio_data = bytes(buffer)
|
||||
audio_data = await response.read()
|
||||
|
||||
await self.start_tts_usage_metrics(text)
|
||||
|
||||
|
||||
@@ -38,7 +38,6 @@ from pipecat.frames.frames import (
|
||||
LLMContextFrame,
|
||||
LLMFullResponseEndFrame,
|
||||
LLMFullResponseStartFrame,
|
||||
LLMTextFrame,
|
||||
StartFrame,
|
||||
TranscriptionFrame,
|
||||
TTSAudioRawFrame,
|
||||
@@ -1078,7 +1077,9 @@ class AWSNovaSonicLLMService(LLMService):
|
||||
logger.debug(f"Assistant response text added: {text}")
|
||||
|
||||
# Report the text of the assistant response.
|
||||
await self._push_assistant_response_text_frames(text)
|
||||
frame = TTSTextFrame(text, aggregated_by=AggregationType.SENTENCE)
|
||||
frame.includes_inter_frame_spaces = True
|
||||
await self.push_frame(frame)
|
||||
|
||||
# HACK: here we're also buffering the assistant text ourselves as a
|
||||
# backup rather than relying solely on the assistant context aggregator
|
||||
@@ -1111,7 +1112,11 @@ class AWSNovaSonicLLMService(LLMService):
|
||||
# TTSTextFrame would be ignored otherwise (the interruption frame
|
||||
# would have cleared the assistant aggregator state).
|
||||
await self.push_frame(LLMFullResponseStartFrame())
|
||||
await self._push_assistant_response_text_frames(self._assistant_text_buffer)
|
||||
frame = TTSTextFrame(
|
||||
self._assistant_text_buffer, aggregated_by=AggregationType.SENTENCE
|
||||
)
|
||||
frame.includes_inter_frame_spaces = True
|
||||
await self.push_frame(frame)
|
||||
self._may_need_repush_assistant_text = False
|
||||
|
||||
# Report the end of the assistant response.
|
||||
@@ -1123,25 +1128,6 @@ class AWSNovaSonicLLMService(LLMService):
|
||||
# Clear out the buffered assistant text
|
||||
self._assistant_text_buffer = ""
|
||||
|
||||
async def _push_assistant_response_text_frames(self, text: str):
|
||||
# In a typical "cascade" LLM + TTS setup, LLMTextFrames would not
|
||||
# proceed beyond the TTS service. Therefore, since a speech-to-speech
|
||||
# service like Nova Sonic combines both LLM and TTS functionality, you
|
||||
# would think we wouldn't need to push LLMTextFrames at all. However,
|
||||
# RTVI relies on LLMTextFrames being pushed to trigger its
|
||||
# "bot-llm-text" event. So here we push an LLMTextFrame, too, but avoid
|
||||
# appending it to context to avoid context message duplication.
|
||||
|
||||
# Push LLMTextFrame
|
||||
llm_text_frame = LLMTextFrame(text)
|
||||
llm_text_frame.append_to_context = False
|
||||
await self.push_frame(llm_text_frame)
|
||||
|
||||
# Push TTSTextFrame
|
||||
tts_text_frame = TTSTextFrame(text, aggregated_by=AggregationType.SENTENCE)
|
||||
tts_text_frame.includes_inter_frame_spaces = True
|
||||
await self.push_frame(tts_text_frame)
|
||||
|
||||
#
|
||||
# user transcription reporting
|
||||
#
|
||||
@@ -1201,7 +1187,7 @@ class AWSNovaSonicLLMService(LLMService):
|
||||
logger.debug(
|
||||
"Wrapping assistant response trigger transcription with upstream UserStarted/StoppedSpeakingFrames"
|
||||
)
|
||||
await self.broadcast_frame(UserStartedSpeakingFrame)
|
||||
await self.push_frame(UserStartedSpeakingFrame(), direction=FrameDirection.UPSTREAM)
|
||||
|
||||
# Send the transcription upstream for the user context aggregator
|
||||
frame = TranscriptionFrame(
|
||||
@@ -1211,7 +1197,7 @@ class AWSNovaSonicLLMService(LLMService):
|
||||
|
||||
# Finish wrapping the upstream transcription in UserStarted/StoppedSpeakingFrames if needed
|
||||
if should_wrap_in_user_started_stopped_speaking_frames:
|
||||
await self.broadcast_frame(UserStoppedSpeakingFrame)
|
||||
await self.push_frame(UserStoppedSpeakingFrame(), direction=FrameDirection.UPSTREAM)
|
||||
|
||||
# Clear out the buffered user text
|
||||
self._user_text_buffer = ""
|
||||
|
||||
@@ -277,8 +277,6 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
|
||||
self._started = False
|
||||
self._first_chunk = True
|
||||
self._cumulative_audio_offset: float = 0.0 # Cumulative audio duration in seconds
|
||||
self._last_word: Optional[str] = None # Track last word for punctuation merging
|
||||
self._last_timestamp: Optional[float] = None # Track last timestamp
|
||||
|
||||
def can_generate_metrics(self) -> bool:
|
||||
"""Check if this service can generate processing metrics.
|
||||
@@ -348,34 +346,9 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
|
||||
await self.cancel_task(self._word_processor_task)
|
||||
self._word_processor_task = None
|
||||
|
||||
def _is_cjk_language(self) -> bool:
|
||||
"""Check if the configured language is CJK (Chinese, Japanese, Korean).
|
||||
|
||||
Returns:
|
||||
True if the language is CJK, False otherwise.
|
||||
"""
|
||||
language = self._settings.get("language", "").lower()
|
||||
# Check if language starts with CJK language codes
|
||||
return language.startswith(("zh", "ja", "ko", "cmn", "yue", "wuu"))
|
||||
|
||||
def _is_punctuation_only(self, text: str) -> bool:
|
||||
"""Check if text consists only of punctuation and whitespace.
|
||||
|
||||
Args:
|
||||
text: Text to check.
|
||||
|
||||
Returns:
|
||||
True if text is only punctuation/whitespace, False otherwise.
|
||||
"""
|
||||
return text and all(not c.isalnum() for c in text)
|
||||
|
||||
def _handle_word_boundary(self, evt):
|
||||
"""Handle word boundary events from Azure SDK.
|
||||
|
||||
Azure sends punctuation as separate word boundaries, and breaks CJK text
|
||||
into individual characters/particles. This method routes to language-specific
|
||||
handlers to properly merge and emit word boundaries.
|
||||
|
||||
Args:
|
||||
evt: SpeechSynthesisWordBoundaryEventArgs from Azure Speech SDK
|
||||
containing word text and audio offset timing.
|
||||
@@ -389,75 +362,13 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
|
||||
# Add cumulative offset to get absolute timestamp across sentences
|
||||
absolute_seconds = self._cumulative_audio_offset + sentence_relative_seconds
|
||||
|
||||
if not word:
|
||||
return
|
||||
|
||||
# Route to language-specific handler
|
||||
if self._is_cjk_language():
|
||||
self._handle_cjk_word_boundary(word, absolute_seconds)
|
||||
else:
|
||||
self._handle_non_cjk_word_boundary(word, absolute_seconds)
|
||||
|
||||
def _emit_pending_word(self):
|
||||
"""Emit the currently buffered word if one exists."""
|
||||
if self._last_word is not None:
|
||||
self._word_boundary_queue.put_nowait((self._last_word, self._last_timestamp))
|
||||
self._last_word = None
|
||||
self._last_timestamp = None
|
||||
|
||||
def _handle_cjk_word_boundary(self, word: str, timestamp: float):
|
||||
"""Handle word boundaries for CJK languages (Chinese, Japanese, Korean).
|
||||
|
||||
CJK languages don't use spaces between words, so we merge characters together
|
||||
and only emit at natural break points (punctuation or whitespace boundaries).
|
||||
Without this logic, we don't get word output for CJK languages.
|
||||
|
||||
Args:
|
||||
word: The word/character from Azure.
|
||||
timestamp: Timestamp in seconds.
|
||||
"""
|
||||
# First word: just store it
|
||||
if self._last_word is None:
|
||||
self._last_word = word
|
||||
self._last_timestamp = timestamp
|
||||
return
|
||||
|
||||
# Punctuation: merge and emit (natural break)
|
||||
if self._is_punctuation_only(word):
|
||||
self._last_word += word
|
||||
self._emit_pending_word()
|
||||
return
|
||||
|
||||
# Whitespace: emit before boundary, start new segment
|
||||
if word.strip() != word:
|
||||
self._emit_pending_word()
|
||||
self._last_word = word
|
||||
self._last_timestamp = timestamp
|
||||
return
|
||||
|
||||
# Default: continue merging CJK characters
|
||||
self._last_word += word
|
||||
|
||||
def _handle_non_cjk_word_boundary(self, word: str, timestamp: float):
|
||||
"""Handle word boundaries for non-CJK languages.
|
||||
|
||||
Non-CJK languages use spaces between words, so we emit each word separately
|
||||
after merging any trailing punctuation.
|
||||
|
||||
Args:
|
||||
word: The word from Azure.
|
||||
timestamp: Timestamp in seconds.
|
||||
"""
|
||||
# Punctuation: merge with previous word (don't emit yet)
|
||||
if self._is_punctuation_only(word) and self._last_word is not None:
|
||||
self._last_word += word
|
||||
return
|
||||
|
||||
# Regular word: emit previous, store current
|
||||
if self._last_word is not None:
|
||||
self._word_boundary_queue.put_nowait((self._last_word, self._last_timestamp))
|
||||
self._last_word = word
|
||||
self._last_timestamp = timestamp
|
||||
# Queue word timestamp for async processing
|
||||
# Use thread-safe queue since this is called from Azure SDK thread
|
||||
if word:
|
||||
logger.trace(f"{self}: Word boundary - '{word}' at {absolute_seconds:.2f}s")
|
||||
# Put in temporary queue - will be processed by async task
|
||||
# Store as (word, timestamp_in_seconds) tuple
|
||||
self._word_boundary_queue.put_nowait((word, absolute_seconds))
|
||||
|
||||
async def _word_processor_task_handler(self):
|
||||
"""Process word timestamps from the queue and call add_word_timestamps."""
|
||||
@@ -486,12 +397,6 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
|
||||
Args:
|
||||
evt: Completion event from Azure Speech SDK.
|
||||
"""
|
||||
# Flush any pending word before completing
|
||||
if self._last_word is not None:
|
||||
self._word_boundary_queue.put_nowait((self._last_word, self._last_timestamp))
|
||||
self._last_word = None
|
||||
self._last_timestamp = None
|
||||
|
||||
# Update cumulative audio offset for next sentence
|
||||
if evt.result and evt.result.audio_duration:
|
||||
self._cumulative_audio_offset += evt.result.audio_duration.total_seconds()
|
||||
@@ -530,8 +435,6 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
|
||||
self._started = False
|
||||
self._first_chunk = True
|
||||
self._cumulative_audio_offset = 0.0
|
||||
self._last_word = None
|
||||
self._last_timestamp = None
|
||||
|
||||
async def flush_audio(self):
|
||||
"""Flush any pending audio data."""
|
||||
|
||||
@@ -1,5 +0,0 @@
|
||||
#
|
||||
# Copyright (c) 2024–2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
@@ -1,323 +0,0 @@
|
||||
#
|
||||
# Copyright (c) 2024–2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""Camb.ai MARS text-to-speech service implementation.
|
||||
|
||||
This module provides TTS functionality using Camb.ai's MARS model family,
|
||||
offering high-quality text-to-speech synthesis with streaming support.
|
||||
|
||||
Features:
|
||||
- MARS models: mars-flash (fast), mars-pro (high quality)
|
||||
- 140+ languages supported
|
||||
- Real-time streaming via official SDK
|
||||
- Model-specific sample rates: mars-pro (48kHz), mars-flash (22.05kHz)
|
||||
"""
|
||||
|
||||
from typing import Any, AsyncGenerator, Dict, Optional
|
||||
|
||||
from camb import StreamTtsOutputConfiguration
|
||||
from camb.client import AsyncCambAI
|
||||
from loguru import logger
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
ErrorFrame,
|
||||
Frame,
|
||||
StartFrame,
|
||||
TTSAudioRawFrame,
|
||||
TTSStartedFrame,
|
||||
TTSStoppedFrame,
|
||||
)
|
||||
from pipecat.services.tts_service import TTSService
|
||||
from pipecat.transcriptions.language import Language, resolve_language
|
||||
from pipecat.utils.tracing.service_decorators import traced_tts
|
||||
|
||||
# Model-specific sample rates
|
||||
MODEL_SAMPLE_RATES: Dict[str, int] = {
|
||||
"mars-flash": 22050, # 22.05kHz
|
||||
"mars-pro": 48000, # 48kHz
|
||||
"mars-instruct": 22050, # 22.05kHz
|
||||
}
|
||||
|
||||
|
||||
def language_to_camb_language(language: Language) -> Optional[str]:
|
||||
"""Convert a Pipecat Language enum to Camb.ai language code.
|
||||
|
||||
Args:
|
||||
language: The Language enum value to convert.
|
||||
|
||||
Returns:
|
||||
The corresponding Camb.ai language code (BCP-47 format), or None if not supported.
|
||||
"""
|
||||
LANGUAGE_MAP = {
|
||||
Language.EN: "en-us",
|
||||
Language.EN_US: "en-us",
|
||||
Language.EN_GB: "en-gb",
|
||||
Language.EN_AU: "en-au",
|
||||
Language.ES: "es-es",
|
||||
Language.ES_ES: "es-es",
|
||||
Language.ES_MX: "es-mx",
|
||||
Language.FR: "fr-fr",
|
||||
Language.FR_FR: "fr-fr",
|
||||
Language.FR_CA: "fr-ca",
|
||||
Language.DE: "de-de",
|
||||
Language.DE_DE: "de-de",
|
||||
Language.IT: "it-it",
|
||||
Language.PT: "pt-pt",
|
||||
Language.PT_BR: "pt-br",
|
||||
Language.PT_PT: "pt-pt",
|
||||
Language.NL: "nl-nl",
|
||||
Language.PL: "pl-pl",
|
||||
Language.RU: "ru-ru",
|
||||
Language.JA: "ja-jp",
|
||||
Language.KO: "ko-kr",
|
||||
Language.ZH: "zh-cn",
|
||||
Language.ZH_CN: "zh-cn",
|
||||
Language.ZH_TW: "zh-tw",
|
||||
Language.AR: "ar-sa",
|
||||
Language.HI: "hi-in",
|
||||
Language.TR: "tr-tr",
|
||||
Language.VI: "vi-vn",
|
||||
Language.TH: "th-th",
|
||||
Language.ID: "id-id",
|
||||
Language.MS: "ms-my",
|
||||
Language.SV: "sv-se",
|
||||
Language.DA: "da-dk",
|
||||
Language.NO: "no-no",
|
||||
Language.FI: "fi-fi",
|
||||
Language.CS: "cs-cz",
|
||||
Language.EL: "el-gr",
|
||||
Language.HE: "he-il",
|
||||
Language.HU: "hu-hu",
|
||||
Language.RO: "ro-ro",
|
||||
Language.SK: "sk-sk",
|
||||
Language.UK: "uk-ua",
|
||||
Language.BG: "bg-bg",
|
||||
Language.HR: "hr-hr",
|
||||
Language.SR: "sr-rs",
|
||||
Language.SL: "sl-si",
|
||||
Language.CA: "ca-es",
|
||||
Language.EU: "eu-es",
|
||||
Language.GL: "gl-es",
|
||||
Language.AF: "af-za",
|
||||
Language.SW: "sw-ke",
|
||||
Language.TA: "ta-in",
|
||||
Language.TE: "te-in",
|
||||
Language.BN: "bn-in",
|
||||
Language.MR: "mr-in",
|
||||
Language.GU: "gu-in",
|
||||
Language.KN: "kn-in",
|
||||
Language.ML: "ml-in",
|
||||
Language.PA: "pa-in",
|
||||
Language.UR: "ur-pk",
|
||||
Language.FA: "fa-ir",
|
||||
Language.TL: "tl-ph",
|
||||
}
|
||||
|
||||
return resolve_language(language, LANGUAGE_MAP, use_base_code=True)
|
||||
|
||||
|
||||
def _get_aligned_audio(buffer: bytes) -> tuple[bytes, bytes]:
|
||||
"""Split buffer into aligned audio (2-byte samples) and remainder.
|
||||
|
||||
Args:
|
||||
buffer: Raw audio bytes to align.
|
||||
|
||||
Returns:
|
||||
Tuple of (aligned audio bytes, remaining bytes).
|
||||
"""
|
||||
aligned_size = (len(buffer) // 2) * 2
|
||||
return buffer[:aligned_size], buffer[aligned_size:]
|
||||
|
||||
|
||||
class CambTTSService(TTSService):
|
||||
"""Camb.ai MARS text-to-speech service using the official SDK.
|
||||
|
||||
Converts text to speech using Camb.ai's MARS TTS models with support for
|
||||
multiple languages.
|
||||
|
||||
Models:
|
||||
- mars-flash: Fast inference, 22.05kHz output (default)
|
||||
- mars-pro: High quality, 48kHz output
|
||||
|
||||
Example::
|
||||
|
||||
# Basic usage with mars-flash (fast)
|
||||
tts = CambTTSService(api_key="your-api-key", model="mars-flash")
|
||||
|
||||
# High quality with mars-pro
|
||||
tts = CambTTSService(
|
||||
api_key="your-api-key",
|
||||
voice_id=12345,
|
||||
model="mars-pro",
|
||||
)
|
||||
"""
|
||||
|
||||
class InputParams(BaseModel):
|
||||
"""Input parameters for Camb.ai TTS configuration.
|
||||
|
||||
Parameters:
|
||||
language: Language for synthesis (BCP-47 format). Defaults to English.
|
||||
user_instructions: Custom instructions for mars-instruct model only.
|
||||
Ignored for other models. Max 1000 characters.
|
||||
"""
|
||||
|
||||
language: Optional[Language] = Language.EN
|
||||
user_instructions: Optional[str] = Field(
|
||||
default=None,
|
||||
max_length=1000,
|
||||
description="Custom instructions for mars-instruct model only. "
|
||||
"Use to control tone, style, or pronunciation. Max 1000 characters.",
|
||||
)
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
api_key: str,
|
||||
voice_id: int = 147320,
|
||||
model: str = "mars-flash",
|
||||
timeout: float = 60.0,
|
||||
sample_rate: Optional[int] = None,
|
||||
params: Optional[InputParams] = None,
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the Camb.ai TTS service.
|
||||
|
||||
Args:
|
||||
api_key: Camb.ai API key for authentication.
|
||||
voice_id: Voice ID to use. Defaults to 147320.
|
||||
model: TTS model to use. Options: "mars-flash" (fast), "mars-pro" (high quality).
|
||||
Defaults to "mars-flash".
|
||||
timeout: Request timeout in seconds. Defaults to 60.0 (minimum recommended
|
||||
by Camb.ai).
|
||||
sample_rate: Audio sample rate in Hz. If None, uses model-specific default.
|
||||
params: Additional voice parameters. If None, uses defaults.
|
||||
**kwargs: Additional arguments passed to parent TTSService.
|
||||
"""
|
||||
super().__init__(sample_rate=sample_rate, **kwargs)
|
||||
|
||||
params = params or CambTTSService.InputParams()
|
||||
|
||||
self._client = AsyncCambAI(api_key=api_key, timeout=timeout)
|
||||
|
||||
# Warn if sample rate doesn't match model's supported rate
|
||||
if sample_rate and sample_rate != MODEL_SAMPLE_RATES.get(model):
|
||||
logger.warning(
|
||||
f"Camb.ai's {model} model only supports {MODEL_SAMPLE_RATES.get(model)}Hz "
|
||||
f"sample rate. Current rate of {sample_rate}Hz may cause issues."
|
||||
)
|
||||
|
||||
# Build settings
|
||||
self._settings = {
|
||||
"language": (
|
||||
self.language_to_service_language(params.language) if params.language else "en-us"
|
||||
),
|
||||
"user_instructions": params.user_instructions,
|
||||
}
|
||||
|
||||
self.set_model_name(model)
|
||||
self.set_voice(str(voice_id))
|
||||
self._voice_id = voice_id
|
||||
|
||||
def can_generate_metrics(self) -> bool:
|
||||
"""Check if this service can generate processing metrics.
|
||||
|
||||
Returns:
|
||||
True, as Camb.ai service supports metrics generation.
|
||||
"""
|
||||
return True
|
||||
|
||||
def language_to_service_language(self, language: Language) -> Optional[str]:
|
||||
"""Convert a Language enum to Camb.ai language format.
|
||||
|
||||
Args:
|
||||
language: The language to convert.
|
||||
|
||||
Returns:
|
||||
The Camb.ai-specific language code, or None if not supported.
|
||||
"""
|
||||
return language_to_camb_language(language)
|
||||
|
||||
async def start(self, frame: StartFrame):
|
||||
"""Start the Camb.ai TTS service.
|
||||
|
||||
Args:
|
||||
frame: The start frame containing initialization parameters.
|
||||
"""
|
||||
await super().start(frame)
|
||||
|
||||
# Use model-specific sample rate if not explicitly specified
|
||||
if not self._init_sample_rate:
|
||||
self._sample_rate = MODEL_SAMPLE_RATES.get(self.model_name, 22050)
|
||||
|
||||
@traced_tts
|
||||
async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
|
||||
"""Generate speech from text using Camb.ai's TTS API.
|
||||
|
||||
Args:
|
||||
text: The text to synthesize into speech (max 3000 characters).
|
||||
|
||||
Yields:
|
||||
Frame: Audio frames containing the synthesized speech.
|
||||
"""
|
||||
logger.debug(f"{self}: Generating TTS [{text}]")
|
||||
|
||||
# Validate text length
|
||||
if len(text) > 3000:
|
||||
logger.warning("Text too long for Camb.ai TTS (max 3000 chars), truncating")
|
||||
text = text[:3000]
|
||||
|
||||
try:
|
||||
await self.start_ttfb_metrics()
|
||||
|
||||
# Build SDK parameters
|
||||
tts_kwargs: Dict[str, Any] = {
|
||||
"text": text,
|
||||
"voice_id": self._voice_id,
|
||||
"language": self._settings["language"],
|
||||
"speech_model": self.model_name,
|
||||
"output_configuration": StreamTtsOutputConfiguration(format="pcm_s16le"),
|
||||
}
|
||||
|
||||
# Add user instructions if using mars-instruct model
|
||||
if self._model_name == "mars-instruct" and self._settings.get("user_instructions"):
|
||||
tts_kwargs["user_instructions"] = self._settings["user_instructions"]
|
||||
|
||||
await self.start_tts_usage_metrics(text)
|
||||
yield TTSStartedFrame()
|
||||
|
||||
# Buffer for aligning chunks to 2-byte boundaries (16-bit PCM)
|
||||
audio_buffer = b""
|
||||
|
||||
# Stream audio chunks from SDK
|
||||
async for chunk in self._client.text_to_speech.tts(**tts_kwargs):
|
||||
if chunk:
|
||||
await self.stop_ttfb_metrics()
|
||||
audio_buffer += chunk
|
||||
|
||||
# Only yield complete 16-bit samples (2 bytes per sample)
|
||||
aligned_audio, audio_buffer = _get_aligned_audio(audio_buffer)
|
||||
if aligned_audio:
|
||||
yield TTSAudioRawFrame(
|
||||
audio=aligned_audio,
|
||||
sample_rate=self.sample_rate,
|
||||
num_channels=1,
|
||||
)
|
||||
|
||||
# Yield any remaining complete samples
|
||||
if len(audio_buffer) >= 2:
|
||||
aligned_audio, _ = _get_aligned_audio(audio_buffer)
|
||||
if aligned_audio:
|
||||
yield TTSAudioRawFrame(
|
||||
audio=aligned_audio,
|
||||
sample_rate=self.sample_rate,
|
||||
num_channels=1,
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
yield ErrorFrame(error=f"Camb.ai TTS error: {e}")
|
||||
finally:
|
||||
yield TTSStoppedFrame()
|
||||
@@ -676,7 +676,8 @@ class DeepgramFluxSTTService(WebsocketSTTService):
|
||||
|
||||
await self._handle_transcription(transcript, True, self._language)
|
||||
await self.stop_processing_metrics()
|
||||
await self.broadcast_frame(UserStoppedSpeakingFrame)
|
||||
await self.push_frame(UserStoppedSpeakingFrame(), FrameDirection.DOWNSTREAM)
|
||||
await self.push_frame(UserStoppedSpeakingFrame(), FrameDirection.UPSTREAM)
|
||||
await self._call_event_handler("on_end_of_turn", transcript)
|
||||
|
||||
async def _handle_eager_end_of_turn(self, transcript: str, data: Dict[str, Any]):
|
||||
|
||||
@@ -1710,26 +1710,11 @@ class GeminiLiveLLMService(LLMService):
|
||||
await self.push_frame(TTSStartedFrame())
|
||||
await self.push_frame(LLMFullResponseStartFrame())
|
||||
|
||||
await self._push_output_transcription_text_frames(text)
|
||||
frame = TTSTextFrame(text=text, aggregated_by=AggregationType.SENTENCE)
|
||||
# Gemini Live text already includes any necessary inter-chunk spaces
|
||||
frame.includes_inter_frame_spaces = True
|
||||
|
||||
async def _push_output_transcription_text_frames(self, text: str):
|
||||
# In a typical "cascade" LLM + TTS setup, LLMTextFrames would not
|
||||
# proceed beyond the TTS service. Therefore, since a speech-to-speech
|
||||
# service like Gemini Live combines both LLM and TTS functionality, you
|
||||
# might think we wouldn't need to push LLMTextFrames at all. However,
|
||||
# RTVI relies on LLMTextFrames being pushed to trigger its
|
||||
# "bot-llm-text" event. So here we push an LLMTextFrame, too, but avoid
|
||||
# appending it to context to avoid context message duplication.
|
||||
|
||||
# Push LLMTextFrame
|
||||
llm_text_frame = LLMTextFrame(text)
|
||||
llm_text_frame.append_to_context = False
|
||||
await self.push_frame(llm_text_frame)
|
||||
|
||||
# Push TTSTextFrame
|
||||
tts_text_frame = TTSTextFrame(text, aggregated_by=AggregationType.SENTENCE)
|
||||
tts_text_frame.includes_inter_frame_spaces = True
|
||||
await self.push_frame(tts_text_frame)
|
||||
await self.push_frame(frame)
|
||||
|
||||
async def _handle_msg_grounding_metadata(self, message: LiveServerMessage):
|
||||
"""Handle dedicated grounding metadata messages."""
|
||||
|
||||
@@ -33,7 +33,6 @@ from pipecat.frames.frames import (
|
||||
LLMFullResponseStartFrame,
|
||||
LLMMessagesAppendFrame,
|
||||
LLMSetToolsFrame,
|
||||
LLMTextFrame,
|
||||
LLMUpdateSettingsFrame,
|
||||
StartFrame,
|
||||
TranscriptionFrame,
|
||||
@@ -620,26 +619,9 @@ class GrokRealtimeLLMService(LLMService):
|
||||
async def _handle_evt_audio_transcript_delta(self, evt):
|
||||
"""Handle audio transcript delta event."""
|
||||
if evt.delta:
|
||||
await self._push_output_transcript_text_frames(evt.delta)
|
||||
|
||||
async def _push_output_transcript_text_frames(self, text: str):
|
||||
# In a typical "cascade" LLM + TTS setup, LLMTextFrames would not
|
||||
# proceed beyond the TTS service. Therefore, since a speech-to-speech
|
||||
# service like Grok Realtime combines both LLM and TTS functionality,
|
||||
# you might think we wouldn't need to push LLMTextFrames at all.
|
||||
# However, RTVI relies on LLMTextFrames being pushed to trigger its
|
||||
# "bot-llm-text" event. So here we push an LLMTextFrame, too, but avoid
|
||||
# appending it to context to avoid context message duplication.
|
||||
|
||||
# Push LLMTextFrame
|
||||
llm_text_frame = LLMTextFrame(text)
|
||||
llm_text_frame.append_to_context = False
|
||||
await self.push_frame(llm_text_frame)
|
||||
|
||||
# Push TTSTextFrame
|
||||
tts_text_frame = TTSTextFrame(text, aggregated_by=AggregationType.SENTENCE)
|
||||
tts_text_frame.includes_inter_frame_spaces = True
|
||||
await self.push_frame(tts_text_frame)
|
||||
frame = TTSTextFrame(evt.delta, aggregated_by=AggregationType.SENTENCE)
|
||||
frame.includes_inter_frame_spaces = True
|
||||
await self.push_frame(frame)
|
||||
|
||||
async def _handle_evt_function_call_arguments_done(self, evt):
|
||||
"""Handle function call arguments done event."""
|
||||
@@ -677,7 +659,7 @@ class GrokRealtimeLLMService(LLMService):
|
||||
"""Handle speech stopped event from VAD."""
|
||||
await self.start_ttfb_metrics()
|
||||
await self.start_processing_metrics()
|
||||
await self.broadcast_frame(UserStoppedSpeakingFrame)
|
||||
await self.push_frame(UserStoppedSpeakingFrame())
|
||||
|
||||
async def _handle_evt_error(self, evt):
|
||||
"""Handle error event."""
|
||||
@@ -752,14 +734,6 @@ class GrokRealtimeLLMService(LLMService):
|
||||
|
||||
async def _send_user_audio(self, frame):
|
||||
"""Send user audio to Grok."""
|
||||
# Don't send audio if conversation setup is still pending, as it can
|
||||
# lead to errors. For example: audio sent before conversation setup
|
||||
# will be interpreted as having Grok's default sample rate (24000),
|
||||
# and if that differs from the sample rate we eventually set through
|
||||
# the conversation setup, Grok will error out.
|
||||
if self._llm_needs_conversation_setup:
|
||||
return
|
||||
|
||||
payload = base64.b64encode(frame.audio).decode("utf-8")
|
||||
await self.send_client_event(events.InputAudioBufferAppendEvent(audio=payload))
|
||||
|
||||
|
||||
@@ -1,160 +0,0 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""[Hathora-hosted](https://models.hathora.dev) speech-to-text services."""
|
||||
|
||||
import base64
|
||||
import os
|
||||
from typing import AsyncGenerator, Optional
|
||||
|
||||
import aiohttp
|
||||
from pydantic import BaseModel
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
ErrorFrame,
|
||||
Frame,
|
||||
TranscriptionFrame,
|
||||
)
|
||||
from pipecat.services.stt_service import SegmentedSTTService
|
||||
from pipecat.transcriptions.language import Language
|
||||
from pipecat.utils.time import time_now_iso8601
|
||||
from pipecat.utils.tracing.service_decorators import traced_stt
|
||||
|
||||
from .utils import ConfigOption
|
||||
|
||||
|
||||
class HathoraSTTService(SegmentedSTTService):
|
||||
"""This service supports several different speech-to-text models hosted by Hathora.
|
||||
|
||||
[Documentation](https://models.hathora.dev)
|
||||
"""
|
||||
|
||||
class InputParams(BaseModel):
|
||||
"""Optional input parameters for Hathora STT configuration.
|
||||
|
||||
Parameters:
|
||||
language: Language code (if supported by model).
|
||||
config: Some models support additional config, refer to
|
||||
[docs](https://models.hathora.dev) for each model to see
|
||||
what is supported.
|
||||
"""
|
||||
|
||||
language: Optional[str] = None
|
||||
config: Optional[list[ConfigOption]] = None
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
model: str,
|
||||
sample_rate: Optional[int] = None,
|
||||
api_key: Optional[str] = None,
|
||||
base_url: str = "https://api.models.hathora.dev/inference/v1/stt",
|
||||
params: Optional[InputParams] = None,
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the Hathora STT service.
|
||||
|
||||
Args:
|
||||
model: Model to use; find available models
|
||||
[here](https://models.hathora.dev).
|
||||
sample_rate: The sample rate for audio input. If None, will be determined
|
||||
from the start frame.
|
||||
api_key: API key for authentication with the Hathora service;
|
||||
provision one [here](https://models.hathora.dev/tokens).
|
||||
base_url: Base API URL for the Hathora STT service.
|
||||
params: Configuration parameters.
|
||||
**kwargs: Additional arguments passed to the parent class.
|
||||
"""
|
||||
super().__init__(
|
||||
sample_rate=sample_rate,
|
||||
**kwargs,
|
||||
)
|
||||
self._model = model
|
||||
self._api_key = api_key or os.getenv("HATHORA_API_KEY")
|
||||
self._base_url = base_url
|
||||
|
||||
params = params or HathoraSTTService.InputParams()
|
||||
|
||||
self._settings = {
|
||||
"language": params.language,
|
||||
"config": params.config,
|
||||
}
|
||||
|
||||
self.set_model_name(model)
|
||||
|
||||
def can_generate_metrics(self) -> bool:
|
||||
"""Check if this service can generate processing metrics.
|
||||
|
||||
Returns:
|
||||
True
|
||||
"""
|
||||
return True
|
||||
|
||||
@traced_stt
|
||||
async def _handle_transcription(
|
||||
self, transcript: str, is_final: bool, language: Optional[Language] = None
|
||||
):
|
||||
"""Handle a transcription result with tracing."""
|
||||
pass
|
||||
|
||||
async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
|
||||
"""Run speech-to-text on the provided audio data.
|
||||
|
||||
Args:
|
||||
audio: Raw audio bytes to transcribe.
|
||||
|
||||
Yields:
|
||||
Frame: Frames containing transcription results (typically TextFrame).
|
||||
"""
|
||||
try:
|
||||
await self.start_processing_metrics()
|
||||
await self.start_ttfb_metrics()
|
||||
|
||||
url = f"{self._base_url}"
|
||||
|
||||
payload = {
|
||||
"model": self._model,
|
||||
}
|
||||
|
||||
if self._settings["language"] is not None:
|
||||
payload["language"] = self._settings["language"]
|
||||
if self._settings["config"] is not None:
|
||||
payload["model_config"] = [
|
||||
{"name": option.name, "value": option.value}
|
||||
for option in self._settings["config"]
|
||||
]
|
||||
|
||||
base64_audio = base64.b64encode(audio).decode("utf-8")
|
||||
payload["audio"] = base64_audio
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.post(
|
||||
url,
|
||||
headers={"Authorization": f"Bearer {self._api_key}"},
|
||||
json=payload,
|
||||
) as resp:
|
||||
response = await resp.json()
|
||||
|
||||
if response and "text" in response:
|
||||
text = response["text"].strip()
|
||||
if text: # Only yield non-empty text
|
||||
# Hathora's API currently doesn't return language info
|
||||
# so we default to the requested language or "en"
|
||||
response_language = self._settings["language"] or "en"
|
||||
await self._handle_transcription(text, True, response_language)
|
||||
yield TranscriptionFrame(
|
||||
text,
|
||||
self._user_id,
|
||||
time_now_iso8601(),
|
||||
Language(response_language),
|
||||
result=response,
|
||||
)
|
||||
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
except Exception as e:
|
||||
yield ErrorFrame(error=f"Unknown error occurred: {e}")
|
||||
@@ -1,173 +0,0 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""[Hathora-hosted](https://models.hathora.dev) text-to-speech services."""
|
||||
|
||||
import io
|
||||
import os
|
||||
import wave
|
||||
from typing import AsyncGenerator, Optional, Tuple
|
||||
|
||||
import aiohttp
|
||||
from pydantic import BaseModel
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
ErrorFrame,
|
||||
Frame,
|
||||
TTSAudioRawFrame,
|
||||
TTSStartedFrame,
|
||||
TTSStoppedFrame,
|
||||
)
|
||||
from pipecat.services.tts_service import TTSService
|
||||
from pipecat.utils.tracing.service_decorators import traced_tts
|
||||
|
||||
from .utils import ConfigOption
|
||||
|
||||
|
||||
def _decode_audio_payload(
|
||||
audio_bytes: bytes,
|
||||
*,
|
||||
fallback_sample_rate: int = 24000,
|
||||
fallback_channels: int = 1,
|
||||
) -> Tuple[bytes, int, int]:
|
||||
"""Convert a WAV/PCM payload into raw PCM samples for TTSAudioRawFrame."""
|
||||
try:
|
||||
with wave.open(io.BytesIO(audio_bytes), "rb") as wav_reader:
|
||||
channels = wav_reader.getnchannels()
|
||||
sample_rate = wav_reader.getframerate()
|
||||
frames = wav_reader.readframes(wav_reader.getnframes())
|
||||
return frames, sample_rate, channels
|
||||
except (wave.Error, EOFError):
|
||||
# If the payload is already raw PCM, just pass it through.
|
||||
return audio_bytes, fallback_sample_rate, fallback_channels
|
||||
|
||||
|
||||
class HathoraTTSService(TTSService):
|
||||
"""This service supports several different text-to-speech models hosted by Hathora.
|
||||
|
||||
[Documentation](https://models.hathora.dev)
|
||||
"""
|
||||
|
||||
class InputParams(BaseModel):
|
||||
"""Optional input parameters for Hathora TTS configuration.
|
||||
|
||||
Parameters:
|
||||
speed: Speech speed multiplier (if supported by model).
|
||||
config: Some models support additional config, refer to
|
||||
[docs](https://models.hathora.dev) for each model to see
|
||||
what is supported.
|
||||
"""
|
||||
|
||||
speed: Optional[float] = None
|
||||
config: Optional[list[ConfigOption]] = None
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
model: str,
|
||||
voice_id: Optional[str] = None,
|
||||
sample_rate: Optional[int] = None,
|
||||
api_key: Optional[str] = None,
|
||||
base_url: str = "https://api.models.hathora.dev/inference/v1/tts",
|
||||
params: Optional[InputParams] = None,
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the Hathora TTS service.
|
||||
|
||||
Args:
|
||||
model: Model to use; find available models
|
||||
[here](https://models.hathora.dev).
|
||||
voice_id: Voice to use for synthesis (if supported by model).
|
||||
sample_rate: Output sample rate for generated audio.
|
||||
api_key: API key for authentication with the Hathora service;
|
||||
provision one [here](https://models.hathora.dev/tokens).
|
||||
base_url: Base API URL for the Hathora TTS service.
|
||||
params: Configuration parameters.
|
||||
**kwargs: Additional arguments passed to the parent class.
|
||||
"""
|
||||
super().__init__(
|
||||
sample_rate=sample_rate,
|
||||
**kwargs,
|
||||
)
|
||||
self._model = model
|
||||
self._api_key = api_key or os.getenv("HATHORA_API_KEY")
|
||||
self._base_url = base_url
|
||||
|
||||
params = params or HathoraTTSService.InputParams()
|
||||
|
||||
self._settings = {
|
||||
"speed": params.speed,
|
||||
"config": params.config,
|
||||
}
|
||||
|
||||
self.set_model_name(model)
|
||||
self.set_voice(voice_id)
|
||||
|
||||
def can_generate_metrics(self) -> bool:
|
||||
"""Check if this service can generate processing metrics.
|
||||
|
||||
Returns:
|
||||
True
|
||||
"""
|
||||
return True
|
||||
|
||||
@traced_tts
|
||||
async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
|
||||
"""Run text-to-speech synthesis on the provided text.
|
||||
|
||||
Args:
|
||||
text: The text to synthesize into speech.
|
||||
|
||||
Yields:
|
||||
Frame: Audio frames containing the synthesized speech.
|
||||
"""
|
||||
try:
|
||||
await self.start_processing_metrics()
|
||||
await self.start_ttfb_metrics()
|
||||
|
||||
url = f"{self._base_url}"
|
||||
|
||||
payload = {"model": self._model, "text": text}
|
||||
|
||||
if self._voice_id is not None:
|
||||
payload["voice"] = self._voice_id
|
||||
if self._settings["speed"] is not None:
|
||||
payload["speed"] = self._settings["speed"]
|
||||
if self._settings["config"] is not None:
|
||||
payload["model_config"] = [
|
||||
{"name": option.name, "value": option.value}
|
||||
for option in self._settings["config"]
|
||||
]
|
||||
|
||||
yield TTSStartedFrame()
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.post(
|
||||
url,
|
||||
headers={"Authorization": f"Bearer {self._api_key}"},
|
||||
json=payload,
|
||||
) as resp:
|
||||
audio_data = await resp.read()
|
||||
|
||||
pcm_audio, sample_rate, num_channels = _decode_audio_payload(
|
||||
audio_data,
|
||||
fallback_sample_rate=self.sample_rate,
|
||||
)
|
||||
|
||||
frame = TTSAudioRawFrame(
|
||||
audio=pcm_audio,
|
||||
sample_rate=self.sample_rate,
|
||||
num_channels=num_channels,
|
||||
)
|
||||
|
||||
yield frame
|
||||
|
||||
except Exception as e:
|
||||
yield ErrorFrame(error=f"Unknown error occurred: {e}")
|
||||
finally:
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
yield TTSStoppedFrame()
|
||||
@@ -1,22 +0,0 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""Utilities and types for [Hathora-hosted](https://models.hathora.dev) voice services."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass
|
||||
class ConfigOption:
|
||||
"""Extra configuration option passed into model_config for Hathora (if supported by model).
|
||||
|
||||
Args:
|
||||
name: Name of the configuration option.
|
||||
value: Value of the configuration option.
|
||||
"""
|
||||
|
||||
name: str
|
||||
value: str
|
||||
@@ -121,6 +121,7 @@ class Mem0MemoryService(FrameProcessor):
|
||||
try:
|
||||
logger.debug(f"Storing {len(messages)} messages in Mem0")
|
||||
params = {
|
||||
"async_mode": True,
|
||||
"messages": messages,
|
||||
"metadata": {"platform": "pipecat"},
|
||||
"output_format": "v1.1",
|
||||
|
||||
@@ -1002,14 +1002,17 @@ class TokenDetails(BaseModel):
|
||||
image_tokens: Number of image tokens used (for input only).
|
||||
"""
|
||||
|
||||
model_config = ConfigDict(extra="allow")
|
||||
|
||||
cached_tokens: Optional[int] = 0
|
||||
text_tokens: Optional[int] = 0
|
||||
audio_tokens: Optional[int] = 0
|
||||
cached_tokens_details: Optional[CachedTokensDetails] = None
|
||||
image_tokens: Optional[int] = 0
|
||||
|
||||
class Config:
|
||||
"""Pydantic configuration for TokenDetails."""
|
||||
|
||||
extra = "allow"
|
||||
|
||||
|
||||
class Usage(BaseModel):
|
||||
"""Token usage statistics for a response.
|
||||
|
||||
@@ -724,26 +724,10 @@ class OpenAIRealtimeLLMService(LLMService):
|
||||
# We receive audio transcript deltas (as opposed to text deltas) when
|
||||
# the output modality is "audio" (the default)
|
||||
if evt.delta:
|
||||
await self._push_output_transcript_text_frames(evt.delta)
|
||||
|
||||
async def _push_output_transcript_text_frames(self, text: str):
|
||||
# In a typical "cascade" LLM + TTS setup, LLMTextFrames would not
|
||||
# proceed beyond the TTS service. Therefore, since a speech-to-speech
|
||||
# service like OpenAI Realtime combines both LLM and TTS functionality,
|
||||
# you might think we wouldn't need to push LLMTextFrames at all.
|
||||
# However, RTVI relies on LLMTextFrames being pushed to trigger its
|
||||
# "bot-llm-text" event. So here we push an LLMTextFrame, too, but avoid
|
||||
# appending it to context to avoid context message duplication.
|
||||
|
||||
# Push LLMTextFrame
|
||||
llm_text_frame = LLMTextFrame(text)
|
||||
llm_text_frame.append_to_context = False
|
||||
await self.push_frame(llm_text_frame)
|
||||
|
||||
# Push TTSTextFrame
|
||||
tts_text_frame = TTSTextFrame(text, aggregated_by=AggregationType.SENTENCE)
|
||||
tts_text_frame.includes_inter_frame_spaces = True
|
||||
await self.push_frame(tts_text_frame)
|
||||
frame = TTSTextFrame(evt.delta, aggregated_by=AggregationType.SENTENCE)
|
||||
# OpenAI Realtime text already includes any necessary inter-chunk spaces
|
||||
frame.includes_inter_frame_spaces = True
|
||||
await self.push_frame(frame)
|
||||
|
||||
async def _handle_evt_function_call_arguments_done(self, evt):
|
||||
"""Handle completion of function call arguments.
|
||||
@@ -792,7 +776,7 @@ class OpenAIRealtimeLLMService(LLMService):
|
||||
async def _handle_evt_speech_stopped(self, evt):
|
||||
await self.start_ttfb_metrics()
|
||||
await self.start_processing_metrics()
|
||||
await self.broadcast_frame(UserStoppedSpeakingFrame)
|
||||
await self.push_frame(UserStoppedSpeakingFrame())
|
||||
|
||||
async def _maybe_handle_evt_retrieve_conversation_item_error(self, evt: events.ErrorEvent):
|
||||
"""Maybe handle an error event related to retrieving a conversation item.
|
||||
|
||||
@@ -877,12 +877,15 @@ class TokenDetails(BaseModel):
|
||||
audio_tokens: Number of audio tokens used. Defaults to 0.
|
||||
"""
|
||||
|
||||
model_config = ConfigDict(extra="allow")
|
||||
|
||||
cached_tokens: Optional[int] = 0
|
||||
text_tokens: Optional[int] = 0
|
||||
audio_tokens: Optional[int] = 0
|
||||
|
||||
class Config:
|
||||
"""Pydantic configuration for TokenDetails."""
|
||||
|
||||
extra = "allow"
|
||||
|
||||
|
||||
class Usage(BaseModel):
|
||||
"""Token usage statistics for a response.
|
||||
|
||||
@@ -662,7 +662,7 @@ class OpenAIRealtimeBetaLLMService(LLMService):
|
||||
async def _handle_evt_speech_stopped(self, evt):
|
||||
await self.start_ttfb_metrics()
|
||||
await self.start_processing_metrics()
|
||||
await self.broadcast_frame(UserStoppedSpeakingFrame)
|
||||
await self.push_frame(UserStoppedSpeakingFrame())
|
||||
|
||||
async def _maybe_handle_evt_retrieve_conversation_item_error(self, evt: events.ErrorEvent):
|
||||
"""Maybe handle an error event related to retrieving a conversation item.
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
# Copyright (c) 2024-2025 Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
@@ -11,7 +11,6 @@ input processing, including VAD, turn analysis, and interruption management.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
from typing import Optional
|
||||
|
||||
from loguru import logger
|
||||
@@ -78,11 +77,6 @@ class BaseInputTransport(FrameProcessor):
|
||||
|
||||
# Track user speaking state for interruption logic
|
||||
self._user_speaking = False
|
||||
# Last time a UserSpeakingFrame was pushed.
|
||||
self._user_speaking_frame_time = 0
|
||||
# How often a UserSpeakingFrame should be pushed (value should be
|
||||
# greater than the audio chunks to have any effect).
|
||||
self._user_speaking_frame_period = 0.2
|
||||
|
||||
# Task to process incoming audio (VAD) and push audio frames downstream
|
||||
# if passthrough is enabled.
|
||||
@@ -429,7 +423,7 @@ class BaseInputTransport(FrameProcessor):
|
||||
await self._deprecated_run_turn_analyzer(frame, vad_state, previous_vad_state)
|
||||
|
||||
if vad_state == VADState.SPEAKING:
|
||||
await self._user_currently_speaking()
|
||||
await self.broadcast_frame(UserSpeakingFrame)
|
||||
|
||||
# Push audio downstream if passthrough is set.
|
||||
if self._params.audio_in_passthrough:
|
||||
@@ -450,13 +444,6 @@ class BaseInputTransport(FrameProcessor):
|
||||
else:
|
||||
await self.push_frame(VADUserStoppedSpeakingFrame())
|
||||
|
||||
async def _user_currently_speaking(self):
|
||||
"""Handle user speaking frame."""
|
||||
diff_time = time.time() - self._user_speaking_frame_time
|
||||
if diff_time >= self._user_speaking_frame_period:
|
||||
await self.broadcast_frame(UserSpeakingFrame)
|
||||
self._user_speaking_frame_time = time.time()
|
||||
|
||||
#
|
||||
# DEPRECATED.
|
||||
#
|
||||
|
||||
@@ -403,7 +403,7 @@ class BaseOutputTransport(FrameProcessor):
|
||||
# Last time a BotSpeakingFrame was pushed.
|
||||
self._bot_speaking_frame_time = 0
|
||||
# How often a BotSpeakingFrame should be pushed (value should be
|
||||
# greater than the audio chunks to have any effect).
|
||||
# lower than the audio chunks).
|
||||
self._bot_speaking_frame_period = 0.2
|
||||
# Last time the bot actually spoke.
|
||||
self._bot_speech_last_time = 0
|
||||
@@ -644,7 +644,8 @@ class BaseOutputTransport(FrameProcessor):
|
||||
|
||||
diff_time = time.time() - self._bot_speaking_frame_time
|
||||
if diff_time >= self._bot_speaking_frame_period:
|
||||
await self._transport.broadcast_frame(BotSpeakingFrame)
|
||||
await self._transport.push_frame(BotSpeakingFrame())
|
||||
await self._transport.push_frame(BotSpeakingFrame(), FrameDirection.UPSTREAM)
|
||||
self._bot_speaking_frame_time = time.time()
|
||||
|
||||
self._bot_speech_last_time = time.time()
|
||||
|
||||
@@ -10,11 +10,11 @@ Methods that wrap the Daily API to create rooms, check room URLs, and get meetin
|
||||
"""
|
||||
|
||||
import time
|
||||
from typing import Any, Dict, List, Literal, Optional
|
||||
from typing import Dict, List, Literal, Optional
|
||||
from urllib.parse import urlparse
|
||||
|
||||
import aiohttp
|
||||
from pydantic import BaseModel, ConfigDict, Field, ValidationError
|
||||
from pydantic import BaseModel, Field, ValidationError
|
||||
|
||||
|
||||
class DailyRoomSipParams(BaseModel):
|
||||
@@ -77,7 +77,7 @@ class TranscriptionBucketConfig(BaseModel):
|
||||
allow_api_access: bool = False
|
||||
|
||||
|
||||
class DailyRoomProperties(BaseModel):
|
||||
class DailyRoomProperties(BaseModel, extra="allow"):
|
||||
"""Properties for configuring a Daily room.
|
||||
|
||||
Reference: https://docs.daily.co/reference/rest-api/rooms/create-room#properties
|
||||
@@ -100,8 +100,6 @@ class DailyRoomProperties(BaseModel):
|
||||
start_video_off: Whether video is off by default.
|
||||
"""
|
||||
|
||||
model_config = ConfigDict(extra="allow")
|
||||
|
||||
exp: Optional[float] = None
|
||||
enable_chat: bool = False
|
||||
enable_prejoin_ui: bool = False
|
||||
@@ -115,7 +113,7 @@ class DailyRoomProperties(BaseModel):
|
||||
recordings_bucket: Optional[RecordingsBucketConfig] = None
|
||||
transcription_bucket: Optional[TranscriptionBucketConfig] = None
|
||||
sip: Optional[DailyRoomSipParams] = None
|
||||
sip_uri: Optional[Dict[str, Any]] = None
|
||||
sip_uri: Optional[dict] = None
|
||||
start_video_off: bool = False
|
||||
|
||||
@property
|
||||
@@ -205,7 +203,7 @@ class DailyMeetingTokenProperties(BaseModel):
|
||||
enable_recording: Optional[Literal["cloud", "local", "raw-tracks"]] = None
|
||||
enable_prejoin_ui: Optional[bool] = None
|
||||
start_cloud_recording: Optional[bool] = None
|
||||
permissions: Optional[Dict[str, Any]] = None
|
||||
permissions: Optional[dict] = None
|
||||
|
||||
|
||||
class DailyMeetingTokenParams(BaseModel):
|
||||
|
||||
@@ -50,7 +50,6 @@ class WebsocketClientParams(TransportParams):
|
||||
"""
|
||||
|
||||
add_wav_header: bool = True
|
||||
additional_headers: Optional[dict[str, str]] = None
|
||||
serializer: Optional[FrameSerializer] = None
|
||||
|
||||
|
||||
@@ -131,11 +130,7 @@ class WebsocketClientSession:
|
||||
return
|
||||
|
||||
try:
|
||||
self._websocket = await websocket_connect(
|
||||
uri=self._uri,
|
||||
open_timeout=10,
|
||||
additional_headers=self._params.additional_headers,
|
||||
)
|
||||
self._websocket = await websocket_connect(uri=self._uri, open_timeout=10)
|
||||
self._client_task = self.task_manager.create_task(
|
||||
self._client_task_handler(),
|
||||
f"{self._transport_name}::WebsocketClientSession::_client_task_handler",
|
||||
|
||||
@@ -4,21 +4,10 @@
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
import warnings
|
||||
|
||||
from pipecat.turns.user_mute.always_user_mute_strategy import AlwaysUserMuteStrategy
|
||||
from pipecat.turns.user_mute.base_user_mute_strategy import BaseUserMuteStrategy
|
||||
from pipecat.turns.user_mute.first_speech_user_mute_strategy import FirstSpeechUserMuteStrategy
|
||||
from pipecat.turns.user_mute.function_call_user_mute_strategy import FunctionCallUserMuteStrategy
|
||||
from pipecat.turns.user_mute.mute_until_first_bot_complete_user_mute_strategy import (
|
||||
from pipecat.turns.mute.always_user_mute_strategy import AlwaysUserMuteStrategy
|
||||
from pipecat.turns.mute.base_user_mute_strategy import BaseUserMuteStrategy
|
||||
from pipecat.turns.mute.first_speech_user_mute_strategy import FirstSpeechUserMuteStrategy
|
||||
from pipecat.turns.mute.function_call_user_mute_strategy import FunctionCallUserMuteStrategy
|
||||
from pipecat.turns.mute.mute_until_first_bot_complete_user_mute_strategy import (
|
||||
MuteUntilFirstBotCompleteUserMuteStrategy,
|
||||
)
|
||||
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter("always")
|
||||
warnings.warn(
|
||||
"Types in pipecat.turns.mute are deprecated. "
|
||||
"Please use the equivalent types from pipecat.turns.user_mute instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
|
||||
@@ -7,7 +7,7 @@
|
||||
"""User mute strategy that always mutes the user while the bot is speaking."""
|
||||
|
||||
from pipecat.frames.frames import BotStartedSpeakingFrame, BotStoppedSpeakingFrame, Frame
|
||||
from pipecat.turns.user_mute.base_user_mute_strategy import BaseUserMuteStrategy
|
||||
from pipecat.turns.mute.base_user_mute_strategy import BaseUserMuteStrategy
|
||||
|
||||
|
||||
class AlwaysUserMuteStrategy(BaseUserMuteStrategy):
|
||||
@@ -7,7 +7,7 @@
|
||||
"""User mute strategy that mutes the user only during the bot’s first speech."""
|
||||
|
||||
from pipecat.frames.frames import BotStartedSpeakingFrame, BotStoppedSpeakingFrame, Frame
|
||||
from pipecat.turns.user_mute.base_user_mute_strategy import BaseUserMuteStrategy
|
||||
from pipecat.turns.mute.base_user_mute_strategy import BaseUserMuteStrategy
|
||||
|
||||
|
||||
class FirstSpeechUserMuteStrategy(BaseUserMuteStrategy):
|
||||
@@ -14,7 +14,7 @@ from pipecat.frames.frames import (
|
||||
FunctionCallResultFrame,
|
||||
FunctionCallsStartedFrame,
|
||||
)
|
||||
from pipecat.turns.user_mute.base_user_mute_strategy import BaseUserMuteStrategy
|
||||
from pipecat.turns.mute.base_user_mute_strategy import BaseUserMuteStrategy
|
||||
|
||||
|
||||
class FunctionCallUserMuteStrategy(BaseUserMuteStrategy):
|
||||
@@ -7,7 +7,7 @@
|
||||
"""User mute strategy that mutes the user until the bot completes its first speech."""
|
||||
|
||||
from pipecat.frames.frames import BotStoppedSpeakingFrame, Frame
|
||||
from pipecat.turns.user_mute.base_user_mute_strategy import BaseUserMuteStrategy
|
||||
from pipecat.turns.mute.base_user_mute_strategy import BaseUserMuteStrategy
|
||||
|
||||
|
||||
class MuteUntilFirstBotCompleteUserMuteStrategy(BaseUserMuteStrategy):
|
||||
@@ -1,173 +0,0 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""This module defines a controller for managing user idle detection."""
|
||||
|
||||
import asyncio
|
||||
from typing import Optional
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
BotSpeakingFrame,
|
||||
Frame,
|
||||
FunctionCallResultFrame,
|
||||
FunctionCallsStartedFrame,
|
||||
UserSpeakingFrame,
|
||||
UserStartedSpeakingFrame,
|
||||
)
|
||||
from pipecat.utils.asyncio.task_manager import BaseTaskManager
|
||||
from pipecat.utils.base_object import BaseObject
|
||||
|
||||
|
||||
class UserIdleController(BaseObject):
|
||||
"""Controller for managing user idle detection.
|
||||
|
||||
This class monitors user activity and triggers an event when the user has been
|
||||
idle (not speaking) for a configured timeout period. It only starts monitoring
|
||||
after the first conversation activity and does not trigger while the bot is
|
||||
speaking or function calls are in progress.
|
||||
|
||||
The controller tracks activity using continuous frames (UserSpeakingFrame and
|
||||
BotSpeakingFrame) which are emitted repeatedly while speaking is happening, and
|
||||
state-based tracking for function calls (FunctionCallsStartedFrame and
|
||||
FunctionCallResultFrame) which are only sent at start and end.
|
||||
|
||||
Event handlers available:
|
||||
|
||||
- on_user_turn_idle: Emitted when the user has been idle for the timeout period.
|
||||
|
||||
Example::
|
||||
|
||||
@controller.event_handler("on_user_turn_idle")
|
||||
async def on_user_turn_idle(controller):
|
||||
# Handle user idle - send reminder, prompt, etc.
|
||||
...
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
user_idle_timeout: float,
|
||||
):
|
||||
"""Initialize the user idle controller.
|
||||
|
||||
Args:
|
||||
user_idle_timeout: Timeout in seconds before considering the user idle.
|
||||
"""
|
||||
super().__init__()
|
||||
|
||||
self._user_idle_timeout = user_idle_timeout
|
||||
|
||||
self._task_manager: Optional[BaseTaskManager] = None
|
||||
|
||||
self._conversation_started = False
|
||||
self._function_call_in_progress = False
|
||||
|
||||
self.user_idle_event = asyncio.Event()
|
||||
self.user_idle_task: Optional[asyncio.Task] = None
|
||||
|
||||
self._register_event_handler("on_user_turn_idle", sync=True)
|
||||
|
||||
@property
|
||||
def task_manager(self) -> BaseTaskManager:
|
||||
"""Returns the configured task manager."""
|
||||
if not self._task_manager:
|
||||
raise RuntimeError(f"{self} user idle controller was not properly setup")
|
||||
return self._task_manager
|
||||
|
||||
async def setup(self, task_manager: BaseTaskManager):
|
||||
"""Initialize the controller with the given task manager.
|
||||
|
||||
Args:
|
||||
task_manager: The task manager to be associated with this instance.
|
||||
"""
|
||||
self._task_manager = task_manager
|
||||
|
||||
if not self.user_idle_task:
|
||||
self.user_idle_task = self.task_manager.create_task(
|
||||
self.user_idle_task_handler(),
|
||||
f"{self}::user_idle_task_handler",
|
||||
)
|
||||
|
||||
async def cleanup(self):
|
||||
"""Cleanup the controller."""
|
||||
await super().cleanup()
|
||||
|
||||
if self.user_idle_task:
|
||||
await self.task_manager.cancel_task(self.user_idle_task)
|
||||
self.user_idle_task = None
|
||||
|
||||
async def process_frame(self, frame: Frame):
|
||||
"""Process an incoming frame to track user activity state.
|
||||
|
||||
Args:
|
||||
frame: The frame to be processed.
|
||||
"""
|
||||
# Start monitoring on first conversation activity
|
||||
if not self._conversation_started:
|
||||
if isinstance(frame, (UserStartedSpeakingFrame, BotSpeakingFrame)):
|
||||
self._conversation_started = True
|
||||
self.user_idle_event.set()
|
||||
else:
|
||||
return
|
||||
|
||||
# Reset idle timer on continuous activity frames
|
||||
if isinstance(frame, (UserSpeakingFrame, BotSpeakingFrame)):
|
||||
await self._handle_activity(frame)
|
||||
# Track function call state (start/end frames, not continuous)
|
||||
elif isinstance(frame, FunctionCallsStartedFrame):
|
||||
await self._handle_function_calls_started(frame)
|
||||
elif isinstance(frame, FunctionCallResultFrame):
|
||||
await self._handle_function_call_result(frame)
|
||||
|
||||
async def _handle_activity(self, _: UserSpeakingFrame | BotSpeakingFrame):
|
||||
"""Handle continuous activity frames that should reset the idle timer.
|
||||
|
||||
These frames are emitted continuously while the user or bot is speaking,
|
||||
so we simply reset the timer whenever we receive them.
|
||||
|
||||
Args:
|
||||
frame: The activity frame to process.
|
||||
"""
|
||||
self.user_idle_event.set()
|
||||
|
||||
async def _handle_function_calls_started(self, _: FunctionCallsStartedFrame):
|
||||
"""Handle function calls started event.
|
||||
|
||||
Function calls can take longer than the timeout, so we track their state
|
||||
to prevent idle callbacks while they're in progress.
|
||||
|
||||
Args:
|
||||
frame: The FunctionCallsStartedFrame to process.
|
||||
"""
|
||||
self._function_call_in_progress = True
|
||||
self.user_idle_event.set()
|
||||
|
||||
async def _handle_function_call_result(self, _: FunctionCallResultFrame):
|
||||
"""Handle function call result event.
|
||||
|
||||
Args:
|
||||
frame: The FunctionCallResultFrame to process.
|
||||
"""
|
||||
self._function_call_in_progress = False
|
||||
self.user_idle_event.set()
|
||||
|
||||
async def user_idle_task_handler(self):
|
||||
"""Monitors for idle timeout and triggers events.
|
||||
|
||||
Runs in a loop until cancelled. The idle timer is reset whenever activity
|
||||
frames are received (UserSpeakingFrame or BotSpeakingFrame). Function calls
|
||||
are tracked via state since they only send start/end frames. If no activity
|
||||
is detected for the configured timeout period and no function call is in
|
||||
progress, the on_user_turn_idle event is triggered.
|
||||
"""
|
||||
while True:
|
||||
try:
|
||||
await asyncio.wait_for(self.user_idle_event.wait(), timeout=self._user_idle_timeout)
|
||||
self.user_idle_event.clear()
|
||||
except asyncio.TimeoutError:
|
||||
# Only trigger if conversation has started and no function call is in progress
|
||||
if self._conversation_started and not self._function_call_in_progress:
|
||||
await self._call_event_handler("on_user_turn_idle")
|
||||
@@ -1,21 +0,0 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
from .always_user_mute_strategy import AlwaysUserMuteStrategy
|
||||
from .base_user_mute_strategy import BaseUserMuteStrategy
|
||||
from .first_speech_user_mute_strategy import FirstSpeechUserMuteStrategy
|
||||
from .function_call_user_mute_strategy import FunctionCallUserMuteStrategy
|
||||
from .mute_until_first_bot_complete_user_mute_strategy import (
|
||||
MuteUntilFirstBotCompleteUserMuteStrategy,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"AlwaysUserMuteStrategy",
|
||||
"BaseUserMuteStrategy",
|
||||
"FirstSpeechUserMuteStrategy",
|
||||
"FunctionCallUserMuteStrategy",
|
||||
"MuteUntilFirstBotCompleteUserMuteStrategy",
|
||||
]
|
||||
@@ -4,17 +4,15 @@
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
from .base_user_turn_start_strategy import BaseUserTurnStartStrategy, UserTurnStartedParams
|
||||
from .external_user_turn_start_strategy import ExternalUserTurnStartStrategy
|
||||
from .min_words_user_turn_start_strategy import MinWordsUserTurnStartStrategy
|
||||
from .transcription_user_turn_start_strategy import TranscriptionUserTurnStartStrategy
|
||||
from .vad_user_turn_start_strategy import VADUserTurnStartStrategy
|
||||
|
||||
__all__ = [
|
||||
"BaseUserTurnStartStrategy",
|
||||
"ExternalUserTurnStartStrategy",
|
||||
"MinWordsUserTurnStartStrategy",
|
||||
"TranscriptionUserTurnStartStrategy",
|
||||
"UserTurnStartedParams",
|
||||
"VADUserTurnStartStrategy",
|
||||
]
|
||||
from pipecat.turns.user_start.base_user_turn_start_strategy import (
|
||||
BaseUserTurnStartStrategy,
|
||||
UserTurnStartedParams,
|
||||
)
|
||||
from pipecat.turns.user_start.external_user_turn_start_strategy import ExternalUserTurnStartStrategy
|
||||
from pipecat.turns.user_start.min_words_user_turn_start_strategy import (
|
||||
MinWordsUserTurnStartStrategy,
|
||||
)
|
||||
from pipecat.turns.user_start.transcription_user_turn_start_strategy import (
|
||||
TranscriptionUserTurnStartStrategy,
|
||||
)
|
||||
from pipecat.turns.user_start.vad_user_turn_start_strategy import VADUserTurnStartStrategy
|
||||
|
||||
@@ -41,10 +41,12 @@ class MinWordsUserTurnStartStrategy(BaseUserTurnStartStrategy):
|
||||
self._min_words = min_words
|
||||
self._use_interim = use_interim
|
||||
self._bot_speaking = False
|
||||
self._text = ""
|
||||
|
||||
async def reset(self):
|
||||
"""Reset the strategy to its initial state."""
|
||||
await super().reset()
|
||||
self._text = ""
|
||||
self._bot_speaking = False
|
||||
|
||||
async def process_frame(self, frame: Frame):
|
||||
@@ -65,7 +67,7 @@ class MinWordsUserTurnStartStrategy(BaseUserTurnStartStrategy):
|
||||
elif isinstance(frame, TranscriptionFrame):
|
||||
await self._handle_transcription(frame)
|
||||
elif isinstance(frame, InterimTranscriptionFrame) and self._use_interim:
|
||||
await self._handle_transcription(frame)
|
||||
await self._handle_interim_transcription(frame)
|
||||
|
||||
async def _handle_bot_started_speaking(self, frame: BotStartedSpeakingFrame):
|
||||
"""Handle bot started speaking frame.
|
||||
@@ -87,21 +89,41 @@ class MinWordsUserTurnStartStrategy(BaseUserTurnStartStrategy):
|
||||
"""
|
||||
self._bot_speaking = False
|
||||
|
||||
async def _handle_transcription(self, frame: TranscriptionFrame | InterimTranscriptionFrame):
|
||||
async def _handle_transcription(self, frame: TranscriptionFrame):
|
||||
"""Handle a completed transcription frame and check word count.
|
||||
|
||||
Args:
|
||||
frame: The transcription frame to be processed.
|
||||
"""
|
||||
self._text += frame.text
|
||||
|
||||
min_words = self._min_words if self._bot_speaking else 1
|
||||
|
||||
word_count = len(frame.text.split())
|
||||
word_count = len(self._text.split())
|
||||
should_trigger = word_count >= min_words
|
||||
is_interim = isinstance(frame, InterimTranscriptionFrame)
|
||||
|
||||
logger.debug(
|
||||
f"{self} should_trigger={should_trigger} num_spoken_words={word_count} "
|
||||
f"min_words={min_words} bot_speaking={self._bot_speaking} interim_transcription={is_interim}"
|
||||
f"min_words={min_words} bot_speaking={self._bot_speaking}"
|
||||
)
|
||||
|
||||
if should_trigger:
|
||||
await self.trigger_user_turn_started()
|
||||
|
||||
async def _handle_interim_transcription(self, frame: InterimTranscriptionFrame):
|
||||
"""Handle an interim transcription frame and check word count.
|
||||
|
||||
Args:
|
||||
frame: The interim transcription frame to be processed.
|
||||
"""
|
||||
min_words = self._min_words if self._bot_speaking else 1
|
||||
|
||||
word_count = len(frame.text.split())
|
||||
should_trigger = word_count >= min_words
|
||||
|
||||
logger.debug(
|
||||
f"{self} interim=True should_trigger={should_trigger} num_spoken_words={word_count} "
|
||||
f"min_words={min_words} bot_speaking={self._bot_speaking}"
|
||||
)
|
||||
|
||||
if should_trigger:
|
||||
|
||||
@@ -4,15 +4,14 @@
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
from .base_user_turn_stop_strategy import BaseUserTurnStopStrategy, UserTurnStoppedParams
|
||||
from .external_user_turn_stop_strategy import ExternalUserTurnStopStrategy
|
||||
from .transcription_user_turn_stop_strategy import TranscriptionUserTurnStopStrategy
|
||||
from .turn_analyzer_user_turn_stop_strategy import TurnAnalyzerUserTurnStopStrategy
|
||||
|
||||
__all__ = [
|
||||
"BaseUserTurnStopStrategy",
|
||||
"ExternalUserTurnStopStrategy",
|
||||
"UserTurnStoppedParams",
|
||||
"TranscriptionUserTurnStopStrategy",
|
||||
"TurnAnalyzerUserTurnStopStrategy",
|
||||
]
|
||||
from pipecat.turns.user_stop.base_user_turn_stop_strategy import (
|
||||
BaseUserTurnStopStrategy,
|
||||
UserTurnStoppedParams,
|
||||
)
|
||||
from pipecat.turns.user_stop.external_user_turn_stop_strategy import ExternalUserTurnStopStrategy
|
||||
from pipecat.turns.user_stop.transcription_user_turn_stop_strategy import (
|
||||
TranscriptionUserTurnStopStrategy,
|
||||
)
|
||||
from pipecat.turns.user_stop.turn_analyzer_user_turn_stop_strategy import (
|
||||
TurnAnalyzerUserTurnStopStrategy,
|
||||
)
|
||||
|
||||
@@ -12,8 +12,6 @@ from typing import Optional, Type
|
||||
from pipecat.frames.frames import (
|
||||
Frame,
|
||||
TranscriptionFrame,
|
||||
UserStartedSpeakingFrame,
|
||||
UserStoppedSpeakingFrame,
|
||||
VADUserStartedSpeakingFrame,
|
||||
VADUserStoppedSpeakingFrame,
|
||||
)
|
||||
@@ -82,7 +80,7 @@ class UserTurnController(BaseObject):
|
||||
|
||||
self._task_manager: Optional[BaseTaskManager] = None
|
||||
|
||||
self._user_speaking = False
|
||||
self._vad_user_speaking = False
|
||||
|
||||
self._user_turn = False
|
||||
self._user_turn_stop_timeout_event = asyncio.Event()
|
||||
@@ -148,11 +146,7 @@ class UserTurnController(BaseObject):
|
||||
frame: The frame to be processed.
|
||||
|
||||
"""
|
||||
if isinstance(frame, UserStartedSpeakingFrame):
|
||||
await self._handle_user_started_speaking(frame)
|
||||
elif isinstance(frame, UserStoppedSpeakingFrame):
|
||||
await self._handle_user_stopped_speaking(frame)
|
||||
elif isinstance(frame, VADUserStartedSpeakingFrame):
|
||||
if isinstance(frame, VADUserStartedSpeakingFrame):
|
||||
await self._handle_vad_user_started_speaking(frame)
|
||||
elif isinstance(frame, VADUserStoppedSpeakingFrame):
|
||||
await self._handle_vad_user_stopped_speaking(frame)
|
||||
@@ -185,26 +179,14 @@ class UserTurnController(BaseObject):
|
||||
for s in self._user_turn_strategies.stop or []:
|
||||
await s.cleanup()
|
||||
|
||||
async def _handle_user_started_speaking(self, frame: UserStartedSpeakingFrame):
|
||||
self._user_speaking = True
|
||||
|
||||
# The user started talking, let's reset the user turn timeout.
|
||||
self._user_turn_stop_timeout_event.set()
|
||||
|
||||
async def _handle_user_stopped_speaking(self, frame: UserStoppedSpeakingFrame):
|
||||
self._user_speaking = False
|
||||
|
||||
# The user stopped talking, let's reset the user turn timeout.
|
||||
self._user_turn_stop_timeout_event.set()
|
||||
|
||||
async def _handle_vad_user_started_speaking(self, frame: VADUserStartedSpeakingFrame):
|
||||
self._user_speaking = True
|
||||
self._vad_user_speaking = True
|
||||
|
||||
# The user started talking, let's reset the user turn timeout.
|
||||
self._user_turn_stop_timeout_event.set()
|
||||
|
||||
async def _handle_vad_user_stopped_speaking(self, frame: VADUserStoppedSpeakingFrame):
|
||||
self._user_speaking = False
|
||||
self._vad_user_speaking = False
|
||||
|
||||
# The user stopped talking, let's reset the user turn timeout.
|
||||
self._user_turn_stop_timeout_event.set()
|
||||
@@ -251,10 +233,6 @@ class UserTurnController(BaseObject):
|
||||
self._user_turn = True
|
||||
self._user_turn_stop_timeout_event.set()
|
||||
|
||||
# Reset all user turn start strategies to start fresh.
|
||||
for s in self._user_turn_strategies.start or []:
|
||||
await s.reset()
|
||||
|
||||
await self._call_event_handler("on_user_turn_started", strategy, params)
|
||||
|
||||
async def _trigger_user_turn_stop(
|
||||
@@ -282,7 +260,7 @@ class UserTurnController(BaseObject):
|
||||
)
|
||||
self._user_turn_stop_timeout_event.clear()
|
||||
except asyncio.TimeoutError:
|
||||
if self._user_turn and not self._user_speaking:
|
||||
if self._user_turn and not self._vad_user_speaking:
|
||||
await self._call_event_handler("on_user_turn_stop_timeout")
|
||||
await self._trigger_user_turn_stop(
|
||||
None, UserTurnStoppedParams(enable_user_speaking_frames=True)
|
||||
|
||||
@@ -19,7 +19,6 @@ from pipecat.frames.frames import (
|
||||
UserStoppedSpeakingFrame,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
from pipecat.turns.user_idle_controller import UserIdleController
|
||||
from pipecat.turns.user_start import BaseUserTurnStartStrategy, UserTurnStartedParams
|
||||
from pipecat.turns.user_stop import BaseUserTurnStopStrategy, UserTurnStoppedParams
|
||||
from pipecat.turns.user_turn_controller import UserTurnController
|
||||
@@ -39,7 +38,6 @@ class UserTurnProcessor(FrameProcessor):
|
||||
- on_user_turn_started: Emitted when a user turn starts.
|
||||
- on_user_turn_stopped: Emitted when a user turn stops.
|
||||
- on_user_turn_stop_timeout: Emitted if no stop strategy triggers before timeout.
|
||||
- on_user_turn_idle: Emitted when the user has been idle for the configured timeout.
|
||||
|
||||
Example::
|
||||
|
||||
@@ -55,10 +53,6 @@ class UserTurnProcessor(FrameProcessor):
|
||||
async def on_user_turn_stop_timeout(processor):
|
||||
...
|
||||
|
||||
@processor.event_handler("on_user_turn_idle")
|
||||
async def on_user_turn_idle(processor):
|
||||
...
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
@@ -66,7 +60,6 @@ class UserTurnProcessor(FrameProcessor):
|
||||
*,
|
||||
user_turn_strategies: Optional[UserTurnStrategies] = None,
|
||||
user_turn_stop_timeout: float = 5.0,
|
||||
user_idle_timeout: Optional[float] = None,
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the user turn processor.
|
||||
@@ -75,10 +68,6 @@ class UserTurnProcessor(FrameProcessor):
|
||||
user_turn_strategies: Configured strategies for starting and stopping user turns.
|
||||
user_turn_stop_timeout: Timeout in seconds to automatically stop a user turn
|
||||
if no activity is detected.
|
||||
user_idle_timeout: Optional timeout in seconds for detecting user idle state.
|
||||
If set, the processor will emit an `on_user_turn_idle` event when the user
|
||||
has been idle (not speaking) for this duration. Set to None to disable
|
||||
idle detection.
|
||||
**kwargs: Additional keyword arguments.
|
||||
"""
|
||||
super().__init__(**kwargs)
|
||||
@@ -86,7 +75,6 @@ class UserTurnProcessor(FrameProcessor):
|
||||
self._register_event_handler("on_user_turn_started")
|
||||
self._register_event_handler("on_user_turn_stopped")
|
||||
self._register_event_handler("on_user_turn_stop_timeout")
|
||||
self._register_event_handler("on_user_turn_idle")
|
||||
|
||||
self._user_turn_controller = UserTurnController(
|
||||
user_turn_strategies=user_turn_strategies or UserTurnStrategies(),
|
||||
@@ -104,14 +92,6 @@ class UserTurnProcessor(FrameProcessor):
|
||||
"on_user_turn_stop_timeout", self._on_user_turn_stop_timeout
|
||||
)
|
||||
|
||||
# Optional user idle controller
|
||||
self._user_idle_controller: Optional[UserIdleController] = None
|
||||
if user_idle_timeout:
|
||||
self._user_idle_controller = UserIdleController(user_idle_timeout=user_idle_timeout)
|
||||
self._user_idle_controller.add_event_handler(
|
||||
"on_user_turn_idle", self._on_user_turn_idle
|
||||
)
|
||||
|
||||
async def cleanup(self):
|
||||
"""Clean up processor resources."""
|
||||
await super().cleanup()
|
||||
@@ -149,15 +129,9 @@ class UserTurnProcessor(FrameProcessor):
|
||||
|
||||
await self._user_turn_controller.process_frame(frame)
|
||||
|
||||
if self._user_idle_controller:
|
||||
await self._user_idle_controller.process_frame(frame)
|
||||
|
||||
async def _start(self, frame: StartFrame):
|
||||
await self._user_turn_controller.setup(self.task_manager)
|
||||
|
||||
if self._user_idle_controller:
|
||||
await self._user_idle_controller.setup(self.task_manager)
|
||||
|
||||
async def _stop(self, frame: EndFrame):
|
||||
await self._cleanup()
|
||||
|
||||
@@ -167,9 +141,6 @@ class UserTurnProcessor(FrameProcessor):
|
||||
async def _cleanup(self):
|
||||
await self._user_turn_controller.cleanup()
|
||||
|
||||
if self._user_idle_controller:
|
||||
await self._user_idle_controller.cleanup()
|
||||
|
||||
async def _on_push_frame(
|
||||
self, controller, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM
|
||||
):
|
||||
@@ -209,6 +180,3 @@ class UserTurnProcessor(FrameProcessor):
|
||||
|
||||
async def _on_user_turn_stop_timeout(self, controller):
|
||||
await self._call_event_handler("on_user_turn_stop_timeout")
|
||||
|
||||
async def _on_user_turn_idle(self, controller):
|
||||
await self._call_event_handler("on_user_turn_idle")
|
||||
|
||||
@@ -16,15 +16,12 @@ import inspect
|
||||
import traceback
|
||||
from abc import ABC
|
||||
from dataclasses import dataclass
|
||||
from typing import Any, Callable, Dict, List, Optional, Set, Tuple, TypeVar
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.utils.utils import obj_count, obj_id
|
||||
|
||||
# TypeVar for preserving function signatures in decorators
|
||||
F = TypeVar("F", bound=Callable[..., Any])
|
||||
|
||||
|
||||
@dataclass
|
||||
class EventHandler:
|
||||
@@ -102,7 +99,7 @@ class BaseObject(ABC):
|
||||
logger.debug(f"{self}: waiting on event handlers to finish {list(event_names)}...")
|
||||
await asyncio.wait(tasks)
|
||||
|
||||
def event_handler(self, event_name: str) -> Callable[[F], F]:
|
||||
def event_handler(self, event_name: str):
|
||||
"""Decorator for registering event handlers.
|
||||
|
||||
Args:
|
||||
@@ -112,7 +109,7 @@ class BaseObject(ABC):
|
||||
The decorator function that registers the handler.
|
||||
"""
|
||||
|
||||
def decorator(handler: F) -> F:
|
||||
def decorator(handler):
|
||||
self.add_event_handler(event_name, handler)
|
||||
return handler
|
||||
|
||||
|
||||
@@ -40,7 +40,7 @@ from pipecat.processors.aggregators.llm_response_universal import (
|
||||
LLMUserAggregatorParams,
|
||||
)
|
||||
from pipecat.tests.utils import SleepFrame, run_test
|
||||
from pipecat.turns.user_mute import FirstSpeechUserMuteStrategy, FunctionCallUserMuteStrategy
|
||||
from pipecat.turns.mute import FirstSpeechUserMuteStrategy, FunctionCallUserMuteStrategy
|
||||
from pipecat.turns.user_stop import TranscriptionUserTurnStopStrategy
|
||||
from pipecat.turns.user_turn_strategies import UserTurnStrategies
|
||||
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
# Copyright (c) 2024-2025 Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
# Copyright (c) 2024-2025 Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
@@ -9,7 +9,7 @@ import os
|
||||
import sys
|
||||
import tempfile
|
||||
import unittest
|
||||
from unittest.mock import MagicMock, patch
|
||||
from unittest.mock import AsyncMock, MagicMock, Mock, patch
|
||||
|
||||
import numpy as np
|
||||
|
||||
|
||||
@@ -1,216 +0,0 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
import asyncio
|
||||
import unittest
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
BotSpeakingFrame,
|
||||
FunctionCallResultFrame,
|
||||
FunctionCallsStartedFrame,
|
||||
UserSpeakingFrame,
|
||||
UserStartedSpeakingFrame,
|
||||
)
|
||||
from pipecat.turns.user_idle_controller import UserIdleController
|
||||
from pipecat.utils.asyncio.task_manager import TaskManager, TaskManagerParams
|
||||
|
||||
USER_IDLE_TIMEOUT = 0.2
|
||||
|
||||
|
||||
class TestUserIdleController(unittest.IsolatedAsyncioTestCase):
|
||||
async def asyncSetUp(self):
|
||||
self.task_manager = TaskManager()
|
||||
self.task_manager.setup(TaskManagerParams(loop=asyncio.get_running_loop()))
|
||||
|
||||
async def test_basic_idle_detection(self):
|
||||
"""Test that idle event is triggered after timeout when no activity."""
|
||||
controller = UserIdleController(user_idle_timeout=USER_IDLE_TIMEOUT)
|
||||
await controller.setup(self.task_manager)
|
||||
|
||||
idle_triggered = False
|
||||
|
||||
@controller.event_handler("on_user_turn_idle")
|
||||
async def on_user_turn_idle(controller):
|
||||
nonlocal idle_triggered
|
||||
idle_triggered = True
|
||||
|
||||
# Start conversation
|
||||
await controller.process_frame(UserStartedSpeakingFrame())
|
||||
|
||||
# Wait for idle timeout
|
||||
await asyncio.sleep(USER_IDLE_TIMEOUT + 0.1)
|
||||
|
||||
self.assertTrue(idle_triggered)
|
||||
|
||||
await controller.cleanup()
|
||||
|
||||
async def test_user_speaking_resets_idle_timer(self):
|
||||
"""Test that continuous UserSpeakingFrame frames reset the idle timer."""
|
||||
controller = UserIdleController(user_idle_timeout=USER_IDLE_TIMEOUT)
|
||||
await controller.setup(self.task_manager)
|
||||
|
||||
idle_triggered = False
|
||||
|
||||
@controller.event_handler("on_user_turn_idle")
|
||||
async def on_user_turn_idle(controller):
|
||||
nonlocal idle_triggered
|
||||
idle_triggered = True
|
||||
|
||||
# Start conversation
|
||||
await controller.process_frame(UserStartedSpeakingFrame())
|
||||
|
||||
# Send UserSpeakingFrame continuously to reset timer
|
||||
for _ in range(5):
|
||||
await asyncio.sleep(USER_IDLE_TIMEOUT * 0.5) # 50% of timeout period
|
||||
await controller.process_frame(UserSpeakingFrame())
|
||||
|
||||
self.assertFalse(idle_triggered)
|
||||
|
||||
await controller.cleanup()
|
||||
|
||||
async def test_bot_speaking_resets_idle_timer(self):
|
||||
"""Test that BotSpeakingFrame frames reset the idle timer."""
|
||||
controller = UserIdleController(user_idle_timeout=USER_IDLE_TIMEOUT)
|
||||
await controller.setup(self.task_manager)
|
||||
|
||||
idle_triggered = False
|
||||
|
||||
@controller.event_handler("on_user_turn_idle")
|
||||
async def on_user_turn_idle(controller):
|
||||
nonlocal idle_triggered
|
||||
idle_triggered = True
|
||||
|
||||
# Start conversation
|
||||
await controller.process_frame(UserStartedSpeakingFrame())
|
||||
|
||||
# Bot speaking should reset timer
|
||||
for _ in range(5):
|
||||
await asyncio.sleep(USER_IDLE_TIMEOUT * 0.6) # 60% of timeout
|
||||
await controller.process_frame(BotSpeakingFrame())
|
||||
|
||||
self.assertFalse(idle_triggered)
|
||||
|
||||
await controller.cleanup()
|
||||
|
||||
async def test_function_call_prevents_idle(self):
|
||||
"""Test that function calls in progress prevent idle event."""
|
||||
controller = UserIdleController(user_idle_timeout=USER_IDLE_TIMEOUT)
|
||||
await controller.setup(self.task_manager)
|
||||
|
||||
idle_triggered = False
|
||||
|
||||
@controller.event_handler("on_user_turn_idle")
|
||||
async def on_user_turn_idle(controller):
|
||||
nonlocal idle_triggered
|
||||
idle_triggered = True
|
||||
|
||||
# Start conversation
|
||||
await controller.process_frame(UserStartedSpeakingFrame())
|
||||
|
||||
# Start function call
|
||||
await controller.process_frame(FunctionCallsStartedFrame(function_calls=[]))
|
||||
|
||||
# Wait longer than idle timeout
|
||||
await asyncio.sleep(USER_IDLE_TIMEOUT + 0.1)
|
||||
|
||||
# Should not trigger idle because function call is in progress
|
||||
self.assertFalse(idle_triggered)
|
||||
|
||||
# Complete function call
|
||||
await controller.process_frame(
|
||||
FunctionCallResultFrame(
|
||||
function_name="test",
|
||||
tool_call_id="123",
|
||||
arguments={},
|
||||
result=None,
|
||||
run_llm=False,
|
||||
)
|
||||
)
|
||||
|
||||
# Now idle should trigger
|
||||
await asyncio.sleep(USER_IDLE_TIMEOUT + 0.1)
|
||||
self.assertTrue(idle_triggered)
|
||||
|
||||
await controller.cleanup()
|
||||
|
||||
async def test_no_idle_before_conversation_starts(self):
|
||||
"""Test that idle monitoring doesn't start before first conversation activity."""
|
||||
controller = UserIdleController(user_idle_timeout=USER_IDLE_TIMEOUT)
|
||||
await controller.setup(self.task_manager)
|
||||
|
||||
idle_triggered = False
|
||||
|
||||
@controller.event_handler("on_user_turn_idle")
|
||||
async def on_user_turn_idle(controller):
|
||||
nonlocal idle_triggered
|
||||
idle_triggered = True
|
||||
|
||||
# Wait without starting conversation
|
||||
await asyncio.sleep(USER_IDLE_TIMEOUT + 0.1)
|
||||
|
||||
self.assertFalse(idle_triggered)
|
||||
|
||||
await controller.cleanup()
|
||||
|
||||
async def test_idle_starts_with_bot_speaking(self):
|
||||
"""Test that monitoring starts with BotSpeakingFrame, not just user speech."""
|
||||
controller = UserIdleController(user_idle_timeout=USER_IDLE_TIMEOUT)
|
||||
await controller.setup(self.task_manager)
|
||||
|
||||
idle_triggered = False
|
||||
|
||||
@controller.event_handler("on_user_turn_idle")
|
||||
async def on_user_turn_idle(controller):
|
||||
nonlocal idle_triggered
|
||||
idle_triggered = True
|
||||
|
||||
# Start conversation with bot speaking
|
||||
await controller.process_frame(BotSpeakingFrame())
|
||||
|
||||
# Wait for idle timeout
|
||||
await asyncio.sleep(USER_IDLE_TIMEOUT + 0.1)
|
||||
|
||||
self.assertTrue(idle_triggered)
|
||||
|
||||
await controller.cleanup()
|
||||
|
||||
async def test_multiple_idle_events(self):
|
||||
"""Test that idle event can trigger multiple times."""
|
||||
controller = UserIdleController(user_idle_timeout=USER_IDLE_TIMEOUT)
|
||||
await controller.setup(self.task_manager)
|
||||
|
||||
idle_count = 0
|
||||
|
||||
@controller.event_handler("on_user_turn_idle")
|
||||
async def on_user_turn_idle(controller):
|
||||
nonlocal idle_count
|
||||
idle_count += 1
|
||||
|
||||
# Start conversation
|
||||
await controller.process_frame(UserStartedSpeakingFrame())
|
||||
|
||||
# First idle
|
||||
await asyncio.sleep(USER_IDLE_TIMEOUT + 0.1)
|
||||
first_count = idle_count
|
||||
self.assertGreaterEqual(first_count, 1)
|
||||
|
||||
# Second idle
|
||||
await asyncio.sleep(USER_IDLE_TIMEOUT + 0.1)
|
||||
second_count = idle_count
|
||||
self.assertGreater(second_count, first_count)
|
||||
|
||||
# User activity resets timer
|
||||
await controller.process_frame(UserSpeakingFrame())
|
||||
|
||||
# Give a moment for the timer to reset
|
||||
await asyncio.sleep(0.1)
|
||||
|
||||
# Third idle
|
||||
await asyncio.sleep(USER_IDLE_TIMEOUT + 0.1)
|
||||
third_count = idle_count
|
||||
self.assertGreater(third_count, second_count)
|
||||
|
||||
await controller.cleanup()
|
||||
@@ -15,7 +15,7 @@ from pipecat.frames.frames import (
|
||||
FunctionCallsStartedFrame,
|
||||
InterruptionFrame,
|
||||
)
|
||||
from pipecat.turns.user_mute import (
|
||||
from pipecat.turns.mute import (
|
||||
AlwaysUserMuteStrategy,
|
||||
FirstSpeechUserMuteStrategy,
|
||||
FunctionCallUserMuteStrategy,
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
#
|
||||
# Copyright (c) 2024-2026, Daily
|
||||
# Copyright (c) 2024-2026 Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
@@ -8,18 +8,12 @@ import asyncio
|
||||
import unittest
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
BotStartedSpeakingFrame,
|
||||
TranscriptionFrame,
|
||||
UserStartedSpeakingFrame,
|
||||
UserStoppedSpeakingFrame,
|
||||
VADUserStartedSpeakingFrame,
|
||||
VADUserStoppedSpeakingFrame,
|
||||
)
|
||||
from pipecat.turns.user_start.min_words_user_turn_start_strategy import (
|
||||
MinWordsUserTurnStartStrategy,
|
||||
)
|
||||
from pipecat.turns.user_turn_controller import UserTurnController
|
||||
from pipecat.turns.user_turn_strategies import ExternalUserTurnStrategies, UserTurnStrategies
|
||||
from pipecat.turns.user_turn_strategies import UserTurnStrategies
|
||||
from pipecat.utils.asyncio.task_manager import TaskManager, TaskManagerParams
|
||||
|
||||
USER_TURN_STOP_TIMEOUT = 0.2
|
||||
@@ -62,44 +56,6 @@ class TestUserTurnController(unittest.IsolatedAsyncioTestCase):
|
||||
self.assertTrue(should_start)
|
||||
self.assertTrue(should_stop)
|
||||
|
||||
async def test_user_turn_start_reset(self):
|
||||
controller = UserTurnController(
|
||||
user_turn_strategies=UserTurnStrategies(
|
||||
start=[MinWordsUserTurnStartStrategy(min_words=3)]
|
||||
),
|
||||
user_turn_stop_timeout=USER_TURN_STOP_TIMEOUT,
|
||||
)
|
||||
|
||||
await controller.setup(self.task_manager)
|
||||
|
||||
should_start = 0
|
||||
|
||||
@controller.event_handler("on_user_turn_started")
|
||||
async def on_user_turn_started(controller, strategy, params):
|
||||
nonlocal should_start
|
||||
should_start += 1
|
||||
|
||||
await controller.process_frame(BotStartedSpeakingFrame())
|
||||
await controller.process_frame(TranscriptionFrame(text="One", user_id="cat", timestamp=""))
|
||||
self.assertEqual(should_start, 0)
|
||||
|
||||
await controller.process_frame(
|
||||
TranscriptionFrame(text="One two three!", user_id="cat", timestamp="")
|
||||
)
|
||||
self.assertEqual(should_start, 1)
|
||||
|
||||
# Trigger user stop turn so we can trigger user start turn again.
|
||||
await asyncio.sleep(USER_TURN_STOP_TIMEOUT + 0.1)
|
||||
|
||||
await controller.process_frame(BotStartedSpeakingFrame())
|
||||
await controller.process_frame(TranscriptionFrame(text="Hi!", user_id="cat", timestamp=""))
|
||||
self.assertEqual(should_start, 1)
|
||||
|
||||
await controller.process_frame(
|
||||
TranscriptionFrame(text="How are you?", user_id="cat", timestamp="")
|
||||
)
|
||||
self.assertEqual(should_start, 2)
|
||||
|
||||
async def test_user_turn_stop_timeout_no_transcription(self):
|
||||
controller = UserTurnController(
|
||||
user_turn_strategies=UserTurnStrategies(),
|
||||
@@ -140,53 +96,3 @@ class TestUserTurnController(unittest.IsolatedAsyncioTestCase):
|
||||
self.assertTrue(should_start)
|
||||
self.assertTrue(should_stop)
|
||||
self.assertTrue(timeout)
|
||||
|
||||
async def test_external_user_turn_strategies_no_timeout_while_speaking(self):
|
||||
"""Test that timeout does not trigger when user is still speaking with external strategies."""
|
||||
controller = UserTurnController(
|
||||
user_turn_strategies=ExternalUserTurnStrategies(),
|
||||
user_turn_stop_timeout=USER_TURN_STOP_TIMEOUT,
|
||||
)
|
||||
|
||||
await controller.setup(self.task_manager)
|
||||
|
||||
should_start = None
|
||||
should_stop = None
|
||||
timeout = None
|
||||
|
||||
@controller.event_handler("on_user_turn_started")
|
||||
async def on_user_turn_started(controller, strategy, params):
|
||||
nonlocal should_start
|
||||
should_start = True
|
||||
|
||||
@controller.event_handler("on_user_turn_stopped")
|
||||
async def on_user_turn_stopped(controller, strategy, params):
|
||||
nonlocal should_stop
|
||||
should_stop = True
|
||||
|
||||
@controller.event_handler("on_user_turn_stop_timeout")
|
||||
async def on_user_turn_stop_timeout(controller):
|
||||
nonlocal timeout
|
||||
timeout = True
|
||||
|
||||
# Simulate external service (like Deepgram Flux) broadcasting UserStartedSpeakingFrame
|
||||
await controller.process_frame(UserStartedSpeakingFrame())
|
||||
self.assertTrue(should_start)
|
||||
self.assertFalse(should_stop)
|
||||
self.assertFalse(timeout)
|
||||
|
||||
# User is still speaking, timeout should not trigger
|
||||
await asyncio.sleep(USER_TURN_STOP_TIMEOUT + 0.1)
|
||||
self.assertTrue(should_start)
|
||||
self.assertFalse(should_stop)
|
||||
self.assertFalse(timeout)
|
||||
|
||||
# Now external service broadcasts UserStoppedSpeakingFrame
|
||||
await controller.process_frame(UserStoppedSpeakingFrame())
|
||||
|
||||
# But no transcription, so timeout should trigger
|
||||
await asyncio.sleep(USER_TURN_STOP_TIMEOUT + 0.1)
|
||||
|
||||
self.assertTrue(should_start)
|
||||
self.assertTrue(should_stop)
|
||||
self.assertTrue(timeout)
|
||||
|
||||
@@ -38,7 +38,7 @@ class TestMinWordsInterruptionStrategy(unittest.IsolatedAsyncioTestCase):
|
||||
self.assertFalse(should_start)
|
||||
|
||||
await strategy.process_frame(
|
||||
TranscriptionFrame(text="Hello there!", user_id="cat", timestamp="")
|
||||
TranscriptionFrame(text=" there!", user_id="cat", timestamp="")
|
||||
)
|
||||
self.assertTrue(should_start)
|
||||
|
||||
@@ -55,26 +55,6 @@ class TestMinWordsInterruptionStrategy(unittest.IsolatedAsyncioTestCase):
|
||||
)
|
||||
self.assertTrue(should_start)
|
||||
|
||||
async def test_bot_speaking_singlw_words(self):
|
||||
strategy = MinWordsUserTurnStartStrategy(min_words=3)
|
||||
|
||||
should_start = None
|
||||
|
||||
@strategy.event_handler("on_user_turn_started")
|
||||
async def on_user_turn_started(strategy, params):
|
||||
nonlocal should_start
|
||||
should_start = True
|
||||
|
||||
await strategy.process_frame(BotStartedSpeakingFrame())
|
||||
await strategy.process_frame(TranscriptionFrame(text="One", user_id="cat", timestamp=""))
|
||||
self.assertFalse(should_start)
|
||||
|
||||
await strategy.process_frame(TranscriptionFrame(text="Two", user_id="cat", timestamp=""))
|
||||
self.assertFalse(should_start)
|
||||
|
||||
await strategy.process_frame(TranscriptionFrame(text="Three", user_id="cat", timestamp=""))
|
||||
self.assertFalse(should_start)
|
||||
|
||||
async def test_bot_speaking_interim_transcriptions(self):
|
||||
strategy = MinWordsUserTurnStartStrategy(min_words=2)
|
||||
|
||||
|
||||
47
uv.lock
generated
47
uv.lock
generated
@@ -629,22 +629,6 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/84/c2/80633736cd183ee4a62107413def345f7e6e3c01563dbca1417363cf957e/build-1.2.2.post1-py3-none-any.whl", hash = "sha256:1d61c0887fa860c01971625baae8bdd338e517b836a2f70dd1f7aa3a6b2fc5b5", size = 22950, upload-time = "2024-10-06T17:22:23.299Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "camb-sdk"
|
||||
version = "1.5.4"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "httpx" },
|
||||
{ name = "pydantic" },
|
||||
{ name = "typing-extensions" },
|
||||
{ name = "websocket-client" },
|
||||
{ name = "websockets" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/68/44/b4161f5da381f1b0d35a1abc7d69588ff24af6dacc9f13bccfb6f9889c46/camb_sdk-1.5.4.tar.gz", hash = "sha256:e882fb1b32aa45243ead5a558fed4e4e7ff8542af9f1802565a5fb4b7114fb5c", size = 83074, upload-time = "2026-01-12T20:19:34.318Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/d6/c9/f718b6028d6e457391d1a6a8f93ae6073055731d77279ef097734b84bdcb/camb_sdk-1.5.4-py3-none-any.whl", hash = "sha256:b9ad43fddc0d749dd522839a06136df89d109510f57e68102414a8c0078e496a", size = 152143, upload-time = "2026-01-12T20:19:31.769Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "cartesia"
|
||||
version = "2.0.17"
|
||||
@@ -1446,7 +1430,7 @@ wheels = [
|
||||
|
||||
[[package]]
|
||||
name = "fastapi"
|
||||
version = "0.127.1"
|
||||
version = "0.121.3"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "annotated-doc" },
|
||||
@@ -1454,9 +1438,9 @@ dependencies = [
|
||||
{ name = "starlette" },
|
||||
{ name = "typing-extensions" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/96/8a/6b9ba6eb8ff3817caae83120495965d9e70afb4d6348cb120e464ee199f4/fastapi-0.127.1.tar.gz", hash = "sha256:946a87ee5d931883b562b6bada787d6c8178becee2683cb3f9b980d593206359", size = 391876, upload-time = "2025-12-26T13:04:47.075Z" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/80/f0/086c442c6516195786131b8ca70488c6ef11d2f2e33c9a893576b2b0d3f7/fastapi-0.121.3.tar.gz", hash = "sha256:0055bc24fe53e56a40e9e0ad1ae2baa81622c406e548e501e717634e2dfbc40b", size = 344501, upload-time = "2025-11-19T16:53:39.243Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/d2/f3/a6858d147ed2645c095d11dc2440f94a5f1cd8f4df888e3377e6b5281a0f/fastapi-0.127.1-py3-none-any.whl", hash = "sha256:31d670a4f9373cc6d7994420f98e4dc46ea693145207abc39696746c83a44430", size = 112332, upload-time = "2025-12-26T13:04:45.329Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/98/b6/4f620d7720fc0a754c8c1b7501d73777f6ba43b57c8ab99671f4d7441eb8/fastapi-0.121.3-py3-none-any.whl", hash = "sha256:0c78fc87587fcd910ca1bbf5bc8ba37b80e119b388a7206b39f0ecc95ebf53e9", size = 109801, upload-time = "2025-11-19T16:53:37.918Z" },
|
||||
]
|
||||
|
||||
[package.optional-dependencies]
|
||||
@@ -2013,7 +1997,6 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/32/6a/33d1702184d94106d3cdd7bfb788e19723206fce152e303473ca3b946c7b/greenlet-3.3.0-cp310-cp310-macosx_11_0_universal2.whl", hash = "sha256:6f8496d434d5cb2dce025773ba5597f71f5410ae499d5dd9533e0653258cdb3d", size = 273658, upload-time = "2025-12-04T14:23:37.494Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/d6/b7/2b5805bbf1907c26e434f4e448cd8b696a0b71725204fa21a211ff0c04a7/greenlet-3.3.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:b96dc7eef78fd404e022e165ec55327f935b9b52ff355b067eb4a0267fc1cffb", size = 574810, upload-time = "2025-12-04T14:50:04.154Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/94/38/343242ec12eddf3d8458c73f555c084359883d4ddc674240d9e61ec51fd6/greenlet-3.3.0-cp310-cp310-manylinux_2_24_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:73631cd5cccbcfe63e3f9492aaa664d278fda0ce5c3d43aeda8e77317e38efbd", size = 586248, upload-time = "2025-12-04T14:57:39.35Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/f0/d0/0ae86792fb212e4384041e0ef8e7bc66f59a54912ce407d26a966ed2914d/greenlet-3.3.0-cp310-cp310-manylinux_2_24_s390x.manylinux_2_28_s390x.whl", hash = "sha256:b299a0cb979f5d7197442dccc3aee67fce53500cd88951b7e6c35575701c980b", size = 597403, upload-time = "2025-12-04T15:07:10.831Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/b6/a8/15d0aa26c0036a15d2659175af00954aaaa5d0d66ba538345bd88013b4d7/greenlet-3.3.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:7dee147740789a4632cace364816046e43310b59ff8fb79833ab043aefa72fd5", size = 586910, upload-time = "2025-12-04T14:25:59.705Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/e1/9b/68d5e3b7ccaba3907e5532cf8b9bf16f9ef5056a008f195a367db0ff32db/greenlet-3.3.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:39b28e339fc3c348427560494e28d8a6f3561c8d2bcf7d706e1c624ed8d822b9", size = 1547206, upload-time = "2025-12-04T15:04:21.027Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/66/bd/e3086ccedc61e49f91e2cfb5ffad9d8d62e5dc85e512a6200f096875b60c/greenlet-3.3.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:b3c374782c2935cc63b2a27ba8708471de4ad1abaa862ffdb1ef45a643ddbb7d", size = 1613359, upload-time = "2025-12-04T14:27:26.548Z" },
|
||||
@@ -2021,7 +2004,6 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/1f/cb/48e964c452ca2b92175a9b2dca037a553036cb053ba69e284650ce755f13/greenlet-3.3.0-cp311-cp311-macosx_11_0_universal2.whl", hash = "sha256:e29f3018580e8412d6aaf5641bb7745d38c85228dacf51a73bd4e26ddf2a6a8e", size = 274908, upload-time = "2025-12-04T14:23:26.435Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/28/da/38d7bff4d0277b594ec557f479d65272a893f1f2a716cad91efeb8680953/greenlet-3.3.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:a687205fb22794e838f947e2194c0566d3812966b41c78709554aa883183fb62", size = 577113, upload-time = "2025-12-04T14:50:05.493Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/3c/f2/89c5eb0faddc3ff014f1c04467d67dee0d1d334ab81fadbf3744847f8a8a/greenlet-3.3.0-cp311-cp311-manylinux_2_24_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:4243050a88ba61842186cb9e63c7dfa677ec146160b0efd73b855a3d9c7fcf32", size = 590338, upload-time = "2025-12-04T14:57:41.136Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/80/d7/db0a5085035d05134f8c089643da2b44cc9b80647c39e93129c5ef170d8f/greenlet-3.3.0-cp311-cp311-manylinux_2_24_s390x.manylinux_2_28_s390x.whl", hash = "sha256:670d0f94cd302d81796e37299bcd04b95d62403883b24225c6b5271466612f45", size = 601098, upload-time = "2025-12-04T15:07:11.898Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/dc/a6/e959a127b630a58e23529972dbc868c107f9d583b5a9f878fb858c46bc1a/greenlet-3.3.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:6cb3a8ec3db4a3b0eb8a3c25436c2d49e3505821802074969db017b87bc6a948", size = 590206, upload-time = "2025-12-04T14:26:01.254Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/48/60/29035719feb91798693023608447283b266b12efc576ed013dd9442364bb/greenlet-3.3.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:2de5a0b09eab81fc6a382791b995b1ccf2b172a9fec934747a7a23d2ff291794", size = 1550668, upload-time = "2025-12-04T15:04:22.439Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/0a/5f/783a23754b691bfa86bd72c3033aa107490deac9b2ef190837b860996c9f/greenlet-3.3.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:4449a736606bd30f27f8e1ff4678ee193bc47f6ca810d705981cfffd6ce0d8c5", size = 1615483, upload-time = "2025-12-04T14:27:28.083Z" },
|
||||
@@ -2029,7 +2011,6 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/f8/0a/a3871375c7b9727edaeeea994bfff7c63ff7804c9829c19309ba2e058807/greenlet-3.3.0-cp312-cp312-macosx_11_0_universal2.whl", hash = "sha256:b01548f6e0b9e9784a2c99c5651e5dc89ffcbe870bc5fb2e5ef864e9cc6b5dcb", size = 276379, upload-time = "2025-12-04T14:23:30.498Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/43/ab/7ebfe34dce8b87be0d11dae91acbf76f7b8246bf9d6b319c741f99fa59c6/greenlet-3.3.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:349345b770dc88f81506c6861d22a6ccd422207829d2c854ae2af8025af303e3", size = 597294, upload-time = "2025-12-04T14:50:06.847Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/a4/39/f1c8da50024feecd0793dbd5e08f526809b8ab5609224a2da40aad3a7641/greenlet-3.3.0-cp312-cp312-manylinux_2_24_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:e8e18ed6995e9e2c0b4ed264d2cf89260ab3ac7e13555b8032b25a74c6d18655", size = 607742, upload-time = "2025-12-04T14:57:42.349Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/77/cb/43692bcd5f7a0da6ec0ec6d58ee7cddb606d055ce94a62ac9b1aa481e969/greenlet-3.3.0-cp312-cp312-manylinux_2_24_s390x.manylinux_2_28_s390x.whl", hash = "sha256:c024b1e5696626890038e34f76140ed1daf858e37496d33f2af57f06189e70d7", size = 622297, upload-time = "2025-12-04T15:07:13.552Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/75/b0/6bde0b1011a60782108c01de5913c588cf51a839174538d266de15e4bf4d/greenlet-3.3.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:047ab3df20ede6a57c35c14bf5200fcf04039d50f908270d3f9a7a82064f543b", size = 609885, upload-time = "2025-12-04T14:26:02.368Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/49/0e/49b46ac39f931f59f987b7cd9f34bfec8ef81d2a1e6e00682f55be5de9f4/greenlet-3.3.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:2d9ad37fc657b1102ec880e637cccf20191581f75c64087a549e66c57e1ceb53", size = 1567424, upload-time = "2025-12-04T15:04:23.757Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/05/f5/49a9ac2dff7f10091935def9165c90236d8f175afb27cbed38fb1d61ab6b/greenlet-3.3.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:83cd0e36932e0e7f36a64b732a6f60c2fc2df28c351bae79fbaf4f8092fe7614", size = 1636017, upload-time = "2025-12-04T14:27:29.688Z" },
|
||||
@@ -2037,7 +2018,6 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/02/2f/28592176381b9ab2cafa12829ba7b472d177f3acc35d8fbcf3673d966fff/greenlet-3.3.0-cp313-cp313-macosx_11_0_universal2.whl", hash = "sha256:a1e41a81c7e2825822f4e068c48cb2196002362619e2d70b148f20a831c00739", size = 275140, upload-time = "2025-12-04T14:23:01.282Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/2c/80/fbe937bf81e9fca98c981fe499e59a3f45df2a04da0baa5c2be0dca0d329/greenlet-3.3.0-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:9f515a47d02da4d30caaa85b69474cec77b7929b2e936ff7fb853d42f4bf8808", size = 599219, upload-time = "2025-12-04T14:50:08.309Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/c2/ff/7c985128f0514271b8268476af89aee6866df5eec04ac17dcfbc676213df/greenlet-3.3.0-cp313-cp313-manylinux_2_24_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:7d2d9fd66bfadf230b385fdc90426fcd6eb64db54b40c495b72ac0feb5766c54", size = 610211, upload-time = "2025-12-04T14:57:43.968Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/79/07/c47a82d881319ec18a4510bb30463ed6891f2ad2c1901ed5ec23d3de351f/greenlet-3.3.0-cp313-cp313-manylinux_2_24_s390x.manylinux_2_28_s390x.whl", hash = "sha256:30a6e28487a790417d036088b3bcb3f3ac7d8babaa7d0139edbaddebf3af9492", size = 624311, upload-time = "2025-12-04T15:07:14.697Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/fd/8e/424b8c6e78bd9837d14ff7df01a9829fc883ba2ab4ea787d4f848435f23f/greenlet-3.3.0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:087ea5e004437321508a8d6f20efc4cfec5e3c30118e1417ea96ed1d93950527", size = 612833, upload-time = "2025-12-04T14:26:03.669Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/b5/ba/56699ff9b7c76ca12f1cdc27a886d0f81f2189c3455ff9f65246780f713d/greenlet-3.3.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:ab97cf74045343f6c60a39913fa59710e4bd26a536ce7ab2397adf8b27e67c39", size = 1567256, upload-time = "2025-12-04T15:04:25.276Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/1e/37/f31136132967982d698c71a281a8901daf1a8fbab935dce7c0cf15f942cc/greenlet-3.3.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:5375d2e23184629112ca1ea89a53389dddbffcf417dad40125713d88eb5f96e8", size = 1636483, upload-time = "2025-12-04T14:27:30.804Z" },
|
||||
@@ -2045,7 +2025,6 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/d7/7c/f0a6d0ede2c7bf092d00bc83ad5bafb7e6ec9b4aab2fbdfa6f134dc73327/greenlet-3.3.0-cp314-cp314-macosx_11_0_universal2.whl", hash = "sha256:60c2ef0f578afb3c8d92ea07ad327f9a062547137afe91f38408f08aacab667f", size = 275671, upload-time = "2025-12-04T14:23:05.267Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/44/06/dac639ae1a50f5969d82d2e3dd9767d30d6dbdbab0e1a54010c8fe90263c/greenlet-3.3.0-cp314-cp314-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:0a5d554d0712ba1de0a6c94c640f7aeba3f85b3a6e1f2899c11c2c0428da9365", size = 646360, upload-time = "2025-12-04T14:50:10.026Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/e0/94/0fb76fe6c5369fba9bf98529ada6f4c3a1adf19e406a47332245ef0eb357/greenlet-3.3.0-cp314-cp314-manylinux_2_24_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:3a898b1e9c5f7307ebbde4102908e6cbfcb9ea16284a3abe15cab996bee8b9b3", size = 658160, upload-time = "2025-12-04T14:57:45.41Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/93/79/d2c70cae6e823fac36c3bbc9077962105052b7ef81db2f01ec3b9bf17e2b/greenlet-3.3.0-cp314-cp314-manylinux_2_24_s390x.manylinux_2_28_s390x.whl", hash = "sha256:dcd2bdbd444ff340e8d6bdf54d2f206ccddbb3ccfdcd3c25bf4afaa7b8f0cf45", size = 671388, upload-time = "2025-12-04T15:07:15.789Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/b8/14/bab308fc2c1b5228c3224ec2bf928ce2e4d21d8046c161e44a2012b5203e/greenlet-3.3.0-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:5773edda4dc00e173820722711d043799d3adb4f01731f40619e07ea2750b955", size = 660166, upload-time = "2025-12-04T14:26:05.099Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/4b/d2/91465d39164eaa0085177f61983d80ffe746c5a1860f009811d498e7259c/greenlet-3.3.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:ac0549373982b36d5fd5d30beb8a7a33ee541ff98d2b502714a09f1169f31b55", size = 1615193, upload-time = "2025-12-04T15:04:27.041Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/42/1b/83d110a37044b92423084d52d5d5a3b3a73cafb51b547e6d7366ff62eff1/greenlet-3.3.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:d198d2d977460358c3b3a4dc844f875d1adb33817f0613f663a656f463764ccc", size = 1683653, upload-time = "2025-12-04T14:27:32.366Z" },
|
||||
@@ -2053,7 +2032,6 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/a0/66/bd6317bc5932accf351fc19f177ffba53712a202f9df10587da8df257c7e/greenlet-3.3.0-cp314-cp314t-macosx_11_0_universal2.whl", hash = "sha256:d6ed6f85fae6cdfdb9ce04c9bf7a08d666cfcfb914e7d006f44f840b46741931", size = 282638, upload-time = "2025-12-04T14:25:20.941Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/30/cf/cc81cb030b40e738d6e69502ccbd0dd1bced0588e958f9e757945de24404/greenlet-3.3.0-cp314-cp314t-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:d9125050fcf24554e69c4cacb086b87b3b55dc395a8b3ebe6487b045b2614388", size = 651145, upload-time = "2025-12-04T14:50:11.039Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/9c/ea/1020037b5ecfe95ca7df8d8549959baceb8186031da83d5ecceff8b08cd2/greenlet-3.3.0-cp314-cp314t-manylinux_2_24_ppc64le.manylinux_2_28_ppc64le.whl", hash = "sha256:87e63ccfa13c0a0f6234ed0add552af24cc67dd886731f2261e46e241608bee3", size = 654236, upload-time = "2025-12-04T14:57:47.007Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/69/cc/1e4bae2e45ca2fa55299f4e85854606a78ecc37fead20d69322f96000504/greenlet-3.3.0-cp314-cp314t-manylinux_2_24_s390x.manylinux_2_28_s390x.whl", hash = "sha256:2662433acbca297c9153a4023fe2161c8dcfdcc91f10433171cf7e7d94ba2221", size = 662506, upload-time = "2025-12-04T15:07:16.906Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/57/b9/f8025d71a6085c441a7eaff0fd928bbb275a6633773667023d19179fe815/greenlet-3.3.0-cp314-cp314t-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:3c6e9b9c1527a78520357de498b0e709fb9e2f49c3a513afd5a249007261911b", size = 653783, upload-time = "2025-12-04T14:26:06.225Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/f6/c7/876a8c7a7485d5d6b5c6821201d542ef28be645aa024cfe1145b35c120c1/greenlet-3.3.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:286d093f95ec98fdd92fcb955003b8a3d054b4e2cab3e2707a5039e7b50520fd", size = 1614857, upload-time = "2025-12-04T15:04:28.484Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/4f/dc/041be1dff9f23dac5f48a43323cd0789cb798342011c19a248d9c9335536/greenlet-3.3.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:6c10513330af5b8ae16f023e8ddbfb486ab355d04467c4679c5cfe4659975dd9", size = 1676034, upload-time = "2025-12-04T14:27:33.531Z" },
|
||||
@@ -4281,9 +4259,6 @@ aws-nova-sonic = [
|
||||
azure = [
|
||||
{ name = "azure-cognitiveservices-speech" },
|
||||
]
|
||||
camb = [
|
||||
{ name = "camb-sdk" },
|
||||
]
|
||||
cartesia = [
|
||||
{ name = "cartesia" },
|
||||
{ name = "websockets" },
|
||||
@@ -4505,7 +4480,6 @@ requires-dist = [
|
||||
{ name = "aws-sdk-bedrock-runtime", marker = "python_full_version >= '3.12' and extra == 'aws-nova-sonic'", specifier = "~=0.2.0" },
|
||||
{ name = "aws-sdk-sagemaker-runtime-http2", marker = "python_full_version >= '3.12' and extra == 'sagemaker'" },
|
||||
{ name = "azure-cognitiveservices-speech", marker = "extra == 'azure'", specifier = "~=1.44.0" },
|
||||
{ name = "camb-sdk", marker = "extra == 'camb'", specifier = ">=1.5.4" },
|
||||
{ name = "cartesia", marker = "extra == 'cartesia'", specifier = "~=2.0.3" },
|
||||
{ name = "coremltools", marker = "extra == 'local-smart-turn'", specifier = ">=8.0" },
|
||||
{ name = "daily-python", marker = "extra == 'daily'", specifier = "~=0.23.0" },
|
||||
@@ -4513,8 +4487,8 @@ requires-dist = [
|
||||
{ name = "docstring-parser", specifier = "~=0.16" },
|
||||
{ name = "einops", marker = "extra == 'moondream'", specifier = "~=0.8.0" },
|
||||
{ name = "fal-client", marker = "extra == 'fal'", specifier = "~=0.5.9" },
|
||||
{ name = "fastapi", marker = "extra == 'runner'", specifier = ">=0.115.6,<0.128.0" },
|
||||
{ name = "fastapi", marker = "extra == 'websocket'", specifier = ">=0.115.6,<0.128.0" },
|
||||
{ name = "fastapi", marker = "extra == 'runner'", specifier = ">=0.115.6,<0.122.0" },
|
||||
{ name = "fastapi", marker = "extra == 'websocket'", specifier = ">=0.115.6,<0.122.0" },
|
||||
{ name = "faster-whisper", marker = "extra == 'whisper'", specifier = "~=1.1.1" },
|
||||
{ name = "google-cloud-speech", marker = "extra == 'google'", specifier = ">=2.33.0,<3" },
|
||||
{ name = "google-cloud-texttospeech", marker = "extra == 'google'", specifier = ">=2.31.0,<3" },
|
||||
@@ -4599,7 +4573,7 @@ requires-dist = [
|
||||
{ name = "wait-for2", marker = "python_full_version < '3.12'", specifier = ">=0.4.1" },
|
||||
{ name = "websockets", marker = "extra == 'websockets-base'", specifier = ">=13.1,<16.0" },
|
||||
]
|
||||
provides-extras = ["aic", "anthropic", "assemblyai", "asyncai", "aws", "aws-nova-sonic", "azure", "cartesia", "camb", "cerebras", "daily", "deepgram", "deepseek", "elevenlabs", "fal", "fireworks", "fish", "gladia", "google", "gradium", "grok", "groq", "gstreamer", "heygen", "hume", "inworld", "koala", "krisp", "langchain", "livekit", "lmnt", "local", "local-smart-turn", "local-smart-turn-v3", "mcp", "mem0", "mistral", "mlx-whisper", "moondream", "neuphonic", "noisereduce", "nvidia", "openai", "rnnoise", "openpipe", "openrouter", "perplexity", "playht", "qwen", "remote-smart-turn", "rime", "riva", "runner", "sagemaker", "sambanova", "sarvam", "sentry", "silero", "simli", "soniox", "soundfile", "speechmatics", "strands", "tavus", "together", "tracing", "ultravox", "webrtc", "websocket", "websockets-base", "whisper"]
|
||||
provides-extras = ["aic", "anthropic", "assemblyai", "asyncai", "aws", "aws-nova-sonic", "azure", "cartesia", "cerebras", "daily", "deepgram", "deepseek", "elevenlabs", "fal", "fireworks", "fish", "gladia", "google", "gradium", "grok", "groq", "gstreamer", "heygen", "hume", "inworld", "koala", "krisp", "langchain", "livekit", "lmnt", "local", "local-smart-turn", "local-smart-turn-v3", "mcp", "mem0", "mistral", "mlx-whisper", "moondream", "neuphonic", "noisereduce", "nvidia", "openai", "rnnoise", "openpipe", "openrouter", "perplexity", "playht", "qwen", "remote-smart-turn", "rime", "riva", "runner", "sagemaker", "sambanova", "sarvam", "sentry", "silero", "simli", "soniox", "soundfile", "speechmatics", "strands", "tavus", "together", "tracing", "ultravox", "webrtc", "websocket", "websockets-base", "whisper"]
|
||||
|
||||
[package.metadata.requires-dev]
|
||||
dev = [
|
||||
@@ -7584,15 +7558,6 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/6e/d4/ed38dd3b1767193de971e694aa544356e63353c33a85d948166b5ff58b9e/watchfiles-1.1.1-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3e6f39af2eab0118338902798b5aa6664f46ff66bc0280de76fca67a7f262a49", size = 457546, upload-time = "2025-10-14T15:06:13.372Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "websocket-client"
|
||||
version = "1.9.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/2c/41/aa4bf9664e4cda14c3b39865b12251e8e7d239f4cd0e3cc1b6c2ccde25c1/websocket_client-1.9.0.tar.gz", hash = "sha256:9e813624b6eb619999a97dc7958469217c3176312b3a16a4bd1bc7e08a46ec98", size = 70576, upload-time = "2025-10-07T21:16:36.495Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/34/db/b10e48aa8fff7407e67470363eac595018441cf32d5e1001567a7aeba5d2/websocket_client-1.9.0-py3-none-any.whl", hash = "sha256:af248a825037ef591efbf6ed20cc5faa03d3b47b9e5a2230a529eeee1c1fc3ef", size = 82616, upload-time = "2025-10-07T21:16:34.951Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "websockets"
|
||||
version = "13.1"
|
||||
|
||||
Reference in New Issue
Block a user