Compare commits
24 Commits
fix/eleven
...
hush/TurnT
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
8bbfa829d3 | ||
|
|
c2eb663bdc | ||
|
|
bf055843e6 | ||
|
|
2607699664 | ||
|
|
47fa3b8556 | ||
|
|
fa0100c38b | ||
|
|
e5142c1210 | ||
|
|
5907b51c7d | ||
|
|
9e4ec4f7f3 | ||
|
|
e2161ea63d | ||
|
|
7c81f66241 | ||
|
|
60da466379 | ||
|
|
12c29b71f3 | ||
|
|
b52b108932 | ||
|
|
a357ff0205 | ||
|
|
0ece8b5894 | ||
|
|
782b257bbb | ||
|
|
ab8dcd6ede | ||
|
|
012c2f7dde | ||
|
|
87fdd8f006 | ||
|
|
7bdac02837 | ||
|
|
861567bc59 | ||
|
|
d0ff43134a | ||
|
|
ec8964425a |
335
CHANGELOG.md
335
CHANGELOG.md
@@ -7,156 +7,252 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
### Fixed
|
||||
|
||||
- Fixed an issue in `ElevenLabsRealtimeSTTService` where dynamic language updates were not working.
|
||||
|
||||
### Added
|
||||
|
||||
- Added `LiveKitRESTHelper` utility class for managing LiveKit rooms via REST API.
|
||||
|
||||
- Added `DeepgramSageMakerSTTService` which connects to a SageMaker hosted
|
||||
Deepgram STT model. Added `07c-interruptible-deepgram-sagemaker.py`
|
||||
foundational example.
|
||||
|
||||
- Added `SageMakerBidiClient` to connect to SageMaker hosted BiDi compatible
|
||||
services.
|
||||
|
||||
- Added support for `include_timestamps` and `enable_logging` in
|
||||
`ElevenLabsRealtimeSTTService`. When `include_timestamps` is enabled,
|
||||
timestamp data is included in the `TranscriptionFrame`'s `result`
|
||||
parameter.
|
||||
|
||||
- Added optional speaking rate control to `InworldTTSService`.
|
||||
|
||||
- Introduced a new `AggregatedTextFrame` type to support passing text along with an
|
||||
`aggregated_by` field to describe the type of text included. `TTSTextFrame`s now
|
||||
inherit from `AggregatedTextFrame`. With this inheritance, an observer can watch for
|
||||
`AggregatedTextFrame`s to accumlate the perceived output and determine whether or not
|
||||
the text was spoken based on if that frame is also a `TTSTextFrame`.
|
||||
- Introduced a new `AggregatedTextFrame` type to support passing text along with
|
||||
an `aggregated_by` field to describe the type of text
|
||||
included. `TTSTextFrame`s now inherit from `AggregatedTextFrame`. With this
|
||||
inheritance, an observer can watch for `AggregatedTextFrame`s to accumlate the
|
||||
perceived output and determine whether or not the text was spoken based on if
|
||||
that frame is also a `TTSTextFrame`.
|
||||
|
||||
With this frame, the llm token stream can be transformed into custom composable
|
||||
chunks, allowing for aggregation outside the TTS service. This makes it possible to
|
||||
listen for or handle those aggregations and sets the stage for doing things like
|
||||
composing a best effort of the perceived llm output in a more digestable form and
|
||||
to do so whether or not it is processed by a TTS or if even a TTS exists.
|
||||
With this frame, the llm token stream can be transformed into custom
|
||||
composable chunks, allowing for aggregation outside the TTS service. This
|
||||
makes it possible to listen for or handle those aggregations and sets the
|
||||
stage for doing things like composing a best effort of the perceived llm
|
||||
output in a more digestable form and to do so whether or not it is processed
|
||||
by a TTS or if even a TTS exists.
|
||||
|
||||
- Introduced `LLMTextProcessor`: A new processor meant to allow customization for how
|
||||
LLMTextFrames should be aggregated and considered. It's purpose is to turn
|
||||
`LLMTextFrame`s into `AggregatedTextFrame`s. By default, a TTSService will still
|
||||
aggregate `LLMTextFrame`s by sentence for the service to consume. However, if you
|
||||
wish to override how the llm text is aggregated, you should no longer override the
|
||||
TTS's internal text_aggregator, but instead, insert this processor between your LLM
|
||||
and TTS in the pipeline.
|
||||
- Introduced `LLMTextProcessor`: A new processor meant to allow customization
|
||||
for how LLMTextFrames should be aggregated and considered. It's purpose is to
|
||||
turn `LLMTextFrame`s into `AggregatedTextFrame`s. By default, a TTSService
|
||||
will still aggregate `LLMTextFrame`s by sentence for the service to
|
||||
consume. However, if you wish to override how the llm text is aggregated, you
|
||||
should no longer override the TTS's internal text_aggregator, but instead,
|
||||
insert this processor between your LLM and TTS in the pipeline.
|
||||
|
||||
- New `bot-output` RTVI message to represent what the bot actually "says".
|
||||
- The `RTVIObserver` now emits `bot-output` messages based off the new `AggregatedTextFrame`s
|
||||
(`bot-tts-text` and `bot-llm-text` are still supported and generated, but `bot-transcript` is
|
||||
now deprecated in lieu of this new, more thorough, message).
|
||||
|
||||
- The `RTVIObserver` now emits `bot-output` messages based off the new
|
||||
`AggregatedTextFrame`s (`bot-tts-text` and `bot-llm-text` are still
|
||||
supported and generated, but `bot-transcript` is now deprecated in lieu of
|
||||
this new, more thorough, message).
|
||||
|
||||
- The new `RTVIBotOutputMessage` includes the fields:
|
||||
|
||||
- `spoken`: A boolean indicating whether the text was spoken by TTS
|
||||
- `aggregated_by`: A string representing how the text was aggregated ("sentence", "word",
|
||||
"my custom aggregation")
|
||||
- Introduced new fields to `RTVIObserver` to support the new `bot-output` messaging:
|
||||
- `bot_output_enabled`: Defaults to True. Set to false to disable bot-output messages.
|
||||
- `skip_aggregator_types`: Defaults to `None`. Set to a list of strings that match
|
||||
aggregation types that should not be included in bot-output messages. (Ex. `credit_card`)
|
||||
- Introduced new methods, `add_text_transformer()` and `remove_text_transformer()`, to
|
||||
`RTVIObserver` to support providing (and subsequently removing) callbacks for various types of
|
||||
aggregations (or all aggregations with `*`) that can modify the text before being sent as a
|
||||
`bot-output` or `tts-text` message. (Think obscuring the credit card or inserting extra detail
|
||||
the client might want that the context doesn't need.)
|
||||
|
||||
- `aggregated_by`: A string representing how the text was aggregated
|
||||
("sentence", "word", "my custom aggregation")
|
||||
|
||||
- Introduced new fields to `RTVIObserver` to support the new `bot-output`
|
||||
messaging:
|
||||
|
||||
- `bot_output_enabled`: Defaults to True. Set to false to disable bot-output
|
||||
messages.
|
||||
|
||||
- `skip_aggregator_types`: Defaults to `None`. Set to a list of strings that
|
||||
match aggregation types that should not be included in bot-output
|
||||
messages. (Ex. `credit_card`)
|
||||
|
||||
- Introduced new methods, `add_text_transformer()` and
|
||||
`remove_text_transformer()`, to `RTVIObserver` to support providing (and
|
||||
subsequently removing) callbacks for various types of aggregations (or all
|
||||
aggregations with `*`) that can modify the text before being sent as a
|
||||
`bot-output` or `tts-text` message. (Think obscuring the credit card or
|
||||
inserting extra detail the client might want that the context doesn't need.)
|
||||
|
||||
- In `MiniMaxHttpTTSService`:
|
||||
|
||||
- Added support for speech-2.6-hd and speech-2.6-turbo models
|
||||
|
||||
- Added languages: Afrikaans, Bulgarian, Catalan, Danish, Persian, Filipino,
|
||||
Hebrew, Croatian, Hungarian, Malay, Norwegian, Nynorsk, Slovak, Slovenian,
|
||||
Swedish, and Tamil
|
||||
|
||||
- Added new emotions: calm and fluent
|
||||
|
||||
### Changed
|
||||
|
||||
- Updated `daily-python` to 0.22.0.
|
||||
|
||||
- `BaseTextAggregator` changes:
|
||||
Modified the BaseTextAggregator type so that when text gets aggregated, metadata can
|
||||
be associated with it. Currently, that just means a `type`, so that the aggregation
|
||||
can be classified or described. Changes made to support this:
|
||||
- ⚠️ IMPORTANT: Aggregators are now expected to strip leading/trailing white space
|
||||
characters before returning their aggregation from `aggregation()` or `.text`. This
|
||||
way all aggregators have a consistent contract allowing downstream use to know how
|
||||
to stitch aggregations back together.
|
||||
- Introduced a new `Aggregation` dataclass to represent both the aggregated `text` and
|
||||
a string identifying the `type` of aggregation (ex. "sentence", "word", "my custom
|
||||
aggregation")
|
||||
- ⚠️ Breaking change: `BaseTextAggregator.text` now returns an `Aggregation` (instead of `str`).
|
||||
To update: `aggregated_text = myAggregator.text` -> `aggregated_text = myAggregator.text.text`
|
||||
- ⚠️ Breaking change: `BaseTextAggregator.aggregate()` now returns `Optional[Aggregation]`
|
||||
(instead of `Optional[str]`). To update:
|
||||
```
|
||||
aggregation = myAggregator.aggregate(text)
|
||||
if (aggregation):
|
||||
print(f"successfully aggregated text: {aggregation.text}") // instead of {aggregation}
|
||||
```
|
||||
- `SimpleTextAggregator`, `SkipTagsAggregator`, `PatternPairAggregator` updated to
|
||||
produce/consume `Aggregation` objects.
|
||||
|
||||
Modified the BaseTextAggregator type so that when text gets aggregated,
|
||||
metadata can be associated with it. Currently, that just means a `type`, so
|
||||
that the aggregation can be classified or described. Changes made to support
|
||||
this:
|
||||
|
||||
- ⚠️ IMPORTANT: Aggregators are now expected to strip leading/trailing white
|
||||
space characters before returning their aggregation from `aggregation()` or
|
||||
`.text`. This way all aggregators have a consistent contract allowing
|
||||
downstream use to know how to stitch aggregations back together.
|
||||
|
||||
- Introduced a new `Aggregation` dataclass to represent both the aggregated
|
||||
`text` and a string identifying the `type` of aggregation (ex. "sentence",
|
||||
"word", "my custom aggregation")
|
||||
|
||||
- ⚠️ Breaking change: `BaseTextAggregator.text` now returns an `Aggregation`
|
||||
(instead of `str`).
|
||||
|
||||
Before:
|
||||
|
||||
```python
|
||||
aggregated_text = myAggregator.text
|
||||
```
|
||||
|
||||
Now:
|
||||
|
||||
```python
|
||||
aggregated_text = myAggregator.text.text
|
||||
```
|
||||
|
||||
- ⚠️ Breaking change: `BaseTextAggregator.aggregate()` now returns
|
||||
`Optional[Aggregation]` (instead of `Optional[str]`).
|
||||
|
||||
Before:
|
||||
|
||||
```python
|
||||
aggregation = myAggregator.aggregate(text)
|
||||
print(f"successfully aggregated text: {aggregation}")
|
||||
```
|
||||
|
||||
Now:
|
||||
|
||||
```python
|
||||
aggregation = myAggregator.aggregate(text)
|
||||
if aggregation:
|
||||
print(f"successfully aggregated text: {aggregation.text}")
|
||||
```
|
||||
|
||||
- `SimpleTextAggregator`, `SkipTagsAggregator`, `PatternPairAggregator`
|
||||
updated to produce/consume `Aggregation` objects.
|
||||
|
||||
- All uses of the above Aggregators have been updated accordingly.
|
||||
|
||||
- Augmented the `PatternPairAggregator` so that matched patterns can be treated as their own
|
||||
aggregation, taking advantage of the new. To that end:
|
||||
- Introduced a new, preferred version of `add_pattern` to support a new option for treating a
|
||||
match as a separate aggregation returned from `aggregate()`. This replaces the now
|
||||
deprecated `add_pattern_pair` method and you provide a `MatchAction` in lieu of the `remove_match` field.
|
||||
- `MatchAction` enum: `REMOVE`, `KEEP`, `AGGREGATE`, allowing customization for how
|
||||
a match should be handled.
|
||||
- `REMOVE`: The text along with its delimiters will be removed from the streaming text.
|
||||
Sentence aggregation will continue on as if this text did not exist.
|
||||
- `KEEP`: The delimiters will be removed, but the content between them will be kept.
|
||||
Sentence aggregation will continue on with the internal text included.
|
||||
- `AGGREGATE`: The delimiters will be removed and the content between will be treated
|
||||
as a separate aggregation. Any text before the start of the pattern will be
|
||||
returned early, whether or not a complete sentence was found. Then the pattern
|
||||
will be returned. Then the aggregation will continue on sentence matching after
|
||||
the closing delimiter is found. The content between the delimiters is not
|
||||
aggregated by sentence. It is aggregated as one single block of text.
|
||||
- `PatternMatch` now extends `Aggregation` and provides richer info to handlers.
|
||||
- ⚠️ Breaking change: The `PatternMatch` type returned to handlers registered via `on_pattern_match`
|
||||
has been updated to subclass from the new `Aggregation` type, which means that `content`
|
||||
has been replaced with `text` and `pattern_id` has been replaced with `type`:
|
||||
```
|
||||
async dev on_match_tag(match: PatternMatch):
|
||||
pattern = match.type # instead of match.pattern_id
|
||||
text = match.text # instead of match.content
|
||||
```
|
||||
- Augmented the `PatternPairAggregator` so that matched patterns can be treated
|
||||
as their own aggregation, taking advantage of the new. To that end:
|
||||
|
||||
- `TextFrame` now includes the field `append_to_context` to support setting whether or not the
|
||||
encompassing text should be added to the LLM context (by the LLM assistant aggregator). It
|
||||
defaults to `True`.
|
||||
- Introduced a new, preferred version of `add_pattern` to support a new option
|
||||
for treating a match as a separate aggregation returned from
|
||||
`aggregate()`. This replaces the now deprecated `add_pattern_pair` method
|
||||
and you provide a `MatchAction` in lieu of the `remove_match` field.
|
||||
|
||||
- `MatchAction` enum: `REMOVE`, `KEEP`, `AGGREGATE`, allowing customization
|
||||
for how a match should be handled.
|
||||
|
||||
- `REMOVE`: The text along with its delimiters will be removed from the
|
||||
streaming text. Sentence aggregation will continue on as if this text
|
||||
did not exist.
|
||||
|
||||
- `KEEP`: The delimiters will be removed, but the content between them
|
||||
will be kept. Sentence aggregation will continue on with the internal
|
||||
text included.
|
||||
|
||||
- `AGGREGATE`: The delimiters will be removed and the content between will
|
||||
be treated as a separate aggregation. Any text before the start of the
|
||||
pattern will be returned early, whether or not a complete sentence was
|
||||
found. Then the pattern will be returned. Then the aggregation will
|
||||
continue on sentence matching after the closing delimiter is found. The
|
||||
content between the delimiters is not aggregated by sentence. It is
|
||||
aggregated as one single block of text.
|
||||
|
||||
- `PatternMatch` now extends `Aggregation` and provides richer info to
|
||||
handlers.
|
||||
|
||||
- ⚠️ Breaking change: The `PatternMatch` type returned to handlers registered
|
||||
via `on_pattern_match` has been updated to subclass from the new
|
||||
`Aggregation` type, which means that `content` has been replaced with
|
||||
`text` and `pattern_id` has been replaced with `type`:
|
||||
|
||||
```python
|
||||
async dev on_match_tag(match: PatternMatch):
|
||||
pattern = match.type # instead of match.pattern_id
|
||||
text = match.text # instead of match.content
|
||||
```
|
||||
|
||||
- `TextFrame` now includes the field `append_to_context` to support setting
|
||||
whether or not the encompassing text should be added to the LLM context (by
|
||||
the LLM assistant aggregator). It defaults to `True`.
|
||||
|
||||
- `TTSService` base class updates:
|
||||
- `TTSService`s now accept a new `skip_aggregator_types` to avoid speaking certain aggregation
|
||||
types (now determined/returned by the aggregator)
|
||||
- Introduced the ability to do a just-in-time transform of text before it gets sent to the
|
||||
TTS service via callbacks you can set up via a new init field, `text_transforms` or a new
|
||||
method `add_text_transformer()`. This makes it possible to do things like introduce
|
||||
TTS-specific tags for spelling or emotion or change the pronunciation of something on the
|
||||
fly. `remove_text_transformer` has also been added to support removing a registered
|
||||
transform callback.
|
||||
- TTS services push `AggregatedTextFrame` in addition to `TTSTextFrame`s when either an
|
||||
aggregation occurs that should not be spoken or when the TTS service supports word-by-word
|
||||
timestamping. In the latter case, the `TTSService` preliminarily generates an
|
||||
`AggregatedTextFrame`, aggregated by sentence to generate the full sentence content as early
|
||||
as possible.
|
||||
|
||||
- `TTSService`s now accept a new `skip_aggregator_types` to avoid speaking
|
||||
certain aggregation types (now determined/returned by the aggregator)
|
||||
|
||||
- Introduced the ability to do a just-in-time transform of text before it gets
|
||||
sent to the TTS service via callbacks you can set up via a new init field,
|
||||
`text_transforms` or a new method `add_text_transformer()`. This makes it
|
||||
possible to do things like introduce TTS-specific tags for spelling or
|
||||
emotion or change the pronunciation of something on the
|
||||
fly. `remove_text_transformer` has also been added to support removing a
|
||||
registered transform callback.
|
||||
|
||||
- TTS services push `AggregatedTextFrame` in addition to `TTSTextFrame`s when
|
||||
either an aggregation occurs that should not be spoken or when the TTS
|
||||
service supports word-by-word timestamping. In the latter case, the
|
||||
`TTSService` preliminarily generates an `AggregatedTextFrame`, aggregated by
|
||||
sentence to generate the full sentence content as early as possible.
|
||||
|
||||
- Updated `CartesiaTTSService`:
|
||||
- Modified use of custom default text_aggregator to avoid deprecation warnings and push users
|
||||
towards use of transformers or the `LLMTextProcessor`
|
||||
- Added convenience methods for taking advantage of Cartesia's SSML tags: spell, emotion,
|
||||
pauses, volume, and speed.
|
||||
|
||||
- Modified use of custom default text_aggregator to avoid deprecation warnings
|
||||
and push users towards use of transformers or the `LLMTextProcessor`
|
||||
|
||||
- Added convenience methods for taking advantage of Cartesia's SSML tags:
|
||||
spell, emotion, pauses, volume, and speed.
|
||||
|
||||
- Updated `RimeTTSService`:
|
||||
- Modified use of custom default text_aggregator to avoid deprecation warnings and push users
|
||||
towards use of transformers or the `LLMTextProcessor`
|
||||
- Added convenience methods for taking advantage of Rime's customization options: spell,
|
||||
pauses, pronunciations, and inline speed control.
|
||||
|
||||
- Modified use of custom default text_aggregator to avoid deprecation warnings
|
||||
and push users towards use of transformers or the `LLMTextProcessor`
|
||||
|
||||
- Added convenience methods for taking advantage of Rime's customization
|
||||
options: spell, pauses, pronunciations, and inline speed control.
|
||||
|
||||
### Deprecated
|
||||
|
||||
- The TTS constructor field, `text_aggregator` is deprecated in favor of the new
|
||||
`LLMTextProcessor`. TTSServices still have an internal aggregator for support of default
|
||||
behavior, but if you want to override the aggregation behavior, you should use the new
|
||||
processor.
|
||||
`LLMTextProcessor`. TTSServices still have an internal aggregator for support
|
||||
of default behavior, but if you want to override the aggregation behavior, you
|
||||
should use the new processor.
|
||||
|
||||
- The RTVI `bot-transcription` event is deprecated in favor of the new `bot-output`
|
||||
message which is the canonical representation of bot output (spoken or not). The code
|
||||
still emits a transcription message for backwards compatibility while transition occurs.
|
||||
- The RTVI `bot-transcription` event is deprecated in favor of the new
|
||||
`bot-output` message which is the canonical representation of bot output
|
||||
(spoken or not). The code still emits a transcription message for backwards
|
||||
compatibility while transition occurs.
|
||||
|
||||
- Deprecated `add_pattern_pair` in the `PatternPairAggregator` which takes a `pattern_id`
|
||||
and `remove_match` field in favor of the new `add_pattern` method which takes a `type` and an
|
||||
`action`
|
||||
- Deprecated `add_pattern_pair` in the `PatternPairAggregator` which takes a
|
||||
`pattern_id` and `remove_match` field in favor of the new `add_pattern` method
|
||||
which takes a `type` and an `action`
|
||||
|
||||
- `english_normalization` input parameter for `MiniMaxHttpTTSService` is
|
||||
deprecated, use `test_normalization` instead.
|
||||
|
||||
### Fixed
|
||||
|
||||
- Fixed an issue in `ElevenLabsRealtimeSTTService` where dynamic language
|
||||
updates were not working.
|
||||
|
||||
- Fixed an issue in `ElevenLabsRealtimeSTTService` where setting the sample
|
||||
rate would result in transcripts failing.
|
||||
|
||||
- Fixed `InworldTTSService` audio config payload to use camelCase keys expected
|
||||
by the Inworld API.
|
||||
|
||||
@@ -218,20 +314,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
- Updated language mappings for the Google and Gemini TTS services to match
|
||||
official documentation.
|
||||
|
||||
- In `MiniMaxHttpTTSService`:
|
||||
-- Added support for speech-2.6-hd and speech-2.6-turbo models
|
||||
-- Added languages: Afrikaans, Bulgarian, Catalan, Danish, Persian, Filipino, Hebrew,
|
||||
Croatian, Hungarian, Malay, Norwegian, Nynorsk, Slovak, Slovenian, Swedish, and Tamil
|
||||
-- Added new emotions: calm and fluent
|
||||
|
||||
### Deprecated
|
||||
|
||||
- The `api_key` parameter in `GeminiTTSService` is deprecated. Use
|
||||
`credentials` or `credentials_path` instead for Google Cloud authentication.
|
||||
|
||||
- `english_normalization` input parameter for `MiniMaxHttpTTSService` is deprecated,
|
||||
use `test_normalization` instead.
|
||||
|
||||
### Fixed
|
||||
|
||||
- Fixed a `SimliVideoService` connection issue.
|
||||
|
||||
103
docs/TURN_AWARE_TRANSCRIPT_PROCESSOR.md
Normal file
103
docs/TURN_AWARE_TRANSCRIPT_PROCESSOR.md
Normal file
@@ -0,0 +1,103 @@
|
||||
# TurnAwareTranscriptProcessor Example
|
||||
|
||||
## Overview
|
||||
|
||||
The `TurnAwareTranscriptProcessor` combines user and assistant transcript tracking with turn boundary detection. It correctly handles interruptions by only capturing what was actually spoken.
|
||||
|
||||
## Basic Usage
|
||||
|
||||
```python
|
||||
from pipecat.processors.transcript_processor import TurnAwareTranscriptProcessor
|
||||
|
||||
# Create the processor
|
||||
turn_processor = TurnAwareTranscriptProcessor()
|
||||
|
||||
# Register event handlers
|
||||
@turn_processor.event_handler("on_turn_started")
|
||||
async def handle_turn_started(processor, turn_number):
|
||||
print(f"Turn {turn_number} started")
|
||||
|
||||
@turn_processor.event_handler("on_turn_ended")
|
||||
async def handle_turn_ended(processor, turn_number, user_text, assistant_text, was_interrupted):
|
||||
print(f"\nTurn {turn_number} ended:")
|
||||
print(f" User said: {user_text}")
|
||||
print(f" Assistant said: {assistant_text}")
|
||||
print(f" Was interrupted: {was_interrupted}")
|
||||
|
||||
@turn_processor.event_handler("on_transcript_update")
|
||||
async def handle_transcript_update(processor, frame):
|
||||
for msg in frame.messages:
|
||||
print(f"[{msg.role}]: {msg.content}")
|
||||
|
||||
# Add to pipeline
|
||||
pipeline = Pipeline([
|
||||
transport.input(),
|
||||
stt,
|
||||
turn_processor, # Process transcripts and track turns
|
||||
context_aggregator.user(),
|
||||
llm,
|
||||
tts,
|
||||
transport.output(),
|
||||
context_aggregator.assistant(),
|
||||
])
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
1. **Turn Boundary Detection**: Automatically detects when turns start and end based on user and bot speaking patterns
|
||||
2. **Interruption Handling**: Correctly captures only what was actually spoken when interruptions occur
|
||||
3. **Real-time Transcripts**: Emits transcript messages for both user and assistant speech
|
||||
4. **Turn Events**: Provides start/end events with accumulated transcripts for each turn
|
||||
|
||||
## Events
|
||||
|
||||
### on_turn_started
|
||||
Emitted when a new turn begins (user starts speaking).
|
||||
|
||||
**Handler signature**: `async def handler(processor, turn_number)`
|
||||
|
||||
### on_turn_ended
|
||||
Emitted when a turn ends with accumulated transcripts.
|
||||
|
||||
**Handler signature**: `async def handler(processor, turn_number, user_transcript, assistant_transcript, was_interrupted)`
|
||||
|
||||
### on_transcript_update
|
||||
Inherited from `BaseTranscriptProcessor`, emitted for individual transcript messages.
|
||||
|
||||
**Handler signature**: `async def handler(processor, frame)`
|
||||
|
||||
## Turn Logic
|
||||
|
||||
- Turns start when the user begins speaking (`UserStartedSpeakingFrame`)
|
||||
- Turns end when:
|
||||
- The user starts speaking again (previous turn ends, new turn starts)
|
||||
- The bot is interrupted (`InterruptionFrame`)
|
||||
- The pipeline ends (`EndFrame`/`CancelFrame`)
|
||||
|
||||
## Integration with OpenTelemetry
|
||||
|
||||
You can use turn events to enrich OpenTelemetry spans:
|
||||
|
||||
```python
|
||||
from pipecat.utils.tracing.turn_trace_observer import TurnTraceObserver
|
||||
|
||||
turn_tracker = TurnTrackingObserver()
|
||||
turn_tracer = TurnTraceObserver(turn_tracker)
|
||||
turn_processor = TurnAwareTranscriptProcessor()
|
||||
|
||||
@turn_processor.event_handler("on_turn_ended")
|
||||
async def add_transcripts_to_span(processor, turn_number, user_text, assistant_text, interrupted):
|
||||
# Get current span and add transcript data
|
||||
from opentelemetry import trace
|
||||
current_span = trace.get_current_span()
|
||||
if current_span:
|
||||
current_span.set_attribute("turn.user_text", user_text)
|
||||
current_span.set_attribute("turn.assistant_text", assistant_text)
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The processor handles async frame processing correctly by delaying turn end until frames are processed
|
||||
- Works with word-level timestamps from TTS services like Cartesia
|
||||
- Accumulates both user (`TranscriptionFrame`) and assistant (`TTSTextFrame`) speech
|
||||
- Emits individual transcript messages in addition to turn-level aggregation
|
||||
@@ -44,6 +44,7 @@ DAILY_SAMPLE_ROOM_URL=https://...
|
||||
|
||||
# Deepgram
|
||||
DEEPGRAM_API_KEY=...
|
||||
SAGEMAKER_ENDPOINT_NAME=...
|
||||
|
||||
# DeepSeek
|
||||
DEEPSEEK_API_KEY=...
|
||||
|
||||
137
examples/foundational/07c-interruptible-deepgram-sagemaker.py
Normal file
137
examples/foundational/07c-interruptible-deepgram-sagemaker.py
Normal file
@@ -0,0 +1,137 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
|
||||
import os
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
|
||||
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.audio.vad.vad_analyzer import VADParams
|
||||
from pipecat.frames.frames import LLMRunFrame
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.aws.llm import AWSBedrockLLMService
|
||||
from pipecat.services.deepgram.stt_sagemaker import DeepgramSageMakerSTTService
|
||||
from pipecat.services.deepgram.tts import DeepgramTTSService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
# selected.
|
||||
transport_params = {
|
||||
"daily": lambda: DailyParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
|
||||
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
|
||||
),
|
||||
"twilio": lambda: FastAPIWebsocketParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
|
||||
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
|
||||
),
|
||||
"webrtc": lambda: TransportParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
|
||||
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info(f"Starting bot")
|
||||
|
||||
# Initialize Deepgram SageMaker STT Service
|
||||
# This requires:
|
||||
# - AWS credentials configured (via environment variables or AWS CLI)
|
||||
# - A deployed SageMaker endpoint with Deepgram model
|
||||
stt = DeepgramSageMakerSTTService(
|
||||
endpoint_name=os.getenv("SAGEMAKER_ENDPOINT_NAME"),
|
||||
region=os.getenv("AWS_REGION"),
|
||||
)
|
||||
|
||||
tts = DeepgramTTSService(api_key=os.getenv("DEEPGRAM_API_KEY"), voice="aura-2-andromeda-en")
|
||||
|
||||
llm = AWSBedrockLLMService(
|
||||
aws_region=os.getenv("AWS_REGION"),
|
||||
model="us.amazon.nova-pro-v1:0",
|
||||
params=AWSBedrockLLMService.InputParams(temperature=0.8),
|
||||
)
|
||||
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
|
||||
},
|
||||
]
|
||||
|
||||
context = LLMContext(messages)
|
||||
context_aggregator = LLMContextAggregatorPair(context)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
stt, # STT
|
||||
context_aggregator.user(), # User responses
|
||||
llm, # LLM
|
||||
tts, # TTS
|
||||
transport.output(), # Transport bot output
|
||||
context_aggregator.assistant(), # Assistant spoken responses
|
||||
]
|
||||
)
|
||||
|
||||
task = PipelineTask(
|
||||
pipeline,
|
||||
params=PipelineParams(
|
||||
enable_metrics=True,
|
||||
enable_usage_metrics=True,
|
||||
),
|
||||
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
|
||||
)
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected")
|
||||
# Kick off the conversation.
|
||||
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
|
||||
await task.queue_frames([LLMRunFrame()])
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
logger.info(f"Client disconnected")
|
||||
await task.cancel()
|
||||
|
||||
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
|
||||
|
||||
await runner.run(task)
|
||||
|
||||
|
||||
async def bot(runner_args: RunnerArguments):
|
||||
"""Main bot entry point compatible with Pipecat Cloud."""
|
||||
transport = await create_transport(runner_args, transport_params)
|
||||
await run_bot(transport, runner_args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from pipecat.runner.run import main
|
||||
|
||||
main()
|
||||
@@ -49,14 +49,14 @@ aic = [ "aic-sdk~=1.1.0" ]
|
||||
anthropic = [ "anthropic~=0.49.0" ]
|
||||
assemblyai = [ "pipecat-ai[websockets-base]" ]
|
||||
asyncai = [ "pipecat-ai[websockets-base]" ]
|
||||
aws = [ "aioboto3~=15.0.0", "pipecat-ai[websockets-base]" ]
|
||||
aws-nova-sonic = [ "aws_sdk_bedrock_runtime~=0.1.1; python_version>='3.12'" ]
|
||||
aws = [ "aioboto3~=15.5.0", "pipecat-ai[websockets-base]" ]
|
||||
aws-nova-sonic = [ "aws_sdk_bedrock_runtime~=0.2.0; python_version>='3.12'" ]
|
||||
azure = [ "azure-cognitiveservices-speech~=1.42.0"]
|
||||
cartesia = [ "cartesia~=2.0.3", "pipecat-ai[websockets-base]" ]
|
||||
cerebras = []
|
||||
deepseek = []
|
||||
daily = [ "daily-python~=0.22.0" ]
|
||||
deepgram = [ "deepgram-sdk~=4.7.0" ]
|
||||
deepseek = []
|
||||
elevenlabs = [ "pipecat-ai[websockets-base]" ]
|
||||
fal = [ "fal-client~=0.5.9" ]
|
||||
fireworks = []
|
||||
@@ -69,19 +69,21 @@ gstreamer = [ "pygobject~=3.50.0" ]
|
||||
heygen = [ "livekit>=1.0.13", "pipecat-ai[websockets-base]" ]
|
||||
hume = [ "hume>=0.11.2" ]
|
||||
inworld = []
|
||||
krisp = [ "pipecat-ai-krisp~=0.4.0" ]
|
||||
koala = [ "pvkoala~=2.0.3" ]
|
||||
krisp = [ "pipecat-ai-krisp~=0.4.0" ]
|
||||
langchain = [ "langchain~=0.3.20", "langchain-community~=0.3.20", "langchain-openai~=0.3.9" ]
|
||||
livekit = [ "livekit~=1.0.13", "livekit-api~=1.0.5", "tenacity>=8.2.3,<10.0.0" ]
|
||||
livekit = [ "livekit~=1.0.13", "livekit-api~=1.0.5", "tenacity>=8.2.3,<10.0.0", "pyjwt>=2.10.1" ]
|
||||
lmnt = [ "pipecat-ai[websockets-base]" ]
|
||||
local = [ "pyaudio~=0.2.14" ]
|
||||
local-smart-turn = [ "coremltools>=8.0", "transformers", "torch>=2.5.0,<3", "torchaudio>=2.5.0,<3" ]
|
||||
local-smart-turn-v3 = [ "transformers", "onnxruntime>=1.20.1,<2" ]
|
||||
mcp = [ "mcp[cli]>=1.11.0,<2" ]
|
||||
mem0 = [ "mem0ai~=0.1.94" ]
|
||||
mistral = []
|
||||
mlx-whisper = [ "mlx-whisper~=0.4.2" ]
|
||||
moondream = [ "accelerate~=1.10.0", "einops~=0.8.0", "pyvips[binary]~=3.0.0", "timm~=1.0.13", "transformers>=4.48.0" ]
|
||||
nim = []
|
||||
neuphonic = [ "pipecat-ai[websockets-base]" ]
|
||||
nim = []
|
||||
noisereduce = [ "noisereduce~=3.0.3" ]
|
||||
openai = [ "pipecat-ai[websockets-base]" ]
|
||||
openpipe = [ "openpipe>=4.50.0,<6" ]
|
||||
@@ -89,15 +91,14 @@ openrouter = []
|
||||
perplexity = []
|
||||
playht = [ "pipecat-ai[websockets-base]" ]
|
||||
qwen = []
|
||||
remote-smart-turn = []
|
||||
rime = [ "pipecat-ai[websockets-base]" ]
|
||||
riva = [ "nvidia-riva-client~=2.21.1" ]
|
||||
runner = [ "python-dotenv>=1.0.0,<2.0.0", "uvicorn>=0.32.0,<1.0.0", "fastapi>=0.115.6,<0.122.0", "pipecat-ai-small-webrtc-prebuilt>=1.0.0"]
|
||||
sagemaker = ["aws_sdk_sagemaker_runtime_http2; python_version>='3.12'"]
|
||||
sambanova = []
|
||||
sarvam = [ "sarvamai==0.1.21", "pipecat-ai[websockets-base]" ]
|
||||
sentry = [ "sentry-sdk>=2.28.0,<3" ]
|
||||
local-smart-turn = [ "coremltools>=8.0", "transformers", "torch>=2.5.0,<3", "torchaudio>=2.5.0,<3" ]
|
||||
local-smart-turn-v3 = [ "transformers", "onnxruntime>=1.20.1,<2" ]
|
||||
remote-smart-turn = []
|
||||
silero = [ "onnxruntime>=1.20.1,<2" ]
|
||||
simli = [ "simli-ai~=1.0.3"]
|
||||
soniox = [ "pipecat-ai[websockets-base]" ]
|
||||
|
||||
@@ -15,6 +15,7 @@ from typing import List, Optional
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
BotStartedSpeakingFrame,
|
||||
BotStoppedSpeakingFrame,
|
||||
CancelFrame,
|
||||
EndFrame,
|
||||
@@ -24,6 +25,7 @@ from pipecat.frames.frames import (
|
||||
TranscriptionMessage,
|
||||
TranscriptionUpdateFrame,
|
||||
TTSTextFrame,
|
||||
UserStartedSpeakingFrame,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
from pipecat.utils.string import TextPartForConcatenation, concatenate_aggregated_text
|
||||
@@ -306,3 +308,267 @@ class TranscriptProcessor:
|
||||
return handler
|
||||
|
||||
return decorator
|
||||
|
||||
|
||||
class TurnAwareTranscriptProcessor(BaseTranscriptProcessor):
|
||||
"""Processes transcripts with turn boundary awareness.
|
||||
|
||||
This processor combines user and assistant transcript tracking with turn
|
||||
detection, emitting events when turns start and end. It correctly handles
|
||||
interruptions by only capturing what was actually spoken.
|
||||
|
||||
Turn boundaries are detected based on:
|
||||
- User started speaking (UserStartedSpeakingFrame)
|
||||
- Bot stopped speaking (BotStoppedSpeakingFrame)
|
||||
- Interruptions (InterruptionFrame)
|
||||
|
||||
Events:
|
||||
on_turn_started: Emitted when a new turn begins.
|
||||
Handler signature: async def handler(processor, turn_number)
|
||||
|
||||
on_turn_ended: Emitted when a turn ends.
|
||||
Handler signature: async def handler(processor, turn_number,
|
||||
user_transcript, assistant_transcript,
|
||||
was_interrupted)
|
||||
|
||||
on_transcript_update: Inherited from BaseTranscriptProcessor, emitted for
|
||||
individual transcript messages.
|
||||
|
||||
Example::
|
||||
|
||||
turn_processor = TurnAwareTranscriptProcessor()
|
||||
|
||||
@turn_processor.event_handler("on_turn_started")
|
||||
async def handle_turn_started(processor, turn_number):
|
||||
print(f"Turn {turn_number} started")
|
||||
|
||||
@turn_processor.event_handler("on_turn_ended")
|
||||
async def handle_turn_ended(processor, turn_number, user_text, assistant_text, interrupted):
|
||||
print(f"Turn {turn_number} ended")
|
||||
print(f"User said: {user_text}")
|
||||
print(f"Assistant said: {assistant_text}")
|
||||
print(f"Was interrupted: {interrupted}")
|
||||
|
||||
pipeline = Pipeline([
|
||||
transport.input(),
|
||||
stt,
|
||||
turn_processor,
|
||||
context_aggregator.user(),
|
||||
llm,
|
||||
tts,
|
||||
transport.output(),
|
||||
context_aggregator.assistant(),
|
||||
])
|
||||
"""
|
||||
|
||||
def __init__(self, **kwargs):
|
||||
"""Initialize the turn-aware transcript processor.
|
||||
|
||||
Args:
|
||||
**kwargs: Additional arguments passed to parent class.
|
||||
"""
|
||||
super().__init__(**kwargs)
|
||||
|
||||
# Turn tracking state
|
||||
self._turn_number = 0
|
||||
self._turn_active = False
|
||||
self._turn_start_time: Optional[str] = None
|
||||
|
||||
# Accumulate text for current turn
|
||||
self._current_turn_user_parts: List[TextPartForConcatenation] = []
|
||||
self._current_turn_assistant_parts: List[TextPartForConcatenation] = []
|
||||
|
||||
# Track bot speaking state
|
||||
self._bot_is_speaking = False
|
||||
|
||||
# Register turn events
|
||||
self._register_event_handler("on_turn_started")
|
||||
self._register_event_handler("on_turn_ended")
|
||||
|
||||
async def _start_turn(self):
|
||||
"""Start a new turn."""
|
||||
if not self._turn_active:
|
||||
self._turn_number += 1
|
||||
self._turn_active = True
|
||||
self._turn_start_time = time_now_iso8601()
|
||||
self._current_turn_user_parts = []
|
||||
self._current_turn_assistant_parts = []
|
||||
|
||||
logger.debug(f"Turn {self._turn_number} started")
|
||||
await self._call_event_handler("on_turn_started", self._turn_number)
|
||||
|
||||
async def _end_turn(self, was_interrupted: bool = False):
|
||||
"""End the current turn and emit aggregated transcripts.
|
||||
|
||||
Args:
|
||||
was_interrupted: Whether the turn ended due to an interruption.
|
||||
"""
|
||||
if not self._turn_active:
|
||||
return
|
||||
|
||||
# Aggregate user text
|
||||
user_transcript = ""
|
||||
if self._current_turn_user_parts:
|
||||
user_transcript = concatenate_aggregated_text(self._current_turn_user_parts)
|
||||
|
||||
# Aggregate assistant text
|
||||
assistant_transcript = ""
|
||||
if self._current_turn_assistant_parts:
|
||||
assistant_transcript = concatenate_aggregated_text(self._current_turn_assistant_parts)
|
||||
|
||||
# Emit turn ended event
|
||||
logger.debug(
|
||||
f"Turn {self._turn_number} ended (interrupted={was_interrupted}). "
|
||||
f"User: '{user_transcript}', Assistant: '{assistant_transcript}'"
|
||||
)
|
||||
await self._call_event_handler(
|
||||
"on_turn_ended",
|
||||
self._turn_number,
|
||||
user_transcript,
|
||||
assistant_transcript,
|
||||
was_interrupted,
|
||||
)
|
||||
|
||||
# Reset turn state
|
||||
self._turn_active = False
|
||||
self._current_turn_user_parts = []
|
||||
self._current_turn_assistant_parts = []
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
"""Process frames for turn-aware transcript tracking.
|
||||
|
||||
Handles:
|
||||
- UserStartedSpeakingFrame: Start new turn
|
||||
- TranscriptionFrame: Accumulate user speech and emit transcript message
|
||||
- BotStartedSpeakingFrame: Track bot speaking state
|
||||
- TTSTextFrame: Accumulate assistant speech
|
||||
- BotStoppedSpeakingFrame: End turn if no interruption pending
|
||||
- InterruptionFrame: End turn immediately as interrupted
|
||||
- EndFrame/CancelFrame: End any active turn
|
||||
|
||||
Args:
|
||||
frame: Input frame to process.
|
||||
direction: Frame processing direction.
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, UserStartedSpeakingFrame):
|
||||
# User started speaking
|
||||
if self._bot_is_speaking:
|
||||
# This is an interruption - end the current turn with what was spoken
|
||||
if self._current_turn_assistant_parts:
|
||||
assistant_content = concatenate_aggregated_text(
|
||||
self._current_turn_assistant_parts
|
||||
)
|
||||
if assistant_content:
|
||||
message = TranscriptionMessage(
|
||||
role="assistant",
|
||||
content=assistant_content,
|
||||
timestamp=self._turn_start_time or time_now_iso8601(),
|
||||
)
|
||||
await self._emit_update([message])
|
||||
await self._end_turn(was_interrupted=True)
|
||||
self._bot_is_speaking = False
|
||||
elif self._turn_active:
|
||||
# Previous turn is ending normally (bot finished speaking)
|
||||
if self._current_turn_assistant_parts:
|
||||
assistant_content = concatenate_aggregated_text(
|
||||
self._current_turn_assistant_parts
|
||||
)
|
||||
if assistant_content:
|
||||
message = TranscriptionMessage(
|
||||
role="assistant",
|
||||
content=assistant_content,
|
||||
timestamp=self._turn_start_time or time_now_iso8601(),
|
||||
)
|
||||
await self._emit_update([message])
|
||||
await self._end_turn(was_interrupted=False)
|
||||
|
||||
# Start a new turn
|
||||
await self._start_turn()
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
elif isinstance(frame, TranscriptionFrame):
|
||||
# Accumulate user speech for the current turn
|
||||
if self._turn_active:
|
||||
self._current_turn_user_parts.append(
|
||||
TextPartForConcatenation(frame.text, includes_inter_part_spaces=True)
|
||||
)
|
||||
|
||||
# Also emit individual transcript message
|
||||
message = TranscriptionMessage(
|
||||
role="user",
|
||||
user_id=frame.user_id,
|
||||
content=frame.text,
|
||||
timestamp=frame.timestamp,
|
||||
)
|
||||
await self._emit_update([message])
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
elif isinstance(frame, BotStartedSpeakingFrame):
|
||||
# Bot started speaking
|
||||
self._bot_is_speaking = True
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
elif isinstance(frame, TTSTextFrame):
|
||||
# Accumulate assistant speech for the current turn
|
||||
if self._turn_active:
|
||||
self._current_turn_assistant_parts.append(
|
||||
TextPartForConcatenation(
|
||||
frame.text, includes_inter_part_spaces=frame.includes_inter_frame_spaces
|
||||
)
|
||||
)
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
elif isinstance(frame, BotStoppedSpeakingFrame):
|
||||
# Bot stopped speaking - just mark it, don't end turn yet
|
||||
# Turn will end when next user speaks or pipeline ends
|
||||
self._bot_is_speaking = False
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
elif isinstance(frame, InterruptionFrame):
|
||||
# Emit assistant transcript message with what was spoken before interruption
|
||||
if self._current_turn_assistant_parts:
|
||||
assistant_content = concatenate_aggregated_text(self._current_turn_assistant_parts)
|
||||
if assistant_content:
|
||||
message = TranscriptionMessage(
|
||||
role="assistant",
|
||||
content=assistant_content,
|
||||
timestamp=self._turn_start_time or time_now_iso8601(),
|
||||
)
|
||||
await self._emit_update([message])
|
||||
|
||||
# Push frame first to ensure proper cleanup
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
# End turn as interrupted
|
||||
await self._end_turn(was_interrupted=True)
|
||||
self._bot_is_speaking = False
|
||||
|
||||
elif isinstance(frame, (EndFrame, CancelFrame)):
|
||||
# Pipeline ending - finalize any active turn
|
||||
if self._turn_active:
|
||||
# Emit any pending assistant transcript (allow time for TTSTextFrames to be processed)
|
||||
# Give a brief moment for any pending frames to process
|
||||
import asyncio
|
||||
|
||||
await asyncio.sleep(0.001)
|
||||
|
||||
if self._current_turn_assistant_parts:
|
||||
assistant_content = concatenate_aggregated_text(
|
||||
self._current_turn_assistant_parts
|
||||
)
|
||||
if assistant_content:
|
||||
message = TranscriptionMessage(
|
||||
role="assistant",
|
||||
content=assistant_content,
|
||||
timestamp=self._turn_start_time or time_now_iso8601(),
|
||||
)
|
||||
await self._emit_update([message])
|
||||
|
||||
await self._end_turn(was_interrupted=isinstance(frame, CancelFrame))
|
||||
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
@@ -8,8 +8,10 @@ import sys
|
||||
|
||||
from pipecat.services import DeprecatedModuleProxy
|
||||
|
||||
from .agent_core import *
|
||||
from .llm import *
|
||||
from .nova_sonic import *
|
||||
from .sagemaker import *
|
||||
from .stt import *
|
||||
from .tts import *
|
||||
|
||||
|
||||
258
src/pipecat/services/aws/agent_core.py
Normal file
258
src/pipecat/services/aws/agent_core.py
Normal file
@@ -0,0 +1,258 @@
|
||||
#
|
||||
# Copyright (c) 2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""AWS AgentCore Processor Module.
|
||||
|
||||
This module defines the AWSAgentCoreProcessor, which invokes agents hosted on
|
||||
Amazon Bedrock AgentCore Runtime and streams their responses as LLMTextFrames.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import os
|
||||
from typing import Callable, Optional
|
||||
|
||||
import aioboto3
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
Frame,
|
||||
LLMContextFrame,
|
||||
LLMFullResponseEndFrame,
|
||||
LLMFullResponseStartFrame,
|
||||
LLMTextFrame,
|
||||
)
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext, LLMSpecificMessage
|
||||
from pipecat.processors.aggregators.openai_llm_context import (
|
||||
OpenAILLMContext,
|
||||
OpenAILLMContextFrame,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
|
||||
|
||||
def default_context_to_payload_transformer(
|
||||
context: LLMContext | OpenAILLMContext,
|
||||
) -> Optional[str]:
|
||||
"""Default transformer to create AgentCore payload from LLM context.
|
||||
|
||||
Extracts the latest user or system message text and wraps it in {"prompt": "<text>"}.
|
||||
|
||||
Args:
|
||||
context: The LLM context containing conversation messages.
|
||||
|
||||
Returns:
|
||||
A JSON string payload for AgentCore, or None if no valid message found.
|
||||
"""
|
||||
messages = context.messages
|
||||
|
||||
if not messages:
|
||||
return None
|
||||
|
||||
last_message = messages[-1]
|
||||
if isinstance(last_message, LLMSpecificMessage) or last_message.get("role") not in (
|
||||
"user",
|
||||
"system",
|
||||
):
|
||||
return None
|
||||
|
||||
content = last_message.get("content")
|
||||
if not content:
|
||||
return None
|
||||
|
||||
if isinstance(content, str):
|
||||
prompt = content
|
||||
elif isinstance(content, list):
|
||||
prompt = " ".join([part.get("text", "") for part in content])
|
||||
else:
|
||||
return None
|
||||
|
||||
return json.dumps({"prompt": prompt})
|
||||
|
||||
|
||||
def default_response_to_output_transformer(response_line: str) -> Optional[str]:
|
||||
"""Default transformer to extract output text from AgentCore response.
|
||||
|
||||
Expects responses with {"response": "<text>"} format.
|
||||
|
||||
Args:
|
||||
response_line: The raw response line from AgentCore (without "data: " prefix).
|
||||
|
||||
Returns:
|
||||
The extracted output text, or None if no text found.
|
||||
"""
|
||||
response_json = json.loads(response_line)
|
||||
return response_json.get("response")
|
||||
|
||||
|
||||
class AWSAgentCoreProcessor(FrameProcessor):
|
||||
"""Processor that runs an Amazon Bedrock AgentCore agent.
|
||||
|
||||
Input:
|
||||
- LLMContextFrame: Supplies a context used to invoke the agent.
|
||||
|
||||
Output:
|
||||
- LLMTextFrame: The agent's text response(s).
|
||||
A single agent invocation may result in multiple text frames.
|
||||
|
||||
This processor transforms the input context to a payload for the AgentCore
|
||||
agent, and transforms the agent's response(s) into output text frame(s). Both
|
||||
mappings are configurable via transformers. Below is the default behavior.
|
||||
|
||||
Input transformer (context_to_payload_transformer):
|
||||
- Grabs the latest user or system message (if it's the latest message)
|
||||
- Extracts its text content
|
||||
- Constructs a payload that looks like {"prompt": "<text>"}
|
||||
|
||||
Output transformer (response_to_output_transformer):
|
||||
- Expects responses that look like {"response": "<text>"}
|
||||
- Extracts the text for use in the LLMTextFrame(s)
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
agentArn: str,
|
||||
aws_access_key: Optional[str] = None,
|
||||
aws_secret_key: Optional[str] = None,
|
||||
aws_session_token: Optional[str] = None,
|
||||
aws_region: Optional[str] = None,
|
||||
context_to_payload_transformer: Optional[
|
||||
Callable[[LLMContext | OpenAILLMContext], Optional[str]]
|
||||
] = None,
|
||||
response_to_output_transformer: Optional[Callable[[str], Optional[str]]] = None,
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the AWS AgentCore processor.
|
||||
|
||||
Args:
|
||||
agentArn: The Amazon Web Services Resource Name (ARN) of the agent.
|
||||
aws_access_key: AWS access key ID. If None, uses default credentials.
|
||||
aws_secret_key: AWS secret access key. If None, uses default credentials.
|
||||
aws_session_token: AWS session token for temporary credentials.
|
||||
aws_region: AWS region.
|
||||
context_to_payload_transformer: Optional callable to transform
|
||||
LLMContext into AgentCore payload string. If None, uses
|
||||
default_context_to_payload_transformer.
|
||||
response_to_output_transformer: Optional callable to extract output text
|
||||
from AgentCore response. If None, uses
|
||||
default_response_to_output_transformer.
|
||||
**kwargs: Additional arguments passed to parent FrameProcessor.
|
||||
"""
|
||||
super().__init__(**kwargs)
|
||||
|
||||
self._agentArn = agentArn
|
||||
self._aws_session = aioboto3.Session()
|
||||
|
||||
# Store AWS session parameters for creating client in async context
|
||||
self._aws_params = {
|
||||
"aws_access_key_id": aws_access_key or os.getenv("AWS_ACCESS_KEY_ID"),
|
||||
"aws_secret_access_key": aws_secret_key or os.getenv("AWS_SECRET_ACCESS_KEY"),
|
||||
"aws_session_token": aws_session_token or os.getenv("AWS_SESSION_TOKEN"),
|
||||
"region_name": aws_region or os.getenv("AWS_REGION", "us-east-1"),
|
||||
}
|
||||
|
||||
# Set transformers with defaults
|
||||
self._context_to_payload_transformer = (
|
||||
context_to_payload_transformer or default_context_to_payload_transformer
|
||||
)
|
||||
self._response_to_output_transformer = (
|
||||
response_to_output_transformer or default_response_to_output_transformer
|
||||
)
|
||||
|
||||
# State for managing output response bookends
|
||||
self._output_response_open = False
|
||||
self._last_text_frame_time: Optional[float] = None
|
||||
self._close_task: Optional[asyncio.Task] = None
|
||||
self._output_response_timeout = 1.0 # seconds
|
||||
|
||||
async def _close_output_response_after_timeout(self):
|
||||
"""Close the output response after timeout if no new text frames arrive."""
|
||||
await asyncio.sleep(self._output_response_timeout)
|
||||
if self._output_response_open:
|
||||
self._output_response_open = False
|
||||
await self.push_frame(LLMFullResponseEndFrame())
|
||||
|
||||
async def _push_text_frame(self, text: str):
|
||||
"""Push a text frame, managing output response bookends."""
|
||||
# Cancel any pending close task
|
||||
if self._close_task and not self._close_task.done():
|
||||
await self.cancel_task(self._close_task)
|
||||
|
||||
# Open output response if needed
|
||||
if not self._output_response_open:
|
||||
await self.push_frame(LLMFullResponseStartFrame())
|
||||
self._output_response_open = True
|
||||
|
||||
# Push the text frame
|
||||
await self.push_frame(LLMTextFrame(text))
|
||||
self._last_text_frame_time = asyncio.get_event_loop().time()
|
||||
|
||||
# Schedule closing the output response after timeout
|
||||
self._close_task = self.create_task(self._close_output_response_after_timeout())
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
"""Process incoming frames and handle LLM message frames.
|
||||
|
||||
Args:
|
||||
frame: The incoming frame to process.
|
||||
direction: The direction of frame flow in the pipeline.
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
if isinstance(frame, (LLMContextFrame, OpenAILLMContextFrame)):
|
||||
# Create payload to invoke AgentCore agent
|
||||
payload = self._context_to_payload_transformer(frame.context)
|
||||
|
||||
if not payload:
|
||||
return
|
||||
|
||||
async with self._aws_session.client("bedrock-agentcore", **self._aws_params) as client:
|
||||
# Invoke the AgentCore agent
|
||||
response = await client.invoke_agent_runtime(
|
||||
agentRuntimeArn=self._agentArn, payload=payload.encode()
|
||||
)
|
||||
|
||||
# Determine if this is a streamed multi-part response, which
|
||||
# will affect our parsing
|
||||
is_multi_part_response = "text/event-stream" in response.get("contentType", "")
|
||||
|
||||
# Handle each response part (there may be one, for single
|
||||
# responses, or multiple, for streamed multi-part responses)
|
||||
async for part in response.get("response", []):
|
||||
part_string = part.decode("utf-8")
|
||||
|
||||
# In streamed multi-part responses, each part might have
|
||||
# one or more lines, each of which starts with "data: ".
|
||||
# Treat each line as a response.
|
||||
if is_multi_part_response:
|
||||
for line in part_string.split("\n"):
|
||||
# Get response text from this line
|
||||
if not line:
|
||||
continue
|
||||
if not line.startswith("data: "):
|
||||
logger.warning(f"Expected line to start with 'data: ', got: {line}")
|
||||
continue
|
||||
line = line[6:] # omit "data: "
|
||||
|
||||
# Transform response line to output text
|
||||
text = self._response_to_output_transformer(line)
|
||||
if text:
|
||||
await self._push_text_frame(text)
|
||||
|
||||
# In single-part responses, the whole part is one response
|
||||
# and there's no "data: " prefix
|
||||
else:
|
||||
# Transform response part string to output text
|
||||
text = self._response_to_output_transformer(part_string)
|
||||
if text:
|
||||
await self._push_text_frame(text)
|
||||
|
||||
# Final close if output response is still open after all parts processed
|
||||
if self._output_response_open:
|
||||
if self._close_task and not self._close_task.done():
|
||||
await self.cancel_task(self._close_task)
|
||||
self._output_response_open = False
|
||||
await self.push_frame(LLMFullResponseEndFrame())
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
0
src/pipecat/services/aws/sagemaker/__init__.py
Normal file
0
src/pipecat/services/aws/sagemaker/__init__.py
Normal file
283
src/pipecat/services/aws/sagemaker/bidi_client.py
Normal file
283
src/pipecat/services/aws/sagemaker/bidi_client.py
Normal file
@@ -0,0 +1,283 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""AWS SageMaker bidirectional streaming client.
|
||||
|
||||
This module provides a client for streaming bidirectional communication with
|
||||
SageMaker endpoints using the HTTP/2 protocol. Supports sending audio, text,
|
||||
and JSON data to SageMaker model endpoints and receiving streaming responses.
|
||||
"""
|
||||
|
||||
import os
|
||||
from typing import Optional
|
||||
|
||||
from loguru import logger
|
||||
|
||||
try:
|
||||
from aws_sdk_sagemaker_runtime_http2.client import SageMakerRuntimeHTTP2Client
|
||||
from aws_sdk_sagemaker_runtime_http2.config import Config, HTTPAuthSchemeResolver
|
||||
from aws_sdk_sagemaker_runtime_http2.models import (
|
||||
InvokeEndpointWithBidirectionalStreamInput,
|
||||
RequestPayloadPart,
|
||||
RequestStreamEventPayloadPart,
|
||||
ResponseStreamEvent,
|
||||
)
|
||||
from smithy_aws_core.auth.sigv4 import SigV4AuthScheme
|
||||
from smithy_aws_core.identity import EnvironmentCredentialsResolver
|
||||
from smithy_core.aio.eventstream import DuplexEventStream
|
||||
except ModuleNotFoundError as e:
|
||||
logger.error(f"Exception: {e}")
|
||||
logger.error(
|
||||
"In order to use SageMaker BiDi client, you need to `pip install pipecat-ai[sagemaker]`."
|
||||
)
|
||||
raise Exception(f"Missing module: {e}")
|
||||
|
||||
|
||||
class SageMakerBidiClient:
|
||||
"""Client for bidirectional streaming with AWS SageMaker endpoints.
|
||||
|
||||
Handles low-level HTTP/2 bidirectional streaming protocol for communicating
|
||||
with SageMaker model endpoints. Provides methods for sending various data
|
||||
types (audio, text, JSON) and receiving streaming responses.
|
||||
|
||||
This client uses AWS SigV4 authentication and supports credential resolution
|
||||
from environment variables, AWS CLI configuration, and instance metadata.
|
||||
|
||||
Example::
|
||||
|
||||
client = SageMakerBidiClient(
|
||||
endpoint_name="my-deepgram-endpoint",
|
||||
region="us-east-2",
|
||||
model_invocation_path="v1/listen",
|
||||
model_query_string="model=nova-3&language=en"
|
||||
)
|
||||
await client.start_session()
|
||||
await client.send_audio_chunk(audio_bytes)
|
||||
response = await client.receive_response()
|
||||
await client.close_session()
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
endpoint_name: str,
|
||||
region: str,
|
||||
model_invocation_path: str = "",
|
||||
model_query_string: str = "",
|
||||
):
|
||||
"""Initialize the SageMaker BiDi client.
|
||||
|
||||
Args:
|
||||
endpoint_name: Name of the SageMaker endpoint to connect to.
|
||||
region: AWS region where the endpoint is deployed.
|
||||
model_invocation_path: API path for the model invocation (e.g., "v1/listen").
|
||||
model_query_string: Query string parameters for the model (e.g., "model=nova-3").
|
||||
"""
|
||||
self.endpoint_name = endpoint_name
|
||||
self.region = region
|
||||
self.model_invocation_path = model_invocation_path
|
||||
self.model_query_string = model_query_string
|
||||
self.bidi_endpoint = f"https://runtime.sagemaker.{region}.amazonaws.com:8443"
|
||||
self._client: Optional[SageMakerRuntimeHTTP2Client] = None
|
||||
self._stream: Optional[
|
||||
DuplexEventStream[RequestStreamEventPayloadPart, ResponseStreamEvent, any]
|
||||
] = None
|
||||
self._output_stream = None
|
||||
self._is_active = False
|
||||
|
||||
def _initialize_client(self):
|
||||
"""Initialize the SageMaker Runtime HTTP2 client with AWS credentials.
|
||||
|
||||
Creates and configures the SageMaker Runtime HTTP2 client with SigV4
|
||||
authentication. Attempts to resolve AWS credentials from environment
|
||||
variables, AWS CLI configuration, or instance metadata.
|
||||
"""
|
||||
logger.debug(f"Initializing SageMaker BiDi client for region: {self.region}")
|
||||
logger.debug(f"Using endpoint URI: {self.bidi_endpoint}")
|
||||
|
||||
# Check for AWS credentials
|
||||
has_env_creds = bool(os.getenv("AWS_ACCESS_KEY_ID") and os.getenv("AWS_SECRET_ACCESS_KEY"))
|
||||
|
||||
if not has_env_creds:
|
||||
logger.warning(
|
||||
"AWS credentials not found in environment variables. "
|
||||
"Attempting to use EnvironmentCredentialsResolver which will check "
|
||||
"AWS CLI configuration and instance metadata."
|
||||
)
|
||||
|
||||
config = Config(
|
||||
endpoint_uri=self.bidi_endpoint,
|
||||
region=self.region,
|
||||
aws_credentials_identity_resolver=EnvironmentCredentialsResolver(),
|
||||
auth_scheme_resolver=HTTPAuthSchemeResolver(),
|
||||
auth_schemes={"aws.auth#sigv4": SigV4AuthScheme(service="sagemaker")},
|
||||
)
|
||||
self._client = SageMakerRuntimeHTTP2Client(config=config)
|
||||
|
||||
async def start_session(self):
|
||||
"""Start a bidirectional streaming session with the SageMaker endpoint.
|
||||
|
||||
Initializes the client if needed, creates the bidirectional stream, and
|
||||
establishes the connection to the SageMaker endpoint. Must be called
|
||||
before sending or receiving data.
|
||||
|
||||
Returns:
|
||||
The output stream for receiving responses.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If client initialization or connection fails.
|
||||
"""
|
||||
if not self._client:
|
||||
self._initialize_client()
|
||||
|
||||
logger.debug(f"Starting BiDi session with endpoint: {self.endpoint_name}")
|
||||
logger.debug(f"Model invocation path: {self.model_invocation_path}")
|
||||
logger.debug(f"Model query string: {self.model_query_string}")
|
||||
|
||||
# Create the bidirectional stream
|
||||
stream_input = InvokeEndpointWithBidirectionalStreamInput(
|
||||
endpoint_name=self.endpoint_name,
|
||||
model_invocation_path=self.model_invocation_path,
|
||||
model_query_string=self.model_query_string,
|
||||
)
|
||||
|
||||
try:
|
||||
self._stream = await self._client.invoke_endpoint_with_bidirectional_stream(
|
||||
stream_input
|
||||
)
|
||||
self._is_active = True
|
||||
|
||||
# Get output stream
|
||||
output = await self._stream.await_output()
|
||||
self._output_stream = output[1]
|
||||
|
||||
logger.debug("BiDi session started successfully")
|
||||
return self._output_stream
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to start BiDi session: {e}")
|
||||
self._is_active = False
|
||||
raise RuntimeError(f"Failed to start SageMaker BiDi session: {e}")
|
||||
|
||||
async def send_data(self, data_bytes: bytes, data_type: Optional[str] = None):
|
||||
"""Send a chunk of data to the stream.
|
||||
|
||||
Generic method for sending any type of data to the SageMaker endpoint.
|
||||
Use the convenience methods (send_audio_chunk, send_text, send_json)
|
||||
for common data types.
|
||||
|
||||
Args:
|
||||
data_bytes: Raw bytes to send.
|
||||
data_type: Optional data type header. Common values are "BINARY" for
|
||||
audio/binary data and "UTF8" for text/JSON data.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If session is not active or send fails.
|
||||
"""
|
||||
if not self._is_active or not self._stream:
|
||||
raise RuntimeError("BiDi session not active")
|
||||
|
||||
try:
|
||||
payload = RequestPayloadPart(bytes_=data_bytes, data_type=data_type)
|
||||
event = RequestStreamEventPayloadPart(value=payload)
|
||||
await self._stream.input_stream.send(event)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to send data: {e}")
|
||||
raise
|
||||
|
||||
async def send_audio_chunk(self, audio_bytes: bytes):
|
||||
"""Send a chunk of audio data to the stream.
|
||||
|
||||
Convenience method for sending audio data. Automatically sets the data
|
||||
type to "BINARY".
|
||||
|
||||
Args:
|
||||
audio_bytes: Raw audio bytes to send (e.g., PCM audio data).
|
||||
|
||||
Raises:
|
||||
RuntimeError: If session is not active or send fails.
|
||||
"""
|
||||
await self.send_data(audio_bytes, data_type="BINARY")
|
||||
|
||||
async def send_text(self, text: str):
|
||||
"""Send text data to the stream.
|
||||
|
||||
Convenience method for sending text data. Automatically encodes the text
|
||||
as UTF-8 and sets the data type to "UTF8".
|
||||
|
||||
Args:
|
||||
text: Text string to send.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If session is not active or send fails.
|
||||
"""
|
||||
await self.send_data(text.encode("utf-8"), data_type="UTF8")
|
||||
|
||||
async def send_json(self, data: dict):
|
||||
"""Send JSON data to the stream.
|
||||
|
||||
Convenience method for sending JSON-encoded messages. Useful for control
|
||||
messages like KeepAlive or CloseStream. Automatically serializes the
|
||||
dictionary to JSON, encodes as UTF-8, and sets the data type to "UTF8".
|
||||
|
||||
Args:
|
||||
data: Dictionary to send as JSON (e.g., {"type": "KeepAlive"}).
|
||||
|
||||
Raises:
|
||||
RuntimeError: If session is not active or send fails.
|
||||
"""
|
||||
import json
|
||||
|
||||
await self.send_data(json.dumps(data).encode("utf-8"), data_type="UTF8")
|
||||
|
||||
async def receive_response(self) -> Optional[ResponseStreamEvent]:
|
||||
"""Receive a response from the stream.
|
||||
|
||||
Blocks until a response is available from the SageMaker endpoint. Returns
|
||||
None when the stream is closed.
|
||||
|
||||
Returns:
|
||||
The response event containing payload data, or None if stream is closed.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If session is not active.
|
||||
"""
|
||||
if not self._is_active or not self._output_stream:
|
||||
raise RuntimeError("BiDi session not active")
|
||||
|
||||
try:
|
||||
result = await self._output_stream.receive()
|
||||
return result
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to receive response: {e}")
|
||||
raise
|
||||
|
||||
async def close_session(self):
|
||||
"""Close the bidirectional streaming session.
|
||||
|
||||
Gracefully closes the input stream and marks the session as inactive.
|
||||
Safe to call multiple times.
|
||||
"""
|
||||
if not self._is_active:
|
||||
return
|
||||
|
||||
logger.debug("Closing BiDi session...")
|
||||
self._is_active = False
|
||||
|
||||
try:
|
||||
if self._stream:
|
||||
await self._stream.input_stream.close()
|
||||
logger.debug("BiDi session closed successfully")
|
||||
except Exception as e:
|
||||
logger.warning(f"Error closing BiDi session: {e}")
|
||||
|
||||
@property
|
||||
def is_active(self) -> bool:
|
||||
"""Check if the session is currently active.
|
||||
|
||||
Returns:
|
||||
True if session is active, False otherwise.
|
||||
"""
|
||||
return self._is_active
|
||||
@@ -183,6 +183,14 @@ class DeepgramFluxSTTService(WebsocketSTTService):
|
||||
"""
|
||||
await self._connect_websocket()
|
||||
|
||||
# Creating the receiver task (only created once during initial connection)
|
||||
if not self._receive_task:
|
||||
self._receive_task = self.create_task(self._receive_task_handler(self._report_error))
|
||||
|
||||
# Creating the watchdog task (only created once during initial connection)
|
||||
if not self._watchdog_task:
|
||||
self._watchdog_task = self.create_task(self._watchdog_task_handler())
|
||||
|
||||
async def _disconnect(self):
|
||||
"""Disconnect from WebSocket and clean up tasks.
|
||||
|
||||
@@ -235,16 +243,6 @@ class DeepgramFluxSTTService(WebsocketSTTService):
|
||||
additional_headers={"Authorization": f"Token {self._api_key}"},
|
||||
)
|
||||
|
||||
# Creating the receiver task
|
||||
if not self._receive_task:
|
||||
self._receive_task = self.create_task(
|
||||
self._receive_task_handler(self._report_error)
|
||||
)
|
||||
|
||||
# Creating the watchdog task
|
||||
if not self._watchdog_task:
|
||||
self._watchdog_task = self.create_task(self._watchdog_task_handler())
|
||||
|
||||
# Now wait for the connection established event
|
||||
logger.debug("WebSocket connected, waiting for server confirmation...")
|
||||
await self._connection_established_event.wait()
|
||||
|
||||
447
src/pipecat/services/deepgram/stt_sagemaker.py
Normal file
447
src/pipecat/services/deepgram/stt_sagemaker.py
Normal file
@@ -0,0 +1,447 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""Deepgram speech-to-text service for AWS SageMaker.
|
||||
|
||||
This module provides a Pipecat STT service that connects to Deepgram models
|
||||
deployed on AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for
|
||||
low-latency real-time transcription with support for interim results, multiple
|
||||
languages, and various Deepgram features.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from typing import AsyncGenerator, Optional
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
CancelFrame,
|
||||
EndFrame,
|
||||
ErrorFrame,
|
||||
Frame,
|
||||
InterimTranscriptionFrame,
|
||||
StartFrame,
|
||||
TranscriptionFrame,
|
||||
UserStartedSpeakingFrame,
|
||||
UserStoppedSpeakingFrame,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.services.aws.sagemaker.bidi_client import SageMakerBidiClient
|
||||
from pipecat.services.stt_service import STTService
|
||||
from pipecat.transcriptions.language import Language
|
||||
from pipecat.utils.time import time_now_iso8601
|
||||
from pipecat.utils.tracing.service_decorators import traced_stt
|
||||
|
||||
try:
|
||||
from deepgram import LiveOptions
|
||||
except ModuleNotFoundError as e:
|
||||
logger.error(f"Exception: {e}")
|
||||
logger.error(
|
||||
"In order to use DeepgramSageMakerSTTService, you need to `pip install pipecat-ai[deepgram,sagemaker]`."
|
||||
)
|
||||
raise Exception(f"Missing module: {e}")
|
||||
|
||||
|
||||
class DeepgramSageMakerSTTService(STTService):
|
||||
"""Deepgram speech-to-text service for AWS SageMaker.
|
||||
|
||||
Provides real-time speech recognition using Deepgram models deployed on
|
||||
AWS SageMaker endpoints. Uses HTTP/2 bidirectional streaming for low-latency
|
||||
transcription with support for interim results, speaker diarization, and
|
||||
multiple languages.
|
||||
|
||||
Requirements:
|
||||
|
||||
- AWS credentials configured (via environment variables, AWS CLI, or instance metadata)
|
||||
- A deployed SageMaker endpoint with Deepgram model: https://developers.deepgram.com/docs/deploy-amazon-sagemaker
|
||||
- Deepgram SDK for LiveOptions configuration
|
||||
|
||||
Example::
|
||||
|
||||
stt = DeepgramSageMakerSTTService(
|
||||
endpoint_name="my-deepgram-endpoint",
|
||||
region="us-east-2",
|
||||
live_options=LiveOptions(
|
||||
model="nova-3",
|
||||
language="en",
|
||||
interim_results=True,
|
||||
punctuate=True,
|
||||
),
|
||||
)
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
endpoint_name: str,
|
||||
region: str,
|
||||
sample_rate: Optional[int] = None,
|
||||
live_options: Optional[LiveOptions] = None,
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the Deepgram SageMaker STT service.
|
||||
|
||||
Args:
|
||||
endpoint_name: Name of the SageMaker endpoint with Deepgram model
|
||||
deployed (e.g., "my-deepgram-nova-3-endpoint").
|
||||
region: AWS region where the endpoint is deployed (e.g., "us-east-2").
|
||||
sample_rate: Audio sample rate in Hz. If None, uses value from
|
||||
live_options or defaults to the value from StartFrame.
|
||||
live_options: Deepgram LiveOptions for detailed configuration. If None,
|
||||
uses sensible defaults (nova-3 model, English, interim results enabled).
|
||||
**kwargs: Additional arguments passed to the parent STTService.
|
||||
"""
|
||||
sample_rate = sample_rate or (live_options.sample_rate if live_options else None)
|
||||
super().__init__(sample_rate=sample_rate, **kwargs)
|
||||
|
||||
self._endpoint_name = endpoint_name
|
||||
self._region = region
|
||||
|
||||
# Create default options similar to DeepgramSTTService
|
||||
default_options = LiveOptions(
|
||||
encoding="linear16",
|
||||
language=Language.EN,
|
||||
model="nova-3",
|
||||
channels=1,
|
||||
interim_results=True,
|
||||
punctuate=True,
|
||||
)
|
||||
|
||||
# Merge with provided options
|
||||
merged_options = default_options.to_dict()
|
||||
if live_options:
|
||||
default_model = default_options.model
|
||||
merged_options.update(live_options.to_dict())
|
||||
# Handle the "None" string bug from deepgram-sdk
|
||||
if "model" in merged_options and merged_options["model"] == "None":
|
||||
merged_options["model"] = default_model
|
||||
|
||||
# Convert Language enum to string if needed
|
||||
if "language" in merged_options and isinstance(merged_options["language"], Language):
|
||||
merged_options["language"] = merged_options["language"].value
|
||||
|
||||
self.set_model_name(merged_options["model"])
|
||||
self._settings = merged_options
|
||||
|
||||
self._client: Optional[SageMakerBidiClient] = None
|
||||
self._response_task: Optional[asyncio.Task] = None
|
||||
self._keepalive_task: Optional[asyncio.Task] = None
|
||||
|
||||
def can_generate_metrics(self) -> bool:
|
||||
"""Check if this service can generate processing metrics.
|
||||
|
||||
Returns:
|
||||
True, as Deepgram SageMaker service supports metrics generation.
|
||||
"""
|
||||
return True
|
||||
|
||||
async def set_model(self, model: str):
|
||||
"""Set the Deepgram model and reconnect.
|
||||
|
||||
Disconnects from the current session, updates the model setting, and
|
||||
establishes a new connection with the updated model.
|
||||
|
||||
Args:
|
||||
model: The Deepgram model name to use (e.g., "nova-3").
|
||||
"""
|
||||
await super().set_model(model)
|
||||
logger.info(f"Switching STT model to: [{model}]")
|
||||
self._settings["model"] = model
|
||||
await self._disconnect()
|
||||
await self._connect()
|
||||
|
||||
async def set_language(self, language: Language):
|
||||
"""Set the recognition language and reconnect.
|
||||
|
||||
Disconnects from the current session, updates the language setting, and
|
||||
establishes a new connection with the updated language.
|
||||
|
||||
Args:
|
||||
language: The language to use for speech recognition (e.g., Language.EN,
|
||||
Language.ES).
|
||||
"""
|
||||
logger.info(f"Switching STT language to: [{language}]")
|
||||
self._settings["language"] = language
|
||||
await self._disconnect()
|
||||
await self._connect()
|
||||
|
||||
async def start(self, frame: StartFrame):
|
||||
"""Start the Deepgram SageMaker STT service.
|
||||
|
||||
Args:
|
||||
frame: The start frame containing initialization parameters.
|
||||
"""
|
||||
await super().start(frame)
|
||||
self._settings["sample_rate"] = self.sample_rate
|
||||
await self._connect()
|
||||
|
||||
async def stop(self, frame: EndFrame):
|
||||
"""Stop the Deepgram SageMaker STT service.
|
||||
|
||||
Args:
|
||||
frame: The end frame.
|
||||
"""
|
||||
await super().stop(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def cancel(self, frame: CancelFrame):
|
||||
"""Cancel the Deepgram SageMaker STT service.
|
||||
|
||||
Args:
|
||||
frame: The cancel frame.
|
||||
"""
|
||||
await super().cancel(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
|
||||
"""Send audio data to Deepgram for transcription.
|
||||
|
||||
Args:
|
||||
audio: Raw audio bytes to transcribe.
|
||||
|
||||
Yields:
|
||||
Frame: None (transcription results come via BiDi stream callbacks).
|
||||
"""
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_audio_chunk(audio)
|
||||
except Exception as e:
|
||||
logger.error(f"Error sending audio to SageMaker: {e}")
|
||||
await self.push_error(ErrorFrame(error=f"SageMaker STT error: {e}"))
|
||||
yield None
|
||||
|
||||
async def _connect(self):
|
||||
"""Connect to the SageMaker endpoint and start the BiDi session.
|
||||
|
||||
Builds the Deepgram query string from settings, creates the BiDi client,
|
||||
starts the streaming session, and launches background tasks for processing
|
||||
responses and sending KeepAlive messages.
|
||||
"""
|
||||
logger.debug("Connecting to Deepgram on SageMaker...")
|
||||
|
||||
# Update sample rate in settings
|
||||
self._settings["sample_rate"] = self.sample_rate
|
||||
|
||||
# Build query string from settings, converting booleans to strings
|
||||
query_params = {}
|
||||
for key, value in self._settings.items():
|
||||
if value is not None:
|
||||
# Convert boolean values to lowercase strings for Deepgram API
|
||||
if isinstance(value, bool):
|
||||
query_params[key] = str(value).lower()
|
||||
else:
|
||||
query_params[key] = str(value)
|
||||
|
||||
query_string = "&".join(f"{k}={v}" for k, v in query_params.items())
|
||||
|
||||
# Create BiDi client
|
||||
self._client = SageMakerBidiClient(
|
||||
endpoint_name=self._endpoint_name,
|
||||
region=self._region,
|
||||
model_invocation_path="v1/listen",
|
||||
model_query_string=query_string,
|
||||
)
|
||||
|
||||
try:
|
||||
# Start the session
|
||||
await self._client.start_session()
|
||||
|
||||
# Start processing responses in the background
|
||||
self._response_task = self.create_task(self._process_responses())
|
||||
|
||||
# Start keepalive task to maintain connection
|
||||
self._keepalive_task = self.create_task(self._send_keepalive())
|
||||
|
||||
logger.debug("Connected to Deepgram on SageMaker")
|
||||
await self._call_event_handler("on_connected")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to connect to SageMaker: {e}")
|
||||
await self.push_error(ErrorFrame(error=f"SageMaker connection error: {e}"))
|
||||
await self._call_event_handler("on_connection_error", str(e))
|
||||
|
||||
async def _disconnect(self):
|
||||
"""Disconnect from the SageMaker endpoint.
|
||||
|
||||
Sends a CloseStream message to Deepgram, cancels background tasks
|
||||
(KeepAlive and response processing), and closes the BiDi session.
|
||||
Safe to call multiple times.
|
||||
"""
|
||||
if self._client and self._client.is_active:
|
||||
logger.debug("Disconnecting from Deepgram on SageMaker...")
|
||||
|
||||
# Send CloseStream message to Deepgram
|
||||
try:
|
||||
await self._client.send_json({"type": "CloseStream"})
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to send CloseStream message: {e}")
|
||||
|
||||
# Cancel keepalive task
|
||||
if self._keepalive_task and not self._keepalive_task.done():
|
||||
await self.cancel_task(self._keepalive_task)
|
||||
|
||||
# Cancel response processing task
|
||||
if self._response_task and not self._response_task.done():
|
||||
await self.cancel_task(self._response_task)
|
||||
|
||||
# Close the BiDi session
|
||||
await self._client.close_session()
|
||||
|
||||
logger.debug("Disconnected from Deepgram on SageMaker")
|
||||
await self._call_event_handler("on_disconnected")
|
||||
|
||||
async def _send_keepalive(self):
|
||||
"""Send periodic KeepAlive messages to maintain the connection.
|
||||
|
||||
Sends a KeepAlive JSON message to Deepgram every 5 seconds while the
|
||||
connection is active. This prevents the connection from timing out during
|
||||
periods of silence.
|
||||
"""
|
||||
while self._client and self._client.is_active:
|
||||
await asyncio.sleep(5)
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_json({"type": "KeepAlive"})
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to send KeepAlive: {e}")
|
||||
|
||||
async def _process_responses(self):
|
||||
"""Process streaming responses from Deepgram on SageMaker.
|
||||
|
||||
Continuously receives responses from the BiDi stream, decodes the payload,
|
||||
parses JSON responses from Deepgram, and processes transcription results.
|
||||
Runs as a background task until the connection is closed or cancelled.
|
||||
"""
|
||||
try:
|
||||
while self._client and self._client.is_active:
|
||||
result = await self._client.receive_response()
|
||||
|
||||
if result is None:
|
||||
break
|
||||
|
||||
# Check if this is a PayloadPart with bytes
|
||||
if hasattr(result, "value") and hasattr(result.value, "bytes_"):
|
||||
if result.value.bytes_:
|
||||
response_data = result.value.bytes_.decode("utf-8")
|
||||
|
||||
try:
|
||||
# Parse JSON response from Deepgram
|
||||
parsed = json.loads(response_data)
|
||||
|
||||
# Extract and process transcript if available
|
||||
if "channel" in parsed:
|
||||
await self._handle_transcript_response(parsed)
|
||||
|
||||
except json.JSONDecodeError:
|
||||
logger.warning(f"Non-JSON response: {response_data}")
|
||||
|
||||
except asyncio.CancelledError:
|
||||
logger.debug("Response processor cancelled")
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing responses: {e}", exc_info=True)
|
||||
await self.push_error(ErrorFrame(error=f"SageMaker response error: {e}"))
|
||||
finally:
|
||||
logger.debug("Response processor stopped")
|
||||
|
||||
async def _handle_transcript_response(self, parsed: dict):
|
||||
"""Handle a transcript response from Deepgram.
|
||||
|
||||
Extracts the transcript text, determines if it's final or interim, extracts
|
||||
language information, and pushes the appropriate frame (TranscriptionFrame
|
||||
or InterimTranscriptionFrame) downstream.
|
||||
|
||||
Args:
|
||||
parsed: The parsed JSON response from Deepgram containing channel,
|
||||
alternatives, transcript, and metadata.
|
||||
"""
|
||||
alternatives = parsed.get("channel", {}).get("alternatives", [])
|
||||
if not alternatives or not alternatives[0].get("transcript"):
|
||||
return
|
||||
|
||||
transcript = alternatives[0]["transcript"]
|
||||
if not transcript.strip():
|
||||
return
|
||||
|
||||
# Stop TTFB metrics on first transcript
|
||||
await self.stop_ttfb_metrics()
|
||||
|
||||
is_final = parsed.get("is_final", False)
|
||||
speech_final = parsed.get("speech_final", False)
|
||||
|
||||
# Extract language if available
|
||||
language = None
|
||||
if alternatives[0].get("languages"):
|
||||
language = alternatives[0]["languages"][0]
|
||||
language = Language(language)
|
||||
|
||||
if is_final and speech_final:
|
||||
# Final transcription
|
||||
await self.push_frame(
|
||||
TranscriptionFrame(
|
||||
transcript,
|
||||
self._user_id,
|
||||
time_now_iso8601(),
|
||||
language,
|
||||
result=parsed,
|
||||
)
|
||||
)
|
||||
await self._handle_transcription(transcript, is_final, language)
|
||||
await self.stop_processing_metrics()
|
||||
else:
|
||||
# Interim transcription
|
||||
await self.push_frame(
|
||||
InterimTranscriptionFrame(
|
||||
transcript,
|
||||
self._user_id,
|
||||
time_now_iso8601(),
|
||||
language,
|
||||
result=parsed,
|
||||
)
|
||||
)
|
||||
|
||||
@traced_stt
|
||||
async def _handle_transcription(
|
||||
self, transcript: str, is_final: bool, language: Optional[Language] = None
|
||||
):
|
||||
"""Handle a transcription result with tracing.
|
||||
|
||||
This method is decorated with @traced_stt for observability and tracing
|
||||
integration. The actual transcription processing is handled by the parent
|
||||
class and observers.
|
||||
|
||||
Args:
|
||||
transcript: The transcribed text.
|
||||
is_final: Whether this is a final transcription result.
|
||||
language: The detected language of the transcription, if available.
|
||||
"""
|
||||
pass
|
||||
|
||||
async def start_metrics(self):
|
||||
"""Start TTFB and processing metrics collection."""
|
||||
await self.start_ttfb_metrics()
|
||||
await self.start_processing_metrics()
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
"""Process frames with Deepgram SageMaker-specific handling.
|
||||
|
||||
Args:
|
||||
frame: The frame to process.
|
||||
direction: The direction of frame processing.
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
# Start metrics when user starts speaking (if VAD is not provided by Deepgram)
|
||||
if isinstance(frame, UserStartedSpeakingFrame):
|
||||
await self.start_metrics()
|
||||
elif isinstance(frame, UserStoppedSpeakingFrame):
|
||||
# Send finalize message to Deepgram when user stops speaking
|
||||
# This tells Deepgram to flush any remaining audio and return final results
|
||||
if self._client and self._client.is_active:
|
||||
try:
|
||||
await self._client.send_json({"type": "Finalize"})
|
||||
except Exception as e:
|
||||
logger.warning(f"Error sending Finalize message: {e}")
|
||||
@@ -416,6 +416,8 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
|
||||
Only used when commit_strategy is VAD. None uses ElevenLabs default.
|
||||
min_silence_duration_ms: Minimum silence duration for VAD (50-2000ms).
|
||||
Only used when commit_strategy is VAD. None uses ElevenLabs default.
|
||||
include_timestamps: Whether to include word-level timestamps in transcripts.
|
||||
enable_logging: Whether to enable logging on ElevenLabs' side.
|
||||
"""
|
||||
|
||||
language_code: Optional[str] = None
|
||||
@@ -424,6 +426,8 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
|
||||
vad_threshold: Optional[float] = None
|
||||
min_speech_duration_ms: Optional[int] = None
|
||||
min_silence_duration_ms: Optional[int] = None
|
||||
include_timestamps: bool = False
|
||||
enable_logging: bool = False
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
@@ -628,10 +632,16 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
|
||||
if self._params.language_code:
|
||||
params.append(f"language_code={self._params.language_code}")
|
||||
|
||||
params.append(f"encoding={self._audio_format}")
|
||||
params.append(f"sample_rate={self.sample_rate}")
|
||||
params.append(f"audio_format={self._audio_format}")
|
||||
params.append(f"commit_strategy={self._params.commit_strategy.value}")
|
||||
|
||||
# Add optional parameters
|
||||
if self._params.include_timestamps:
|
||||
params.append(f"include_timestamps={str(self._params.include_timestamps).lower()}")
|
||||
|
||||
if self._params.enable_logging:
|
||||
params.append(f"enable_logging={str(self._params.enable_logging).lower()}")
|
||||
|
||||
# Add VAD parameters if using VAD commit strategy and values are specified
|
||||
if self._params.commit_strategy == CommitStrategy.VAD:
|
||||
if self._params.vad_silence_threshold_secs is not None:
|
||||
@@ -720,15 +730,20 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
|
||||
elif message_type == "committed_transcript_with_timestamps":
|
||||
await self._on_committed_transcript_with_timestamps(data)
|
||||
|
||||
elif message_type == "input_error":
|
||||
error_msg = data.get("error", "Unknown input error")
|
||||
logger.error(f"ElevenLabs input error: {error_msg}")
|
||||
await self.push_error(ErrorFrame(f"Input error: {error_msg}"))
|
||||
elif message_type == "error":
|
||||
error_msg = data.get("error", "Unknown error")
|
||||
logger.error(f"ElevenLabs error: {error_msg}")
|
||||
await self.push_error(ErrorFrame(f"Error: {error_msg}"))
|
||||
|
||||
elif message_type in ["auth_error", "quota_exceeded", "transcriber_error", "error"]:
|
||||
error_msg = data.get("error", data.get("message", "Unknown error"))
|
||||
logger.error(f"ElevenLabs error ({message_type}): {error_msg}")
|
||||
await self.push_error(ErrorFrame(f"{message_type}: {error_msg}"))
|
||||
elif message_type == "auth_error":
|
||||
error_msg = data.get("error", "Authentication error")
|
||||
logger.error(f"ElevenLabs auth error: {error_msg}")
|
||||
await self.push_error(ErrorFrame(f"Auth error: {error_msg}"))
|
||||
|
||||
elif message_type == "quota_exceeded_error":
|
||||
error_msg = data.get("error", "Quota exceeded")
|
||||
logger.error(f"ElevenLabs quota exceeded: {error_msg}")
|
||||
await self.push_error(ErrorFrame(f"Quota exceeded: {error_msg}"))
|
||||
|
||||
else:
|
||||
logger.debug(f"Unknown message type: {message_type}")
|
||||
@@ -773,6 +788,11 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
|
||||
Args:
|
||||
data: Committed transcript data.
|
||||
"""
|
||||
# If timestamps are enabled, skip this message and wait for the
|
||||
# committed_transcript_with_timestamps message which contains all the data
|
||||
if self._params.include_timestamps:
|
||||
return
|
||||
|
||||
text = data.get("text", "").strip()
|
||||
if not text:
|
||||
return
|
||||
@@ -800,6 +820,18 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
|
||||
async def _on_committed_transcript_with_timestamps(self, data: dict):
|
||||
"""Handle committed transcript with word-level timestamps.
|
||||
|
||||
This message is sent when include_timestamps=true. The result data includes:
|
||||
- text: The transcribed text
|
||||
- language_code: Detected language (if available)
|
||||
- words: Array of word objects with timing information:
|
||||
- text: The word text
|
||||
- start: Start time in seconds
|
||||
- end: End time in seconds
|
||||
- type: "word" or "spacing"
|
||||
- speaker_id: Speaker identifier (if available)
|
||||
- logprob: Log probability score (if available)
|
||||
- characters: Array of character strings (if available)
|
||||
|
||||
Args:
|
||||
data: Committed transcript data with timestamps.
|
||||
"""
|
||||
@@ -807,9 +839,24 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
|
||||
if not text:
|
||||
return
|
||||
|
||||
logger.debug(f"Committed transcript with timestamps: [{text}]")
|
||||
logger.trace(f"Timestamps: {data.get('words', [])}")
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
# This is sent after the committed_transcript, so we don't need to
|
||||
# push another TranscriptionFrame, but we could use the timestamps
|
||||
# for additional processing if needed in the future
|
||||
# Get language if provided
|
||||
language = data.get("language_code")
|
||||
|
||||
logger.debug(f"Committed transcript with timestamps: [{text}]")
|
||||
|
||||
await self._handle_transcription(text, True, language)
|
||||
|
||||
# This message is sent after committed_transcript when include_timestamps=true.
|
||||
# It contains the full transcript data including text and word-level timestamps.
|
||||
await self.push_frame(
|
||||
TranscriptionFrame(
|
||||
text,
|
||||
self._user_id,
|
||||
time_now_iso8601(),
|
||||
language,
|
||||
result=data,
|
||||
)
|
||||
)
|
||||
|
||||
96
src/pipecat/transports/livekit/utils.py
Normal file
96
src/pipecat/transports/livekit/utils.py
Normal file
@@ -0,0 +1,96 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""LiveKit REST Helpers.
|
||||
|
||||
Methods that wrap the LiveKit API for room management.
|
||||
"""
|
||||
|
||||
import aiohttp
|
||||
|
||||
|
||||
class LiveKitRESTHelper:
|
||||
"""Helper class for interacting with LiveKit's REST API.
|
||||
|
||||
Provides methods for managing LiveKit rooms.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
api_key: str,
|
||||
api_secret: str,
|
||||
api_url: str = "https://your-livekit-host.com",
|
||||
aiohttp_session: aiohttp.ClientSession,
|
||||
):
|
||||
"""Initialize the LiveKit REST helper.
|
||||
|
||||
Args:
|
||||
api_key: Your LiveKit API key.
|
||||
api_secret: Your LiveKit API secret.
|
||||
api_url: LiveKit server URL (e.g. "https://your-livekit-host.com").
|
||||
aiohttp_session: Async HTTP session for making requests.
|
||||
"""
|
||||
self.api_key = api_key
|
||||
self.api_secret = api_secret
|
||||
self.api_url = api_url.rstrip("/")
|
||||
self.aiohttp_session = aiohttp_session
|
||||
|
||||
def _create_access_token(self, room_create: bool = True) -> str:
|
||||
"""Create a signed access token for LiveKit API authentication.
|
||||
|
||||
Args:
|
||||
room_create: Whether to grant roomCreate permission.
|
||||
|
||||
Returns:
|
||||
Signed JWT access token.
|
||||
"""
|
||||
import time
|
||||
|
||||
import jwt
|
||||
|
||||
claims = {
|
||||
"iss": self.api_key,
|
||||
"sub": self.api_key,
|
||||
"nbf": int(time.time()),
|
||||
"exp": int(time.time()) + 60, # Token valid for 60 seconds
|
||||
"video": {
|
||||
"roomCreate": room_create,
|
||||
},
|
||||
}
|
||||
|
||||
return jwt.encode(claims, self.api_secret, algorithm="HS256")
|
||||
|
||||
async def delete_room_by_name(self, room_name: str) -> bool:
|
||||
"""Delete a LiveKit room by name.
|
||||
|
||||
This will forcibly disconnect all participants currently in the room.
|
||||
|
||||
Args:
|
||||
room_name: Name of the room to delete.
|
||||
|
||||
Returns:
|
||||
True if deletion was successful.
|
||||
|
||||
Raises:
|
||||
Exception: If deletion fails.
|
||||
"""
|
||||
token = self._create_access_token(room_create=True)
|
||||
headers = {
|
||||
"Authorization": f"Bearer {token}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
async with self.aiohttp_session.post(
|
||||
f"{self.api_url}/twirp/livekit.RoomService/DeleteRoom",
|
||||
headers=headers,
|
||||
json={"room": room_name},
|
||||
) as r:
|
||||
if r.status != 200:
|
||||
text = await r.text()
|
||||
raise Exception(f"Failed to delete room [{room_name}] (status: {r.status}): {text}")
|
||||
|
||||
return True
|
||||
189
tests/test_turn_aware_transcript_processor.py
Normal file
189
tests/test_turn_aware_transcript_processor.py
Normal file
@@ -0,0 +1,189 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
import unittest
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
AggregationType,
|
||||
BotStartedSpeakingFrame,
|
||||
BotStoppedSpeakingFrame,
|
||||
InterruptionFrame,
|
||||
TranscriptionFrame,
|
||||
TranscriptionUpdateFrame,
|
||||
TTSTextFrame,
|
||||
UserStartedSpeakingFrame,
|
||||
)
|
||||
from pipecat.processors.transcript_processor import TurnAwareTranscriptProcessor
|
||||
from pipecat.tests.utils import SleepFrame, run_test
|
||||
|
||||
|
||||
class TestTurnAwareTranscriptProcessor(unittest.IsolatedAsyncioTestCase):
|
||||
"""Tests for TurnAwareTranscriptProcessor."""
|
||||
|
||||
async def test_basic_turn_flow(self):
|
||||
"""Test basic turn start/end with user and assistant speech."""
|
||||
processor = TurnAwareTranscriptProcessor()
|
||||
|
||||
# Track events
|
||||
turn_started_calls = []
|
||||
turn_ended_calls = []
|
||||
|
||||
@processor.event_handler("on_turn_started")
|
||||
async def on_turn_started(proc, turn_number):
|
||||
turn_started_calls.append(turn_number)
|
||||
|
||||
@processor.event_handler("on_turn_ended")
|
||||
async def on_turn_ended(proc, turn_number, user_text, assistant_text, interrupted):
|
||||
turn_ended_calls.append(
|
||||
{
|
||||
"turn_number": turn_number,
|
||||
"user_text": user_text,
|
||||
"assistant_text": assistant_text,
|
||||
"interrupted": interrupted,
|
||||
}
|
||||
)
|
||||
|
||||
frames_to_send = [
|
||||
# Turn 1: User speaks, bot responds
|
||||
UserStartedSpeakingFrame(),
|
||||
TranscriptionFrame(text="Hello", user_id="user1", timestamp=""),
|
||||
SleepFrame(sleep=0.01), # Allow transcription to process
|
||||
BotStartedSpeakingFrame(),
|
||||
TTSTextFrame(text="Hi", aggregated_by=AggregationType.WORD),
|
||||
TTSTextFrame(text=" there", aggregated_by=AggregationType.WORD),
|
||||
BotStoppedSpeakingFrame(),
|
||||
SleepFrame(sleep=0.1),
|
||||
]
|
||||
|
||||
await run_test(processor, frames_to_send=frames_to_send)
|
||||
|
||||
# Verify events
|
||||
self.assertEqual(
|
||||
len(turn_started_calls), 1, f"Expected 1 turn started, got {len(turn_started_calls)}"
|
||||
)
|
||||
self.assertEqual(turn_started_calls[0], 1)
|
||||
|
||||
self.assertEqual(
|
||||
len(turn_ended_calls), 1, f"Expected 1 turn ended, got {len(turn_ended_calls)}"
|
||||
)
|
||||
self.assertEqual(turn_ended_calls[0]["turn_number"], 1)
|
||||
self.assertEqual(turn_ended_calls[0]["user_text"], "Hello")
|
||||
self.assertEqual(turn_ended_calls[0]["assistant_text"], "Hi there")
|
||||
self.assertFalse(turn_ended_calls[0]["interrupted"])
|
||||
|
||||
async def test_interruption(self):
|
||||
"""Test turn ending on interruption."""
|
||||
processor = TurnAwareTranscriptProcessor()
|
||||
|
||||
# Track events
|
||||
turn_ended_calls = []
|
||||
|
||||
@processor.event_handler("on_turn_ended")
|
||||
async def on_turn_ended(proc, turn_number, user_text, assistant_text, interrupted):
|
||||
turn_ended_calls.append(
|
||||
{
|
||||
"turn_number": turn_number,
|
||||
"user_text": user_text,
|
||||
"assistant_text": assistant_text,
|
||||
"interrupted": interrupted,
|
||||
}
|
||||
)
|
||||
|
||||
frames_to_send = [
|
||||
# User speaks
|
||||
UserStartedSpeakingFrame(),
|
||||
TranscriptionFrame(text="Tell me", user_id="user1", timestamp=""),
|
||||
SleepFrame(sleep=0.01), # Allow transcription to process
|
||||
# Bot starts responding
|
||||
BotStartedSpeakingFrame(),
|
||||
TTSTextFrame(text="Sure", aggregated_by=AggregationType.WORD),
|
||||
TTSTextFrame(text=" I", aggregated_by=AggregationType.WORD),
|
||||
TTSTextFrame(text=" can", aggregated_by=AggregationType.WORD),
|
||||
# User interrupts
|
||||
InterruptionFrame(),
|
||||
# New turn starts
|
||||
UserStartedSpeakingFrame(),
|
||||
TranscriptionFrame(text="Wait", user_id="user1", timestamp=""),
|
||||
SleepFrame(sleep=0.1),
|
||||
]
|
||||
|
||||
await run_test(processor, frames_to_send=frames_to_send)
|
||||
|
||||
# Verify first turn was interrupted
|
||||
self.assertGreaterEqual(
|
||||
len(turn_ended_calls), 1, f"Expected at least 1 turn ended, got {len(turn_ended_calls)}"
|
||||
)
|
||||
first_turn = turn_ended_calls[0]
|
||||
self.assertEqual(first_turn["user_text"], "Tell me")
|
||||
# Note: In this test flow, InterruptionFrame arrives before TTSTextFrames are processed,
|
||||
# so assistant text may be empty. In real scenarios, word timestamps ensure proper capture.
|
||||
self.assertIn(first_turn["assistant_text"], ["", "Sure I can", "Sure I can"])
|
||||
self.assertTrue(first_turn["interrupted"])
|
||||
|
||||
async def test_multiple_turns(self):
|
||||
"""Test multiple back-and-forth turns."""
|
||||
processor = TurnAwareTranscriptProcessor()
|
||||
|
||||
# Track events
|
||||
turn_started_calls = []
|
||||
turn_ended_calls = []
|
||||
|
||||
@processor.event_handler("on_turn_started")
|
||||
async def on_turn_started(proc, turn_number):
|
||||
turn_started_calls.append(turn_number)
|
||||
|
||||
@processor.event_handler("on_turn_ended")
|
||||
async def on_turn_ended(proc, turn_number, user_text, assistant_text, interrupted):
|
||||
turn_ended_calls.append(
|
||||
{
|
||||
"turn_number": turn_number,
|
||||
"user_text": user_text,
|
||||
"assistant_text": assistant_text,
|
||||
}
|
||||
)
|
||||
|
||||
frames_to_send = [
|
||||
# Turn 1
|
||||
UserStartedSpeakingFrame(),
|
||||
TranscriptionFrame(text="Hi", user_id="user1", timestamp=""),
|
||||
SleepFrame(sleep=0.01), # Allow transcription to process
|
||||
BotStartedSpeakingFrame(),
|
||||
TTSTextFrame(text="Hello", aggregated_by=AggregationType.WORD),
|
||||
BotStoppedSpeakingFrame(),
|
||||
SleepFrame(sleep=0.05),
|
||||
# Turn 2
|
||||
UserStartedSpeakingFrame(),
|
||||
TranscriptionFrame(text="How are you", user_id="user1", timestamp=""),
|
||||
SleepFrame(sleep=0.01), # Allow transcription to process
|
||||
BotStartedSpeakingFrame(),
|
||||
TTSTextFrame(text="I'm", aggregated_by=AggregationType.WORD),
|
||||
TTSTextFrame(text=" good", aggregated_by=AggregationType.WORD),
|
||||
BotStoppedSpeakingFrame(),
|
||||
SleepFrame(sleep=0.1),
|
||||
]
|
||||
|
||||
await run_test(processor, frames_to_send=frames_to_send)
|
||||
|
||||
# Verify multiple turns
|
||||
self.assertEqual(
|
||||
len(turn_started_calls), 2, f"Expected 2 turns started, got {len(turn_started_calls)}"
|
||||
)
|
||||
self.assertEqual(turn_started_calls, [1, 2])
|
||||
|
||||
self.assertEqual(
|
||||
len(turn_ended_calls), 2, f"Expected 2 turns ended, got {len(turn_ended_calls)}"
|
||||
)
|
||||
self.assertEqual(turn_ended_calls[0]["turn_number"], 1)
|
||||
self.assertEqual(turn_ended_calls[0]["user_text"], "Hi")
|
||||
self.assertEqual(turn_ended_calls[0]["assistant_text"], "Hello")
|
||||
|
||||
self.assertEqual(turn_ended_calls[1]["turn_number"], 2)
|
||||
self.assertEqual(turn_ended_calls[1]["user_text"], "How are you")
|
||||
self.assertEqual(turn_ended_calls[1]["assistant_text"], "I'm good")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
120
uv.lock
generated
120
uv.lock
generated
@@ -45,20 +45,20 @@ sdist = { url = "https://files.pythonhosted.org/packages/99/83/bf38b95d98c67b8eb
|
||||
|
||||
[[package]]
|
||||
name = "aioboto3"
|
||||
version = "15.0.0"
|
||||
version = "15.5.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "aiobotocore", extra = ["boto3"] },
|
||||
{ name = "aiofiles" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/80/d0/ed107e16551ba1b93ddcca9a6bf79580450945268a8bc396530687b3189f/aioboto3-15.0.0.tar.gz", hash = "sha256:dce40b701d1f8e0886dc874d27cd9799b8bf6b32d63743f57e7bef7e4a562756", size = 225278, upload-time = "2025-06-26T16:30:48.967Z" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/a2/01/92e9ab00f36e2899315f49eefcd5b4685fbb19016c7f19a9edf06da80bb0/aioboto3-15.5.0.tar.gz", hash = "sha256:ea8d8787d315594842fbfcf2c4dce3bac2ad61be275bc8584b2ce9a3402a6979", size = 255069, upload-time = "2025-10-30T13:37:16.122Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/bf/95/d69c744f408e5e4592fe53ed98fc244dd13b83d84cf1f83b2499d98bfcc9/aioboto3-15.0.0-py3-none-any.whl", hash = "sha256:9cf54b3627c8b34bb82eaf43ab327e7027e37f92b1e10dd5cfe343cd512568d0", size = 35785, upload-time = "2025-06-26T16:30:47.444Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/e5/3e/e8f5b665bca646d43b916763c901e00a07e40f7746c9128bdc912a089424/aioboto3-15.5.0-py3-none-any.whl", hash = "sha256:cc880c4d6a8481dd7e05da89f41c384dbd841454fc1998ae25ca9c39201437a6", size = 35913, upload-time = "2025-10-30T13:37:14.549Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "aiobotocore"
|
||||
version = "2.23.0"
|
||||
version = "2.25.1"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "aiohttp" },
|
||||
@@ -69,9 +69,9 @@ dependencies = [
|
||||
{ name = "python-dateutil" },
|
||||
{ name = "wrapt" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/9d/25/4b06ea1214ddf020a28df27dc7136ac9dfaf87929d51e6f6044dd350ed67/aiobotocore-2.23.0.tar.gz", hash = "sha256:0333931365a6c7053aee292fe6ef50c74690c4ae06bb019afdf706cb6f2f5e32", size = 115825, upload-time = "2025-06-12T23:46:38.055Z" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/62/94/2e4ec48cf1abb89971cb2612d86f979a6240520f0a659b53a43116d344dc/aiobotocore-2.25.1.tar.gz", hash = "sha256:ea9be739bfd7ece8864f072ec99bb9ed5c7e78ebb2b0b15f29781fbe02daedbc", size = 120560, upload-time = "2025-10-28T22:33:21.787Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/ea/43/ccf9b29669cdb09fd4bfc0a8effeb2973b22a0f3c3be4142d0b485975d11/aiobotocore-2.23.0-py3-none-any.whl", hash = "sha256:8202cebbf147804a083a02bc282fbfda873bfdd0065fd34b64784acb7757b66e", size = 84161, upload-time = "2025-06-12T23:46:36.305Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/95/2a/d275ec4ce5cd0096665043995a7d76f5d0524853c76a3d04656de49f8808/aiobotocore-2.25.1-py3-none-any.whl", hash = "sha256:eb6daebe3cbef5b39a0bb2a97cffbe9c7cb46b2fcc399ad141f369f3c2134b1f", size = 86039, upload-time = "2025-10-28T22:33:19.949Z" },
|
||||
]
|
||||
|
||||
[package.optional-dependencies]
|
||||
@@ -419,16 +419,30 @@ wheels = [
|
||||
|
||||
[[package]]
|
||||
name = "aws-sdk-bedrock-runtime"
|
||||
version = "0.1.1"
|
||||
version = "0.2.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "smithy-aws-core", extra = ["eventstream", "json"], marker = "python_full_version >= '3.12'" },
|
||||
{ name = "smithy-core", marker = "python_full_version >= '3.12'" },
|
||||
{ name = "smithy-http", extra = ["awscrt"], marker = "python_full_version >= '3.12'" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/1d/78/48574454b3cac869df67665e4a403ebfc3abfcfba2c2ff01ccfd67d55f8f/aws_sdk_bedrock_runtime-0.1.1.tar.gz", hash = "sha256:c896f99e675c3a1ab600633a07b785f3dc9fe8ab94f640b1f992b63da2dfc784", size = 82446, upload-time = "2025-10-21T20:25:25.845Z" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/db/94/f2451bb09c106e5690bbb88fc366637cdcec942b352ed9bb788804c877e0/aws_sdk_bedrock_runtime-0.2.0.tar.gz", hash = "sha256:8de52dd4492e74c73244d4b41a52304e1db368814a10e49dbbf8f4e8e412cd0e", size = 88156, upload-time = "2025-11-22T00:35:44.978Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/83/07/62c0b70223d178c138f29124ac2f7973a6ba803abc7735b6a01a85217f3d/aws_sdk_bedrock_runtime-0.1.1-py3-none-any.whl", hash = "sha256:c0336b377b2112cf88197d3d44302fbeb3efb1101989fa49ae55e78f49cfe345", size = 74954, upload-time = "2025-10-21T20:25:24.973Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/eb/6b/07fbddd31dd6e38c967fe088b5e91a7cc3a2bc0f645f18b4e5d45bc03f1f/aws_sdk_bedrock_runtime-0.2.0-py3-none-any.whl", hash = "sha256:19594de50a52d199d73efca153c0a2328bd781827715a6e012d50b11085236cc", size = 79875, upload-time = "2025-11-22T00:35:44.092Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "aws-sdk-sagemaker-runtime-http2"
|
||||
version = "0.1.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "smithy-aws-core", extra = ["eventstream", "json"], marker = "python_full_version >= '3.12'" },
|
||||
{ name = "smithy-core", marker = "python_full_version >= '3.12'" },
|
||||
{ name = "smithy-http", extra = ["awscrt"], marker = "python_full_version >= '3.12'" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/6e/ca/00f9c55887fc0f3fa345995dd871d40ff81473ab1591e56b4b4483d99d00/aws_sdk_sagemaker_runtime_http2-0.1.0.tar.gz", hash = "sha256:5077ec0c4440495b15004bbf04e27bc0bc137f1f8950d32195c6b45d7788d837", size = 20863, upload-time = "2025-11-22T00:20:56.358Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/9c/24/2e2f727c51c20f4625cd19364d9421dbd7c893fe2b53a46eb0caaf6263a2/aws_sdk_sagemaker_runtime_http2-0.1.0-py3-none-any.whl", hash = "sha256:1aebb728ba6c6d14e58e29ecf89b51f7abbe8786d34144f8a7d59a419e80bd2f", size = 21911, upload-time = "2025-11-22T00:20:55.054Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -606,30 +620,30 @@ wheels = [
|
||||
|
||||
[[package]]
|
||||
name = "boto3"
|
||||
version = "1.38.27"
|
||||
version = "1.40.61"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "botocore" },
|
||||
{ name = "jmespath" },
|
||||
{ name = "s3transfer" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/e7/96/fc74d8521d2369dd8c412438401ff12e1350a1cd3eab5c758ed3dd5e5f82/boto3-1.38.27.tar.gz", hash = "sha256:94bd7fdd92d5701b362d4df100d21e28f8307a67ff56b6a8b0398119cf22f859", size = 111875, upload-time = "2025-05-30T19:32:41.352Z" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/ed/f9/6ef8feb52c3cce5ec3967a535a6114b57ac7949fd166b0f3090c2b06e4e5/boto3-1.40.61.tar.gz", hash = "sha256:d6c56277251adf6c2bdd25249feae625abe4966831676689ff23b4694dea5b12", size = 111535, upload-time = "2025-10-28T19:26:57.247Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/43/8b/b2361188bd1e293eede1bc165e2461d390394f71ec0c8c21211c8dabf62c/boto3-1.38.27-py3-none-any.whl", hash = "sha256:95f5fe688795303a8a15e8b7e7f255cadab35eae459d00cc281a4fd77252ea80", size = 139938, upload-time = "2025-05-30T19:32:38.006Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/61/24/3bf865b07d15fea85b63504856e137029b6acbc73762496064219cdb265d/boto3-1.40.61-py3-none-any.whl", hash = "sha256:6b9c57b2a922b5d8c17766e29ed792586a818098efe84def27c8f582b33f898c", size = 139321, upload-time = "2025-10-28T19:26:55.007Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "botocore"
|
||||
version = "1.38.27"
|
||||
version = "1.40.61"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "jmespath" },
|
||||
{ name = "python-dateutil" },
|
||||
{ name = "urllib3" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/36/5e/67899214ad57f7f26af5bd776ac5eb583dc4ecf5c1e52e2cbfdc200e487a/botocore-1.38.27.tar.gz", hash = "sha256:9788f7efe974328a38cbade64cc0b1e67d27944b899f88cb786ae362973133b6", size = 13919963, upload-time = "2025-05-30T19:32:29.657Z" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/28/a3/81d3a47c2dbfd76f185d3b894f2ad01a75096c006a2dd91f237dca182188/botocore-1.40.61.tar.gz", hash = "sha256:a2487ad69b090f9cccd64cf07c7021cd80ee9c0655ad974f87045b02f3ef52cd", size = 14393956, upload-time = "2025-10-28T19:26:46.108Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/7e/83/a753562020b69fa90cebc39e8af2c753b24dcdc74bee8355ee3f6cefdf34/botocore-1.38.27-py3-none-any.whl", hash = "sha256:a785d5e9a5eda88ad6ab9ed8b87d1f2ac409d0226bba6ff801c55359e94d91a8", size = 13580545, upload-time = "2025-05-30T19:32:26.712Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/38/c5/f6ce561004db45f0b847c2cd9b19c67c6bf348a82018a48cb718be6b58b0/botocore-1.40.61-py3-none-any.whl", hash = "sha256:17ebae412692fd4824f99cde0f08d50126dc97954008e5ba2b522eb049238aa7", size = 14055973, upload-time = "2025-10-28T19:26:42.15Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -4508,6 +4522,7 @@ langchain = [
|
||||
livekit = [
|
||||
{ name = "livekit" },
|
||||
{ name = "livekit-api" },
|
||||
{ name = "pyjwt" },
|
||||
{ name = "tenacity" },
|
||||
]
|
||||
lmnt = [
|
||||
@@ -4569,6 +4584,9 @@ runner = [
|
||||
{ name = "python-dotenv" },
|
||||
{ name = "uvicorn" },
|
||||
]
|
||||
sagemaker = [
|
||||
{ name = "aws-sdk-sagemaker-runtime-http2", marker = "python_full_version >= '3.12'" },
|
||||
]
|
||||
sarvam = [
|
||||
{ name = "sarvamai" },
|
||||
{ name = "websockets" },
|
||||
@@ -4648,13 +4666,14 @@ docs = [
|
||||
requires-dist = [
|
||||
{ name = "accelerate", marker = "extra == 'moondream'", specifier = "~=1.10.0" },
|
||||
{ name = "aic-sdk", marker = "extra == 'aic'", specifier = "~=1.1.0" },
|
||||
{ name = "aioboto3", marker = "extra == 'aws'", specifier = "~=15.0.0" },
|
||||
{ name = "aioboto3", marker = "extra == 'aws'", specifier = "~=15.5.0" },
|
||||
{ name = "aiofiles", specifier = ">=24.1.0,<25" },
|
||||
{ name = "aiohttp", specifier = ">=3.11.12,<4" },
|
||||
{ name = "aiortc", marker = "extra == 'webrtc'", specifier = ">=1.13.0,<2" },
|
||||
{ name = "anthropic", marker = "extra == 'anthropic'", specifier = "~=0.49.0" },
|
||||
{ name = "audioop-lts", marker = "python_full_version >= '3.13'", specifier = "~=0.2.1" },
|
||||
{ name = "aws-sdk-bedrock-runtime", marker = "python_full_version >= '3.12' and extra == 'aws-nova-sonic'", specifier = "~=0.1.1" },
|
||||
{ name = "aws-sdk-bedrock-runtime", marker = "python_full_version >= '3.12' and extra == 'aws-nova-sonic'", specifier = "~=0.2.0" },
|
||||
{ name = "aws-sdk-sagemaker-runtime-http2", marker = "python_full_version >= '3.12' and extra == 'sagemaker'" },
|
||||
{ name = "azure-cognitiveservices-speech", marker = "extra == 'azure'", specifier = "~=1.42.0" },
|
||||
{ name = "cartesia", marker = "extra == 'cartesia'", specifier = "~=2.0.3" },
|
||||
{ name = "coremltools", marker = "extra == 'local-smart-turn'", specifier = ">=8.0" },
|
||||
@@ -4721,6 +4740,7 @@ requires-dist = [
|
||||
{ name = "pyaudio", marker = "extra == 'local'", specifier = "~=0.2.14" },
|
||||
{ name = "pydantic", specifier = ">=2.10.6,<3" },
|
||||
{ name = "pygobject", marker = "extra == 'gstreamer'", specifier = "~=3.50.0" },
|
||||
{ name = "pyjwt", marker = "extra == 'livekit'", specifier = ">=2.10.1" },
|
||||
{ name = "pyloudnorm", specifier = "~=0.1.1" },
|
||||
{ name = "python-dotenv", marker = "extra == 'runner'", specifier = ">=1.0.0,<2.0.0" },
|
||||
{ name = "pyvips", extras = ["binary"], marker = "extra == 'moondream'", specifier = "~=3.0.0" },
|
||||
@@ -4745,7 +4765,7 @@ requires-dist = [
|
||||
{ name = "wait-for2", marker = "python_full_version < '3.12'", specifier = ">=0.4.1" },
|
||||
{ name = "websockets", marker = "extra == 'websockets-base'", specifier = ">=13.1,<16.0" },
|
||||
]
|
||||
provides-extras = ["aic", "anthropic", "assemblyai", "asyncai", "aws", "aws-nova-sonic", "azure", "cartesia", "cerebras", "deepseek", "daily", "deepgram", "elevenlabs", "fal", "fireworks", "fish", "gladia", "google", "grok", "groq", "gstreamer", "heygen", "hume", "inworld", "krisp", "koala", "langchain", "livekit", "lmnt", "local", "mcp", "mem0", "mistral", "mlx-whisper", "moondream", "nim", "neuphonic", "noisereduce", "openai", "openpipe", "openrouter", "perplexity", "playht", "qwen", "rime", "riva", "runner", "sambanova", "sarvam", "sentry", "local-smart-turn", "local-smart-turn-v3", "remote-smart-turn", "silero", "simli", "soniox", "soundfile", "speechmatics", "strands", "tavus", "together", "tracing", "ultravox", "webrtc", "websocket", "websockets-base", "whisper"]
|
||||
provides-extras = ["aic", "anthropic", "assemblyai", "asyncai", "aws", "aws-nova-sonic", "azure", "cartesia", "cerebras", "daily", "deepgram", "deepseek", "elevenlabs", "fal", "fireworks", "fish", "gladia", "google", "grok", "groq", "gstreamer", "heygen", "hume", "inworld", "koala", "krisp", "langchain", "livekit", "lmnt", "local", "local-smart-turn", "local-smart-turn-v3", "mcp", "mem0", "mistral", "mlx-whisper", "moondream", "neuphonic", "nim", "noisereduce", "openai", "openpipe", "openrouter", "perplexity", "playht", "qwen", "remote-smart-turn", "rime", "riva", "runner", "sagemaker", "sambanova", "sarvam", "sentry", "silero", "simli", "soniox", "soundfile", "speechmatics", "strands", "tavus", "together", "tracing", "ultravox", "webrtc", "websocket", "websockets-base", "whisper"]
|
||||
|
||||
[package.metadata.requires-dev]
|
||||
dev = [
|
||||
@@ -6202,14 +6222,14 @@ wheels = [
|
||||
|
||||
[[package]]
|
||||
name = "s3transfer"
|
||||
version = "0.13.1"
|
||||
version = "0.14.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "botocore" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/6d/05/d52bf1e65044b4e5e27d4e63e8d1579dbdec54fce685908ae09bc3720030/s3transfer-0.13.1.tar.gz", hash = "sha256:c3fdba22ba1bd367922f27ec8032d6a1cf5f10c934fb5d68cf60fd5a23d936cf", size = 150589, upload-time = "2025-07-18T19:22:42.31Z" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/62/74/8d69dcb7a9efe8baa2046891735e5dfe433ad558ae23d9e3c14c633d1d58/s3transfer-0.14.0.tar.gz", hash = "sha256:eff12264e7c8b4985074ccce27a3b38a485bb7f7422cc8046fee9be4983e4125", size = 151547, upload-time = "2025-09-09T19:23:31.089Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/6d/4f/d073e09df851cfa251ef7840007d04db3293a0482ce607d2b993926089be/s3transfer-0.13.1-py3-none-any.whl", hash = "sha256:a981aa7429be23fe6dfc13e80e4020057cbab622b08c0315288758d67cabc724", size = 85308, upload-time = "2025-07-18T19:22:40.947Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/48/f0/ae7ca09223a81a1d890b2557186ea015f6e0502e9b8cb8e1813f1d8cfa4e/s3transfer-0.14.0-py3-none-any.whl", hash = "sha256:ea3b790c7077558ed1f02a3072fb3cb992bbbd253392f4b6e9e8976941c7d456", size = 85712, upload-time = "2025-09-09T19:23:30.041Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -6522,16 +6542,16 @@ wheels = [
|
||||
|
||||
[[package]]
|
||||
name = "smithy-aws-core"
|
||||
version = "0.1.1"
|
||||
version = "0.2.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "aws-sdk-signers", marker = "python_full_version >= '3.12'" },
|
||||
{ name = "smithy-core", marker = "python_full_version >= '3.12'" },
|
||||
{ name = "smithy-http", marker = "python_full_version >= '3.12'" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/56/d3/f847e0fd36b95aa36ce3a4c9ce1a08e16b2aa9a56b71714045c9c924e282/smithy_aws_core-0.1.1.tar.gz", hash = "sha256:78dfd7040fc2bc72b6af293096642fc9a7bfd2db28ddbdf7c4110535eab9d662", size = 11196, upload-time = "2025-10-21T20:21:18.648Z" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/c1/c8/5970c869527972b23a1733de3993d54283c84a2340e84acdd48a11aa0ff4/smithy_aws_core-0.2.0.tar.gz", hash = "sha256:dfa1ecd311d6f0a16f48c86d793085e2a0a33a46de897d129dd1f142a43bf7f6", size = 11344, upload-time = "2025-11-21T18:33:01.928Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/d0/04/87cb06f0f6d664b5cffdef6d4042dd52c11c138436084d30ffdaa3543031/smithy_aws_core-0.1.1-py3-none-any.whl", hash = "sha256:0d1634f276c2999dc2a04fafef63b9d28309de50d939d1d49df952773a7063c4", size = 18963, upload-time = "2025-10-21T20:21:17.692Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/88/25/739c0005a6cb4effbc2d37fe23590660b508fe314200f4acf94410a8f315/smithy_aws_core-0.2.0-py3-none-any.whl", hash = "sha256:d112082ef77758e1977f8694cf690ac35c76570c12a6590fccd5da085a3ce507", size = 18966, upload-time = "2025-11-21T18:33:00.812Z" },
|
||||
]
|
||||
|
||||
[package.optional-dependencies]
|
||||
@@ -6544,35 +6564,35 @@ json = [
|
||||
|
||||
[[package]]
|
||||
name = "smithy-aws-event-stream"
|
||||
version = "0.1.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "smithy-core", marker = "python_full_version >= '3.12'" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/49/26/8ff24194efed60b2df18f610ea05fa2a4c6546858b80a0a51335a4943b9b/smithy_aws_event_stream-0.1.0.tar.gz", hash = "sha256:6634691a3bf5d4801a2c29f0761db2dc4771f3ae43cdee50c10d4b4bb2f86475", size = 12216, upload-time = "2025-09-29T19:37:14.659Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/90/c4/2b63d31af58fc359577e5515bf730348a235f2f2fa10e17af8640495c81c/smithy_aws_event_stream-0.1.0-py3-none-any.whl", hash = "sha256:17a7300a85cb90df4c6c23f895ca6343361fa419203c3cf80019edd7d3b5f036", size = 15581, upload-time = "2025-09-29T19:37:13.589Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "smithy-core"
|
||||
version = "0.1.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/b9/8d/16028d03456071d21de7591f1e1e6a1cc81b2389e53ef8663dbf59caf9cd/smithy_core-0.1.0.tar.gz", hash = "sha256:b159b8905264e1e4c613eab9f74cec0b2f5b8119c42fbadddb4da0a8ed8050e9", size = 48415, upload-time = "2025-09-29T19:37:16.873Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/ca/5b/563cb2beadcfa40597b0c3ff3f2d42e21f065b14782c4ba9cb41a44b745f/smithy_core-0.1.0-py3-none-any.whl", hash = "sha256:cb44e9355fb89e89f2c6ba6a1d59c5db4f2f7282c72d31d9307b6202d66cd0fa", size = 62895, upload-time = "2025-09-29T19:37:15.917Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "smithy-http"
|
||||
version = "0.2.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "smithy-core", marker = "python_full_version >= '3.12'" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/3c/1c/44e99a7dfb8c39bf0c3d998accdf4573a7a3488863b90f21af260cec2d45/smithy_http-0.2.0.tar.gz", hash = "sha256:2382562fa9af326be455f14b18615f16ffe9db756e51b2a4ca0d23e1b881cff8", size = 26729, upload-time = "2025-10-21T20:21:06.146Z" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/31/90/78283c21484f8cf9862982e53bc2769b784910735fb5fb2400a17bfb5fdd/smithy_aws_event_stream-0.2.0.tar.gz", hash = "sha256:99700a11346e7ab1435ff2e53e6f6d60a1e857f2b2ee1941d40b54270adf3323", size = 12278, upload-time = "2025-11-21T18:33:03.79Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/d4/e2/d475fad81ac74ec0e145cb6d72afe5ecde4e2358bd632c2fd5d3f4bc87dc/smithy_http-0.2.0-py3-none-any.whl", hash = "sha256:49ee2402d7737798d70f99f491fbfb2a5767283ae562e21b6f86e3fd14f3e3e0", size = 37328, upload-time = "2025-10-21T20:21:05.362Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/ca/f5/08b997eee81b55150496ce565f0e03c72d0c80e5b218170bdeae7c46a5a4/smithy_aws_event_stream-0.2.0-py3-none-any.whl", hash = "sha256:679a0c7d944e67d3a55d287541b3ca1e61f9d6a62e13401367dcc034e75aa55d", size = 15567, upload-time = "2025-11-21T18:33:02.711Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "smithy-core"
|
||||
version = "0.2.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/c7/f6/140f0be9331dd7cd8fa012b3ca4735df39a1a81d03eea89728f997249116/smithy_core-0.2.0.tar.gz", hash = "sha256:05c3e3309df5dcb9cf53e241bd57a96510e4575186443ea157db9dbb59b6c85e", size = 50334, upload-time = "2025-11-21T18:33:05.697Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/16/e3/d0defa2acf50b91625fe15e3ddb0c8e41ff64363a1f4cd9b8f19ae2ec0c6/smithy_core-0.2.0-py3-none-any.whl", hash = "sha256:db4620da3497abb60f79ac1d8a738d3eac46d7e820bfb50c777c36e932915239", size = 64777, upload-time = "2025-11-21T18:33:04.591Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "smithy-http"
|
||||
version = "0.3.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "smithy-core", marker = "python_full_version >= '3.12'" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/1c/c7/4d8be56e897f99f3b6ffcdf52ba00a468febc939fca85b90f1c122450830/smithy_http-0.3.0.tar.gz", hash = "sha256:55dcc3af315eee6863d2f3f58ada1d9cb4bcc3a57faac10a1b21d4a93722f520", size = 28674, upload-time = "2025-11-21T18:33:07.387Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/2d/e5/59ae79ecdc9a935ad10512c581b3054ebb1afd90498ecc8afaf141dbc22b/smithy_http-0.3.0-py3-none-any.whl", hash = "sha256:972924304febd77c7134a7cffab83ce3b48423ff966dcc1f257e2c0d58fa9b18", size = 40520, upload-time = "2025-11-21T18:33:06.312Z" },
|
||||
]
|
||||
|
||||
[package.optional-dependencies]
|
||||
@@ -6582,15 +6602,15 @@ awscrt = [
|
||||
|
||||
[[package]]
|
||||
name = "smithy-json"
|
||||
version = "0.1.0"
|
||||
version = "0.2.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "ijson", marker = "python_full_version >= '3.12'" },
|
||||
{ name = "smithy-core", marker = "python_full_version >= '3.12'" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/e2/5b/0ecb10007475e1b8faca3bbff1be2fc6edb3ea12ffc5e939e2249be95325/smithy_json-0.1.0.tar.gz", hash = "sha256:84fb48e445b87d850c240d837702c16b259ea53bad76b655ac1bbd8094d48912", size = 7086, upload-time = "2025-09-29T19:37:20.432Z" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/89/cf/e319a2a299b27bc0addf46ee3d4b9c25ec0817e3a0507b2b7a33eddc19f1/smithy_json-0.2.0.tar.gz", hash = "sha256:0946066fdda15d6a579dfdd4b61a547ab915eb057bd176fc2bc17d01dc789499", size = 7157, upload-time = "2025-11-21T18:33:08.968Z" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/62/95/e11c04e56aae12b62e38c49000004a1dc598a64dc207018c08448efde322/smithy_json-0.1.0-py3-none-any.whl", hash = "sha256:80ff64734dccdabf1ba6a2908555b97e60f62c07c3a27df48e421ee058413cb9", size = 9914, upload-time = "2025-09-29T19:37:19.459Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/2e/b1/33012ac5b2e5940a00b6e1ccc313330e6f8692152a151f72a398cd6be0e0/smithy_json-0.2.0-py3-none-any.whl", hash = "sha256:5018a4e61731afa3094a02d737d4f956dbf270c271410c089045a17d86fc3b3b", size = 9911, upload-time = "2025-11-21T18:33:08.267Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
|
||||
Reference in New Issue
Block a user