Compare commits

..

23 Commits

Author SHA1 Message Date
mattie ruth backman
e8640d84ae test fix now that we send an aggregated text frame for non word-by-word tts services 2025-11-14 17:13:08 -05:00
mattie ruth backman
23e4e29999 CHANGELOG fixes 2025-11-14 13:57:49 -05:00
mattie ruth backman
713b488bb6 Final PR Feedback changes 2025-11-14 13:54:20 -05:00
mattie ruth backman
71b87fd420 add transformers to initialization args 2025-11-14 13:54:20 -05:00
mattie ruth backman
3f269f9834 Add backwards compatibility for add_pattern_pair 2025-11-14 13:54:20 -05:00
mattie ruth backman
4c698777f3 PR Feedback 2025-11-14 13:54:20 -05:00
mattie ruth backman
5ca04ad741 CHANGELOG updates 2025-11-14 13:54:20 -05:00
mattie ruth backman
9a3902a82c Introducing a new processor: LLMTextProcessor
This new processor wraps an aggregator that can be overridden for the purposes
of customizing how the llm output gets categorized and handled in the pipeline.

Along with this, we are deprecating the ability to override the default
aggregator in the TTS to encourage use of the LLMTextProcessor in cases where
custome aggregation is needed.

This PR also:
- Introduces TTSService.transform_aggregation_type():
  This function provides the ability to provide callbacks to the TTS to
  transform text based on its aggregated type prior to sending the text to the
  underlying TTS service. This makes it possible to do things like introduce
  TTS-specific tags for spelling or emotion or change the pronunciation of
  something on the fly.
- Introduces to the RTVIObserver:
  - new init field skip_aggregator_types: A way to provide a list of aggregation
    types that should not be included in bot-output (or tts-text) messages
  - transform_aggregation_type(): Same as with TTSService, this allows you
    to provide a callback to transform text being sent as bot-output before
    it gets sent.
2025-11-14 13:54:20 -05:00
mattie ruth backman
8ab0c92681 Rename AggregatedLLMTextFrame to AggregatedTextFrame and made built-in types an enum 2025-11-14 13:54:20 -05:00
mattie ruth backman
124f147a37 CHANGELOG improvements 2025-11-14 13:54:18 -05:00
mattie ruth backman
ed808a9246 Fix new test and str version of PatternMatch 2025-11-14 13:53:23 -05:00
mattie ruth backman
e9de9daf8c Update PatternPairAggregator patterns to replace pattern_id with type to simplify the API 2025-11-14 13:53:23 -05:00
mattie ruth backman
82b9c4f0b6 various PR Review fixes:
1. Added support for turning off bot-output messages with the bot_output_enabled flag
2. Cleaned up logic and comments around TTSService:_push_tts_frames to hopefully make
   it easier to understand
3. Other minor cleanup
2025-11-14 13:53:23 -05:00
mattie ruth backman
5dfe20be91 Update Changelog 2025-11-14 13:53:22 -05:00
mattie ruth backman
0d2c5286fa Support customization over the way the assistant aggregator aggregates LLMTextFrames when tts_skip is on 2025-11-14 13:51:45 -05:00
mattie ruth backman
29417ba44d Move aggregation logic when skip_tts is on to the assistant aggregator 2025-11-14 13:51:45 -05:00
mattie ruth backman
bc6a9cac26 Add append_to_context boolean field to TextFrames
This allows any given TextFrame to be marked in a way such that it does not get
added to the context.

Specifically, this fixes a problem with the new AggregatedTextFrames where we
need to send LLM text both in an aggregated form as well as word-by-word but
avoid duplicating the text in the context.
2025-11-14 13:51:45 -05:00
mattie ruth backman
8a90decbc0 codepilot review fixes 2025-11-14 13:51:45 -05:00
mattie ruth backman
ccca6e8d81 Make the PatternPair action an Enum 2025-11-14 13:51:45 -05:00
mattie ruth backman
e6dc1a510d Introduce AggregatedLLMTextFrame to allow a separation of TTSTextFrame, indicating a spoken frame vs other aggregated, non-spoken frames 2025-11-14 13:51:45 -05:00
mattie ruth backman
69945c5e0d Various fixes:
1. Fixed pattern_pair_aggregator to support various ways of handling
   pattern matches (remove, keep and just trigger a callback, or
   aggregate
2. Fixed ivr_navigator use of pattern_pair_aggregator
3. Test fixes -- Tests now pass
2025-11-14 13:51:45 -05:00
mattie ruth backman
5c8635570d test fixes 2025-11-14 13:51:45 -05:00
mattie ruth backman
fe9aa3383e Adding support for new bot-output RTVI Message:
1. TTSTextFrames now include metadata about whether the text was spoken
   or not along with a type string to describe what the text represents:
   ex. "sentence", "word", "custom aggregation"
2. Expanded how aggregators work so that the aggregate method returns
   aggregated text along with the type of aggregation used to create it
3. Deprecated the RTVI bot-transcription event in lieu of...
4. Introduced support for a new bot-output event. This event is meant
   to be the one stop shop for communicating what the bot actually "says".
   It is based off TTSTextFrames to communicate both sentence by sentence
   (or whatever aggregation is used) as well as word by word. In addition,
   it will include LLMTextFrames, aggregated by sentence when tts is
   turned off (i.e. skip_tts is true).

Resolves pipecat-ai/pipecat-client-web#158
2025-11-14 13:51:45 -05:00
207 changed files with 5193 additions and 11735 deletions

View File

@@ -1,174 +0,0 @@
name: Generate Changelog for Release
on:
workflow_dispatch:
inputs:
version:
description: "Release version (e.g., 0.0.97)"
required: true
type: string
date:
description: "Release date (YYYY-MM-DD format, defaults to today)"
required: false
type: string
default: ""
permissions:
contents: write
pull-requests: write
jobs:
generate-changelog:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install uv
uses: astral-sh/setup-uv@v4
with:
enable-cache: true
- name: Install dependencies
run: |
uv sync --group dev
- name: Set release date
id: set_date
run: |
if [ -z "${{ inputs.date }}" ]; then
RELEASE_DATE=$(date +%Y-%m-%d)
echo "Using today's date: $RELEASE_DATE"
else
RELEASE_DATE="${{ inputs.date }}"
echo "Using provided date: $RELEASE_DATE"
fi
echo "release_date=$RELEASE_DATE" >> $GITHUB_OUTPUT
- name: Validate inputs
run: |
# Validate version format (basic check)
if ! [[ "${{ inputs.version }}" =~ ^[0-9]+\.[0-9]+\.[0-9]+.*$ ]]; then
echo "Error: Version must be in format X.Y.Z (e.g., 0.0.97)"
exit 1
fi
# Validate date format if provided
if [ -n "${{ inputs.date }}" ]; then
if ! date -d "${{ inputs.date }}" >/dev/null 2>&1; then
# Try macOS date format
if ! date -j -f "%Y-%m-%d" "${{ inputs.date }}" >/dev/null 2>&1; then
echo "Error: Date must be in YYYY-MM-DD format (e.g., 2025-12-04)"
exit 1
fi
fi
fi
- name: Check for changelog fragments
id: check_fragments
run: |
FRAGMENT_COUNT=$(find changelog -name "*.md" ! -name "_template.md.j2" | wc -l | tr -d ' ')
echo "fragment_count=$FRAGMENT_COUNT" >> $GITHUB_OUTPUT
if [ "$FRAGMENT_COUNT" -eq "0" ]; then
echo "❌ Error: No changelog fragments found in changelog/"
echo ""
echo "Cannot create a release without changelog entries."
echo "Add changelog fragments to the changelog/ directory (e.g., 1234.added.md) and try again."
exit 1
fi
# Validate fragment types
VALID_TYPES="added changed deprecated removed fixed security"
INVALID_FRAGMENTS=""
for file in changelog/*.md; do
# Skip template
if [[ "$file" == "changelog/_template.md.j2" ]]; then
continue
fi
# Extract type from filename (e.g., 1234.added.md -> added)
filename=$(basename "$file")
# Handle both 1234.added.md and 1234.added.2.md patterns
type=$(echo "$filename" | sed -E 's/^[0-9]+\.([a-z]+)(\.[0-9]+)?\.md$/\1/')
# Check if type is valid
if ! echo "$VALID_TYPES" | grep -wq "$type"; then
INVALID_FRAGMENTS="$INVALID_FRAGMENTS\n - $filename (type: '$type')"
fi
done
if [ -n "$INVALID_FRAGMENTS" ]; then
echo "❌ Error: Invalid changelog fragment types found:"
echo -e "$INVALID_FRAGMENTS"
echo ""
echo "Valid types are: $VALID_TYPES"
echo "Example: 1234.added.md, 5678.fixed.md"
exit 1
fi
echo "✓ Found $FRAGMENT_COUNT changelog fragment(s)"
echo "has_fragments=true" >> $GITHUB_OUTPUT
- name: Preview changelog
run: |
echo "## Preview of changelog for version ${{ inputs.version }}"
echo ""
uv run towncrier build --draft --version "${{ inputs.version }}" --date "${{ steps.set_date.outputs.release_date }}"
- name: Build changelog
run: |
uv run towncrier build --version "${{ inputs.version }}" --date "${{ steps.set_date.outputs.release_date }}" --yes
- name: Create Pull Request
uses: peter-evans/create-pull-request@v7
with:
token: ${{ secrets.GITHUB_TOKEN }}
commit-message: "Update changelog for version ${{ inputs.version }}"
title: "Release ${{ inputs.version }} - Changelog Update"
body: |
## Changelog Update for Release ${{ inputs.version }}
This PR updates the CHANGELOG.md with all changes for version **${{ inputs.version }}**.
### Summary
- **Version:** ${{ inputs.version }}
- **Date:** ${{ steps.set_date.outputs.release_date }}
- **Fragments processed:** ${{ steps.check_fragments.outputs.fragment_count }}
### What this PR does
- ✅ Adds new release section to CHANGELOG.md
- ✅ Removes processed changelog fragments
- ✅ Ready to merge for release
### Next Steps
1. Review the changelog entries below
2. Make any necessary edits to CHANGELOG.md if needed
3. Merge this PR
4. Continue with your release process
---
<details>
<summary>📋 Preview of changes</summary>
The changelog has been updated with entries from the following fragments:
```bash
${{ steps.check_fragments.outputs.fragment_count }} fragments processed
```
</details>
branch: changelog-${{ inputs.version }}
delete-branch: true
labels: |
changelog
release

View File

@@ -50,6 +50,7 @@ jobs:
run: |
uv sync --group dev --all-extras \
--no-extra krisp \
--no-extra ultravox \
--no-extra local-smart-turn \
--no-extra moondream \
--no-extra mlx-whisper

View File

@@ -11,7 +11,7 @@ build:
jobs:
post_install:
- pip install uv
- UV_PROJECT_ENVIRONMENT=$READTHEDOCS_VIRTUALENV_PATH uv sync --group docs --all-extras --no-extra krisp --no-extra gstreamer --no-extra local_smart_turn --no-extra moondream --no-extra riva --no-extra mlx-whisper
- UV_PROJECT_ENVIRONMENT=$READTHEDOCS_VIRTUALENV_PATH uv sync --group docs --all-extras --no-extra krisp --no-extra gstreamer --no-extra ultravox --no-extra local_smart_turn --no-extra moondream --no-extra riva --no-extra mlx-whisper
sphinx:
configuration: docs/api/conf.py

View File

@@ -5,695 +5,98 @@ All notable changes to **Pipecat** will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
<!-- towncrier release notes start -->
## [0.0.98] - 2025-12-17
## [Unreleased]
### Added
- Added `RimeNonJsonTTSService` which supports non-JSON streaming mode. This
new class supports websocket streaming for the Arcana model.
(PR [#3085](https://github.com/pipecat-ai/pipecat/pull/3085))
- Added additional functionality related to "thinking", for Google and
Anthropic LLMs.
1. New typed parameters for Google and Anthropic LLMs that control the
models' thinking behavior (like how much thinking to do, and whether to
output thoughts or thought summaries):
- `AnthropicLLMService.ThinkingConfig`
- `GoogleLLMService.ThinkingConfig`
2. New frames for representing thoughts output by LLMs:
- `LLMThoughtStartFrame`
- `LLMThoughtTextFrame`
- `LLMThoughtEndFrame`
3. A generic mechanism for recording LLM thoughts to context, used
specifically to support Anthropic, whose thought signatures are expected
to appear alongside the text of the thoughts within assistant context
messages. See:
- `LLMThoughtEndFrame.signature`
- `LLMAssistantAggregator` handling of the above field
- `AnthropicLLMAdapter` handling of `"thought"` context messages
4. Google-specific logic for inserting thought signatures into the context,
to help maintain thinking continuity in a chain of LLM calls. See:
- `GoogleLLMService` sending `LLMMessagesAppendFrame`s to add
LLM-specific
`"thought_signature"` messages to context
- `GeminiLLMAdapter` handling of `"thought_signature"` messages
5. An expansion of `TranscriptProcessor` to process LLM thoughts in
addition to user and assistant utterances. See:
- `TranscriptProcessor(process_thoughts=True)` (defaults to `False`)
- `ThoughtTranscriptionMessage`, which is now also emitted with the
`"on_transcript_update"` event
(PR [#3175](https://github.com/pipecat-ai/pipecat/pull/3175))
- Data and control frames can now be marked as non-interruptible by using the
`UninterruptibleFrame` mixin. Frames marked as `UninterruptibleFrame` will
not be interrupted during processing, and any queued frames of this type will
be retained in the internal queues. This is useful when you need ordered
frames (data or control) that should not be discarded or cancelled due to
interruptions.
(PR [#3189](https://github.com/pipecat-ai/pipecat/pull/3189))
- Added `on_conversation_detected` event to `VoicemaiDetector`.
(PR [#3207](https://github.com/pipecat-ai/pipecat/pull/3207))
- Added `x-goog-api-client` header with Pipecat's version to all Google
services' requests.
(PR [#3208](https://github.com/pipecat-ai/pipecat/pull/3208))
- Added support for the HeyGen LiveAvatar API (see https://www.liveavatar.com/).
(PR [#3210](https://github.com/pipecat-ai/pipecat/pull/3210))
- Added to `AWSNovaSonicLLMService` functionality related to the new (and now
default) Nova 2 Sonic model (`"amazon.nova-2-sonic-v1:0"`):
- Added the `endpointing_sensitivity` parameter to control how quickly the
model decides the user has stopped speaking.
- Made the assistant-response-trigger hack a no-op. It's only needed for
the older Nova Sonic model.
(PR [#3212](https://github.com/pipecat-ai/pipecat/pull/3212))
- [Ultravox Realtime](https://docs.ultravox.ai) is now a supported
speech-to-speech service.
- Added `UltravoxRealtimeLLMService` for the integration.
- Added `49-ultravox-realtime.py` example (with tool calling).
(PR [#3227](https://github.com/pipecat-ai/pipecat/pull/3227))
- Added Daily PSTN dial-in support to the development runner with `--dialin`
flag. This includes:
- `/daily-dialin-webhook` endpoint that handles incoming Daily PSTN webhooks
- Automatic Daily room creation with SIP configuration
- `DialinSettings` and `DailyDialinRequest` types in `pipecat.runner.types`
for type-safe dial-in data
- The runner now mimics Pipecat Cloud's dial-in webhook handling for local
development
(PR [#3235](https://github.com/pipecat-ai/pipecat/pull/3235))
- Add Gladia session id to logs for `GladiaSTTService`.
(PR [#3236](https://github.com/pipecat-ai/pipecat/pull/3236))
- Added `InworldHttpTTSService` which uses Inworld's HTTP based TTS service in
either streaming or non-streaming mode. Note: This class was previously named
`InworldTTSService`.
(PR [#3239](https://github.com/pipecat-ai/pipecat/pull/3239))
- Added `language_hints_strict` parameter to `SonioxSTTService` to strictly
enforces language hints. This ensures that transcription occurs in the
specified language.
(PR [#3245](https://github.com/pipecat-ai/pipecat/pull/3245))
- Added Pipecat library version info to the `about` field in the `bot-ready`
RTVI message.
(PR [#3248](https://github.com/pipecat-ai/pipecat/pull/3248))
- Added `VisionFullResponseStartFrame`, `VisionFullResponseEndFrame` and
`VisionTextFrame`. This are used by vision services similar to LLM
services.
(PR [#3252](https://github.com/pipecat-ai/pipecat/pull/3252))
### Changed
- `FunctionCallInProgressFrame` and `FunctionCallResultFrame` have changed from
system frames to a control frame and a data frame, respectively, and are
now both marked as `UninterruptibleFrame`.
(PR [#3189](https://github.com/pipecat-ai/pipecat/pull/3189))
- `UserBotLatencyLogObserver` now uses `VADUserStartedSpeakingFrame` and
`VADUserStoppedSpeakingFrame` to determine latency from user stopped speaking
to bot started speaking.
(PR [#3206](https://github.com/pipecat-ai/pipecat/pull/3206))
- Updated `HeyGenVideoService` and `HeyGenTransport` to support both HeyGen
APIs (Interactive Avatar and Live Avatar).
Using them is as simple as specifying the `service_type` when creating the
`HeyGenVideoService` and the `HeyGenTransport`:
```python
heyGen = HeyGenVideoService(
api_key=os.getenv("HEYGEN_LIVE_AVATAR_API_KEY"),
service_type=ServiceType.LIVE_AVATAR,
session=session,
)
```
(PR [#3210](https://github.com/pipecat-ai/pipecat/pull/3210))
- Made `"amazon.nova-2-sonic-v1:0"` the new default model for
`AWSNovaSonicLLMService`.
(PR [#3212](https://github.com/pipecat-ai/pipecat/pull/3212))
- Updated the `run_inference` methods in the LLM service classes
(`AnthropicLLMService`, `AWSBedrockLLMService`, `GoogleLLMService`, and
`OpenAILLMService` and its base classes) to use the provided LLM
configuration parameters.
(PR [#3214](https://github.com/pipecat-ai/pipecat/pull/3214))
- Updated default models for:
- `GeminiLiveLLMService` to `gemini-2.5-flash-native-audio-preview-12-2025`.
- `GeminiLiveVertexLLMService` to `gemini-live-2.5-flash-native-audio`.
(PR [#3228](https://github.com/pipecat-ai/pipecat/pull/3228))
- Changed the `reason` field in `EndFrame`, `CancelFrame`, `EndTaskFrame`, and
`CancelTaskFrame` from `str` to `Any` to indicate that it can hold values
other than strings.
(PR [#3231](https://github.com/pipecat-ai/pipecat/pull/3231))
- Updated websocket STT services to use the `WebsocketSTTService` base class.
This base class manages the websocket connection and handles reconnects.
Updated services:
- `AssemblyAISTTService`
- `AWSTranscribeSTTService`
- `GladiaSTTService`
- `SonioxSTTService`
(PR [#3236](https://github.com/pipecat-ai/pipecat/pull/3236))
- Changed Inworld's TTS service implementations:
- Previously, the HTTP implementation was named `InworldTTSService`. That
has been moved to `InworldHttpTTSService`. This service now supports
word-timestamp alignment data in both streaming and non-streaming modes.
- Updated the `InworldTTSService` class to use Inworld's Websocket API.
This class now has support for word-timestamp alignment data and tracks
contexts for each user turn.
(PR [#3239](https://github.com/pipecat-ai/pipecat/pull/3239))
- ⚠️ Breaking change: `WordTTSService.start_word_timestamps()` and
`WordTTSService.reset_word_timestamps()` are now async.
(PR [#3240](https://github.com/pipecat-ai/pipecat/pull/3240))
- Updated the current RTVI version to 1.1.0 to reflect recent additions and
deprecations.
- New RTVI Messages: `send-text` and `bot-output`
- Deprecated Messages: `append-to-context` and `bot-transcription`
(PR [#3248](https://github.com/pipecat-ai/pipecat/pull/3248))
- `MoondreamService` now pushes `VisionFullResponseStartFrame`,
`VisionFullResponseEndFrame` and `VisionTextFrame`.
(PR [#3252](https://github.com/pipecat-ai/pipecat/pull/3252))
### Deprecated
- `FalSmartTurnAnalyzer` and `LocalSmartTurnAnalyzer` are deprecated and will
be removed in a future version. Use `LocalSmartTurnAnalyzerV3` instead.
(PR [#3219](https://github.com/pipecat-ai/pipecat/pull/3219))
### Removed
- Removed the deprecated VLLM-based open source Ultravox STT service.
(PR [#3227](https://github.com/pipecat-ai/pipecat/pull/3227))
### Fixed
- Fixed a bug in `AWSNovaSonicLLMService` where we would mishandle cancelled
tool calls in the context, resulting in errors.
(PR [#3212](https://github.com/pipecat-ai/pipecat/pull/3212))
- Better support conversation history with Gemini 2.5 Flash Image (model
"gemini-2.5-flash-image"). Prior to this fix, the model had no memory of
previous images it had generated, so it wouldn't be able to iterate on
them.
(PR [#3224](https://github.com/pipecat-ai/pipecat/pull/3224))
- Support conversations with Gemini 3 Pro Image (model
"gemini-3-pro-image-preview"). Prior to this fix, after the model generated
an image the conversation would not be able to progress.
(PR [#3224](https://github.com/pipecat-ai/pipecat/pull/3224))
- Fixed an issue where `ElevenLabsHttpTTSService` was not updating
voice settings when receiving a `TTSUpdateSettingsFrame`.
(PR [#3226](https://github.com/pipecat-ai/pipecat/pull/3226))
- Fixed the return type for `SmallWebRTCRequestHandler.handle_web_request()`
function.
(PR [#3230](https://github.com/pipecat-ai/pipecat/pull/3230))
- Fix a bug in LLM context audio content handling
(PR [#3234](https://github.com/pipecat-ai/pipecat/pull/3234))
- In `GladiaSTTService`, reset the `_bytes_sent` counter on connecting the
websocket. This avoids unnecessary audio buffer trimming.
(PR [#3236](https://github.com/pipecat-ai/pipecat/pull/3236))
- Fixed a TTS service word-timestamp issue that could cause generated
`TTSTextFrame` instances to have an incorrect pts (`pts = -1`).
(PR [#3240](https://github.com/pipecat-ai/pipecat/pull/3240))
- Fixed an issue in `SimpleTextAggreagtor` where spaces were not being stripped
before returning the aggregation. This resulted in an extra space for TTS
services that don't support word-timestamp alignment data.
(PR [#3247](https://github.com/pipecat-ai/pipecat/pull/3247))
## [0.0.97] - 2025-12-05
### Added
- Added new Gradium services, `GradiumSTTService` and `GradiumTTSService`, for
speech-to-text and text-to-speech functionality using Gradium's API.
- Additions for `AsyncAITTSService` and `AsyncAIHttpTTSService`:
- Added new `languages`: `pt`, `nl`, `ar`, `ru`, `ro`, `ja`, `he`, `hy`,
`tr`, `hi`, `zh`.
- Updated the default model to `asyncflow_multilingual_v1.0` for improved
accuracy and broader language coverage.
- Added optional tool and tool output filters for MCP services.
### Changed
- Updated Deepgram logging to include Deepgram request IDs for improved
debugging.
- Text Aggregation Improvements:
- **Breaking Change**: `BaseTextAggregator.aggregate()` now returns
`AsyncIterator[Aggregation]` instead of `Optional[Aggregation]`. This
enables the aggregator to return multiple results based on the provided
text.
- Refactored text aggregators to use inheritance: `SkipTagsAggregator` and
`PatternPairAggregator` now inherit from `SimpleTextAggregator`, reusing
the base class's sentence detection logic.
- Improved interruption handling to prevent bots from repeating themselves. LLM
services that return multiple sentences in a single response (e.g.,
`GoogleLLMService`) are now split into individual sentences before being sent
to TTS. This ensures interruptions occur at sentence boundaries, preventing
the bot from repeating content after being interrupted during long responses.
- Updated `AICFilter` to use Quail STT as the default model
(`AICModelType.QUAIL_STT`). Quail STT is optimized for human-to-machine
interaction (e.g., voice agents, speech-to-text) and operates at a native
sample rate of 16 kHz with fixed enhancement parameters.
- If an unexpected exception is caught, or if `FrameProcessor.push_error()` is
called with an exception, the file name and line number where the exception
occured are now logged.
- Updated Smart Turn model weights to v3.1.
- Smart Turn analyzer now uses the full context of the turn rather than just
the audio since VAD last triggered.
- Updated `CartesiaSTTService` to return the full transcription `result` in the
`TranscriptionFrame` and `InterimTranscriptionFrame`. This provides access to
word timestamp data.
- `HumeTTSService` changes:
- Added tracking headers (`X-Hume-Client-Name` and `X-Hume-Client-Version`)
to all requests made by `HumeTTSService` to the Hume API for better usage
tracking and analytics.
- Added `stop()` and `cancel()` cleanup methods to `HumeTTSService` to
properly close the HTTP client and prevent resource leaks.
### Deprecated
- NVIDIA Services name changes (all functionality is unchanged):
- `NimLLMService` is now deprecated, use `NvidiaLLMService` instead.
- `RivaSTTService` is now deprecated, use `NvidiaSTTService` instead.
- `RivaTTSService` is now deprecated, use `NvidiaTTSService` instead.
- Use `uv pip install pipecat-ai[nvidia]` instead of
`uv pip install pipecat-ai[riva]`
- The `noise_gate_enable` parameter in `AICFilter` is deprecated and no longer
has any effect. Noise gating is now handled automatically by the AIC VAD
system. Use `AICFilter.create_vad_analyzer()` for VAD functionality instead.
- Package `pipecat.sync` is deprecated, use `pipecat.utils.sync` instead.
### Fixed
- Fixed bug in `PatternPairAggregator` where pattern handlers could be called
multiple times for `KEEP` or `AGGREGATE` patterns.
- Fixed sentence aggregation to correctly handle ambiguous punctuation in
streaming text, such as currency ("$29.95") and abbreviations ("Mr. Smith").
- Fixed an issue in `AWSTranscribeSTTService` where the `region` arg was always
set to `us-east-1` when providing an AWS_REGION env var.
- Fixed an issue in `SarvamTTSService` where the last sentence was not being
spoken. Now, audio is flushed when the TTS services receives the
`LLMFullResponseEndFrame` or `EndFrame`.
- Fixed an issue in `DeepgramTTSService` where a `TTSStoppedFrame` was
incorrectly pushed after a functional call. This caused an issue with the
voice-ui-kit's conversational panel rending of the LLM output after a
function call.
- Fixed an issue where `LLMTextFrame.skip_tts` was being overwritten by LLM
services.
- Fixed an issue that caused `WebsocketService` instances to attempt
reconnection during shutdown.
- Fixed an issue in `ElevenLabsTTSService` where character usage metrics were
only reported on the first TTS generation per turn.
## [0.0.96] - 2025-11-26 🦃 "Happy Thanksgiving!" 🦃
### Added
- Added `AWSBedrockAgentCoreProcessor` to support invoking an AgentCore-hosted
agent in a Pipecat pipeline.
- Enhanced error handling across the framework:
- Added `on_error` callback to `FrameProcessor` for centralized error
handling.
- Renamed `push_error(error: ErrorFrame)` to `push_error_frame(error: ErrorFrame)`
for clarity.
- Added new `push_error` method for simplified error reporting:
```python
async def push_error(error_msg: str,
exception: Optional[Exception] = None,
fatal: bool = False)
```
- Standardized error logging by replacing `logger.exception` calls with
`logger.error` throughout the codebase.
- Added `cache_read_input_tokens`, `cache_creation_input_tokens` and
`reasoning_tokens` to OTel spans for LLM call
- Added `LiveKitRESTHelper` utility class for managing LiveKit rooms via REST API.
- Added `DeepgramSageMakerSTTService` which connects to a SageMaker hosted
Deepgram STT model. Added `07c-interruptible-deepgram-sagemaker.py`
foundational example.
- Added `SageMakerBidiClient` to connect to SageMaker hosted BiDi compatible
services.
- Added support for `include_timestamps` and `enable_logging` in
`ElevenLabsRealtimeSTTService`. When `include_timestamps` is enabled,
timestamp data is included in the `TranscriptionFrame`'s `result`
parameter.
- Added optional speaking rate control to `InworldTTSService`.
- Introduced a new `AggregatedTextFrame` type to support passing text along with
an `aggregated_by` field to describe the type of text
included. `TTSTextFrame`s now inherit from `AggregatedTextFrame`. With this
inheritance, an observer can watch for `AggregatedTextFrame`s to accumlate the
perceived output and determine whether or not the text was spoken based on if
that frame is also a `TTSTextFrame`.
With this frame, the llm token stream can be transformed into custom
composable chunks, allowing for aggregation outside the TTS service. This
makes it possible to listen for or handle those aggregations and sets the
stage for doing things like composing a best effort of the perceived llm
output in a more digestable form and to do so whether or not it is processed
by a TTS or if even a TTS exists.
- Introduced `LLMTextProcessor`: A new processor meant to allow customization
for how LLMTextFrames should be aggregated and considered. It's purpose is to
turn `LLMTextFrame`s into `AggregatedTextFrame`s. By default, a TTSService
will still aggregate `LLMTextFrame`s by sentence for the service to
consume. However, if you wish to override how the llm text is aggregated, you
should no longer override the TTS's internal text_aggregator, but instead,
insert this processor between your LLM and TTS in the pipeline.
- New `bot-output` RTVI message to represent what the bot actually "says".
- The `RTVIObserver` now emits `bot-output` messages based off the new
`AggregatedTextFrame`s (`bot-tts-text` and `bot-llm-text` are still
supported and generated, but `bot-transcript` is now deprecated in lieu of
this new, more thorough, message).
- The new `RTVIBotOutputMessage` includes the fields:
- `spoken`: A boolean indicating whether the text was spoken by TTS
- `aggregated_by`: A string representing how the text was aggregated
("sentence", "word", "my custom aggregation")
- Introduced new fields to `RTVIObserver` to support the new `bot-output`
messaging:
- `bot_output_enabled`: Defaults to True. Set to false to disable bot-output
messages.
- `skip_aggregator_types`: Defaults to `None`. Set to a list of strings that
match aggregation types that should not be included in bot-output
messages. (Ex. `credit_card`)
- Introduced new methods, `add_text_transformer()` and
`remove_text_transformer()`, to `RTVIObserver` to support providing (and
subsequently removing) callbacks for various types of aggregations (or all
aggregations with `*`) that can modify the text before being sent as a
`bot-output` or `tts-text` message. (Think obscuring the credit card or
inserting extra detail the client might want that the context doesn't need.)
- In `MiniMaxHttpTTSService`:
- Added support for speech-2.6-hd and speech-2.6-turbo models
- Added languages: Afrikaans, Bulgarian, Catalan, Danish, Persian, Filipino,
Hebrew, Croatian, Hungarian, Malay, Norwegian, Nynorsk, Slovak, Slovenian,
Swedish, and Tamil
- Added new emotions: calm and fluent
- Added `enable_logging` to `SimliVideoService` input parameters. It's disabled
by default.
### Changed
- Updated `FishAudioTTSService` default model to `s1`.
- Updated `DeepgramTTSService` to use Deepgram's TTS websocket API. ⚠️ This is
a potential breaking change, which only affects you if you're self-hosting
`DeepgramTTSService`. The new service uses Websockets and improves TTFB
latency.
- Updated `daily-python` to 0.22.0.
- `BaseTextAggregator` changes:
Modified the BaseTextAggregator type so that when text gets aggregated,
metadata can be associated with it. Currently, that just means a `type`, so
that the aggregation can be classified or described. Changes made to support
this:
- ⚠️ IMPORTANT: Aggregators are now expected to strip leading/trailing white
space characters before returning their aggregation from `aggregation()` or
`.text`. This way all aggregators have a consistent contract allowing
downstream use to know how to stitch aggregations back together.
- Introduced a new `Aggregation` dataclass to represent both the aggregated
`text` and a string identifying the `type` of aggregation (ex. "sentence",
"word", "my custom aggregation")
- ⚠️ Breaking change: `BaseTextAggregator.text` now returns an `Aggregation`
(instead of `str`).
Before:
```python
aggregated_text = myAggregator.text
```
Now:
```python
aggregated_text = myAggregator.text.text
```
- ⚠️ Breaking change: `BaseTextAggregator.aggregate()` now returns
`Optional[Aggregation]` (instead of `Optional[str]`).
Before:
```python
aggregation = myAggregator.aggregate(text)
print(f"successfully aggregated text: {aggregation}")
```
Now:
```python
aggregation = myAggregator.aggregate(text)
if aggregation:
print(f"successfully aggregated text: {aggregation.text}")
```
- `SimpleTextAggregator`, `SkipTagsAggregator`, `PatternPairAggregator`
updated to produce/consume `Aggregation` objects.
- All uses of the above Aggregators have been updated accordingly.
- Augmented the `PatternPairAggregator` so that matched patterns can be treated
as their own aggregation, taking advantage of the new. To that end:
- Introduced a new, preferred version of `add_pattern` to support a new option
for treating a match as a separate aggregation returned from
`aggregate()`. This replaces the now deprecated `add_pattern_pair` method
and you provide a `MatchAction` in lieu of the `remove_match` field.
- `MatchAction` enum: `REMOVE`, `KEEP`, `AGGREGATE`, allowing customization
for how a match should be handled.
- `REMOVE`: The text along with its delimiters will be removed from the
streaming text. Sentence aggregation will continue on as if this text
did not exist.
- `KEEP`: The delimiters will be removed, but the content between them
will be kept. Sentence aggregation will continue on with the internal
text included.
- `AGGREGATE`: The delimiters will be removed and the content between will
be treated as a separate aggregation. Any text before the start of the
pattern will be returned early, whether or not a complete sentence was
found. Then the pattern will be returned. Then the aggregation will
continue on sentence matching after the closing delimiter is found. The
content between the delimiters is not aggregated by sentence. It is
aggregated as one single block of text.
- `PatternMatch` now extends `Aggregation` and provides richer info to
handlers.
- ⚠️ Breaking change: The `PatternMatch` type returned to handlers registered
via `on_pattern_match` has been updated to subclass from the new
`Aggregation` type, which means that `content` has been replaced with
`text` and `pattern_id` has been replaced with `type`:
```python
async dev on_match_tag(match: PatternMatch):
pattern = match.type # instead of match.pattern_id
text = match.text # instead of match.content
```
- `TextFrame` now includes the field `append_to_context` to support setting
whether or not the encompassing text should be added to the LLM context (by
the LLM assistant aggregator). It defaults to `True`.
- `TTSService` base class updates:
- `TTSService`s now accept a new `skip_aggregator_types` to avoid speaking
certain aggregation types (now determined/returned by the aggregator)
- Introduced the ability to do a just-in-time transform of text before it gets
sent to the TTS service via callbacks you can set up via a new init field,
`text_transforms` or a new method `add_text_transformer()`. This makes it
possible to do things like introduce TTS-specific tags for spelling or
emotion or change the pronunciation of something on the
fly. `remove_text_transformer` has also been added to support removing a
registered transform callback.
- TTS services push `AggregatedTextFrame` in addition to `TTSTextFrame`s when
either an aggregation occurs that should not be spoken or when the TTS
service supports word-by-word timestamping. In the latter case, the
`TTSService` preliminarily generates an `AggregatedTextFrame`, aggregated by
sentence to generate the full sentence content as early as possible.
- Updated `CartesiaTTSService`:
- Modified use of custom default text_aggregator to avoid deprecation warnings
and push users towards use of transformers or the `LLMTextProcessor`
- Added convenience methods for taking advantage of Cartesia's SSML tags:
spell, emotion, pauses, volume, and speed.
- Updated `RimeTTSService`:
- Modified use of custom default text_aggregator to avoid deprecation warnings
and push users towards use of transformers or the `LLMTextProcessor`
- Added convenience methods for taking advantage of Rime's customization
options: spell, pauses, pronunciations, and inline speed control.
### Deprecated
- The TTS constructor field, `text_aggregator` is deprecated in favor of the new
`LLMTextProcessor`. TTSServices still have an internal aggregator for support
of default behavior, but if you want to override the aggregation behavior, you
should use the new processor.
- The RTVI `bot-transcription` event is deprecated in favor of the new
`bot-output` message which is the canonical representation of bot output
(spoken or not). The code still emits a transcription message for backwards
compatibility while transition occurs.
- Deprecated `add_pattern_pair` in the `PatternPairAggregator` which takes a
`pattern_id` and `remove_match` field in favor of the new `add_pattern` method
which takes a `type` and an `action`
- `english_normalization` input parameter for `MiniMaxHttpTTSService` is
deprecated, use `test_normalization` instead.
### Fixed
- Fixed an issue in `AWSBedrockLLMService` where the `aws_region` arg was
always set to `us-east-1` when providing an AWS_REGION env var.
- Fixed an issue with `DeepgramFluxSTTService` where it sometimes failed to reconnect.
- Fixed an issue in `ElevenLabsRealtimeSTTService` where dynamic language
updates were not working.
- Fixed an issue in `ElevenLabsRealtimeSTTService` where setting the sample
rate would result in transcripts failing.
- Fixed `InworldTTSService` audio config payload to use camelCase keys expected
by the Inworld API.
## [0.0.95] - 2025-11-18
### Added
- Added ai-coustics integrated VAD (`AICVADAnalyzer`) with `AICFilter` factory and
example wiring; leverages the enhancement model for robust detection with no
ONNX dependency or added processing complexity.
- Added a watchdog to `DeepgramFluxSTTService` to prevent dangling tasks in case the
user was speaking and we stop receiving audio.
- Introduced a minimum confidence parameter in `DeepgramFluxSTTService` to avoid
generating transcriptions below a defined threshold.
- Added `ElevenLabsRealtimeSTTService` which implements the Realtime STT
service from ElevenLabs.
- Added word-level timestamps support to Hume TTS service
- Added a `TTSService.includes_inter_frame_spaces` property getter, so that TTS
services that subclass `TTSService` can indicate whether the text in the
`TTSTextFrame`s they push already contain any necessary inter-frame spaces.
- Introduced new `AggregatedTextFrame` type to support representing a best effort of
the perceived llm output whether or not it is processed by the TTS. This new frame
type includes the field `aggregated_by` to represent the conceptual format by which
the given text is aggregated. `TTSTextFrame`s now inherit from `AggregatedTextFrame`.
With this inheritance, an observer can watch for `AggregatedTextFrame`s to accumlate
the perceived output and determine whether or not the text was spoken based on if that
frame is also a `TTSTextFrame`. (See bullet below on new `bot-output` which takes
advantage of this)
- Introduced `LLMTextProcessor`: A new processor meant to allow customization for how
LLMTextFrames should be aggregated and considered. It's purpose is to turn
`LLMTextFrame`s into `AggregatedTextFrame`s. By default, a TTSService will still
aggregate `LLMTextFrame`s by sentence for the service to consume. However, if you
wish to override how the llm text is aggregated, you should no longer override the
TTS's internal aggregator, but instead, insert this processor between your LLM and
TTS in the pipeline.
- New `bot-output` RTVI message to represent what the bot actually "says".
- The `RTVIObserver` now emits `bot-output` messages based off the new `AggregatedTextFrame`s
(`bot-tts-text` and `bot-llm-text` are still supported and generated, but `bot-transcript` is
now deprecated in lieu of this new, more thorough, message).
- The new `RTVIBotOutputMessage` includes the fields:
- `spoken`: A boolean indicating whether the text was spoken by TTS
- `aggregated_by`: A string representing how the text was aggregated ("sentence", "word",
"my custom aggregation")
- Introduced new fields to `RTVIObserver` to support the new `bot-output` messaging:
- `bot_output_enabled`: Defaults to True. Set to false to disable bot-output messages.
- `skip_aggregator_types`: Defaults to `None`. Set to a list of strings that match
aggregation types that should not be included in bot-output messages. (Ex. `credit_card`)
- Introduced new methods, `add_text_transformer()` and `remove_text_transformer()`, to `RTVIObserver` to support providing (and subsequently removing)
callbacks for various types of aggregations (or all aggregations with `*`) that can modify the
text before being sent as a `bot-output` or `tts-text` message. (Think obscuring the credit card
or inserting extra detail the client might want that the context doesn't need.)
- Updated the base aggregator type:
- Introduced a new `Aggregation` dataclass to represent both the aggregated `text` and
a string identifying the `type` of aggregation (ex. "sentence", "word", "my custom
aggregation")
- **BREAKING**: `BaseTextAggregator.text` now returns an `Aggregation` (instead of `str`).
To update: `aggregated_text = myAggregator.text` -> `aggregated_text = myAggregator.text.text`
- **BREAKING**: `BaseTextAggregator.aggregate()` now returns `Optional[Aggregation]`
(instead of `Optional[str]`). To update:
```
aggregation = myAggregator.aggregate(text)
if (aggregation):
print(f"successfully aggregated text: {aggregation.text}") // instead of {aggregation}
```
- `SimpleTextAggregator`, `SkipTagsAggregator`, `PatternPairAggregator` updated to
produce/consume `Aggregation` objects.
- Augmented the `PatternPairAggregator`:
- Introduced a new, preferred version of `add_pattern` to support a new option for treating a
match as a separate aggregation returned from `aggregate()`. This replaces the now
deprecated `add_pattern_pair` method and you provide a `MatchAction` in lieu of the `remove_match` field.
- `MatchAction` enum: `REMOVE`, `KEEP`, `AGGREGATE`, allowing customization for how
a match should be handled.
- `REMOVE`: The text along with its delimiters will be removed from the streaming text.
Sentence aggregation will continue on as if this text did not exist.
- `KEEP`: The delimiters will be removed, but the content between them will be kept.
Sentence aggregation will continue on with the internal text included.
- `AGGREGATE`: The delimiters will be removed and the content between will be treated
as a separate aggregation. Any text before the start of the pattern will be
returned early, whether or not a complete sentence was found. Then the pattern
will be returned. Then the aggregation will continue on sentence matching after
the closing delimiter is found. The content between the delimiters is not
aggregated by sentence. It is aggregated as one single block of text.
- `PatternMatch` now extends `Aggregation` and provides richer info to handlers.
- **BREAKING**: The `PatternMatch` type returned to handlers registered via `on_pattern_match`
has been updated to subclass from the new `Aggregation` type, which means that `content`
has been replaced with `text` and `pattern_id` has been replaced with `type`:
```
async dev on_match_tag(match: PatternMatch):
pattern = match.type # instead of match.pattern_id
text = match.text # instead of match.content
```
### Changed
- ⚠️ Breaking change: `LLMContext.create_image_message()`,
`LLMContext.create_audio_message()`, `LLMContext.add_image_frame_message()`
and `LLMContext.add_audio_frames_message()` are now async methods. This fixes
an issue where the asyncio event loop would be blocked while encoding audio or
images.
- `ConsumerProcessor` now queues frames from the producer internally instead of
pushing them directly. This allows us to subclass consumer processors and
manipulate frames before they are pushed.
- `BaseTextFilter` only require subclasses to implement the `filter()` method.
- Extracted the logic for retrying connections, and create a new `send_with_retry`
method inside `WebSocketService`.
- Refactored `DeepgramFluxSTTService` to automatically reconnect if sending a
message fails.
- Updated all STT and TTS services to use consistent error handling pattern with
`push_error()` method for better pipeline error event integration.
- Added support for `maybe_capture_participant_camera()` and
`maybe_capture_participant_screen()` for `SmallWebRTCTransport` in the runner
utils.
- Added Hindi support for Rime TTS services.
- Updated `GeminiTTSService` to use Google Cloud Text-to-Speech streaming API
@@ -706,18 +109,44 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Updated language mappings for the Google and Gemini TTS services to match
official documentation.
- `TextFrame` new field `append_to_context` used to indicate if the encompassing
text should be added to the LLM context (by the LLM assistant aggregator). It
defaults to `True`.
- TTS flow respects aggregation metadata
- `TTSService` accepts a new `skip_aggregator_types` to avoid speaking certain aggregation types
(now determined/returned by the aggregator)
- TTS services push `AggregatedTextFrame` in addition to `TTSTextFrame`s when either an
aggregation occurs that should not be spoken or when the TTS service supports word-by-word
timestamping. In the latter case, the `TTSService` preliminarily generates an
`AggregatedTextFrame`, aggregated by sentence to generate the full sentence content as early
as possible.
- Introduced a new methods, `add_text_transformer()` and `remove_text_transformer()`:
These functions introduce the ability to provide (and subsequently remove) callbacks to the TTS to transform text based on
its aggregated type prior to sending the text to the underlying TTS service. This makes it
possible to do things like introduce TTS-specific tags for spelling or emotion or change the
pronunciation of something on the fly.
### Deprecated
- The `api_key` parameter in `GeminiTTSService` is deprecated. Use
`credentials` or `credentials_path` instead for Google Cloud authentication.
- The RTVI `bot-transcription` event is deprecated in favor of the new `bot-output`
message which is the canonical representation of bot output (spoken or not). The code
still emits a transcription message for backwards compatibility while transition occurs.
- The TTS constructor field, `text_aggregator` is deprecated in favor of the new
`LLMTextProcessor`. TTSServices still have an internal aggregator for support of default
behavior, but if you want to override the aggregation behavior, you should use the new
processor.
- Deprecated `add_pattern_pair` in the `PatternPairAggregator` which takes a `pattern_id`
and `remove_match` field in favor of the new `add_pattern` method which takes a `type` and an
`action`
### Fixed
- Fixed a `SimliVideoService` connection issue.
- Fixed an issue in the `Runner` where, when using `SmallWebRTCTransport`, the
`request_data` was not being passed to the `SmallWebRTCRunnerArguments` body.
- Fixed subtle issue of assistant context messages ending up with double spaces
between words or sentences.
@@ -732,6 +161,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Prevented `HeyGenVideoService` from automatically disconnecting after 5 minutes.
### Added
- Added ai-coustics integrated VAD (`AICVADAnalyzer`) with `AICFilter` factory and
example wiring; leverages the enhancement model for robust detection with no
ONNX dependency or added processing complexity.
## [0.0.94] - 2025-11-10
### Changed

View File

@@ -79,7 +79,7 @@ Once your PR is submitted, post in the `#community-integrations` Discord channel
**Examples:**
- [NvidiaSTTService](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/nvidia/stt.py)
- [RivaSTTService](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/riva/stt.py)
- [FalSTTService](https://github.com/pipecat-ai/pipecat/blob/main/src/pipecat/services/fal/stt.py)
#### Key requirements:

View File

@@ -17,121 +17,24 @@ We welcome contributions of all kinds! Your help is appreciated. Follow these st
git checkout -b your-branch-name
```
4. **Make your changes**: Edit or add files as necessary.
5. **Add a changelog entry**: Create a changelog fragment file (see [Changelog Entries](#changelog-entries) below).
6. **Test your changes**: Ensure that your changes look correct and follow the style set in the codebase.
7. **Commit your changes**: Once you're satisfied with your changes, commit them with a meaningful message.
5. **Test your changes**: Ensure that your changes look correct and follow the style set in the codebase.
6. **Commit your changes**: Once you're satisfied with your changes, commit them with a meaningful message.
```bash
git commit -m "Description of your changes"
```
8. **Push your changes**: Push your branch to your forked repository.
7. **Push your changes**: Push your branch to your forked repository.
```bash
git push origin your-branch-name
```
9. **Submit a Pull Request (PR)**: Open a PR from your forked repository to the main branch of this repo.
8. **Submit a Pull Request (PR)**: Open a PR from your forked repository to the main branch of this repo.
> Important: Describe the changes you've made clearly!
Our maintainers will review your PR, and once everything is good, your contributions will be merged!
## Changelog Entries
Every pull request that makes a user-facing change should include a changelog entry. We use a changelog fragment system to avoid merge conflicts.
### Creating a Changelog Fragment
1. Create a new file in the `changelog/` directory with this naming pattern:
```
<PR_number>.<type>.md
```
2. Choose the appropriate type:
- `added.md` - New features
- `changed.md` - Changes in existing functionality
- `deprecated.md` - Soon-to-be removed features
- `removed.md` - Removed features
- `fixed.md` - Bug fixes
- `security.md` - Security fixes
3. Write your changelog entry as a Markdown bullet point. Include the `-` at the start:
**Example files:**
`changelog/1234.added.md`:
```markdown
- Added support for Anthropic Claude 3.5 Sonnet with improved streaming performance.
```
`changelog/5678.fixed.md`:
```markdown
- Fixed an issue where audio frames were dropped during high-load scenarios.
```
**For entries with nested bullets:**
`changelog/1234.changed.md`:
```markdown
- Updated service configuration:
- Changed default timeout to 30 seconds
- Added retry logic for failed connections
```
### Multiple Changes in One PR
**Different types of changes:** Create separate fragment files for each type:
```
changelog/1234.added.md
changelog/1234.fixed.md
```
**Multiple changes of the same type:** Create numbered fragment files:
```
changelog/1234.changed.md
changelog/1234.changed.2.md
```
**Related changes:** Use nested bullets in a single fragment:
```markdown
- Updated service configuration:
- Changed default timeout to 30 seconds
- Added retry logic for failed connections
```
**Rule of thumb:** One logical change per fragment file. If changes are unrelated, use separate files.
### Preview Your Changes
To see what your changelog entry will look like:
```bash
towncrier build --draft --version Unreleased
```
This won't modify any files, just show you a preview.
### When to Skip Changelog Entries
You can skip adding a changelog entry for:
- Documentation-only changes
- Internal refactoring with no user-facing impact
- Test-only changes
- CI/build configuration changes
If you're unsure whether your change needs a changelog entry, ask in your PR!
## Dependency Management
This project uses [uv](https://docs.astral.sh/uv/) for dependency management. The `uv.lock` file is committed to ensure reproducible builds.

View File

@@ -3,6 +3,7 @@
</div></h1>
[![PyPI](https://img.shields.io/pypi/v/pipecat-ai)](https://pypi.org/project/pipecat-ai) ![Tests](https://github.com/pipecat-ai/pipecat/actions/workflows/tests.yaml/badge.svg) [![codecov](https://codecov.io/gh/pipecat-ai/pipecat/graph/badge.svg?token=LNVUIVO4Y9)](https://codecov.io/gh/pipecat-ai/pipecat) [![Docs](https://img.shields.io/badge/Documentation-blue)](https://docs.pipecat.ai) [![Discord](https://img.shields.io/discord/1239284677165056021)](https://discord.gg/pipecat) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/pipecat-ai/pipecat)
[![](https://getmanta.ai/api/badges?text=Manta%20Graph&link=manta)](https://getmanta.ai/pipecat)
# 🎙️ Pipecat: Real-Time Voice & Multimodal AI Agents
@@ -73,10 +74,10 @@ Catch new features, interviews, and how-tos on our [Pipecat TV](https://www.yout
| Category | Services |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Speech-to-Text | [AssemblyAI](https://docs.pipecat.ai/server/services/stt/assemblyai), [AWS](https://docs.pipecat.ai/server/services/stt/aws), [Azure](https://docs.pipecat.ai/server/services/stt/azure), [Cartesia](https://docs.pipecat.ai/server/services/stt/cartesia), [Deepgram](https://docs.pipecat.ai/server/services/stt/deepgram), [ElevenLabs](https://docs.pipecat.ai/server/services/stt/elevenlabs), [Fal Wizper](https://docs.pipecat.ai/server/services/stt/fal), [Gladia](https://docs.pipecat.ai/server/services/stt/gladia), [Google](https://docs.pipecat.ai/server/services/stt/google), [Gradium](https://docs.pipecat.ai/server/services/stt/gradium), [Groq (Whisper)](https://docs.pipecat.ai/server/services/stt/groq), [NVIDIA Riva](https://docs.pipecat.ai/server/services/stt/riva), [OpenAI (Whisper)](https://docs.pipecat.ai/server/services/stt/openai), [SambaNova (Whisper)](https://docs.pipecat.ai/server/services/stt/sambanova), [Sarvam](https://docs.pipecat.ai/server/services/stt/sarvam), [Soniox](https://docs.pipecat.ai/server/services/stt/soniox), [Speechmatics](https://docs.pipecat.ai/server/services/stt/speechmatics), [Whisper](https://docs.pipecat.ai/server/services/stt/whisper) |
| Speech-to-Text | [AssemblyAI](https://docs.pipecat.ai/server/services/stt/assemblyai), [AWS](https://docs.pipecat.ai/server/services/stt/aws), [Azure](https://docs.pipecat.ai/server/services/stt/azure), [Cartesia](https://docs.pipecat.ai/server/services/stt/cartesia), [Deepgram](https://docs.pipecat.ai/server/services/stt/deepgram), [ElevenLabs](https://docs.pipecat.ai/server/services/stt/elevenlabs), [Fal Wizper](https://docs.pipecat.ai/server/services/stt/fal), [Gladia](https://docs.pipecat.ai/server/services/stt/gladia), [Google](https://docs.pipecat.ai/server/services/stt/google), [Groq (Whisper)](https://docs.pipecat.ai/server/services/stt/groq), [NVIDIA Riva](https://docs.pipecat.ai/server/services/stt/riva), [OpenAI (Whisper)](https://docs.pipecat.ai/server/services/stt/openai), [SambaNova (Whisper)](https://docs.pipecat.ai/server/services/stt/sambanova), [Sarvam](https://docs.pipecat.ai/server/services/stt/sarvam), [Soniox](https://docs.pipecat.ai/server/services/stt/soniox), [Speechmatics](https://docs.pipecat.ai/server/services/stt/speechmatics), [Ultravox](https://docs.pipecat.ai/server/services/stt/ultravox), [Whisper](https://docs.pipecat.ai/server/services/stt/whisper) |
| LLMs | [Anthropic](https://docs.pipecat.ai/server/services/llm/anthropic), [AWS](https://docs.pipecat.ai/server/services/llm/aws), [Azure](https://docs.pipecat.ai/server/services/llm/azure), [Cerebras](https://docs.pipecat.ai/server/services/llm/cerebras), [DeepSeek](https://docs.pipecat.ai/server/services/llm/deepseek), [Fireworks AI](https://docs.pipecat.ai/server/services/llm/fireworks), [Gemini](https://docs.pipecat.ai/server/services/llm/gemini), [Grok](https://docs.pipecat.ai/server/services/llm/grok), [Groq](https://docs.pipecat.ai/server/services/llm/groq), [Mistral](https://docs.pipecat.ai/server/services/llm/mistral), [NVIDIA NIM](https://docs.pipecat.ai/server/services/llm/nim), [Ollama](https://docs.pipecat.ai/server/services/llm/ollama), [OpenAI](https://docs.pipecat.ai/server/services/llm/openai), [OpenRouter](https://docs.pipecat.ai/server/services/llm/openrouter), [Perplexity](https://docs.pipecat.ai/server/services/llm/perplexity), [Qwen](https://docs.pipecat.ai/server/services/llm/qwen), [SambaNova](https://docs.pipecat.ai/server/services/llm/sambanova) [Together AI](https://docs.pipecat.ai/server/services/llm/together) |
| Text-to-Speech | [Async](https://docs.pipecat.ai/server/services/tts/asyncai), [AWS](https://docs.pipecat.ai/server/services/tts/aws), [Azure](https://docs.pipecat.ai/server/services/tts/azure), [Cartesia](https://docs.pipecat.ai/server/services/tts/cartesia), [Deepgram](https://docs.pipecat.ai/server/services/tts/deepgram), [ElevenLabs](https://docs.pipecat.ai/server/services/tts/elevenlabs), [Fish](https://docs.pipecat.ai/server/services/tts/fish), [Google](https://docs.pipecat.ai/server/services/tts/google), [Gradium](https://docs.pipecat.ai/server/services/tts/gradium), [Groq](https://docs.pipecat.ai/server/services/tts/groq), [Hume](https://docs.pipecat.ai/server/services/tts/hume), [Inworld](https://docs.pipecat.ai/server/services/tts/inworld), [LMNT](https://docs.pipecat.ai/server/services/tts/lmnt), [MiniMax](https://docs.pipecat.ai/server/services/tts/minimax), [Neuphonic](https://docs.pipecat.ai/server/services/tts/neuphonic), [NVIDIA Riva](https://docs.pipecat.ai/server/services/tts/riva), [OpenAI](https://docs.pipecat.ai/server/services/tts/openai), [Piper](https://docs.pipecat.ai/server/services/tts/piper), [PlayHT](https://docs.pipecat.ai/server/services/tts/playht), [Rime](https://docs.pipecat.ai/server/services/tts/rime), [Sarvam](https://docs.pipecat.ai/server/services/tts/sarvam), [Speechmatics](https://docs.pipecat.ai/server/services/tts/speechmatics), [XTTS](https://docs.pipecat.ai/server/services/tts/xtts) |
| Speech-to-Speech | [AWS Nova Sonic](https://docs.pipecat.ai/server/services/s2s/aws), [Gemini Multimodal Live](https://docs.pipecat.ai/server/services/s2s/gemini), [OpenAI Realtime](https://docs.pipecat.ai/server/services/s2s/openai), Ultravox, |
| Text-to-Speech | [Async](https://docs.pipecat.ai/server/services/tts/asyncai), [AWS](https://docs.pipecat.ai/server/services/tts/aws), [Azure](https://docs.pipecat.ai/server/services/tts/azure), [Cartesia](https://docs.pipecat.ai/server/services/tts/cartesia), [Deepgram](https://docs.pipecat.ai/server/services/tts/deepgram), [ElevenLabs](https://docs.pipecat.ai/server/services/tts/elevenlabs), [Fish](https://docs.pipecat.ai/server/services/tts/fish), [Google](https://docs.pipecat.ai/server/services/tts/google), [Groq](https://docs.pipecat.ai/server/services/tts/groq), [Hume](https://docs.pipecat.ai/server/services/tts/hume), [Inworld](https://docs.pipecat.ai/server/services/tts/inworld), [LMNT](https://docs.pipecat.ai/server/services/tts/lmnt), [MiniMax](https://docs.pipecat.ai/server/services/tts/minimax), [Neuphonic](https://docs.pipecat.ai/server/services/tts/neuphonic), [NVIDIA Riva](https://docs.pipecat.ai/server/services/tts/riva), [OpenAI](https://docs.pipecat.ai/server/services/tts/openai), [Piper](https://docs.pipecat.ai/server/services/tts/piper), [PlayHT](https://docs.pipecat.ai/server/services/tts/playht), [Rime](https://docs.pipecat.ai/server/services/tts/rime), [Sarvam](https://docs.pipecat.ai/server/services/tts/sarvam), [Speechmatics](https://docs.pipecat.ai/server/services/tts/speechmatics), [XTTS](https://docs.pipecat.ai/server/services/tts/xtts) |
| Speech-to-Speech | [AWS Nova Sonic](https://docs.pipecat.ai/server/services/s2s/aws), [Gemini Multimodal Live](https://docs.pipecat.ai/server/services/s2s/gemini), [OpenAI Realtime](https://docs.pipecat.ai/server/services/s2s/openai) |
| Transport | [Daily (WebRTC)](https://docs.pipecat.ai/server/services/transport/daily), [FastAPI Websocket](https://docs.pipecat.ai/server/services/transport/fastapi-websocket), [SmallWebRTCTransport](https://docs.pipecat.ai/server/services/transport/small-webrtc), [WebSocket Server](https://docs.pipecat.ai/server/services/transport/websocket-server), Local |
| Serializers | [Plivo](https://docs.pipecat.ai/server/utilities/serializers/plivo), [Twilio](https://docs.pipecat.ai/server/utilities/serializers/twilio), [Telnyx](https://docs.pipecat.ai/server/utilities/serializers/telnyx) |
| Video | [HeyGen](https://docs.pipecat.ai/server/services/video/heygen), [Tavus](https://docs.pipecat.ai/server/services/video/tavus), [Simli](https://docs.pipecat.ai/server/services/video/simli) |
@@ -153,6 +154,7 @@ You can get started with Pipecat running on your local machine, then move your a
--no-extra gstreamer \
--no-extra krisp \
--no-extra local \
--no-extra ultravox # (ultravox not fully supported on macOS)
```
3. Install the git pre-commit hooks:

View File

@@ -1,16 +0,0 @@
{% for section, _ in sections.items() %}
{% if sections[section] %}
{% for category, val in definitions.items() if category in sections[section]%}
### {{ definitions[category]['name'] }}
{% for text, values in sections[section][category].items() %}
{{ text }}
(PR {{ values|join(', ') }})
{% endfor %}
{% endfor %}
{% else %}
No significant changes.
{% endif %}
{% endfor %}

View File

@@ -2,7 +2,7 @@
# Build docs using uv
echo "Installing dependencies with uv..."
uv sync --group docs --all-extras --no-extra krisp --no-extra gstreamer --no-extra local_smart_turn --no-extra moondream --no-extra riva --no-extra mlx-whisper
uv sync --group docs --all-extras --no-extra krisp --no-extra gstreamer --no-extra ultravox --no-extra local_smart_turn --no-extra moondream --no-extra riva --no-extra mlx-whisper
# Check if sphinx-build is available
if ! uv run sphinx-build --version &> /dev/null; then
@@ -24,4 +24,4 @@ if [ $? -eq 0 ]; then
else
echo "Documentation build failed!" >&2
exit 1
fi
fi

View File

@@ -61,6 +61,9 @@ autodoc_mock_imports = [
# OpenCV - sometimes has import issues during docs build
"cv2",
# Heavy ML packages excluded from ReadTheDocs
# ultravox dependencies
"vllm",
"vllm.engine.arg_utils",
# local-smart-turn dependencies
"coremltools",
"coremltools.models",
@@ -116,6 +119,7 @@ def import_core_modules():
"pipecat.observers",
"pipecat.runner",
"pipecat.serializers",
"pipecat.sync",
"pipecat.transcriptions",
"pipecat.utils",
]

View File

@@ -30,6 +30,7 @@ Quick Links
Runner <api/pipecat.runner>
Serializers <api/pipecat.serializers>
Services <api/pipecat.services>
Sync <api/pipecat.sync>
Transcriptions <api/pipecat.transcriptions>
Transports <api/pipecat.transports>
Utils <api/pipecat.utils>
Utils <api/pipecat.utils>

View File

@@ -44,7 +44,6 @@ DAILY_SAMPLE_ROOM_URL=https://...
# Deepgram
DEEPGRAM_API_KEY=...
SAGEMAKER_ENDPOINT_NAME=...
# DeepSeek
DEEPSEEK_API_KEY=...
@@ -73,9 +72,6 @@ GOOGLE_CLOUD_PROJECT_ID=...
GOOGLE_CLOUD_LOCATION=...
GOOGLE_TEST_CREDENTIALS=...
# Gradium
GRAPDIUM_API_KEY=...
# Grok
GROK_API_KEY=...
@@ -84,7 +80,6 @@ GROQ_API_KEY=...
# Heygen
HEYGEN_API_KEY=...
HEYGEN_LIVE_AVATAR_API_KEY=...
# Hume
HUME_API_KEY=...
@@ -191,11 +186,8 @@ TOGETHER_API_KEY=...
TWILIO_ACCOUNT_SID=...
TWILIO_AUTH_TOKEN=...
# Ultravox Realtime
ULTRAVOX_API_KEY=...
# WhatsApp
WHATSAPP_TOKEN=...
WHATSAPP_WEBHOOK_VERIFICATION_TOKEN=...
WHATSAPP_PHONE_NUMBER_ID=...
WHATSAPP_APP_SECRET=...
WHATSAPP_APP_SECRET=...

View File

@@ -15,7 +15,7 @@ from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.nvidia.tts import NvidiaTTSService
from pipecat.services.riva.tts import FastPitchTTSService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
@@ -36,7 +36,7 @@ transport_params = {
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Starting bot")
tts = NvidiaTTSService(api_key=os.getenv("NVIDIA_API_KEY"))
tts = FastPitchTTSService(api_key=os.getenv("NVIDIA_API_KEY"))
task = PipelineTask(
Pipeline([tts, transport.output()]),

View File

@@ -4,6 +4,7 @@
# SPDX-License-Identifier: BSD 2-Clause License
#
import os
import aiohttp
@@ -14,26 +15,26 @@ from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import LLMRunFrame, TTSTextFrame
from pipecat.observers.loggers.debug_log_observer import DebugLogObserver, FrameEndpoint
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.frameworks.rtvi import RTVIObserver, RTVIProcessor
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.inworld.tts import InworldHttpTTSService
from pipecat.services.inworld.tts import InworldTTSService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.transports.base_output import BaseOutputTransport
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
load_dotenv(override=True)
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
# instantiated. The function will be called when the desired transport gets
# selected.
transport_params = {
"daily": lambda: DailyParams(
audio_in_enabled=True,
@@ -57,18 +58,22 @@ transport_params = {
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info("Starting bot")
logger.info(f"Starting bot")
# Create an HTTP session
async with aiohttp.ClientSession() as session:
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
tts = InworldHttpTTSService(
# Inworld TTS Service - Unified streaming and non-streaming
# Set streaming=True for real-time audio, streaming=False for complete audio generation
streaming = True # Toggle this to switch between modes
tts = InworldTTSService(
api_key=os.getenv("INWORLD_API_KEY", ""),
aiohttp_session=session,
voice_id="Ashley",
model="inworld-tts-1",
# Set to False for non-streaming mode or True for streaming mode.
streaming=True,
streaming=streaming, # True: real-time chunks, False: complete audio then playback
)
llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
@@ -76,25 +81,22 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
messages = [
{
"role": "system",
"content": "You are a helpful AI demonstrating Inworld AI's TTS. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a friendly and helpful way.",
"content": "You are very knowledgable about dogs. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
},
]
context = LLMContext(messages)
context_aggregator = LLMContextAggregatorPair(context)
rtvi = RTVIProcessor()
pipeline = Pipeline(
[
transport.input(),
rtvi,
stt,
context_aggregator.user(),
llm,
tts,
transport.output(),
context_aggregator.assistant(),
transport.input(), # Transport user input
stt, # STT
context_aggregator.user(), # User responses
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output
context_aggregator.assistant(), # Assistant spoken responses
]
)
@@ -104,27 +106,19 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
enable_metrics=True,
enable_usage_metrics=True,
),
observers=[
RTVIObserver(rtvi),
DebugLogObserver(
frame_types={
TTSTextFrame: (BaseOutputTransport, FrameEndpoint.SOURCE),
}
),
],
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
)
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info("Client connected")
logger.info(f"Client connected")
# Kick off the conversation.
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
await task.queue_frames([LLMRunFrame()])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):
logger.info("Client disconnected")
logger.info(f"Client disconnected")
await task.cancel()
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)

View File

@@ -1,141 +0,0 @@
#
# Copyright (c) 20242025, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import os
from dotenv import load_dotenv
from loguru import logger
from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import LLMRunFrame, TTSTextFrame
from pipecat.observers.loggers.debug_log_observer import DebugLogObserver, FrameEndpoint
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.frameworks.rtvi import RTVIConfig, RTVIObserver, RTVIProcessor
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.inworld.tts import InworldTTSService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.transports.base_output import BaseOutputTransport
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
load_dotenv(override=True)
transport_params = {
"daily": lambda: DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
"twilio": lambda: FastAPIWebsocketParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
"webrtc": lambda: TransportParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
}
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info("Starting bot")
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
tts = InworldTTSService(
api_key=os.getenv("INWORLD_API_KEY", ""),
voice_id="Ashley",
model="inworld-tts-1",
temperature=1.1,
)
llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
messages = [
{
"role": "system",
"content": "You are a helpful AI demonstrating Inworld AI's TTS. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a friendly and helpful way.",
},
]
context = LLMContext(messages)
context_aggregator = LLMContextAggregatorPair(context)
rtvi = RTVIProcessor(config=RTVIConfig(config=[]))
pipeline = Pipeline(
[
transport.input(),
rtvi,
stt,
context_aggregator.user(),
llm,
tts,
transport.output(),
context_aggregator.assistant(),
]
)
task = PipelineTask(
pipeline,
params=PipelineParams(
enable_metrics=True,
enable_usage_metrics=True,
),
observers=[
RTVIObserver(rtvi),
DebugLogObserver(
frame_types={
TTSTextFrame: (BaseOutputTransport, FrameEndpoint.SOURCE),
}
),
],
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
)
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info("Client connected")
# Kick off the conversation.
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
await task.queue_frames([LLMRunFrame()])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):
logger.info("Client disconnected")
await task.cancel()
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
async def bot(runner_args: RunnerArguments):
"""Main bot entry point compatible with Pipecat Cloud."""
transport = await create_transport(runner_args, transport_params)
await run_bot(transport, runner_args)
if __name__ == "__main__":
from pipecat.runner.run import main
main()

View File

@@ -13,29 +13,24 @@ from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import LLMRunFrame, TTSTextFrame
from pipecat.observers.loggers.debug_log_observer import DebugLogObserver, FrameEndpoint
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import (
LLMContextAggregatorPair,
)
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.frameworks.rtvi import RTVIConfig, RTVIObserver, RTVIProcessor
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.hume.tts import HUME_SAMPLE_RATE, HumeTTSService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.transports.base_output import BaseOutputTransport
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
load_dotenv(override=True)
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
# instantiated. The function will be called when the desired transport gets
# selected.
@@ -93,7 +88,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
stt,
context_aggregator.user(), # User responses
llm, # LLM
tts, # TTS (HumeTTSService with word timestamps)
tts, # TTS
transport.output(), # Transport bot output
context_aggregator.assistant(), # Assistant spoken responses
]
@@ -107,14 +102,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
audio_out_sample_rate=HUME_SAMPLE_RATE,
),
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
observers=[
RTVIObserver(rtvi),
DebugLogObserver(
frame_types={
TTSTextFrame: (BaseOutputTransport, FrameEndpoint.SOURCE),
}
),
],
observers=[RTVIObserver(rtvi)],
)
@rtvi.event_handler("on_client_ready")
@@ -124,9 +112,6 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info(f"Client connected")
logger.info(
"💡 Word timestamps are enabled! Watch the console for TTSTextFrame logs showing each word with its PTS."
)
# Kick off the conversation.
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
await task.queue_frames([LLMRunFrame()])

View File

@@ -52,10 +52,7 @@ transport_params = {
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Starting bot")
stt = DeepgramFluxSTTService(
api_key=os.getenv("DEEPGRAM_API_KEY"),
params=DeepgramFluxSTTService.InputParams(min_confidence=0.3),
)
stt = DeepgramFluxSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
tts = DeepgramTTSService(api_key=os.getenv("DEEPGRAM_API_KEY"), voice="aura-2-andromeda-en")

View File

@@ -1,137 +0,0 @@
#
# Copyright (c) 20242025, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import os
from dotenv import load_dotenv
from loguru import logger
from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.aws.llm import AWSBedrockLLMService
from pipecat.services.deepgram.stt_sagemaker import DeepgramSageMakerSTTService
from pipecat.services.deepgram.tts import DeepgramTTSService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
load_dotenv(override=True)
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
# instantiated. The function will be called when the desired transport gets
# selected.
transport_params = {
"daily": lambda: DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
"twilio": lambda: FastAPIWebsocketParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
"webrtc": lambda: TransportParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
}
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Starting bot")
# Initialize Deepgram SageMaker STT Service
# This requires:
# - AWS credentials configured (via environment variables or AWS CLI)
# - A deployed SageMaker endpoint with Deepgram model
stt = DeepgramSageMakerSTTService(
endpoint_name=os.getenv("SAGEMAKER_ENDPOINT_NAME"),
region=os.getenv("AWS_REGION"),
)
tts = DeepgramTTSService(api_key=os.getenv("DEEPGRAM_API_KEY"), voice="aura-2-andromeda-en")
llm = AWSBedrockLLMService(
aws_region=os.getenv("AWS_REGION"),
model="us.amazon.nova-pro-v1:0",
params=AWSBedrockLLMService.InputParams(temperature=0.8),
)
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
},
]
context = LLMContext(messages)
context_aggregator = LLMContextAggregatorPair(context)
pipeline = Pipeline(
[
transport.input(), # Transport user input
stt, # STT
context_aggregator.user(), # User responses
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output
context_aggregator.assistant(), # Assistant spoken responses
]
)
task = PipelineTask(
pipeline,
params=PipelineParams(
enable_metrics=True,
enable_usage_metrics=True,
),
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
)
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info(f"Client connected")
# Kick off the conversation.
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
await task.queue_frames([LLMRunFrame()])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):
logger.info(f"Client disconnected")
await task.cancel()
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
async def bot(runner_args: RunnerArguments):
"""Main bot entry point compatible with Pipecat Cloud."""
transport = await create_transport(runner_args, transport_params)
await run_bot(transport, runner_args)
if __name__ == "__main__":
from pipecat.runner.run import main
main()

View File

@@ -71,9 +71,9 @@ def build_agent(model_id: str, max_tokens: int):
@tool
def check_weather(location: str) -> str:
if location.lower() == "san francisco":
return "The weather in San Francisco is sunny and 75 degrees."
return "The weather in San Francisco is sunny and 30 degrees."
elif location.lower() == "sydney":
return "The weather in Sydney is cloudy and 60 degrees."
return "The weather in Sydney is cloudy and 20 degrees."
else:
return "I'm not sure about the weather in that location."

View File

@@ -89,7 +89,6 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
llm = GoogleLLMService(
api_key=os.getenv("GOOGLE_API_KEY"),
model="gemini-2.5-flash-image",
# model="gemini-3-pro-image-preview", # A more powerful model, but slower
)
messages = [

View File

@@ -136,7 +136,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
messages.append(
{
"role": "system",
"content": "You are an AI assistant. You can help with a variety of tasks. Introduce yourself and ask the user what they would like to know.",
"content": "Hello! I'm your AI assistant. I can help you with a variety of tasks. What would you like to know?",
}
)
await task.queue_frames([LLMRunFrame()])

View File

@@ -75,10 +75,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
llm = GoogleLLMService(
api_key=os.getenv("GOOGLE_API_KEY"),
model="gemini-2.5-flash",
# force a certain amount of thinking if you want it
# params=GoogleLLMService.InputParams(
# thinking=GoogleLLMService.ThinkingConfig(thinking_budget=4096)
# ),
# turn on thinking if you want it
# params=GoogleLLMService.InputParams(extra={"thinking_config": {"thinking_budget": 4096}}),)
)
messages = [

View File

@@ -75,10 +75,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
llm = GoogleLLMService(
api_key=os.getenv("GOOGLE_API_KEY"),
model="gemini-2.5-flash",
# force a certain amount of thinking if you want it
# params=GoogleLLMService.InputParams(
# thinking=GoogleLLMService.ThinkingConfig(thinking_budget=4096)
# ),
# turn on thinking if you want it
# params=GoogleLLMService.InputParams(extra={"thinking_config": {"thinking_budget": 4096}}),)
)
messages = [

View File

@@ -22,9 +22,9 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.nvidia.llm import NvidiaLLMService
from pipecat.services.nvidia.stt import NvidiaSTTService
from pipecat.services.nvidia.tts import NvidiaTTSService
from pipecat.services.nim.llm import NimLLMService
from pipecat.services.riva.stt import RivaSTTService
from pipecat.services.riva.tts import RivaTTSService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
@@ -59,13 +59,11 @@ transport_params = {
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Starting bot")
stt = NvidiaSTTService(api_key=os.getenv("NVIDIA_API_KEY"))
stt = RivaSTTService(api_key=os.getenv("NVIDIA_API_KEY"))
llm = NvidiaLLMService(
api_key=os.getenv("NVIDIA_API_KEY"), model="meta/llama-3.1-405b-instruct"
)
llm = NimLLMService(api_key=os.getenv("NVIDIA_API_KEY"), model="meta/llama-3.1-405b-instruct")
tts = NvidiaTTSService(api_key=os.getenv("NVIDIA_API_KEY"))
tts = RivaTTSService(api_key=os.getenv("NVIDIA_API_KEY"))
messages = [
{

View File

@@ -224,10 +224,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
llm = GoogleLLMService(
api_key=os.getenv("GOOGLE_API_KEY"),
model="gemini-2.5-flash",
# force a certain amount of thinking if you want it
# params=GoogleLLMService.InputParams(
# thinking=GoogleLLMService.ThinkingConfig(thinking_budget=4096)
# ),
# turn on thinking if you want it
# params=GoogleLLMService.InputParams(extra={"thinking_config": {"thinking_budget": 4096}}),
)
tts = GoogleTTSService(

View File

@@ -4,6 +4,7 @@
# SPDX-License-Identifier: BSD 2-Clause License
#
import os
from dotenv import load_dotenv
@@ -13,23 +14,32 @@ from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.gradium.stt import GradiumSTTService
from pipecat.services.gradium.tts import GradiumTTSService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.ultravox.stt import UltravoxSTTService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
load_dotenv(override=True)
# NOTE: This example requires GPU resources to run efficiently.
# The Ultravox model is compute-intensive and performs best with GPU acceleration.
# This can be deployed on cloud GPU providers like Cerebrium.ai for optimal performance.
# Want to initialize the ultravox processor since it takes time to load the model and dont
# want to load it every time the pipeline is run
ultravox_processor = UltravoxSTTService(
model_name="fixie-ai/ultravox-v0_5-llama-3_1-8b",
hf_token=os.getenv("HF_TOKEN"),
)
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
# instantiated. The function will be called when the desired transport gets
# selected.
@@ -58,34 +68,17 @@ transport_params = {
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Starting bot")
stt = GradiumSTTService(api_key=os.getenv("GRADIUM_API_KEY"))
tts = GradiumTTSService(
api_key=os.getenv("GRADIUM_API_KEY"),
voice_id="YTpq7expH9539ERJ",
tts = CartesiaTTSService(
api_key=os.environ.get("CARTESIA_API_KEY"),
voice_id="97f4b8fb-f2fe-444b-bb9a-c109783a857a",
)
llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
},
]
context = LLMContext(messages)
context_aggregator = LLMContextAggregatorPair(context)
pipeline = Pipeline(
[
transport.input(), # Transport user input
stt,
context_aggregator.user(), # User responses
llm, # LLM
ultravox_processor,
tts, # TTS
transport.output(), # Transport bot output
context_aggregator.assistant(), # Assistant spoken responses
]
)
@@ -101,9 +94,6 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info(f"Client connected")
# Kick off the conversation.
messages.append({"role": "system", "content": "Please introduce yourself to the user."})
await task.queue_frames([LLMRunFrame()])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):

View File

@@ -110,7 +110,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
# Kick off the conversation.
image = Image.open(image_path)
message = await LLMContext.create_image_message(
message = LLMContext.create_image_message(
image=image.tobytes(),
format="RGB",
size=image.size,

View File

@@ -110,7 +110,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
# Kick off the conversation.
image = Image.open(image_path)
message = await LLMContext.create_image_message(
message = LLMContext.create_image_message(
image=image.tobytes(),
format="RGB",
size=image.size,

View File

@@ -117,7 +117,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
# Kick off the conversation.
image = Image.open(image_path)
message = await LLMContext.create_image_message(
message = LLMContext.create_image_message(
image=image.tobytes(),
format="RGB",
size=image.size,

View File

@@ -110,7 +110,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
# Kick off the conversation.
image = Image.open(image_path)
message = await LLMContext.create_image_message(
message = LLMContext.create_image_message(
image=image.tobytes(),
format="RGB",
size=image.size,

View File

@@ -15,21 +15,14 @@ from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import (
Frame,
LLMFullResponseEndFrame,
LLMFullResponseStartFrame,
LLMRunFrame,
TextFrame,
UserImageRequestFrame,
)
from pipecat.frames.frames import LLMRunFrame, UserImageRequestFrame
from pipecat.pipeline.parallel_pipeline import ParallelPipeline
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.processors.frame_processor import FrameDirection
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import (
create_transport,
@@ -73,27 +66,6 @@ async def fetch_user_image(params: FunctionCallParams):
# await params.result_callback({"result": "Image is being captured."})
class MoondreamTextFrameWrapper(FrameProcessor):
"""Wraps Moondream-provided TextFrames with LLM response start/end frames.
This processor detects TextFrames and automatically wraps them with
LLMFullResponseStartFrame and LLMFullResponseEndFrame to provide proper
response boundaries for downstream processors.
"""
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
# If we receive a TextFrame, wrap it with response start/end frames
if isinstance(frame, TextFrame):
await self.push_frame(LLMFullResponseStartFrame(), direction)
await self.push_frame(frame, direction)
await self.push_frame(LLMFullResponseEndFrame(), direction)
else:
# For all other frames, just pass them through
await self.push_frame(frame, direction)
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
# instantiated. The function will be called when the desired transport gets
# selected.
@@ -158,12 +130,6 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
# If you run into weird description, try with use_cpu=True
moondream = MoondreamService()
# Wrap TextFrames with LLM response start/end frames, which makes Moondream
# output be treated like LLM responses for the purpose of context
# aggregation. Without this, the assistant context aggregator would ignore
# Moondream output (if the TTS service is disabled).
moondream_text_wrapper = MoondreamTextFrameWrapper()
pipeline = Pipeline(
[
transport.input(), # Transport user input
@@ -171,7 +137,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
context_aggregator.user(), # User responses
ParallelPipeline(
[llm], # LLM
[moondream, moondream_text_wrapper],
[moondream],
),
tts, # TTS
transport.output(), # Transport bot output

View File

@@ -76,7 +76,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
llm = FireworksLLMService(
api_key=os.getenv("FIREWORKS_API_KEY"),
model="accounts/fireworks/models/gpt-oss-20b",
model="accounts/fireworks/models/llama-v3p1-405b-instruct",
)
# You can also register a function_name of None to get all functions
# sent to the same callback with an additional function_name parameter.

View File

@@ -27,7 +27,7 @@ from pipecat.runner.utils import create_transport
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.llm_service import FunctionCallParams
from pipecat.services.nvidia.llm import NvidiaLLMService
from pipecat.services.nim.llm import NimLLMService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
@@ -75,11 +75,11 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
# text_filters=[MarkdownTextFilter()],
)
llm = NvidiaLLMService(
llm = NimLLMService(
api_key=os.getenv("NVIDIA_API_KEY"),
model="nvidia/llama-3.3-nemotron-super-49b-v1.5",
# Recommended when turning thinking off
params=NvidiaLLMService.InputParams(temperature=0.0),
params=NimLLMService.InputParams(temperature=0.0),
)
# You can also register a function_name of None to get all functions
# sent to the same callback with an additional function_name parameter.

View File

@@ -14,13 +14,20 @@ from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.adapters.services.open_ai_realtime_adapter import OpenAIRealtimeLLMAdapter
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMRunFrame, LLMSetToolsFrame, TranscriptionMessage
from pipecat.frames.frames import (
LLMRunFrame,
LLMSetToolsFrame,
LLMUpdateSettingsFrame,
TranscriptionMessage,
)
from pipecat.observers.loggers.transcription_log_observer import TranscriptionLogObserver
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response import LLMAssistantAggregatorParams
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.transcript_processor import TranscriptProcessor
from pipecat.runner.types import RunnerArguments

View File

@@ -19,6 +19,7 @@ from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response import LLMAssistantAggregatorParams
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport

View File

@@ -28,10 +28,10 @@ from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.llm_service import LLMService
from pipecat.services.openai.llm import OpenAIContextAggregatorPair, OpenAILLMService
from pipecat.sync.event_notifier import EventNotifier
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
from pipecat.utils.sync.event_notifier import EventNotifier
load_dotenv(override=True)

View File

@@ -45,11 +45,11 @@ from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.llm_service import FunctionCallParams, LLMService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.sync.base_notifier import BaseNotifier
from pipecat.sync.event_notifier import EventNotifier
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
from pipecat.utils.sync.base_notifier import BaseNotifier
from pipecat.utils.sync.event_notifier import EventNotifier
from pipecat.utils.time import time_now_iso8601
load_dotenv(override=True)

View File

@@ -46,11 +46,11 @@ from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.llm_service import FunctionCallParams, LLMService
from pipecat.services.openai.llm import OpenAILLMService
from pipecat.sync.base_notifier import BaseNotifier
from pipecat.sync.event_notifier import EventNotifier
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
from pipecat.utils.sync.base_notifier import BaseNotifier
from pipecat.utils.sync.event_notifier import EventNotifier
from pipecat.utils.time import time_now_iso8601
load_dotenv(override=True)

View File

@@ -47,11 +47,11 @@ from pipecat.runner.utils import create_transport
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.google.llm import GoogleLLMService
from pipecat.services.llm_service import LLMService
from pipecat.sync.base_notifier import BaseNotifier
from pipecat.sync.event_notifier import EventNotifier
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
from pipecat.utils.sync.base_notifier import BaseNotifier
from pipecat.utils.sync.event_notifier import EventNotifier
from pipecat.utils.time import time_now_iso8601
load_dotenv(override=True)
@@ -391,7 +391,7 @@ class AudioAccumulator(FrameProcessor):
)
self._user_speaking = False
context = LLMContext()
await context.add_audio_frames_message(audio_frames=self._audio_frames)
context.add_audio_frames_message(audio_frames=self._audio_frames)
await self.push_frame(LLMContextFrame(context=context))
elif isinstance(frame, InputAudioRawFrame):
# Append the audio frame to our buffer. Treat the buffer as a ring buffer, dropping the oldest

View File

@@ -17,6 +17,7 @@ from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response import LLMAssistantAggregatorParams
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.transcript_processor import TranscriptProcessor
from pipecat.runner.types import RunnerArguments

View File

@@ -20,6 +20,7 @@ from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response import LLMAssistantAggregatorParams
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport

View File

@@ -18,6 +18,7 @@ from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response import LLMAssistantAggregatorParams
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import (

View File

@@ -150,7 +150,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
LLMLogObserver(),
DebugLogObserver(
frame_types={
TTSTextFrame: (BaseOutputTransport, FrameEndpoint.SOURCE),
TTSTextFrame: (BaseOutputTransport, FrameEndpoint.DESTINATION),
UserStartedSpeakingFrame: (BaseInputTransport, FrameEndpoint.SOURCE),
EndFrame: None,
}

View File

@@ -64,14 +64,11 @@ class UrlToImageProcessor(FrameProcessor):
await self.push_frame(frame, direction)
def extract_url(self, text: str):
try:
data = json.loads(text)
if "artObject" in data:
return data["artObject"]["webImage"]["url"]
if "artworks" in data and len(data["artworks"]):
return data["artworks"][0]["webImage"]["url"]
except:
pass
data = json.loads(text)
if "artObject" in data:
return data["artObject"]["webImage"]["url"]
if "artworks" in data and len(data["artworks"]):
return data["artworks"][0]["webImage"]["url"]
return None
@@ -91,23 +88,6 @@ class UrlToImageProcessor(FrameProcessor):
logger.error(error_msg)
# full list of tools available from rijksmuseum MCP:
# - get_artwork_details
# - get_artwork_image
# - get_user_sets
# - get_user_set_details
# - open_image_in_browser
# - get_artist_timeline
mcp_tools_filter = ["get_artwork_details", "get_artwork_image", "open_image_in_browser"]
def open_image_output_filter(output: str):
pattern = r"Successfully opened image in browser: "
text_to_print = re.sub(pattern, "", output)
print(f"🖼️ link to high resolution artwork: {text_to_print}")
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
# instantiated. The function will be called when the desired transport gets
# selected.
@@ -156,10 +136,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
# https://github.com/r-huijts/rijksmuseum-mcp
args=["-y", "mcp-server-rijksmuseum"],
env={"RIJKSMUSEUM_API_KEY": os.getenv("RIJKSMUSEUM_API_KEY")},
),
# Optional
tools_filter=mcp_tools_filter, # Optional
tools_output_filters={"open_image_in_browser": open_image_output_filter},
)
)
except Exception as e:
logger.error(f"error setting up mcp")
@@ -178,7 +155,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
You are a helpful LLM in a WebRTC call.
Your goal is to demonstrate your capabilities in a succinct way.
You have access to tools to search the Rijksmuseum collection.
Offer, for example, to show a floral still life, use the `search_artwork` tool.
Offer, for example, to show the earliest Rembrandt work from the museum. Use the `search_artwork` tool.
The tool may respond with a JSON object with an `artworks` array. Choose the art from that array.
Once the tool has responded, tell the user the title and use the `open_image_in_browser` tool.
Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points.

View File

@@ -4,27 +4,29 @@
# SPDX-License-Identifier: BSD 2-Clause License
#
import os
from dotenv import load_dotenv
from loguru import logger
from mcp.client.session_group import SseServerParameters
from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import LLMRunFrame, ThoughtTranscriptionMessage, TranscriptionMessage
from pipecat.frames.frames import LLMRunFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.transcript_processor import TranscriptProcessor
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.anthropic.llm import AnthropicLLMService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.mcp_service import MCPClient
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
@@ -67,35 +69,48 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
)
llm = AnthropicLLMService(
api_key=os.getenv("ANTHROPIC_API_KEY"),
params=AnthropicLLMService.InputParams(
thinking=AnthropicLLMService.ThinkingConfig(type="enabled", budget_tokens=2048)
),
api_key=os.getenv("ANTHROPIC_API_KEY"), model="claude-3-7-sonnet-latest"
)
transcript = TranscriptProcessor(process_thoughts=True)
try:
# https://docs.mcp.run/integrating/tutorials/mcp-run-sse-openai-agents/
mcp = MCPClient(server_params=SseServerParameters(url=os.getenv("MCP_RUN_SSE_URL")))
except Exception as e:
logger.error(f"error setting up mcp")
logger.exception("error trace:")
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
},
]
tools = {}
try:
tools = await mcp.register_tools(llm)
except Exception as e:
logger.error(f"error registering tools")
logger.exception("error trace:")
context = LLMContext(messages)
system = f"""
You are a helpful LLM in a WebRTC call.
Your goal is to demonstrate your capabilities in a succinct way.
You have access to a number of tools provided by mcp.run. Use any and all tools to help users.
Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points.
Respond to what the user said in a creative and helpful way.
When asked for today's date, use 'https://www.datetoday.net/'.
Don't overexplain what you are doing.
Just respond with short sentences when you are carrying out tool calls.
"""
messages = [{"role": "system", "content": system}]
context = LLMContext(messages, tools)
context_aggregator = LLMContextAggregatorPair(context)
pipeline = Pipeline(
[
transport.input(), # Transport user input
stt,
transcript.user(), # User transcripts
context_aggregator.user(), # User responses
context_aggregator.user(), # User spoken responses
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output
transcript.assistant(), # Assistant transcripts (including thoughts)
context_aggregator.assistant(), # Assistant spoken responses
context_aggregator.assistant(), # Assistant spoken responses and tool context
]
)
@@ -110,24 +125,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info(f"Client connected")
logger.info(f"Client connected: {client}")
# Kick off the conversation.
messages.append(
{
"role": "user",
"content": "Say hello briefly.",
}
)
# Here are some example prompts conducive to demonstrating
# thinking (picked from Google and Anthropic docs).
# messages.append(
# {
# "role": "user",
# "content": "Analogize photosynthesis and growing up. Keep your answer concise.",
# # "content": "Compare and contrast electric cars and hybrid cars."
# # "content": "Are there an infinite number of prime numbers such that n mod 4 == 3?"
# }
# )
await task.queue_frames([LLMRunFrame()])
@transport.event_handler("on_client_disconnected")
@@ -135,15 +134,6 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Client disconnected")
await task.cancel()
# Register event handler for transcript updates
@transcript.event_handler("on_transcript_update")
async def on_transcript_update(processor, frame):
for msg in frame.messages:
if isinstance(msg, (ThoughtTranscriptionMessage, TranscriptionMessage)):
timestamp = f"[{msg.timestamp}] " if msg.timestamp else ""
role = "THOUGHT" if isinstance(msg, ThoughtTranscriptionMessage) else msg.role
logger.info(f"Transcript: {timestamp}{role}: {msg.content}")
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
@@ -156,6 +146,14 @@ async def bot(runner_args: RunnerArguments):
if __name__ == "__main__":
if not os.getenv("MCP_RUN_SSE_URL"):
logger.error(
f"Please set MCP_RUN_SSE_URL environment variable for this example. See https://mcp.run"
)
import sys
sys.exit(1)
from pipecat.runner.run import main
main()

View File

@@ -7,7 +7,6 @@
import asyncio
import io
import json
import os
import re
import shutil
@@ -16,7 +15,7 @@ import aiohttp
from dotenv import load_dotenv
from loguru import logger
from mcp import StdioServerParameters
from mcp.client.session_group import StreamableHttpParameters
from mcp.client.session_group import SseServerParameters
from PIL import Image
from pipecat.adapters.schemas.tools_schema import ToolsSchema
@@ -67,14 +66,11 @@ class UrlToImageProcessor(FrameProcessor):
await self.push_frame(frame, direction)
def extract_url(self, text: str):
try:
data = json.loads(text)
if "artObject" in data:
return data["artObject"]["webImage"]["url"]
if "artworks" in data and len(data["artworks"]):
return data["artworks"][0]["webImage"]["url"]
except:
pass
pattern = r"!\[[^\]]*\]\((https?://[^)]+\.(png|jpg|jpeg|PNG|JPG|JPEG|gif))\)"
match = re.search(pattern, text)
if match:
return match.group(1)
return None
async def run_image_process(self, image_url: str):
try:
@@ -136,11 +132,10 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
system = f"""
You are a helpful LLM in a WebRTC call.
Your goal is to demonstrate your capabilities in a succinct way.
You have access to tools to search the Rijksmuseum collection and the user's GitHub repositories and account.
Offer, for example, to show a floral still life, use the `search_artwork` tool.
You have access to tools to search the Rijksmuseum collection.
Offer, for example, to show the earliest Rembrandt work from the museum. Use the `search_artwork` tool.
The tool may respond with a JSON object with an `artworks` array. Choose the art from that array.
Once the tool has responded, tell the user the title and use the `open_image_in_browser` tool.
You can also offer to answer users questions about their GitHub repositories and account.
Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points.
Respond to what the user said in a creative and helpful way.
Don't overexplain what you are doing.
@@ -150,11 +145,11 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
messages = [{"role": "system", "content": system}]
try:
rijksmuseum_mcp = MCPClient(
mcp = MCPClient(
server_params=StdioServerParameters(
command=shutil.which("npx"),
# https://github.com/r-huijts/rijksmuseum-mcp
args=["-y", "mcp-server-rijksmuseum"],
args=["-y", "mcp-server-error setting up mcp"],
env={"RIJKSMUSEUM_API_KEY": os.getenv("RIJKSMUSEUM_API_KEY")},
)
)
@@ -162,32 +157,24 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.error(f"error setting up rijksmuseum mcp")
logger.exception("error trace:")
try:
# Github MCP docs: https://github.com/github/github-mcp-server
# Enable Github Copilot on your GitHub account. Free tier is ok. (https://github.com/settings/copilot)
# Generate a personal access token. It must be a Fine-grained token, classic tokens are not supported. (https://github.com/settings/personal-access-tokens)
# Set permissions you want to use (eg. "all repositories", "profile: read/write", etc)
github_mcp = MCPClient(
server_params=StreamableHttpParameters(
url="https://api.githubcopilot.com/mcp/",
headers={
"Authorization": f"Bearer {os.getenv('GITHUB_PERSONAL_ACCESS_TOKEN')}"
},
)
)
# https://docs.mcp.run/integrating/tutorials/mcp-run-sse-openai-agents/
# ie. "https://www.mcp.run/api/mcp/sse?..."
# ensure the profile has a tool or few installed
mcp_run = MCPClient(server_params=SseServerParameters(url=os.getenv("MCP_RUN_SSE_URL")))
except Exception as e:
logger.error(f"error setting up mcp.run")
logger.exception("error trace:")
rijksmuseum_tools = {}
github_tools = {}
tools = {}
run_tools = {}
try:
rijksmuseum_tools = await rijksmuseum_mcp.register_tools(llm)
github_tools = await github_mcp.register_tools(llm)
tools = await mcp.register_tools(llm)
run_tools = await mcp_run.register_tools(llm)
except Exception as e:
logger.error(f"error registering tools")
logger.exception("error trace:")
all_standard_tools = rijksmuseum_tools.standard_tools + github_tools.standard_tools
all_standard_tools = run_tools.standard_tools + tools.standard_tools
all_tools = ToolsSchema(standard_tools=all_standard_tools)
context = LLMContext(messages, all_tools)
@@ -239,9 +226,9 @@ async def bot(runner_args: RunnerArguments):
if __name__ == "__main__":
if not os.getenv("RIJKSMUSEUM_API_KEY") or not os.getenv("GITHUB_PERSONAL_ACCESS_TOKEN"):
if not os.getenv("RIJKSMUSEUM_API_KEY") or not os.getenv("MCP_RUN_SSE_URL"):
logger.error(
f"Please set `RIJKSMUSEUM_API_KEY` and `GITHUB_PERSONAL_ACCESS_TOKEN` environment variables. See https://github.com/r-huijts/rijksmuseum-mcp."
f"Please set RIJKSMUSEUM_API_KEY and MCP_RUN_SSE_URL environment variables. See https://github.com/r-huijts/rijksmuseum-mcp and https://mcp.run"
)
import sys

View File

@@ -5,9 +5,7 @@
#
import asyncio
import os
import random
from datetime import datetime
from dotenv import load_dotenv
@@ -35,21 +33,11 @@ load_dotenv(override=True)
async def fetch_weather_from_api(params: FunctionCallParams):
temperature = (
random.randint(60, 85)
if params.arguments["format"] == "fahrenheit"
else random.randint(15, 30)
)
# Simulate a long network delay.
# You can continue chatting while waiting for this to complete.
# With Nova 2 Sonic (the default model), the assistant will respond
# appropriately once the function call is complete.
await asyncio.sleep(5)
temperature = 75 if params.arguments["format"] == "fahrenheit" else 24
await params.result_callback(
{
"conditions": "nice",
"temperature": temperature,
"location": params.arguments["location"],
"format": params.arguments["format"],
"timestamp": datetime.now().strftime("%Y%m%d_%H%M%S"),
}
@@ -103,31 +91,23 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Starting bot")
# Specify initial system instruction.
# HACK: note that, for now, we need to inject a special bit of text into this instruction to
# allow the first assistant response to be programmatically triggered (which happens in the
# on_client_connected handler, below)
system_instruction = (
"You are a friendly assistant. The user and you will engage in a spoken dialog exchanging "
"the transcripts of a natural real-time conversation. Keep your responses short, generally "
"two or three sentences for chatty scenarios."
# HACK: if using the older Nova Sonic (pre-2) model, note that you need to inject a special
# bit of text into this instruction to allow the first assistant response to be
# programmatically triggered (which happens in the on_client_connected handler)
# f"{AWSNovaSonicLLMService.AWAIT_TRIGGER_ASSISTANT_RESPONSE_INSTRUCTION}"
"two or three sentences for chatty scenarios. "
f"{AWSNovaSonicLLMService.AWAIT_TRIGGER_ASSISTANT_RESPONSE_INSTRUCTION}"
)
# Create the AWS Nova Sonic LLM service
llm = AWSNovaSonicLLMService(
secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
# as of 2025-12-09, these are the supported regions:
# - Nova 2 Sonic (the default model):
# - us-east-1
# - us-west-2
# - ap-northeast-1
# - Nova Sonic (the older model):
# - us-east-1
# - ap-northeast-1
region=os.getenv("AWS_REGION"),
region=os.getenv("AWS_REGION"), # as of 2025-05-06, us-east-1 is the only supported region
session_token=os.getenv("AWS_SESSION_TOKEN"),
voice_id="tiffany",
voice_id="tiffany", # matthew, tiffany, amy
# you could choose to pass instruction here rather than via context
# system_instruction=system_instruction
# you could choose to pass tools here rather than via context
@@ -137,9 +117,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
# Register function for function calls
# you can either register a single function for all function calls, or specific functions
# llm.register_function(None, fetch_weather_from_api)
llm.register_function(
"get_current_weather", fetch_weather_from_api, cancel_on_interruption=False
)
llm.register_function("get_current_weather", fetch_weather_from_api)
# Set up context and context management.
context = LLMContext(
@@ -181,10 +159,10 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Client connected")
# Kick off the conversation.
await task.queue_frames([LLMRunFrame()])
# HACK: if using the older Nova Sonic (pre-2) model, you need this special way of
# triggering the first assistant response. Note that this trigger requires a special
# corresponding bit of text in the system instruction.
# await llm.trigger_assistant_response()
# HACK: for now, we need this special way of triggering the first assistant response in AWS
# Nova Sonic. Note that this trigger requires a special corresponding bit of text in the
# system instruction. In the future, simply queueing the context frame should be sufficient.
await llm.trigger_assistant_response()
# Handle client disconnection events
@transport.event_handler("on_client_disconnected")

View File

@@ -25,7 +25,7 @@ from pipecat.runner.utils import create_transport
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.google.llm import GoogleLLMService
from pipecat.services.heygen.client import ServiceType
from pipecat.services.heygen.api import AvatarQuality, NewSessionRequest
from pipecat.services.heygen.video import HeyGenVideoService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams, DailyTransport
@@ -73,9 +73,11 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
llm = GoogleLLMService(api_key=os.getenv("GOOGLE_API_KEY"))
heyGen = HeyGenVideoService(
api_key=os.getenv("HEYGEN_LIVE_AVATAR_API_KEY"),
service_type=ServiceType.LIVE_AVATAR,
api_key=os.getenv("HEYGEN_API_KEY"),
session=session,
session_request=NewSessionRequest(
avatar_id="Shawn_Therapist_public", version="v2", quality=AvatarQuality.high
),
)
messages = [

View File

@@ -113,12 +113,8 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Client disconnected")
await task.cancel()
@voicemail.event_handler("on_conversation_detected")
async def on_conversation_detected(processor):
logger.info("Conversation detected!")
@voicemail.event_handler("on_voicemail_detected")
async def on_voicemail_detected(processor):
async def handle_voicemail(processor):
logger.info("Voicemail detected! Leaving a message...")
# Push frames using standard Pipecat pattern

View File

@@ -1,221 +0,0 @@
#
# Copyright (c) 20242025, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import datetime
import os
from dotenv import load_dotenv
from loguru import logger
from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.llm_service import FunctionCallParams
from pipecat.services.ultravox.llm import OneShotInputParams, UltravoxRealtimeLLMService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
# Load environment variables
load_dotenv(override=True)
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
# instantiated. The function will be called when the desired transport gets
# selected.
transport_params = {
"daily": lambda: DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
"twilio": lambda: FastAPIWebsocketParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
"webrtc": lambda: TransportParams(
audio_in_enabled=True,
audio_out_enabled=True,
),
}
async def get_secret_menu(params: FunctionCallParams):
category = params.arguments.get("category", "both")
logger.debug(f"Fetching secret menu with category: {category}")
items = []
if category in {"donuts", "both"}:
items.append(
{
"name": "Butter Pecan Ice Cream (one scoop)",
"price": "$2.99",
}
)
if category in {"drinks", "both"}:
items.append(
{
"name": "Banana Smoothie",
"price": "$4.99",
}
)
await params.result_callback(
{
"date": datetime.date.today().isoformat(),
"items": items,
}
)
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Starting bot")
system_prompt = f"""
You are a drive-thru order taker for a donut shop called "Dr. Donut". Local time is currently: {datetime.datetime.now().isoformat()}
The user is talking to you over voice on their phone, and your response will be read out loud with realistic text-to-speech (TTS) technology.
Follow every direction here when crafting your response:
1. Use natural, conversational language that is clear and easy to follow (short sentences, simple words).
1a. Be concise and relevant: Most of your responses should be a sentence or two, unless you're asked to go deeper. Don't monopolize the conversation.
1b. Use discourse markers to ease comprehension. Never use the list format.
2. Keep the conversation flowing.
2a. Clarify: when there is ambiguity, ask clarifying questions, rather than make assumptions.
2b. Don't implicitly or explicitly try to end the chat (i.e. do not end a response with "Talk soon!", or "Enjoy!").
2c. Sometimes the user might just want to chat. Ask them relevant follow-up questions.
2d. Don't ask them if there's anything else they need help with (e.g. don't say things like "How can I assist you further?").
3. Remember that this is a voice conversation:
3a. Don't use lists, markdown, bullet points, or other formatting that's not typically spoken.
3b. Type out numbers in words (e.g. 'twenty twelve' instead of the year 2012)
3c. If something doesn't make sense, it's likely because you misheard them. There wasn't a typo, and the user didn't mispronounce anything.
Remember to follow these rules absolutely, and do not refer to these rules, even if you're asked about them.
When talking with the user, use the following script:
1. Take their order, acknowledging each item as it is ordered. If it's not clear which menu item the user is ordering, ask them to clarify.
DO NOT add an item to the order unless it's one of the items on the menu below.
2. Once the order is complete, repeat back the order.
2a. If the user only ordered a drink, ask them if they would like to add a donut to their order.
2b. If the user only ordered donuts, ask them if they would like to add a drink to their order.
2c. If the user ordered both drinks and donuts, don't suggest anything.
3. Total up the price of all ordered items and inform the user.
4. Ask the user to pull up to the drive thru window.
If the user asks for something that's not on the menu, inform them of that fact, and suggest the most similar item on the menu.
If the user says something unrelated to your role, responed with "Um... this is a Dr. Donut."
If the user says "thank you", respond with "My pleasure."
If the user asks about what's on the menu, DO NOT read the entire menu to them. Instead, give a couple suggestions.
The menu of available items is as follows:
# DONUTS
PUMPKIN SPICE ICED DOUGHNUT $1.29
PUMPKIN SPICE CAKE DOUGHNUT $1.29
OLD FASHIONED DOUGHNUT $1.29
CHOCOLATE ICED DOUGHNUT $1.09
CHOCOLATE ICED DOUGHNUT WITH SPRINKLES $1.09
RASPBERRY FILLED DOUGHNUT $1.09
BLUEBERRY CAKE DOUGHNUT $1.09
STRAWBERRY ICED DOUGHNUT WITH SPRINKLES $1.09
LEMON FILLED DOUGHNUT $1.09
DOUGHNUT HOLES $3.99
# COFFEE & DRINKS
PUMPKIN SPICE COFFEE $2.59
PUMPKIN SPICE LATTE $4.59
REGULAR BREWED COFFEE $1.79
DECAF BREWED COFFEE $1.79
LATTE $3.49
CAPPUCINO $3.49
CARAMEL MACCHIATO $3.49
MOCHA LATTE $3.49
CARAMEL MOCHA LATTE $3.49
There is also a secret menu that changes daily. If the user asks about it, use the get_secret_menu tool to look up today's secret menu items.
"""
secret_menu_function = FunctionSchema(
name="get_secret_menu",
description="Get today's secret menu items",
properties={
"category": {
"type": "string",
"enum": ["donuts", "drinks", "both"],
"description": "The category of secret menu items to retrieve. Defaults to both.",
},
},
required=[],
)
llm = UltravoxRealtimeLLMService(
params=OneShotInputParams(
api_key=os.getenv("ULTRAVOX_API_KEY"),
system_prompt=system_prompt,
temperature=0.3,
max_duration=datetime.timedelta(minutes=3),
),
one_shot_selected_tools=ToolsSchema(standard_tools=[secret_menu_function]),
)
llm.register_function("get_secret_menu", get_secret_menu)
# Necessary to complete the function call lifecycle in Pipecat.
context_aggregator = LLMContextAggregatorPair(LLMContext([]))
# Build the pipeline
pipeline = Pipeline(
[
transport.input(),
context_aggregator.user(),
llm,
context_aggregator.assistant(),
transport.output(),
]
)
# Configure the pipeline task
task = PipelineTask(
pipeline,
params=PipelineParams(
enable_metrics=True,
enable_usage_metrics=True,
),
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
)
# Handle client connection event
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info(f"Client connected")
# Handle client disconnection events
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):
logger.info(f"Client disconnected")
await task.cancel()
# Run the pipeline
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
async def bot(runner_args: RunnerArguments):
"""Main bot entry point compatible with Pipecat Cloud."""
transport = await create_transport(runner_args, transport_params)
await run_bot(transport, runner_args)
if __name__ == "__main__":
from pipecat.runner.run import main
main()

View File

@@ -1,167 +0,0 @@
#
# Copyright (c) 20242025, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import os
from dotenv import load_dotenv
from loguru import logger
from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import LLMRunFrame, ThoughtTranscriptionMessage, TranscriptionMessage
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.transcript_processor import TranscriptProcessor
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.google.llm import GoogleLLMService
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
load_dotenv(override=True)
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
# instantiated. The function will be called when the desired transport gets
# selected.
transport_params = {
"daily": lambda: DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
"twilio": lambda: FastAPIWebsocketParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
"webrtc": lambda: TransportParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
}
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Starting bot")
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
tts = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
voice_id="71a7ad14-091c-4e8e-a314-022ece01c121", # British Reading Lady
)
llm = GoogleLLMService(
api_key=os.getenv("GOOGLE_API_KEY"),
# model="gemini-3-pro-preview", # A more powerful reasoning model, but slower
params=GoogleLLMService.InputParams(
thinking=GoogleLLMService.ThinkingConfig(
# thinking_level="low", # Use this field instead of thinking_budget for Gemini 3 Pro. Defaults to "high".
thinking_budget=-1, # Dynamic thinking
include_thoughts=True,
)
),
)
transcript = TranscriptProcessor(process_thoughts=True)
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
},
]
context = LLMContext(messages)
context_aggregator = LLMContextAggregatorPair(context)
pipeline = Pipeline(
[
transport.input(), # Transport user input
stt,
transcript.user(), # User transcripts
context_aggregator.user(), # User responses
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output
transcript.assistant(), # Assistant transcripts (including thoughts)
context_aggregator.assistant(), # Assistant spoken responses
]
)
task = PipelineTask(
pipeline,
params=PipelineParams(
enable_metrics=True,
enable_usage_metrics=True,
),
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
)
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info(f"Client connected")
# Kick off the conversation.
messages.append(
{
"role": "user",
"content": "Say hello briefly.",
}
)
# Replace the above with one of these example prompts to demonstrate
# thinking.
# These examples come from Gemini and Anthropic docs.
# messages.append(
# {
# "role": "user",
# "content": "Analogize photosynthesis and growing up. Keep your answer concise.",
# # "content": "Compare and contrast electric cars and hybrid cars."
# # "content": "Are there an infinite number of prime numbers such that n mod 4 == 3?"
# }
# )
await task.queue_frames([LLMRunFrame()])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):
logger.info(f"Client disconnected")
await task.cancel()
# Register event handler for transcript updates
@transcript.event_handler("on_transcript_update")
async def on_transcript_update(processor, frame):
for msg in frame.messages:
if isinstance(msg, (ThoughtTranscriptionMessage, TranscriptionMessage)):
timestamp = f"[{msg.timestamp}] " if msg.timestamp else ""
role = "THOUGHT" if isinstance(msg, ThoughtTranscriptionMessage) else msg.role
logger.info(f"Transcript: {timestamp}{role}: {msg.content}")
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
async def bot(runner_args: RunnerArguments):
"""Main bot entry point compatible with Pipecat Cloud."""
transport = await create_transport(runner_args, transport_params)
await run_bot(transport, runner_args)
if __name__ == "__main__":
from pipecat.runner.run import main
main()

View File

@@ -1,185 +0,0 @@
#
# Copyright (c) 20242025, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import os
from dotenv import load_dotenv
from loguru import logger
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import LLMRunFrame, ThoughtTranscriptionMessage, TranscriptionMessage
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.transcript_processor import TranscriptProcessor
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.anthropic.llm import AnthropicLLMService
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.llm_service import FunctionCallParams
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
load_dotenv(override=True)
async def check_flight_status(params: FunctionCallParams, flight_number: str):
"""Check the status of a flight. Returns status (e.g., "on time", "delayed") and departure time.
Args:
flight_number (str): The flight number, e.g. "AA100".
"""
await params.result_callback({"status": "delayed", "departure_time": "14:30"})
async def book_taxi(params: FunctionCallParams, time: str):
"""Book a taxi for a given time. Returns status (e.g., "done").
Args:
time (str): The time to book the taxi for, e.g. "15:00".
"""
await params.result_callback({"status": "done"})
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
# instantiated. The function will be called when the desired transport gets
# selected.
transport_params = {
"daily": lambda: DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
"twilio": lambda: FastAPIWebsocketParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
"webrtc": lambda: TransportParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
}
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Starting bot")
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
tts = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
voice_id="71a7ad14-091c-4e8e-a314-022ece01c121", # British Reading Lady
)
llm = AnthropicLLMService(
api_key=os.getenv("ANTHROPIC_API_KEY"),
params=AnthropicLLMService.InputParams(
thinking=AnthropicLLMService.ThinkingConfig(type="enabled", budget_tokens=2048)
),
)
llm.register_direct_function(check_flight_status)
llm.register_direct_function(book_taxi)
tools = ToolsSchema(standard_tools=[check_flight_status, book_taxi])
transcript = TranscriptProcessor(process_thoughts=True)
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
},
]
context = LLMContext(messages, tools)
context_aggregator = LLMContextAggregatorPair(context)
pipeline = Pipeline(
[
transport.input(), # Transport user input
stt,
transcript.user(), # User transcripts
context_aggregator.user(), # User responses
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output
transcript.assistant(), # Assistant transcripts (including thoughts)
context_aggregator.assistant(), # Assistant spoken responses
]
)
task = PipelineTask(
pipeline,
params=PipelineParams(
enable_metrics=True,
enable_usage_metrics=True,
),
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
)
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info(f"Client connected")
# Kick off the conversation.
messages.append(
{
"role": "user",
"content": "Say hello briefly.",
}
)
# Here is an example prompt conducive to demonstrating thinking and
# function calling.
# This example comes from Gemini docs.
# messages.append(
# {
# "role": "user",
# "content": "Check the status of flight AA100 and, if it's delayed, book me a taxi 2 hours before its departure time.",
# }
# )
await task.queue_frames([LLMRunFrame()])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):
logger.info(f"Client disconnected")
await task.cancel()
@transcript.event_handler("on_transcript_update")
async def on_transcript_update(processor, frame):
for msg in frame.messages:
if isinstance(msg, (ThoughtTranscriptionMessage, TranscriptionMessage)):
timestamp = f"[{msg.timestamp}] " if msg.timestamp else ""
role = "THOUGHT" if isinstance(msg, ThoughtTranscriptionMessage) else msg.role
logger.info(f"Transcript: {timestamp}{role}: {msg.content}")
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
async def bot(runner_args: RunnerArguments):
"""Main bot entry point compatible with Pipecat Cloud."""
transport = await create_transport(runner_args, transport_params)
await run_bot(transport, runner_args)
if __name__ == "__main__":
from pipecat.runner.run import main
main()

View File

@@ -1,190 +0,0 @@
#
# Copyright (c) 20242025, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import os
from dotenv import load_dotenv
from loguru import logger
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import LLMRunFrame, ThoughtTranscriptionMessage, TranscriptionMessage
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.transcript_processor import TranscriptProcessor
from pipecat.runner.types import RunnerArguments
from pipecat.runner.utils import create_transport
from pipecat.services.cartesia.tts import CartesiaTTSService
from pipecat.services.deepgram.stt import DeepgramSTTService
from pipecat.services.google.llm import GoogleLLMService
from pipecat.services.llm_service import FunctionCallParams
from pipecat.transports.base_transport import BaseTransport, TransportParams
from pipecat.transports.daily.transport import DailyParams
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
load_dotenv(override=True)
async def check_flight_status(params: FunctionCallParams, flight_number: str):
"""Check the status of a flight. Returns status (e.g., "on time", "delayed") and departure time.
Args:
flight_number (str): The flight number, e.g. "AA100".
"""
await params.result_callback({"status": "delayed", "departure_time": "14:30"})
async def book_taxi(params: FunctionCallParams, time: str):
"""Book a taxi for a given time. Returns status (e.g., "done").
Args:
time (str): The time to book the taxi for, e.g. "15:00".
"""
await params.result_callback({"status": "done"})
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
# instantiated. The function will be called when the desired transport gets
# selected.
transport_params = {
"daily": lambda: DailyParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
"twilio": lambda: FastAPIWebsocketParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
"webrtc": lambda: TransportParams(
audio_in_enabled=True,
audio_out_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
turn_analyzer=LocalSmartTurnAnalyzerV3(params=SmartTurnParams()),
),
}
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
logger.info(f"Starting bot")
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
tts = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
voice_id="71a7ad14-091c-4e8e-a314-022ece01c121", # British Reading Lady
)
llm = GoogleLLMService(
api_key=os.getenv("GOOGLE_API_KEY"),
# model="gemini-3-pro-preview", # A more powerful reasoning model, but slower
params=GoogleLLMService.InputParams(
thinking=GoogleLLMService.ThinkingConfig(
# thinking_level="low", # Use this field instead of thinking_budget for Gemini 3 Pro. Defaults to "high".
thinking_budget=-1, # Dynamic thinking
include_thoughts=True,
)
),
)
llm.register_direct_function(check_flight_status)
llm.register_direct_function(book_taxi)
tools = ToolsSchema(standard_tools=[check_flight_status, book_taxi])
transcript = TranscriptProcessor(process_thoughts=True)
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
},
]
context = LLMContext(messages, tools)
context_aggregator = LLMContextAggregatorPair(context)
pipeline = Pipeline(
[
transport.input(), # Transport user input
stt,
transcript.user(), # User transcripts
context_aggregator.user(), # User responses
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output
transcript.assistant(), # Assistant transcripts (including thoughts)
context_aggregator.assistant(), # Assistant spoken responses
]
)
task = PipelineTask(
pipeline,
params=PipelineParams(
enable_metrics=True,
enable_usage_metrics=True,
),
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
)
@transport.event_handler("on_client_connected")
async def on_client_connected(transport, client):
logger.info(f"Client connected")
# Kick off the conversation.
messages.append(
{
"role": "user",
"content": "Say hello briefly.",
}
)
# Replace the above with one of these example prompts to demonstrate
# thinking and function calling.
# This example comes from Gemini docs.
# messages.append(
# {
# "role": "user",
# "content": "Check the status of flight AA100 and, if it's delayed, book me a taxi 2 hours before its departure time.",
# }
# )
await task.queue_frames([LLMRunFrame()])
@transport.event_handler("on_client_disconnected")
async def on_client_disconnected(transport, client):
logger.info(f"Client disconnected")
await task.cancel()
@transcript.event_handler("on_transcript_update")
async def on_transcript_update(processor, frame):
for msg in frame.messages:
if isinstance(msg, (ThoughtTranscriptionMessage, TranscriptionMessage)):
timestamp = f"[{msg.timestamp}] " if msg.timestamp else ""
role = "THOUGHT" if isinstance(msg, ThoughtTranscriptionMessage) else msg.role
logger.info(f"Transcript: {timestamp}{role}: {msg.content}")
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
await runner.run(task)
async def bot(runner_args: RunnerArguments):
"""Main bot entry point compatible with Pipecat Cloud."""
transport = await create_transport(runner_args, transport_params)
await run_bot(transport, runner_args)
if __name__ == "__main__":
from pipecat.runner.run import main
main()

View File

@@ -45,63 +45,61 @@ Source = "https://github.com/pipecat-ai/pipecat"
Website = "https://pipecat.ai"
[project.optional-dependencies]
aic = [ "aic-sdk~=1.2.0" ]
aic = [ "aic-sdk~=1.1.0" ]
anthropic = [ "anthropic~=0.49.0" ]
assemblyai = [ "pipecat-ai[websockets-base]" ]
asyncai = [ "pipecat-ai[websockets-base]" ]
aws = [ "aioboto3~=15.5.0", "pipecat-ai[websockets-base]" ]
aws-nova-sonic = [ "aws_sdk_bedrock_runtime~=0.2.0; python_version>='3.12'" ]
aws = [ "aioboto3~=15.0.0", "pipecat-ai[websockets-base]" ]
aws-nova-sonic = [ "aws_sdk_bedrock_runtime~=0.1.1; python_version>='3.12'" ]
azure = [ "azure-cognitiveservices-speech~=1.42.0"]
cartesia = [ "cartesia~=2.0.3", "pipecat-ai[websockets-base]" ]
cerebras = []
daily = [ "daily-python~=0.22.0" ]
deepgram = [ "deepgram-sdk~=4.7.0", "pipecat-ai[websockets-base]" ]
deepseek = []
daily = [ "daily-python~=0.21.0" ]
deepgram = [ "deepgram-sdk~=4.7.0" ]
elevenlabs = [ "pipecat-ai[websockets-base]" ]
fal = [ "fal-client~=0.5.9" ]
fireworks = []
fish = [ "ormsgpack~=1.7.0", "pipecat-ai[websockets-base]" ]
gladia = [ "pipecat-ai[websockets-base]" ]
google = [ "google-cloud-speech>=2.33.0,<3", "google-cloud-texttospeech>=2.31.0,<3", "google-genai>=1.51.0,<2", "pipecat-ai[websockets-base]" ]
gradium = [ "pipecat-ai[websockets-base]" ]
google = [ "google-cloud-speech>=2.33.0,<3", "google-cloud-texttospeech>=2.31.0,<3", "google-genai>=1.41.0,<2", "pipecat-ai[websockets-base]" ]
grok = []
groq = [ "groq~=0.23.0" ]
gstreamer = [ "pygobject~=3.50.0" ]
heygen = [ "livekit>=1.0.13", "pipecat-ai[websockets-base]" ]
hume = [ "hume>=0.11.2" ]
inworld = []
koala = [ "pvkoala~=2.0.3" ]
krisp = [ "pipecat-ai-krisp~=0.4.0" ]
koala = [ "pvkoala~=2.0.3" ]
langchain = [ "langchain~=0.3.20", "langchain-community~=0.3.20", "langchain-openai~=0.3.9" ]
livekit = [ "livekit~=1.0.13", "livekit-api~=1.0.5", "tenacity>=8.2.3,<10.0.0", "pyjwt>=2.10.1" ]
livekit = [ "livekit~=1.0.13", "livekit-api~=1.0.5", "tenacity>=8.2.3,<10.0.0" ]
lmnt = [ "pipecat-ai[websockets-base]" ]
local = [ "pyaudio~=0.2.14" ]
local-smart-turn = [ "coremltools>=8.0", "transformers", "torch>=2.5.0,<3", "torchaudio>=2.5.0,<3" ]
local-smart-turn-v3 = [ "transformers", "onnxruntime>=1.20.1,<2" ]
mcp = [ "mcp[cli]>=1.11.0,<2" ]
mem0 = [ "mem0ai~=0.1.94" ]
mistral = []
mlx-whisper = [ "mlx-whisper~=0.4.2" ]
moondream = [ "accelerate~=1.10.0", "einops~=0.8.0", "pyvips[binary]~=3.0.0", "timm~=1.0.13", "transformers>=4.48.0" ]
nim = []
neuphonic = [ "pipecat-ai[websockets-base]" ]
noisereduce = [ "noisereduce~=3.0.3" ]
nvidia = [ "nvidia-riva-client~=2.21.1" ]
openai = [ "pipecat-ai[websockets-base]" ]
openpipe = [ "openpipe>=4.50.0,<6" ]
openrouter = []
perplexity = []
playht = [ "pipecat-ai[websockets-base]" ]
qwen = []
remote-smart-turn = []
rime = [ "pipecat-ai[websockets-base]" ]
riva = [ "pipecat-ai[nvidia]" ]
riva = [ "nvidia-riva-client~=2.21.1" ]
runner = [ "python-dotenv>=1.0.0,<2.0.0", "uvicorn>=0.32.0,<1.0.0", "fastapi>=0.115.6,<0.122.0", "pipecat-ai-small-webrtc-prebuilt>=1.0.0"]
sagemaker = ["aws_sdk_sagemaker_runtime_http2; python_version>='3.12'"]
sambanova = []
sarvam = [ "sarvamai==0.1.21", "pipecat-ai[websockets-base]" ]
sentry = [ "sentry-sdk>=2.28.0,<3" ]
local-smart-turn = [ "coremltools>=8.0", "transformers", "torch>=2.5.0,<3", "torchaudio>=2.5.0,<3" ]
local-smart-turn-v3 = [ "transformers", "onnxruntime>=1.20.1,<2" ]
remote-smart-turn = []
silero = [ "onnxruntime>=1.20.1,<2" ]
simli = [ "simli-ai~=1.0.3"]
simli = [ "simli-ai~=0.1.25"]
soniox = [ "pipecat-ai[websockets-base]" ]
soundfile = [ "soundfile~=0.13.1" ]
speechmatics = [ "speechmatics-rt>=0.5.0" ]
@@ -109,7 +107,7 @@ strands = [ "strands-agents>=1.9.1,<2" ]
tavus=[]
together = []
tracing = [ "opentelemetry-sdk>=1.33.0", "opentelemetry-api>=1.33.0", "opentelemetry-instrumentation>=0.54b0" ]
ultravox = [ "pipecat-ai[websockets-base]" ]
ultravox = [ "transformers>=4.48.0", "vllm>=0.9.0" ]
webrtc = [ "aiortc>=1.13.0,<2", "opencv-python>=4.11.0.86,<5" ]
websocket = [ "pipecat-ai[websockets-base]", "fastapi>=0.115.6,<0.122.0" ]
websockets-base = [ "websockets>=13.1,<16.0" ]
@@ -130,7 +128,6 @@ dev = [
"setuptools~=78.1.1",
"setuptools_scm~=8.3.1",
"python-dotenv>=1.0.1,<2.0.0",
"towncrier~=25.8.0",
]
docs = [
@@ -161,7 +158,7 @@ where = ["src"]
"src/pipecat/audio/dtmf/dtmf-star.wav",
]
"pipecat.services.aws_nova_sonic" = ["src/pipecat/services/aws_nova_sonic/ready.wav"]
"pipecat.audio.turn.smart_turn.data" = ["src/pipecat/audio/turn/smart_turn/data/smart-turn-v3.1-cpu.onnx"]
"pipecat.audio.turn.smart_turn.data" = ["src/pipecat/audio/turn/smart_turn/data/smart-turn-v3.0.onnx"]
[tool.pytest.ini_options]
addopts = "--verbose"
@@ -208,45 +205,3 @@ convention = "google"
command_line = "--module pytest"
source = ["src"]
omit = ["*/tests/*"]
[tool.towncrier]
package = "pipecat"
package_dir = "src"
filename = "CHANGELOG.md"
directory = "changelog"
start_string = "<!-- towncrier release notes start -->\n"
template = "changelog/_template.md.j2"
title_format = "## [{version}] - {project_date}"
issue_format = "[#{issue}](https://github.com/pipecat-ai/pipecat/pull/{issue})"
underlines = ["", "", ""]
wrap = true
[[tool.towncrier.type]]
directory = "added"
name = "Added"
showcontent = true
[[tool.towncrier.type]]
directory = "changed"
name = "Changed"
showcontent = true
[[tool.towncrier.type]]
directory = "deprecated"
name = "Deprecated"
showcontent = true
[[tool.towncrier.type]]
directory = "removed"
name = "Removed"
showcontent = true
[[tool.towncrier.type]]
directory = "fixed"
name = "Fixed"
showcontent = true
[[tool.towncrier.type]]
directory = "security"
name = "Security"
showcontent = true

View File

@@ -31,13 +31,7 @@ from pipecat.adapters.schemas.function_schema import FunctionSchema
from pipecat.adapters.schemas.tools_schema import ToolsSchema
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import (
CancelFrame,
EndFrame,
EndTaskFrame,
LLMRunFrame,
OutputImageRawFrame,
)
from pipecat.frames.frames import EndTaskFrame, LLMRunFrame, OutputImageRawFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
@@ -56,7 +50,6 @@ SCRIPT_DIR = Path(__file__).resolve().parent
PIPELINE_IDLE_TIMEOUT_SECS = 60
EVAL_TIMEOUT_SECS = 120
EVAL_RESULT_TIMEOUT_SECS = 10
EvalPrompt = str | Tuple[str, ImageFile]
@@ -85,7 +78,7 @@ class EvalRunner:
self._log_level = log_level
self._total_success = 0
self._tests: List[EvalResult] = []
self._result_future: Optional[asyncio.Future[bool]] = None
self._queue = asyncio.Queue()
# We to save runner files.
name = name or f"{datetime.now().strftime('%Y%m%d_%H%M%S')}"
@@ -95,16 +88,16 @@ class EvalRunner:
os.makedirs(self._logs_dir, exist_ok=True)
os.makedirs(self._recordings_dir, exist_ok=True)
async def function_assert_eval(self, params: FunctionCallParams):
async def assert_eval(self, params: FunctionCallParams):
result = params.arguments["result"]
reasoning = params.arguments["reasoning"]
logger.debug(f"🧠 EVAL REASONING(result: {result}): {reasoning}")
await self._queue.put(result)
await params.result_callback(None)
await params.llm.push_frame(EndTaskFrame(reason=result), FrameDirection.UPSTREAM)
await params.llm.push_frame(EndTaskFrame(), FrameDirection.UPSTREAM)
async def assert_eval(self, result: bool):
if self._result_future:
self._result_future.set_result(result)
async def assert_eval_false(self):
await self._queue.put(False)
async def run_eval(
self,
@@ -124,9 +117,6 @@ class EvalRunner:
start_time = time.time()
# Create a future to store the eval result.
self._result_future = asyncio.get_running_loop().create_future()
try:
tasks = [
asyncio.create_task(run_example_pipeline(script_path, eval_config)),
@@ -146,10 +136,8 @@ class EvalRunner:
logger.error(f"ERROR: Unable to run {example_file}: {e}")
try:
# Wait for the future to resolve.
result = await asyncio.wait_for(self._result_future, timeout=EVAL_RESULT_TIMEOUT_SECS)
result = await asyncio.wait_for(self._queue.get(), timeout=1.0)
except asyncio.TimeoutError:
logger.error(f"ERROR: Timeout waiting for eval result.")
result = False
if result:
@@ -256,25 +244,19 @@ async def run_eval_pipeline(
llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
llm.register_function("eval_function", eval_runner.function_assert_eval)
llm.register_function("eval_function", eval_runner.assert_eval)
eval_function = FunctionSchema(
name="eval_function",
description=(
"Determines whether the user's response satisfies the evaluation "
"criteria defined for the current prompt or interaction."
),
description="Called when the user answers a question.",
properties={
"result": {
"type": "boolean",
"description": "Whether the user's response meets the evaluation criteria.",
"description": "Whether the answer is correct or not",
},
"reasoning": {
"type": "string",
"description": (
"A concise explanation of how the user's response did or did "
"not satisfy the evaluation criteria."
),
"description": "Why the answer was considered correct or invalid",
},
},
required=["result", "reasoning"],
@@ -296,9 +278,9 @@ async def run_eval_pipeline(
"Ignore greetings, comments, non-answers, or requests for clarification."
)
if eval_config.eval_speaks_first:
system_prompt = f"You are an evaluation agent, be extremly brief. Numerical word answers are allowed. You will start the conversation by saying: '{example_prompt}'. {common_system_prompt}"
system_prompt = f"You are an evaluation agent, be extremly brief. You will start the conversation by saying: '{example_prompt}'. {common_system_prompt}"
else:
system_prompt = f"You are an evaluation agent, be extremly brief. Numerical word answers are allowed. First, ask one question: {example_prompt}. {common_system_prompt}"
system_prompt = f"You are an evaluation agent, be extremly brief. First, ask one question: {example_prompt}. {common_system_prompt}"
messages = [
{
@@ -364,12 +346,9 @@ async def run_eval_pipeline(
logger.info(f"Client disconnected")
await task.cancel()
@task.event_handler("on_pipeline_finished")
async def on_pipeline_finished(task, frame):
if isinstance(frame, EndFrame):
await eval_runner.assert_eval(frame.reason)
elif isinstance(frame, CancelFrame):
await eval_runner.assert_eval(False)
@task.event_handler("on_idle_timeout")
async def on_pipeline_idle_timeout(task):
await eval_runner.assert_eval_false()
# TODO(aleix): We should handle SIGINT and SIGTERM so we can cancel both the
# eval and the example.

View File

@@ -30,13 +30,13 @@ EVAL_SIMPLE_MATH = EvalConfig(
)
EVAL_WEATHER = EvalConfig(
prompt="What's the weather in San Francisco? Temperature should be in fahrenheits.",
eval="The user talks about the weather in San Francisco, including the degrees.",
prompt="What's the weather in San Francisco?",
eval="The user says something specific about the current weather in San Francisco, including the degrees.",
)
EVAL_ONLINE_SEARCH = EvalConfig(
prompt="What's the current date in UTC?",
eval=f"Current date in UTC is {datetime.now(timezone.utc).strftime('%A, %B %d, %Y')}.",
prompt="What's the date right now in London?",
eval=f"The user says today is {datetime.now(timezone.utc).strftime('%B %d, %Y')} in London.",
)
EVAL_SWITCH_LANGUAGE = EvalConfig(
@@ -64,21 +64,16 @@ def EVAL_VISION_IMAGE(*, eval_speaks_first: bool = False):
EVAL_VOICEMAIL = EvalConfig(
prompt="Please leave a message.",
eval="The user provides a reasonable voicemail message.",
eval="The user leaves a voicemail message.",
eval_speaks_first=True,
)
EVAL_CONVERSATION = EvalConfig(
prompt="Hello, this is Mark.",
eval="The user provides any reasonable conversational response to the greeting.",
eval="The user replies with a greeting.",
eval_speaks_first=True,
)
EVAL_FLIGHT_STATUS = EvalConfig(
prompt="Check the status of flight AA100.",
eval="The user says something about the status of flight AA100, such as whether it's on time or delayed.",
)
TESTS_07 = [
# 07 series
@@ -86,7 +81,6 @@ TESTS_07 = [
("07-interruptible-cartesia-http.py", EVAL_SIMPLE_MATH),
("07a-interruptible-speechmatics.py", EVAL_SIMPLE_MATH),
("07aa-interruptible-soniox.py", EVAL_SIMPLE_MATH),
("07ab-interruptible-inworld.py", EVAL_SIMPLE_MATH),
("07ab-interruptible-inworld-http.py", EVAL_SIMPLE_MATH),
("07ac-interruptible-asyncai.py", EVAL_SIMPLE_MATH),
("07ac-interruptible-asyncai-http.py", EVAL_SIMPLE_MATH),
@@ -109,7 +103,7 @@ TESTS_07 = [
("07o-interruptible-assemblyai.py", EVAL_SIMPLE_MATH),
("07q-interruptible-rime.py", EVAL_SIMPLE_MATH),
("07q-interruptible-rime-http.py", EVAL_SIMPLE_MATH),
("07r-interruptible-nvidia.py", EVAL_SIMPLE_MATH),
("07r-interruptible-riva-nim.py", EVAL_SIMPLE_MATH),
("07s-interruptible-google-audio-in.py", EVAL_SIMPLE_MATH),
("07t-interruptible-fish.py", EVAL_SIMPLE_MATH),
("07v-interruptible-neuphonic.py", EVAL_SIMPLE_MATH),
@@ -122,6 +116,8 @@ TESTS_07 = [
# ("07i-interruptible-xtts.py", EVAL_SIMPLE_MATH),
# Needs a Krisp license.
# ("07p-interruptible-krisp.py", EVAL_SIMPLE_MATH),
# Needs GPU resources.
# ("07u-interruptible-ultravox.py", EVAL_SIMPLE_MATH),
]
TESTS_12 = [
@@ -140,7 +136,7 @@ TESTS_14 = [
("14g-function-calling-grok.py", EVAL_WEATHER),
("14h-function-calling-azure.py", EVAL_WEATHER),
("14i-function-calling-fireworks.py", EVAL_WEATHER),
("14j-function-calling-nvidia.py", EVAL_WEATHER),
("14j-function-calling-nim.py", EVAL_WEATHER),
("14k-function-calling-cerebras.py", EVAL_WEATHER),
("14m-function-calling-openrouter.py", EVAL_WEATHER),
("14n-function-calling-perplexity.py", EVAL_WEATHER),
@@ -208,13 +204,6 @@ TESTS_44 = [
("44-voicemail-detection.py", EVAL_CONVERSATION),
]
TESTS_49 = [
("49a-thinking-anthropic.py", EVAL_SIMPLE_MATH),
("49b-thinking-google.py", EVAL_SIMPLE_MATH),
("49c-thinking-functions-anthropic.py", EVAL_FLIGHT_STATUS),
("49d-thinking-functions-google.py", EVAL_FLIGHT_STATUS),
]
TESTS = [
*TESTS_07,
*TESTS_12,
@@ -227,7 +216,6 @@ TESTS = [
*TESTS_40,
*TESTS_43,
*TESTS_44,
*TESTS_49,
]

View File

@@ -5,20 +5,14 @@
#
import sys
from importlib.metadata import version as lib_version
from importlib.metadata import version
from loguru import logger
__version__ = lib_version("pipecat-ai")
__version__ = version("pipecat-ai")
logger.info(f"ᓚᘏᗢ Pipecat {__version__} (Python {sys.version}) ᓚᘏᗢ")
def version() -> str:
"""Returns the Pipecat version."""
return __version__
# We replace `asyncio.wait_for()` for `wait_for2.wait_for()` for Python < 3.12.
#
# In Python 3.12, `asyncio.wait_for()` is implemented in terms of

View File

@@ -94,8 +94,6 @@ class AnthropicLLMAdapter(BaseLLMAdapter[AnthropicLLMInvocationParams]):
for item in msg["content"]:
if item["type"] == "image":
item["source"]["data"] = "..."
if item["type"] == "thinking" and item.get("signature"):
item["signature"] = "..."
messages_for_logging.append(msg)
return messages_for_logging
@@ -167,44 +165,9 @@ class AnthropicLLMAdapter(BaseLLMAdapter[AnthropicLLMInvocationParams]):
def _from_universal_context_message(self, message: LLMContextMessage) -> MessageParam:
if isinstance(message, LLMSpecificMessage):
return self._from_anthropic_specific_message(message)
return copy.deepcopy(message.message)
return self._from_standard_message(message)
def _from_anthropic_specific_message(self, message: LLMSpecificMessage) -> MessageParam:
"""Convert LLMSpecificMessage to Anthropic format.
Anthropic-specific messages may either be special thought messages that
need to be handled in a special way, or messages already in Anthropic
format.
Args:
message: Anthropic-specific message.
"""
# Handle special case of thought messages.
# These can be converted to standalone "assistant" messages; later
# these thinking messages will be properly merged into the assistant
# response messages before the context is sent to Anthropic for the
# next turn.
if (
isinstance(message.message, dict)
and message.message.get("type") == "thought"
and (text := message.message.get("text"))
and (signature := message.message.get("signature"))
):
return {
"role": "assistant",
"content": [
{
"type": "thinking",
"thinking": text,
"signature": signature,
}
],
}
# Fall back to assuming that the message is already in Anthropic format
return copy.deepcopy(message.message)
def _from_standard_message(self, message: LLMStandardMessage) -> MessageParam:
"""Convert standard universal context message to Anthropic format.
@@ -283,14 +246,11 @@ class AnthropicLLMAdapter(BaseLLMAdapter[AnthropicLLMInvocationParams]):
# handle image_url -> image conversion
if item["type"] == "image_url":
if item["image_url"]["url"].startswith("data:"):
# Extract MIME type from data URL (format: "data:image/jpeg;base64,...")
url = item["image_url"]["url"]
mime_type = url.split(":")[1].split(";")[0]
item["type"] = "image"
item["source"] = {
"type": "base64",
"media_type": mime_type,
"data": url.split(",")[1],
"media_type": "image/jpeg",
"data": item["image_url"]["url"].split(",")[1],
}
del item["image_url"]
elif item["image_url"]["url"].startswith("http"):

View File

@@ -257,15 +257,14 @@ class AWSBedrockLLMAdapter(BaseLLMAdapter[AWSBedrockLLMInvocationParams]):
# handle image_url -> image conversion
if item["type"] == "image_url":
if item["image_url"]["url"].startswith("data:"):
# Extract format from data URL (format: "data:image/jpeg;base64,...")
url = item["image_url"]["url"]
mime_type = url.split(":")[1].split(";")[0]
# Bedrock expects format like "jpeg", "png" etc., not "image/jpeg"
image_format = mime_type.split("/")[1]
new_item = {
"image": {
"format": image_format,
"source": {"bytes": base64.b64decode(url.split(",")[1])},
"format": "jpeg",
"source": {
"bytes": base64.b64decode(
item["image_url"]["url"].split(",")[1]
)
},
}
}
new_content.append(new_item)

View File

@@ -151,8 +151,6 @@ class GeminiLLMAdapter(BaseLLMAdapter[GeminiLLMInvocationParams]):
for part in obj["parts"]:
if "inline_data" in part:
part["inline_data"]["data"] = "..."
if "thought_signature" in part:
part["thought_signature"] = "..."
except Exception as e:
logger.debug(f"Error: {e}")
messages_for_logging.append(obj)
@@ -211,37 +209,16 @@ class GeminiLLMAdapter(BaseLLMAdapter[GeminiLLMInvocationParams]):
system_instruction = None
messages = []
tool_call_id_to_name_mapping = {}
thought_signature_dicts = []
# Process each message, converting to Google format as needed
# Process each message, preserving Google-formatted messages and converting others
for message in universal_context_messages:
# We have a Google-specific message; this may either be a
# thought-signature-containing message that we need to handle in a
# special way, or a message already in Google format that we can
# use directly
if isinstance(message, LLMSpecificMessage):
if (
isinstance(message.message, dict)
and message.message.get("type") == "thought_signature"
):
thought_signature_dicts.append(message.message)
continue
# Fall back to assuming that the message is already in Google
# format
messages.append(message.message)
continue
# We have a standard universal context message; convert it to
# Google format
result = self._from_standard_message(
result = self._from_universal_context_message(
message,
params=self.MessageConversionParams(
already_have_system_instruction=bool(system_instruction),
tool_call_id_to_name_mapping=tool_call_id_to_name_mapping,
),
)
# Each result is either a Content or a system instruction
if result.content:
messages.append(result.content)
@@ -252,9 +229,6 @@ class GeminiLLMAdapter(BaseLLMAdapter[GeminiLLMInvocationParams]):
if result.tool_call_id_to_name_mapping:
tool_call_id_to_name_mapping.update(result.tool_call_id_to_name_mapping)
# Apply thought signatures to the corresponding messages
self._apply_thought_signatures_to_messages(thought_signature_dicts, messages)
# Check if we only have function-related messages (no regular text)
has_regular_messages = any(
len(msg.parts) == 1
@@ -273,6 +247,13 @@ class GeminiLLMAdapter(BaseLLMAdapter[GeminiLLMInvocationParams]):
return self.ConvertedMessages(messages=messages, system_instruction=system_instruction)
def _from_universal_context_message(
self, message: LLMContextMessage, *, params: MessageConversionParams
) -> MessageConversionResult:
if isinstance(message, LLMSpecificMessage):
return self.MessageConversionResult(content=message.message)
return self._from_standard_message(message, params=params)
def _from_standard_message(
self, message: LLMStandardMessage, *, params: MessageConversionParams
) -> MessageConversionResult:
@@ -399,14 +380,11 @@ class GeminiLLMAdapter(BaseLLMAdapter[GeminiLLMInvocationParams]):
if c["type"] == "text":
parts.append(Part(text=c["text"]))
elif c["type"] == "image_url" and c["image_url"]["url"].startswith("data:"):
# Extract MIME type from data URL (format: "data:image/jpeg;base64,...")
url = c["image_url"]["url"]
mime_type = url.split(":")[1].split(";")[0]
parts.append(
Part(
inline_data=Blob(
mime_type=mime_type,
data=base64.b64decode(url.split(",")[1]),
mime_type="image/jpeg",
data=base64.b64decode(c["image_url"]["url"].split(",")[1]),
)
)
)
@@ -432,139 +410,3 @@ class GeminiLLMAdapter(BaseLLMAdapter[GeminiLLMInvocationParams]):
content=Content(role=role, parts=parts),
tool_call_id_to_name_mapping=tool_call_id_to_name_mapping,
)
def _apply_thought_signatures_to_messages(
self, thought_signature_dicts: List[dict], messages: List[Content]
) -> None:
"""Apply thought signatures to corresponding assistant messages.
See GoogleLLMService for more details about thought signatures.
Args:
thought_signature_dicts: A list of dicts containing:
- "signature": a thought signature
- "bookmark": a bookmark to identify the message part to apply the signature to.
The bookmark may contain one of:
- "function_call" (a function call ID string)
- "text" (a text string)
- "inline_data" (a Blob)
The list of thought signature dicts is in order.
messages: List of messages to apply the thought signatures to.
"""
if not thought_signature_dicts:
return
# For debugging, print out thought signatures and their bookmarks
logger.debug(f"Thought signatures to apply: {len(thought_signature_dicts)}")
for ts in thought_signature_dicts:
bookmark = ts.get("bookmark")
if bookmark.get("function_call"):
logger.trace(f" - To function call: {bookmark['function_call']}")
elif bookmark.get("text"):
text = bookmark["text"]
log_display_text = f"{text[:50]}..." if len(text) > 50 else text
logger.trace(f" - To text: {log_display_text}")
elif bookmark.get("inline_data"):
logger.trace(f" - To inline data")
# Get all assistant messages
assistant_messages = [
message
for message in messages
if isinstance(message, Content) and message.role == "model"
]
# Apply thought signatures to the corresponding assistant messages.
# Thought signatures are already in message order.
thought_signatures_applied = 0
message_start_index = 0 # Track where to start searching for the next matching message.
for thought_signature_dict in thought_signature_dicts:
signature = thought_signature_dict.get("signature")
bookmark = thought_signature_dict.get("bookmark")
if not signature or not bookmark:
continue
# Search through remaining assistant messages for a match
for i in range(message_start_index, len(assistant_messages)):
message = assistant_messages[i]
if not message.parts:
continue
# We're assuming that the thought signature always applies to the last part
last_part = message.parts[-1]
# If the bookmark matches the part...
if self._thought_signature_bookmark_matches_part(bookmark, last_part):
# Apply the thought signature
last_part.thought_signature = signature
thought_signatures_applied += 1
# Update the start index and stop searching for a match
message_start_index = i + 1
break
# For debugging, print out how many thought signatures were applied
logger.debug(f"Applied {thought_signatures_applied} thought signatures.")
def _thought_signature_bookmark_matches_part(self, bookmark: dict, part: Part) -> bool:
if function_call_bookmark := bookmark.get("function_call"):
return self._thought_signature_function_call_bookmark_matches_part(
function_call_bookmark, part
)
elif text_bookmark := bookmark.get("text"):
return self._thought_signature_text_bookmark_matches_part(text_bookmark, part)
elif inline_data := bookmark.get("inline_data"):
return self._thought_signature_inline_data_bookmark_matches_part(inline_data, part)
else:
logger.warning(f"Unknown thought signature bookmark type: {bookmark}")
return False
def _thought_signature_function_call_bookmark_matches_part(
self, bookmark_function_call_id: str, part: Part
) -> bool:
if (
hasattr(part, "function_call")
and part.function_call
and part.function_call.id == bookmark_function_call_id
):
logger.trace(f"Thought signature function call match: {bookmark_function_call_id}")
return True
return False
def _thought_signature_text_bookmark_matches_part(self, bookmark_text: str, part: Part) -> bool:
if hasattr(part, "text") and part.text:
# Normalize whitespace for comparison
bookmark_text = " ".join(bookmark_text.split())
part_text = " ".join(part.text.split())
# Check that either:
# - the part text is the same as the bookmark text
# - a prefix of the bookmark text (in case the part text was truncated due to interruption)
# - the bookmark text is a prefix of the part text (in case the bookmark represents just first chunk of multi-chunk text)
if (
part_text == bookmark_text
or bookmark_text.startswith(part_text)
or part_text.startswith(bookmark_text)
):
log_display_text = f"{part.text[:50]}..." if len(part.text) > 50 else part.text
logger.trace(f"Thought signature text match: {log_display_text}")
return True
return False
def _thought_signature_inline_data_bookmark_matches_part(
self, bookmark_inline_data: Blob, part: Part
) -> bool:
if (
hasattr(part, "inline_data")
and part.inline_data
# Comparing length should be good enough for matching inline data,
# especially since we're already matching thought signatures in
# strict message order. Comparing actual data is expensive.
and len(part.inline_data.data) == len(bookmark_inline_data.data)
):
logger.trace(f"Thought signature inline data match")
return True
return False

View File

@@ -39,7 +39,7 @@ class AICFilter(BaseAudioFilter):
self,
*,
license_key: str = "",
model_type: AICModelType = AICModelType.QUAIL_STT,
model_type: AICModelType = AICModelType.QUAIL_L,
enhancement_level: Optional[float] = 1.0,
voice_gain: Optional[float] = 1.0,
noise_gate_enable: Optional[bool] = True,
@@ -52,27 +52,12 @@ class AICFilter(BaseAudioFilter):
enhancement_level: Optional overall enhancement strength (0.0..1.0).
voice_gain: Optional linear gain applied to detected speech (0.0..4.0).
noise_gate_enable: Optional enable/disable noise gate (default: True).
.. deprecated:: 1.3.0
The `noise_gate_enable` parameter is deprecated and no longer has any effect.
It will be removed in a future version.
"""
self._license_key = license_key
self._model_type = model_type
self._enhancement_level = enhancement_level
self._voice_gain = voice_gain
if noise_gate_enable is not None:
import warnings
with warnings.catch_warnings():
warnings.simplefilter("always")
warnings.warn(
"Parameter `noise_gate_enable` is deprecated and no longer has any effect. "
"It will be removed in a future version. Use AIC VAD instead (create_vad_analyzer()).",
DeprecationWarning,
)
self._noise_gate_enable = noise_gate_enable
self._enabled = True
@@ -164,6 +149,10 @@ class AICFilter(BaseAudioFilter):
)
if self._voice_gain is not None:
self._aic.set_parameter(AICParameter.VOICE_GAIN, float(self._voice_gain))
if self._noise_gate_enable is not None:
self._aic.set_parameter(
AICParameter.NOISE_GATE_ENABLE, 1.0 if bool(self._noise_gate_enable) else 0.0
)
self._aic_ready = True

View File

@@ -28,6 +28,7 @@ from pipecat.metrics.metrics import MetricsData, SmartTurnMetricsData
STOP_SECS = 3
PRE_SPEECH_MS = 0
MAX_DURATION_SECONDS = 8 # Max allowed segment duration
USE_ONLY_LAST_VAD_SEGMENT = True
class SmartTurnParams(BaseTurnParams):
@@ -42,6 +43,8 @@ class SmartTurnParams(BaseTurnParams):
stop_secs: float = STOP_SECS
pre_speech_ms: float = PRE_SPEECH_MS
max_duration_secs: float = MAX_DURATION_SECONDS
# not exposing this for now yet until the model can handle it.
# use_only_last_vad_segment: bool = USE_ONLY_LAST_VAD_SEGMENT
class SmartTurnTimeoutException(Exception):
@@ -157,7 +160,7 @@ class BaseSmartTurn(BaseTurnAnalyzer):
state, result = await loop.run_in_executor(
self._executor, self._process_speech_segment, self._audio_buffer
)
if state == EndOfTurnState.COMPLETE:
if state == EndOfTurnState.COMPLETE or USE_ONLY_LAST_VAD_SEGMENT:
self._clear(state)
logger.debug(f"End of Turn result: {state}")
return state, result

View File

@@ -14,7 +14,6 @@ Note: To learn more about the smart-turn model, visit:
- https://github.com/pipecat-ai/smart-turn
"""
import warnings
from typing import Optional
import aiohttp
@@ -27,10 +26,6 @@ class FalSmartTurnAnalyzer(HttpSmartTurnAnalyzer):
Extends HttpSmartTurnAnalyzer to provide integration with Fal.ai's
smart turn detection API endpoint with proper authentication.
.. deprecated:: 0.98.0
FalSmartTurnAnalyzer is deprecated and will be removed in a future version.
Use LocalSmartTurnAnalyzerV3 instead.
"""
def __init__(
@@ -53,12 +48,3 @@ class FalSmartTurnAnalyzer(HttpSmartTurnAnalyzer):
if api_key:
headers = {"Authorization": f"Key {api_key}"}
super().__init__(url=url, aiohttp_session=aiohttp_session, headers=headers, **kwargs)
with warnings.catch_warnings():
warnings.simplefilter("always")
warnings.warn(
"FalSmartTurnAnalyzer is deprecated and will be removed in a future version. "
"Use LocalSmartTurnAnalyzerV3 instead.",
DeprecationWarning,
stacklevel=2,
)

View File

@@ -10,7 +10,6 @@ This module provides a smart turn analyzer that uses PyTorch models for
local end-of-turn detection without requiring network connectivity.
"""
import warnings
from typing import Any, Dict
import numpy as np
@@ -35,10 +34,6 @@ class LocalSmartTurnAnalyzer(BaseSmartTurn):
Provides end-of-turn detection using locally-stored PyTorch models,
enabling offline operation without network dependencies. Uses
Wav2Vec2-BERT architecture for audio sequence classification.
.. deprecated:: 0.98.0
LocalSmartTurnAnalyzer is deprecated and will be removed in a future version.
Use LocalSmartTurnAnalyzerV3 instead.
"""
def __init__(self, *, smart_turn_model_path: str, **kwargs):
@@ -51,15 +46,6 @@ class LocalSmartTurnAnalyzer(BaseSmartTurn):
"""
super().__init__(**kwargs)
with warnings.catch_warnings():
warnings.simplefilter("always")
warnings.warn(
"LocalSmartTurnAnalyzer is deprecated and will be removed in a future version. "
"Use LocalSmartTurnAnalyzerV3 instead.",
DeprecationWarning,
stacklevel=2,
)
if not smart_turn_model_path:
# Define the path to the pretrained model on Hugging Face
smart_turn_model_path = "pipecat-ai/smart-turn"

View File

@@ -42,15 +42,17 @@ class LocalSmartTurnAnalyzerV3(BaseSmartTurn):
Args:
smart_turn_model_path: Path to the ONNX model file. If this is not
set, the bundled smart-turn-v3.1-cpu model will be used.
set, the bundled smart-turn-v3.0 model will be used.
cpu_count: The number of CPUs to use for inference. Defaults to 1.
**kwargs: Additional arguments passed to BaseSmartTurn.
"""
super().__init__(**kwargs)
logger.debug("Loading Local Smart Turn v3 model...")
if not smart_turn_model_path:
# Load bundled model
model_name = "smart-turn-v3.1-cpu.onnx"
model_name = "smart-turn-v3.0.onnx"
package_path = "pipecat.audio.turn.smart_turn.data"
try:
@@ -68,8 +70,6 @@ class LocalSmartTurnAnalyzerV3(BaseSmartTurn):
impresources.files(package_path).joinpath(model_name)
)
logger.debug(f"Loading Local Smart Turn v3.x model from {smart_turn_model_path}...")
so = ort.SessionOptions()
so.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
so.inter_op_num_threads = 1
@@ -79,7 +79,7 @@ class LocalSmartTurnAnalyzerV3(BaseSmartTurn):
self._feature_extractor = WhisperFeatureExtractor(chunk_length=8)
self._session = ort.InferenceSession(smart_turn_model_path, sess_options=so)
logger.debug("Loaded Local Smart Turn v3.x")
logger.debug("Loaded Local Smart Turn v3")
def _predict_endpoint(self, audio_array: np.ndarray) -> Dict[str, Any]:
"""Predict end-of-turn using local ONNX model."""

View File

@@ -18,10 +18,8 @@ from loguru import logger
from pipecat.audio.dtmf.types import KeypadEntry
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import (
EndFrame,
Frame,
LLMContextFrame,
LLMFullResponseEndFrame,
LLMMessagesUpdateFrame,
LLMTextFrame,
OutputDTMFUrgentFrame,
@@ -151,17 +149,10 @@ class IVRProcessor(FrameProcessor):
elif isinstance(frame, LLMTextFrame):
# Process text through the pattern aggregator
async for result in self._aggregator.aggregate(frame.text):
result = await self._aggregator.aggregate(frame.text)
if result:
# Push aggregated text that doesn't contain XML patterns
await self.push_frame(LLMTextFrame(result.text), direction)
elif isinstance(frame, (LLMFullResponseEndFrame, EndFrame)):
# Flush any remaining text from the aggregator
remaining = await self._aggregator.flush()
if remaining:
await self.push_frame(LLMTextFrame(remaining.text), direction)
# Push the end frame
await self.push_frame(frame, direction)
await self.push_frame(LLMTextFrame(result), direction)
else:
await self.push_frame(frame, direction)

View File

@@ -40,8 +40,8 @@ from pipecat.processors.aggregators.llm_context import LLMContext
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor, FrameProcessorSetup
from pipecat.services.llm_service import LLMService
from pipecat.utils.sync.base_notifier import BaseNotifier
from pipecat.utils.sync.event_notifier import EventNotifier
from pipecat.sync.base_notifier import BaseNotifier
from pipecat.sync.event_notifier import EventNotifier
class NotifierGate(FrameProcessor):
@@ -252,8 +252,7 @@ class ClassificationProcessor(FrameProcessor):
self._voicemail_notifier = voicemail_notifier
self._voicemail_response_delay = voicemail_response_delay
# Register the conversation and voicemail detected events
self._register_event_handler("on_conversation_detected")
# Register the voicemail detected event
self._register_event_handler("on_voicemail_detected")
# Aggregation state for collecting complete LLM responses
@@ -351,7 +350,6 @@ class ClassificationProcessor(FrameProcessor):
logger.info(f"{self}: CONVERSATION detected")
await self._gate_notifier.notify() # Close the classifier gate
await self._conversation_notifier.notify() # Release buffered TTS frames
await self._call_event_handler("on_conversation_detected")
elif "VOICEMAIL" in response:
# Voicemail detected - trigger voicemail handling
@@ -541,9 +539,6 @@ class VoicemailDetector(ParallelPipeline):
custom_prompt = "Your custom classification logic here. " + VoicemailDetector.CLASSIFIER_RESPONSE_INSTRUCTION
Events:
on_conversation_detected: Triggered when a human conversation is detected. The
event handler receives one argument: the ClassificationProcessor instance
which can be used to push frames.
on_voicemail_detected: Triggered when voicemail is detected after the configured
delay. The event handler receives one argument: the ClassificationProcessor
instance which can be used to push frames.
@@ -706,7 +701,7 @@ VOICEMAIL SYSTEM (respond "VOICEMAIL"):
event_name: The name of the event to handle.
handler: The function to call when the event occurs.
"""
if event_name in ("on_conversation_detected", "on_voicemail_detected"):
if event_name == "on_voicemail_detected":
self._classification_processor.add_event_handler(event_name, handler)
else:
super().add_event_handler(event_name, handler)

View File

@@ -38,7 +38,7 @@ from pipecat.utils.time import nanoseconds_to_str
from pipecat.utils.utils import obj_count, obj_id
if TYPE_CHECKING:
from pipecat.processors.aggregators.llm_context import LLMContext, LLMContextMessage, NotGiven
from pipecat.processors.aggregators.llm_context import LLMContext, NotGiven
from pipecat.processors.frame_processor import FrameProcessor
@@ -186,20 +186,6 @@ class ControlFrame(Frame):
#
@dataclass
class UninterruptibleFrame:
"""A marker for data or control frames that must not be interrupted.
Frames with this mixin are still ordered normally, but unlike other frames,
they are preserved during interruptions: they remain in internal queues and
any task processing them will not be cancelled. This ensures the frame is
always delivered and processed to completion.
"""
pass
@dataclass
class AudioRawFrame:
"""A frame containing a chunk of raw audio.
@@ -227,7 +213,7 @@ class ImageRawFrame:
Parameters:
image: Raw image bytes.
size: Image dimensions as (width, height) tuple.
format: Image format (e.g., 'RGB', 'RGBA').
format: Image format (e.g., 'JPEG', 'PNG').
"""
image: bytes
@@ -344,7 +330,7 @@ class TextFrame(DataFrame):
"""
text: str
skip_tts: Optional[bool] = field(init=False)
skip_tts: bool = field(init=False)
# Whether any necessary inter-frame (leading/trailing) spaces are already
# included in the text.
# NOTE: Ideally this would be available at init time with a default value,
@@ -357,7 +343,7 @@ class TextFrame(DataFrame):
def __post_init__(self):
super().__post_init__()
self.skip_tts = None
self.skip_tts = False
self.includes_inter_frame_spaces = False
self.append_to_context = True
@@ -370,10 +356,7 @@ class TextFrame(DataFrame):
class LLMTextFrame(TextFrame):
"""Text frame generated by LLM services."""
def __post_init__(self):
super().__post_init__()
# LLM services send text frames with all necessary spaces included
self.includes_inter_frame_spaces = True
pass
class AggregationType(str, Enum):
@@ -400,13 +383,6 @@ class AggregatedTextFrame(TextFrame):
aggregated_by: AggregationType | str
@dataclass
class VisionTextFrame(LLMTextFrame):
"""Text frame generated by vision services."""
pass
@dataclass
class TTSTextFrame(AggregatedTextFrame):
"""Text frame generated by Text-to-Speech services."""
@@ -519,15 +495,6 @@ class TranscriptionMessage:
timestamp: Optional[str] = None
@dataclass
class ThoughtTranscriptionMessage:
"""An LLM thought message in a conversation transcript."""
role: Literal["assistant"] = field(default="assistant", init=False)
content: str
timestamp: Optional[str] = None
@dataclass
class TranscriptionUpdateFrame(DataFrame):
"""Frame containing new messages added to conversation transcript.
@@ -572,7 +539,7 @@ class TranscriptionUpdateFrame(DataFrame):
messages: List of new transcript messages that were added.
"""
messages: List[TranscriptionMessage | ThoughtTranscriptionMessage]
messages: List[TranscriptionMessage]
def __str__(self):
pts = format_pts(self.pts)
@@ -593,75 +560,6 @@ class LLMContextFrame(Frame):
context: "LLMContext"
@dataclass
class LLMThoughtStartFrame(ControlFrame):
"""Frame indicating the start of an LLM thought.
Parameters:
append_to_context: Whether the thought should be appended to the LLM context.
If it is appended, the `llm` field is required, since it will be
appended as an `LLMSpecificMessage`.
llm: Optional identifier of the LLM provider for LLM-specific handling.
Only required if `append_to_context` is True, as the thought is
appended to context as an `LLMSpecificMessage`.
"""
append_to_context: bool = False
llm: Optional[str] = None
def __post_init__(self):
super().__post_init__()
if self.append_to_context and self.llm is None:
raise ValueError("When append_to_context is True, llm must be set")
def __str__(self):
pts = format_pts(self.pts)
return (
f"{self.name}(pts: {pts}, append_to_context: {self.append_to_context}, llm: {self.llm})"
)
@dataclass
class LLMThoughtTextFrame(DataFrame):
"""Frame containing the text (or text chunk) of an LLM thought.
Note that despite this containing text, it is a DataFrame and not a
TextFrame, to avoid most typical text processing, such as TTS.
Parameters:
text: The text (or text chunk) of the thought.
"""
text: str
includes_inter_frame_spaces: bool = field(init=False)
def __post_init__(self):
super().__post_init__()
# Assume that thought text chunks include all necessary spaces
self.includes_inter_frame_spaces = True
def __str__(self):
pts = format_pts(self.pts)
return f"{self.name}(pts: {pts}, thought text: {self.text})"
@dataclass
class LLMThoughtEndFrame(ControlFrame):
"""Frame indicating the end of an LLM thought.
Parameters:
signature: Optional signature associated with the thought.
This is used by Anthropic, which includes a signature at the end of
each thought.
"""
signature: Any = None
def __str__(self):
pts = format_pts(self.pts)
return f"{self.name}(pts: {pts}, signature: {self.signature})"
@dataclass
class LLMMessagesFrame(DataFrame):
"""Frame containing LLM messages for chat completion.
@@ -795,44 +693,6 @@ class LLMConfigureOutputFrame(DataFrame):
skip_tts: bool
@dataclass
class FunctionCallResultProperties:
"""Properties for configuring function call result behavior.
Parameters:
run_llm: Whether to run the LLM after receiving this result.
on_context_updated: Callback to execute when context is updated.
"""
run_llm: Optional[bool] = None
on_context_updated: Optional[Callable[[], Awaitable[None]]] = None
@dataclass
class FunctionCallResultFrame(DataFrame, UninterruptibleFrame):
"""Frame containing the result of an LLM function call.
This is an uninterruptible frame because once a result is generated we
always want to update the context.
Parameters:
function_name: Name of the function that was executed.
tool_call_id: Unique identifier for the function call.
arguments: Arguments that were passed to the function.
result: The result returned by the function.
run_llm: Whether to run the LLM after this result.
properties: Additional properties for result handling.
"""
function_name: str
tool_call_id: str
arguments: Any
result: Any
run_llm: Optional[bool] = None
properties: Optional[FunctionCallResultProperties] = None
@dataclass
class TTSSpeakFrame(DataFrame):
"""Frame containing text that should be spoken by TTS.
@@ -954,7 +814,7 @@ class CancelFrame(SystemFrame):
reason: Optional reason for pushing a cancel frame.
"""
reason: Optional[Any] = None
reason: Optional[str] = None
def __str__(self):
return f"{self.name}(reason: {self.reason})"
@@ -972,13 +832,11 @@ class ErrorFrame(SystemFrame):
error: Description of the error that occurred.
fatal: Whether the error is fatal and requires bot shutdown.
processor: The frame processor that generated the error.
exception: The exception that occurred.
"""
error: str
fatal: bool = False
processor: Optional["FrameProcessor"] = None
exception: Optional[Exception] = None
def __str__(self):
return f"{self.name}(error: {self.error}, fatal: {self.fatal})"
@@ -1226,6 +1084,23 @@ class FunctionCallsStartedFrame(SystemFrame):
function_calls: Sequence[FunctionCallFromLLM]
@dataclass
class FunctionCallInProgressFrame(SystemFrame):
"""Frame signaling that a function call is currently executing.
Parameters:
function_name: Name of the function being executed.
tool_call_id: Unique identifier for this function call.
arguments: Arguments passed to the function.
cancel_on_interruption: Whether to cancel this call if interrupted.
"""
function_name: str
tool_call_id: str
arguments: Any
cancel_on_interruption: bool = False
@dataclass
class FunctionCallCancelFrame(SystemFrame):
"""Frame signaling that a function call has been cancelled.
@@ -1239,6 +1114,40 @@ class FunctionCallCancelFrame(SystemFrame):
tool_call_id: str
@dataclass
class FunctionCallResultProperties:
"""Properties for configuring function call result behavior.
Parameters:
run_llm: Whether to run the LLM after receiving this result.
on_context_updated: Callback to execute when context is updated.
"""
run_llm: Optional[bool] = None
on_context_updated: Optional[Callable[[], Awaitable[None]]] = None
@dataclass
class FunctionCallResultFrame(SystemFrame):
"""Frame containing the result of an LLM function call.
Parameters:
function_name: Name of the function that was executed.
tool_call_id: Unique identifier for the function call.
arguments: Arguments that were passed to the function.
result: The result returned by the function.
run_llm: Whether to run the LLM after this result.
properties: Additional properties for result handling.
"""
function_name: str
tool_call_id: str
arguments: Any
result: Any
run_llm: Optional[bool] = None
properties: Optional[FunctionCallResultProperties] = None
@dataclass
class STTMuteFrame(SystemFrame):
"""Frame to mute/unmute the Speech-to-Text service.
@@ -1473,23 +1382,6 @@ class UserImageRawFrame(InputImageRawFrame):
return f"{self.name}(pts: {pts}, user: {self.user_id}, source: {self.transport_source}, size: {self.size}, format: {self.format}, text: {self.text}, append_to_context: {self.append_to_context})"
@dataclass
class AssistantImageRawFrame(OutputImageRawFrame):
"""Frame containing an image generated by the assistant.
Contains both the raw frame for display (superclass functionality) as well
as the original image, which can get used directly in LLM contexts.
Parameters:
original_data: The original image data, which can get used directly in
an LLM context message without further encoding.
original_mime_type: The MIME type of the original image data.
"""
original_data: Optional[bytes] = None
original_mime_type: Optional[str] = None
@dataclass
class InputDTMFFrame(DTMFFrame, SystemFrame):
"""DTMF keypress input frame from transport."""
@@ -1557,7 +1449,7 @@ class EndTaskFrame(TaskFrame):
reason: Optional reason for pushing an end frame.
"""
reason: Optional[Any] = None
reason: Optional[str] = None
def __str__(self):
return f"{self.name}(reason: {self.reason})"
@@ -1575,7 +1467,7 @@ class CancelTaskFrame(TaskFrame):
reason: Optional reason for pushing a cancel frame.
"""
reason: Optional[Any] = None
reason: Optional[str] = None
def __str__(self):
return f"{self.name}(reason: {self.reason})"
@@ -1654,7 +1546,7 @@ class EndFrame(ControlFrame):
reason: Optional reason for pushing an end frame.
"""
reason: Optional[Any] = None
reason: Optional[str] = None
def __str__(self):
return f"{self.name}(reason: {self.reason})"
@@ -1735,61 +1627,22 @@ class LLMFullResponseStartFrame(ControlFrame):
more TextFrames and a final LLMFullResponseEndFrame.
"""
skip_tts: Optional[bool] = field(init=False)
skip_tts: bool = field(init=False)
def __post_init__(self):
super().__post_init__()
self.skip_tts = None
self.skip_tts = False
@dataclass
class LLMFullResponseEndFrame(ControlFrame):
"""Frame indicating the end of an LLM response."""
skip_tts: Optional[bool] = field(init=False)
skip_tts: bool = field(init=False)
def __post_init__(self):
super().__post_init__()
self.skip_tts = None
@dataclass
class FunctionCallInProgressFrame(ControlFrame, UninterruptibleFrame):
"""Frame signaling that a function call is currently executing.
This is an uninterruptible frame because we always want to update the
context.
Parameters:
function_name: Name of the function being executed.
tool_call_id: Unique identifier for this function call.
arguments: Arguments passed to the function.
cancel_on_interruption: Whether to cancel this call if interrupted.
"""
function_name: str
tool_call_id: str
arguments: Any
cancel_on_interruption: bool = False
@dataclass
class VisionFullResponseStartFrame(LLMFullResponseStartFrame):
"""Frame indicating the beginning of a vision model response.
Used to indicate the beginning of a vision model response. Followed by one
or more VisionTextFrames and a final VisionFullResponseEndFrame.
"""
pass
@dataclass
class VisionFullResponseEndFrame(LLMFullResponseEndFrame):
"""Frame indicating the end of a Vision model response."""
pass
self.skip_tts = False
@dataclass

View File

@@ -15,8 +15,8 @@ from pipecat.frames.frames import (
BotStartedSpeakingFrame,
CancelFrame,
EndFrame,
VADUserStartedSpeakingFrame,
VADUserStoppedSpeakingFrame,
UserStartedSpeakingFrame,
UserStoppedSpeakingFrame,
)
from pipecat.observers.base_observer import BaseObserver, FramePushed
from pipecat.processors.frame_processor import FrameDirection
@@ -36,7 +36,7 @@ class UserBotLatencyLogObserver(BaseObserver):
to calculate response latencies.
"""
super().__init__()
self._user_bot_latency_processed_frames = set()
self._processed_frames = set()
self._user_stopped_time = 0
self._latencies = []
@@ -51,14 +51,14 @@ class UserBotLatencyLogObserver(BaseObserver):
return
# Skip already processed frames
if data.frame.id in self._user_bot_latency_processed_frames:
if data.frame.id in self._processed_frames:
return
self._user_bot_latency_processed_frames.add(data.frame.id)
self._processed_frames.add(data.frame.id)
if isinstance(data.frame, VADUserStartedSpeakingFrame):
if isinstance(data.frame, UserStartedSpeakingFrame):
self._user_stopped_time = 0
elif isinstance(data.frame, VADUserStoppedSpeakingFrame):
elif isinstance(data.frame, UserStoppedSpeakingFrame):
self._user_stopped_time = time.time()
elif isinstance(data.frame, (EndFrame, CancelFrame)):
self._log_summary()

View File

@@ -9,7 +9,7 @@
from pipecat.frames.frames import CancelFrame, EndFrame, Frame, LLMContextFrame, StartFrame
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContextFrame
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.utils.sync.base_notifier import BaseNotifier
from pipecat.sync.base_notifier import BaseNotifier
class GatedLLMContextAggregator(FrameProcessor):

View File

@@ -14,7 +14,6 @@ translation from this universal context into whatever format it needs, using a
service-specific adapter.
"""
import asyncio
import base64
import io
import wave
@@ -138,7 +137,7 @@ class LLMContext:
return {"role": role, "content": content}
@staticmethod
async def create_image_message(
def create_image_message(
*,
role: str = "user",
format: str,
@@ -150,34 +149,20 @@ class LLMContext:
Args:
role: The role of this message (defaults to "user").
format: Image format (e.g., 'RGB', 'RGBA', or, if already encoded,
the MIME type like 'image/jpeg').
format: Image format (e.g., 'RGB', 'RGBA').
size: Image dimensions as (width, height) tuple.
image: Raw image bytes.
text: Optional text to include with the image.
"""
# Format is a mime type: image is already encoded
image_already_encoded = format.startswith("image/")
def encode_image():
if image_already_encoded:
bytes = image
else:
# Encode to JPEG
buffer = io.BytesIO()
Image.frombytes(format, size, image).save(buffer, format="JPEG")
bytes = buffer.getvalue()
encoded_image = base64.b64encode(bytes).decode("utf-8")
return encoded_image
encoded_image = await asyncio.to_thread(encode_image)
url = f"data:{format if image_already_encoded else 'image/jpeg'};base64,{encoded_image}"
buffer = io.BytesIO()
Image.frombytes(format, size, image).save(buffer, format="JPEG")
encoded_image = base64.b64encode(buffer.getvalue()).decode("utf-8")
url = f"data:image/jpeg;base64,{encoded_image}"
return LLMContext.create_image_url_message(role=role, url=url, text=text)
@staticmethod
async def create_audio_message(
def create_audio_message(
*, role: str = "user", audio_frames: list[AudioRawFrame], text: str = "Audio follows"
) -> LLMContextMessage:
"""Create a context message containing audio.
@@ -187,25 +172,21 @@ class LLMContext:
audio_frames: List of audio frame objects to include.
text: Optional text to include with the audio.
"""
content = [{"type": "text", "text": text}]
sample_rate = audio_frames[0].sample_rate
num_channels = audio_frames[0].num_channels
async def encode_audio():
sample_rate = audio_frames[0].sample_rate
num_channels = audio_frames[0].num_channels
content = []
content.append({"type": "text", "text": text})
data = b"".join(frame.audio for frame in audio_frames)
data = b"".join(frame.audio for frame in audio_frames)
with io.BytesIO() as buffer:
with wave.open(buffer, "wb") as wf:
wf.setsampwidth(2)
wf.setnchannels(num_channels)
wf.setframerate(sample_rate)
wf.writeframes(data)
with io.BytesIO() as buffer:
with wave.open(buffer, "wb") as wf:
wf.setsampwidth(2)
wf.setnchannels(num_channels)
wf.setframerate(sample_rate)
wf.writeframes(data)
encoded_audio = base64.b64encode(buffer.getvalue()).decode("utf-8")
return encoded_audio
encoded_audio = await asyncio.to_thread(encode_audio)
encoded_audio = base64.b64encode(buffer.getvalue()).decode("utf-8")
content.append(
{
@@ -340,31 +321,21 @@ class LLMContext:
"""
self._tool_choice = tool_choice
async def add_image_frame_message(
self,
*,
format: str,
size: tuple[int, int],
image: bytes,
text: Optional[str] = None,
role: str = "user",
def add_image_frame_message(
self, *, format: str, size: tuple[int, int], image: bytes, text: Optional[str] = None
):
"""Add a message containing an image frame.
Args:
format: Image format (e.g., 'RGB', 'RGBA', or, if already encoded,
the MIME type like 'image/jpeg').
format: Image format (e.g., 'RGB', 'RGBA').
size: Image dimensions as (width, height) tuple.
image: Raw image bytes.
text: Optional text to include with the image.
role: The role of this message (defaults to "user").
"""
message = await LLMContext.create_image_message(
role=role, format=format, size=size, image=image, text=text
)
message = LLMContext.create_image_message(format=format, size=size, image=image, text=text)
self.add_message(message)
async def add_audio_frames_message(
def add_audio_frames_message(
self, *, audio_frames: list[AudioRawFrame], text: str = "Audio follows"
):
"""Add a message containing audio frames.
@@ -373,7 +344,7 @@ class LLMContext:
audio_frames: List of audio frame objects to include.
text: Optional text to include with the audio.
"""
message = await LLMContext.create_audio_message(audio_frames=audio_frames, text=text)
message = LLMContext.create_audio_message(audio_frames=audio_frames, text=text)
self.add_message(message)
@staticmethod

View File

@@ -24,7 +24,6 @@ from pipecat.audio.interruptions.base_interruption_strategy import BaseInterrupt
from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
from pipecat.audio.vad.vad_analyzer import VADParams
from pipecat.frames.frames import (
AssistantImageRawFrame,
BotStartedSpeakingFrame,
BotStoppedSpeakingFrame,
CancelFrame,
@@ -48,9 +47,6 @@ from pipecat.frames.frames import (
LLMRunFrame,
LLMSetToolChoiceFrame,
LLMSetToolsFrame,
LLMThoughtEndFrame,
LLMThoughtStartFrame,
LLMThoughtTextFrame,
SpeechControlParamsFrame,
StartFrame,
TextFrame,
@@ -70,7 +66,7 @@ from pipecat.processors.aggregators.llm_response import (
LLMUserAggregatorParams,
)
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.utils.string import TextPartForConcatenation, concatenate_aggregated_text
from pipecat.utils.string import concatenate_aggregated_text
from pipecat.utils.time import time_now_iso8601
@@ -94,7 +90,15 @@ class LLMContextAggregator(FrameProcessor):
self._context = context
self._role = role
self._aggregation: List[TextPartForConcatenation] = []
self._aggregation: List[str] = []
# Whether to add spaces between text parts.
# (Currently only used by LLMAssistantAggregator, but could be expanded
# to LLMUserAggregator in the future if needed; that would require
# additional work since LLMUserAggregator currently trims spaces from
# incoming frames before determining whether it "really" received any
# text).
self._add_spaces = True
@property
def messages(self) -> List[LLMContextMessage]:
@@ -187,7 +191,7 @@ class LLMContextAggregator(FrameProcessor):
Returns:
The concatenated aggregation string.
"""
return concatenate_aggregated_text(self._aggregation)
return concatenate_aggregated_text(self._aggregation, self._add_spaces)
class LLMUserAggregator(LLMContextAggregator):
@@ -437,12 +441,7 @@ class LLMUserAggregator(LLMContextAggregator):
if not text.strip():
return
# Transcriptions never include inter-part spaces (so far).
self._aggregation.append(
TextPartForConcatenation(
text, includes_inter_part_spaces=frame.includes_inter_frame_spaces
)
)
self._aggregation.append(text)
# We just got a final result, so let's reset interim results.
self._seen_interim_results = False
# Reset aggregation timer.
@@ -596,10 +595,6 @@ class LLMAssistantAggregator(LLMContextAggregator):
self._function_calls_in_progress: Dict[str, Optional[FunctionCallInProgressFrame]] = {}
self._context_updated_tasks: Set[asyncio.Task] = set()
self._thought_aggregation_enabled = False
self._thought_llm: str = ""
self._thought_aggregation: List[TextPartForConcatenation] = []
@property
def has_function_calls_in_progress(self) -> bool:
"""Check if there are any function calls currently in progress.
@@ -609,17 +604,6 @@ class LLMAssistantAggregator(LLMContextAggregator):
"""
return bool(self._function_calls_in_progress)
async def reset(self):
"""Reset the aggregation state."""
await super().reset()
await self._reset_thought_aggregation() # Just to be safe
async def _reset_thought_aggregation(self):
"""Reset the thought aggregation state."""
self._thought_aggregation_enabled = False
self._thought_llm = ""
self._thought_aggregation = []
async def process_frame(self, frame: Frame, direction: FrameDirection):
"""Process frames for assistant response aggregation and function call management.
@@ -638,12 +622,6 @@ class LLMAssistantAggregator(LLMContextAggregator):
await self._handle_llm_end(frame)
elif isinstance(frame, TextFrame):
await self._handle_text(frame)
elif isinstance(frame, LLMThoughtStartFrame):
await self._handle_thought_start(frame)
elif isinstance(frame, LLMThoughtTextFrame):
await self._handle_thought_text(frame)
elif isinstance(frame, LLMThoughtEndFrame):
await self._handle_thought_end(frame)
elif isinstance(frame, LLMRunFrame):
await self._handle_llm_run(frame)
elif isinstance(frame, LLMMessagesAppendFrame):
@@ -664,8 +642,6 @@ class LLMAssistantAggregator(LLMContextAggregator):
await self._handle_function_call_cancel(frame)
elif isinstance(frame, UserImageRawFrame):
await self._handle_user_image_frame(frame)
elif isinstance(frame, AssistantImageRawFrame):
await self._handle_assistant_image_frame(frame)
elif isinstance(frame, BotStoppedSpeakingFrame):
await self.push_aggregation()
await self.push_frame(frame, direction)
@@ -820,7 +796,7 @@ class LLMAssistantAggregator(LLMContextAggregator):
logger.debug(f"{self} Appending UserImageRawFrame to LLM context (size: {frame.size})")
await self._context.add_image_frame_message(
self._context.add_image_frame_message(
format=frame.format,
size=frame.size,
image=frame.image,
@@ -830,24 +806,6 @@ class LLMAssistantAggregator(LLMContextAggregator):
await self.push_aggregation()
await self.push_context_frame(FrameDirection.UPSTREAM)
async def _handle_assistant_image_frame(self, frame: AssistantImageRawFrame):
logger.debug(f"{self} Appending AssistantImageRawFrame to LLM context (size: {frame.size})")
if frame.original_data and frame.original_mime_type:
await self._context.add_image_frame_message(
format=frame.original_mime_type,
size=frame.size, # Technically doesn't matter, since already encoded
image=frame.original_data,
role="assistant",
)
else:
await self._context.add_image_frame_message(
format=frame.format,
size=frame.size,
image=frame.image,
role="assistant",
)
async def _handle_llm_start(self, _: LLMFullResponseStartFrame):
self._started += 1
@@ -863,52 +821,11 @@ class LLMAssistantAggregator(LLMContextAggregator):
if len(frame.text) == 0:
return
self._aggregation.append(
TextPartForConcatenation(
frame.text, includes_inter_part_spaces=frame.includes_inter_frame_spaces
)
)
# Track whether we need to add spaces between text parts
# Assumption: we can just keep track of the latest frame's value
self._add_spaces = not frame.includes_inter_frame_spaces
async def _handle_thought_start(self, frame: LLMThoughtStartFrame):
if not self._started:
return
await self._reset_thought_aggregation()
self._thought_aggregation_enabled = frame.append_to_context
self._thought_llm = frame.llm
async def _handle_thought_text(self, frame: LLMThoughtTextFrame):
if not self._started or not self._thought_aggregation_enabled:
return
# Make sure we really have text (spaces count, too!)
if len(frame.text) == 0:
return
self._thought_aggregation.append(
TextPartForConcatenation(
frame.text, includes_inter_part_spaces=frame.includes_inter_frame_spaces
)
)
async def _handle_thought_end(self, frame: LLMThoughtEndFrame):
if not self._started or not self._thought_aggregation_enabled:
return
thought = concatenate_aggregated_text(self._thought_aggregation)
llm = self._thought_llm
await self._reset_thought_aggregation()
self._context.add_message(
LLMSpecificMessage(
llm=llm,
message={
"type": "thought",
"text": thought,
"signature": frame.signature,
},
)
)
self._aggregation.append(frame.text)
def _context_updated_task_finished(self, task: asyncio.Task):
self._context_updated_tasks.discard(task)

View File

@@ -83,7 +83,8 @@ class LLMTextProcessor(FrameProcessor):
await self._text_aggregator.reset()
async def _handle_llm_text(self, in_frame: LLMTextFrame):
async for aggregation in self._text_aggregator.aggregate(in_frame.text):
aggregation = await self._text_aggregator.aggregate(in_frame.text)
if aggregation:
out_frame = AggregatedTextFrame(
text=aggregation.text,
aggregated_by=aggregation.type,
@@ -91,13 +92,15 @@ class LLMTextProcessor(FrameProcessor):
out_frame.skip_tts = in_frame.skip_tts
await self.push_frame(out_frame)
async def _handle_llm_end(self, skip_tts: Optional[bool] = None):
# Flush any remaining text
remaining = await self._text_aggregator.flush()
if remaining:
async def _handle_llm_end(self, skip_tts: bool = False):
# Flush any remaining aggregated text at the end of the LLM response
aggregation = self._text_aggregator.text
await self._text_aggregator.reset()
text = aggregation.text.strip()
if text:
out_frame = AggregatedTextFrame(
text=remaining.text,
aggregated_by=remaining.type,
text=text,
aggregated_by=aggregation.type,
)
out_frame.skip_tts = skip_tts
await self.push_frame(out_frame)

View File

@@ -83,4 +83,4 @@ class ConsumerProcessor(FrameProcessor):
while True:
frame = await self._queue.get()
new_frame = await self._transformer(frame)
await self.queue_frame(new_frame, self._direction)
await self.push_frame(new_frame, self._direction)

View File

@@ -126,4 +126,6 @@ class WakeCheckFilter(FrameProcessor):
else:
await self.push_frame(frame, direction)
except Exception as e:
await self.push_error(error_msg=f"Error in wake word filter: {e}", exception=e)
error_msg = f"Error in wake word filter: {e}"
logger.exception(error_msg)
await self.push_error(ErrorFrame(error_msg))

View File

@@ -10,7 +10,7 @@ from typing import Awaitable, Callable, Tuple, Type
from pipecat.frames.frames import Frame
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.utils.sync.base_notifier import BaseNotifier
from pipecat.sync.base_notifier import BaseNotifier
class WakeNotifierFilter(FrameProcessor):

View File

@@ -12,7 +12,6 @@ management, and frame flow control mechanisms.
"""
import asyncio
import traceback
from dataclasses import dataclass
from enum import Enum
from typing import Any, Awaitable, Callable, Coroutine, List, Optional, Sequence, Tuple, Type
@@ -33,7 +32,6 @@ from pipecat.frames.frames import (
InterruptionTaskFrame,
StartFrame,
SystemFrame,
UninterruptibleFrame,
)
from pipecat.metrics.metrics import LLMTokenUsage, MetricsData
from pipecat.observers.base_observer import BaseObserver, FrameProcessed, FramePushed
@@ -144,7 +142,6 @@ class FrameProcessor(BaseObject):
- on_after_process_frame: Called after a frame is processed
- on_before_push_frame: Called before a frame is pushed
- on_after_push_frame: Called after a frame is pushed
- on_error: Called when an error is raised in the frame processing.
"""
def __init__(
@@ -212,7 +209,6 @@ class FrameProcessor(BaseObject):
# The input task that handles all types of frames. It processes system
# frames right away and queues non-system frames for later processing.
self.__should_block_system_frames = False
self.__input_queue = FrameProcessorQueue()
self.__input_event: Optional[asyncio.Event] = None
self.__input_frame_task: Optional[asyncio.Task] = None
@@ -222,10 +218,8 @@ class FrameProcessor(BaseObject):
# called. To resume processing frames we need to call
# `resume_processing_frames()` which will wake up the event.
self.__should_block_frames = False
self.__process_queue = asyncio.Queue()
self.__process_event: Optional[asyncio.Event] = None
self.__process_frame_task: Optional[asyncio.Task] = None
self.__process_current_frame: Optional[Frame] = None
# To interrupt a pipeline, we push an `InterruptionTaskFrame` upstream.
# Then we wait for the corresponding `InterruptionFrame` to travel from
@@ -240,7 +234,6 @@ class FrameProcessor(BaseObject):
self._register_event_handler("on_after_process_frame", sync=True)
self._register_event_handler("on_before_push_frame", sync=True)
self._register_event_handler("on_after_push_frame", sync=True)
self._register_event_handler("on_error", sync=True)
@property
def id(self) -> int:
@@ -637,43 +630,7 @@ class FrameProcessor(BaseObject):
elif isinstance(frame, (FrameProcessorResumeFrame, FrameProcessorResumeUrgentFrame)):
await self.__resume(frame)
async def push_error(
self,
error_msg: str,
exception: Optional[Exception] = None,
fatal: bool = False,
):
"""Creates and pushes an ErrorFrame upstream.
Creates and pushes an ErrorFrame upstream to notify other processors in the
pipeline about an error condition. The error frame will include context about
which processor generated the error.
Args:
error_msg: Descriptive message explaining the error condition.
exception: Optional exception object that caused the error, if available.
This provides additional context for debugging and error handling.
fatal: Whether this error should be considered fatal to the pipeline.
Fatal errors typically cause the entire pipeline to stop processing.
Defaults to False for non-fatal errors.
Example::
```python
# Non-fatal error
await self.push_error("Failed to process audio chunk, skipping")
# Fatal error with exception context
try:
result = some_critical_operation()
except Exception as e:
await self.push_error("Critical operation failed", exception=e, fatal=True)
```
"""
error_frame = ErrorFrame(error=error_msg, fatal=fatal, exception=exception, processor=self)
await self.push_error_frame(error=error_frame)
async def push_error_frame(self, error: ErrorFrame):
async def push_error(self, error: ErrorFrame):
"""Push an error frame upstream.
Args:
@@ -681,18 +638,6 @@ class FrameProcessor(BaseObject):
"""
if not error.processor:
error.processor = self
await self._call_event_handler("on_error", error)
if error.exception:
tb = traceback.extract_tb(error.exception.__traceback__)
last = tb[-1]
error_message = (
f"{error.processor} exception ({last.filename}:{last.lineno}): {error.error}"
)
else:
error_message = f"{error.processor} error: {error.error}"
logger.error(error_message)
await self.push_frame(error, FrameDirection.UPSTREAM)
async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
@@ -809,19 +754,13 @@ class FrameProcessor(BaseObject):
# interruption). Instead we just drain the queue because this is
# an interruption.
self.__reset_process_task()
elif isinstance(self.__process_current_frame, UninterruptibleFrame):
# We don't want to cancel UninterruptibleFrame, so we simply
# cleanup the queue.
self.__reset_process_queue()
else:
# Cancel and re-create the process task.
# Cancel and re-create the process task including the queue.
await self.__cancel_process_task()
self.__create_process_task()
except Exception as e:
await self.push_error(
error_msg=f"Uncaught exception handling _start_interruption: {e}",
exception=e,
)
logger.exception(f"Uncaught exception in {self} when handling _start_interruption: {e}")
await self.push_error(ErrorFrame(str(e)))
async def __internal_push_frame(self, frame: Frame, direction: FrameDirection):
"""Internal method to push frames to adjacent processors.
@@ -858,7 +797,8 @@ class FrameProcessor(BaseObject):
await self._observer.on_push_frame(data)
await self._prev.queue_frame(frame, direction)
except Exception as e:
await self.push_error(error_msg=f"Uncaught exception: {e}", exception=e)
logger.exception(f"Uncaught exception in {self}: {e}")
await self.push_error(ErrorFrame(str(e)))
def _check_started(self, frame: Frame):
"""Check if the processor has been started.
@@ -880,6 +820,7 @@ class FrameProcessor(BaseObject):
if not self.__input_frame_task:
self.__input_event = asyncio.Event()
self.__input_queue = FrameProcessorQueue()
self.__input_frame_task = self.create_task(self.__input_frame_task_handler())
async def __cancel_input_task(self):
@@ -897,7 +838,9 @@ class FrameProcessor(BaseObject):
return
if not self.__process_frame_task:
self.__reset_process_task()
self.__should_block_frames = False
self.__process_event = asyncio.Event()
self.__process_queue = asyncio.Queue()
self.__process_frame_task = self.create_task(self.__process_frame_task_handler())
def __reset_process_task(self):
@@ -907,26 +850,10 @@ class FrameProcessor(BaseObject):
self.__should_block_frames = False
self.__process_event = asyncio.Event()
self.__reset_process_queue()
def __reset_process_queue(self):
"""Reset non-system frame processing queue."""
# Create a new queue to insert UninterruptibleFrame frames.
new_queue = asyncio.Queue()
# Process current queue and keep UninterruptibleFrame frames.
while not self.__process_queue.empty():
item = self.__process_queue.get_nowait()
if isinstance(item, UninterruptibleFrame):
new_queue.put_nowait(item)
self.__process_queue.get_nowait()
self.__process_queue.task_done()
# Put back UninterruptibleFrame frames into our process queue.
while not new_queue.empty():
item = new_queue.get_nowait()
self.__process_queue.put_nowait(item)
new_queue.task_done()
async def __cancel_process_task(self):
"""Cancel the non-system frame processing task."""
if self.__process_frame_task:
@@ -947,7 +874,8 @@ class FrameProcessor(BaseObject):
await self._call_event_handler("on_after_process_frame", frame)
except Exception as e:
await self.push_error(error_msg=f"Error processing frame: {e}", exception=e)
logger.exception(f"{self}: error processing frame: {e}")
await self.push_error(ErrorFrame(str(e)))
async def __input_frame_task_handler(self):
"""Handle frames from the input queue.
@@ -980,12 +908,8 @@ class FrameProcessor(BaseObject):
async def __process_frame_task_handler(self):
"""Handle non-system frames from the process queue."""
while True:
self.__process_current_frame = None
(frame, direction, callback) = await self.__process_queue.get()
self.__process_current_frame = frame
if self.__should_block_frames and self.__process_event:
logger.trace(f"{self}: frame processing paused")
await self.__process_event.wait()

View File

@@ -24,7 +24,7 @@ try:
from langchain_core.messages import AIMessageChunk
from langchain_core.runnables import Runnable
except ModuleNotFoundError as e:
logger.error("In order to use Langchain, you need to `pip install pipecat-ai[langchain]`. ")
logger.exception("In order to use Langchain, you need to `pip install pipecat-ai[langchain]`. ")
raise Exception(f"Missing module: {e}")
@@ -113,6 +113,6 @@ class LangchainProcessor(FrameProcessor):
except GeneratorExit:
logger.warning(f"{self} generator was closed prematurely")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
logger.exception(f"{self} an unknown error occurred: {e}")
finally:
await self.push_frame(LLMFullResponseEndFrame())

View File

@@ -31,7 +31,6 @@ from typing import (
from loguru import logger
from pydantic import BaseModel, Field, PrivateAttr, ValidationError
from pipecat import version as pipecat_version
from pipecat.audio.utils import calculate_audio_volume
from pipecat.frames.frames import (
AggregatedTextFrame,
@@ -86,7 +85,7 @@ from pipecat.transports.base_output import BaseOutputTransport
from pipecat.transports.base_transport import BaseTransport
from pipecat.utils.string import match_endofsentence
RTVI_PROTOCOL_VERSION = "1.1.0"
RTVI_PROTOCOL_VERSION = "1.0.0"
RTVI_MESSAGE_LABEL = "rtvi-ai"
RTVIMessageLiteral = Literal["rtvi-ai"]
@@ -936,8 +935,8 @@ class RTVIObserverParams:
system_logs_enabled: Indicates if system logs should be sent.
errors_enabled: [Deprecated] Indicates if errors messages should be sent.
skip_aggregator_types: List of aggregation types to skip sending as tts/output messages.
Note: if using this to avoid sending secure information, be sure to also disable
bot_llm_enabled to avoid leaking through LLM messages.
Note: if using this to avoid sending secure information, be sure to also disable
bot_llm_enabled to avoid leaking through LLM messages.
bot_output_transforms: A list of callables to transform text before just before sending it
to TTS. Each callable takes the aggregated text and its type, and returns the
transformed text. To register, provide a list of tuples of
@@ -1418,20 +1417,15 @@ class RTVIProcessor(FrameProcessor):
self._client_ready = True
await self._call_event_handler("on_client_ready")
async def set_bot_ready(self, about: Mapping[str, Any] = None):
"""Mark the bot as ready and send the bot-ready message.
Args:
about: Optional information about the bot to include in the ready message.
If left as None, the Pipecat library and version will be used.
"""
async def set_bot_ready(self):
"""Mark the bot as ready and send the bot-ready message."""
self._bot_ready = True
# Only call the (deprecated) _update_config method if the we're using a
# config (which is deprecated). Otherwise we'd always print an
# unnecessary deprecation warning.
if self._config.config:
await self._update_config(self._config, False)
await self._send_bot_ready(about=about)
await self._send_bot_ready()
async def interrupt_bot(self):
"""Send a bot interruption frame upstream."""
@@ -1879,21 +1873,14 @@ class RTVIProcessor(FrameProcessor):
message = RTVIActionResponse(id=request_id, data=RTVIActionResponseData(result=result))
await self.push_transport_message(message)
async def _send_bot_ready(self, about: Mapping[str, Any] = None):
"""Send the bot-ready message to the client.
Args:
about: Optional information about the bot to include in the ready message.
If left as None, the pipecat library and version will be used.
"""
async def _send_bot_ready(self):
"""Send the bot-ready message to the client."""
config = None
if self._client_version and self._client_version[0] < 1:
config = self._config.config
if not about:
about = {"library": "pipecat-ai", "library_version": f"{pipecat_version()}"}
message = RTVIBotReady(
id=self._client_ready_id,
data=RTVIBotReadyData(version=RTVI_PROTOCOL_VERSION, about=about, config=config),
data=RTVIBotReadyData(version=RTVI_PROTOCOL_VERSION, config=config),
)
await self.push_transport_message(message)

View File

@@ -23,7 +23,7 @@ try:
from strands import Agent
from strands.multiagent.graph import Graph
except ModuleNotFoundError as e:
logger.error("In order to use Strands Agents, you need to `pip install strands-agents`.")
logger.exception("In order to use Strands Agents, you need to `pip install strands-agents`.")
raise Exception(f"Missing module: {e}")
@@ -143,7 +143,7 @@ class StrandsAgentsProcessor(FrameProcessor):
except GeneratorExit:
logger.warning(f"{self} generator was closed prematurely")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
logger.exception(f"{self} an unknown error occurred: {e}")
finally:
if ttfb_tracking:
await self.stop_ttfb_metrics()

View File

@@ -20,17 +20,13 @@ from pipecat.frames.frames import (
EndFrame,
Frame,
InterruptionFrame,
LLMThoughtEndFrame,
LLMThoughtStartFrame,
LLMThoughtTextFrame,
ThoughtTranscriptionMessage,
TranscriptionFrame,
TranscriptionMessage,
TranscriptionUpdateFrame,
TTSTextFrame,
)
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.utils.string import TextPartForConcatenation, concatenate_aggregated_text
from pipecat.utils.string import concatenate_aggregated_text
from pipecat.utils.time import time_now_iso8601
@@ -85,98 +81,98 @@ class UserTranscriptProcessor(BaseTranscriptProcessor):
class AssistantTranscriptProcessor(BaseTranscriptProcessor):
"""Processes assistant TTS text frames and LLM thought frames into timestamped messages.
"""Processes assistant TTS text frames into timestamped conversation messages.
This processor aggregates both TTS text frames and LLM thought frames into
complete utterances and thoughts, emitting them as transcript messages.
This processor aggregates TTS text frames into complete utterances and emits them as
transcript messages. Utterances are completed when:
An assistant utterance is completed when:
- The bot stops speaking (BotStoppedSpeakingFrame)
- The bot is interrupted (InterruptionFrame)
- The pipeline ends (EndFrame, CancelFrame)
A thought is completed when:
- The thought ends (LLMThoughtEndFrame)
- The bot is interrupted (InterruptionFrame)
- The pipeline ends (EndFrame, CancelFrame)
- The pipeline ends (EndFrame)
"""
def __init__(self, *, process_thoughts: bool = False, **kwargs):
def __init__(self, **kwargs):
"""Initialize processor with aggregation state.
Args:
process_thoughts: Whether to process LLM thought frames. Defaults to False.
**kwargs: Additional arguments passed to parent class.
"""
super().__init__(**kwargs)
self._current_text_parts: List[str] = []
self._aggregation_start_time: Optional[str] = None
self._process_thoughts = process_thoughts
self._current_assistant_text_parts: List[TextPartForConcatenation] = []
self._assistant_text_start_time: Optional[str] = None
# Whether to add spaces between text parts.
# (The use of this could be expanded to the UserTranscriptProcessor in
# the future if needed; currently the UserTranscriptProcessor assumes
# that user transcription frames do not need aggregation).
self._add_spaces = True
self._current_thought_parts: List[TextPartForConcatenation] = []
self._thought_start_time: Optional[str] = None
self._thought_active = False
async def _emit_aggregated_assistant_text(self):
async def _emit_aggregated_text(self):
"""Aggregates and emits text fragments as a transcript message.
This method aggregates text fragments that may arrive in multiple
TTSTextFrame instances and emits them as a single TranscriptionMessage.
This method uses a heuristic to automatically detect whether text fragments
contain embedded spacing (spaces at the beginning or end of fragments) or not,
and applies the appropriate joining strategy. It handles fragments from different
TTS services with different formatting patterns.
Examples:
Fragments with embedded spacing (concatenated)::
TTSTextFrame: ["Hello"]
TTSTextFrame: [" there"] # Leading space
TTSTextFrame: ["!"]
TTSTextFrame: [" How"] # Leading space
TTSTextFrame: ["'s"]
TTSTextFrame: [" it"] # Leading space
Result: "Hello there! How's it"
Fragments with trailing spaces (concatenated)::
TTSTextFrame: ["Hel"]
TTSTextFrame: ["lo "] # Trailing space
TTSTextFrame: ["to "] # Trailing space
TTSTextFrame: ["you"]
Result: "Hello to you"
Word-by-word fragments without spacing (joined with spaces)::
TTSTextFrame: ["Hello"]
TTSTextFrame: ["there"]
TTSTextFrame: ["how"]
TTSTextFrame: ["are"]
TTSTextFrame: ["you"]
Result: "Hello there how are you"
"""
if self._current_assistant_text_parts and self._assistant_text_start_time:
content = concatenate_aggregated_text(self._current_assistant_text_parts)
if self._current_text_parts and self._aggregation_start_time:
content = concatenate_aggregated_text(self._current_text_parts, self._add_spaces)
if content:
logger.trace(f"Emitting aggregated assistant message: {content}")
message = TranscriptionMessage(
role="assistant",
content=content,
timestamp=self._assistant_text_start_time,
timestamp=self._aggregation_start_time,
)
await self._emit_update([message])
else:
logger.trace("No content to emit after stripping whitespace")
# Reset aggregation state
self._current_assistant_text_parts = []
self._assistant_text_start_time = None
async def _emit_aggregated_thought(self):
"""Aggregates and emits thought text fragments as a thought transcript message.
This method aggregates thought fragments that may arrive in multiple
LLMThoughtTextFrame instances and emits them as a single ThoughtTranscriptionMessage.
"""
if self._current_thought_parts and self._thought_start_time:
content = concatenate_aggregated_text(self._current_thought_parts)
if content:
logger.trace(f"Emitting aggregated thought message: {content}")
message = ThoughtTranscriptionMessage(
content=content,
timestamp=self._thought_start_time,
)
await self._emit_update([message])
else:
logger.trace("No thought content to emit after stripping whitespace")
# Reset aggregation state
self._current_thought_parts = []
self._thought_start_time = None
self._thought_active = False
self._current_text_parts = []
self._aggregation_start_time = None
async def process_frame(self, frame: Frame, direction: FrameDirection):
"""Process frames into assistant conversation messages and thought messages.
"""Process frames into assistant conversation messages.
Handles different frame types:
- TTSTextFrame: Aggregates text for current utterance
- LLMThoughtStartFrame: Begins aggregating a new thought
- LLMThoughtTextFrame: Aggregates text for current thought
- LLMThoughtEndFrame: Completes current thought
- BotStoppedSpeakingFrame: Completes current utterance
- InterruptionFrame: Completes current utterance and thought due to interruption
- EndFrame: Completes current utterance and thought at pipeline end
- CancelFrame: Completes current utterance and thought due to cancellation
- InterruptionFrame: Completes current utterance due to interruption
- EndFrame: Completes current utterance at pipeline end
- CancelFrame: Completes current utterance due to cancellation
Args:
frame: Input frame to process.
@@ -188,53 +184,24 @@ class AssistantTranscriptProcessor(BaseTranscriptProcessor):
# Push frame first otherwise our emitted transcription update frame
# might get cleaned up.
await self.push_frame(frame, direction)
# Emit accumulated text and thought with interruptions
await self._emit_aggregated_assistant_text()
if self._process_thoughts and self._thought_active:
await self._emit_aggregated_thought()
elif isinstance(frame, LLMThoughtStartFrame):
# Start a new thought
if self._process_thoughts:
self._thought_active = True
self._thought_start_time = time_now_iso8601()
self._current_thought_parts = []
# Push frame.
await self.push_frame(frame, direction)
elif isinstance(frame, LLMThoughtTextFrame):
# Aggregate thought text if we have an active thought
if self._process_thoughts and self._thought_active:
self._current_thought_parts.append(
TextPartForConcatenation(
frame.text, includes_inter_part_spaces=frame.includes_inter_frame_spaces
)
)
# Push frame.
await self.push_frame(frame, direction)
elif isinstance(frame, LLMThoughtEndFrame):
# Emit accumulated thought when thought ends
if self._process_thoughts and self._thought_active:
await self._emit_aggregated_thought()
# Push frame.
await self.push_frame(frame, direction)
# Emit accumulated text with interruptions
await self._emit_aggregated_text()
elif isinstance(frame, TTSTextFrame):
# Start timestamp on first text part
if not self._assistant_text_start_time:
self._assistant_text_start_time = time_now_iso8601()
if not self._aggregation_start_time:
self._aggregation_start_time = time_now_iso8601()
self._current_assistant_text_parts.append(
TextPartForConcatenation(
frame.text, includes_inter_part_spaces=frame.includes_inter_frame_spaces
)
)
# Track whether we need to add spaces between text parts
# Assumption: we can just keep track of the latest frame's value
self._add_spaces = not frame.includes_inter_frame_spaces
self._current_text_parts.append(frame.text)
# Push frame.
await self.push_frame(frame, direction)
elif isinstance(frame, (BotStoppedSpeakingFrame, EndFrame)):
# Emit accumulated text when bot finishes speaking or pipeline ends.
await self._emit_aggregated_assistant_text()
# Emit accumulated thought at pipeline end if still active
if isinstance(frame, EndFrame) and self._process_thoughts and self._thought_active:
await self._emit_aggregated_thought()
await self._emit_aggregated_text()
# Push frame.
await self.push_frame(frame, direction)
else:
@@ -245,8 +212,7 @@ class TranscriptProcessor:
"""Factory for creating and managing transcript processors.
Provides unified access to user and assistant transcript processors
with shared event handling. The assistant processor handles both TTS text
and LLM thought frames.
with shared event handling.
Example::
@@ -261,7 +227,7 @@ class TranscriptProcessor:
llm,
tts,
transport.output(),
transcript.assistant(), # Assistant transcripts (including thoughts)
transcript.assistant_tts(), # Assistant transcripts
context_aggregator.assistant(),
]
)
@@ -271,14 +237,8 @@ class TranscriptProcessor:
print(f"New messages: {frame.messages}")
"""
def __init__(self, *, process_thoughts: bool = False):
"""Initialize factory.
Args:
process_thoughts: Whether the assistant processor should handle LLM thought
frames. Defaults to False.
"""
self._process_thoughts = process_thoughts
def __init__(self):
"""Initialize factory."""
self._user_processor = None
self._assistant_processor = None
self._event_handlers = {}
@@ -313,9 +273,7 @@ class TranscriptProcessor:
The assistant transcript processor instance.
"""
if self._assistant_processor is None:
self._assistant_processor = AssistantTranscriptProcessor(
process_thoughts=self._process_thoughts, **kwargs
)
self._assistant_processor = AssistantTranscriptProcessor(**kwargs)
# Apply any registered event handlers
for event_name, handler in self._event_handlers.items():

View File

@@ -171,7 +171,6 @@ def _create_server_app(
esp32_mode: bool = False,
whatsapp_enabled: bool = False,
folder: Optional[str] = None,
dialin_enabled: bool = False,
):
"""Create FastAPI app with transport-specific routes."""
app = FastAPI()
@@ -190,7 +189,7 @@ def _create_server_app(
if whatsapp_enabled:
_setup_whatsapp_routes(app)
elif transport_type == "daily":
_setup_daily_routes(app, dialin_enabled=dialin_enabled)
_setup_daily_routes(app)
elif transport_type in TELEPHONY_TRANSPORTS:
_setup_telephony_routes(app, transport_type=transport_type, proxy=proxy)
else:
@@ -265,10 +264,7 @@ def _setup_webrtc_routes(
# Prepare runner arguments with the callback to run your bot
async def webrtc_connection_callback(connection):
bot_module = _get_bot_module()
runner_args = SmallWebRTCRunnerArguments(
webrtc_connection=connection, body=request.request_data
)
runner_args = SmallWebRTCRunnerArguments(webrtc_connection=connection)
background_tasks.add_task(bot_module.bot, runner_args)
# Delegate handling to SmallWebRTCRequestHandler
@@ -303,7 +299,7 @@ def _setup_webrtc_routes(
result: StartBotResult = {"sessionId": session_id}
if request_data.get("enableDefaultIceServers"):
result["iceConfig"] = IceConfig(
iceServers=[IceServer(urls=["stun:stun.l.google.com:19302"])]
iceServers=[IceServer(urls="stun:stun.l.google.com:19302")]
)
return result
@@ -330,8 +326,7 @@ def _setup_webrtc_routes(
type=request_data["type"],
pc_id=request_data.get("pc_id"),
restart_pc=request_data.get("restart_pc"),
request_data=request_data.get("request_data")
or request_data.get("requestData"),
request_data=request_data,
)
return await offer(webrtc_request, background_tasks)
elif request.method == HTTPMethod.PATCH.value:
@@ -534,13 +529,8 @@ def _setup_whatsapp_routes(app: FastAPI):
_add_lifespan_to_app(app, whatsapp_lifespan)
def _setup_daily_routes(app: FastAPI, dialin_enabled: bool = False):
"""Set up Daily-specific routes.
Args:
app: FastAPI application instance
dialin_enabled: If True, adds /daily-dialin-webhook endpoint for PSTN dial-in handling
"""
def _setup_daily_routes(app: FastAPI):
"""Set up Daily-specific routes."""
@app.get("/")
async def create_room_and_start_agent():
@@ -645,116 +635,6 @@ def _setup_daily_routes(app: FastAPI, dialin_enabled: bool = False):
return result
if dialin_enabled:
@app.post("/daily-dialin-webhook")
async def handle_dialin_webhook(request: Request):
"""Handle incoming Daily PSTN dial-in webhook.
This endpoint mimics Pipecat Cloud's dial-in webhook handler.
It receives Daily webhook data, creates a SIP-enabled room, and starts the bot.
Expected webhook payload::
{
"From": "+15551234567",
"To": "+15559876543",
"callId": "uuid-call-id",
"callDomain": "uuid-call-domain",
"sipHeaders": {...} // optional
}
Returns::
{
"dailyRoom": "https://...",
"dailyToken": "...",
"sessionId": "uuid"
}
"""
logger.debug("Received Daily dial-in webhook")
try:
data = await request.json()
logger.debug(f"Webhook data: {data}")
except Exception as e:
logger.error(f"Failed to parse webhook data: {e}")
raise HTTPException(status_code=400, detail="Invalid JSON payload")
# Handle webhook verification test (sent by Daily when configuring webhook)
if data.get("test") or data.get("Test"):
logger.debug("Webhook verification test received")
return {"status": "OK"}
# Validate required fields
if not all(key in data for key in ["From", "To", "callId", "callDomain"]):
raise HTTPException(
status_code=400,
detail="Missing required fields: From, To, callId, callDomain",
)
import aiohttp
from pipecat.runner.daily import configure
from pipecat.runner.types import DailyDialinRequest, DialinSettings
# Create Daily room with SIP capabilities
async with aiohttp.ClientSession() as session:
try:
room_config = await configure(session, sip_caller_phone=data.get("From"))
except Exception as e:
logger.error(f"Failed to create Daily room: {e}")
raise HTTPException(
status_code=500, detail=f"Failed to create Daily room: {str(e)}"
)
# Get Daily API URL from environment, fallback to production
daily_api_url = os.getenv("DAILY_API_URL", "https://api.daily.co/v1")
# Get Daily API key from environment
daily_api_key = os.getenv("DAILY_API_KEY")
if not daily_api_key:
logger.error("DAILY_API_KEY not found in environment")
raise HTTPException(
status_code=500, detail="DAILY_API_KEY not configured on server"
)
# Prepare dial-in settings matching Pipecat Cloud structure
dialin_settings = DialinSettings(
call_id=data.get("callId"),
call_domain=data.get("callDomain"),
To=data.get("To"),
From=data.get("From"),
sip_headers=data.get("sipHeaders"),
)
# Create request body matching Pipecat Cloud payload
request_body = DailyDialinRequest(
dialin_settings=dialin_settings,
daily_api_key=daily_api_key,
daily_api_url=daily_api_url,
)
# Start bot with dial-in context
bot_module = _get_bot_module()
runner_args = DailyRunnerArguments(
room_url=room_config.room_url,
token=room_config.token,
body=request_body.model_dump(),
)
asyncio.create_task(bot_module.bot(runner_args))
# Generate session ID
session_id = str(uuid.uuid4())
# Return response matching Pipecat Cloud format
return {
"dailyRoom": room_config.room_url,
"dailyToken": room_config.token,
"sessionId": session_id,
}
def _setup_telephony_routes(app: FastAPI, *, transport_type: str, proxy: str):
"""Set up telephony-specific routes."""
@@ -929,12 +809,6 @@ def main():
default=False,
help="Ensure requried WhatsApp environment variables are present",
)
parser.add_argument(
"--dialin",
action="store_true",
default=False,
help="Enable Daily PSTN dial-in webhook handling (requires Daily transport)",
)
args = parser.parse_args()
@@ -954,11 +828,6 @@ def main():
logger.error("For ESP32, you need to specify `--host IP` so we can do SDP munging.")
return
# Validate dial-in requirements
if args.dialin and args.transport != "daily":
logger.error("--dialin flag only works with Daily transport (-t daily)")
return
# Log level
logger.remove()
logger.add(sys.stderr, level="TRACE" if args.verbose else "DEBUG")
@@ -987,13 +856,7 @@ def main():
elif args.transport == "daily":
print()
print(f"🚀 Bot ready!")
if args.dialin:
print(
f" → Daily dial-in webhook: http://{args.host}:{args.port}/daily-dialin-webhook"
)
print(f" → Configure this URL in your Daily phone number settings")
else:
print(f" → Open http://{args.host}:{args.port} in your browser to start a session")
print(f" → Open http://{args.host}:{args.port} in your browser to start a session")
print()
RUNNER_DOWNLOADS_FOLDER = args.folder
@@ -1008,7 +871,6 @@ def main():
esp32_mode=args.esp32,
whatsapp_enabled=args.whatsapp,
folder=args.folder,
dialin_enabled=args.dialin,
)
# Run the server

View File

@@ -11,48 +11,9 @@ information to bot functions.
"""
from dataclasses import dataclass, field
from typing import Any, Dict, Optional
from typing import Any, Optional
from fastapi import WebSocket
from pydantic import BaseModel
class DialinSettings(BaseModel):
"""Dial-in settings from the Daily webhook.
This model matches the structure sent by Pipecat Cloud and Daily.co webhooks
for incoming PSTN/SIP calls.
Parameters:
call_id: Unique identifier for the call (UUID representing sessionId in SIP Network)
call_domain: Daily domain for the call (UUID representing Daily Domain on SIP Network)
To: The dialed phone number (optional)
From: The caller's phone number (optional)
sip_headers: Optional SIP headers from the call
"""
call_id: str
call_domain: str
To: Optional[str] = None
From: Optional[str] = None
sip_headers: Optional[Dict[str, str]] = None
class DailyDialinRequest(BaseModel):
"""Request data for Daily PSTN dial-in requests.
This is the structure passed in runner_args.body for dial-in calls.
It matches the payload structure from Pipecat Cloud's dial-in webhook handler.
Parameters:
dialin_settings: Dial-in configuration including call_id, call_domain, To, From
daily_api_key: Daily API key for pinlessCallUpdate (required for dial-in)
daily_api_url: Daily API URL (staging or production)
"""
dialin_settings: DialinSettings
daily_api_key: str
daily_api_url: str
@dataclass

View File

@@ -281,14 +281,6 @@ async def maybe_capture_participant_camera(
except ImportError:
pass
try:
from pipecat.transports.smallwebrtc.transport import SmallWebRTCTransport
if isinstance(transport, SmallWebRTCTransport):
await transport.capture_participant_video(video_source="camera")
except ImportError:
pass
async def maybe_capture_participant_screen(
transport: BaseTransport, client: Any, framerate: int = 0
@@ -311,14 +303,6 @@ async def maybe_capture_participant_screen(
except ImportError:
pass
try:
from pipecat.transports.smallwebrtc.transport import SmallWebRTCTransport
if isinstance(transport, SmallWebRTCTransport):
await transport.capture_participant_video(video_source="screenVideo")
except ImportError:
pass
def _smallwebrtc_sdp_cleanup_ice_candidates(text: str, pattern: str) -> str:
"""Clean up ICE candidates in SDP text for SmallWebRTC.

View File

@@ -199,7 +199,7 @@ class PlivoFrameSerializer(FrameSerializer):
)
except Exception as e:
logger.error(f"Failed to hang up Plivo call: {e}")
logger.exception(f"Failed to hang up Plivo call: {e}")
async def deserialize(self, data: str | bytes) -> Frame | None:
"""Deserializes Plivo WebSocket data to Pipecat frames.

View File

@@ -225,7 +225,7 @@ class TelnyxFrameSerializer(FrameSerializer):
)
except Exception as e:
logger.error(f"Failed to hang up Telnyx call: {e}")
logger.exception(f"Failed to hang up Telnyx call: {e}")
async def deserialize(self, data: str | bytes) -> Frame | None:
"""Deserializes Telnyx WebSocket data to Pipecat frames.

View File

@@ -236,7 +236,7 @@ class TwilioFrameSerializer(FrameSerializer):
)
except Exception as e:
logger.error(f"Failed to hang up Twilio call: {e}")
logger.exception(f"Failed to hang up Twilio call: {e}")
async def deserialize(self, data: str | bytes) -> Frame | None:
"""Deserializes Twilio WebSocket data to Pipecat frames.

View File

@@ -166,6 +166,6 @@ class AIService(FrameProcessor):
async for f in generator:
if f:
if isinstance(f, ErrorFrame):
await self.push_error_frame(f)
await self.push_error(f)
else:
await self.push_frame(f)

View File

@@ -17,7 +17,7 @@ import io
import json
import re
from dataclasses import dataclass
from typing import Any, Dict, List, Literal, Optional, Union
from typing import Any, Dict, List, Optional, Union
import httpx
from loguru import logger
@@ -40,9 +40,6 @@ from pipecat.frames.frames import (
LLMFullResponseStartFrame,
LLMMessagesFrame,
LLMTextFrame,
LLMThoughtEndFrame,
LLMThoughtStartFrame,
LLMThoughtTextFrame,
LLMUpdateSettingsFrame,
UserImageRawFrame,
)
@@ -113,24 +110,6 @@ class AnthropicLLMService(LLMService):
# Overriding the default adapter to use the Anthropic one.
adapter_class = AnthropicLLMAdapter
class ThinkingConfig(BaseModel):
"""Configuration for extended thinking.
Parameters:
type: Type of thinking mode (currently only "enabled" or "disabled").
budget_tokens: Maximum number of tokens for thinking.
With today's models, the minimum is 1024.
Only allowed if type is "enabled".
"""
# Why `| str` here? To not break compatibility in case Anthropic adds
# more types in the future.
type: Literal["enabled", "disabled"] | str
# Why not enforce minimnum of 1024 here? To not break compatibility in
# case Anthropic changes this requirement in the future.
budget_tokens: int
class InputParams(BaseModel):
"""Input parameters for Anthropic model inference.
@@ -145,10 +124,6 @@ class AnthropicLLMService(LLMService):
temperature: Sampling temperature between 0.0 and 1.0.
top_k: Top-k sampling parameter.
top_p: Top-p sampling parameter between 0.0 and 1.0.
thinking: Extended thinking configuration.
Enabling extended thinking causes the model to spend more time "thinking" before responding.
It also causes this service to emit LLMThinking*Frames during response generation.
Extended thinking is disabled by default.
extra: Additional parameters to pass to the API.
"""
@@ -158,9 +133,6 @@ class AnthropicLLMService(LLMService):
temperature: Optional[float] = Field(default_factory=lambda: NOT_GIVEN, ge=0.0, le=1.0)
top_k: Optional[int] = Field(default_factory=lambda: NOT_GIVEN, ge=0)
top_p: Optional[float] = Field(default_factory=lambda: NOT_GIVEN, ge=0.0, le=1.0)
thinking: Optional["AnthropicLLMService.ThinkingConfig"] = Field(
default_factory=lambda: NOT_GIVEN
)
extra: Optional[Dict[str, Any]] = Field(default_factory=dict)
def model_post_init(self, __context):
@@ -219,7 +191,6 @@ class AnthropicLLMService(LLMService):
"temperature": params.temperature,
"top_k": params.top_k,
"top_p": params.top_p,
"thinking": params.thinking,
"extra": params.extra if isinstance(params.extra, dict) else {},
}
@@ -267,43 +238,28 @@ class AnthropicLLMService(LLMService):
"""
messages = []
system = NOT_GIVEN
tools = []
if isinstance(context, LLMContext):
adapter: AnthropicLLMAdapter = self.get_llm_adapter()
invocation_params = adapter.get_llm_invocation_params(
params = adapter.get_llm_invocation_params(
context, enable_prompt_caching=self._settings["enable_prompt_caching"]
)
messages = invocation_params["messages"]
system = invocation_params["system"]
tools = invocation_params["tools"]
messages = params["messages"]
system = params["system"]
else:
context = AnthropicLLMContext.upgrade_to_anthropic(context)
messages = context.messages
system = getattr(context, "system", NOT_GIVEN)
tools = context.tools or []
# Build params using the same method as streaming completions
params = {
"model": self.model_name,
"max_tokens": self._settings["max_tokens"],
"stream": False,
"temperature": self._settings["temperature"],
"top_k": self._settings["top_k"],
"top_p": self._settings["top_p"],
"messages": messages,
"system": system,
"tools": tools,
"betas": ["interleaved-thinking-2025-05-14"],
}
if self._settings["thinking"]:
params["thinking"] = self._settings["thinking"].model_dump(exclude_unset=True)
params.update(self._settings["extra"])
# LLM completion
response = await self._client.beta.messages.create(**params)
response = await self._client.messages.create(
model=self.model_name,
messages=messages,
system=system,
max_tokens=8192,
stream=False,
)
return next((block.text for block in response.content if hasattr(block, "text")), None)
return response.content[0].text
def create_context_aggregator(
self,
@@ -398,21 +354,12 @@ class AnthropicLLMService(LLMService):
"top_p": self._settings["top_p"],
}
# Add thinking parameter if set
if self._settings["thinking"]:
params["thinking"] = self._settings["thinking"].model_dump(exclude_unset=True)
# Messages, system, tools
params.update(params_from_context)
params.update(self._settings["extra"])
# "Interleaved thinking" needed to allow thinking between sequences
# of function calls, when extended thinking is enabled.
# Note that this requires us to use `client.beta`, below.
params.update({"betas": ["interleaved-thinking-2025-05-14"]})
response = await self._create_message_stream(self._client.beta.messages.create, params)
response = await self._create_message_stream(self._client.messages.create, params)
await self.stop_ttfb_metrics()
@@ -426,28 +373,19 @@ class AnthropicLLMService(LLMService):
if event.type == "content_block_delta":
if hasattr(event.delta, "text"):
await self.push_frame(LLMTextFrame(event.delta.text))
frame = LLMTextFrame(event.delta.text)
frame.includes_inter_frame_spaces = True
await self.push_frame(frame)
completion_tokens_estimate += self._estimate_tokens(event.delta.text)
elif hasattr(event.delta, "partial_json") and tool_use_block:
json_accumulator += event.delta.partial_json
completion_tokens_estimate += self._estimate_tokens(
event.delta.partial_json
)
elif hasattr(event.delta, "thinking"):
await self.push_frame(LLMThoughtTextFrame(text=event.delta.thinking))
elif hasattr(event.delta, "signature"):
await self.push_frame(LLMThoughtEndFrame(signature=event.delta.signature))
elif event.type == "content_block_start":
if event.content_block.type == "tool_use":
tool_use_block = event.content_block
json_accumulator = ""
elif event.content_block.type == "thinking":
await self.push_frame(
LLMThoughtStartFrame(
append_to_context=True,
llm=self.get_llm_adapter().id_for_llm_specific_messages,
)
)
elif (
event.type == "message_delta"
and hasattr(event.delta, "stop_reason")
@@ -522,7 +460,8 @@ class AnthropicLLMService(LLMService):
except httpx.TimeoutException:
await self._call_event_handler("on_completion_timeout")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
logger.exception(f"{self} exception: {e}")
await self.push_error(ErrorFrame(f"{e}"))
finally:
await self.stop_processing_metrics()
await self.push_frame(LLMFullResponseEndFrame())

View File

@@ -17,10 +17,11 @@ from urllib.parse import urlencode
from loguru import logger
from pipecat import version as pipecat_version
from pipecat import __version__ as pipecat_version
from pipecat.frames.frames import (
CancelFrame,
EndFrame,
ErrorFrame,
Frame,
InterimTranscriptionFrame,
StartFrame,
@@ -29,7 +30,7 @@ from pipecat.frames.frames import (
UserStoppedSpeakingFrame,
)
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.stt_service import WebsocketSTTService
from pipecat.services.stt_service import STTService
from pipecat.transcriptions.language import Language
from pipecat.utils.time import time_now_iso8601
from pipecat.utils.tracing.service_decorators import traced_stt
@@ -43,15 +44,15 @@ from .models import (
)
try:
import websockets
from websockets.asyncio.client import connect as websocket_connect
from websockets.protocol import State
except ModuleNotFoundError as e:
logger.error(f"Exception: {e}")
logger.error('In order to use AssemblyAI, you need to `pip install "pipecat-ai[assemblyai]"`.')
raise Exception(f"Missing module: {e}")
class AssemblyAISTTService(WebsocketSTTService):
class AssemblyAISTTService(STTService):
"""AssemblyAI real-time speech-to-text service.
Provides real-time speech transcription using AssemblyAI's WebSocket API.
@@ -79,14 +80,15 @@ class AssemblyAISTTService(WebsocketSTTService):
vad_force_turn_endpoint: Whether to force turn endpoint on VAD stop. Defaults to True.
**kwargs: Additional arguments passed to parent STTService class.
"""
super().__init__(sample_rate=connection_params.sample_rate, **kwargs)
self._api_key = api_key
self._language = language
self._api_endpoint_base_url = api_endpoint_base_url
self._connection_params = connection_params
self._vad_force_turn_endpoint = vad_force_turn_endpoint
super().__init__(sample_rate=self._connection_params.sample_rate, **kwargs)
self._websocket = None
self._termination_event = asyncio.Event()
self._received_termination = False
self._connected = False
@@ -112,7 +114,7 @@ class AssemblyAISTTService(WebsocketSTTService):
frame: Start frame to begin processing.
"""
await super().start(frame)
self._chunk_size_bytes = int(self._chunk_size_ms * self.sample_rate * 2 / 1000)
self._chunk_size_bytes = int(self._chunk_size_ms * self._sample_rate * 2 / 1000)
await self._connect()
async def stop(self, frame: EndFrame):
@@ -144,11 +146,10 @@ class AssemblyAISTTService(WebsocketSTTService):
"""
self._audio_buffer.extend(audio)
if self._websocket and self._websocket.state is State.OPEN:
while len(self._audio_buffer) >= self._chunk_size_bytes:
chunk = bytes(self._audio_buffer[: self._chunk_size_bytes])
self._audio_buffer = self._audio_buffer[self._chunk_size_bytes :]
await self._websocket.send(chunk)
while len(self._audio_buffer) >= self._chunk_size_bytes:
chunk = bytes(self._audio_buffer[: self._chunk_size_bytes])
self._audio_buffer = self._audio_buffer[self._chunk_size_bytes :]
await self._websocket.send(chunk)
yield None
@@ -163,11 +164,7 @@ class AssemblyAISTTService(WebsocketSTTService):
if isinstance(frame, UserStartedSpeakingFrame):
await self.start_ttfb_metrics()
elif isinstance(frame, UserStoppedSpeakingFrame):
if (
self._vad_force_turn_endpoint
and self._websocket
and self._websocket.state is State.OPEN
):
if self._vad_force_turn_endpoint:
await self._websocket.send(json.dumps({"type": "ForceEndpoint"}))
await self.start_processing_metrics()
@@ -194,20 +191,28 @@ class AssemblyAISTTService(WebsocketSTTService):
return self._api_endpoint_base_url
async def _connect(self):
"""Connect to the AssemblyAI service.
try:
ws_url = self._build_ws_url()
headers = {
"Authorization": self._api_key,
"User-Agent": f"AssemblyAI/1.0 (integration=Pipecat/{pipecat_version})",
}
self._websocket = await websocket_connect(
ws_url,
additional_headers=headers,
)
self._connected = True
self._receive_task = self.create_task(self._receive_task_handler())
Establishes websocket connection and starts receive task.
"""
await self._connect_websocket()
if self._websocket and not self._receive_task:
self._receive_task = self.create_task(self._receive_task_handler(self._report_error))
await self._call_event_handler("on_connected")
except Exception as e:
logger.error(f"{self} exception: {e}")
self._connected = False
await self.push_error(ErrorFrame(error=f"{self} error: {e}"))
raise
async def _disconnect(self):
"""Disconnect from the AssemblyAI service.
Sends termination message, waits for acknowledgment, and cleans up.
"""
"""Disconnect from AssemblyAI WebSocket and wait for termination message."""
if not self._connected or not self._websocket:
return
@@ -215,96 +220,55 @@ class AssemblyAISTTService(WebsocketSTTService):
self._termination_event.clear()
self._received_termination = False
if self._websocket.state is State.OPEN:
# Send any remaining audio
if len(self._audio_buffer) > 0:
await self._websocket.send(bytes(self._audio_buffer))
self._audio_buffer.clear()
if len(self._audio_buffer) > 0:
await self._websocket.send(bytes(self._audio_buffer))
self._audio_buffer.clear()
try:
await self._websocket.send(json.dumps({"type": "Terminate"}))
# Send termination message and wait for acknowledgment
try:
await self._websocket.send(json.dumps({"type": "Terminate"}))
await asyncio.wait_for(self._termination_event.wait(), timeout=5.0)
except asyncio.TimeoutError:
logger.warning("Timed out waiting for termination message from server")
try:
await asyncio.wait_for(self._termination_event.wait(), timeout=5.0)
except asyncio.TimeoutError:
logger.warning("Timed out waiting for termination message from server")
except Exception as e:
logger.error(f"{self} exception: {e}")
await self.push_error(ErrorFrame(error=f"{self} error: {e}"))
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
finally:
# Clean up tasks and connection
if self._receive_task:
await self.cancel_task(self._receive_task)
self._receive_task = None
await self._disconnect_websocket()
await self._websocket.close()
async def _connect_websocket(self):
"""Establish the websocket connection to AssemblyAI."""
try:
if self._websocket and self._websocket.state is State.OPEN:
return
logger.debug("Connecting to AssemblyAI WebSocket")
ws_url = self._build_ws_url()
headers = {
"Authorization": self._api_key,
"User-Agent": f"AssemblyAI/1.0 (integration=Pipecat/{pipecat_version()})",
}
self._websocket = await websocket_connect(
ws_url,
additional_headers=headers,
)
self._connected = True
await self._call_event_handler("on_connected")
logger.debug(f"{self} Connected to AssemblyAI WebSocket")
except Exception as e:
self._connected = False
await self.push_error(error_msg=f"Unable to connect to AssemblyAI: {e}", exception=e)
raise
logger.error(f"{self} exception: {e}")
await self.push_error(ErrorFrame(error=f"{self} error: {e}"))
async def _disconnect_websocket(self):
"""Close the websocket connection to AssemblyAI."""
try:
if self._websocket:
logger.debug("Disconnecting from AssemblyAI WebSocket")
await self._websocket.close()
except Exception as e:
await self.push_error(error_msg=f"Error closing websocket: {e}", exception=e)
finally:
self._websocket = None
self._connected = False
self._receive_task = None
await self._call_event_handler("on_disconnected")
def _get_websocket(self):
"""Get the current WebSocket connection.
async def _receive_task_handler(self):
"""Handle incoming WebSocket messages."""
try:
while self._connected:
try:
message = await self._websocket.recv()
data = json.loads(message)
await self._handle_message(data)
except websockets.exceptions.ConnectionClosedOK:
break
except Exception as e:
logger.error(f"{self} exception: {e}")
await self.push_error(ErrorFrame(error=f"{self} error: {e}"))
break
Returns:
The WebSocket connection.
Raises:
Exception: If WebSocket is not connected.
"""
if self._websocket:
return self._websocket
raise Exception("Websocket not connected")
async def _receive_messages(self):
"""Receive and process websocket messages.
Continuously processes messages from the websocket connection.
"""
async for message in self._get_websocket():
try:
data = json.loads(message)
await self._handle_message(data)
except json.JSONDecodeError:
logger.warning(f"Received non-JSON message: {message}")
except Exception as e:
logger.error(f"{self} exception: {e}")
await self.push_error(ErrorFrame(error=f"{self} error: {e}"))
def _parse_message(self, message: Dict[str, Any]) -> BaseMessage:
"""Parse a raw message into the appropriate message type."""
@@ -333,7 +297,8 @@ class AssemblyAISTTService(WebsocketSTTService):
elif isinstance(parsed_message, TerminationMessage):
await self._handle_termination(parsed_message)
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
logger.error(f"{self} exception: {e}")
await self.push_error(ErrorFrame(error=f"{self} error: {e}"))
async def _handle_termination(self, message: TerminationMessage):
"""Handle termination message."""

View File

@@ -56,17 +56,6 @@ def language_to_async_language(language: Language) -> Optional[str]:
Language.ES: "es",
Language.DE: "de",
Language.IT: "it",
Language.PT: "pt",
Language.NL: "nl",
Language.AR: "ar",
Language.RU: "ru",
Language.RO: "ro",
Language.JA: "ja",
Language.HE: "he",
Language.HY: "hy",
Language.TR: "tr",
Language.HI: "hi",
Language.ZH: "zh",
}
return resolve_language(language, LANGUAGE_MAP, use_base_code=True)
@@ -85,7 +74,7 @@ class AsyncAITTSService(InterruptibleTTSService):
language: Language to use for synthesis.
"""
language: Optional[Language] = None
language: Optional[Language] = Language.EN
def __init__(
self,
@@ -94,7 +83,7 @@ class AsyncAITTSService(InterruptibleTTSService):
voice_id: str,
version: str = "v1",
url: str = "wss://api.async.ai/text_to_speech/websocket/ws",
model: str = "asyncflow_multilingual_v1.0",
model: str = "asyncflow_v2.0",
sample_rate: Optional[int] = None,
encoding: str = "pcm_s16le",
container: str = "raw",
@@ -110,7 +99,7 @@ class AsyncAITTSService(InterruptibleTTSService):
https://docs.async.ai/list-voices-16699698e0
version: Async API version.
url: WebSocket URL for Async TTS API.
model: TTS model to use (e.g., "asyncflow_multilingual_v1.0").
model: TTS model to use (e.g., "asyncflow_v2.0").
sample_rate: Audio sample rate.
encoding: Audio encoding format.
container: Audio container format.
@@ -139,7 +128,7 @@ class AsyncAITTSService(InterruptibleTTSService):
},
"language": self.language_to_service_language(params.language)
if params.language
else None,
else "en",
}
self.set_model_name(model)
@@ -157,6 +146,15 @@ class AsyncAITTSService(InterruptibleTTSService):
"""
return True
@property
def includes_inter_frame_spaces(self) -> bool:
"""Indicates that AsyncAI TTSTextFrames include necessary inter-frame spaces.
Returns:
True, indicating that AsyncAI's text frames include necessary inter-frame spaces.
"""
return True
def language_to_service_language(self, language: Language) -> Optional[str]:
"""Convert a Language enum to Async language format.
@@ -239,7 +237,8 @@ class AsyncAITTSService(InterruptibleTTSService):
await self._call_event_handler("on_connected")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
logger.error(f"{self} exception: {e}")
await self.push_error(ErrorFrame(error=f"{self} error: {e}"))
self._websocket = None
await self._call_event_handler("on_connection_error", f"{e}")
@@ -251,7 +250,8 @@ class AsyncAITTSService(InterruptibleTTSService):
logger.debug("Disconnecting from Async")
await self._websocket.close()
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
logger.error(f"{self} exception: {e}")
await self.push_error(ErrorFrame(error=f"{self} error: {e}"))
finally:
self._websocket = None
self._started = False
@@ -296,11 +296,12 @@ class AsyncAITTSService(InterruptibleTTSService):
)
await self.push_frame(frame)
elif msg.get("error_code"):
logger.error(f"{self} error: {msg}")
await self.push_frame(TTSStoppedFrame())
await self.stop_all_metrics()
await self.push_error(error_msg=f"Error: {msg['message']}")
await self.push_error(ErrorFrame(error=f"{self} error: {msg['message']}"))
else:
await self.push_error(error_msg=f"Unknown message type: {msg}")
logger.error(f"{self} error, unknown message type: {msg}")
async def _keepalive_task_handler(self):
"""Send periodic keepalive messages to maintain WebSocket connection."""
@@ -343,14 +344,16 @@ class AsyncAITTSService(InterruptibleTTSService):
await self._get_websocket().send(msg)
await self.start_tts_usage_metrics(text)
except Exception as e:
yield ErrorFrame(error=f"Unknown error occurred: {e}")
logger.error(f"{self} exception: {e}")
yield ErrorFrame(error=f"{self} error: {e}")
yield TTSStoppedFrame()
await self._disconnect()
await self._connect()
return
yield None
except Exception as e:
yield ErrorFrame(error=f"Unknown error occurred: {e}")
logger.error(f"{self} exception: {e}")
yield ErrorFrame(error=f"{self} error: {e}")
class AsyncAIHttpTTSService(TTSService):
@@ -368,7 +371,7 @@ class AsyncAIHttpTTSService(TTSService):
language: Language to use for synthesis.
"""
language: Optional[Language] = None
language: Optional[Language] = Language.EN
def __init__(
self,
@@ -376,7 +379,7 @@ class AsyncAIHttpTTSService(TTSService):
api_key: str,
voice_id: str,
aiohttp_session: aiohttp.ClientSession,
model: str = "asyncflow_multilingual_v1.0",
model: str = "asyncflow_v2.0",
url: str = "https://api.async.ai",
version: str = "v1",
sample_rate: Optional[int] = None,
@@ -391,7 +394,7 @@ class AsyncAIHttpTTSService(TTSService):
api_key: Async API key.
voice_id: ID of the voice to use for synthesis.
aiohttp_session: An aiohttp session for making HTTP requests.
model: TTS model to use (e.g., "asyncflow_multilingual_v1.0").
model: TTS model to use (e.g., "asyncflow_v2.0").
url: Base URL for Async API.
version: API version string for Async API.
sample_rate: Audio sample rate.
@@ -415,7 +418,7 @@ class AsyncAIHttpTTSService(TTSService):
},
"language": self.language_to_service_language(params.language)
if params.language
else None,
else "en",
}
self.set_voice(voice_id)
self.set_model_name(model)
@@ -430,6 +433,15 @@ class AsyncAIHttpTTSService(TTSService):
"""
return True
@property
def includes_inter_frame_spaces(self) -> bool:
"""Indicates that AsyncAI TTSTextFrames include necessary inter-frame spaces.
Returns:
True, indicating that AsyncAI's text frames include necessary inter-frame spaces.
"""
return True
def language_to_service_language(self, language: Language) -> Optional[str]:
"""Convert a Language enum to Async language format.
@@ -483,7 +495,8 @@ class AsyncAIHttpTTSService(TTSService):
async with self._session.post(url, json=payload, headers=headers) as response:
if response.status != 200:
error_text = await response.text()
await self.push_error(error_msg=f"Async API error: {error_text}")
logger.error(f"Async API error: {error_text}")
await self.push_error(ErrorFrame(error=f"Async API error: {error_text}"))
raise Exception(f"Async API returned status {response.status}: {error_text}")
audio_data = await response.read()
@@ -499,7 +512,8 @@ class AsyncAIHttpTTSService(TTSService):
yield frame
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
logger.error(f"{self} exception: {e}")
await self.push_error(ErrorFrame(error=f"{self} error: {e}"))
finally:
await self.stop_ttfb_metrics()
yield TTSStoppedFrame()

View File

@@ -8,10 +8,8 @@ import sys
from pipecat.services import DeprecatedModuleProxy
from .agent_core import *
from .llm import *
from .nova_sonic import *
from .sagemaker import *
from .stt import *
from .tts import *

View File

@@ -1,258 +0,0 @@
#
# Copyright (c) 2025, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
"""AWS AgentCore Processor Module.
This module defines the AWSAgentCoreProcessor, which invokes agents hosted on
Amazon Bedrock AgentCore Runtime and streams their responses as LLMTextFrames.
"""
import asyncio
import json
import os
from typing import Callable, Optional
import aioboto3
from loguru import logger
from pipecat.frames.frames import (
Frame,
LLMContextFrame,
LLMFullResponseEndFrame,
LLMFullResponseStartFrame,
LLMTextFrame,
)
from pipecat.processors.aggregators.llm_context import LLMContext, LLMSpecificMessage
from pipecat.processors.aggregators.openai_llm_context import (
OpenAILLMContext,
OpenAILLMContextFrame,
)
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
def default_context_to_payload_transformer(
context: LLMContext | OpenAILLMContext,
) -> Optional[str]:
"""Default transformer to create AgentCore payload from LLM context.
Extracts the latest user or system message text and wraps it in {"prompt": "<text>"}.
Args:
context: The LLM context containing conversation messages.
Returns:
A JSON string payload for AgentCore, or None if no valid message found.
"""
messages = context.messages
if not messages:
return None
last_message = messages[-1]
if isinstance(last_message, LLMSpecificMessage) or last_message.get("role") not in (
"user",
"system",
):
return None
content = last_message.get("content")
if not content:
return None
if isinstance(content, str):
prompt = content
elif isinstance(content, list):
prompt = " ".join([part.get("text", "") for part in content])
else:
return None
return json.dumps({"prompt": prompt})
def default_response_to_output_transformer(response_line: str) -> Optional[str]:
"""Default transformer to extract output text from AgentCore response.
Expects responses with {"response": "<text>"} format.
Args:
response_line: The raw response line from AgentCore (without "data: " prefix).
Returns:
The extracted output text, or None if no text found.
"""
response_json = json.loads(response_line)
return response_json.get("response")
class AWSAgentCoreProcessor(FrameProcessor):
"""Processor that runs an Amazon Bedrock AgentCore agent.
Input:
- LLMContextFrame: Supplies a context used to invoke the agent.
Output:
- LLMTextFrame: The agent's text response(s).
A single agent invocation may result in multiple text frames.
This processor transforms the input context to a payload for the AgentCore
agent, and transforms the agent's response(s) into output text frame(s). Both
mappings are configurable via transformers. Below is the default behavior.
Input transformer (context_to_payload_transformer):
- Grabs the latest user or system message (if it's the latest message)
- Extracts its text content
- Constructs a payload that looks like {"prompt": "<text>"}
Output transformer (response_to_output_transformer):
- Expects responses that look like {"response": "<text>"}
- Extracts the text for use in the LLMTextFrame(s)
"""
def __init__(
self,
agentArn: str,
aws_access_key: Optional[str] = None,
aws_secret_key: Optional[str] = None,
aws_session_token: Optional[str] = None,
aws_region: Optional[str] = None,
context_to_payload_transformer: Optional[
Callable[[LLMContext | OpenAILLMContext], Optional[str]]
] = None,
response_to_output_transformer: Optional[Callable[[str], Optional[str]]] = None,
**kwargs,
):
"""Initialize the AWS AgentCore processor.
Args:
agentArn: The Amazon Web Services Resource Name (ARN) of the agent.
aws_access_key: AWS access key ID. If None, uses default credentials.
aws_secret_key: AWS secret access key. If None, uses default credentials.
aws_session_token: AWS session token for temporary credentials.
aws_region: AWS region.
context_to_payload_transformer: Optional callable to transform
LLMContext into AgentCore payload string. If None, uses
default_context_to_payload_transformer.
response_to_output_transformer: Optional callable to extract output text
from AgentCore response. If None, uses
default_response_to_output_transformer.
**kwargs: Additional arguments passed to parent FrameProcessor.
"""
super().__init__(**kwargs)
self._agentArn = agentArn
self._aws_session = aioboto3.Session()
# Store AWS session parameters for creating client in async context
self._aws_params = {
"aws_access_key_id": aws_access_key or os.getenv("AWS_ACCESS_KEY_ID"),
"aws_secret_access_key": aws_secret_key or os.getenv("AWS_SECRET_ACCESS_KEY"),
"aws_session_token": aws_session_token or os.getenv("AWS_SESSION_TOKEN"),
"region_name": aws_region or os.getenv("AWS_REGION", "us-east-1"),
}
# Set transformers with defaults
self._context_to_payload_transformer = (
context_to_payload_transformer or default_context_to_payload_transformer
)
self._response_to_output_transformer = (
response_to_output_transformer or default_response_to_output_transformer
)
# State for managing output response bookends
self._output_response_open = False
self._last_text_frame_time: Optional[float] = None
self._close_task: Optional[asyncio.Task] = None
self._output_response_timeout = 1.0 # seconds
async def _close_output_response_after_timeout(self):
"""Close the output response after timeout if no new text frames arrive."""
await asyncio.sleep(self._output_response_timeout)
if self._output_response_open:
self._output_response_open = False
await self.push_frame(LLMFullResponseEndFrame())
async def _push_text_frame(self, text: str):
"""Push a text frame, managing output response bookends."""
# Cancel any pending close task
if self._close_task and not self._close_task.done():
await self.cancel_task(self._close_task)
# Open output response if needed
if not self._output_response_open:
await self.push_frame(LLMFullResponseStartFrame())
self._output_response_open = True
# Push the text frame
await self.push_frame(LLMTextFrame(text))
self._last_text_frame_time = asyncio.get_event_loop().time()
# Schedule closing the output response after timeout
self._close_task = self.create_task(self._close_output_response_after_timeout())
async def process_frame(self, frame: Frame, direction: FrameDirection):
"""Process incoming frames and handle LLM message frames.
Args:
frame: The incoming frame to process.
direction: The direction of frame flow in the pipeline.
"""
await super().process_frame(frame, direction)
if isinstance(frame, (LLMContextFrame, OpenAILLMContextFrame)):
# Create payload to invoke AgentCore agent
payload = self._context_to_payload_transformer(frame.context)
if not payload:
return
async with self._aws_session.client("bedrock-agentcore", **self._aws_params) as client:
# Invoke the AgentCore agent
response = await client.invoke_agent_runtime(
agentRuntimeArn=self._agentArn, payload=payload.encode()
)
# Determine if this is a streamed multi-part response, which
# will affect our parsing
is_multi_part_response = "text/event-stream" in response.get("contentType", "")
# Handle each response part (there may be one, for single
# responses, or multiple, for streamed multi-part responses)
async for part in response.get("response", []):
part_string = part.decode("utf-8")
# In streamed multi-part responses, each part might have
# one or more lines, each of which starts with "data: ".
# Treat each line as a response.
if is_multi_part_response:
for line in part_string.split("\n"):
# Get response text from this line
if not line:
continue
if not line.startswith("data: "):
logger.warning(f"Expected line to start with 'data: ', got: {line}")
continue
line = line[6:] # omit "data: "
# Transform response line to output text
text = self._response_to_output_transformer(line)
if text:
await self._push_text_frame(text)
# In single-part responses, the whole part is one response
# and there's no "data: " prefix
else:
# Transform response part string to output text
text = self._response_to_output_transformer(part_string)
if text:
await self._push_text_frame(text)
# Final close if output response is still open after all parts processed
if self._output_response_open:
if self._close_task and not self._close_task.done():
await self.cancel_task(self._close_task)
self._output_response_open = False
await self.push_frame(LLMFullResponseEndFrame())
else:
await self.push_frame(frame, direction)

View File

@@ -734,7 +734,7 @@ class AWSBedrockLLMService(LLMService):
aws_access_key: Optional[str] = None,
aws_secret_key: Optional[str] = None,
aws_session_token: Optional[str] = None,
aws_region: Optional[str] = None,
aws_region: str = "us-east-1",
params: Optional[InputParams] = None,
client_config: Optional[Config] = None,
retry_timeout_secs: Optional[float] = 5.0,
@@ -840,13 +840,15 @@ class AWSBedrockLLMService(LLMService):
messages = context.messages
system = getattr(context, "system", None) # [{"text": "system message"}]
# Prepare request parameters using the same method as streaming
# Determine if we're using Claude or Nova based on model ID
model_id = self.model_name
# Prepare request parameters
inference_config = self._build_inference_config()
request_params = {
"modelId": self.model_name,
"modelId": model_id,
"messages": messages,
"additionalModelRequestFields": self._settings["additional_model_request_fields"],
}
if inference_config:
@@ -1076,7 +1078,9 @@ class AWSBedrockLLMService(LLMService):
if "contentBlockDelta" in event:
delta = event["contentBlockDelta"]["delta"]
if "text" in delta:
await self.push_frame(LLMTextFrame(delta["text"]))
frame = LLMTextFrame(delta["text"])
frame.includes_inter_frame_spaces = True
await self.push_frame(frame)
completion_tokens_estimate += self._estimate_tokens(delta["text"])
elif "toolUse" in delta and "input" in delta["toolUse"]:
# Handle partial JSON for tool use
@@ -1134,7 +1138,7 @@ class AWSBedrockLLMService(LLMService):
except (ReadTimeoutError, asyncio.TimeoutError):
await self._call_event_handler("on_completion_timeout")
except Exception as e:
await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
logger.exception(f"{self} exception: {e}")
finally:
await self.stop_processing_metrics()
await self.push_frame(LLMFullResponseEndFrame())

View File

@@ -157,12 +157,6 @@ class Params(BaseModel):
max_tokens: Maximum number of tokens to generate.
top_p: Nucleus sampling parameter.
temperature: Sampling temperature for text generation.
endpointing_sensitivity: Controls how quickly Nova Sonic decides the
user has stopped speaking. Can be "LOW", "MEDIUM", or "HIGH", with
"HIGH" being the most sensitive (i.e., causing the model to respond
most quickly).
If not set, uses the model's default behavior.
Only supported with Nova 2 Sonic (the default model).
"""
# Audio input
@@ -180,9 +174,6 @@ class Params(BaseModel):
top_p: Optional[float] = Field(default=0.9)
temperature: Optional[float] = Field(default=0.7)
# Turn-taking
endpointing_sensitivity: Optional[str] = Field(default=None)
class AWSNovaSonicLLMService(LLMService):
"""AWS Nova Sonic speech-to-speech LLM service.
@@ -201,8 +192,8 @@ class AWSNovaSonicLLMService(LLMService):
access_key_id: str,
session_token: Optional[str] = None,
region: str,
model: str = "amazon.nova-2-sonic-v1:0",
voice_id: str = "matthew",
model: str = "amazon.nova-sonic-v1:0",
voice_id: str = "matthew", # matthew, tiffany, amy
params: Optional[Params] = None,
system_instruction: Optional[str] = None,
tools: Optional[ToolsSchema] = None,
@@ -216,15 +207,8 @@ class AWSNovaSonicLLMService(LLMService):
access_key_id: AWS access key ID for authentication.
session_token: AWS session token for authentication.
region: AWS region where the service is hosted.
Supported regions:
- Nova 2 Sonic (the default model): "us-east-1", "us-west-2", "ap-northeast-1"
- Nova Sonic (the older model): "us-east-1", "ap-northeast-1"
model: Model identifier. Defaults to "amazon.nova-2-sonic-v1:0".
voice_id: Voice ID for speech synthesis.
Note that some voices are designed for use with a specific language.
Options:
- Nova 2 Sonic (the default model): see https://docs.aws.amazon.com/nova/latest/nova2-userguide/sonic-language-support.html
- Nova Sonic (the older model): see https://docs.aws.amazon.com/nova/latest/userguide/available-voices.html.
model: Model identifier. Defaults to "amazon.nova-sonic-v1:0".
voice_id: Voice ID for speech synthesis. Options: matthew, tiffany, amy.
params: Model parameters for audio configuration and inference.
system_instruction: System-level instruction for the model.
tools: Available tools/functions for the model to use.
@@ -248,17 +232,6 @@ class AWSNovaSonicLLMService(LLMService):
self._system_instruction = system_instruction
self._tools = tools
# Validate endpointing_sensitivity parameter
if (
self._params.endpointing_sensitivity
and not self._is_endpointing_sensitivity_supported()
):
logger.warning(
f"endpointing_sensitivity is not supported for model '{model}' and will be ignored. "
"This parameter is only supported starting with Nova 2 Sonic (amazon.nova-2-sonic-v1:0)."
)
self._params.endpointing_sensitivity = None
if not send_transcription_frames:
import warnings
@@ -480,13 +453,13 @@ class AWSNovaSonicLLMService(LLMService):
self._ready_to_send_context = True
await self._finish_connecting_if_context_available()
except Exception as e:
await self.push_error(error_msg=f"Initialization error: {e}", exception=e)
logger.error(f"{self} initialization error: {e}")
await self._disconnect()
async def _process_completed_function_calls(self, send_new_results: bool):
# Check for set of completed function calls in the context
for message in self._context.get_messages():
if message.get("role") and message.get("content") not in ["IN_PROGRESS", "CANCELLED"]:
if message.get("role") and message.get("content") != "IN_PROGRESS":
tool_call_id = message.get("tool_call_id")
if tool_call_id and tool_call_id not in self._completed_tool_calls:
# Found a newly-completed function call - send the result to the service
@@ -604,7 +577,7 @@ class AWSNovaSonicLLMService(LLMService):
logger.info("Finished disconnecting")
except Exception as e:
await self.push_error(error_msg=f"Error disconnecting: {e}", exception=e)
logger.error(f"{self} error disconnecting: {e}")
def _create_client(self) -> BedrockRuntimeClient:
config = Config(
@@ -618,33 +591,11 @@ class AWSNovaSonicLLMService(LLMService):
)
return BedrockRuntimeClient(config=config)
def _is_first_generation_sonic_model(self) -> bool:
# Nova Sonic (the older model) is identified by "amazon.nova-sonic-v1:0"
return self._model == "amazon.nova-sonic-v1:0"
def _is_endpointing_sensitivity_supported(self) -> bool:
# endpointing_sensitivity is only supported with Nova 2 Sonic (and,
# presumably, future models)
return not self._is_first_generation_sonic_model()
def _is_assistant_response_trigger_needed(self) -> bool:
# Assistant response trigger audio is only needed with the older model
return self._is_first_generation_sonic_model()
#
# LLM communication: input events (pipecat -> LLM)
#
async def _send_session_start_event(self):
turn_detection_config = (
f""",
"turnDetectionConfiguration": {{
"endpointingSensitivity": "{self._params.endpointing_sensitivity}"
}}"""
if self._params.endpointing_sensitivity
else ""
)
session_start = f"""
{{
"event": {{
@@ -653,7 +604,7 @@ class AWSNovaSonicLLMService(LLMService):
"maxTokens": {self._params.max_tokens},
"topP": {self._params.top_p},
"temperature": {self._params.temperature}
}}{turn_detection_config}
}}
}}
}}
}}
@@ -934,7 +885,7 @@ class AWSNovaSonicLLMService(LLMService):
# Errors are kind of expected while disconnecting, so just
# ignore them and do nothing
return
await self.push_error(error_msg=f"Error processing responses: {e}", exception=e)
logger.error(f"{self} error processing responses: {e}")
if self._wants_connection:
await self.reset_conversation()
@@ -1238,8 +1189,7 @@ class AWSNovaSonicLLMService(LLMService):
)
#
# assistant response trigger
# HACK: only needed for the older Nova Sonic (as opposed to Nova 2 Sonic) model
# assistant response trigger (HACK)
#
# Class variable
@@ -1253,17 +1203,12 @@ class AWSNovaSonicLLMService(LLMService):
Sends a pre-recorded "ready" audio trigger to prompt the assistant
to start speaking. This is useful for controlling conversation flow.
"""
if not self._is_assistant_response_trigger_needed():
logger.warning(
f"Assistant response trigger not needed for model '{self._model}'; skipping. "
"An LLMRunFrame() should be sufficient to prompt the assistant to respond, "
"assuming the context ends in a user message."
)
return
Returns:
False if already triggering a response, True otherwise.
"""
if self._triggering_assistant_response:
return
return False
self._triggering_assistant_response = True

View File

@@ -1,283 +0,0 @@
#
# Copyright (c) 20242025, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
"""AWS SageMaker bidirectional streaming client.
This module provides a client for streaming bidirectional communication with
SageMaker endpoints using the HTTP/2 protocol. Supports sending audio, text,
and JSON data to SageMaker model endpoints and receiving streaming responses.
"""
import os
from typing import Optional
from loguru import logger
try:
from aws_sdk_sagemaker_runtime_http2.client import SageMakerRuntimeHTTP2Client
from aws_sdk_sagemaker_runtime_http2.config import Config, HTTPAuthSchemeResolver
from aws_sdk_sagemaker_runtime_http2.models import (
InvokeEndpointWithBidirectionalStreamInput,
RequestPayloadPart,
RequestStreamEventPayloadPart,
ResponseStreamEvent,
)
from smithy_aws_core.auth.sigv4 import SigV4AuthScheme
from smithy_aws_core.identity import EnvironmentCredentialsResolver
from smithy_core.aio.eventstream import DuplexEventStream
except ModuleNotFoundError as e:
logger.error(f"Exception: {e}")
logger.error(
"In order to use SageMaker BiDi client, you need to `pip install pipecat-ai[sagemaker]`."
)
raise Exception(f"Missing module: {e}")
class SageMakerBidiClient:
"""Client for bidirectional streaming with AWS SageMaker endpoints.
Handles low-level HTTP/2 bidirectional streaming protocol for communicating
with SageMaker model endpoints. Provides methods for sending various data
types (audio, text, JSON) and receiving streaming responses.
This client uses AWS SigV4 authentication and supports credential resolution
from environment variables, AWS CLI configuration, and instance metadata.
Example::
client = SageMakerBidiClient(
endpoint_name="my-deepgram-endpoint",
region="us-east-2",
model_invocation_path="v1/listen",
model_query_string="model=nova-3&language=en"
)
await client.start_session()
await client.send_audio_chunk(audio_bytes)
response = await client.receive_response()
await client.close_session()
"""
def __init__(
self,
endpoint_name: str,
region: str,
model_invocation_path: str = "",
model_query_string: str = "",
):
"""Initialize the SageMaker BiDi client.
Args:
endpoint_name: Name of the SageMaker endpoint to connect to.
region: AWS region where the endpoint is deployed.
model_invocation_path: API path for the model invocation (e.g., "v1/listen").
model_query_string: Query string parameters for the model (e.g., "model=nova-3").
"""
self.endpoint_name = endpoint_name
self.region = region
self.model_invocation_path = model_invocation_path
self.model_query_string = model_query_string
self.bidi_endpoint = f"https://runtime.sagemaker.{region}.amazonaws.com:8443"
self._client: Optional[SageMakerRuntimeHTTP2Client] = None
self._stream: Optional[
DuplexEventStream[RequestStreamEventPayloadPart, ResponseStreamEvent, any]
] = None
self._output_stream = None
self._is_active = False
def _initialize_client(self):
"""Initialize the SageMaker Runtime HTTP2 client with AWS credentials.
Creates and configures the SageMaker Runtime HTTP2 client with SigV4
authentication. Attempts to resolve AWS credentials from environment
variables, AWS CLI configuration, or instance metadata.
"""
logger.debug(f"Initializing SageMaker BiDi client for region: {self.region}")
logger.debug(f"Using endpoint URI: {self.bidi_endpoint}")
# Check for AWS credentials
has_env_creds = bool(os.getenv("AWS_ACCESS_KEY_ID") and os.getenv("AWS_SECRET_ACCESS_KEY"))
if not has_env_creds:
logger.warning(
"AWS credentials not found in environment variables. "
"Attempting to use EnvironmentCredentialsResolver which will check "
"AWS CLI configuration and instance metadata."
)
config = Config(
endpoint_uri=self.bidi_endpoint,
region=self.region,
aws_credentials_identity_resolver=EnvironmentCredentialsResolver(),
auth_scheme_resolver=HTTPAuthSchemeResolver(),
auth_schemes={"aws.auth#sigv4": SigV4AuthScheme(service="sagemaker")},
)
self._client = SageMakerRuntimeHTTP2Client(config=config)
async def start_session(self):
"""Start a bidirectional streaming session with the SageMaker endpoint.
Initializes the client if needed, creates the bidirectional stream, and
establishes the connection to the SageMaker endpoint. Must be called
before sending or receiving data.
Returns:
The output stream for receiving responses.
Raises:
RuntimeError: If client initialization or connection fails.
"""
if not self._client:
self._initialize_client()
logger.debug(f"Starting BiDi session with endpoint: {self.endpoint_name}")
logger.debug(f"Model invocation path: {self.model_invocation_path}")
logger.debug(f"Model query string: {self.model_query_string}")
# Create the bidirectional stream
stream_input = InvokeEndpointWithBidirectionalStreamInput(
endpoint_name=self.endpoint_name,
model_invocation_path=self.model_invocation_path,
model_query_string=self.model_query_string,
)
try:
self._stream = await self._client.invoke_endpoint_with_bidirectional_stream(
stream_input
)
self._is_active = True
# Get output stream
output = await self._stream.await_output()
self._output_stream = output[1]
logger.debug("BiDi session started successfully")
return self._output_stream
except Exception as e:
logger.error(f"Failed to start BiDi session: {e}")
self._is_active = False
raise RuntimeError(f"Failed to start SageMaker BiDi session: {e}")
async def send_data(self, data_bytes: bytes, data_type: Optional[str] = None):
"""Send a chunk of data to the stream.
Generic method for sending any type of data to the SageMaker endpoint.
Use the convenience methods (send_audio_chunk, send_text, send_json)
for common data types.
Args:
data_bytes: Raw bytes to send.
data_type: Optional data type header. Common values are "BINARY" for
audio/binary data and "UTF8" for text/JSON data.
Raises:
RuntimeError: If session is not active or send fails.
"""
if not self._is_active or not self._stream:
raise RuntimeError("BiDi session not active")
try:
payload = RequestPayloadPart(bytes_=data_bytes, data_type=data_type)
event = RequestStreamEventPayloadPart(value=payload)
await self._stream.input_stream.send(event)
except Exception as e:
logger.error(f"Failed to send data: {e}")
raise
async def send_audio_chunk(self, audio_bytes: bytes):
"""Send a chunk of audio data to the stream.
Convenience method for sending audio data. Automatically sets the data
type to "BINARY".
Args:
audio_bytes: Raw audio bytes to send (e.g., PCM audio data).
Raises:
RuntimeError: If session is not active or send fails.
"""
await self.send_data(audio_bytes, data_type="BINARY")
async def send_text(self, text: str):
"""Send text data to the stream.
Convenience method for sending text data. Automatically encodes the text
as UTF-8 and sets the data type to "UTF8".
Args:
text: Text string to send.
Raises:
RuntimeError: If session is not active or send fails.
"""
await self.send_data(text.encode("utf-8"), data_type="UTF8")
async def send_json(self, data: dict):
"""Send JSON data to the stream.
Convenience method for sending JSON-encoded messages. Useful for control
messages like KeepAlive or CloseStream. Automatically serializes the
dictionary to JSON, encodes as UTF-8, and sets the data type to "UTF8".
Args:
data: Dictionary to send as JSON (e.g., {"type": "KeepAlive"}).
Raises:
RuntimeError: If session is not active or send fails.
"""
import json
await self.send_data(json.dumps(data).encode("utf-8"), data_type="UTF8")
async def receive_response(self) -> Optional[ResponseStreamEvent]:
"""Receive a response from the stream.
Blocks until a response is available from the SageMaker endpoint. Returns
None when the stream is closed.
Returns:
The response event containing payload data, or None if stream is closed.
Raises:
RuntimeError: If session is not active.
"""
if not self._is_active or not self._output_stream:
raise RuntimeError("BiDi session not active")
try:
result = await self._output_stream.receive()
return result
except Exception as e:
logger.error(f"Failed to receive response: {e}")
raise
async def close_session(self):
"""Close the bidirectional streaming session.
Gracefully closes the input stream and marks the session as inactive.
Safe to call multiple times.
"""
if not self._is_active:
return
logger.debug("Closing BiDi session...")
self._is_active = False
try:
if self._stream:
await self._stream.input_stream.close()
logger.debug("BiDi session closed successfully")
except Exception as e:
logger.warning(f"Error closing BiDi session: {e}")
@property
def is_active(self) -> bool:
"""Check if the session is currently active.
Returns:
True if session is active, False otherwise.
"""
return self._is_active

Some files were not shown because too many files have changed in this diff Show More