Compare commits
65 Commits
v0.0.83
...
hush/delay
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
6bb3cb2b83 | ||
|
|
908325484d | ||
|
|
dd6ff789c7 | ||
|
|
f4938e0fad | ||
|
|
e8f60c7c6f | ||
|
|
38f6e33f97 | ||
|
|
1c3e4e34e5 | ||
|
|
623c660027 | ||
|
|
a3e65ab3b5 | ||
|
|
f3a4b416df | ||
|
|
aa471a4ef5 | ||
|
|
d55133a44f | ||
|
|
0f1cf81691 | ||
|
|
ac4d335799 | ||
|
|
e65385c151 | ||
|
|
0bb7df7a6b | ||
|
|
daee1ddf3b | ||
|
|
1cccb97ccf | ||
|
|
d7794abf21 | ||
|
|
6a6a63a532 | ||
|
|
6edb6fed41 | ||
|
|
a537382816 | ||
|
|
46deaada70 | ||
|
|
dbc52bc6b0 | ||
|
|
d6432589f6 | ||
|
|
13b73d4406 | ||
|
|
85d8282f7e | ||
|
|
070690ec64 | ||
|
|
b9c96fd623 | ||
|
|
f8b2ab6331 | ||
|
|
ea3f7e3c34 | ||
|
|
2f44f88b08 | ||
|
|
25747a001b | ||
|
|
fbe4338440 | ||
|
|
64b4c65728 | ||
|
|
29442969a9 | ||
|
|
dc2e1d4ad3 | ||
|
|
5477dfcbea | ||
|
|
516f0e08ab | ||
|
|
246f9f3325 | ||
|
|
3d850e8cc5 | ||
|
|
6e734a37f9 | ||
|
|
f72ca2fd7d | ||
|
|
0826d72f74 | ||
|
|
ba5ebfa0ec | ||
|
|
dc3412b2df | ||
|
|
b2e9fd9341 | ||
|
|
c11b207c97 | ||
|
|
d6205027cf | ||
|
|
986160c077 | ||
|
|
b56ff86fee | ||
|
|
5c574eaad9 | ||
|
|
2df231143a | ||
|
|
65298ab792 | ||
|
|
b609b02614 | ||
|
|
f2b50c14d2 | ||
|
|
ee3b023986 | ||
|
|
0d9e1190d7 | ||
|
|
595a7c7fbe | ||
|
|
586586f743 | ||
|
|
a1c6ad539d | ||
|
|
daf7fed8b3 | ||
|
|
a26647c433 | ||
|
|
83f64ecd3b | ||
|
|
0a3e98857e |
73
CHANGELOG.md
73
CHANGELOG.md
@@ -5,14 +5,79 @@ All notable changes to **Pipecat** will be documented in this file.
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
### Added
|
||||
|
||||
- Added video streaming support to `LiveKitTransport`.
|
||||
|
||||
- Added `OpenAIRealtimeLLMService` and `AzureRealtimeLLMService` which provide
|
||||
access to OpenAI Realtime.
|
||||
|
||||
### Removed
|
||||
|
||||
- Remove `VisionImageRawFrame` in favor of context frames (`LLMContextFrame` or
|
||||
`OpenAILLMContextFrame`).
|
||||
|
||||
### Deprecated
|
||||
|
||||
- Deprecate `VisionImageFrameAggregator` because `VisionImageRawFrame` has been
|
||||
removed. See the `12*` examples for the new recommended replacement pattern.
|
||||
|
||||
- `NoisereduceFilter` is now deprecated and will be removed in a future
|
||||
version. Use other audio filters like `KrispFilter` or `AICFilter`.
|
||||
|
||||
- Deprecated `OpenAIRealtimeBetaLLMService` and `AzureRealtimeBetaLLMService`.
|
||||
Use `OpenAIRealtimeLLMService` and `AzureRealtimeLLMService`, respectively.
|
||||
Each service will be removed in an upcoming version, 1.0.0.
|
||||
|
||||
### Fixed
|
||||
|
||||
- Fixed a `LiveKitTransport` issue where RTVI messages were not properly
|
||||
encoded.
|
||||
|
||||
- Add additional fixups to Mistral context messages to ensure they meet
|
||||
Mistral-specific requirements, avoiding Mistral "invalid request" errors.
|
||||
|
||||
- Fixed `DailyTransport` transcription handling to gracefully handle missing
|
||||
`rawResponse` field in transcription messages, preventing KeyError crashes.
|
||||
|
||||
## [0.0.84] - 2025-09-05
|
||||
|
||||
### Added
|
||||
|
||||
- Add the ability to send DTMF to `LiveKitTransport`.
|
||||
|
||||
- Expanded support for universal `LLMContext` to the Anthropic LLM service.
|
||||
Using the universal `LLMContext` and associated `LLMContextAggregatorPair` is
|
||||
a pre-requisite for using `LLMSwitcher` to switch between LLMs at runtime.
|
||||
|
||||
### Changed
|
||||
|
||||
- Updated `daily-python` to 0.19.9.
|
||||
|
||||
- Restored `DailyTransport`'s native DTMF support using Daily's `send_dtmf()`
|
||||
method instead of generated audio tones.
|
||||
|
||||
### Fixed
|
||||
|
||||
- Fixed a `AWSBedrockLLMService` crash caused by an extra `await`.
|
||||
|
||||
- Fixed a `OpenAIImageGenService` issue where it was not creating
|
||||
`URLImageRawFrame` correctly.
|
||||
|
||||
## [0.0.83] - 2025-09-03
|
||||
|
||||
### Added
|
||||
|
||||
- Added multilingual support for AsyncAI in `AsyncAITTSService` and `AsyncAIHttpTTSService`.
|
||||
|
||||
- New `languages`: `es`, `fr`, `de`, `it`.
|
||||
|
||||
- Added new frames `InputTransportMessageUrgentFrame` and
|
||||
`DailyInputTransportMessageUrgentFrame` for transport messages received from
|
||||
external sources.
|
||||
|
||||
|
||||
- Added `UserSpeakingFrame`. This will be sent upstream and downstream while VAD
|
||||
detects the user is speaking.
|
||||
|
||||
@@ -82,7 +147,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
|
||||
- Added new config parameters to `GladiaSTTService`.
|
||||
- PreProcessingConfig > `audio_enhancer` to enhance audio quality.
|
||||
- CustomVocabularyItem > `pronunciations` and `language` to specify special pronunciations and in which language it will be pronounced.
|
||||
- CustomVocabularyItem > `pronunciations` and `language` to specify special
|
||||
pronunciations and in which language it will be pronounced.
|
||||
|
||||
### Changed
|
||||
|
||||
@@ -101,7 +167,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
||||
- `pipecat.frames.frames.KeypadEntry` is deprecated and has been moved to
|
||||
`pipecat.audio.dtmf.types.KeypadEntry`.
|
||||
|
||||
- Updated `RimeTTSService`'s flush_audio message to conform with Rime's official API.
|
||||
- Updated `RimeTTSService`'s flush_audio message to conform with Rime's official
|
||||
API.
|
||||
|
||||
- Updated the default model for `CerebrasLLMService` to GPT-OSS-120B.
|
||||
|
||||
|
||||
48
README.md
48
README.md
@@ -28,6 +28,41 @@
|
||||
- **Composable Pipelines**: Build complex behavior from modular components
|
||||
- **Real-Time**: Ultra-low latency interaction with different transports (e.g. WebSockets or WebRTC)
|
||||
|
||||
## 📱 Client SDKs
|
||||
|
||||
You can connect to Pipecat from any platform using our official SDKs:
|
||||
|
||||
<table>
|
||||
<tr>
|
||||
<td>
|
||||
<img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/javascript/javascript-original.svg" width="40" height="40" alt="JavaScript"/>
|
||||
<a href="https://docs.pipecat.ai/client/js/introduction">JavaScript</a>
|
||||
</td>
|
||||
<td>
|
||||
<img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/react/react-original.svg" width="40" height="40" alt="React"/>
|
||||
<a href="https://docs.pipecat.ai/client/react/introduction">React</a>
|
||||
</td>
|
||||
<td>
|
||||
<img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/react/react-original.svg" width="40" height="40" alt="React Native"/>
|
||||
<a href="https://docs.pipecat.ai/client/react-native/introduction">React Native</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>
|
||||
<img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/swift/swift-original.svg" width="40" height="40" alt="Swift"/>
|
||||
<a href="https://docs.pipecat.ai/client/ios/introduction">Swift</a>
|
||||
</td>
|
||||
<td>
|
||||
<img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/kotlin/kotlin-original.svg" width="40" height="40" alt="Kotlin"/>
|
||||
<a href="https://docs.pipecat.ai/client/android/introduction">Kotlin</a>
|
||||
</td>
|
||||
<td>
|
||||
<img src="https://cdn.jsdelivr.net/gh/devicons/devicon/icons/cplusplus/cplusplus-original.svg" width="40" height="40" alt="JavaScript"/>
|
||||
<a href="https://docs.pipecat.ai/client/c++/introduction">C++</a>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
## 🎬 See it in action
|
||||
|
||||
<p float="left">
|
||||
@@ -38,17 +73,6 @@
|
||||
<a href="https://github.com/pipecat-ai/pipecat-examples/tree/main/moondream-chatbot"><img src="https://raw.githubusercontent.com/pipecat-ai/pipecat-examples/main/moondream-chatbot/image.png" width="400" /></a>
|
||||
</p>
|
||||
|
||||
## 📱 Client SDKs
|
||||
|
||||
You can connect to Pipecat from any platform using our official SDKs:
|
||||
|
||||
| Platform | SDK Repo | Description |
|
||||
| -------- | ------------------------------------------------------------------------------ | -------------------------------- |
|
||||
| Web | [pipecat-client-web](https://github.com/pipecat-ai/pipecat-client-web) | JavaScript and React client SDKs |
|
||||
| iOS | [pipecat-client-ios](https://github.com/pipecat-ai/pipecat-client-ios) | Swift SDK for iOS |
|
||||
| Android | [pipecat-client-android](https://github.com/pipecat-ai/pipecat-client-android) | Kotlin SDK for Android |
|
||||
| C++ | [pipecat-client-cxx](https://github.com/pipecat-ai/pipecat-client-cxx) | C++ client SDK |
|
||||
|
||||
## 🧩 Available services
|
||||
|
||||
| Category | Services |
|
||||
@@ -62,7 +86,7 @@ You can connect to Pipecat from any platform using our official SDKs:
|
||||
| Video | [HeyGen](https://docs.pipecat.ai/server/services/video/heygen), [Tavus](https://docs.pipecat.ai/server/services/video/tavus), [Simli](https://docs.pipecat.ai/server/services/video/simli) |
|
||||
| Memory | [mem0](https://docs.pipecat.ai/server/services/memory/mem0) |
|
||||
| Vision & Image | [fal](https://docs.pipecat.ai/server/services/image-generation/fal), [Google Imagen](https://docs.pipecat.ai/server/services/image-generation/fal), [Moondream](https://docs.pipecat.ai/server/services/vision/moondream) |
|
||||
| Audio Processing | [Silero VAD](https://docs.pipecat.ai/server/utilities/audio/silero-vad-analyzer), [Krisp](https://docs.pipecat.ai/server/utilities/audio/krisp-filter), [Koala](https://docs.pipecat.ai/server/utilities/audio/koala-filter), [ai-coustics](https://docs.pipecat.ai/server/utilities/audio/aic-filter), [Noisereduce](https://docs.pipecat.ai/server/utilities/audio/noisereduce-filter) |
|
||||
| Audio Processing | [Silero VAD](https://docs.pipecat.ai/server/utilities/audio/silero-vad-analyzer), [Krisp](https://docs.pipecat.ai/server/utilities/audio/krisp-filter), [Koala](https://docs.pipecat.ai/server/utilities/audio/koala-filter), [ai-coustics](https://docs.pipecat.ai/server/utilities/audio/aic-filter) |
|
||||
| Analytics & Metrics | [OpenTelemetry](https://docs.pipecat.ai/server/utilities/opentelemetry), [Sentry](https://docs.pipecat.ai/server/services/analytics/sentry) |
|
||||
|
||||
📚 [View full services documentation →](https://docs.pipecat.ai/server/services/supported-services)
|
||||
|
||||
@@ -4,17 +4,19 @@
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import LLMRunFrame
|
||||
from pipecat.frames.frames import Frame, LLMFullResponseEndFrame, LLMRunFrame, LLMTextFrame
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.cartesia.tts import CartesiaTTSService
|
||||
@@ -26,6 +28,62 @@ from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
class DelayProcessor(FrameProcessor):
|
||||
"""Custom processor that queues LLM text frames until response is complete.
|
||||
|
||||
This creates a more natural conversation flow by preventing the agent from
|
||||
responding immediately after the user stops speaking. It queues all LLMTextFrames
|
||||
until it sees an LLMFullResponseEndFrame, then waits for the specified delay
|
||||
before releasing all queued frames at once.
|
||||
"""
|
||||
|
||||
def __init__(self, *, delay_seconds: float = 1.0, **kwargs) -> None:
|
||||
"""Initialize the DelayProcessor.
|
||||
|
||||
Args:
|
||||
delay_seconds: Number of seconds to delay before releasing queued frames (default: 1.0)
|
||||
"""
|
||||
super().__init__(**kwargs)
|
||||
self._delay_seconds = delay_seconds
|
||||
self._queued_frames = []
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection) -> None:
|
||||
"""Process frames, queuing LLM text frames until response is complete.
|
||||
|
||||
Args:
|
||||
frame: The frame to process
|
||||
direction: Direction of the frame in the pipeline
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, LLMTextFrame):
|
||||
# Queue LLM text frames instead of pushing them immediately
|
||||
logger.debug(f"Queuing LLMTextFrame: {frame.text}")
|
||||
self._queued_frames.append((frame, direction))
|
||||
elif isinstance(frame, LLMFullResponseEndFrame):
|
||||
# When we see the end frame, wait for delay then push all queued frames
|
||||
logger.debug(
|
||||
f"LLM response complete, delaying {self._delay_seconds} seconds before releasing {len(self._queued_frames)} queued frames"
|
||||
)
|
||||
await asyncio.sleep(self._delay_seconds)
|
||||
|
||||
# Push all queued LLM text frames
|
||||
for queued_frame, queued_direction in self._queued_frames:
|
||||
logger.debug(f"Releasing queued LLMTextFrame: {queued_frame.text}")
|
||||
await self.push_frame(queued_frame, queued_direction)
|
||||
|
||||
# Clear the queue
|
||||
self._queued_frames.clear()
|
||||
|
||||
# Push the end frame
|
||||
logger.debug("Pushing LLMFullResponseEndFrame")
|
||||
await self.push_frame(frame, direction)
|
||||
else:
|
||||
# Push all other frames immediately
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
# selected.
|
||||
@@ -70,12 +128,16 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
context = OpenAILLMContext(messages)
|
||||
context_aggregator = llm.create_context_aggregator(context)
|
||||
|
||||
# Create delay processor to add 1-second delay before agent responses
|
||||
delay_processor = DelayProcessor(delay_seconds=1.0)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
stt,
|
||||
context_aggregator.user(), # User responses
|
||||
llm, # LLM
|
||||
delay_processor, # Add delay before TTS
|
||||
tts, # TTS
|
||||
transport.output(), # Transport bot output
|
||||
context_aggregator.assistant(), # Assistant spoken responses
|
||||
|
||||
@@ -93,9 +93,8 @@ class UserAudioCollector(FrameProcessor):
|
||||
elif isinstance(frame, UserStoppedSpeakingFrame):
|
||||
self._user_speaking = False
|
||||
self._context.add_audio_frames_message(audio_frames=self._audio_frames)
|
||||
await self._user_context_aggregator.push_frame(
|
||||
self._user_context_aggregator.get_context_frame()
|
||||
)
|
||||
await self._user_context_aggregator.push_frame(LLMRunFrame())
|
||||
|
||||
elif isinstance(frame, InputAudioRawFrame):
|
||||
if self._user_speaking:
|
||||
self._audio_frames.append(frame)
|
||||
@@ -151,7 +150,7 @@ class TranscriptExtractor(FrameProcessor):
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
class TanscriptionContextFixup(FrameProcessor):
|
||||
class TranscriptionContextFixup(FrameProcessor):
|
||||
def __init__(self, context):
|
||||
super().__init__()
|
||||
self._context = context
|
||||
@@ -245,7 +244,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
context_aggregator = llm.create_context_aggregator(context)
|
||||
audio_collector = UserAudioCollector(context, context_aggregator.user())
|
||||
pull_transcript_out_of_llm_output = TranscriptExtractor(context)
|
||||
fixup_context_messages = TanscriptionContextFixup(context)
|
||||
fixup_context_messages = TranscriptionContextFixup(context)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
|
||||
@@ -11,12 +11,19 @@ from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import Frame, TextFrame, TTSSpeakFrame, UserImageRequestFrame
|
||||
from pipecat.frames.frames import (
|
||||
Frame,
|
||||
LLMContextFrame,
|
||||
TextFrame,
|
||||
TTSSpeakFrame,
|
||||
UserImageRawFrame,
|
||||
UserImageRequestFrame,
|
||||
)
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineTask
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.aggregators.user_response import UserResponseAggregator
|
||||
from pipecat.processors.aggregators.vision_image_frame import VisionImageFrameAggregator
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import (
|
||||
@@ -34,6 +41,8 @@ load_dotenv(override=True)
|
||||
|
||||
|
||||
class UserImageRequester(FrameProcessor):
|
||||
"""Converts incoming text into requests for user images."""
|
||||
|
||||
def __init__(self, participant_id: Optional[str] = None):
|
||||
super().__init__()
|
||||
self._participant_id = participant_id
|
||||
@@ -46,9 +55,32 @@ class UserImageRequester(FrameProcessor):
|
||||
|
||||
if self._participant_id and isinstance(frame, TextFrame):
|
||||
await self.push_frame(
|
||||
UserImageRequestFrame(self._participant_id), FrameDirection.UPSTREAM
|
||||
UserImageRequestFrame(self._participant_id, context=frame.text),
|
||||
FrameDirection.UPSTREAM,
|
||||
)
|
||||
await self.push_frame(frame, direction)
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
class UserImageProcessor(FrameProcessor):
|
||||
"""Converts incoming user images into context frames."""
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, UserImageRawFrame):
|
||||
if frame.request and frame.request.context:
|
||||
context = LLMContext()
|
||||
context.add_image_frame_message(
|
||||
image=frame.image,
|
||||
text=frame.request.context,
|
||||
size=frame.size,
|
||||
format=frame.format,
|
||||
)
|
||||
frame = LLMContextFrame(context)
|
||||
await self.push_frame(frame)
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
@@ -78,7 +110,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
# Initialize the image requester without setting the participant ID yet
|
||||
image_requester = UserImageRequester()
|
||||
|
||||
vision_aggregator = VisionImageFrameAggregator()
|
||||
image_processor = UserImageProcessor()
|
||||
|
||||
# If you run into weird description, try with use_cpu=True
|
||||
moondream = MoondreamService()
|
||||
@@ -96,7 +128,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
stt,
|
||||
user_response,
|
||||
image_requester,
|
||||
vision_aggregator,
|
||||
image_processor,
|
||||
moondream,
|
||||
tts,
|
||||
transport.output(),
|
||||
@@ -119,7 +151,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
image_requester.set_participant_id(client_id)
|
||||
|
||||
# Welcome message
|
||||
await task.queue_frame(TTSSpeakFrame("Hi there! Feel free to ask me what I see."))
|
||||
await task.queue_frame(TTSSpeakFrame("Hi there! Feel free to ask me about what I see."))
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
|
||||
@@ -11,12 +11,19 @@ from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import Frame, TextFrame, TTSSpeakFrame, UserImageRequestFrame
|
||||
from pipecat.frames.frames import (
|
||||
Frame,
|
||||
LLMContextFrame,
|
||||
TextFrame,
|
||||
TTSSpeakFrame,
|
||||
UserImageRawFrame,
|
||||
UserImageRequestFrame,
|
||||
)
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.aggregators.user_response import UserResponseAggregator
|
||||
from pipecat.processors.aggregators.vision_image_frame import VisionImageFrameAggregator
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import (
|
||||
@@ -34,6 +41,8 @@ load_dotenv(override=True)
|
||||
|
||||
|
||||
class UserImageRequester(FrameProcessor):
|
||||
"""Converts incoming text into requests for user images."""
|
||||
|
||||
def __init__(self, participant_id: Optional[str] = None):
|
||||
super().__init__()
|
||||
self._participant_id = participant_id
|
||||
@@ -46,9 +55,32 @@ class UserImageRequester(FrameProcessor):
|
||||
|
||||
if self._participant_id and isinstance(frame, TextFrame):
|
||||
await self.push_frame(
|
||||
UserImageRequestFrame(self._participant_id), FrameDirection.UPSTREAM
|
||||
UserImageRequestFrame(self._participant_id, context=frame.text),
|
||||
FrameDirection.UPSTREAM,
|
||||
)
|
||||
await self.push_frame(frame, direction)
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
class UserImageProcessor(FrameProcessor):
|
||||
"""Converts incoming user images into context frames."""
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, UserImageRawFrame):
|
||||
if frame.request and frame.request.context:
|
||||
context = LLMContext()
|
||||
context.add_image_frame_message(
|
||||
image=frame.image,
|
||||
text=frame.request.context,
|
||||
size=frame.size,
|
||||
format=frame.format,
|
||||
)
|
||||
frame = LLMContextFrame(context)
|
||||
await self.push_frame(frame)
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
@@ -78,7 +110,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
# Initialize the image requester without setting the participant ID yet
|
||||
image_requester = UserImageRequester()
|
||||
|
||||
vision_aggregator = VisionImageFrameAggregator()
|
||||
image_processor = UserImageProcessor()
|
||||
|
||||
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
|
||||
|
||||
@@ -96,7 +128,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
stt,
|
||||
user_response,
|
||||
image_requester,
|
||||
vision_aggregator,
|
||||
image_processor,
|
||||
google,
|
||||
tts,
|
||||
transport.output(),
|
||||
@@ -123,7 +155,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
image_requester.set_participant_id(client_id)
|
||||
|
||||
# Welcome message
|
||||
await task.queue_frame(TTSSpeakFrame("Hi there! Feel free to ask me what I see."))
|
||||
await task.queue_frame(TTSSpeakFrame("Hi there! Feel free to ask me about what I see."))
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
|
||||
@@ -11,12 +11,19 @@ from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import Frame, TextFrame, TTSSpeakFrame, UserImageRequestFrame
|
||||
from pipecat.frames.frames import (
|
||||
Frame,
|
||||
LLMContextFrame,
|
||||
TextFrame,
|
||||
TTSSpeakFrame,
|
||||
UserImageRawFrame,
|
||||
UserImageRequestFrame,
|
||||
)
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.aggregators.user_response import UserResponseAggregator
|
||||
from pipecat.processors.aggregators.vision_image_frame import VisionImageFrameAggregator
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import (
|
||||
@@ -34,6 +41,8 @@ load_dotenv(override=True)
|
||||
|
||||
|
||||
class UserImageRequester(FrameProcessor):
|
||||
"""Converts incoming text into requests for user images."""
|
||||
|
||||
def __init__(self, participant_id: Optional[str] = None):
|
||||
super().__init__()
|
||||
self._participant_id = participant_id
|
||||
@@ -46,9 +55,32 @@ class UserImageRequester(FrameProcessor):
|
||||
|
||||
if self._participant_id and isinstance(frame, TextFrame):
|
||||
await self.push_frame(
|
||||
UserImageRequestFrame(self._participant_id), FrameDirection.UPSTREAM
|
||||
UserImageRequestFrame(self._participant_id, context=frame.text),
|
||||
FrameDirection.UPSTREAM,
|
||||
)
|
||||
await self.push_frame(frame, direction)
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
class UserImageProcessor(FrameProcessor):
|
||||
"""Converts incoming user images into context frames."""
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, UserImageRawFrame):
|
||||
if frame.request and frame.request.context:
|
||||
context = LLMContext()
|
||||
context.add_image_frame_message(
|
||||
image=frame.image,
|
||||
text=frame.request.context,
|
||||
size=frame.size,
|
||||
format=frame.format,
|
||||
)
|
||||
frame = LLMContextFrame(context)
|
||||
await self.push_frame(frame)
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
@@ -78,7 +110,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
# Initialize the image requester without setting the participant ID yet
|
||||
image_requester = UserImageRequester()
|
||||
|
||||
vision_aggregator = VisionImageFrameAggregator()
|
||||
image_processor = UserImageProcessor()
|
||||
|
||||
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
|
||||
|
||||
@@ -96,7 +128,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
stt,
|
||||
user_response,
|
||||
image_requester,
|
||||
vision_aggregator,
|
||||
image_processor,
|
||||
openai,
|
||||
tts,
|
||||
transport.output(),
|
||||
@@ -123,7 +155,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
image_requester.set_participant_id(client_id)
|
||||
|
||||
# Welcome message
|
||||
await task.queue_frame(TTSSpeakFrame("Hi there! Feel free to ask me what I see."))
|
||||
await task.queue_frame(TTSSpeakFrame("Hi there! Feel free to ask me about what I see."))
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
|
||||
@@ -11,12 +11,19 @@ from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import Frame, TextFrame, TTSSpeakFrame, UserImageRequestFrame
|
||||
from pipecat.frames.frames import (
|
||||
Frame,
|
||||
LLMContextFrame,
|
||||
TextFrame,
|
||||
TTSSpeakFrame,
|
||||
UserImageRawFrame,
|
||||
UserImageRequestFrame,
|
||||
)
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.aggregators.user_response import UserResponseAggregator
|
||||
from pipecat.processors.aggregators.vision_image_frame import VisionImageFrameAggregator
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import (
|
||||
@@ -34,6 +41,8 @@ load_dotenv(override=True)
|
||||
|
||||
|
||||
class UserImageRequester(FrameProcessor):
|
||||
"""Converts incoming text into requests for user images."""
|
||||
|
||||
def __init__(self, participant_id: Optional[str] = None):
|
||||
super().__init__()
|
||||
self._participant_id = participant_id
|
||||
@@ -46,9 +55,32 @@ class UserImageRequester(FrameProcessor):
|
||||
|
||||
if self._participant_id and isinstance(frame, TextFrame):
|
||||
await self.push_frame(
|
||||
UserImageRequestFrame(self._participant_id), FrameDirection.UPSTREAM
|
||||
UserImageRequestFrame(self._participant_id, context=frame.text),
|
||||
FrameDirection.UPSTREAM,
|
||||
)
|
||||
await self.push_frame(frame, direction)
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
class UserImageProcessor(FrameProcessor):
|
||||
"""Converts incoming user images into context frames."""
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, UserImageRawFrame):
|
||||
if frame.request and frame.request.context:
|
||||
context = LLMContext()
|
||||
context.add_image_frame_message(
|
||||
image=frame.image,
|
||||
text=frame.request.context,
|
||||
size=frame.size,
|
||||
format=frame.format,
|
||||
)
|
||||
frame = LLMContextFrame(context)
|
||||
await self.push_frame(frame)
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
@@ -78,7 +110,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
# Initialize the image requester without setting the participant ID yet
|
||||
image_requester = UserImageRequester()
|
||||
|
||||
vision_aggregator = VisionImageFrameAggregator()
|
||||
image_processor = UserImageProcessor()
|
||||
|
||||
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
|
||||
|
||||
@@ -96,7 +128,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
stt,
|
||||
user_response,
|
||||
image_requester,
|
||||
vision_aggregator,
|
||||
image_processor,
|
||||
anthropic,
|
||||
tts,
|
||||
transport.output(),
|
||||
@@ -123,7 +155,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
image_requester.set_participant_id(client_id)
|
||||
|
||||
# Welcome message
|
||||
await task.queue_frame(TTSSpeakFrame("Hi there! Feel free to ask me what I see."))
|
||||
await task.queue_frame(TTSSpeakFrame("Hi there! Feel free to ask me about what I see."))
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
|
||||
186
examples/foundational/12d-describe-video-aws.py
Normal file
186
examples/foundational/12d-describe-video-aws.py
Normal file
@@ -0,0 +1,186 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
import os
|
||||
from typing import Optional
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import (
|
||||
Frame,
|
||||
TextFrame,
|
||||
TTSSpeakFrame,
|
||||
UserImageRawFrame,
|
||||
UserImageRequestFrame,
|
||||
)
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.openai_llm_context import (
|
||||
OpenAILLMContext,
|
||||
OpenAILLMContextFrame,
|
||||
)
|
||||
from pipecat.processors.aggregators.user_response import UserResponseAggregator
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import (
|
||||
create_transport,
|
||||
get_transport_client_id,
|
||||
maybe_capture_participant_camera,
|
||||
)
|
||||
from pipecat.services.aws.llm import AWSBedrockLLMService
|
||||
from pipecat.services.cartesia.tts import CartesiaTTSService
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
class UserImageRequester(FrameProcessor):
|
||||
"""Converts incoming text into requests for user images."""
|
||||
|
||||
def __init__(self, participant_id: Optional[str] = None):
|
||||
super().__init__()
|
||||
self._participant_id = participant_id
|
||||
|
||||
def set_participant_id(self, participant_id: str):
|
||||
self._participant_id = participant_id
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if self._participant_id and isinstance(frame, TextFrame):
|
||||
await self.push_frame(
|
||||
UserImageRequestFrame(self._participant_id, context=frame.text),
|
||||
FrameDirection.UPSTREAM,
|
||||
)
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
class UserImageProcessor(FrameProcessor):
|
||||
"""Converts incoming user images into context frames."""
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, UserImageRawFrame):
|
||||
if frame.request and frame.request.context:
|
||||
# Note: AWS Bedrock does not yet support the universal LLMContext
|
||||
context = OpenAILLMContext()
|
||||
context.add_image_frame_message(
|
||||
image=frame.image,
|
||||
text=frame.request.context,
|
||||
size=frame.size,
|
||||
format=frame.format,
|
||||
)
|
||||
frame = OpenAILLMContextFrame(context)
|
||||
await self.push_frame(frame)
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
# selected.
|
||||
transport_params = {
|
||||
"daily": lambda: DailyParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
video_in_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
"webrtc": lambda: TransportParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
video_in_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info(f"Starting bot")
|
||||
|
||||
user_response = UserResponseAggregator()
|
||||
|
||||
# Initialize the image requester without setting the participant ID yet
|
||||
image_requester = UserImageRequester()
|
||||
|
||||
image_processor = UserImageProcessor()
|
||||
|
||||
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
|
||||
|
||||
# AWS for vision analysis
|
||||
aws = AWSBedrockLLMService(
|
||||
aws_region="us-west-2",
|
||||
model="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
|
||||
params=AWSBedrockLLMService.InputParams(temperature=0.8),
|
||||
)
|
||||
|
||||
tts = CartesiaTTSService(
|
||||
api_key=os.getenv("CARTESIA_API_KEY"),
|
||||
voice_id="71a7ad14-091c-4e8e-a314-022ece01c121", # British Reading Lady
|
||||
)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(),
|
||||
stt,
|
||||
user_response,
|
||||
image_requester,
|
||||
image_processor,
|
||||
aws,
|
||||
tts,
|
||||
transport.output(),
|
||||
]
|
||||
)
|
||||
|
||||
task = PipelineTask(
|
||||
pipeline,
|
||||
params=PipelineParams(
|
||||
enable_metrics=True,
|
||||
enable_usage_metrics=True,
|
||||
),
|
||||
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
|
||||
)
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected: {client}")
|
||||
|
||||
await maybe_capture_participant_camera(transport, client)
|
||||
|
||||
# Set the participant ID in the image requester
|
||||
client_id = get_transport_client_id(transport, client)
|
||||
image_requester.set_participant_id(client_id)
|
||||
|
||||
# Welcome message
|
||||
await task.queue_frame(TTSSpeakFrame("Hi there! Feel free to ask me about what I see."))
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
logger.info(f"Client disconnected")
|
||||
await task.cancel()
|
||||
|
||||
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
|
||||
|
||||
await runner.run(task)
|
||||
|
||||
|
||||
async def bot(runner_args: RunnerArguments):
|
||||
"""Main bot entry point compatible with Pipecat Cloud."""
|
||||
transport = await create_transport(runner_args, transport_params)
|
||||
await run_bot(transport, runner_args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from pipecat.runner.run import main
|
||||
|
||||
main()
|
||||
@@ -31,6 +31,9 @@ class TranscriptionLogger(FrameProcessor):
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
print(f"Transcription: {frame.text}")
|
||||
|
||||
# Push all frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
|
||||
@@ -32,6 +32,9 @@ class TranscriptionLogger(FrameProcessor):
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
print(f"Transcription: {frame.text}")
|
||||
|
||||
# Push all frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
async def main():
|
||||
transport = LocalAudioTransport(
|
||||
|
||||
@@ -31,6 +31,9 @@ class TranscriptionLogger(FrameProcessor):
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
print(f"Transcription: {frame.text}")
|
||||
|
||||
# Push all frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
|
||||
@@ -31,6 +31,9 @@ class TranscriptionLogger(FrameProcessor):
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
print(f"Transcription: {frame.text}")
|
||||
|
||||
# Push all frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
|
||||
@@ -40,6 +40,9 @@ class TranscriptionLogger(FrameProcessor):
|
||||
elif isinstance(frame, TranslationFrame):
|
||||
print(f"Translation ({frame.language}): {frame.text}")
|
||||
|
||||
# Push all frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
|
||||
@@ -31,6 +31,9 @@ class TranscriptionLogger(FrameProcessor):
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
print(f"Transcription: {frame.text}")
|
||||
|
||||
# Push all frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
|
||||
@@ -52,6 +52,9 @@ class TranscriptionLogger(FrameProcessor):
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
self._last_transcription_time = time.time()
|
||||
|
||||
# Push all frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
|
||||
@@ -31,6 +31,9 @@ class TranscriptionLogger(FrameProcessor):
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
print(f"Transcription: {frame.text}")
|
||||
|
||||
# Push all frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
|
||||
@@ -53,6 +53,9 @@ class TranscriptionLogger(FrameProcessor):
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
self._last_transcription_time = time.time()
|
||||
|
||||
# Push all frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
|
||||
@@ -32,6 +32,9 @@ class TranscriptionLogger(FrameProcessor):
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
print(f"Transcription: {frame.text}")
|
||||
|
||||
# Push all frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
|
||||
@@ -32,6 +32,9 @@ class TranscriptionLogger(FrameProcessor):
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
print(f"Transcription: {frame.text}")
|
||||
|
||||
# Push all frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
transport_params = {
|
||||
"daily": lambda: DailyParams(
|
||||
|
||||
@@ -32,6 +32,9 @@ class TranscriptionLogger(FrameProcessor):
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
print(f"Transcription: {frame.text}")
|
||||
|
||||
# Push all frames through
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
|
||||
transport_params = {
|
||||
"daily": lambda: DailyParams(
|
||||
|
||||
@@ -97,7 +97,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
llm = AnthropicLLMService(
|
||||
api_key=os.getenv("ANTHROPIC_API_KEY"),
|
||||
model="claude-3-7-sonnet-latest",
|
||||
enable_prompt_caching_beta=True,
|
||||
params=AnthropicLLMService.InputParams(enable_prompt_caching=True),
|
||||
)
|
||||
llm.register_function("get_weather", get_weather)
|
||||
llm.register_function("get_image", get_image)
|
||||
|
||||
@@ -0,0 +1,211 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.adapters.schemas.function_schema import FunctionSchema
|
||||
from pipecat.adapters.schemas.tools_schema import ToolsSchema
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import LLMRunFrame
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.aggregators.llm_response_universal import LLMContextAggregatorPair
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import (
|
||||
create_transport,
|
||||
get_transport_client_id,
|
||||
maybe_capture_participant_camera,
|
||||
)
|
||||
from pipecat.services.anthropic.llm import AnthropicLLMService
|
||||
from pipecat.services.cartesia.tts import CartesiaTTSService
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.services.llm_service import FunctionCallParams
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.services.daily import DailyParams
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
# Global variable to store the client ID
|
||||
client_id = ""
|
||||
|
||||
|
||||
async def get_weather(params: FunctionCallParams):
|
||||
location = params.arguments["location"]
|
||||
await params.result_callback(f"The weather in {location} is currently 72 degrees and sunny.")
|
||||
|
||||
|
||||
async def get_image(params: FunctionCallParams):
|
||||
question = params.arguments["question"]
|
||||
logger.debug(f"Requesting image with user_id={client_id}, question={question}")
|
||||
|
||||
# Request the image frame
|
||||
await params.llm.request_image_frame(
|
||||
user_id=client_id,
|
||||
function_name=params.function_name,
|
||||
tool_call_id=params.tool_call_id,
|
||||
text_content=question,
|
||||
)
|
||||
|
||||
# Wait a short time for the frame to be processed
|
||||
await asyncio.sleep(0.5)
|
||||
|
||||
# Return a result to complete the function call
|
||||
await params.result_callback(
|
||||
f"I've captured an image from your camera and I'm analyzing what you asked about: {question}"
|
||||
)
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
# selected.
|
||||
transport_params = {
|
||||
"daily": lambda: DailyParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
video_in_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
"webrtc": lambda: TransportParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
video_in_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info(f"Starting bot")
|
||||
|
||||
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
|
||||
|
||||
tts = CartesiaTTSService(
|
||||
api_key=os.getenv("CARTESIA_API_KEY"),
|
||||
voice_id="71a7ad14-091c-4e8e-a314-022ece01c121", # British Reading Lady
|
||||
)
|
||||
|
||||
llm = AnthropicLLMService(
|
||||
api_key=os.getenv("ANTHROPIC_API_KEY"),
|
||||
model="claude-3-7-sonnet-latest",
|
||||
params=AnthropicLLMService.InputParams(enable_prompt_caching=True),
|
||||
)
|
||||
llm.register_function("get_weather", get_weather)
|
||||
llm.register_function("get_image", get_image)
|
||||
|
||||
weather_function = FunctionSchema(
|
||||
name="get_weather",
|
||||
description="Get the current weather",
|
||||
properties={
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
},
|
||||
required=["location"],
|
||||
)
|
||||
get_image_function = FunctionSchema(
|
||||
name="get_image",
|
||||
description="Get an image from the video stream.",
|
||||
properties={
|
||||
"question": {
|
||||
"type": "string",
|
||||
"description": "The question that the user is asking about the image.",
|
||||
}
|
||||
},
|
||||
required=["question"],
|
||||
)
|
||||
tools = ToolsSchema(standard_tools=[weather_function, get_image_function])
|
||||
|
||||
system_prompt = """\
|
||||
You are a helpful assistant who converses with a user and answers questions. Respond concisely to general questions.
|
||||
|
||||
Your response will be turned into speech so use only simple words and punctuation.
|
||||
|
||||
You have access to two tools: get_weather and get_image.
|
||||
|
||||
You can respond to questions about the weather using the get_weather tool.
|
||||
|
||||
You can answer questions about the user's video stream using the get_image tool. Some examples of phrases that \
|
||||
indicate you should use the get_image tool are:
|
||||
- What do you see?
|
||||
- What's in the video?
|
||||
- Can you describe the video?
|
||||
- Tell me about what you see.
|
||||
- Tell me something interesting about what you see.
|
||||
- What's happening in the video?
|
||||
|
||||
If you need to use a tool, simply use the tool. Do not tell the user the tool you are using. Be brief and concise.
|
||||
"""
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": system_prompt},
|
||||
{"role": "user", "content": "Start the conversation by introducing yourself."},
|
||||
]
|
||||
|
||||
context = LLMContext(messages, tools)
|
||||
context_aggregator = LLMContextAggregatorPair(context)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
stt, # STT
|
||||
context_aggregator.user(), # User speech to text
|
||||
llm, # LLM
|
||||
tts, # TTS
|
||||
transport.output(), # Transport bot output
|
||||
context_aggregator.assistant(), # Assistant spoken responses and tool context
|
||||
]
|
||||
)
|
||||
|
||||
task = PipelineTask(
|
||||
pipeline,
|
||||
params=PipelineParams(
|
||||
enable_metrics=True,
|
||||
enable_usage_metrics=True,
|
||||
),
|
||||
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
|
||||
)
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected: {client}")
|
||||
|
||||
await maybe_capture_participant_camera(transport, client)
|
||||
|
||||
global client_id
|
||||
client_id = get_transport_client_id(transport, client)
|
||||
|
||||
# Kick off the conversation.
|
||||
await task.queue_frames([LLMRunFrame()])
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
logger.info(f"Client disconnected")
|
||||
await task.cancel()
|
||||
|
||||
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
|
||||
|
||||
await runner.run(task)
|
||||
|
||||
|
||||
async def bot(runner_args: RunnerArguments):
|
||||
"""Main bot entry point compatible with Pipecat Cloud."""
|
||||
transport = await create_transport(runner_args, transport_params)
|
||||
await run_bot(transport, runner_args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from pipecat.runner.run import main
|
||||
|
||||
main()
|
||||
228
examples/foundational/19-openai-realtime.py
Normal file
228
examples/foundational/19-openai-realtime.py
Normal file
@@ -0,0 +1,228 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
|
||||
import os
|
||||
from datetime import datetime
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.adapters.schemas.function_schema import FunctionSchema
|
||||
from pipecat.adapters.schemas.tools_schema import ToolsSchema
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import LLMRunFrame, TranscriptionMessage
|
||||
from pipecat.observers.loggers.transcription_log_observer import TranscriptionLogObserver
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
|
||||
from pipecat.processors.transcript_processor import TranscriptProcessor
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.llm_service import FunctionCallParams
|
||||
from pipecat.services.openai_realtime import (
|
||||
InputAudioNoiseReduction,
|
||||
InputAudioTranscription,
|
||||
OpenAIRealtimeLLMService,
|
||||
SemanticTurnDetection,
|
||||
SessionProperties,
|
||||
)
|
||||
from pipecat.services.openai_realtime.events import AudioConfiguration, AudioInput
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
async def fetch_weather_from_api(params: FunctionCallParams):
|
||||
temperature = 75 if params.arguments["format"] == "fahrenheit" else 24
|
||||
await params.result_callback(
|
||||
{
|
||||
"conditions": "nice",
|
||||
"temperature": temperature,
|
||||
"format": params.arguments["format"],
|
||||
"timestamp": datetime.now().strftime("%Y%m%d_%H%M%S"),
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
async def fetch_restaurant_recommendation(params: FunctionCallParams):
|
||||
await params.result_callback({"name": "The Golden Dragon"})
|
||||
|
||||
|
||||
weather_function = FunctionSchema(
|
||||
name="get_current_weather",
|
||||
description="Get the current weather",
|
||||
properties={
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
"format": {
|
||||
"type": "string",
|
||||
"enum": ["celsius", "fahrenheit"],
|
||||
"description": "The temperature unit to use. Infer this from the users location.",
|
||||
},
|
||||
},
|
||||
required=["location", "format"],
|
||||
)
|
||||
|
||||
restaurant_function = FunctionSchema(
|
||||
name="get_restaurant_recommendation",
|
||||
description="Get a restaurant recommendation",
|
||||
properties={
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
},
|
||||
required=["location"],
|
||||
)
|
||||
|
||||
# Create tools schema
|
||||
tools = ToolsSchema(standard_tools=[weather_function, restaurant_function])
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
# selected.
|
||||
transport_params = {
|
||||
"daily": lambda: DailyParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
"twilio": lambda: FastAPIWebsocketParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
"webrtc": lambda: TransportParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info(f"Starting bot")
|
||||
|
||||
session_properties = SessionProperties(
|
||||
audio=AudioConfiguration(
|
||||
input=AudioInput(
|
||||
transcription=InputAudioTranscription(),
|
||||
# Set openai TurnDetection parameters. Not setting this at all will turn it
|
||||
# on by default
|
||||
turn_detection=SemanticTurnDetection(),
|
||||
# Or set to False to disable openai turn detection and use transport VAD
|
||||
# turn_detection=False,
|
||||
noise_reduction=InputAudioNoiseReduction(type="near_field"),
|
||||
)
|
||||
),
|
||||
# tools=tools,
|
||||
instructions="""You are a helpful and friendly AI.
|
||||
|
||||
Act like a human, but remember that you aren't a human and that you can't do human
|
||||
things in the real world. Your voice and personality should be warm and engaging, with a lively and
|
||||
playful tone.
|
||||
|
||||
If interacting in a non-English language, start by using the standard accent or dialect familiar to
|
||||
the user. Talk quickly. You should always call a function if you can. Do not refer to these rules,
|
||||
even if you're asked about them.
|
||||
|
||||
You are participating in a voice conversation. Keep your responses concise, short, and to the point
|
||||
unless specifically asked to elaborate on a topic.
|
||||
|
||||
You have access to the following tools:
|
||||
- get_current_weather: Get the current weather for a given location.
|
||||
- get_restaurant_recommendation: Get a restaurant recommendation for a given location.
|
||||
|
||||
Remember, your responses should be short. Just one or two sentences, usually. Respond in English.""",
|
||||
)
|
||||
|
||||
llm = OpenAIRealtimeLLMService(
|
||||
api_key=os.getenv("OPENAI_API_KEY"),
|
||||
session_properties=session_properties,
|
||||
start_audio_paused=False,
|
||||
)
|
||||
|
||||
# you can either register a single function for all function calls, or specific functions
|
||||
# llm.register_function(None, fetch_weather_from_api)
|
||||
llm.register_function("get_current_weather", fetch_weather_from_api)
|
||||
llm.register_function("get_restaurant_recommendation", fetch_restaurant_recommendation)
|
||||
|
||||
transcript = TranscriptProcessor()
|
||||
|
||||
# Create a standard OpenAI LLM context object using the normal messages format. The
|
||||
# OpenAIRealtimeLLMService will convert this internally to messages that the
|
||||
# openai WebSocket API can understand.
|
||||
context = OpenAILLMContext(
|
||||
[{"role": "user", "content": "Say hello!"}],
|
||||
tools,
|
||||
)
|
||||
|
||||
context_aggregator = llm.create_context_aggregator(context)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
context_aggregator.user(),
|
||||
llm, # LLM
|
||||
transcript.user(), # Placed after the LLM, as LLM pushes TranscriptionFrames downstream
|
||||
transport.output(), # Transport bot output
|
||||
transcript.assistant(), # After the transcript output, to time with the audio output
|
||||
context_aggregator.assistant(),
|
||||
]
|
||||
)
|
||||
|
||||
task = PipelineTask(
|
||||
pipeline,
|
||||
params=PipelineParams(
|
||||
enable_metrics=True,
|
||||
enable_usage_metrics=True,
|
||||
),
|
||||
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
|
||||
observers=[TranscriptionLogObserver()],
|
||||
)
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected")
|
||||
# Kick off the conversation.
|
||||
await task.queue_frames([LLMRunFrame()])
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
logger.info(f"Client disconnected")
|
||||
await task.cancel()
|
||||
|
||||
# Register event handler for transcript updates
|
||||
@transcript.event_handler("on_transcript_update")
|
||||
async def on_transcript_update(processor, frame):
|
||||
for msg in frame.messages:
|
||||
if isinstance(msg, TranscriptionMessage):
|
||||
timestamp = f"[{msg.timestamp}] " if msg.timestamp else ""
|
||||
line = f"{timestamp}{msg.role}: {msg.content}"
|
||||
logger.info(f"Transcript: {line}")
|
||||
|
||||
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
|
||||
|
||||
await runner.run(task)
|
||||
|
||||
|
||||
async def bot(runner_args: RunnerArguments):
|
||||
"""Main bot entry point compatible with Pipecat Cloud."""
|
||||
transport = await create_transport(runner_args, transport_params)
|
||||
await run_bot(transport, runner_args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from pipecat.runner.run import main
|
||||
|
||||
main()
|
||||
221
examples/foundational/19a-azure-realtime.py
Normal file
221
examples/foundational/19a-azure-realtime.py
Normal file
@@ -0,0 +1,221 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
|
||||
import os
|
||||
from datetime import datetime
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.adapters.schemas.function_schema import FunctionSchema
|
||||
from pipecat.adapters.schemas.tools_schema import ToolsSchema
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import LLMRunFrame
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.llm_service import FunctionCallParams
|
||||
from pipecat.services.openai_realtime import (
|
||||
AzureRealtimeLLMService,
|
||||
InputAudioTranscription,
|
||||
SessionProperties,
|
||||
)
|
||||
from pipecat.services.openai_realtime.events import AudioConfiguration, AudioInput
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
async def fetch_weather_from_api(params: FunctionCallParams):
|
||||
temperature = 75 if params.arguments["format"] == "fahrenheit" else 24
|
||||
await params.result_callback(
|
||||
{
|
||||
"conditions": "nice",
|
||||
"temperature": temperature,
|
||||
"format": params.arguments["format"],
|
||||
"timestamp": datetime.now().strftime("%Y%m%d_%H%M%S"),
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
async def fetch_restaurant_recommendation(params: FunctionCallParams):
|
||||
await params.result_callback({"name": "The Golden Dragon"})
|
||||
|
||||
|
||||
# Define weather function using standardized schema
|
||||
weather_function = FunctionSchema(
|
||||
name="get_current_weather",
|
||||
description="Get the current weather",
|
||||
properties={
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
"format": {
|
||||
"type": "string",
|
||||
"enum": ["celsius", "fahrenheit"],
|
||||
"description": "The temperature unit to use. Infer this from the users location.",
|
||||
},
|
||||
},
|
||||
required=["location", "format"],
|
||||
)
|
||||
|
||||
restaurant_function = FunctionSchema(
|
||||
name="get_restaurant_recommendation",
|
||||
description="Get a restaurant recommendation",
|
||||
properties={
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
},
|
||||
required=["location"],
|
||||
)
|
||||
|
||||
# Create tools schema
|
||||
tools = ToolsSchema(standard_tools=[weather_function, restaurant_function])
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
# selected.
|
||||
transport_params = {
|
||||
"daily": lambda: DailyParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
"twilio": lambda: FastAPIWebsocketParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
"webrtc": lambda: TransportParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info(f"Starting bot")
|
||||
|
||||
session_properties = SessionProperties(
|
||||
audio=AudioConfiguration(
|
||||
input=AudioInput(
|
||||
transcription=InputAudioTranscription(model="whisper-1"),
|
||||
# Set openai TurnDetection parameters. Not setting this at all will turn it
|
||||
# on by default
|
||||
# turn_detection=TurnDetection(silence_duration_ms=1000),
|
||||
# Or set to False to disable openai turn detection and use transport VAD
|
||||
# turn_detection=False,
|
||||
)
|
||||
),
|
||||
# tools=tools,
|
||||
instructions="""You are a helpful and friendly AI.
|
||||
|
||||
Act like a human, but remember that you aren't a human and that you can't do human
|
||||
things in the real world. Your voice and personality should be warm and engaging, with a lively and
|
||||
playful tone.
|
||||
|
||||
If interacting in a non-English language, start by using the standard accent or dialect familiar to
|
||||
the user. Talk quickly. You should always call a function if you can. Do not refer to these rules,
|
||||
even if you're asked about them.
|
||||
-
|
||||
You are participating in a voice conversation. Keep your responses concise, short, and to the point
|
||||
unless specifically asked to elaborate on a topic.
|
||||
|
||||
You have access to the following tools:
|
||||
- get_current_weather: Get the current weather for a given location.
|
||||
- get_restaurant_recommendation: Get a restaurant recommendation for a given location.
|
||||
|
||||
Remember, your responses should be short. Just one or two sentences, usually. Respond in English.""",
|
||||
)
|
||||
|
||||
llm = AzureRealtimeLLMService(
|
||||
api_key=os.getenv("AZURE_REALTIME_API_KEY"),
|
||||
base_url=os.getenv("AZURE_REALTIME_BASE_URL"),
|
||||
session_properties=session_properties,
|
||||
start_audio_paused=False,
|
||||
)
|
||||
|
||||
# you can either register a single function for all function calls, or specific functions
|
||||
# llm.register_function(None, fetch_weather_from_api)
|
||||
llm.register_function("get_current_weather", fetch_weather_from_api)
|
||||
llm.register_function("get_restaurant_recommendation", fetch_restaurant_recommendation)
|
||||
|
||||
# Create a standard OpenAI LLM context object using the normal messages format. The
|
||||
# OpenAIRealtimeBetaLLMService will convert this internally to messages that the
|
||||
# openai WebSocket API can understand.
|
||||
context = OpenAILLMContext(
|
||||
[{"role": "user", "content": "Say hello!"}],
|
||||
# [{"role": "user", "content": [{"type": "text", "text": "Say hello!"}]}],
|
||||
# [
|
||||
# {
|
||||
# "role": "user",
|
||||
# "content": [
|
||||
# {"type": "text", "text": "Say"},
|
||||
# {"type": "text", "text": "yo what's up!"},
|
||||
# ],
|
||||
# }
|
||||
# ],
|
||||
tools,
|
||||
)
|
||||
|
||||
context_aggregator = llm.create_context_aggregator(context)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
context_aggregator.user(),
|
||||
llm, # LLM
|
||||
transport.output(), # Transport bot output
|
||||
context_aggregator.assistant(),
|
||||
]
|
||||
)
|
||||
|
||||
task = PipelineTask(
|
||||
pipeline,
|
||||
params=PipelineParams(
|
||||
enable_metrics=True,
|
||||
enable_usage_metrics=True,
|
||||
),
|
||||
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
|
||||
)
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected")
|
||||
# Kick off the conversation.
|
||||
await task.queue_frames([LLMRunFrame()])
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
logger.info(f"Client disconnected")
|
||||
await task.cancel()
|
||||
|
||||
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
|
||||
|
||||
await runner.run(task)
|
||||
|
||||
|
||||
async def bot(runner_args: RunnerArguments):
|
||||
"""Main bot entry point compatible with Pipecat Cloud."""
|
||||
transport = await create_transport(runner_args, transport_params)
|
||||
await run_bot(transport, runner_args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from pipecat.runner.run import main
|
||||
|
||||
main()
|
||||
@@ -31,6 +31,7 @@ from pipecat.services.openai_realtime_beta import (
|
||||
SemanticTurnDetection,
|
||||
SessionProperties,
|
||||
)
|
||||
from pipecat.services.openai_realtime_beta.events import AudioConfiguration, AudioInput
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
@@ -113,14 +114,18 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info(f"Starting bot")
|
||||
|
||||
session_properties = SessionProperties(
|
||||
input_audio_transcription=InputAudioTranscription(),
|
||||
modalities=["text"],
|
||||
# Set openai TurnDetection parameters. Not setting this at all will turn it
|
||||
# on by default
|
||||
turn_detection=SemanticTurnDetection(),
|
||||
# Or set to False to disable openai turn detection and use transport VAD
|
||||
# turn_detection=False,
|
||||
input_audio_noise_reduction=InputAudioNoiseReduction(type="near_field"),
|
||||
audio=AudioConfiguration(
|
||||
input=AudioInput(
|
||||
transcription=InputAudioTranscription(),
|
||||
# Set openai TurnDetection parameters. Not setting this at all will turn it
|
||||
# on by default
|
||||
turn_detection=SemanticTurnDetection(),
|
||||
# Or set to False to disable openai turn detection and use transport VAD
|
||||
# turn_detection=False,
|
||||
noise_reduction=InputAudioNoiseReduction(type="near_field"),
|
||||
)
|
||||
),
|
||||
output_modalities=["text"],
|
||||
# tools=tools,
|
||||
instructions="""You are a helpful and friendly AI.
|
||||
|
||||
|
||||
234
examples/foundational/19b-openai-realtime-text.py
Normal file
234
examples/foundational/19b-openai-realtime-text.py
Normal file
@@ -0,0 +1,234 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
|
||||
import os
|
||||
from datetime import datetime
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.adapters.schemas.function_schema import FunctionSchema
|
||||
from pipecat.adapters.schemas.tools_schema import ToolsSchema
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import LLMRunFrame, TranscriptionMessage
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
|
||||
from pipecat.processors.transcript_processor import TranscriptProcessor
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.cartesia import CartesiaTTSService
|
||||
from pipecat.services.llm_service import FunctionCallParams
|
||||
from pipecat.services.openai_realtime import (
|
||||
InputAudioNoiseReduction,
|
||||
InputAudioTranscription,
|
||||
OpenAIRealtimeLLMService,
|
||||
SemanticTurnDetection,
|
||||
SessionProperties,
|
||||
)
|
||||
from pipecat.services.openai_realtime.events import AudioConfiguration, AudioInput
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
async def fetch_weather_from_api(params: FunctionCallParams):
|
||||
temperature = 75 if params.arguments["format"] == "fahrenheit" else 24
|
||||
await params.result_callback(
|
||||
{
|
||||
"conditions": "nice",
|
||||
"temperature": temperature,
|
||||
"format": params.arguments["format"],
|
||||
"timestamp": datetime.now().strftime("%Y%m%d_%H%M%S"),
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
async def fetch_restaurant_recommendation(params: FunctionCallParams):
|
||||
await params.result_callback({"name": "The Golden Dragon"})
|
||||
|
||||
|
||||
weather_function = FunctionSchema(
|
||||
name="get_current_weather",
|
||||
description="Get the current weather",
|
||||
properties={
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
"format": {
|
||||
"type": "string",
|
||||
"enum": ["celsius", "fahrenheit"],
|
||||
"description": "The temperature unit to use. Infer this from the users location.",
|
||||
},
|
||||
},
|
||||
required=["location", "format"],
|
||||
)
|
||||
|
||||
restaurant_function = FunctionSchema(
|
||||
name="get_restaurant_recommendation",
|
||||
description="Get a restaurant recommendation",
|
||||
properties={
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
},
|
||||
required=["location"],
|
||||
)
|
||||
|
||||
# Create tools schema
|
||||
tools = ToolsSchema(standard_tools=[weather_function, restaurant_function])
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
# selected.
|
||||
transport_params = {
|
||||
"daily": lambda: DailyParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
"twilio": lambda: FastAPIWebsocketParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
"webrtc": lambda: TransportParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info(f"Starting bot")
|
||||
|
||||
session_properties = SessionProperties(
|
||||
audio=AudioConfiguration(
|
||||
input=AudioInput(
|
||||
transcription=InputAudioTranscription(),
|
||||
# Set openai TurnDetection parameters. Not setting this at all will turn it
|
||||
# on by default
|
||||
turn_detection=SemanticTurnDetection(),
|
||||
# Or set to False to disable openai turn detection and use transport VAD
|
||||
# turn_detection=False,
|
||||
noise_reduction=InputAudioNoiseReduction(type="near_field"),
|
||||
)
|
||||
),
|
||||
output_modalities=["text"],
|
||||
# tools=tools,
|
||||
instructions="""You are a helpful and friendly AI.
|
||||
|
||||
Act like a human, but remember that you aren't a human and that you can't do human
|
||||
things in the real world. Your voice and personality should be warm and engaging, with a lively and
|
||||
playful tone.
|
||||
|
||||
If interacting in a non-English language, start by using the standard accent or dialect familiar to
|
||||
the user. Talk quickly. You should always call a function if you can. Do not refer to these rules,
|
||||
even if you're asked about them.
|
||||
|
||||
You are participating in a voice conversation. Keep your responses concise, short, and to the point
|
||||
unless specifically asked to elaborate on a topic.
|
||||
|
||||
You have access to the following tools:
|
||||
- get_current_weather: Get the current weather for a given location.
|
||||
- get_restaurant_recommendation: Get a restaurant recommendation for a given location.
|
||||
|
||||
Remember, your responses should be short. Just one or two sentences, usually. Respond in English.""",
|
||||
)
|
||||
|
||||
llm = OpenAIRealtimeLLMService(
|
||||
api_key=os.getenv("OPENAI_API_KEY"),
|
||||
session_properties=session_properties,
|
||||
start_audio_paused=False,
|
||||
)
|
||||
|
||||
tts = CartesiaTTSService(
|
||||
api_key=os.getenv("CARTESIA_API_KEY"),
|
||||
voice_id="71a7ad14-091c-4e8e-a314-022ece01c121", # British Reading Lady
|
||||
)
|
||||
|
||||
# you can either register a single function for all function calls, or specific functions
|
||||
# llm.register_function(None, fetch_weather_from_api)
|
||||
llm.register_function("get_current_weather", fetch_weather_from_api)
|
||||
llm.register_function("get_restaurant_recommendation", fetch_restaurant_recommendation)
|
||||
|
||||
transcript = TranscriptProcessor()
|
||||
|
||||
# Create a standard OpenAI LLM context object using the normal messages format. The
|
||||
# OpenAIRealtimeLLMService will convert this internally to messages that the
|
||||
# openai WebSocket API can understand.
|
||||
context = OpenAILLMContext(
|
||||
[{"role": "user", "content": "Say hello!"}],
|
||||
tools,
|
||||
)
|
||||
|
||||
context_aggregator = llm.create_context_aggregator(context)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
context_aggregator.user(),
|
||||
llm, # LLM
|
||||
tts, # TTS
|
||||
transcript.user(), # Placed after the LLM, as LLM pushes TranscriptionFrames downstream
|
||||
transport.output(), # Transport bot output
|
||||
transcript.assistant(), # After the transcript output, to time with the audio output
|
||||
context_aggregator.assistant(),
|
||||
]
|
||||
)
|
||||
|
||||
task = PipelineTask(
|
||||
pipeline,
|
||||
params=PipelineParams(
|
||||
enable_metrics=True,
|
||||
enable_usage_metrics=True,
|
||||
),
|
||||
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
|
||||
)
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected")
|
||||
# Kick off the conversation.
|
||||
await task.queue_frames([LLMRunFrame()])
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
logger.info(f"Client disconnected")
|
||||
await task.cancel()
|
||||
|
||||
# Register event handler for transcript updates
|
||||
@transcript.event_handler("on_transcript_update")
|
||||
async def on_transcript_update(processor, frame):
|
||||
for msg in frame.messages:
|
||||
if isinstance(msg, TranscriptionMessage):
|
||||
timestamp = f"[{msg.timestamp}] " if msg.timestamp else ""
|
||||
line = f"{timestamp}{msg.role}: {msg.content}"
|
||||
logger.info(f"Transcript: {line}")
|
||||
|
||||
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
|
||||
|
||||
await runner.run(task)
|
||||
|
||||
|
||||
async def bot(runner_args: RunnerArguments):
|
||||
"""Main bot entry point compatible with Pipecat Cloud."""
|
||||
transport = await create_transport(runner_args, transport_params)
|
||||
await run_bot(transport, runner_args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from pipecat.runner.run import main
|
||||
|
||||
main()
|
||||
@@ -0,0 +1,274 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
import asyncio
|
||||
import glob
|
||||
import json
|
||||
import os
|
||||
from datetime import datetime
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.audio.vad.silero import SileroVADAnalyzer
|
||||
from pipecat.frames.frames import LLMRunFrame
|
||||
from pipecat.pipeline.pipeline import Pipeline
|
||||
from pipecat.pipeline.runner import PipelineRunner
|
||||
from pipecat.pipeline.task import PipelineParams, PipelineTask
|
||||
from pipecat.processors.aggregators.openai_llm_context import (
|
||||
OpenAILLMContext,
|
||||
)
|
||||
from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.services.llm_service import FunctionCallParams
|
||||
from pipecat.services.openai_realtime_beta import (
|
||||
InputAudioTranscription,
|
||||
OpenAIRealtimeBetaLLMService,
|
||||
SessionProperties,
|
||||
TurnDetection,
|
||||
)
|
||||
from pipecat.services.openai_realtime_beta.events import AudioConfiguration, AudioInput
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
BASE_FILENAME = "/tmp/pipecat_conversation_"
|
||||
|
||||
|
||||
async def fetch_weather_from_api(params: FunctionCallParams):
|
||||
temperature = 75 if params.arguments["format"] == "fahrenheit" else 24
|
||||
await params.result_callback(
|
||||
{
|
||||
"conditions": "nice",
|
||||
"temperature": temperature,
|
||||
"format": params.arguments["format"],
|
||||
"timestamp": datetime.now().strftime("%Y%m%d_%H%M%S"),
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
async def get_saved_conversation_filenames(params: FunctionCallParams):
|
||||
# Construct the full pattern including the BASE_FILENAME
|
||||
full_pattern = f"{BASE_FILENAME}*.json"
|
||||
|
||||
# Use glob to find all matching files
|
||||
matching_files = glob.glob(full_pattern)
|
||||
logger.debug(f"matching files: {matching_files}")
|
||||
|
||||
await params.result_callback({"filenames": matching_files})
|
||||
|
||||
|
||||
async def save_conversation(params: FunctionCallParams):
|
||||
timestamp = datetime.now().strftime("%Y-%m-%d_%H:%M:%S")
|
||||
filename = f"{BASE_FILENAME}{timestamp}.json"
|
||||
logger.debug(
|
||||
f"writing conversation to {filename}\n{json.dumps(params.context.messages, indent=4)}"
|
||||
)
|
||||
try:
|
||||
with open(filename, "w") as file:
|
||||
messages = params.context.get_messages_for_persistent_storage()
|
||||
# remove the last message, which is the instruction we just gave to save the conversation
|
||||
messages.pop()
|
||||
json.dump(messages, file, indent=2)
|
||||
await params.result_callback({"success": True})
|
||||
except Exception as e:
|
||||
await params.result_callback({"success": False, "error": str(e)})
|
||||
|
||||
|
||||
async def load_conversation(params: FunctionCallParams):
|
||||
async def _reset():
|
||||
filename = params.arguments["filename"]
|
||||
logger.debug(f"loading conversation from {filename}")
|
||||
try:
|
||||
with open(filename, "r") as file:
|
||||
params.context.set_messages(json.load(file))
|
||||
await params.llm.reset_conversation()
|
||||
await params.llm._create_response()
|
||||
except Exception as e:
|
||||
await params.result_callback({"success": False, "error": str(e)})
|
||||
|
||||
asyncio.create_task(_reset())
|
||||
|
||||
|
||||
tools = [
|
||||
{
|
||||
"type": "function",
|
||||
"name": "get_current_weather",
|
||||
"description": "Get the current weather",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
"format": {
|
||||
"type": "string",
|
||||
"enum": ["celsius", "fahrenheit"],
|
||||
"description": "The temperature unit to use. Infer this from the users location.",
|
||||
},
|
||||
},
|
||||
"required": ["location", "format"],
|
||||
},
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"name": "save_conversation",
|
||||
"description": "Save the current conversatione. Use this function to persist the current conversation to external storage.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {},
|
||||
"required": [],
|
||||
},
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"name": "get_saved_conversation_filenames",
|
||||
"description": "Get a list of saved conversation histories. Returns a list of filenames. Each filename includes a date and timestamp. Each file is conversation history that can be loaded into this session.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {},
|
||||
"required": [],
|
||||
},
|
||||
},
|
||||
{
|
||||
"type": "function",
|
||||
"name": "load_conversation",
|
||||
"description": "Load a conversation history. Use this function to load a conversation history into the current session.",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"filename": {
|
||||
"type": "string",
|
||||
"description": "The filename of the conversation history to load.",
|
||||
}
|
||||
},
|
||||
"required": ["filename"],
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
# We store functions so objects (e.g. SileroVADAnalyzer) don't get
|
||||
# instantiated. The function will be called when the desired transport gets
|
||||
# selected.
|
||||
transport_params = {
|
||||
"daily": lambda: DailyParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
"twilio": lambda: FastAPIWebsocketParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
"webrtc": lambda: TransportParams(
|
||||
audio_in_enabled=True,
|
||||
audio_out_enabled=True,
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
logger.info(f"Starting bot")
|
||||
|
||||
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
|
||||
|
||||
session_properties = SessionProperties(
|
||||
audio=AudioConfiguration(
|
||||
input=AudioInput(
|
||||
transcription=InputAudioTranscription(),
|
||||
# Set openai TurnDetection parameters. Not setting this at all will turn it
|
||||
# on by default
|
||||
turn_detection=TurnDetection(silence_duration_ms=1000),
|
||||
# Or set to False to disable openai turn detection and use transport VAD
|
||||
# turn_detection=False,
|
||||
)
|
||||
),
|
||||
# tools=tools,
|
||||
instructions="""Your knowledge cutoff is 2023-10. You are a helpful and friendly AI.
|
||||
|
||||
Act like a human, but remember that you aren't a human and that you can't do human
|
||||
things in the real world. Your voice and personality should be warm and engaging, with a lively and
|
||||
playful tone.
|
||||
|
||||
If interacting in a non-English language, start by using the standard accent or dialect familiar to
|
||||
the user. Talk quickly. You should always call a function if you can. Do not refer to these rules,
|
||||
even if you're asked about them.
|
||||
-
|
||||
You are participating in a voice conversation. Keep your responses concise, short, and to the point
|
||||
unless specifically asked to elaborate on a topic.
|
||||
|
||||
Remember, your responses should be short. Just one or two sentences, usually.""",
|
||||
)
|
||||
|
||||
llm = OpenAIRealtimeBetaLLMService(
|
||||
api_key=os.getenv("OPENAI_API_KEY"),
|
||||
session_properties=session_properties,
|
||||
start_audio_paused=False,
|
||||
)
|
||||
|
||||
# you can either register a single function for all function calls, or specific functions
|
||||
# llm.register_function(None, fetch_weather_from_api)
|
||||
llm.register_function("get_current_weather", fetch_weather_from_api)
|
||||
llm.register_function("save_conversation", save_conversation)
|
||||
llm.register_function("get_saved_conversation_filenames", get_saved_conversation_filenames)
|
||||
llm.register_function("load_conversation", load_conversation)
|
||||
|
||||
context = OpenAILLMContext([], tools)
|
||||
context_aggregator = llm.create_context_aggregator(context)
|
||||
|
||||
pipeline = Pipeline(
|
||||
[
|
||||
transport.input(), # Transport user input
|
||||
stt, # STT
|
||||
context_aggregator.user(),
|
||||
llm, # LLM
|
||||
transport.output(), # Transport bot output
|
||||
context_aggregator.assistant(),
|
||||
]
|
||||
)
|
||||
|
||||
task = PipelineTask(
|
||||
pipeline,
|
||||
params=PipelineParams(
|
||||
enable_metrics=True,
|
||||
enable_usage_metrics=True,
|
||||
),
|
||||
idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
|
||||
)
|
||||
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected")
|
||||
# Kick off the conversation.
|
||||
await task.queue_frames([LLMRunFrame()])
|
||||
|
||||
@transport.event_handler("on_client_disconnected")
|
||||
async def on_client_disconnected(transport, client):
|
||||
logger.info(f"Client disconnected")
|
||||
await task.cancel()
|
||||
|
||||
runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
|
||||
|
||||
await runner.run(task)
|
||||
|
||||
|
||||
async def bot(runner_args: RunnerArguments):
|
||||
"""Main bot entry point compatible with Pipecat Cloud."""
|
||||
transport = await create_transport(runner_args, transport_params)
|
||||
await run_bot(transport, runner_args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from pipecat.runner.run import main
|
||||
|
||||
main()
|
||||
@@ -25,12 +25,13 @@ from pipecat.runner.types import RunnerArguments
|
||||
from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.services.llm_service import FunctionCallParams
|
||||
from pipecat.services.openai_realtime_beta import (
|
||||
from pipecat.services.openai_realtime import (
|
||||
InputAudioTranscription,
|
||||
OpenAIRealtimeBetaLLMService,
|
||||
OpenAIRealtimeLLMService,
|
||||
SessionProperties,
|
||||
TurnDetection,
|
||||
)
|
||||
from pipecat.services.openai_realtime.events import AudioConfiguration, AudioInput
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
|
||||
@@ -182,12 +183,16 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
|
||||
|
||||
session_properties = SessionProperties(
|
||||
input_audio_transcription=InputAudioTranscription(),
|
||||
# Set openai TurnDetection parameters. Not setting this at all will turn it
|
||||
# on by default
|
||||
turn_detection=TurnDetection(silence_duration_ms=1000),
|
||||
# Or set to False to disable openai turn detection and use transport VAD
|
||||
# turn_detection=False,
|
||||
audio=AudioConfiguration(
|
||||
input=AudioInput(
|
||||
transcription=InputAudioTranscription(),
|
||||
# Set openai TurnDetection parameters. Not setting this at all will turn it
|
||||
# on by default
|
||||
turn_detection=TurnDetection(silence_duration_ms=1000),
|
||||
# Or set to False to disable openai turn detection and use transport VAD
|
||||
# turn_detection=False,
|
||||
)
|
||||
),
|
||||
# tools=tools,
|
||||
instructions="""Your knowledge cutoff is 2023-10. You are a helpful and friendly AI.
|
||||
|
||||
@@ -205,7 +210,7 @@ unless specifically asked to elaborate on a topic.
|
||||
Remember, your responses should be short. Just one or two sentences, usually.""",
|
||||
)
|
||||
|
||||
llm = OpenAIRealtimeBetaLLMService(
|
||||
llm = OpenAIRealtimeLLMService(
|
||||
api_key=os.getenv("OPENAI_API_KEY"),
|
||||
session_properties=session_properties,
|
||||
start_audio_paused=False,
|
||||
|
||||
@@ -21,9 +21,10 @@ from pipecat.runner.utils import create_transport
|
||||
from pipecat.services.cartesia.tts import CartesiaTTSService
|
||||
from pipecat.services.deepgram.stt import DeepgramSTTService
|
||||
from pipecat.services.google.llm import GoogleLLMService
|
||||
from pipecat.services.heygen.api import AvatarQuality, NewSessionRequest
|
||||
from pipecat.services.heygen.video import HeyGenVideoService
|
||||
from pipecat.transports.base_transport import BaseTransport, TransportParams
|
||||
from pipecat.transports.daily.transport import DailyParams
|
||||
from pipecat.transports.daily.transport import DailyParams, DailyTransport
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
@@ -38,6 +39,7 @@ transport_params = {
|
||||
video_out_is_live=True,
|
||||
video_out_width=1280,
|
||||
video_out_height=720,
|
||||
video_out_bitrate=2_000_000, # 2MBps
|
||||
vad_analyzer=SileroVADAnalyzer(),
|
||||
),
|
||||
"webrtc": lambda: TransportParams(
|
||||
@@ -64,7 +66,13 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
|
||||
llm = GoogleLLMService(api_key=os.getenv("GOOGLE_API_KEY"))
|
||||
|
||||
heyGen = HeyGenVideoService(api_key=os.getenv("HEYGEN_API_KEY"), session=session)
|
||||
heyGen = HeyGenVideoService(
|
||||
api_key=os.getenv("HEYGEN_API_KEY"),
|
||||
session=session,
|
||||
session_request=NewSessionRequest(
|
||||
avatar_id="Shawn_Therapist_public", version="v2", quality=AvatarQuality.high
|
||||
),
|
||||
)
|
||||
|
||||
messages = [
|
||||
{
|
||||
@@ -101,6 +109,18 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
|
||||
@transport.event_handler("on_client_connected")
|
||||
async def on_client_connected(transport, client):
|
||||
logger.info(f"Client connected")
|
||||
# Updating publishing settings to enable adaptive bitrate
|
||||
if isinstance(transport, DailyTransport):
|
||||
await transport.update_publishing(
|
||||
publishing_settings={
|
||||
"camera": {
|
||||
"sendSettings": {
|
||||
"allowAdaptiveLayers": True,
|
||||
}
|
||||
}
|
||||
}
|
||||
)
|
||||
|
||||
# Kick off the conversation.
|
||||
messages.append(
|
||||
{
|
||||
|
||||
@@ -4,7 +4,7 @@ version = "0.1.0"
|
||||
description = "Quickstart example for building voice AI bots with Pipecat"
|
||||
requires-python = ">=3.10"
|
||||
dependencies = [
|
||||
"pipecat-ai[webrtc,daily,silero,deepgram,openai,cartesia,runner]>=0.0.82",
|
||||
"pipecat-ai[webrtc,daily,silero,deepgram,openai,cartesia,runner]>=0.0.83",
|
||||
"pipecatcloud>=0.2.4"
|
||||
]
|
||||
|
||||
|
||||
3158
examples/quickstart/uv.lock
generated
3158
examples/quickstart/uv.lock
generated
File diff suppressed because it is too large
Load Diff
@@ -55,7 +55,7 @@ azure = [ "azure-cognitiveservices-speech~=1.42.0"]
|
||||
cartesia = [ "cartesia~=2.0.3", "websockets>=13.1,<15.0" ]
|
||||
cerebras = []
|
||||
deepseek = []
|
||||
daily = [ "daily-python~=0.19.8" ]
|
||||
daily = [ "daily-python~=0.19.9" ]
|
||||
deepgram = [ "deepgram-sdk~=4.7.0" ]
|
||||
elevenlabs = [ "websockets>=13.1,<15.0" ]
|
||||
fal = [ "fal-client~=0.5.9" ]
|
||||
|
||||
@@ -47,7 +47,7 @@ from pipecat.transports.daily.transport import DailyParams, DailyTransport
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
|
||||
PIPELINE_IDLE_TIMEOUT_SECS = 60
|
||||
EVAL_TIMEOUT_SECS = 90
|
||||
EVAL_TIMEOUT_SECS = 120
|
||||
|
||||
EvalPrompt = str | Tuple[str, ImageFile]
|
||||
|
||||
@@ -266,8 +266,11 @@ async def run_eval_pipeline(
|
||||
elif isinstance(prompt, tuple):
|
||||
example_prompt, example_image = prompt
|
||||
|
||||
eval_prompt = f"The answer is correct if it's appropriate for the context and matches: {eval}."
|
||||
common_system_prompt = f"Call the eval function with your assessment only if the user answers the question. {eval_prompt}"
|
||||
eval_prompt = f"The answer is correct if it matches: {eval}."
|
||||
common_system_prompt = (
|
||||
"The user might say things other than the answer and that's allowed. "
|
||||
f"You should only call the eval function with your assessment when the user actually answers the question. {eval_prompt}"
|
||||
)
|
||||
if user_speaks_first:
|
||||
system_prompt = f"You are an LLM eval, be extremly brief. You will start the conversation by saying: '{example_prompt}'. {common_system_prompt}"
|
||||
else:
|
||||
|
||||
@@ -39,11 +39,12 @@ class BaseLLMAdapter(ABC, Generic[TLLMInvocationParams]):
|
||||
"""
|
||||
|
||||
@abstractmethod
|
||||
def get_llm_invocation_params(self, context: LLMContext) -> TLLMInvocationParams:
|
||||
def get_llm_invocation_params(self, context: LLMContext, **kwargs) -> TLLMInvocationParams:
|
||||
"""Get provider-specific LLM invocation parameters from a universal LLM context.
|
||||
|
||||
Args:
|
||||
context: The LLM context containing messages, tools, etc.
|
||||
**kwargs: Additional provider-specific arguments that subclasses can use.
|
||||
|
||||
Returns:
|
||||
Provider-specific parameters for invoking the LLM.
|
||||
|
||||
@@ -6,12 +6,25 @@
|
||||
|
||||
"""Anthropic LLM adapter for Pipecat."""
|
||||
|
||||
from typing import Any, Dict, List, TypedDict
|
||||
import copy
|
||||
import json
|
||||
from dataclasses import dataclass
|
||||
from typing import Any, Dict, List, Optional, TypedDict
|
||||
|
||||
from anthropic import NOT_GIVEN, NotGiven
|
||||
from anthropic.types.message_param import MessageParam
|
||||
from anthropic.types.tool_union_param import ToolUnionParam
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.adapters.base_llm_adapter import BaseLLMAdapter
|
||||
from pipecat.adapters.schemas.function_schema import FunctionSchema
|
||||
from pipecat.adapters.schemas.tools_schema import ToolsSchema
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.aggregators.llm_context import (
|
||||
LLMContext,
|
||||
LLMContextMessage,
|
||||
LLMSpecificMessage,
|
||||
LLMStandardMessage,
|
||||
)
|
||||
|
||||
|
||||
class AnthropicLLMInvocationParams(TypedDict):
|
||||
@@ -20,7 +33,9 @@ class AnthropicLLMInvocationParams(TypedDict):
|
||||
This is a placeholder until support for universal LLMContext machinery is added for Anthropic.
|
||||
"""
|
||||
|
||||
pass
|
||||
system: str | NotGiven
|
||||
messages: List[MessageParam]
|
||||
tools: List[ToolUnionParam]
|
||||
|
||||
|
||||
class AnthropicLLMAdapter(BaseLLMAdapter[AnthropicLLMInvocationParams]):
|
||||
@@ -30,20 +45,33 @@ class AnthropicLLMAdapter(BaseLLMAdapter[AnthropicLLMInvocationParams]):
|
||||
to the specific format required by Anthropic's Claude models for function calling.
|
||||
"""
|
||||
|
||||
def get_llm_invocation_params(self, context: LLMContext) -> AnthropicLLMInvocationParams:
|
||||
def get_llm_invocation_params(
|
||||
self, context: LLMContext, enable_prompt_caching: bool
|
||||
) -> AnthropicLLMInvocationParams:
|
||||
"""Get Anthropic-specific LLM invocation parameters from a universal LLM context.
|
||||
|
||||
This is a placeholder until support for universal LLMContext machinery is added for Anthropic.
|
||||
|
||||
Args:
|
||||
context: The LLM context containing messages, tools, etc.
|
||||
enable_prompt_caching: Whether prompt caching should be enabled.
|
||||
|
||||
Returns:
|
||||
Dictionary of parameters for invoking Anthropic's LLM API.
|
||||
"""
|
||||
raise NotImplementedError("Universal LLMContext is not yet supported for Anthropic.")
|
||||
messages = self._from_universal_context_messages(self._get_messages(context))
|
||||
return {
|
||||
"system": messages.system,
|
||||
"messages": (
|
||||
self._with_cache_control_markers(messages.messages)
|
||||
if enable_prompt_caching
|
||||
else messages.messages
|
||||
),
|
||||
# NOTE: LLMContext's tools are guaranteed to be a ToolsSchema (or NOT_GIVEN)
|
||||
"tools": self.from_standard_tools(context.tools) or [],
|
||||
}
|
||||
|
||||
def get_messages_for_logging(self, context) -> List[Dict[str, Any]]:
|
||||
def get_messages_for_logging(self, context: LLMContext) -> List[Dict[str, Any]]:
|
||||
"""Get messages from a universal LLM context in a format ready for logging about Anthropic.
|
||||
|
||||
Removes or truncates sensitive data like image content for safe logging.
|
||||
@@ -56,7 +84,241 @@ class AnthropicLLMAdapter(BaseLLMAdapter[AnthropicLLMInvocationParams]):
|
||||
Returns:
|
||||
List of messages in a format ready for logging about Anthropic.
|
||||
"""
|
||||
raise NotImplementedError("Universal LLMContext is not yet supported for Anthropic.")
|
||||
# Get messages in Anthropic's format
|
||||
messages = self._from_universal_context_messages(self._get_messages(context)).messages
|
||||
|
||||
# Sanitize messages for logging
|
||||
messages_for_logging = []
|
||||
for message in messages:
|
||||
msg = copy.deepcopy(message)
|
||||
if "content" in msg:
|
||||
if isinstance(msg["content"], list):
|
||||
for item in msg["content"]:
|
||||
if item["type"] == "image":
|
||||
item["source"]["data"] = "..."
|
||||
messages_for_logging.append(msg)
|
||||
return messages_for_logging
|
||||
|
||||
def _get_messages(self, context: LLMContext) -> List[LLMContextMessage]:
|
||||
return context.get_messages("anthropic")
|
||||
|
||||
@dataclass
|
||||
class ConvertedMessages:
|
||||
"""Container for Anthropic-formatted messages converted from universal context."""
|
||||
|
||||
messages: List[MessageParam]
|
||||
system: str | NotGiven
|
||||
|
||||
def _from_universal_context_messages(
|
||||
self, universal_context_messages: List[LLMContextMessage]
|
||||
) -> ConvertedMessages:
|
||||
system = NOT_GIVEN
|
||||
messages = []
|
||||
|
||||
# first, map messages using self._from_universal_context_message(m)
|
||||
try:
|
||||
messages = [self._from_universal_context_message(m) for m in universal_context_messages]
|
||||
except Exception as e:
|
||||
logger.error(f"Error mapping messages: {e}")
|
||||
|
||||
# See if we should pull the system message out of our messages list.
|
||||
if messages and messages[0]["role"] == "system":
|
||||
if len(messages) == 1:
|
||||
# If we have only have a system message in the list, all we can really do
|
||||
# without introducing too much magic is change the role to "user".
|
||||
messages[0]["role"] = "user"
|
||||
else:
|
||||
# If we have more than one message, we'll pull the system message out of the
|
||||
# list.
|
||||
system = messages[0]["content"]
|
||||
messages.pop(0)
|
||||
|
||||
# Convert any subsequent "system"-role messages to "user"-role
|
||||
# messages, as Anthropic doesn't support system input messages.
|
||||
for message in messages:
|
||||
if message["role"] == "system":
|
||||
message["role"] = "user"
|
||||
|
||||
# Merge consecutive messages with the same role.
|
||||
i = 0
|
||||
while i < len(messages) - 1:
|
||||
current_message = messages[i]
|
||||
next_message = messages[i + 1]
|
||||
if current_message["role"] == next_message["role"]:
|
||||
# Convert content to list of dictionaries if it's a string
|
||||
if isinstance(current_message["content"], str):
|
||||
current_message["content"] = [
|
||||
{"type": "text", "text": current_message["content"]}
|
||||
]
|
||||
if isinstance(next_message["content"], str):
|
||||
next_message["content"] = [{"type": "text", "text": next_message["content"]}]
|
||||
# Concatenate the content
|
||||
current_message["content"].extend(next_message["content"])
|
||||
# Remove the next message from the list
|
||||
messages.pop(i + 1)
|
||||
else:
|
||||
i += 1
|
||||
|
||||
# Avoid empty content in messages
|
||||
for message in messages:
|
||||
if isinstance(message["content"], str) and message["content"] == "":
|
||||
message["content"] = "(empty)"
|
||||
elif isinstance(message["content"], list) and len(message["content"]) == 0:
|
||||
message["content"] = [{"type": "text", "text": "(empty)"}]
|
||||
|
||||
return self.ConvertedMessages(messages=messages, system=system)
|
||||
|
||||
def _from_universal_context_message(self, message: LLMContextMessage) -> MessageParam:
|
||||
if isinstance(message, LLMSpecificMessage):
|
||||
return copy.deepcopy(message.message)
|
||||
return self._from_standard_message(message)
|
||||
|
||||
def _from_standard_message(self, message: LLMStandardMessage) -> MessageParam:
|
||||
"""Convert standard universal context message to Anthropic format.
|
||||
|
||||
Handles conversion of text content, tool calls, and tool results.
|
||||
Empty text content is converted to "(empty)".
|
||||
|
||||
Args:
|
||||
message: Message in standard universal context format.
|
||||
|
||||
Returns:
|
||||
Message in Anthropic format.
|
||||
|
||||
Examples:
|
||||
Input standard format::
|
||||
|
||||
{
|
||||
"role": "assistant",
|
||||
"tool_calls": [
|
||||
{
|
||||
"id": "123",
|
||||
"function": {"name": "search", "arguments": '{"q": "test"}'}
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
Output Anthropic format::
|
||||
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": [
|
||||
{
|
||||
"type": "tool_use",
|
||||
"id": "123",
|
||||
"name": "search",
|
||||
"input": {"q": "test"}
|
||||
}
|
||||
]
|
||||
}
|
||||
"""
|
||||
message = copy.deepcopy(message)
|
||||
if message["role"] == "tool":
|
||||
return {
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "tool_result",
|
||||
"tool_use_id": message["tool_call_id"],
|
||||
"content": message["content"],
|
||||
},
|
||||
],
|
||||
}
|
||||
if message.get("tool_calls"):
|
||||
tc = message["tool_calls"]
|
||||
ret = {"role": "assistant", "content": []}
|
||||
for tool_call in tc:
|
||||
function = tool_call["function"]
|
||||
arguments = json.loads(function["arguments"])
|
||||
new_tool_use = {
|
||||
"type": "tool_use",
|
||||
"id": tool_call["id"],
|
||||
"name": function["name"],
|
||||
"input": arguments,
|
||||
}
|
||||
ret["content"].append(new_tool_use)
|
||||
return ret
|
||||
content = message.get("content")
|
||||
if isinstance(content, str):
|
||||
# fix empty text
|
||||
if content == "":
|
||||
content = "(empty)"
|
||||
elif isinstance(content, list):
|
||||
for item in content:
|
||||
# fix empty text
|
||||
if item["type"] == "text" and item["text"] == "":
|
||||
item["text"] = "(empty)"
|
||||
# handle image_url -> image conversion
|
||||
if item["type"] == "image_url":
|
||||
item["type"] = "image"
|
||||
item["source"] = {
|
||||
"type": "base64",
|
||||
"media_type": "image/jpeg",
|
||||
"data": item["image_url"]["url"].split(",")[1],
|
||||
}
|
||||
del item["image_url"]
|
||||
# In the case where there's a single image in the list (like what
|
||||
# would result from a UserImageRawFrame), ensure that the image
|
||||
# comes before text, as recommended by Anthropic docs
|
||||
# (https://docs.anthropic.com/en/docs/build-with-claude/vision#example-one-image)
|
||||
image_indices = [i for i, item in enumerate(content) if item["type"] == "image"]
|
||||
text_indices = [i for i, item in enumerate(content) if item["type"] == "text"]
|
||||
if len(image_indices) == 1 and text_indices:
|
||||
img_idx = image_indices[0]
|
||||
first_txt_idx = text_indices[0]
|
||||
if img_idx > first_txt_idx:
|
||||
# Move image before the first text
|
||||
image_item = content.pop(img_idx)
|
||||
content.insert(first_txt_idx, image_item)
|
||||
|
||||
return message
|
||||
|
||||
def _with_cache_control_markers(self, messages: List[MessageParam]) -> List[MessageParam]:
|
||||
"""Add cache control markers to messages for prompt caching.
|
||||
|
||||
Args:
|
||||
messages: List of messages in Anthropic format.
|
||||
|
||||
Returns:
|
||||
List of messages with cache control markers added.
|
||||
"""
|
||||
|
||||
def add_cache_control_marker(message: MessageParam):
|
||||
if isinstance(message["content"], str):
|
||||
message["content"] = [{"type": "text", "text": message["content"]}]
|
||||
message["content"][-1]["cache_control"] = {"type": "ephemeral"}
|
||||
|
||||
try:
|
||||
# Add cache control markers to the most recent two user messages.
|
||||
# - The marker at the most recent user message tells Anthropic to
|
||||
# cache the prompt up to that point.
|
||||
# - The marker at the second-most-recent user message tells Anthropic
|
||||
# to look up the cached prompt that goes up to that point (the
|
||||
# point that *was* the last user message the previous turn).
|
||||
# If we only added the marker to the last user message, we'd only
|
||||
# ever be adding to the cache, never looking up from it.
|
||||
# Why user messages? We're assuming that we're primarily running
|
||||
# inference as soon as user turns come in. In Anthropic, turns
|
||||
# strictly alternate between user and assistant.
|
||||
|
||||
messages_with_markers = copy.deepcopy(messages)
|
||||
|
||||
# Find the most recent two user messages
|
||||
user_message_indices = []
|
||||
for i in range(len(messages_with_markers) - 1, -1, -1):
|
||||
if messages_with_markers[i]["role"] == "user":
|
||||
user_message_indices.append(i)
|
||||
if len(user_message_indices) == 2:
|
||||
break
|
||||
|
||||
# Add cache control markers to the identified user messages
|
||||
for index in user_message_indices:
|
||||
add_cache_control_marker(messages_with_markers[index])
|
||||
|
||||
return messages_with_markers
|
||||
except Exception as e:
|
||||
logger.error(f"Error adding cache control marker: {e}")
|
||||
return messages_with_markers
|
||||
|
||||
@staticmethod
|
||||
def _to_anthropic_function_format(function: FunctionSchema) -> Dict[str, Any]:
|
||||
|
||||
@@ -67,7 +67,7 @@ class GeminiLLMAdapter(BaseLLMAdapter[GeminiLLMInvocationParams]):
|
||||
return {
|
||||
"system_instruction": messages.system_instruction,
|
||||
"messages": messages.messages,
|
||||
# NOTE; LLMContext's tools are guaranteed to be a ToolsSchema (or NOT_GIVEN)
|
||||
# NOTE: LLMContext's tools are guaranteed to be a ToolsSchema (or NOT_GIVEN)
|
||||
"tools": self.from_standard_tools(context.tools),
|
||||
}
|
||||
|
||||
@@ -192,14 +192,14 @@ class GeminiLLMAdapter(BaseLLMAdapter[GeminiLLMInvocationParams]):
|
||||
def _from_standard_message(
|
||||
self, message: LLMStandardMessage, already_have_system_instruction: bool
|
||||
) -> Content | str:
|
||||
"""Convert universal context message to Google Content object.
|
||||
"""Convert standard universal context message to Google Content object.
|
||||
|
||||
Handles conversion of text, images, and function calls to Google's
|
||||
format.
|
||||
System instructions are returned as a plain string.
|
||||
|
||||
Args:
|
||||
message: Message in universal context format.
|
||||
message: Message in standard universal context format.
|
||||
already_have_system_instruction: Whether we already have a system instruction
|
||||
|
||||
Returns:
|
||||
@@ -308,5 +308,4 @@ class GeminiLLMAdapter(BaseLLMAdapter[GeminiLLMInvocationParams]):
|
||||
audio_bytes = base64.b64decode(input_audio["data"])
|
||||
parts.append(Part(inline_data=Blob(mime_type="audio/wav", data=audio_bytes)))
|
||||
|
||||
message = Content(role=role, parts=parts)
|
||||
return message
|
||||
return Content(role=role, parts=parts)
|
||||
|
||||
@@ -33,6 +33,10 @@ class NoisereduceFilter(BaseAudioFilter):
|
||||
Applies spectral gating noise reduction algorithms to suppress background
|
||||
noise in audio streams. Uses the noisereduce library's default noise
|
||||
reduction parameters.
|
||||
|
||||
.. deprecated:: 0.0.85
|
||||
`NoisereduceFilter` is deprecated and will be removed in a future version.
|
||||
We recommend using other real-time audio filters like `KrispFilter` or `AICFilter`.
|
||||
"""
|
||||
|
||||
def __init__(self) -> None:
|
||||
@@ -40,6 +44,17 @@ class NoisereduceFilter(BaseAudioFilter):
|
||||
self._filtering = True
|
||||
self._sample_rate = 0
|
||||
|
||||
import warnings
|
||||
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter("always")
|
||||
warnings.warn(
|
||||
"`NoisereduceFilter` is deprecated. "
|
||||
"Use other real-time audio filters like `KrispFilter` or `AICFilter`.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
|
||||
async def start(self, sample_rate: int):
|
||||
"""Initialize the filter with the transport's sample rate.
|
||||
|
||||
|
||||
@@ -1253,23 +1253,6 @@ class UserImageRawFrame(InputImageRawFrame):
|
||||
return f"{self.name}(pts: {pts}, user: {self.user_id}, source: {self.transport_source}, size: {self.size}, format: {self.format}, request: {self.request})"
|
||||
|
||||
|
||||
@dataclass
|
||||
class VisionImageRawFrame(InputImageRawFrame):
|
||||
"""Image frame for vision/image analysis with associated text prompt.
|
||||
|
||||
An image with an associated text to ask for a description of it.
|
||||
|
||||
Parameters:
|
||||
text: Optional text prompt describing what to analyze in the image.
|
||||
"""
|
||||
|
||||
text: Optional[str] = None
|
||||
|
||||
def __str__(self):
|
||||
pts = format_pts(self.pts)
|
||||
return f"{self.name}(pts: {pts}, text: [{self.text}], size: {self.size}, format: {self.format})"
|
||||
|
||||
|
||||
@dataclass
|
||||
class InputDTMFFrame(DTMFFrame, SystemFrame):
|
||||
"""DTMF keypress input frame from transport."""
|
||||
|
||||
@@ -30,25 +30,17 @@ class LLMSwitcher(ServiceSwitcher[StrategyType]):
|
||||
"""Get the currently active LLM, if any."""
|
||||
return self.strategy.active_service
|
||||
|
||||
async def run_inference(
|
||||
self, context: LLMContext, system_instruction: Optional[str] = None
|
||||
) -> Optional[str]:
|
||||
async def run_inference(self, context: LLMContext) -> Optional[str]:
|
||||
"""Run a one-shot, out-of-band (i.e. out-of-pipeline) inference with the given LLM context, using the currently active LLM.
|
||||
|
||||
Args:
|
||||
context: The LLM context containing conversation history.
|
||||
system_instruction: Optional system instruction to guide the LLM's
|
||||
behavior. You could also (again, optionally) provide a system
|
||||
instruction directly in the context. If both are provided, the
|
||||
one in the context takes precedence.
|
||||
|
||||
Returns:
|
||||
The LLM's response as a string, or None if no response is generated.
|
||||
"""
|
||||
if self.active_llm:
|
||||
return await self.active_llm.run_inference(
|
||||
context=context, system_instruction=system_instruction
|
||||
)
|
||||
return await self.active_llm.run_inference(context=context)
|
||||
return None
|
||||
|
||||
def register_function(
|
||||
|
||||
@@ -10,13 +10,22 @@ This module provides frame aggregation functionality to combine text and image
|
||||
frames into vision frames for multimodal processing.
|
||||
"""
|
||||
|
||||
from pipecat.frames.frames import Frame, InputImageRawFrame, TextFrame, VisionImageRawFrame
|
||||
from pipecat.frames.frames import Frame, InputImageRawFrame, TextFrame
|
||||
from pipecat.processors.aggregators.openai_llm_context import (
|
||||
OpenAILLMContext,
|
||||
OpenAILLMContextFrame,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
|
||||
|
||||
|
||||
class VisionImageFrameAggregator(FrameProcessor):
|
||||
"""Aggregates consecutive text and image frames into vision frames.
|
||||
|
||||
.. deprecated:: 0.0.85
|
||||
VisionImageRawFrame has been removed in favor of context frames
|
||||
(LLMContextFrame or OpenAILLMContextFrame), so this aggregator is not
|
||||
needed anymore. See the 12* examples for the new recommended pattern.
|
||||
|
||||
This aggregator waits for a consecutive TextFrame and an InputImageRawFrame.
|
||||
After the InputImageRawFrame arrives it will output a VisionImageRawFrame
|
||||
combining both the text and image data for multimodal processing.
|
||||
@@ -28,6 +37,17 @@ class VisionImageFrameAggregator(FrameProcessor):
|
||||
The aggregator starts with no cached text, waiting for the first
|
||||
TextFrame to arrive before it can create vision frames.
|
||||
"""
|
||||
import warnings
|
||||
|
||||
warnings.warn(
|
||||
"VisionImageFrameAggregator is deprecated. "
|
||||
"VisionImageRawFrame has been removed in favor of context frames "
|
||||
"(LLMContextFrame or OpenAILLMContextFrame), so this aggregator is "
|
||||
"not needed anymore. See the 12* examples for the new recommended "
|
||||
"pattern.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
super().__init__()
|
||||
self._describe_text = None
|
||||
|
||||
@@ -47,12 +67,14 @@ class VisionImageFrameAggregator(FrameProcessor):
|
||||
self._describe_text = frame.text
|
||||
elif isinstance(frame, InputImageRawFrame):
|
||||
if self._describe_text:
|
||||
frame = VisionImageRawFrame(
|
||||
context = OpenAILLMContext()
|
||||
context.add_image_frame_message(
|
||||
text=self._describe_text,
|
||||
image=frame.image,
|
||||
size=frame.size,
|
||||
format=frame.format,
|
||||
)
|
||||
frame = OpenAILLMContextFrame(context)
|
||||
await self.push_frame(frame)
|
||||
self._describe_text = None
|
||||
else:
|
||||
|
||||
@@ -24,7 +24,10 @@ from loguru import logger
|
||||
from PIL import Image
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
from pipecat.adapters.services.anthropic_adapter import AnthropicLLMAdapter
|
||||
from pipecat.adapters.services.anthropic_adapter import (
|
||||
AnthropicLLMAdapter,
|
||||
AnthropicLLMInvocationParams,
|
||||
)
|
||||
from pipecat.frames.frames import (
|
||||
ErrorFrame,
|
||||
Frame,
|
||||
@@ -39,7 +42,6 @@ from pipecat.frames.frames import (
|
||||
LLMTextFrame,
|
||||
LLMUpdateSettingsFrame,
|
||||
UserImageRawFrame,
|
||||
VisionImageRawFrame,
|
||||
)
|
||||
from pipecat.metrics.metrics import LLMTokenUsage
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
@@ -112,7 +114,12 @@ class AnthropicLLMService(LLMService):
|
||||
"""Input parameters for Anthropic model inference.
|
||||
|
||||
Parameters:
|
||||
enable_prompt_caching_beta: Whether to enable beta prompt caching feature.
|
||||
enable_prompt_caching: Whether to enable the prompt caching feature.
|
||||
enable_prompt_caching_beta (deprecated): Whether to enable the beta prompt caching feature.
|
||||
|
||||
.. deprecated:: 0.0.84
|
||||
Use the `enable_prompt_caching` parameter instead.
|
||||
|
||||
max_tokens: Maximum tokens to generate. Must be at least 1.
|
||||
temperature: Sampling temperature between 0.0 and 1.0.
|
||||
top_k: Top-k sampling parameter.
|
||||
@@ -120,13 +127,26 @@ class AnthropicLLMService(LLMService):
|
||||
extra: Additional parameters to pass to the API.
|
||||
"""
|
||||
|
||||
enable_prompt_caching_beta: Optional[bool] = False
|
||||
enable_prompt_caching: Optional[bool] = None
|
||||
enable_prompt_caching_beta: Optional[bool] = None
|
||||
max_tokens: Optional[int] = Field(default_factory=lambda: 4096, ge=1)
|
||||
temperature: Optional[float] = Field(default_factory=lambda: NOT_GIVEN, ge=0.0, le=1.0)
|
||||
top_k: Optional[int] = Field(default_factory=lambda: NOT_GIVEN, ge=0)
|
||||
top_p: Optional[float] = Field(default_factory=lambda: NOT_GIVEN, ge=0.0, le=1.0)
|
||||
extra: Optional[Dict[str, Any]] = Field(default_factory=dict)
|
||||
|
||||
def model_post_init(self, __context):
|
||||
"""Post-initialization to handle deprecated parameters."""
|
||||
if self.enable_prompt_caching_beta is not None:
|
||||
import warnings
|
||||
|
||||
warnings.simplefilter("always")
|
||||
warnings.warn(
|
||||
"enable_prompt_caching_beta is deprecated. Use enable_prompt_caching instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
@@ -159,7 +179,15 @@ class AnthropicLLMService(LLMService):
|
||||
self._retry_on_timeout = retry_on_timeout
|
||||
self._settings = {
|
||||
"max_tokens": params.max_tokens,
|
||||
"enable_prompt_caching_beta": params.enable_prompt_caching_beta or False,
|
||||
"enable_prompt_caching": (
|
||||
params.enable_prompt_caching
|
||||
if params.enable_prompt_caching is not None
|
||||
else (
|
||||
params.enable_prompt_caching_beta
|
||||
if params.enable_prompt_caching_beta is not None
|
||||
else False
|
||||
)
|
||||
),
|
||||
"temperature": params.temperature,
|
||||
"top_k": params.top_k,
|
||||
"top_p": params.top_p,
|
||||
@@ -199,34 +227,28 @@ class AnthropicLLMService(LLMService):
|
||||
response = await api_call(**params)
|
||||
return response
|
||||
|
||||
async def run_inference(
|
||||
self, context: LLMContext | OpenAILLMContext, system_instruction: Optional[str] = None
|
||||
) -> Optional[str]:
|
||||
async def run_inference(self, context: LLMContext | OpenAILLMContext) -> Optional[str]:
|
||||
"""Run a one-shot, out-of-band (i.e. out-of-pipeline) inference with the given LLM context.
|
||||
|
||||
Args:
|
||||
context: The LLM context containing conversation history.
|
||||
system_instruction: Optional system instruction to guide the LLM's
|
||||
behavior. You could also (again, optionally) provide a system
|
||||
instruction directly in the context. If both are provided, the
|
||||
one in the context takes precedence.
|
||||
|
||||
Returns:
|
||||
The LLM's response as a string, or None if no response is generated.
|
||||
"""
|
||||
messages = []
|
||||
system = []
|
||||
system = NOT_GIVEN
|
||||
if isinstance(context, LLMContext):
|
||||
# Future code will be something like this:
|
||||
# adapter = self.get_llm_adapter()
|
||||
# params: AnthropicLLMInvocationParams = adapter.get_llm_invocation_params(context)
|
||||
# messages = params["messages"]
|
||||
# system = params["system_instruction"]
|
||||
raise NotImplementedError("Universal LLMContext is not yet supported for Anthropic.")
|
||||
adapter: AnthropicLLMAdapter = self.get_llm_adapter()
|
||||
params = adapter.get_llm_invocation_params(
|
||||
context, enable_prompt_caching=self._settings["enable_prompt_caching"]
|
||||
)
|
||||
messages = params["messages"]
|
||||
system = params["system"]
|
||||
else:
|
||||
context = AnthropicLLMContext.upgrade_to_anthropic(context)
|
||||
messages = context.messages
|
||||
system = getattr(context, "system", None) or system_instruction
|
||||
system = getattr(context, "system", NOT_GIVEN)
|
||||
|
||||
# LLM completion
|
||||
response = await self._client.messages.create(
|
||||
@@ -239,15 +261,6 @@ class AnthropicLLMService(LLMService):
|
||||
|
||||
return response.content[0].text
|
||||
|
||||
@property
|
||||
def enable_prompt_caching_beta(self) -> bool:
|
||||
"""Check if prompt caching beta feature is enabled.
|
||||
|
||||
Returns:
|
||||
True if prompt caching is enabled.
|
||||
"""
|
||||
return self._enable_prompt_caching_beta
|
||||
|
||||
def create_context_aggregator(
|
||||
self,
|
||||
context: OpenAILLMContext,
|
||||
@@ -277,8 +290,31 @@ class AnthropicLLMService(LLMService):
|
||||
assistant = AnthropicAssistantContextAggregator(context, params=assistant_params)
|
||||
return AnthropicContextAggregatorPair(_user=user, _assistant=assistant)
|
||||
|
||||
def _get_llm_invocation_params(
|
||||
self, context: OpenAILLMContext | LLMContext
|
||||
) -> AnthropicLLMInvocationParams:
|
||||
# Universal LLMContext
|
||||
if isinstance(context, LLMContext):
|
||||
adapter: AnthropicLLMAdapter = self.get_llm_adapter()
|
||||
params = adapter.get_llm_invocation_params(
|
||||
context, enable_prompt_caching=self._settings["enable_prompt_caching"]
|
||||
)
|
||||
return params
|
||||
|
||||
# Anthropic-specific context
|
||||
messages = (
|
||||
context.get_messages_with_cache_control_markers()
|
||||
if self._settings["enable_prompt_caching"]
|
||||
else context.messages
|
||||
)
|
||||
return AnthropicLLMInvocationParams(
|
||||
system=context.system,
|
||||
messages=messages,
|
||||
tools=context.tools or [],
|
||||
)
|
||||
|
||||
@traced_llm
|
||||
async def _process_context(self, context: OpenAILLMContext):
|
||||
async def _process_context(self, context: OpenAILLMContext | LLMContext):
|
||||
# Usage tracking. We track the usage reported by Anthropic in prompt_tokens and
|
||||
# completion_tokens. We also estimate the completion tokens from output text
|
||||
# and use that estimate if we are interrupted, because we almost certainly won't
|
||||
@@ -294,24 +330,22 @@ class AnthropicLLMService(LLMService):
|
||||
await self.push_frame(LLMFullResponseStartFrame())
|
||||
await self.start_processing_metrics()
|
||||
|
||||
params_from_context = self._get_llm_invocation_params(context)
|
||||
|
||||
if isinstance(context, LLMContext):
|
||||
adapter = self.get_llm_adapter()
|
||||
context_type_for_logging = "universal"
|
||||
messages_for_logging = adapter.get_messages_for_logging(context)
|
||||
else:
|
||||
context_type_for_logging = "LLM-specific"
|
||||
messages_for_logging = context.get_messages_for_logging()
|
||||
logger.debug(
|
||||
f"{self}: Generating chat [{context.system}] | {context.get_messages_for_logging()}"
|
||||
f"{self}: Generating chat from {context_type_for_logging} context [{params_from_context['system']}] | {messages_for_logging}"
|
||||
)
|
||||
|
||||
messages = context.messages
|
||||
if self._settings["enable_prompt_caching_beta"]:
|
||||
messages = context.get_messages_with_cache_control_markers()
|
||||
|
||||
api_call = self._client.messages.create
|
||||
if self._settings["enable_prompt_caching_beta"]:
|
||||
api_call = self._client.beta.prompt_caching.messages.create
|
||||
|
||||
await self.start_ttfb_metrics()
|
||||
|
||||
params = {
|
||||
"tools": context.tools or [],
|
||||
"system": context.system,
|
||||
"messages": messages,
|
||||
"model": self.model_name,
|
||||
"max_tokens": self._settings["max_tokens"],
|
||||
"stream": True,
|
||||
@@ -320,9 +354,12 @@ class AnthropicLLMService(LLMService):
|
||||
"top_p": self._settings["top_p"],
|
||||
}
|
||||
|
||||
# Messages, system, tools
|
||||
params.update(params_from_context)
|
||||
|
||||
params.update(self._settings["extra"])
|
||||
|
||||
response = await self._create_message_stream(api_call, params)
|
||||
response = await self._create_message_stream(self._client.messages.create, params)
|
||||
|
||||
await self.stop_ttfb_metrics()
|
||||
|
||||
@@ -405,7 +442,10 @@ class AnthropicLLMService(LLMService):
|
||||
prompt_tokens + cache_creation_input_tokens + cache_read_input_tokens
|
||||
)
|
||||
if total_input_tokens >= 1024:
|
||||
context.turns_above_cache_threshold += 1
|
||||
if hasattr(
|
||||
context, "turns_above_cache_threshold"
|
||||
): # LLMContext doesn't have this attribute
|
||||
context.turns_above_cache_threshold += 1
|
||||
|
||||
await self.run_function_calls(function_calls)
|
||||
|
||||
@@ -451,20 +491,14 @@ class AnthropicLLMService(LLMService):
|
||||
if isinstance(frame, OpenAILLMContextFrame):
|
||||
context: "AnthropicLLMContext" = AnthropicLLMContext.upgrade_to_anthropic(frame.context)
|
||||
elif isinstance(frame, LLMContextFrame):
|
||||
raise NotImplementedError("Universal LLMContext is not yet supported for Anthropic.")
|
||||
context = frame.context
|
||||
elif isinstance(frame, LLMMessagesFrame):
|
||||
context = AnthropicLLMContext.from_messages(frame.messages)
|
||||
elif isinstance(frame, VisionImageRawFrame):
|
||||
# This is only useful in very simple pipelines because it creates
|
||||
# a new context. Generally we want a context manager to catch
|
||||
# UserImageRawFrames coming through the pipeline and add them
|
||||
# to the context.
|
||||
context = AnthropicLLMContext.from_image_frame(frame)
|
||||
elif isinstance(frame, LLMUpdateSettingsFrame):
|
||||
await self._update_settings(frame.settings)
|
||||
elif isinstance(frame, LLMEnablePromptCachingFrame):
|
||||
logger.debug(f"Setting enable prompt caching to: [{frame.enable}]")
|
||||
self._settings["enable_prompt_caching_beta"] = frame.enable
|
||||
self._settings["enable_prompt_caching"] = frame.enable
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
@@ -585,22 +619,6 @@ class AnthropicLLMContext(OpenAILLMContext):
|
||||
self._restructure_from_openai_messages()
|
||||
return self
|
||||
|
||||
@classmethod
|
||||
def from_image_frame(cls, frame: VisionImageRawFrame) -> "AnthropicLLMContext":
|
||||
"""Create context from a vision image frame.
|
||||
|
||||
Args:
|
||||
frame: The vision image frame to process.
|
||||
|
||||
Returns:
|
||||
New Anthropic context with the image message.
|
||||
"""
|
||||
context = cls()
|
||||
context.add_image_frame_message(
|
||||
format=frame.format, size=frame.size, image=frame.image, text=frame.text
|
||||
)
|
||||
return context
|
||||
|
||||
def set_messages(self, messages: List):
|
||||
"""Set the messages list and reset cache tracking.
|
||||
|
||||
|
||||
@@ -52,6 +52,10 @@ def language_to_async_language(language: Language) -> Optional[str]:
|
||||
"""
|
||||
BASE_LANGUAGES = {
|
||||
Language.EN: "en",
|
||||
Language.FR: "fr",
|
||||
Language.ES: "es",
|
||||
Language.DE: "de",
|
||||
Language.IT: "it",
|
||||
}
|
||||
|
||||
result = BASE_LANGUAGES.get(language)
|
||||
|
||||
@@ -39,7 +39,6 @@ from pipecat.frames.frames import (
|
||||
LLMTextFrame,
|
||||
LLMUpdateSettingsFrame,
|
||||
UserImageRawFrame,
|
||||
VisionImageRawFrame,
|
||||
)
|
||||
from pipecat.metrics.metrics import LLMTokenUsage
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
@@ -180,22 +179,6 @@ class AWSBedrockLLMContext(OpenAILLMContext):
|
||||
self._restructure_from_openai_messages()
|
||||
return self
|
||||
|
||||
@classmethod
|
||||
def from_image_frame(cls, frame: VisionImageRawFrame) -> "AWSBedrockLLMContext":
|
||||
"""Create AWS Bedrock context from vision image frame.
|
||||
|
||||
Args:
|
||||
frame: The vision image frame to convert.
|
||||
|
||||
Returns:
|
||||
New AWS Bedrock LLM context instance.
|
||||
"""
|
||||
context = cls()
|
||||
context.add_image_frame_message(
|
||||
format=frame.format, size=frame.size, image=frame.image, text=frame.text
|
||||
)
|
||||
return context
|
||||
|
||||
def set_messages(self, messages: List):
|
||||
"""Set the messages list and restructure for Bedrock format.
|
||||
|
||||
@@ -399,9 +382,33 @@ class AWSBedrockLLMContext(OpenAILLMContext):
|
||||
elif isinstance(content, list):
|
||||
new_content = []
|
||||
for item in content:
|
||||
# fix empty text
|
||||
if item.get("type", "") == "text":
|
||||
text_content = item["text"] if item["text"] != "" else "(empty)"
|
||||
new_content.append({"text": text_content})
|
||||
# handle image_url -> image conversion
|
||||
if item["type"] == "image_url":
|
||||
new_item = {
|
||||
"image": {
|
||||
"format": "jpeg",
|
||||
"source": {
|
||||
"bytes": base64.b64decode(item["image_url"]["url"].split(",")[1])
|
||||
},
|
||||
}
|
||||
}
|
||||
new_content.append(new_item)
|
||||
# In the case where there's a single image in the list (like what
|
||||
# would result from a UserImageRawFrame), ensure that the image
|
||||
# comes before text
|
||||
image_indices = [i for i, item in enumerate(new_content) if "image" in item]
|
||||
text_indices = [i for i, item in enumerate(new_content) if "text" in item]
|
||||
if len(image_indices) == 1 and text_indices:
|
||||
img_idx = image_indices[0]
|
||||
first_txt_idx = text_indices[0]
|
||||
if img_idx > first_txt_idx:
|
||||
# Move image before the first text
|
||||
image_item = new_content.pop(img_idx)
|
||||
new_content.insert(first_txt_idx, image_item)
|
||||
return {"role": message["role"], "content": new_content}
|
||||
|
||||
return message
|
||||
@@ -569,7 +576,7 @@ class AWSBedrockLLMContext(OpenAILLMContext):
|
||||
if isinstance(msg["content"], list):
|
||||
for item in msg["content"]:
|
||||
if item.get("image"):
|
||||
item["source"]["bytes"] = "..."
|
||||
item["image"]["source"]["bytes"] = "..."
|
||||
msgs.append(msg)
|
||||
return msgs
|
||||
|
||||
@@ -792,17 +799,11 @@ class AWSBedrockLLMService(LLMService):
|
||||
"""
|
||||
return True
|
||||
|
||||
async def run_inference(
|
||||
self, context: LLMContext | OpenAILLMContext, system_instruction: Optional[str] = None
|
||||
) -> Optional[str]:
|
||||
async def run_inference(self, context: LLMContext | OpenAILLMContext) -> Optional[str]:
|
||||
"""Run a one-shot, out-of-band (i.e. out-of-pipeline) inference with the given LLM context.
|
||||
|
||||
Args:
|
||||
context: The LLM context containing conversation history.
|
||||
system_instruction: Optional system instruction to guide the LLM's
|
||||
behavior. You could also (again, optionally) provide a system
|
||||
instruction directly in the context. If both are provided, the
|
||||
one in the context takes precedence.
|
||||
|
||||
Returns:
|
||||
The LLM's response as a string, or None if no response is generated.
|
||||
@@ -815,14 +816,14 @@ class AWSBedrockLLMService(LLMService):
|
||||
# adapter = self.get_llm_adapter()
|
||||
# params: AWSBedrockLLMInvocationParams = adapter.get_llm_invocation_params(context)
|
||||
# messages = params["messages"]
|
||||
# system = params["system_instruction"]
|
||||
# system = params["system_instruction"] # [{"text": "system message"}]
|
||||
raise NotImplementedError(
|
||||
"Universal LLMContext is not yet supported for AWS Bedrock."
|
||||
)
|
||||
else:
|
||||
context = AWSBedrockLLMContext.upgrade_to_bedrock(context)
|
||||
messages = context.messages
|
||||
system = getattr(context, "system", None) or system_instruction
|
||||
system = getattr(context, "system", None) # [{"text": "system message"}]
|
||||
|
||||
# Determine if we're using Claude or Nova based on model ID
|
||||
model_id = self.model_name
|
||||
@@ -839,7 +840,7 @@ class AWSBedrockLLMService(LLMService):
|
||||
}
|
||||
|
||||
if system:
|
||||
request_params["system"] = [{"text": system}]
|
||||
request_params["system"] = system
|
||||
|
||||
async with self._aws_session.client(
|
||||
service_name="bedrock-runtime", **self._aws_params
|
||||
@@ -880,7 +881,7 @@ class AWSBedrockLLMService(LLMService):
|
||||
if self._retry_on_timeout:
|
||||
try:
|
||||
response = await asyncio.wait_for(
|
||||
await client.converse_stream(**request_params), timeout=self._retry_timeout_secs
|
||||
client.converse_stream(**request_params), timeout=self._retry_timeout_secs
|
||||
)
|
||||
return response
|
||||
except (ReadTimeoutError, asyncio.TimeoutError) as e:
|
||||
@@ -973,7 +974,9 @@ class AWSBedrockLLMService(LLMService):
|
||||
}
|
||||
|
||||
# Add system message
|
||||
request_params["system"] = context.system
|
||||
system = getattr(context, "system", None)
|
||||
if system:
|
||||
request_params["system"] = system
|
||||
|
||||
# Check if messages contain tool use or tool result content blocks
|
||||
has_tool_content = False
|
||||
@@ -1015,7 +1018,10 @@ class AWSBedrockLLMService(LLMService):
|
||||
if self._settings["latency"] in ["standard", "optimized"]:
|
||||
request_params["performanceConfig"] = {"latency": self._settings["latency"]}
|
||||
|
||||
logger.debug(f"Calling AWS Bedrock model with: {request_params}")
|
||||
# Log request params with messages redacted for logging
|
||||
log_params = dict(request_params)
|
||||
log_params["messages"] = context.get_messages_for_logging()
|
||||
logger.debug(f"Calling AWS Bedrock model with: {log_params}")
|
||||
|
||||
async with self._aws_session.client(
|
||||
service_name="bedrock-runtime", **self._aws_params
|
||||
@@ -1126,12 +1132,6 @@ class AWSBedrockLLMService(LLMService):
|
||||
raise NotImplementedError("Universal LLMContext is not yet supported for AWS Bedrock.")
|
||||
elif isinstance(frame, LLMMessagesFrame):
|
||||
context = AWSBedrockLLMContext.from_messages(frame.messages)
|
||||
elif isinstance(frame, VisionImageRawFrame):
|
||||
# This is only useful in very simple pipelines because it creates
|
||||
# a new context. Generally we want a context manager to catch
|
||||
# UserImageRawFrames coming through the pipeline and add them
|
||||
# to the context.
|
||||
context = AWSBedrockLLMContext.from_image_frame(frame)
|
||||
elif isinstance(frame, LLMUpdateSettingsFrame):
|
||||
await self._update_settings(frame.settings)
|
||||
else:
|
||||
|
||||
@@ -33,6 +33,7 @@ from pipecat.frames.frames import (
|
||||
InputAudioRawFrame,
|
||||
InputImageRawFrame,
|
||||
InputTextRawFrame,
|
||||
LLMContextFrame,
|
||||
LLMFullResponseEndFrame,
|
||||
LLMFullResponseStartFrame,
|
||||
LLMMessagesAppendFrame,
|
||||
@@ -738,6 +739,10 @@ class GeminiMultimodalLiveLLMService(LLMService):
|
||||
# Support just one tool call per context frame for now
|
||||
tool_result_message = context.messages[-1]
|
||||
await self._tool_result(tool_result_message)
|
||||
elif isinstance(frame, LLMContextFrame):
|
||||
raise NotImplementedError(
|
||||
"Universal LLMContext is not yet supported for Gemini Multimodal Live."
|
||||
)
|
||||
elif isinstance(frame, InputTextRawFrame):
|
||||
await self._send_user_text(frame.text)
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
@@ -36,7 +36,6 @@ from pipecat.frames.frames import (
|
||||
LLMTextFrame,
|
||||
LLMUpdateSettingsFrame,
|
||||
UserImageRawFrame,
|
||||
VisionImageRawFrame,
|
||||
)
|
||||
from pipecat.metrics.metrics import LLMTokenUsage
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
@@ -733,17 +732,11 @@ class GoogleLLMService(LLMService):
|
||||
def _create_client(self, api_key: str, http_options: Optional[HttpOptions] = None):
|
||||
self._client = genai.Client(api_key=api_key, http_options=http_options)
|
||||
|
||||
async def run_inference(
|
||||
self, context: LLMContext | OpenAILLMContext, system_instruction: Optional[str] = None
|
||||
) -> Optional[str]:
|
||||
async def run_inference(self, context: LLMContext | OpenAILLMContext) -> Optional[str]:
|
||||
"""Run a one-shot, out-of-band (i.e. out-of-pipeline) inference with the given LLM context.
|
||||
|
||||
Args:
|
||||
context: The LLM context containing conversation history.
|
||||
system_instruction: Optional system instruction to guide the LLM's
|
||||
behavior. You could also (again, optionally) provide a system
|
||||
instruction directly in the context. If both are provided, the
|
||||
one in the context takes precedence.
|
||||
|
||||
Returns:
|
||||
The LLM's response as a string, or None if no response is generated.
|
||||
@@ -758,7 +751,7 @@ class GoogleLLMService(LLMService):
|
||||
else:
|
||||
context = GoogleLLMContext.upgrade_to_google(context)
|
||||
messages = context.messages
|
||||
system = getattr(context, "system_message", None) or system_instruction
|
||||
system = getattr(context, "system_message", None)
|
||||
|
||||
generation_config = GenerateContentConfig(system_instruction=system)
|
||||
|
||||
@@ -858,8 +851,7 @@ class GoogleLLMService(LLMService):
|
||||
self, context: OpenAILLMContext
|
||||
) -> AsyncIterator[GenerateContentResponse]:
|
||||
logger.debug(
|
||||
# f"{self}: Generating chat [{self._system_instruction}] | {context.get_messages_for_logging()}"
|
||||
f"{self}: Generating chat from OpenAI context {context.get_messages_for_logging()}"
|
||||
f"{self}: Generating chat from LLM-specific context [{context.system_message}] | {context.get_messages_for_logging()}"
|
||||
)
|
||||
|
||||
params = GeminiLLMInvocationParams(
|
||||
@@ -874,13 +866,12 @@ class GoogleLLMService(LLMService):
|
||||
self, context: LLMContext
|
||||
) -> AsyncIterator[GenerateContentResponse]:
|
||||
adapter = self.get_llm_adapter()
|
||||
logger.debug(
|
||||
# f"{self}: Generating chat [{self._system_instruction}] | {context.get_messages_for_logging()}"
|
||||
f"{self}: Generating chat from universal context {adapter.get_messages_for_logging(context)}"
|
||||
)
|
||||
|
||||
params: GeminiLLMInvocationParams = adapter.get_llm_invocation_params(context)
|
||||
|
||||
logger.debug(
|
||||
f"{self}: Generating chat from universal context [{params['system_instruction']}] | {adapter.get_messages_for_logging(context)}"
|
||||
)
|
||||
|
||||
return await self._stream_content(params)
|
||||
|
||||
@traced_llm
|
||||
@@ -1021,15 +1012,6 @@ class GoogleLLMService(LLMService):
|
||||
# NOTE: LLMMessagesFrame is deprecated, so we don't support the newer universal
|
||||
# LLMContext with it
|
||||
context = GoogleLLMContext(frame.messages)
|
||||
elif isinstance(frame, VisionImageRawFrame):
|
||||
# This is only useful in very simple pipelines because it creates
|
||||
# a new context. Generally we want a context manager to catch
|
||||
# UserImageRawFrames coming through the pipeline and add them
|
||||
# to the context.
|
||||
context = GoogleLLMContext()
|
||||
context.add_image_frame_message(
|
||||
format=frame.format, size=frame.size, image=frame.image, text=frame.text
|
||||
)
|
||||
elif isinstance(frame, LLMUpdateSettingsFrame):
|
||||
await self._update_settings(frame.settings)
|
||||
else:
|
||||
|
||||
@@ -195,18 +195,13 @@ class LLMService(AIService):
|
||||
"""
|
||||
return self._adapter
|
||||
|
||||
async def run_inference(
|
||||
self, context: LLMContext | OpenAILLMContext, system_instruction: Optional[str] = None
|
||||
) -> Optional[str]:
|
||||
async def run_inference(self, context: LLMContext | OpenAILLMContext) -> Optional[str]:
|
||||
"""Run a one-shot, out-of-band (i.e. out-of-pipeline) inference with the given LLM context.
|
||||
|
||||
Must be implemented by subclasses.
|
||||
|
||||
Args:
|
||||
context: The LLM context containing conversation history.
|
||||
system_instruction: Optional system instruction to guide the LLM's
|
||||
behavior. You could also (again, optionally) provide a system
|
||||
instruction directly in the context.
|
||||
|
||||
Returns:
|
||||
The LLM's response as a string, or None if no response is generated.
|
||||
|
||||
@@ -57,16 +57,18 @@ class MistralLLMService(OpenAILLMService):
|
||||
logger.debug(f"Creating Mistral client with api {base_url}")
|
||||
return super().create_client(api_key, base_url, **kwargs)
|
||||
|
||||
def _apply_mistral_assistant_prefix(
|
||||
def _apply_mistral_fixups(
|
||||
self, messages: List[ChatCompletionMessageParam]
|
||||
) -> List[ChatCompletionMessageParam]:
|
||||
"""Apply Mistral's assistant message prefix requirement.
|
||||
"""Apply fixups to messages to meet Mistral-specific requirements.
|
||||
|
||||
Mistral requires assistant messages to have prefix=True when they
|
||||
are the final message in a conversation. According to Mistral's API:
|
||||
- Assistant messages with prefix=True MUST be the last message
|
||||
- Only add prefix=True to the final assistant message when needed
|
||||
- This allows assistant messages to be accepted as the last message
|
||||
1. A "tool"-role message must be followed by an assistant message.
|
||||
|
||||
2. "system"-role messages must only appear at the start of a
|
||||
conversation.
|
||||
|
||||
3. Assistant messages must have prefix=True when they are the final
|
||||
message in a conversation (but at no other point).
|
||||
|
||||
Args:
|
||||
messages: The original list of messages.
|
||||
@@ -80,6 +82,25 @@ class MistralLLMService(OpenAILLMService):
|
||||
# Create a copy to avoid modifying the original
|
||||
fixed_messages = [dict(msg) for msg in messages]
|
||||
|
||||
# Ensure all tool responses are followed by an assistant message
|
||||
assistant_insert_indices = []
|
||||
for i, msg in enumerate(fixed_messages):
|
||||
if msg.get("role") == "tool":
|
||||
# If this is the last message or the next message is not assistant
|
||||
if i == len(fixed_messages) - 1 or fixed_messages[i + 1].get("role") != "assistant":
|
||||
assistant_insert_indices.append(i + 1)
|
||||
for idx in reversed(assistant_insert_indices):
|
||||
fixed_messages.insert(idx, {"role": "assistant", "content": " "})
|
||||
|
||||
# Convert any "system" messages that aren't at the start (i.e., after the initial contiguous block) to "user"
|
||||
first_non_system_idx = next(
|
||||
(i for i, msg in enumerate(fixed_messages) if msg.get("role") != "system"),
|
||||
len(fixed_messages),
|
||||
)
|
||||
for i, msg in enumerate(fixed_messages):
|
||||
if msg.get("role") == "system" and i >= first_non_system_idx:
|
||||
msg["role"] = "user"
|
||||
|
||||
# Get the last message
|
||||
last_message = fixed_messages[-1]
|
||||
|
||||
@@ -158,7 +179,7 @@ class MistralLLMService(OpenAILLMService):
|
||||
- Core completion settings
|
||||
"""
|
||||
# Apply Mistral's assistant prefix requirement for API compatibility
|
||||
fixed_messages = self._apply_mistral_assistant_prefix(params_from_context["messages"])
|
||||
fixed_messages = self._apply_mistral_fixups(params_from_context["messages"])
|
||||
|
||||
params = {
|
||||
"model": self.model_name,
|
||||
|
||||
@@ -11,17 +11,20 @@ for image analysis and description generation.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from typing import AsyncGenerator
|
||||
import base64
|
||||
from io import BytesIO
|
||||
from typing import AsyncGenerator, Optional
|
||||
|
||||
from loguru import logger
|
||||
from PIL import Image
|
||||
|
||||
from pipecat.frames.frames import ErrorFrame, Frame, TextFrame, VisionImageRawFrame
|
||||
from pipecat.frames.frames import ErrorFrame, Frame, TextFrame
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.services.vision_service import VisionService
|
||||
|
||||
try:
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
from transformers import AutoModelForCausalLM
|
||||
except ModuleNotFoundError as e:
|
||||
logger.error(f"Exception: {e}")
|
||||
logger.error("In order to use Moondream, you need to `pip install pipecat-ai[moondream]`.")
|
||||
@@ -94,11 +97,11 @@ class MoondreamService(VisionService):
|
||||
|
||||
logger.debug("Loaded Moondream model")
|
||||
|
||||
async def run_vision(self, frame: VisionImageRawFrame) -> AsyncGenerator[Frame, None]:
|
||||
async def run_vision(self, context: LLMContext) -> AsyncGenerator[Frame, None]:
|
||||
"""Analyze an image and generate a description.
|
||||
|
||||
Args:
|
||||
frame: Vision frame containing the image data and optional question text.
|
||||
context: The context to process, containing image data.
|
||||
|
||||
Yields:
|
||||
Frame: TextFrame containing the generated image description, or ErrorFrame
|
||||
@@ -109,22 +112,45 @@ class MoondreamService(VisionService):
|
||||
yield ErrorFrame("Moondream model not available")
|
||||
return
|
||||
|
||||
logger.debug(f"Analyzing image: {frame}")
|
||||
image_bytes = None
|
||||
text = None
|
||||
try:
|
||||
messages = context.get_messages()
|
||||
last_message = messages[-1]
|
||||
last_message_content = last_message.get("content")
|
||||
|
||||
def get_image_description(frame: VisionImageRawFrame):
|
||||
"""Generate description for the given image frame.
|
||||
for item in last_message_content:
|
||||
if isinstance(item, dict):
|
||||
if (
|
||||
"image_url" in item
|
||||
and isinstance(item["image_url"], dict)
|
||||
and item["image_url"].get("url")
|
||||
):
|
||||
image_bytes = base64.b64decode(item["image_url"]["url"].split(",")[1])
|
||||
elif "text" in item and isinstance(item["text"], str):
|
||||
text = item["text"]
|
||||
|
||||
Args:
|
||||
frame: Vision frame containing image data and question.
|
||||
except Exception as e:
|
||||
logger.error(f"Exception during image extraction: {e}")
|
||||
yield ErrorFrame("Failed to extract image from context")
|
||||
return
|
||||
|
||||
Returns:
|
||||
str: Generated description of the image.
|
||||
"""
|
||||
image = Image.frombytes(frame.format, frame.size, frame.image)
|
||||
if not image_bytes:
|
||||
logger.error("No image found in context")
|
||||
yield ErrorFrame("No image found in context")
|
||||
return
|
||||
|
||||
logger.debug(
|
||||
f"Analyzing image (bytes length: {len(image_bytes) if image_bytes else 'None'})"
|
||||
)
|
||||
|
||||
def get_image_description(bytes: bytes, text: Optional[str]) -> str:
|
||||
image_buffer = BytesIO(bytes)
|
||||
image = Image.open(image_buffer)
|
||||
image_embeds = self._model.encode_image(image)
|
||||
description = self._model.query(image_embeds, frame.text)["answer"]
|
||||
description = self._model.query(image_embeds, text)["answer"]
|
||||
return description
|
||||
|
||||
description = await asyncio.to_thread(get_image_description, frame)
|
||||
description = await asyncio.to_thread(get_image_description, image_bytes, text)
|
||||
|
||||
yield TextFrame(text=description)
|
||||
|
||||
@@ -32,7 +32,6 @@ from pipecat.frames.frames import (
|
||||
LLMMessagesFrame,
|
||||
LLMTextFrame,
|
||||
LLMUpdateSettingsFrame,
|
||||
VisionImageRawFrame,
|
||||
)
|
||||
from pipecat.metrics.metrics import LLMTokenUsage
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
@@ -245,16 +244,11 @@ class BaseOpenAILLMService(LLMService):
|
||||
params.update(self._settings["extra"])
|
||||
return params
|
||||
|
||||
async def run_inference(
|
||||
self, context: LLMContext | OpenAILLMContext, system_instruction: Optional[str] = None
|
||||
) -> Optional[str]:
|
||||
async def run_inference(self, context: LLMContext | OpenAILLMContext) -> Optional[str]:
|
||||
"""Run a one-shot, out-of-band (i.e. out-of-pipeline) inference with the given LLM context.
|
||||
|
||||
Args:
|
||||
context: The LLM context containing conversation history.
|
||||
system_instruction: Optional system instruction to guide the LLM's
|
||||
behavior. You could also (again, optionally) provide a system
|
||||
instruction directly in the context.
|
||||
|
||||
Returns:
|
||||
The LLM's response as a string, or None if no response is generated.
|
||||
@@ -279,7 +273,7 @@ class BaseOpenAILLMService(LLMService):
|
||||
self, context: OpenAILLMContext
|
||||
) -> AsyncStream[ChatCompletionChunk]:
|
||||
logger.debug(
|
||||
f"{self}: Generating chat from OpenAI context {context.get_messages_for_logging()}"
|
||||
f"{self}: Generating chat from LLM-specific context {context.get_messages_for_logging()}"
|
||||
)
|
||||
|
||||
messages: List[ChatCompletionMessageParam] = context.get_messages()
|
||||
@@ -423,8 +417,8 @@ class BaseOpenAILLMService(LLMService):
|
||||
"""Process frames for LLM completion requests.
|
||||
|
||||
Handles OpenAILLMContextFrame, LLMContextFrame, LLMMessagesFrame,
|
||||
VisionImageRawFrame, and LLMUpdateSettingsFrame to trigger LLM
|
||||
completions and manage settings.
|
||||
and LLMUpdateSettingsFrame to trigger LLM completions and manage
|
||||
settings.
|
||||
|
||||
Args:
|
||||
frame: The frame to process.
|
||||
@@ -443,16 +437,6 @@ class BaseOpenAILLMService(LLMService):
|
||||
# NOTE: LLMMessagesFrame is deprecated, so we don't support the newer universal
|
||||
# LLMContext with it
|
||||
context = OpenAILLMContext.from_messages(frame.messages)
|
||||
elif isinstance(frame, VisionImageRawFrame):
|
||||
# This is only useful in very simple pipelines because it creates
|
||||
# a new context. Generally we want a context manager to catch
|
||||
# UserImageRawFrames coming through the pipeline and add them
|
||||
# to the context.
|
||||
# TODO: support the newer universal LLMContext with a VisionImageRawFrame equivalent?
|
||||
context = OpenAILLMContext()
|
||||
context.add_image_frame_message(
|
||||
format=frame.format, size=frame.size, image=frame.image, text=frame.text
|
||||
)
|
||||
elif isinstance(frame, LLMUpdateSettingsFrame):
|
||||
await self._update_settings(frame.settings)
|
||||
else:
|
||||
|
||||
@@ -84,5 +84,10 @@ class OpenAIImageGenService(ImageGenService):
|
||||
async with self._aiohttp_session.get(image_url) as response:
|
||||
image_stream = io.BytesIO(await response.content.read())
|
||||
image = Image.open(image_stream)
|
||||
frame = URLImageRawFrame(image_url, image.tobytes(), image.size, image.format)
|
||||
frame = URLImageRawFrame(
|
||||
image=image.tobytes(),
|
||||
size=image.size,
|
||||
format=image.format,
|
||||
url=image_url,
|
||||
)
|
||||
yield frame
|
||||
|
||||
9
src/pipecat/services/openai_realtime/__init__.py
Normal file
9
src/pipecat/services/openai_realtime/__init__.py
Normal file
@@ -0,0 +1,9 @@
|
||||
from .azure import AzureRealtimeLLMService
|
||||
from .events import (
|
||||
InputAudioNoiseReduction,
|
||||
InputAudioTranscription,
|
||||
SemanticTurnDetection,
|
||||
SessionProperties,
|
||||
TurnDetection,
|
||||
)
|
||||
from .openai import OpenAIRealtimeLLMService
|
||||
67
src/pipecat/services/openai_realtime/azure.py
Normal file
67
src/pipecat/services/openai_realtime/azure.py
Normal file
@@ -0,0 +1,67 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""Azure OpenAI Realtime LLM service implementation."""
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from .openai import OpenAIRealtimeLLMService
|
||||
|
||||
try:
|
||||
from websockets.asyncio.client import connect as websocket_connect
|
||||
except ModuleNotFoundError as e:
|
||||
logger.error(f"Exception: {e}")
|
||||
logger.error(
|
||||
"In order to use OpenAI, you need to `pip install pipecat-ai[openai]`. Also, set `OPENAI_API_KEY` environment variable."
|
||||
)
|
||||
raise Exception(f"Missing module: {e}")
|
||||
|
||||
|
||||
class AzureRealtimeLLMService(OpenAIRealtimeLLMService):
|
||||
"""Azure OpenAI Realtime LLM service with Azure-specific authentication.
|
||||
|
||||
Extends the OpenAI Realtime service to work with Azure OpenAI endpoints,
|
||||
using Azure's authentication headers and endpoint format. Provides the same
|
||||
real-time audio and text communication capabilities as the base OpenAI service.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
api_key: str,
|
||||
base_url: str,
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize Azure Realtime LLM service.
|
||||
|
||||
Args:
|
||||
api_key: The API key for the Azure OpenAI service.
|
||||
base_url: The full Azure WebSocket endpoint URL including api-version and deployment.
|
||||
Example: "wss://my-project.openai.azure.com/openai/realtime?api-version=2024-10-01-preview&deployment=my-realtime-deployment"
|
||||
**kwargs: Additional arguments passed to parent OpenAIRealtimeLLMService.
|
||||
"""
|
||||
super().__init__(base_url=base_url, api_key=api_key, **kwargs)
|
||||
self.api_key = api_key
|
||||
self.base_url = base_url
|
||||
|
||||
async def _connect(self):
|
||||
try:
|
||||
if self._websocket:
|
||||
# Here we assume that if we have a websocket, we are connected. We
|
||||
# handle disconnections in the send/recv code paths.
|
||||
return
|
||||
|
||||
logger.info(f"Connecting to {self.base_url}, api key: {self.api_key}")
|
||||
self._websocket = await websocket_connect(
|
||||
uri=self.base_url,
|
||||
additional_headers={
|
||||
"api-key": self.api_key,
|
||||
},
|
||||
)
|
||||
self._receive_task = self.create_task(self._receive_task_handler())
|
||||
except Exception as e:
|
||||
logger.error(f"{self} initialization error: {e}")
|
||||
self._websocket = None
|
||||
272
src/pipecat/services/openai_realtime/context.py
Normal file
272
src/pipecat/services/openai_realtime/context.py
Normal file
@@ -0,0 +1,272 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""OpenAI Realtime LLM context and aggregator implementations."""
|
||||
|
||||
import copy
|
||||
import json
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.frames.frames import (
|
||||
Frame,
|
||||
FunctionCallResultFrame,
|
||||
InterimTranscriptionFrame,
|
||||
LLMMessagesUpdateFrame,
|
||||
LLMSetToolsFrame,
|
||||
LLMTextFrame,
|
||||
TranscriptionFrame,
|
||||
)
|
||||
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.services.openai.llm import (
|
||||
OpenAIAssistantContextAggregator,
|
||||
OpenAIUserContextAggregator,
|
||||
)
|
||||
|
||||
from . import events
|
||||
from .frames import RealtimeFunctionCallResultFrame, RealtimeMessagesUpdateFrame
|
||||
|
||||
|
||||
class OpenAIRealtimeLLMContext(OpenAILLMContext):
|
||||
"""OpenAI Realtime LLM context with session management and message conversion.
|
||||
|
||||
Extends the standard OpenAI LLM context to support real-time session properties,
|
||||
instruction management, and conversion between standard message formats and
|
||||
realtime conversation items.
|
||||
"""
|
||||
|
||||
def __init__(self, messages=None, tools=None, **kwargs):
|
||||
"""Initialize the OpenAIRealtimeLLMContext.
|
||||
|
||||
Args:
|
||||
messages: Initial conversation messages. Defaults to None.
|
||||
tools: Available function tools. Defaults to None.
|
||||
**kwargs: Additional arguments passed to parent OpenAILLMContext.
|
||||
"""
|
||||
super().__init__(messages=messages, tools=tools, **kwargs)
|
||||
self.__setup_local()
|
||||
|
||||
def __setup_local(self):
|
||||
self.llm_needs_settings_update = True
|
||||
self.llm_needs_initial_messages = True
|
||||
self._session_instructions = ""
|
||||
|
||||
return
|
||||
|
||||
@staticmethod
|
||||
def upgrade_to_realtime(obj: OpenAILLMContext) -> "OpenAIRealtimeLLMContext":
|
||||
"""Upgrade a standard OpenAI LLM context to a realtime context.
|
||||
|
||||
Args:
|
||||
obj: The OpenAILLMContext instance to upgrade.
|
||||
|
||||
Returns:
|
||||
The upgraded OpenAIRealtimeLLMContext instance.
|
||||
"""
|
||||
if isinstance(obj, OpenAILLMContext) and not isinstance(obj, OpenAIRealtimeLLMContext):
|
||||
obj.__class__ = OpenAIRealtimeLLMContext
|
||||
obj.__setup_local()
|
||||
return obj
|
||||
|
||||
# todo
|
||||
# - finish implementing all frames
|
||||
|
||||
def from_standard_message(self, message):
|
||||
"""Convert a standard message format to a realtime conversation item.
|
||||
|
||||
Args:
|
||||
message: The standard message dictionary to convert.
|
||||
|
||||
Returns:
|
||||
A ConversationItem instance for the realtime API.
|
||||
"""
|
||||
if message.get("role") == "user":
|
||||
content = message.get("content")
|
||||
if isinstance(message.get("content"), list):
|
||||
content = ""
|
||||
for c in message.get("content"):
|
||||
if c.get("type") == "text":
|
||||
content += " " + c.get("text")
|
||||
else:
|
||||
logger.error(
|
||||
f"Unhandled content type in context message: {c.get('type')} - {message}"
|
||||
)
|
||||
return events.ConversationItem(
|
||||
role="user",
|
||||
type="message",
|
||||
content=[events.ItemContent(type="input_text", text=content)],
|
||||
)
|
||||
if message.get("role") == "assistant" and message.get("tool_calls"):
|
||||
tc = message.get("tool_calls")[0]
|
||||
return events.ConversationItem(
|
||||
type="function_call",
|
||||
call_id=tc["id"],
|
||||
name=tc["function"]["name"],
|
||||
arguments=tc["function"]["arguments"],
|
||||
)
|
||||
logger.error(f"Unhandled message type in from_standard_message: {message}")
|
||||
|
||||
def get_messages_for_initializing_history(self):
|
||||
"""Get conversation items for initializing the realtime session history.
|
||||
|
||||
Converts the context's messages to a format suitable for the realtime API,
|
||||
handling system instructions and conversation history packaging.
|
||||
|
||||
Returns:
|
||||
List of conversation items for session initialization.
|
||||
"""
|
||||
# We can't load a long conversation history into the openai realtime api yet. (The API/model
|
||||
# forgets that it can do audio, if you do a series of `conversation.item.create` calls.) So
|
||||
# our general strategy until this is fixed is just to put everything into a first "user"
|
||||
# message as a single input.
|
||||
if not self.messages:
|
||||
return []
|
||||
|
||||
messages = copy.deepcopy(self.messages)
|
||||
|
||||
# If we have a "system" message as our first message, let's pull that out into session
|
||||
# "instructions"
|
||||
if messages[0].get("role") == "system":
|
||||
self.llm_needs_settings_update = True
|
||||
system = messages.pop(0)
|
||||
content = system.get("content")
|
||||
if isinstance(content, str):
|
||||
self._session_instructions = content
|
||||
elif isinstance(content, list):
|
||||
self._session_instructions = content[0].get("text")
|
||||
if not messages:
|
||||
return []
|
||||
|
||||
# If we have just a single "user" item, we can just send it normally
|
||||
if len(messages) == 1 and messages[0].get("role") == "user":
|
||||
return [self.from_standard_message(messages[0])]
|
||||
|
||||
# Otherwise, let's pack everything into a single "user" message with a bit of
|
||||
# explanation for the LLM
|
||||
intro_text = """
|
||||
This is a previously saved conversation. Please treat this conversation history as a
|
||||
starting point for the current conversation."""
|
||||
|
||||
trailing_text = """
|
||||
This is the end of the previously saved conversation. Please continue the conversation
|
||||
from here. If the last message is a user instruction or question, act on that instruction
|
||||
or answer the question. If the last message is an assistant response, simple say that you
|
||||
are ready to continue the conversation."""
|
||||
|
||||
return [
|
||||
{
|
||||
"role": "user",
|
||||
"type": "message",
|
||||
"content": [
|
||||
{
|
||||
"type": "input_text",
|
||||
"text": "\n\n".join(
|
||||
[intro_text, json.dumps(messages, indent=2), trailing_text]
|
||||
),
|
||||
}
|
||||
],
|
||||
}
|
||||
]
|
||||
|
||||
def add_user_content_item_as_message(self, item):
|
||||
"""Add a user content item as a standard message to the context.
|
||||
|
||||
Args:
|
||||
item: The conversation item to add as a user message.
|
||||
"""
|
||||
message = {
|
||||
"role": "user",
|
||||
"content": [{"type": "text", "text": item.content[0].transcript}],
|
||||
}
|
||||
self.add_message(message)
|
||||
|
||||
|
||||
class OpenAIRealtimeUserContextAggregator(OpenAIUserContextAggregator):
|
||||
"""User context aggregator for OpenAI Realtime API.
|
||||
|
||||
Handles user input frames and generates appropriate context updates
|
||||
for the realtime conversation, including message updates and tool settings.
|
||||
|
||||
Args:
|
||||
context: The OpenAI realtime LLM context.
|
||||
**kwargs: Additional arguments passed to parent aggregator.
|
||||
"""
|
||||
|
||||
async def process_frame(
|
||||
self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM
|
||||
):
|
||||
"""Process incoming frames and handle realtime-specific frame types.
|
||||
|
||||
Args:
|
||||
frame: The frame to process.
|
||||
direction: The direction of frame flow in the pipeline.
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
# Parent does not push LLMMessagesUpdateFrame. This ensures that in a typical pipeline,
|
||||
# messages are only processed by the user context aggregator, which is generally what we want. But
|
||||
# we also need to send new messages over the websocket, so the openai realtime API has them
|
||||
# in its context.
|
||||
if isinstance(frame, LLMMessagesUpdateFrame):
|
||||
await self.push_frame(RealtimeMessagesUpdateFrame(context=self._context))
|
||||
|
||||
# Parent also doesn't push the LLMSetToolsFrame.
|
||||
if isinstance(frame, LLMSetToolsFrame):
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
async def push_aggregation(self):
|
||||
"""Push user input aggregation.
|
||||
|
||||
Currently ignores all user input coming into the pipeline as realtime
|
||||
audio input is handled directly by the service.
|
||||
"""
|
||||
# for the moment, ignore all user input coming into the pipeline.
|
||||
# todo: think about whether/how to fix this to allow for text input from
|
||||
# upstream (transport/transcription, or other sources)
|
||||
pass
|
||||
|
||||
|
||||
class OpenAIRealtimeAssistantContextAggregator(OpenAIAssistantContextAggregator):
|
||||
"""Assistant context aggregator for OpenAI Realtime API.
|
||||
|
||||
Handles assistant output frames from the realtime service, filtering
|
||||
out duplicate text frames and managing function call results.
|
||||
|
||||
Args:
|
||||
context: The OpenAI realtime LLM context.
|
||||
**kwargs: Additional arguments passed to parent aggregator.
|
||||
"""
|
||||
|
||||
# The LLMAssistantContextAggregator uses TextFrames to aggregate the LLM output,
|
||||
# but the OpenAIRealtimeLLMService pushes LLMTextFrames and TTSTextFrames. We
|
||||
# need to override this proces_frame for LLMTextFrame, so that only the TTSTextFrames
|
||||
# are process. This ensures that the context gets only one set of messages.
|
||||
# OpenAIRealtimeLLMService also pushes TranscriptionFrames and InterimTranscriptionFrames,
|
||||
# so we need to ignore pushing those as well, as they're also TextFrames.
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
"""Process assistant frames, filtering out duplicate text content.
|
||||
|
||||
Args:
|
||||
frame: The frame to process.
|
||||
direction: The direction of frame flow in the pipeline.
|
||||
"""
|
||||
if not isinstance(frame, (LLMTextFrame, TranscriptionFrame, InterimTranscriptionFrame)):
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
async def handle_function_call_result(self, frame: FunctionCallResultFrame):
|
||||
"""Handle function call result and notify the realtime service.
|
||||
|
||||
Args:
|
||||
frame: The function call result frame to handle.
|
||||
"""
|
||||
await super().handle_function_call_result(frame)
|
||||
|
||||
# The standard function callback code path pushes the FunctionCallResultFrame from the llm itself,
|
||||
# so we didn't have a chance to add the result to the openai realtime api context. Let's push a
|
||||
# special frame to do that.
|
||||
await self.push_frame(
|
||||
RealtimeFunctionCallResultFrame(result_frame=frame), FrameDirection.UPSTREAM
|
||||
)
|
||||
1106
src/pipecat/services/openai_realtime/events.py
Normal file
1106
src/pipecat/services/openai_realtime/events.py
Normal file
File diff suppressed because it is too large
Load Diff
37
src/pipecat/services/openai_realtime/frames.py
Normal file
37
src/pipecat/services/openai_realtime/frames.py
Normal file
@@ -0,0 +1,37 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""Custom frame types for OpenAI Realtime API integration."""
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from pipecat.frames.frames import DataFrame, FunctionCallResultFrame
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from pipecat.services.openai_realtime_beta.context import OpenAIRealtimeLLMContext
|
||||
|
||||
|
||||
@dataclass
|
||||
class RealtimeMessagesUpdateFrame(DataFrame):
|
||||
"""Frame indicating that the realtime context messages have been updated.
|
||||
|
||||
Parameters:
|
||||
context: The updated OpenAI realtime LLM context.
|
||||
"""
|
||||
|
||||
context: "OpenAIRealtimeLLMContext"
|
||||
|
||||
|
||||
@dataclass
|
||||
class RealtimeFunctionCallResultFrame(DataFrame):
|
||||
"""Frame containing function call results for the realtime service.
|
||||
|
||||
Parameters:
|
||||
result_frame: The function call result frame to send to the realtime API.
|
||||
"""
|
||||
|
||||
result_frame: FunctionCallResultFrame
|
||||
831
src/pipecat/services/openai_realtime/openai.py
Normal file
831
src/pipecat/services/openai_realtime/openai.py
Normal file
@@ -0,0 +1,831 @@
|
||||
#
|
||||
# Copyright (c) 2024–2025, Daily
|
||||
#
|
||||
# SPDX-License-Identifier: BSD 2-Clause License
|
||||
#
|
||||
|
||||
"""OpenAI Realtime LLM service implementation with WebSocket support."""
|
||||
|
||||
import base64
|
||||
import json
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from pipecat.adapters.services.open_ai_realtime_adapter import OpenAIRealtimeLLMAdapter
|
||||
from pipecat.frames.frames import (
|
||||
BotStoppedSpeakingFrame,
|
||||
CancelFrame,
|
||||
EndFrame,
|
||||
ErrorFrame,
|
||||
Frame,
|
||||
InputAudioRawFrame,
|
||||
InterimTranscriptionFrame,
|
||||
LLMContextFrame,
|
||||
LLMFullResponseEndFrame,
|
||||
LLMFullResponseStartFrame,
|
||||
LLMMessagesAppendFrame,
|
||||
LLMSetToolsFrame,
|
||||
LLMTextFrame,
|
||||
LLMUpdateSettingsFrame,
|
||||
StartFrame,
|
||||
StartInterruptionFrame,
|
||||
TranscriptionFrame,
|
||||
TTSAudioRawFrame,
|
||||
TTSStartedFrame,
|
||||
TTSStoppedFrame,
|
||||
TTSTextFrame,
|
||||
UserStartedSpeakingFrame,
|
||||
UserStoppedSpeakingFrame,
|
||||
)
|
||||
from pipecat.metrics.metrics import LLMTokenUsage
|
||||
from pipecat.processors.aggregators.llm_response import (
|
||||
LLMAssistantAggregatorParams,
|
||||
LLMUserAggregatorParams,
|
||||
)
|
||||
from pipecat.processors.aggregators.openai_llm_context import (
|
||||
OpenAILLMContext,
|
||||
OpenAILLMContextFrame,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.services.llm_service import FunctionCallFromLLM, LLMService
|
||||
from pipecat.services.openai.llm import OpenAIContextAggregatorPair
|
||||
from pipecat.transcriptions.language import Language
|
||||
from pipecat.utils.time import time_now_iso8601
|
||||
from pipecat.utils.tracing.service_decorators import traced_openai_realtime, traced_stt
|
||||
|
||||
from . import events
|
||||
from .context import (
|
||||
OpenAIRealtimeAssistantContextAggregator,
|
||||
OpenAIRealtimeLLMContext,
|
||||
OpenAIRealtimeUserContextAggregator,
|
||||
)
|
||||
from .frames import RealtimeFunctionCallResultFrame, RealtimeMessagesUpdateFrame
|
||||
|
||||
try:
|
||||
from websockets.asyncio.client import connect as websocket_connect
|
||||
except ModuleNotFoundError as e:
|
||||
logger.error(f"Exception: {e}")
|
||||
logger.error("In order to use OpenAI, you need to `pip install pipecat-ai[openai]`.")
|
||||
raise Exception(f"Missing module: {e}")
|
||||
|
||||
|
||||
@dataclass
|
||||
class CurrentAudioResponse:
|
||||
"""Tracks the current audio response from the assistant.
|
||||
|
||||
Parameters:
|
||||
item_id: Unique identifier for the audio response item.
|
||||
content_index: Index of the audio content within the item.
|
||||
start_time_ms: Timestamp when the audio response started in milliseconds.
|
||||
total_size: Total size of audio data received in bytes. Defaults to 0.
|
||||
"""
|
||||
|
||||
item_id: str
|
||||
content_index: int
|
||||
start_time_ms: int
|
||||
total_size: int = 0
|
||||
|
||||
|
||||
class OpenAIRealtimeLLMService(LLMService):
|
||||
"""OpenAI Realtime LLM service providing real-time audio and text communication.
|
||||
|
||||
Implements the OpenAI Realtime API with WebSocket communication for low-latency
|
||||
bidirectional audio and text interactions. Supports function calling, conversation
|
||||
management, and real-time transcription.
|
||||
"""
|
||||
|
||||
# Overriding the default adapter to use the OpenAIRealtimeLLMAdapter one.
|
||||
adapter_class = OpenAIRealtimeLLMAdapter
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
api_key: str,
|
||||
model: str = "gpt-realtime",
|
||||
base_url: str = "wss://api.openai.com/v1/realtime",
|
||||
session_properties: Optional[events.SessionProperties] = None,
|
||||
start_audio_paused: bool = False,
|
||||
send_transcription_frames: bool = True,
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the OpenAI Realtime LLM service.
|
||||
|
||||
Args:
|
||||
api_key: OpenAI API key for authentication.
|
||||
model: OpenAI model name. Defaults to "gpt-4o-realtime-preview-2025-06-03".
|
||||
base_url: WebSocket base URL for the realtime API.
|
||||
Defaults to "wss://api.openai.com/v1/realtime".
|
||||
session_properties: Configuration properties for the realtime session.
|
||||
If None, uses default SessionProperties.
|
||||
start_audio_paused: Whether to start with audio input paused. Defaults to False.
|
||||
send_transcription_frames: Whether to emit transcription frames. Defaults to True.
|
||||
**kwargs: Additional arguments passed to parent LLMService.
|
||||
"""
|
||||
full_url = f"{base_url}?model={model}"
|
||||
super().__init__(base_url=full_url, **kwargs)
|
||||
|
||||
self.api_key = api_key
|
||||
self.base_url = full_url
|
||||
self.set_model_name(model)
|
||||
|
||||
self._session_properties: events.SessionProperties = (
|
||||
session_properties or events.SessionProperties()
|
||||
)
|
||||
self._audio_input_paused = start_audio_paused
|
||||
self._send_transcription_frames = send_transcription_frames
|
||||
self._websocket = None
|
||||
self._receive_task = None
|
||||
self._context = None
|
||||
|
||||
self._disconnecting = False
|
||||
self._api_session_ready = False
|
||||
self._run_llm_when_api_session_ready = False
|
||||
|
||||
self._current_assistant_response = None
|
||||
self._current_audio_response = None
|
||||
|
||||
self._messages_added_manually = {}
|
||||
self._user_and_response_message_tuple = None
|
||||
self._pending_function_calls = {} # Track function calls by call_id
|
||||
|
||||
self._register_event_handler("on_conversation_item_created")
|
||||
self._register_event_handler("on_conversation_item_updated")
|
||||
self._retrieve_conversation_item_futures = {}
|
||||
|
||||
def can_generate_metrics(self) -> bool:
|
||||
"""Check if the service can generate usage metrics.
|
||||
|
||||
Returns:
|
||||
True if metrics generation is supported.
|
||||
"""
|
||||
return True
|
||||
|
||||
def set_audio_input_paused(self, paused: bool):
|
||||
"""Set whether audio input is paused.
|
||||
|
||||
Args:
|
||||
paused: True to pause audio input, False to resume.
|
||||
"""
|
||||
self._audio_input_paused = paused
|
||||
|
||||
def _is_modality_enabled(self, modality: str) -> bool:
|
||||
"""Check if a specific modality is enabled, "text" or "audio"."""
|
||||
modalities = self._session_properties.output_modalities or ["audio", "text"]
|
||||
return modality in modalities
|
||||
|
||||
def _get_enabled_modalities(self) -> list[str]:
|
||||
"""Get the list of enabled modalities."""
|
||||
modalities = self._session_properties.output_modalities or ["audio", "text"]
|
||||
# API only supports single modality responses: either ["text"] or ["audio"]
|
||||
if "audio" in modalities:
|
||||
return ["audio"]
|
||||
elif "text" in modalities:
|
||||
return ["text"]
|
||||
|
||||
async def retrieve_conversation_item(self, item_id: str):
|
||||
"""Retrieve a conversation item by ID from the server.
|
||||
|
||||
Args:
|
||||
item_id: The ID of the conversation item to retrieve.
|
||||
|
||||
Returns:
|
||||
The retrieved conversation item.
|
||||
"""
|
||||
future = self.get_event_loop().create_future()
|
||||
retrieval_in_flight = False
|
||||
if not self._retrieve_conversation_item_futures.get(item_id):
|
||||
self._retrieve_conversation_item_futures[item_id] = []
|
||||
else:
|
||||
retrieval_in_flight = True
|
||||
self._retrieve_conversation_item_futures[item_id].append(future)
|
||||
if not retrieval_in_flight:
|
||||
await self.send_client_event(
|
||||
# Set event_id to "rci_{item_id}" so that we can identify an
|
||||
# error later if the retrieval fails. We don't need a UUID
|
||||
# suffix to the event_id because we're ensuring only one
|
||||
# in-flight retrieval per item_id. (Note: "rci" = "retrieve
|
||||
# conversation item")
|
||||
events.ConversationItemRetrieveEvent(item_id=item_id, event_id=f"rci_{item_id}")
|
||||
)
|
||||
return await future
|
||||
|
||||
#
|
||||
# standard AIService frame handling
|
||||
#
|
||||
|
||||
async def start(self, frame: StartFrame):
|
||||
"""Start the service and establish WebSocket connection.
|
||||
|
||||
Args:
|
||||
frame: The start frame triggering service initialization.
|
||||
"""
|
||||
await super().start(frame)
|
||||
await self._connect()
|
||||
|
||||
async def stop(self, frame: EndFrame):
|
||||
"""Stop the service and close WebSocket connection.
|
||||
|
||||
Args:
|
||||
frame: The end frame triggering service shutdown.
|
||||
"""
|
||||
await super().stop(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def cancel(self, frame: CancelFrame):
|
||||
"""Cancel the service and close WebSocket connection.
|
||||
|
||||
Args:
|
||||
frame: The cancel frame triggering service cancellation.
|
||||
"""
|
||||
await super().cancel(frame)
|
||||
await self._disconnect()
|
||||
|
||||
#
|
||||
# speech and interruption handling
|
||||
#
|
||||
|
||||
async def _handle_interruption(self):
|
||||
# None and False are different. Check for False. None means we're using OpenAI's
|
||||
# built-in turn detection defaults.
|
||||
turn_detection_disabled = (
|
||||
self._session_properties.audio
|
||||
and self._session_properties.audio.input
|
||||
and self._session_properties.audio.input.turn_detection is False
|
||||
)
|
||||
if turn_detection_disabled:
|
||||
await self.send_client_event(events.InputAudioBufferClearEvent())
|
||||
await self.send_client_event(events.ResponseCancelEvent())
|
||||
await self._truncate_current_audio_response()
|
||||
await self.stop_all_metrics()
|
||||
if self._current_assistant_response:
|
||||
await self.push_frame(LLMFullResponseEndFrame())
|
||||
# Only push TTSStoppedFrame if audio modality is enabled
|
||||
if self._is_modality_enabled("audio"):
|
||||
await self.push_frame(TTSStoppedFrame())
|
||||
|
||||
async def _handle_user_started_speaking(self, frame):
|
||||
pass
|
||||
|
||||
async def _handle_user_stopped_speaking(self, frame):
|
||||
# None and False are different. Check for False. None means we're using OpenAI's
|
||||
# built-in turn detection defaults.
|
||||
turn_detection_disabled = (
|
||||
self._session_properties.audio
|
||||
and self._session_properties.audio.input
|
||||
and self._session_properties.audio.input.turn_detection is False
|
||||
)
|
||||
if turn_detection_disabled:
|
||||
await self.send_client_event(events.InputAudioBufferCommitEvent())
|
||||
await self.send_client_event(events.ResponseCreateEvent())
|
||||
|
||||
async def _handle_bot_stopped_speaking(self):
|
||||
self._current_audio_response = None
|
||||
|
||||
def _calculate_audio_duration_ms(
|
||||
self, total_bytes: int, sample_rate: int = 24000, bytes_per_sample: int = 2
|
||||
) -> int:
|
||||
"""Calculate audio duration in milliseconds based on PCM audio parameters."""
|
||||
samples = total_bytes / bytes_per_sample
|
||||
duration_seconds = samples / sample_rate
|
||||
return int(duration_seconds * 1000)
|
||||
|
||||
async def _truncate_current_audio_response(self):
|
||||
"""Truncates the current audio response at the appropriate duration.
|
||||
|
||||
Calculates the actual duration of the audio content and truncates at the shorter of
|
||||
either the wall clock time or the actual audio duration to prevent invalid truncation
|
||||
requests.
|
||||
"""
|
||||
if not self._current_audio_response:
|
||||
return
|
||||
|
||||
# if the bot is still speaking, truncate the last message
|
||||
try:
|
||||
current = self._current_audio_response
|
||||
self._current_audio_response = None
|
||||
|
||||
# Calculate actual audio duration instead of using wall clock time
|
||||
audio_duration_ms = self._calculate_audio_duration_ms(current.total_size)
|
||||
|
||||
# Use the shorter of wall clock time or actual audio duration
|
||||
elapsed_ms = int(time.time() * 1000 - current.start_time_ms)
|
||||
truncate_ms = min(elapsed_ms, audio_duration_ms)
|
||||
|
||||
logger.trace(
|
||||
f"Truncating audio: duration={audio_duration_ms}ms, "
|
||||
f"elapsed={elapsed_ms}ms, truncate={truncate_ms}ms"
|
||||
)
|
||||
|
||||
await self.send_client_event(
|
||||
events.ConversationItemTruncateEvent(
|
||||
item_id=current.item_id,
|
||||
content_index=current.content_index,
|
||||
audio_end_ms=truncate_ms,
|
||||
)
|
||||
)
|
||||
except Exception as e:
|
||||
# Log warning and don't re-raise - allow session to continue
|
||||
logger.warning(f"Audio truncation failed (non-fatal): {e}")
|
||||
|
||||
#
|
||||
# frame processing
|
||||
#
|
||||
# StartFrame, StopFrame, CancelFrame implemented in base class
|
||||
#
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
"""Process incoming frames from the pipeline.
|
||||
|
||||
Args:
|
||||
frame: The frame to process.
|
||||
direction: The direction of frame flow in the pipeline.
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
pass
|
||||
elif isinstance(frame, OpenAILLMContextFrame):
|
||||
context: OpenAIRealtimeLLMContext = OpenAIRealtimeLLMContext.upgrade_to_realtime(
|
||||
frame.context
|
||||
)
|
||||
if not self._context:
|
||||
self._context = context
|
||||
elif frame.context is not self._context:
|
||||
# If the context has changed, reset the conversation
|
||||
self._context = context
|
||||
await self.reset_conversation()
|
||||
# Run the LLM at next opportunity
|
||||
await self._create_response()
|
||||
elif isinstance(frame, LLMContextFrame):
|
||||
raise NotImplementedError(
|
||||
"Universal LLMContext is not yet supported for OpenAI Realtime."
|
||||
)
|
||||
elif isinstance(frame, InputAudioRawFrame):
|
||||
if not self._audio_input_paused:
|
||||
await self._send_user_audio(frame)
|
||||
elif isinstance(frame, StartInterruptionFrame):
|
||||
await self._handle_interruption()
|
||||
elif isinstance(frame, UserStartedSpeakingFrame):
|
||||
await self._handle_user_started_speaking(frame)
|
||||
elif isinstance(frame, UserStoppedSpeakingFrame):
|
||||
await self._handle_user_stopped_speaking(frame)
|
||||
elif isinstance(frame, BotStoppedSpeakingFrame):
|
||||
await self._handle_bot_stopped_speaking()
|
||||
elif isinstance(frame, LLMMessagesAppendFrame):
|
||||
await self._handle_messages_append(frame)
|
||||
elif isinstance(frame, RealtimeMessagesUpdateFrame):
|
||||
self._context = frame.context
|
||||
elif isinstance(frame, LLMUpdateSettingsFrame):
|
||||
self._session_properties = events.SessionProperties(**frame.settings)
|
||||
await self._update_settings()
|
||||
elif isinstance(frame, LLMSetToolsFrame):
|
||||
await self._update_settings()
|
||||
elif isinstance(frame, RealtimeFunctionCallResultFrame):
|
||||
await self._handle_function_call_result(frame.result_frame)
|
||||
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
async def _handle_messages_append(self, frame):
|
||||
logger.error("!!! NEED TO IMPLEMENT MESSAGES APPEND")
|
||||
|
||||
async def _handle_function_call_result(self, frame):
|
||||
item = events.ConversationItem(
|
||||
type="function_call_output",
|
||||
call_id=frame.tool_call_id,
|
||||
output=json.dumps(frame.result),
|
||||
)
|
||||
await self.send_client_event(events.ConversationItemCreateEvent(item=item))
|
||||
|
||||
#
|
||||
# websocket communication
|
||||
#
|
||||
|
||||
async def send_client_event(self, event: events.ClientEvent):
|
||||
"""Send a client event to the OpenAI Realtime API.
|
||||
|
||||
Args:
|
||||
event: The client event to send.
|
||||
"""
|
||||
await self._ws_send(event.model_dump(exclude_none=True))
|
||||
|
||||
async def _connect(self):
|
||||
try:
|
||||
if self._websocket:
|
||||
# Here we assume that if we have a websocket, we are connected. We
|
||||
# handle disconnections in the send/recv code paths.
|
||||
return
|
||||
self._websocket = await websocket_connect(
|
||||
uri=self.base_url,
|
||||
additional_headers={
|
||||
"Authorization": f"Bearer {self.api_key}",
|
||||
},
|
||||
)
|
||||
self._receive_task = self.create_task(self._receive_task_handler())
|
||||
except Exception as e:
|
||||
logger.error(f"{self} initialization error: {e}")
|
||||
self._websocket = None
|
||||
|
||||
async def _disconnect(self):
|
||||
try:
|
||||
self._disconnecting = True
|
||||
self._api_session_ready = False
|
||||
await self.stop_all_metrics()
|
||||
if self._websocket:
|
||||
await self._websocket.close()
|
||||
self._websocket = None
|
||||
if self._receive_task:
|
||||
await self.cancel_task(self._receive_task, timeout=1.0)
|
||||
self._receive_task = None
|
||||
self._disconnecting = False
|
||||
except Exception as e:
|
||||
logger.error(f"{self} error disconnecting: {e}")
|
||||
|
||||
async def _ws_send(self, realtime_message):
|
||||
try:
|
||||
if self._websocket:
|
||||
await self._websocket.send(json.dumps(realtime_message))
|
||||
except Exception as e:
|
||||
if self._disconnecting:
|
||||
return
|
||||
logger.error(f"Error sending message to websocket: {e}")
|
||||
# In server-to-server contexts, a WebSocket error should be quite rare. Given how hard
|
||||
# it is to recover from a send-side error with proper state management, and that exponential
|
||||
# backoff for retries can have cost/stability implications for a service cluster, let's just
|
||||
# treat a send-side error as fatal.
|
||||
await self.push_error(ErrorFrame(error=f"Error sending client event: {e}", fatal=True))
|
||||
|
||||
async def _update_settings(self):
|
||||
settings = self._session_properties
|
||||
# tools given in the context override the tools in the session properties
|
||||
if self._context and self._context.tools:
|
||||
settings.tools = self._context.tools
|
||||
# instructions in the context come from an initial "system" message in the
|
||||
# messages list, and override instructions in the session properties
|
||||
if self._context and self._context._session_instructions:
|
||||
settings.instructions = self._context._session_instructions
|
||||
await self.send_client_event(events.SessionUpdateEvent(session=settings))
|
||||
|
||||
#
|
||||
# inbound server event handling
|
||||
# https://platform.openai.com/docs/api-reference/realtime-server-events
|
||||
#
|
||||
|
||||
async def _receive_task_handler(self):
|
||||
async for message in self._websocket:
|
||||
evt = events.parse_server_event(message)
|
||||
if evt.type == "session.created":
|
||||
await self._handle_evt_session_created(evt)
|
||||
elif evt.type == "session.updated":
|
||||
await self._handle_evt_session_updated(evt)
|
||||
elif evt.type == "response.output_audio.delta":
|
||||
await self._handle_evt_audio_delta(evt)
|
||||
elif evt.type == "response.output_audio.done":
|
||||
await self._handle_evt_audio_done(evt)
|
||||
elif evt.type == "conversation.item.added":
|
||||
await self._handle_evt_conversation_item_added(evt)
|
||||
elif evt.type == "conversation.item.done":
|
||||
await self._handle_evt_conversation_item_done(evt)
|
||||
elif evt.type == "conversation.item.input_audio_transcription.delta":
|
||||
await self._handle_evt_input_audio_transcription_delta(evt)
|
||||
elif evt.type == "conversation.item.input_audio_transcription.completed":
|
||||
await self.handle_evt_input_audio_transcription_completed(evt)
|
||||
elif evt.type == "conversation.item.retrieved":
|
||||
await self._handle_conversation_item_retrieved(evt)
|
||||
elif evt.type == "response.done":
|
||||
await self._handle_evt_response_done(evt)
|
||||
elif evt.type == "input_audio_buffer.speech_started":
|
||||
await self._handle_evt_speech_started(evt)
|
||||
elif evt.type == "input_audio_buffer.speech_stopped":
|
||||
await self._handle_evt_speech_stopped(evt)
|
||||
elif evt.type == "response.output_text.delta":
|
||||
await self._handle_evt_text_delta(evt)
|
||||
elif evt.type == "response.output_audio_transcript.delta":
|
||||
await self._handle_evt_audio_transcript_delta(evt)
|
||||
elif evt.type == "response.function_call_arguments.done":
|
||||
await self._handle_evt_function_call_arguments_done(evt)
|
||||
elif evt.type == "error":
|
||||
if not await self._maybe_handle_evt_retrieve_conversation_item_error(evt):
|
||||
await self._handle_evt_error(evt)
|
||||
# errors are fatal, so exit the receive loop
|
||||
return
|
||||
|
||||
@traced_openai_realtime(operation="llm_setup")
|
||||
async def _handle_evt_session_created(self, evt):
|
||||
# session.created is received right after connecting. Send a message
|
||||
# to configure the session properties.
|
||||
await self._update_settings()
|
||||
|
||||
async def _handle_evt_session_updated(self, evt):
|
||||
# If this is our first context frame, run the LLM
|
||||
self._api_session_ready = True
|
||||
# Now that we've configured the session, we can run the LLM if we need to.
|
||||
if self._run_llm_when_api_session_ready:
|
||||
self._run_llm_when_api_session_ready = False
|
||||
await self._create_response()
|
||||
|
||||
async def _handle_evt_audio_delta(self, evt):
|
||||
# note: ttfb is faster by 1/2 RTT than ttfb as measured for other services, since we're getting
|
||||
# this event from the server
|
||||
await self.stop_ttfb_metrics()
|
||||
if not self._current_audio_response:
|
||||
self._current_audio_response = CurrentAudioResponse(
|
||||
item_id=evt.item_id,
|
||||
content_index=evt.content_index,
|
||||
start_time_ms=int(time.time() * 1000),
|
||||
)
|
||||
await self.push_frame(TTSStartedFrame())
|
||||
audio = base64.b64decode(evt.delta)
|
||||
self._current_audio_response.total_size += len(audio)
|
||||
frame = TTSAudioRawFrame(
|
||||
audio=audio,
|
||||
sample_rate=24000,
|
||||
num_channels=1,
|
||||
)
|
||||
await self.push_frame(frame)
|
||||
|
||||
async def _handle_evt_audio_done(self, evt):
|
||||
if self._current_audio_response:
|
||||
await self.push_frame(TTSStoppedFrame())
|
||||
# Don't clear the self._current_audio_response here. We need to wait until we
|
||||
# receive a BotStoppedSpeakingFrame from the output transport.
|
||||
|
||||
async def _handle_evt_conversation_item_added(self, evt):
|
||||
"""Handle conversation.item.added event - item is added but may still be processing."""
|
||||
if evt.item.type == "function_call":
|
||||
# Track this function call for when arguments are completed
|
||||
# Only add if not already tracked (prevent duplicates)
|
||||
if evt.item.call_id not in self._pending_function_calls:
|
||||
self._pending_function_calls[evt.item.call_id] = evt.item
|
||||
else:
|
||||
logger.warning(f"Function call {evt.item.call_id} already tracked, skipping")
|
||||
|
||||
await self._call_event_handler("on_conversation_item_created", evt.item.id, evt.item)
|
||||
|
||||
# This will get sent from the server every time a new "message" is added
|
||||
# to the server's conversation state, whether we create it via the API
|
||||
# or the server creates it from LLM output.
|
||||
if self._messages_added_manually.get(evt.item.id):
|
||||
del self._messages_added_manually[evt.item.id]
|
||||
return
|
||||
|
||||
if evt.item.role == "user":
|
||||
# We need to wait for completion of both user message and response message. Then we'll
|
||||
# add both to the context. User message is complete when we have a "transcript" field
|
||||
# that is not None. Response message is complete when we get a "response.done" event.
|
||||
self._user_and_response_message_tuple = (evt.item, {"done": False, "output": []})
|
||||
elif evt.item.role == "assistant":
|
||||
self._current_assistant_response = evt.item
|
||||
await self.push_frame(LLMFullResponseStartFrame())
|
||||
|
||||
async def _handle_evt_conversation_item_done(self, evt):
|
||||
"""Handle conversation.item.done event - item is fully completed."""
|
||||
await self._call_event_handler("on_conversation_item_updated", evt.item.id, evt.item)
|
||||
# The item is now fully processed and ready
|
||||
# For now, no additional logic needed beyond the event handler call
|
||||
|
||||
async def _handle_evt_input_audio_transcription_delta(self, evt):
|
||||
if self._send_transcription_frames:
|
||||
await self.push_frame(
|
||||
# no way to get a language code?
|
||||
InterimTranscriptionFrame(evt.delta, "", time_now_iso8601(), result=evt)
|
||||
)
|
||||
|
||||
@traced_stt
|
||||
async def _handle_user_transcription(
|
||||
self, transcript: str, is_final: bool, language: Optional[Language] = None
|
||||
):
|
||||
"""Handle a transcription result with tracing."""
|
||||
pass
|
||||
|
||||
async def handle_evt_input_audio_transcription_completed(self, evt):
|
||||
"""Handle completion of input audio transcription.
|
||||
|
||||
Args:
|
||||
evt: The transcription completed event.
|
||||
"""
|
||||
await self._call_event_handler("on_conversation_item_updated", evt.item_id, None)
|
||||
|
||||
if self._send_transcription_frames:
|
||||
await self.push_frame(
|
||||
# no way to get a language code?
|
||||
TranscriptionFrame(evt.transcript, "", time_now_iso8601(), result=evt)
|
||||
)
|
||||
await self._handle_user_transcription(evt.transcript, True, Language.EN)
|
||||
pair = self._user_and_response_message_tuple
|
||||
if pair:
|
||||
user, assistant = pair
|
||||
user.content[0].transcript = evt.transcript
|
||||
if assistant["done"]:
|
||||
self._user_and_response_message_tuple = None
|
||||
self._context.add_user_content_item_as_message(user)
|
||||
else:
|
||||
# User message without preceding conversation.item.created. Bug?
|
||||
logger.warning(f"Transcript for unknown user message: {evt}")
|
||||
|
||||
async def _handle_conversation_item_retrieved(self, evt: events.ConversationItemRetrieved):
|
||||
futures = self._retrieve_conversation_item_futures.pop(evt.item.id, None)
|
||||
if futures:
|
||||
for future in futures:
|
||||
future.set_result(evt.item)
|
||||
|
||||
@traced_openai_realtime(operation="llm_response")
|
||||
async def _handle_evt_response_done(self, evt):
|
||||
# todo: figure out whether there's anything we need to do for "cancelled" events
|
||||
# usage metrics
|
||||
tokens = LLMTokenUsage(
|
||||
prompt_tokens=evt.response.usage.input_tokens,
|
||||
completion_tokens=evt.response.usage.output_tokens,
|
||||
total_tokens=evt.response.usage.total_tokens,
|
||||
)
|
||||
await self.start_llm_usage_metrics(tokens)
|
||||
await self.stop_processing_metrics()
|
||||
await self.push_frame(LLMFullResponseEndFrame())
|
||||
self._current_assistant_response = None
|
||||
# error handling
|
||||
if evt.response.status == "failed":
|
||||
await self.push_error(
|
||||
ErrorFrame(error=evt.response.status_details["error"]["message"], fatal=True)
|
||||
)
|
||||
return
|
||||
# response content
|
||||
for item in evt.response.output:
|
||||
await self._call_event_handler("on_conversation_item_updated", item.id, item)
|
||||
pair = self._user_and_response_message_tuple
|
||||
if pair:
|
||||
user, assistant = pair
|
||||
assistant["done"] = True
|
||||
assistant["output"] = evt.response.output
|
||||
if user.content[0].transcript is not None:
|
||||
self._user_and_response_message_tuple = None
|
||||
self._context.add_user_content_item_as_message(user)
|
||||
else:
|
||||
# Response message without preceding user message (standalone response)
|
||||
# Function calls in this response were already processed immediately when arguments were complete
|
||||
logger.debug(f"Handling standalone response: {evt.response.id}")
|
||||
|
||||
async def _handle_evt_text_delta(self, evt):
|
||||
if evt.delta:
|
||||
await self.push_frame(LLMTextFrame(evt.delta))
|
||||
|
||||
async def _handle_evt_audio_transcript_delta(self, evt):
|
||||
if evt.delta:
|
||||
await self.push_frame(LLMTextFrame(evt.delta))
|
||||
await self.push_frame(TTSTextFrame(evt.delta))
|
||||
|
||||
async def _handle_evt_function_call_arguments_done(self, evt):
|
||||
"""Handle completion of function call arguments.
|
||||
|
||||
Args:
|
||||
evt: The response.function_call_arguments.done event.
|
||||
"""
|
||||
# Process the function call immediately when arguments are complete
|
||||
# This is needed because function calls might not trigger response.done
|
||||
try:
|
||||
# Parse the arguments
|
||||
args = json.loads(evt.arguments)
|
||||
|
||||
# Get the function call item we tracked earlier
|
||||
function_call_item = self._pending_function_calls.get(evt.call_id)
|
||||
if function_call_item:
|
||||
# Remove from pending calls FIRST to prevent duplicate processing
|
||||
del self._pending_function_calls[evt.call_id]
|
||||
|
||||
# Create the function call and process it
|
||||
function_calls = [
|
||||
FunctionCallFromLLM(
|
||||
context=self._context,
|
||||
tool_call_id=evt.call_id,
|
||||
function_name=function_call_item.name,
|
||||
arguments=args,
|
||||
)
|
||||
]
|
||||
|
||||
await self.run_function_calls(function_calls)
|
||||
logger.debug(f"Processed function call: {function_call_item.name}")
|
||||
else:
|
||||
logger.warning(f"No tracked function call found for call_id: {evt.call_id}")
|
||||
logger.warning(
|
||||
f"Available pending calls: {list(self._pending_function_calls.keys())}"
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to process function call arguments: {e}")
|
||||
|
||||
async def _handle_evt_speech_started(self, evt):
|
||||
await self._truncate_current_audio_response()
|
||||
await self._start_interruption() # cancels this processor task
|
||||
await self.push_frame(StartInterruptionFrame()) # cancels downstream tasks
|
||||
await self.push_frame(UserStartedSpeakingFrame())
|
||||
|
||||
async def _handle_evt_speech_stopped(self, evt):
|
||||
await self.start_ttfb_metrics()
|
||||
await self.start_processing_metrics()
|
||||
await self._stop_interruption()
|
||||
await self.push_frame(UserStoppedSpeakingFrame())
|
||||
|
||||
async def _maybe_handle_evt_retrieve_conversation_item_error(self, evt: events.ErrorEvent):
|
||||
"""Maybe handle an error event related to retrieving a conversation item.
|
||||
|
||||
If the given error event is an error retrieving a conversation item:
|
||||
|
||||
- set an exception on the future that retrieve_conversation_item() is waiting on
|
||||
- return true
|
||||
Otherwise:
|
||||
- return false
|
||||
"""
|
||||
if evt.error.code == "item_retrieve_invalid_item_id":
|
||||
item_id = evt.error.event_id.split("_", 1)[1] # event_id is of the form "rci_{item_id}"
|
||||
futures = self._retrieve_conversation_item_futures.pop(item_id, None)
|
||||
if futures:
|
||||
for future in futures:
|
||||
future.set_exception(Exception(evt.error.message))
|
||||
return True
|
||||
return False
|
||||
|
||||
async def _handle_evt_error(self, evt):
|
||||
# Errors are fatal to this connection. Send an ErrorFrame.
|
||||
await self.push_error(ErrorFrame(error=f"Error: {evt}", fatal=True))
|
||||
|
||||
#
|
||||
# state and client events for the current conversation
|
||||
# https://platform.openai.com/docs/api-reference/realtime-client-events
|
||||
#
|
||||
|
||||
async def reset_conversation(self):
|
||||
"""Reset the conversation by disconnecting and reconnecting.
|
||||
|
||||
This is the safest way to start a new conversation. Note that this will
|
||||
fail if called from the receive task.
|
||||
"""
|
||||
logger.debug("Resetting conversation")
|
||||
await self._disconnect()
|
||||
if self._context:
|
||||
self._context.llm_needs_settings_update = True
|
||||
self._context.llm_needs_initial_messages = True
|
||||
await self._connect()
|
||||
|
||||
@traced_openai_realtime(operation="llm_request")
|
||||
async def _create_response(self):
|
||||
if not self._api_session_ready:
|
||||
self._run_llm_when_api_session_ready = True
|
||||
return
|
||||
|
||||
if self._context.llm_needs_initial_messages:
|
||||
messages = self._context.get_messages_for_initializing_history()
|
||||
for item in messages:
|
||||
evt = events.ConversationItemCreateEvent(item=item)
|
||||
self._messages_added_manually[evt.item.id] = True
|
||||
await self.send_client_event(evt)
|
||||
self._context.llm_needs_initial_messages = False
|
||||
|
||||
if self._context.llm_needs_settings_update:
|
||||
await self._update_settings()
|
||||
self._context.llm_needs_settings_update = False
|
||||
|
||||
logger.debug(f"Creating response: {self._context.get_messages_for_logging()}")
|
||||
|
||||
await self.push_frame(LLMFullResponseStartFrame())
|
||||
await self.start_processing_metrics()
|
||||
await self.start_ttfb_metrics()
|
||||
await self.send_client_event(
|
||||
events.ResponseCreateEvent(
|
||||
response=events.ResponseProperties(output_modalities=self._get_enabled_modalities())
|
||||
)
|
||||
)
|
||||
|
||||
async def _send_user_audio(self, frame):
|
||||
payload = base64.b64encode(frame.audio).decode("utf-8")
|
||||
await self.send_client_event(events.InputAudioBufferAppendEvent(audio=payload))
|
||||
|
||||
def create_context_aggregator(
|
||||
self,
|
||||
context: OpenAILLMContext,
|
||||
*,
|
||||
user_params: LLMUserAggregatorParams = LLMUserAggregatorParams(),
|
||||
assistant_params: LLMAssistantAggregatorParams = LLMAssistantAggregatorParams(),
|
||||
) -> OpenAIContextAggregatorPair:
|
||||
"""Create an instance of OpenAIContextAggregatorPair from an OpenAILLMContext.
|
||||
|
||||
Constructor keyword arguments for both the user and assistant aggregators can be provided.
|
||||
|
||||
Args:
|
||||
context: The LLM context.
|
||||
user_params: User aggregator parameters.
|
||||
assistant_params: Assistant aggregator parameters.
|
||||
|
||||
Returns:
|
||||
OpenAIContextAggregatorPair: A pair of context aggregators, one for
|
||||
the user and one for the assistant, encapsulated in an
|
||||
OpenAIContextAggregatorPair.
|
||||
"""
|
||||
context.set_llm_adapter(self.get_llm_adapter())
|
||||
|
||||
OpenAIRealtimeLLMContext.upgrade_to_realtime(context)
|
||||
user = OpenAIRealtimeUserContextAggregator(context, params=user_params)
|
||||
|
||||
assistant_params.expect_stripped_words = False
|
||||
assistant = OpenAIRealtimeAssistantContextAggregator(context, params=assistant_params)
|
||||
return OpenAIContextAggregatorPair(_user=user, _assistant=assistant)
|
||||
@@ -6,6 +6,8 @@
|
||||
|
||||
"""Azure OpenAI Realtime Beta LLM service implementation."""
|
||||
|
||||
import warnings
|
||||
|
||||
from loguru import logger
|
||||
|
||||
from .openai import OpenAIRealtimeBetaLLMService
|
||||
@@ -23,6 +25,10 @@ except ModuleNotFoundError as e:
|
||||
class AzureRealtimeBetaLLMService(OpenAIRealtimeBetaLLMService):
|
||||
"""Azure OpenAI Realtime Beta LLM service with Azure-specific authentication.
|
||||
|
||||
.. deprecated:: 0.0.84
|
||||
`AzureRealtimeBetaLLMService` is deprecated, use `AzureRealtimeLLMService` instead.
|
||||
This class will be removed in version 1.0.0.
|
||||
|
||||
Extends the OpenAI Realtime service to work with Azure OpenAI endpoints,
|
||||
using Azure's authentication headers and endpoint format. Provides the same
|
||||
real-time audio and text communication capabilities as the base OpenAI service.
|
||||
@@ -44,6 +50,16 @@ class AzureRealtimeBetaLLMService(OpenAIRealtimeBetaLLMService):
|
||||
**kwargs: Additional arguments passed to parent OpenAIRealtimeBetaLLMService.
|
||||
"""
|
||||
super().__init__(base_url=base_url, api_key=api_key, **kwargs)
|
||||
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter("always")
|
||||
warnings.warn(
|
||||
"AzureRealtimeBetaLLMService is deprecated and will be removed in version 1.0.0. "
|
||||
"Use AzureRealtimeLLMService instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
|
||||
self.api_key = api_key
|
||||
self.base_url = base_url
|
||||
|
||||
|
||||
@@ -9,6 +9,7 @@
|
||||
import base64
|
||||
import json
|
||||
import time
|
||||
import warnings
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
@@ -92,6 +93,10 @@ class CurrentAudioResponse:
|
||||
class OpenAIRealtimeBetaLLMService(LLMService):
|
||||
"""OpenAI Realtime Beta LLM service providing real-time audio and text communication.
|
||||
|
||||
.. deprecated:: 0.0.84
|
||||
`OpenAIRealtimeBetaLLMService` is deprecated, use `OpenAIRealtimeLLMService` instead.
|
||||
This class will be removed in version 1.0.0.
|
||||
|
||||
Implements the OpenAI Realtime API Beta with WebSocket communication for low-latency
|
||||
bidirectional audio and text interactions. Supports function calling, conversation
|
||||
management, and real-time transcription.
|
||||
@@ -124,6 +129,15 @@ class OpenAIRealtimeBetaLLMService(LLMService):
|
||||
send_transcription_frames: Whether to emit transcription frames. Defaults to True.
|
||||
**kwargs: Additional arguments passed to parent LLMService.
|
||||
"""
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter("always")
|
||||
warnings.warn(
|
||||
"OpenAIRealtimeBetaLLMService is deprecated and will be removed in version 1.0.0. "
|
||||
"Use OpenAIRealtimeLLMService instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
|
||||
full_url = f"{base_url}?model={model}"
|
||||
super().__init__(base_url=full_url, **kwargs)
|
||||
|
||||
|
||||
@@ -14,7 +14,8 @@ visual content.
|
||||
from abc import abstractmethod
|
||||
from typing import AsyncGenerator
|
||||
|
||||
from pipecat.frames.frames import Frame, VisionImageRawFrame
|
||||
from pipecat.frames.frames import Frame, LLMContextFrame
|
||||
from pipecat.processors.aggregators.llm_context import LLMContext
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.services.ai_service import AIService
|
||||
|
||||
@@ -37,15 +38,15 @@ class VisionService(AIService):
|
||||
self._describe_text = None
|
||||
|
||||
@abstractmethod
|
||||
async def run_vision(self, frame: VisionImageRawFrame) -> AsyncGenerator[Frame, None]:
|
||||
"""Process a vision image frame and generate results.
|
||||
async def run_vision(self, context: LLMContext) -> AsyncGenerator[Frame, None]:
|
||||
"""Process the latest image in the context and generate results.
|
||||
|
||||
This method must be implemented by subclasses to provide actual computer
|
||||
vision functionality such as image description, object detection, or
|
||||
visual question answering.
|
||||
|
||||
Args:
|
||||
frame: The vision image frame to process, containing image data.
|
||||
context: The context to process, containing image data.
|
||||
|
||||
Yields:
|
||||
Frame: Frames containing the vision analysis results, typically TextFrame
|
||||
@@ -65,9 +66,9 @@ class VisionService(AIService):
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, VisionImageRawFrame):
|
||||
if isinstance(frame, LLMContextFrame):
|
||||
await self.start_processing_metrics()
|
||||
await self.process_generator(self.run_vision(frame))
|
||||
await self.process_generator(self.run_vision(frame.context))
|
||||
await self.stop_processing_metrics()
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
@@ -219,7 +219,34 @@ class BaseOutputTransport(FrameProcessor):
|
||||
pass
|
||||
|
||||
async def write_dtmf(self, frame: OutputDTMFFrame | OutputDTMFUrgentFrame):
|
||||
"""Write a DTMF tone to the transport.
|
||||
"""Write a DTMF tone using the transport's preferred method.
|
||||
|
||||
Args:
|
||||
frame: The DTMF frame to write.
|
||||
"""
|
||||
if self._supports_native_dtmf():
|
||||
await self._write_dtmf_native(frame)
|
||||
else:
|
||||
await self._write_dtmf_audio(frame)
|
||||
|
||||
def _supports_native_dtmf(self) -> bool:
|
||||
"""Override in transport implementations that support native DTMF.
|
||||
|
||||
Returns:
|
||||
True if the transport supports native DTMF, False otherwise.
|
||||
"""
|
||||
return False
|
||||
|
||||
async def _write_dtmf_native(self, frame: OutputDTMFFrame | OutputDTMFUrgentFrame):
|
||||
"""Override in transport implementations for native DTMF.
|
||||
|
||||
Args:
|
||||
frame: The DTMF frame to write.
|
||||
"""
|
||||
raise NotImplementedError("Transport claims native DTMF support but doesn't implement it")
|
||||
|
||||
async def _write_dtmf_audio(self, frame: OutputDTMFFrame | OutputDTMFUrgentFrame):
|
||||
"""Generate and send audio tones for DTMF.
|
||||
|
||||
Args:
|
||||
frame: The DTMF frame to write.
|
||||
@@ -228,7 +255,6 @@ class BaseOutputTransport(FrameProcessor):
|
||||
dtmf_audio_frame = OutputAudioRawFrame(
|
||||
audio=dtmf_audio, sample_rate=self._sample_rate, num_channels=1
|
||||
)
|
||||
dtmf_audio_frame.transport_destination = frame.transport_destination
|
||||
await self.write_audio_frame(dtmf_audio_frame)
|
||||
|
||||
async def send_audio(self, frame: OutputAudioRawFrame):
|
||||
|
||||
@@ -61,9 +61,7 @@ try:
|
||||
VirtualCameraDevice,
|
||||
VirtualSpeakerDevice,
|
||||
)
|
||||
from daily import (
|
||||
LogLevel as DailyLogLevel,
|
||||
)
|
||||
from daily import LogLevel as DailyLogLevel
|
||||
except ModuleNotFoundError as e:
|
||||
logger.error(f"Exception: {e}")
|
||||
logger.error(
|
||||
@@ -1809,6 +1807,27 @@ class DailyOutputTransport(BaseOutputTransport):
|
||||
"""
|
||||
await self._client.write_video_frame(frame)
|
||||
|
||||
def _supports_native_dtmf(self) -> bool:
|
||||
"""Daily supports native DTMF via telephone events.
|
||||
|
||||
Returns:
|
||||
True, as Daily supports native DTMF transmission.
|
||||
"""
|
||||
return True
|
||||
|
||||
async def _write_dtmf_native(self, frame):
|
||||
"""Use Daily's native send_dtmf method for telephone events.
|
||||
|
||||
Args:
|
||||
frame: The DTMF frame to write.
|
||||
"""
|
||||
await self._client.send_dtmf(
|
||||
{
|
||||
"sessionId": frame.transport_destination,
|
||||
"tones": frame.button.value,
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
class DailyTransport(BaseTransport):
|
||||
"""Transport implementation for Daily audio and video calls.
|
||||
@@ -2296,7 +2315,7 @@ class DailyTransport(BaseTransport):
|
||||
"""Handle participant updated events."""
|
||||
await self._call_event_handler("on_participant_updated", participant)
|
||||
|
||||
async def _on_transcription_message(self, message):
|
||||
async def _on_transcription_message(self, message: Dict[str, Any]) -> None:
|
||||
"""Handle transcription message events."""
|
||||
await self._call_event_handler("on_transcription_message", message)
|
||||
|
||||
@@ -2308,9 +2327,10 @@ class DailyTransport(BaseTransport):
|
||||
|
||||
text = message["text"]
|
||||
timestamp = message["timestamp"]
|
||||
is_final = message["rawResponse"]["is_final"]
|
||||
raw_response = message.get("rawResponse", {})
|
||||
is_final = raw_response.get("is_final", False)
|
||||
try:
|
||||
language = message["rawResponse"]["channel"]["alternatives"][0]["languages"][0]
|
||||
language = raw_response["channel"]["alternatives"][0]["languages"][0]
|
||||
language = Language(language)
|
||||
except KeyError:
|
||||
language = None
|
||||
|
||||
@@ -12,6 +12,7 @@ event handling for conversational AI applications.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from dataclasses import dataclass
|
||||
from typing import Any, Awaitable, Callable, List, Optional
|
||||
|
||||
@@ -24,11 +25,15 @@ from pipecat.frames.frames import (
|
||||
AudioRawFrame,
|
||||
CancelFrame,
|
||||
EndFrame,
|
||||
ImageRawFrame,
|
||||
OutputAudioRawFrame,
|
||||
OutputDTMFFrame,
|
||||
OutputDTMFUrgentFrame,
|
||||
StartFrame,
|
||||
TransportMessageFrame,
|
||||
TransportMessageUrgentFrame,
|
||||
UserAudioRawFrame,
|
||||
UserImageRawFrame,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection, FrameProcessorSetup
|
||||
from pipecat.transports.base_input import BaseInputTransport
|
||||
@@ -38,12 +43,29 @@ from pipecat.utils.asyncio.task_manager import BaseTaskManager
|
||||
|
||||
try:
|
||||
from livekit import rtc
|
||||
from livekit.rtc._proto import video_frame_pb2 as proto_video_frame
|
||||
from tenacity import retry, stop_after_attempt, wait_exponential
|
||||
except ModuleNotFoundError as e:
|
||||
logger.error(f"Exception: {e}")
|
||||
logger.error("In order to use LiveKit, you need to `pip install pipecat-ai[livekit]`.")
|
||||
raise Exception(f"Missing module: {e}")
|
||||
|
||||
# DTMF mapping according to RFC 4733
|
||||
DTMF_CODE_MAP = {
|
||||
"0": 0,
|
||||
"1": 1,
|
||||
"2": 2,
|
||||
"3": 3,
|
||||
"4": 4,
|
||||
"5": 5,
|
||||
"6": 6,
|
||||
"7": 7,
|
||||
"8": 8,
|
||||
"9": 9,
|
||||
"*": 10,
|
||||
"#": 11,
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class LiveKitTransportMessageFrame(TransportMessageFrame):
|
||||
@@ -96,6 +118,8 @@ class LiveKitCallbacks(BaseModel):
|
||||
on_participant_disconnected: Callable[[str], Awaitable[None]]
|
||||
on_audio_track_subscribed: Callable[[str], Awaitable[None]]
|
||||
on_audio_track_unsubscribed: Callable[[str], Awaitable[None]]
|
||||
on_video_track_subscribed: Callable[[str], Awaitable[None]]
|
||||
on_video_track_unsubscribed: Callable[[str], Awaitable[None]]
|
||||
on_data_received: Callable[[bytes, str], Awaitable[None]]
|
||||
on_first_participant_joined: Callable[[str], Awaitable[None]]
|
||||
|
||||
@@ -140,8 +164,11 @@ class LiveKitTransportClient:
|
||||
self._audio_track: Optional[rtc.LocalAudioTrack] = None
|
||||
self._audio_tracks = {}
|
||||
self._audio_queue = asyncio.Queue()
|
||||
self._video_tracks = {}
|
||||
self._video_queue = asyncio.Queue()
|
||||
self._other_participant_has_joined = False
|
||||
self._task_manager: Optional[BaseTaskManager] = None
|
||||
self._async_lock = asyncio.Lock()
|
||||
|
||||
@property
|
||||
def participant_id(self) -> str:
|
||||
@@ -202,61 +229,63 @@ class LiveKitTransportClient:
|
||||
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
|
||||
async def connect(self):
|
||||
"""Connect to the LiveKit room with retry logic."""
|
||||
if self._connected:
|
||||
# Increment disconnect counter if already connected.
|
||||
self._disconnect_counter += 1
|
||||
return
|
||||
async with self._async_lock:
|
||||
if self._connected:
|
||||
# Increment disconnect counter if already connected.
|
||||
self._disconnect_counter += 1
|
||||
return
|
||||
|
||||
logger.info(f"Connecting to {self._room_name}")
|
||||
logger.info(f"Connecting to {self._room_name}")
|
||||
|
||||
try:
|
||||
await self.room.connect(
|
||||
self._url,
|
||||
self._token,
|
||||
options=rtc.RoomOptions(auto_subscribe=True),
|
||||
)
|
||||
self._connected = True
|
||||
# Increment disconnect counter if we successfully connected.
|
||||
self._disconnect_counter += 1
|
||||
try:
|
||||
await self.room.connect(
|
||||
self._url,
|
||||
self._token,
|
||||
options=rtc.RoomOptions(auto_subscribe=True),
|
||||
)
|
||||
self._connected = True
|
||||
# Increment disconnect counter if we successfully connected.
|
||||
self._disconnect_counter += 1
|
||||
|
||||
self._participant_id = self.room.local_participant.sid
|
||||
logger.info(f"Connected to {self._room_name}")
|
||||
self._participant_id = self.room.local_participant.sid
|
||||
logger.info(f"Connected to {self._room_name}")
|
||||
|
||||
# Set up audio source and track
|
||||
self._audio_source = rtc.AudioSource(
|
||||
self._out_sample_rate, self._params.audio_out_channels
|
||||
)
|
||||
self._audio_track = rtc.LocalAudioTrack.create_audio_track(
|
||||
"pipecat-audio", self._audio_source
|
||||
)
|
||||
options = rtc.TrackPublishOptions()
|
||||
options.source = rtc.TrackSource.SOURCE_MICROPHONE
|
||||
await self.room.local_participant.publish_track(self._audio_track, options)
|
||||
# Set up audio source and track
|
||||
self._audio_source = rtc.AudioSource(
|
||||
self._out_sample_rate, self._params.audio_out_channels
|
||||
)
|
||||
self._audio_track = rtc.LocalAudioTrack.create_audio_track(
|
||||
"pipecat-audio", self._audio_source
|
||||
)
|
||||
options = rtc.TrackPublishOptions()
|
||||
options.source = rtc.TrackSource.SOURCE_MICROPHONE
|
||||
await self.room.local_participant.publish_track(self._audio_track, options)
|
||||
|
||||
await self._callbacks.on_connected()
|
||||
await self._callbacks.on_connected()
|
||||
|
||||
# Check if there are already participants in the room
|
||||
participants = self.get_participants()
|
||||
if participants and not self._other_participant_has_joined:
|
||||
self._other_participant_has_joined = True
|
||||
await self._callbacks.on_first_participant_joined(participants[0])
|
||||
except Exception as e:
|
||||
logger.error(f"Error connecting to {self._room_name}: {e}")
|
||||
raise
|
||||
# Check if there are already participants in the room
|
||||
participants = self.get_participants()
|
||||
if participants and not self._other_participant_has_joined:
|
||||
self._other_participant_has_joined = True
|
||||
await self._callbacks.on_first_participant_joined(participants[0])
|
||||
except Exception as e:
|
||||
logger.error(f"Error connecting to {self._room_name}: {e}")
|
||||
raise
|
||||
|
||||
async def disconnect(self):
|
||||
"""Disconnect from the LiveKit room."""
|
||||
# Decrement leave counter when leaving.
|
||||
self._disconnect_counter -= 1
|
||||
async with self._async_lock:
|
||||
# Decrement leave counter when leaving.
|
||||
self._disconnect_counter -= 1
|
||||
|
||||
if not self._connected or self._disconnect_counter > 0:
|
||||
return
|
||||
if not self._connected or self._disconnect_counter > 0:
|
||||
return
|
||||
|
||||
logger.info(f"Disconnecting from {self._room_name}")
|
||||
await self.room.disconnect()
|
||||
self._connected = False
|
||||
logger.info(f"Disconnected from {self._room_name}")
|
||||
await self._callbacks.on_disconnected()
|
||||
logger.info(f"Disconnecting from {self._room_name}")
|
||||
await self.room.disconnect()
|
||||
self._connected = False
|
||||
logger.info(f"Disconnected from {self._room_name}")
|
||||
await self._callbacks.on_disconnected()
|
||||
|
||||
async def send_data(self, data: bytes, participant_id: Optional[str] = None):
|
||||
"""Send data to participants in the room.
|
||||
@@ -278,6 +307,26 @@ class LiveKitTransportClient:
|
||||
except Exception as e:
|
||||
logger.error(f"Error sending data: {e}")
|
||||
|
||||
async def send_dtmf(self, digit: str):
|
||||
"""Send DTMF tone to the room.
|
||||
|
||||
Args:
|
||||
digit: The DTMF digit to send (0-9, *, #).
|
||||
"""
|
||||
if not self._connected:
|
||||
return
|
||||
|
||||
if digit not in DTMF_CODE_MAP:
|
||||
logger.warning(f"Invalid DTMF digit: {digit}")
|
||||
return
|
||||
|
||||
code = DTMF_CODE_MAP[digit]
|
||||
|
||||
try:
|
||||
await self.room.local_participant.publish_dtmf(code=code, digit=digit)
|
||||
except Exception as e:
|
||||
logger.error(f"Error sending DTMF tone {digit}: {e}")
|
||||
|
||||
async def publish_audio(self, audio_frame: rtc.AudioFrame):
|
||||
"""Publish an audio frame to the room.
|
||||
|
||||
@@ -439,6 +488,15 @@ class LiveKitTransportClient:
|
||||
f"{self}::_process_audio_stream",
|
||||
)
|
||||
await self._callbacks.on_audio_track_subscribed(participant.sid)
|
||||
elif track.kind == rtc.TrackKind.KIND_VIDEO:
|
||||
logger.info(f"Video track subscribed: {track.sid} from participant {participant.sid}")
|
||||
self._video_tracks[participant.sid] = track
|
||||
video_stream = rtc.VideoStream(track)
|
||||
self._task_manager.create_task(
|
||||
self._process_video_stream(video_stream, participant.sid),
|
||||
f"{self}::_process_video_stream",
|
||||
)
|
||||
await self._callbacks.on_video_track_subscribed(participant.sid)
|
||||
|
||||
async def _async_on_track_unsubscribed(
|
||||
self,
|
||||
@@ -450,6 +508,8 @@ class LiveKitTransportClient:
|
||||
logger.info(f"Track unsubscribed: {publication.sid} from {participant.identity}")
|
||||
if track.kind == rtc.TrackKind.KIND_AUDIO:
|
||||
await self._callbacks.on_audio_track_unsubscribed(participant.sid)
|
||||
elif track.kind == rtc.TrackKind.KIND_VIDEO:
|
||||
await self._callbacks.on_video_track_unsubscribed(participant.sid)
|
||||
|
||||
async def _async_on_data_received(self, data: rtc.DataPacket):
|
||||
"""Handle data received events."""
|
||||
@@ -480,6 +540,21 @@ class LiveKitTransportClient:
|
||||
frame, participant_id = await self._audio_queue.get()
|
||||
yield frame, participant_id
|
||||
|
||||
async def _process_video_stream(self, video_stream: rtc.VideoStream, participant_id: str):
|
||||
"""Process incoming video stream from a participant."""
|
||||
logger.info(f"Started processing video stream for participant {participant_id}")
|
||||
async for event in video_stream:
|
||||
if isinstance(event, rtc.VideoFrameEvent):
|
||||
await self._video_queue.put((event, participant_id))
|
||||
else:
|
||||
logger.warning(f"Received unexpected event type: {type(event)}")
|
||||
|
||||
async def get_next_video_frame(self):
|
||||
"""Get the next video frame from the queue."""
|
||||
while True:
|
||||
frame, participant_id = await self._video_queue.get()
|
||||
yield frame, participant_id
|
||||
|
||||
def __str__(self):
|
||||
"""String representation of the LiveKit transport client."""
|
||||
return f"{self._transport_name}::LiveKitTransportClient"
|
||||
@@ -512,6 +587,7 @@ class LiveKitInputTransport(BaseInputTransport):
|
||||
self._client = client
|
||||
|
||||
self._audio_in_task = None
|
||||
self._video_in_task = None
|
||||
self._vad_analyzer: Optional[VADAnalyzer] = params.vad_analyzer
|
||||
self._resampler = create_stream_resampler()
|
||||
|
||||
@@ -544,6 +620,8 @@ class LiveKitInputTransport(BaseInputTransport):
|
||||
await self._client.connect()
|
||||
if not self._audio_in_task and self._params.audio_in_enabled:
|
||||
self._audio_in_task = self.create_task(self._audio_in_task_handler())
|
||||
if not self._video_in_task and self._params.video_in_enabled:
|
||||
self._video_in_task = self.create_task(self._video_in_task_handler())
|
||||
await self.set_transport_ready(frame)
|
||||
logger.info("LiveKitInputTransport started")
|
||||
|
||||
@@ -557,6 +635,8 @@ class LiveKitInputTransport(BaseInputTransport):
|
||||
await self._client.disconnect()
|
||||
if self._audio_in_task:
|
||||
await self.cancel_task(self._audio_in_task)
|
||||
if self._video_in_task:
|
||||
await self.cancel_task(self._video_in_task)
|
||||
logger.info("LiveKitInputTransport stopped")
|
||||
|
||||
async def cancel(self, frame: CancelFrame):
|
||||
@@ -569,6 +649,8 @@ class LiveKitInputTransport(BaseInputTransport):
|
||||
await self._client.disconnect()
|
||||
if self._audio_in_task and self._params.audio_in_enabled:
|
||||
await self.cancel_task(self._audio_in_task)
|
||||
if self._video_in_task and self._params.video_in_enabled:
|
||||
await self.cancel_task(self._video_in_task)
|
||||
|
||||
async def setup(self, setup: FrameProcessorSetup):
|
||||
"""Setup the input transport with shared client setup.
|
||||
@@ -617,6 +699,29 @@ class LiveKitInputTransport(BaseInputTransport):
|
||||
)
|
||||
await self.push_audio_frame(input_audio_frame)
|
||||
|
||||
async def _video_in_task_handler(self):
|
||||
"""Handle incoming video frames from participants."""
|
||||
logger.info("Video input task started")
|
||||
video_iterator = self._client.get_next_video_frame()
|
||||
async for video_data in video_iterator:
|
||||
if video_data:
|
||||
video_frame_event, participant_id = video_data
|
||||
pipecat_video_frame = await self._convert_livekit_video_to_pipecat(
|
||||
video_frame_event=video_frame_event
|
||||
)
|
||||
|
||||
# Skip frames with no video data
|
||||
if len(pipecat_video_frame.image) == 0:
|
||||
continue
|
||||
|
||||
input_video_frame = UserImageRawFrame(
|
||||
user_id=participant_id,
|
||||
image=pipecat_video_frame.image,
|
||||
size=pipecat_video_frame.size,
|
||||
format=pipecat_video_frame.format,
|
||||
)
|
||||
await self.push_video_frame(input_video_frame)
|
||||
|
||||
async def _convert_livekit_audio_to_pipecat(
|
||||
self, audio_frame_event: rtc.AudioFrameEvent
|
||||
) -> AudioRawFrame:
|
||||
@@ -633,6 +738,19 @@ class LiveKitInputTransport(BaseInputTransport):
|
||||
num_channels=audio_frame.num_channels,
|
||||
)
|
||||
|
||||
async def _convert_livekit_video_to_pipecat(
|
||||
self,
|
||||
video_frame_event: rtc.VideoFrameEvent,
|
||||
) -> ImageRawFrame:
|
||||
"""Convert LiveKit video frame to Pipecat video frame."""
|
||||
rgb_frame = video_frame_event.frame.convert(proto_video_frame.VideoBufferType.RGB24)
|
||||
image_frame = ImageRawFrame(
|
||||
image=rgb_frame.data,
|
||||
size=(rgb_frame.width, rgb_frame.height),
|
||||
format="RGB",
|
||||
)
|
||||
return image_frame
|
||||
|
||||
|
||||
class LiveKitOutputTransport(BaseOutputTransport):
|
||||
"""Handles outgoing media streams and events to LiveKit rooms.
|
||||
@@ -720,10 +838,14 @@ class LiveKitOutputTransport(BaseOutputTransport):
|
||||
Args:
|
||||
frame: The transport message frame to send.
|
||||
"""
|
||||
message = frame.message
|
||||
if isinstance(message, dict):
|
||||
# fix message encoding for dict-like messages, e.g. RTVI messages.
|
||||
message = json.dumps(message, ensure_ascii=False)
|
||||
if isinstance(frame, (LiveKitTransportMessageFrame, LiveKitTransportMessageUrgentFrame)):
|
||||
await self._client.send_data(frame.message.encode(), frame.participant_id)
|
||||
await self._client.send_data(message.encode(), frame.participant_id)
|
||||
else:
|
||||
await self._client.send_data(frame.message.encode())
|
||||
await self._client.send_data(message.encode())
|
||||
|
||||
async def write_audio_frame(self, frame: OutputAudioRawFrame):
|
||||
"""Write an audio frame to the LiveKit room.
|
||||
@@ -734,6 +856,22 @@ class LiveKitOutputTransport(BaseOutputTransport):
|
||||
livekit_audio = self._convert_pipecat_audio_to_livekit(frame.audio)
|
||||
await self._client.publish_audio(livekit_audio)
|
||||
|
||||
def _supports_native_dtmf(self) -> bool:
|
||||
"""LiveKit supports native DTMF via telephone events.
|
||||
|
||||
Returns:
|
||||
True, as LiveKit supports native DTMF transmission.
|
||||
"""
|
||||
return True
|
||||
|
||||
async def _write_dtmf_native(self, frame: OutputDTMFFrame | OutputDTMFUrgentFrame):
|
||||
"""Use LiveKit's native publish_dtmf method for telephone events.
|
||||
|
||||
Args:
|
||||
frame: The DTMF frame to write.
|
||||
"""
|
||||
await self._client.send_dtmf(frame.button.value)
|
||||
|
||||
def _convert_pipecat_audio_to_livekit(self, pipecat_audio: bytes) -> rtc.AudioFrame:
|
||||
"""Convert Pipecat audio data to LiveKit audio frame."""
|
||||
bytes_per_sample = 2 # Assuming 16-bit audio
|
||||
@@ -784,6 +922,8 @@ class LiveKitTransport(BaseTransport):
|
||||
on_participant_disconnected=self._on_participant_disconnected,
|
||||
on_audio_track_subscribed=self._on_audio_track_subscribed,
|
||||
on_audio_track_unsubscribed=self._on_audio_track_unsubscribed,
|
||||
on_video_track_subscribed=self._on_video_track_subscribed,
|
||||
on_video_track_unsubscribed=self._on_video_track_unsubscribed,
|
||||
on_data_received=self._on_data_received,
|
||||
on_first_participant_joined=self._on_first_participant_joined,
|
||||
)
|
||||
@@ -801,6 +941,8 @@ class LiveKitTransport(BaseTransport):
|
||||
self._register_event_handler("on_participant_disconnected")
|
||||
self._register_event_handler("on_audio_track_subscribed")
|
||||
self._register_event_handler("on_audio_track_unsubscribed")
|
||||
self._register_event_handler("on_video_track_subscribed")
|
||||
self._register_event_handler("on_video_track_unsubscribed")
|
||||
self._register_event_handler("on_data_received")
|
||||
self._register_event_handler("on_first_participant_joined")
|
||||
self._register_event_handler("on_participant_left")
|
||||
@@ -922,6 +1064,20 @@ class LiveKitTransport(BaseTransport):
|
||||
"""Handle audio track unsubscribed events."""
|
||||
await self._call_event_handler("on_audio_track_unsubscribed", participant_id)
|
||||
|
||||
async def _on_video_track_subscribed(self, participant_id: str):
|
||||
"""Handle video track subscribed events."""
|
||||
await self._call_event_handler("on_video_track_subscribed", participant_id)
|
||||
participant = self._client.room.remote_participants.get(participant_id)
|
||||
if participant:
|
||||
for publication in participant.video_tracks.values():
|
||||
self._client._on_track_subscribed_wrapper(
|
||||
publication.track, publication, participant
|
||||
)
|
||||
|
||||
async def _on_video_track_unsubscribed(self, participant_id: str):
|
||||
"""Handle video track unsubscribed events."""
|
||||
await self._call_event_handler("on_video_track_unsubscribed", participant_id)
|
||||
|
||||
async def _on_data_received(self, data: bytes, participant_id: str):
|
||||
"""Handle data received events."""
|
||||
if self._input:
|
||||
|
||||
12
uv.lock
generated
12
uv.lock
generated
@@ -1236,13 +1236,13 @@ wheels = [
|
||||
|
||||
[[package]]
|
||||
name = "daily-python"
|
||||
version = "0.19.8"
|
||||
version = "0.19.9"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/25/33/21029ca23df6bae54dfa4e8af550fdcc557f053dc924a554cfa0e32b2904/daily_python-0.19.8-cp37-abi3-macosx_10_15_x86_64.whl", hash = "sha256:cccd2eb8b223299408a9f1269a6f1a257d03aba749ef9fa97678010474c2b40b", size = 13692303, upload-time = "2025-08-27T21:24:36.567Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/14/5b/c795498ffe7cbdca72530b71b4102c4ef43c2f528a72f0deb2d4c3af1cac/daily_python-0.19.8-cp37-abi3-macosx_11_0_arm64.whl", hash = "sha256:2c1010d44238d492cee3a6af231ff613899efb54943fe3a191f5b84d6af3330d", size = 12047872, upload-time = "2025-08-27T21:24:39.097Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/94/fd/145d65d6902873f3b44f3da7918d1bbbeeb6228e260e2aeb163311a33aa2/daily_python-0.19.8-cp37-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:d2b2fedf92e84b599f18d424ede423116eabffbe01ec6434f478d0b577d8bec3", size = 14111600, upload-time = "2025-08-27T21:24:40.962Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/5b/b6/a0123f00003cee45e488467649e2d69f09058f7815a4c590b4827992d3ef/daily_python-0.19.8-cp37-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:faec30ae64e233384d8bb96ad13de871843bb02f67bf3aa793a6bce9722734c5", size = 14582824, upload-time = "2025-08-27T21:24:43.148Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/22/85/6064c3225e5b190e522e8f3bc6a460efc5e3e6632f16fd5f9799c44ba57a/daily_python-0.19.9-cp37-abi3-macosx_10_15_x86_64.whl", hash = "sha256:cbc558ad7d49e79b550bf7567b9ceae75e2864d4fcaf41c90377b620e38a2461", size = 13365213, upload-time = "2025-09-06T00:31:00.224Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/23/58/af986c6881180a46a7b60dd418ce58d6d7c0c4ffc48d261748067c679317/daily_python-0.19.9-cp37-abi3-macosx_11_0_arm64.whl", hash = "sha256:446bb9ee848d88bc68ca29a2216793c9b5ebaf5991bf604daf76f7c5a53d5919", size = 11711673, upload-time = "2025-09-06T00:31:02.526Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/9d/48/1cad4c3e92cdb5ef06467d972c76a510fe5e807513334b10ad7f8c21bf74/daily_python-0.19.9-cp37-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:2facaf82b614404c642c70bbf0874fb045d8ad46400acb051470cd4df93cb4db", size = 13679393, upload-time = "2025-09-06T00:31:04.999Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/3c/e9/354f4699619e83d13e266256b2352b21741ac527e3e5ab5f2264d5c482cd/daily_python-0.19.9-cp37-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:ffc205efca7b47739efd358febab17577248c8db2ebc4d17d819307a83b9eefc", size = 14221932, upload-time = "2025-09-06T00:31:07.471Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -4432,7 +4432,7 @@ requires-dist = [
|
||||
{ name = "azure-cognitiveservices-speech", marker = "extra == 'azure'", specifier = "~=1.42.0" },
|
||||
{ name = "cartesia", marker = "extra == 'cartesia'", specifier = "~=2.0.3" },
|
||||
{ name = "coremltools", marker = "extra == 'local-smart-turn'", specifier = ">=8.0" },
|
||||
{ name = "daily-python", marker = "extra == 'daily'", specifier = "~=0.19.8" },
|
||||
{ name = "daily-python", marker = "extra == 'daily'", specifier = "~=0.19.9" },
|
||||
{ name = "deepgram-sdk", marker = "extra == 'deepgram'", specifier = "~=4.7.0" },
|
||||
{ name = "docstring-parser", specifier = "~=0.16" },
|
||||
{ name = "einops", marker = "extra == 'moondream'", specifier = "~=0.8.0" },
|
||||
|
||||
Reference in New Issue
Block a user