Compare commits

..

6 Commits

Author SHA1 Message Date
Jon Taylor
5bd5d22270 removed space from event handler 2024-06-26 18:30:56 +01:00
Jon Taylor
6ee7932337 added pause to start and new intro prompt 2024-06-26 18:24:14 +01:00
Jon Taylor
c407445dd1 removed header comment from bot runner 2024-06-24 17:35:26 +01:00
Jon Taylor
447f37167e added VAD stop seconds env 2024-06-24 17:34:25 +01:00
Jon Taylor
354c21500e prompt tweaks 2024-06-24 17:28:10 +01:00
Jon Taylor
5728e25b5a added fastbot example 2024-06-24 16:25:36 +01:00
80 changed files with 1259 additions and 2847 deletions

View File

@@ -5,159 +5,6 @@ All notable changes to **pipecat** will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.0.37] - 2024-07-22
### Added
- Added `RTVIProcessor` which implements the RTVI-AI standard.
See https://github.com/rtvi-ai
- Added `BotInterruptionFrame` which allows interrupting the bot while talking.
- Added `LLMMessagesAppendFrame` which allows appending messages to the current
LLM context.
- Added `LLMMessagesUpdateFrame` which allows changing the LLM context for the
one provided in this new frame.
- Added `LLMModelUpdateFrame` which allows updating the LLM model.
- Added `TTSSpeakFrame` which causes the bot say some text. This text will not
be part of the LLM context.
- Added `TTSVoiceUpdateFrame` which allows updating the TTS voice.
### Removed
- We remove the `LLMResponseStartFrame` and `LLMResponseEndFrame` frames. These
were added in the past to properly handle interruptions for the
`LLMAssistantContextAggregator`. But the `LLMContextAggregator` is now based
on `LLMResponseAggregator` which handles interruptions properly by just
processing the `StartInterruptionFrame`, so there's no need for these extra
frames any more.
### Fixed
- Fixed an issue with `StatelessTextTransformer` where it was pushing a string
instead of a `TextFrame`.
- `TTSService` end of sentence detection has been improved. It now works with
acronyms, numbers, hours and others.
- Fixed an issue in `TTSService` that would not properly flush the current
aggregated sentence if an `LLMFullResponseEndFrame` was found.
### Performance
- `CartesiaTTSService` now uses websockets which improves speed. It also
leverages the new Cartesia contexts which maintains generated audio prosody
when multiple inputs are sent, therefore improving audio quality a lot.
## [0.0.36] - 2024-07-02
### Added
- Added `GladiaSTTService`.
See https://docs.gladia.io/chapters/speech-to-text-api/pages/live-speech-recognition
- Added `XTTSService`. This is a local Text-To-Speech service.
See https://github.com/coqui-ai/TTS
- Added `UserIdleProcessor`. This processor can be used to wait for any
interaction with the user. If the user doesn't say anything within a given
timeout a provided callback is called.
- Added `IdleFrameProcessor`. This processor can be used to wait for frames
within a given timeout. If no frame is received within the timeout a provided
callback is called.
- Added new frame `BotSpeakingFrame`. This frame will be continuously pushed
upstream while the bot is talking.
- It is now possible to specify a Silero VAD version when using `SileroVADAnalyzer`
or `SileroVAD`.
- Added `AysncFrameProcessor` and `AsyncAIService`. Some services like
`DeepgramSTTService` need to process things asynchronously. For example, audio
is sent to Deepgram but transcriptions are not returned immediately. In these
cases we still require all frames (except system frames) to be pushed
downstream from a single task. That's what `AsyncFrameProcessor` is for. It
creates a task and all frames should be pushed from that task. So, whenever a
new Deepgram transcription is ready that transcription will also be pushed
from this internal task.
- The `MetricsFrame` now includes processing metrics if metrics are enabled. The
processing metrics indicate the time a processor needs to generate all its
output. Note that not all processors generate these kind of metrics.
### Changed
- `WhisperSTTService` model can now also be a string.
- Added missing * keyword separators in services.
### Fixed
- `WebsocketServerTransport` doesn't try to send frames anymore if serializers
returns `None`.
- Fixed an issue where exceptions that occurred inside frame processors were
being swallowed and not displayed.
- Fixed an issue in `FastAPIWebsocketTransport` where it would still try to send
data to the websocket after being closed.
### Other
- Added Fly.io deployment example in `examples/deployment/flyio-example`.
- Added new `17-detect-user-idle.py` example that shows how to use the new
`UserIdleProcessor`.
## [0.0.35] - 2024-06-28
### Changed
- `FastAPIWebsocketParams` now require a serializer.
- `TwilioFrameSerializer` now requires a `streamSid`.
### Fixed
- Silero VAD number of frames needs to be 512 for 16000 sample rate or 256 for
8000 sample rate.
## [0.0.34] - 2024-06-25
### Fixed
- Fixed an issue with asynchronous STT services (Deepgram and Azure) that could
interruptions to ignore transcriptions.
- Fixed an issue introduced in 0.0.33 that would cause the LLM to generate
shorter output.
## [0.0.33] - 2024-06-25
### Changed
- Upgraded to Cartesia's new Python library 1.0.0. `CartesiaTTSService` now
expects a voice ID instead of a voice name (you can get the voice ID from
Cartesia's playground). You can also specify the audio `sample_rate` and
`encoding` instead of the previous `output_format`.
### Fixed
- Fixed an issue with asynchronous STT services (Deepgram and Azure) that could
cause static audio issues and interruptions to not work properly when dealing
with multiple LLMs sentences.
- Fixed an issue that could mix new LLM responses with previous ones when
handling interruptions.
- Fixed a Daily transport blocking situation that occurred while reading audio
frames after a participant left the room. Needs daily-python >= 0.10.1.
## [0.0.32] - 2024-06-22
### Added

View File

@@ -39,7 +39,7 @@ pip install "pipecat-ai[option,...]"
Your project may or may not need these, so they're made available as optional requirements. Here is a list:
- **AI services**: `anthropic`, `azure`, `deepgram`, `gladia`, `google`, `fal`, `moondream`, `openai`, `openpipe`, `playht`, `silero`, `whisper`, `xtts`
- **AI services**: `anthropic`, `azure`, `deepgram`, `google`, `fal`, `moondream`, `openai`, `openpipe`, `playht`, `silero`, `whisper`
- **Transports**: `local`, `websocket`, `daily`
## Code examples
@@ -70,8 +70,8 @@ async def main():
transport = DailyTransport(
room_url=...,
token=...,
bot_name="Bot Name",
params=DailyParams(audio_out_enabled=True))
"Bot Name",
DailyParams(audio_out_enabled=True))
# Use Eleven Labs for Text-to-Speech
tts = ElevenLabsTTSService(
@@ -125,7 +125,7 @@ Sign up [here](https://dashboard.daily.co/u/signup) and [create a room](https://
Voice Activity Detection — very important for knowing when a user has finished speaking to your bot. If you are not using press-to-talk, and want Pipecat to detect when the user has finished talking, VAD is an essential component for a natural feeling conversation.
Pipecat makes use of WebRTC VAD by default when using a WebRTC transport layer. Optionally, you can use Silero VAD for improved accuracy at the cost of higher CPU usage.
Pipecast makes use of WebRTC VAD by default when using a WebRTC transport layer. Optionally, you can use Silero VAD for improved accuracy at the cost of higher CPU usage.
```shell
pip install pipecat-ai[silero]

View File

@@ -27,9 +27,6 @@ FAL_KEY=...
# Fireworks
FIREWORKS_API_KEY=...
# Gladia
GLADIA_API_KEY=...
# PlayHT
PLAY_HT_USER_ID=...
PLAY_HT_API_KEY=...

View File

@@ -1,16 +0,0 @@
FROM python:3.11-bullseye
# Open port 7860 for http service
ENV FAST_API_PORT=7860
EXPOSE 7860
# Install Python dependencies
COPY *.py .
COPY ./requirements.txt requirements.txt
RUN pip3 install --no-cache-dir --upgrade -r requirements.txt
# Install models
RUN python3 install_deps.py
# Start the FastAPI server
CMD python3 bot_runner.py --port ${FAST_API_PORT}

View File

@@ -1,43 +0,0 @@
# Fly.io deployment example
This project modifies the `bot_runner.py` server to launch a new machine for each user session. This is a recommended approach for production vs. running shell processess as your deployment will quickly run out of system resources under load.
To speed up machine boot times, we also download and cache Silero VAD as part of the Dockerfile (`install_deps.py`). If you are using other custom models, you can add them here too.
For this example, we are using Daily as a WebRTC transport and provisioning a new room and token for each session. You can use another transport, such as WebSockets, by modifying the `bot.py` and `bot_runner.py` files accordingly.
## Setting up your fly.io deployment
### Create your fly.toml file
You can copy the `example-fly.toml` as a reference. Be sure to change the app name to something unique.
### Create your .env file
Copy the base `env.example` to `.env` and enter the necessary API keys.
`FLY_APP_NAME` should match that in the `fly.toml` file.
### Launch a new fly.io project
`fly launch` or `fly launch --org your-org-name`
### Set the necessary app secrets from your .env
Note: you can do this manually via the fly.io dashboard under the "secrets" sub-section of your deployment (e.g. "https://fly.io/apps/fly-app-name/secrets") or run the following terminal command:
`cat .env | tr '\n' ' ' | xargs flyctl secrets set`
### Deploy your machine
`fly deploy`
## Connecting to your bot
Send a post request to your running fly.io instance:
`curl --location --request POST 'https://YOUR_FLY_APP_NAME/start_bot'`
This request will wait until the machine enters into a `starting` state, before returning the a room URL and token to join.

View File

@@ -1,103 +0,0 @@
import asyncio
import aiohttp
import os
import sys
import argparse
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_response import LLMAssistantResponseAggregator, LLMUserResponseAggregator
from pipecat.frames.frames import LLMMessagesFrame, EndFrame
from pipecat.services.openai import OpenAILLMService
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.vad.silero import SileroVADAnalyzer
from loguru import logger
from dotenv import load_dotenv
load_dotenv(override=True)
logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
daily_api_key = os.getenv("DAILY_API_KEY", "")
daily_api_url = os.getenv("DAILY_API_URL", "https://api.daily.co/v1")
async def main(room_url: str, token: str):
async with aiohttp.ClientSession() as session:
transport = DailyTransport(
room_url,
token,
"Chatbot",
DailyParams(
api_url=daily_api_url,
api_key=daily_api_key,
audio_in_enabled=True,
audio_out_enabled=True,
camera_out_enabled=False,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
transcription_enabled=True,
)
)
tts = ElevenLabsTTSService(
aiohttp_session=session,
api_key=os.getenv("ELEVENLABS_API_KEY", ""),
voice_id=os.getenv("ELEVENLABS_VOICE_ID", ""),
)
llm = OpenAILLMService(
api_key=os.getenv("OPENAI_API_KEY"),
model="gpt-4o")
messages = [
{
"role": "system",
"content": "You are Chatbot, a friendly, helpful robot. Your output will be converted to audio so don't include special characters other than '!' or '?' in your answers. Respond to what the user said in a creative and helpful way, but keep your responses brief. Start by saying hello.",
},
]
tma_in = LLMUserResponseAggregator(messages)
tma_out = LLMAssistantResponseAggregator(messages)
pipeline = Pipeline([
transport.input(),
tma_in,
llm,
tts,
transport.output(),
tma_out,
])
task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))
@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
transport.capture_participant_transcription(participant["id"])
await task.queue_frames([LLMMessagesFrame(messages)])
@transport.event_handler("on_participant_left")
async def on_participant_left(transport, participant, reason):
await task.queue_frame(EndFrame())
@transport.event_handler("on_call_state_updated")
async def on_call_state_updated(transport, state):
if state == "left":
await task.queue_frame(EndFrame())
runner = PipelineRunner()
await runner.run(task)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Pipecat Bot")
parser.add_argument("-u", type=str, help="Room URL")
parser.add_argument("-t", type=str, help="Token")
config = parser.parse_args()
asyncio.run(main(config.u, config.t))

View File

@@ -1,8 +0,0 @@
DAILY_API_KEY=
DAILY_SAMPLE_ROOM_URL= # Enter a Daily room URL to use a set room URL each time (useful for local testing)
OPENAI_API_KEY=
ELEVENLABS_API_KEY=
ELEVENLABS_VOICE_ID=
FLY_API_KEY=
FLY_APP_NAME=
RUN_AS_PROCESS= # Spawn fly.io machine for each session or run as local process

View File

@@ -1,25 +0,0 @@
# fly.toml app configuration file generated for pipecat-fly-example on 2024-07-01T15:04:53+01:00
#
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
#
app = 'pipecat-fly-example'
primary_region = 'sjc'
[build]
[env]
FLY_APP_NAME = 'pipecat-fly-example'
[http_service]
internal_port = 7860
force_https = true
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 0
processes = ['app']
[[vm]]
memory = 512
cpu_kind = 'shared'
cpus = 1

View File

@@ -1,4 +0,0 @@
import torch
# Download (cache) the Silero VAD model
torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad', force_reload=True)

165
examples/fast-chatbot/.gitignore vendored Normal file
View File

@@ -0,0 +1,165 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
runpod.toml
# custom script to recursively upgrade items in requirements.py
upgrade_requirements.py
.DS_Store

View File

@@ -0,0 +1,164 @@
#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
from loguru import logger
import argparse
import asyncio
import aiohttp
import os
import sys
import time
from typing import Optional
from pydantic import BaseModel, ValidationError
from pipecat.vad.vad_analyzer import VADParams
from pipecat.vad.silero import SileroVADAnalyzer
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.services.openai import OpenAILLMService
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.pipeline import Pipeline
from pipecat.frames.frames import LLMMessagesFrame, EndFrame
from pipecat.processors.aggregators.llm_response import (
LLMAssistantResponseAggregator, LLMUserResponseAggregator
)
from helpers import (
ClearableDeepgramTTSService,
AudioVolumeTimer,
TranscriptionTimingLogger
)
from dotenv import load_dotenv
load_dotenv(override=True)
logger.remove(0)
logger.add(sys.stderr, level=os.getenv("LOG_LEVEL", "DEBUG"))
class BotSettings(BaseModel):
room_url: str
room_token: str
bot_name: str = "Pipecat"
prompt: Optional[str] = "You are a helpful assistant."
deepgram_api_key: Optional[str] = os.getenv("DEEPGRAM_API_KEY", None)
deepgram_voice: Optional[str] = os.getenv("DEEPGRAM_VOICE", "aura-asteria-en")
deepgram_tts_base_url: Optional[str] = os.getenv(
"DEEPGRAM_TTS_BASE_URL", "https://api.deepgram.com/v1/speak")
deepgram_stt_base_url: Optional[str] = os.getenv(
"DEEPGRAM_STT_BASE_URL", "https://api.deepgram.com/v1/speak")
openai_api_key: Optional[str] = os.getenv("OPENAI_API_KEY", None),
openai_model: Optional[str] = os.getenv("OPENAI_MODEL", None),
openai_base_url: Optional[str] = os.getenv("OPENAI_BASE_URL", None)
vad_stop_secs: Optional[float] = os.getenv("VAD_STOP_SECS", 0.200)
async def main(settings: BotSettings):
async with aiohttp.ClientSession() as session:
transport = DailyTransport(
settings.room_url,
settings.room_token,
settings.bot_name,
DailyParams(
audio_out_enabled=True,
transcription_enabled=False,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(params=VADParams(
stop_secs=settings.vad_stop_secs
)),
vad_audio_passthrough=True
)
)
stt = DeepgramSTTService(
name="STT",
api_key=settings.deepgram_api_key,
url=settings.deepgram_stt_base_url
)
tts = ClearableDeepgramTTSService(
name="Voice",
aiohttp_session=session,
api_key=settings.deepgram_api_key,
voice=settings.deepgram_voice,
**({'base_url': url} if (url := settings.deepgram_tts_base_url) else {})
)
llm = OpenAILLMService(
name="LLM",
api_key=settings.openai_api_key,
model=settings.openai_model,
base_url=settings.openai_base_url,
)
messages = [
{
"role": "system",
"content": settings.prompt,
},
]
avt = AudioVolumeTimer()
tl = TranscriptionTimingLogger(avt)
tma_in = LLMUserResponseAggregator(messages)
tma_out = LLMAssistantResponseAggregator(messages)
pipeline = Pipeline([
transport.input(), # Transport user input
avt, # Audio volume timer
stt, # Speech-to-text
tl, # Transcription timing logger
tma_in, # User responses
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output
tma_out, # Assistant spoken responses
])
task = PipelineTask(
pipeline,
PipelineParams(
allow_interruptions=True,
enable_metrics=True,
report_only_initial_ttfb=True
))
# When the participant leaves, we exit the bot.
@transport.event_handler("on_participant_left")
async def on_participant_left(transport, participant, reason):
await task.queue_frame(EndFrame())
# When the first participant joins, the bot should introduce itself.
@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
# Provide some air whilst tracks subscribe
time.sleep(2)
messages.append(
{
"role": "system",
"content": "Briefly introduce yourself by saying 'hello, I'm FastBot, how can I help you today?'"})
await task.queue_frames([LLMMessagesFrame(messages)])
runner = PipelineRunner()
await runner.run(task)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Pipecat Bot")
parser.add_argument("-s", "--settings", type=str, required=True, help="Pipecat bot settings")
args, unknown = parser.parse_known_args()
try:
settings = BotSettings.model_validate_json(args.settings)
asyncio.run(main(settings))
except ValidationError as e:
print(e)

View File

@@ -1,7 +1,15 @@
#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import os
import argparse
import subprocess
import requests
from pydantic import BaseModel, ValidationError
from typing import Optional
from pipecat.transports.services.helpers.daily_rest import DailyRESTHelper, DailyRoomObject, DailyRoomProperties, DailyRoomParams
@@ -9,6 +17,8 @@ from fastapi import FastAPI, Request, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
from bot import BotSettings
from dotenv import load_dotenv
load_dotenv(override=True)
@@ -16,29 +26,24 @@ load_dotenv(override=True)
# ------------ Configuration ------------ #
MAX_SESSION_TIME = 5 * 60 # 5 minutes
REQUIRED_ENV_VARS = [
'DAILY_API_KEY',
'OPENAI_API_KEY',
'ELEVENLABS_API_KEY',
'ELEVENLABS_VOICE_ID',
'FLY_API_KEY',
'FLY_APP_NAME',]
FLY_API_HOST = os.getenv("FLY_API_HOST", "https://api.machines.dev/v1")
FLY_APP_NAME = os.getenv("FLY_APP_NAME", "pipecat-fly-example")
FLY_API_KEY = os.getenv("FLY_API_KEY", "")
FLY_HEADERS = {
'Authorization': f"Bearer {FLY_API_KEY}",
'Content-Type': 'application/json'
}
REQUIRED_ENV_VARS = ['DAILY_API_URL', 'DAILY_API_KEY', 'DEEPGRAM_API_KEY']
daily_rest_helper = DailyRESTHelper(
os.getenv("DAILY_API_KEY", ""),
os.getenv("DAILY_API_URL", 'https://api.daily.co/v1'))
class RunnerSettings(BaseModel):
prompt: Optional[
str] = "You are a fast, low-latency chatbot. Your goal is to demonstrate voice-driven AI capabilities at human-like speeds. When introducing yourself briefly mention your goal is to showcase speed and conversational flow. The technology powering you is Daily for transport, Cerebrium for GPU hosting, Llama 3 (8-B version) LLM, and Deepgram for speech-to-text and text-to-speech. You are hosted on the east coast of the United States. Respond to what the user said in a creative and helpful way, but keep responses short and legible. Ensure responses contain only words. Check again that you have not included special characters other than '?' or '!'."
deepgram_voice: Optional[str] = os.getenv("DEEPGRAM_VOICE")
openai_model: Optional[str] = os.getenv("OPENAI_MODEL", "gpt-4o")
openai_api_key: Optional[str] = os.getenv("OPENAI_API_KEY")
test: Optional[bool] = None
# ----------------- API ----------------- #
app = FastAPI()
app.add_middleware(
@@ -52,67 +57,25 @@ app.add_middleware(
# ----------------- Main ----------------- #
def spawn_fly_machine(room_url: str, token: str):
# Use the same image as the bot runner
res = requests.get(f"{FLY_API_HOST}/apps/{FLY_APP_NAME}/machines", headers=FLY_HEADERS)
if res.status_code != 200:
raise Exception(f"Unable to get machine info from Fly: {res.text}")
image = res.json()[0]['config']['image']
# Machine configuration
cmd = f"python3 bot.py -u {room_url} -t {token}"
cmd = cmd.split()
worker_props = {
"config": {
"image": image,
"auto_destroy": True,
"init": {
"cmd": cmd
},
"restart": {
"policy": "no"
},
"guest": {
"cpu_kind": "shared",
"cpus": 1,
"memory_mb": 1024
}
},
}
# Spawn a new machine instance
res = requests.post(
f"{FLY_API_HOST}/apps/{FLY_APP_NAME}/machines",
headers=FLY_HEADERS,
json=worker_props)
if res.status_code != 200:
raise Exception(f"Problem starting a bot worker: {res.text}")
# Wait for the machine to enter the started state
vm_id = res.json()['id']
res = requests.get(
f"{FLY_API_HOST}/apps/{FLY_APP_NAME}/machines/{vm_id}/wait?state=started",
headers=FLY_HEADERS)
if res.status_code != 200:
raise Exception(f"Bot was unable to enter started state: {res.text}")
print(f"Machine joined room: {room_url}")
@app.post("/start_bot")
async def start_bot(request: Request) -> JSONResponse:
runner_settings = RunnerSettings()
try:
data = await request.json()
# Is this a webhook creation request?
if "test" in data:
return JSONResponse({"test": True})
request_body = await request.body()
if len(request_body) > 0:
runner_settings = RunnerSettings.model_validate_json(request_body)
except ValidationError as e:
raise HTTPException(
status_code=400,
detail=f"Invalid request: {e}")
except Exception as e:
# If no data in request, pass
pass
# Is this a webhook creation request?
if runner_settings.test is not None:
return JSONResponse({"test": True})
# Use specified room URL, or create a new one if not specified
room_url = os.getenv("DAILY_SAMPLE_ROOM_URL", "")
@@ -141,25 +104,26 @@ async def start_bot(request: Request) -> JSONResponse:
raise HTTPException(
status_code=500, detail=f"Failed to get token for room: {room_url}")
# Launch a new fly.io machine, or run as a shell process (not recommended)
run_as_process = os.getenv("RUN_AS_PROCESS", False)
# Spawn a new agent, and join the user session
try:
bot_settings = BotSettings(
room_url=room.url,
room_token=token,
prompt=runner_settings.prompt,
deepgram_voice=runner_settings.deepgram_voice,
openai_model=runner_settings.openai_model,
openai_api_key=runner_settings.openai_api_key,
)
bot_settings_str = bot_settings.model_dump_json(exclude_none=True)
if run_as_process:
try:
subprocess.Popen(
[f"python3 -m bot -u {room.url} -t {token}"],
shell=True,
bufsize=1,
cwd=os.path.dirname(os.path.abspath(__file__)))
except Exception as e:
raise HTTPException(
status_code=500, detail=f"Failed to start subprocess: {e}")
else:
try:
spawn_fly_machine(room.url, token)
except Exception as e:
raise HTTPException(
status_code=500, detail=f"Failed to spawn VM: {e}")
subprocess.Popen(
[f"python3 -m bot -s '{bot_settings_str}'"],
shell=True,
bufsize=1,
cwd=os.path.dirname(os.path.abspath(__file__)))
except Exception as e:
raise HTTPException(
status_code=500, detail=f"Failed to start subprocess: {e}")
# Grab a token for the user to join with
user_token = daily_rest_helper.get_token(room.url, MAX_SESSION_TIME)
@@ -169,6 +133,7 @@ async def start_bot(request: Request) -> JSONResponse:
"token": user_token,
})
if __name__ == "__main__":
# Check environment variables
for env_var in REQUIRED_ENV_VARS:
@@ -181,7 +146,7 @@ if __name__ == "__main__":
parser.add_argument("--port", type=int,
default=os.getenv("PORT", 7860), help="Port number")
parser.add_argument("--reload", action="store_true",
default=False, help="Reload code on change")
default=True, help="Reload code on change")
config = parser.parse_args()

View File

@@ -0,0 +1,12 @@
DAILY_SAMPLE_ROOM_URL= #optional: use the same room each time, or create a new one if unset
DAILY_API_KEY=
DAILY_API_URL=
DEEPGRAM_API_KEY=
DEEPGRAM_VOICE=
DEEPGRAM_STT_URL=
DEEPGRAM_TTS_BASE_URL=
OPENAI_API_KEY=
OPENAI_MODEL=
OPENAI_BASE_URL=

View File

@@ -0,0 +1,267 @@
from loguru import logger
import asyncio
import math
import struct
import time
from dataclasses import dataclass, field
from typing import List
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.frames.frames import (
Frame,
AudioRawFrame,
InterimTranscriptionFrame,
TranscriptionFrame,
TextFrame,
StartInterruptionFrame,
LLMFullResponseStartFrame,
TTSStoppedFrame,
MetricsFrame
)
from pipecat.vad.vad_analyzer import VADAnalyzer, VADState
from pipecat.services.deepgram import DeepgramTTSService
from pipecat.services.openai import OpenAILLMContext, OpenAILLMContextFrame
class GreedyLLMAggregator(FrameProcessor):
def __init__(self, context: OpenAILLMContext = None, **kwargs):
super().__init__(**kwargs)
self.context: OpenAILLMContext = context if context else OpenAILLMContext()
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
logger.debug(f"{frame}")
try:
if isinstance(frame, InterimTranscriptionFrame):
return
if isinstance(frame, TranscriptionFrame):
# append transcribed text to last "user" frame
if self.context.messages and self.context.messages[-1]["role"] == "user":
last_frame = self.context.messages.pop()
else:
last_frame = {"role": "user", "content": ""}
last_frame["content"] += " " + frame.text
self.context.messages.append(last_frame)
oai_context_frame = OpenAILLMContextFrame(context=self.context)
logger.debug(f"pushing frame {oai_context_frame}")
await self.push_frame(oai_context_frame)
return
await self.push_frame(frame, direction)
except Exception as e:
logger.debug(f"error: {e}")
class ClearableDeepgramTTSService(DeepgramTTSService):
def __init___(self, **kwargs):
super().__init(**kwargs)
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, StartInterruptionFrame):
self._current_sentence = ""
@dataclass
class BufferedSentence:
audio_frames: List[AudioRawFrame] = field(default_factory=list)
text_frame: TextFrame = None
class VADGate(FrameProcessor):
def __init__(
self,
vad_analyzer: VADAnalyzer = None,
context: OpenAILLMContext = None,
**kwargs):
super().__init__(**kwargs)
self.vad_analyzer = vad_analyzer
self.context = context
self._audio_pusher_task = None
self._expect_text_frame_next = False
self._sentences: List[BufferedSentence] = []
# queue output from tts one sentence at a time. associate a buffer of audio frames with the content of
# each text frame.
#
# start a coroutine to service the queue and send sentences down the pipeline when possible.
# 1. do not send anything when we are not in VADState.QUIET
# 2. if we are in VADState.QUIET, send a sentence, estimate how long it will take for that sentence
# to output, sleep until it's time to send another sentence
# 3. each time we send a sentence, append it to the conversation context
# 3. when the sentence buffer becomes empty, cancel the coroutine
# 4. if we get a new LLMFullResponse, treat that as a cancellation, too
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
try:
# A TTSService will emit a series of AudioRawFrame objects, then a TTSStoppedFrame,
# then a TextFrame.
if self._expect_text_frame_next:
self._expect_text_frame_next = False
if isinstance(frame, TextFrame):
self._sentences[-1].text_frame = frame
else:
logger.debug(f"expected a text frame, but received {frame}")
await self.push_frame(frame, direction)
return
else:
if isinstance(frame, TextFrame):
logger.error(f"XXXXXXXXXXXXXXXXXXX received a text frame, wasn't expecting it.")
if isinstance(frame, AudioRawFrame):
# if our buffer is empty or has a "finished" sentence at the end,
# then we need to start buffering a new sentence
if not self._sentences or self._sentences[-1].text_frame:
self._sentences.append(BufferedSentence())
self._sentences[-1].audio_frames.append(frame)
await self.maybe_start_audio_pusher_task()
return
if isinstance(frame, TTSStoppedFrame):
self._expect_text_frame_next = True
await self.push_frame(frame, direction)
return
# There are two ways we can be interrupted. During greedy inference, a new
# LLM response can start. Or, during playout, we can get a traditional
# user interruption frame.
if (isinstance(frame, LLMFullResponseStartFrame) or
isinstance(frame, StartInterruptionFrame)):
logger.debug(f"{frame} - Handle interruption in VADGate")
self._sentences = []
if self._audio_pusher_task:
self._audio_pusher_task.cancel()
self._audio_pusher_task = None
await self.push_frame(frame, direction)
return
await self.push_frame(frame, direction)
except Exception as e:
logger.debug(f"error: {e}")
async def maybe_start_audio_pusher_task(self):
try:
if self._audio_pusher_task:
return
self._audio_pusher_task = self.get_event_loop().create_task(self.push_audio())
except Exception as e:
logger.debug(f"Exception {e}")
async def push_audio(self):
try:
while True:
if not self._sentences:
await asyncio.sleep(0.01)
continue
if self.vad_analyzer._vad_state != VADState.QUIET:
await asyncio.sleep(0.01)
continue
# we only want to push completed sentence buffers
if not self._sentences[0].text_frame:
await asyncio.sleep(0.01)
continue
s = self._sentences.pop(0)
if not s.audio_frames:
continue
sample_rate = s.audio_frames[0].sample_rate
duration = 0
logger.debug(f"Pushing {len(s.audio_frames)} audio frames")
for frame in s.audio_frames:
await self.push_frame(frame)
# assume linear16 encoding (2 bytes per sample). todo: add some more
# metadata to AudioRawFrame, maybe
duration += (len(frame.audio) / 2 / frame.num_channels) / sample_rate
await asyncio.sleep(duration - 20 / 1000)
if self.context:
logger.debug(f"Appending assistant message to context: [{s.text_frame.text}]")
self.context.messages.append(
{"role": "assistant", "content": s.text_frame.text}
)
await self.push_frame(s.text_frame)
except Exception as e:
logger.debug(f"Exception {e}")
class TranscriptionTimingLogger(FrameProcessor):
def __init__(self, avt):
super().__init__()
self.name = "Transcription"
self._avt = avt
async def process_frame(self, frame: Frame, direction: FrameDirection):
try:
await super().process_frame(frame, direction)
if isinstance(frame, TranscriptionFrame):
elapsed = time.time() - self._avt.last_transition_ts
logger.debug(f"Transcription TTF: {elapsed}")
await self.push_frame(MetricsFrame(ttfb={self.name: elapsed}))
await self.push_frame(frame, direction)
except Exception as e:
logger.debug(f"Exception {e}")
class AudioVolumeTimer(FrameProcessor):
def __init__(self):
super().__init__()
self.last_transition_ts = 0
self._prev_volume = -80
self._speech_volume_threshold = -50
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, AudioRawFrame):
volume = self.calculate_volume(frame)
# print(f"Audio volume: {volume:.2f} dB")
if (volume >= self._speech_volume_threshold and
self._prev_volume < self._speech_volume_threshold):
# logger.debug("transition above speech volume threshold")
self.last_transition_ts = time.time()
elif (volume < self._speech_volume_threshold and
self._prev_volume >= self._speech_volume_threshold):
# logger.debug("transition below non-speech volume threshold")
self.last_transition_ts = time.time()
self._prev_volume = volume
await self.push_frame(frame, direction)
def calculate_volume(self, frame: AudioRawFrame) -> float:
if frame.num_channels != 1:
raise ValueError(f"Expected 1 channel, got {frame.num_channels}")
# Unpack audio data into 16-bit integers
fmt = f"{len(frame.audio) // 2}h"
audio_samples = struct.unpack(fmt, frame.audio)
# Calculate RMS
sum_squares = sum(sample**2 for sample in audio_samples)
rms = math.sqrt(sum_squares / len(audio_samples))
# Convert RMS to decibels (dB)
# Reference: maximum value for 16-bit audio is 32767
if rms > 0:
db = 20 * math.log10(rms / 32767)
else:
db = -96 # Minimum value (almost silent)
return db

View File

@@ -1,4 +1,4 @@
pipecat-ai[daily,openai,silero]
pipecat-ai[daily,openai,silero,deepgram]
fastapi
uvicorn
requests

View File

@@ -67,12 +67,11 @@ async def main(room_url: str, token):
"Respond bot",
DailyParams(
audio_out_enabled=True,
camera_out_enabled=True,
camera_out_width=1024,
camera_out_height=1024,
transcription_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
vad_analyzer=SileroVADAnalyzer()
)
)
@@ -117,7 +116,7 @@ async def main(room_url: str, token):
async def on_first_participant_joined(transport, participant):
participant_name = participant["info"]["userName"] or ''
transport.capture_participant_transcription(participant["id"])
await task.queue_frames([TextFrame(f"Hi there {participant_name}!")])
await task.queue_frames([TextFrame(f"Hi, this is {participant_name}.")])
runner = PipelineRunner()

View File

@@ -37,8 +37,8 @@ async def main(room_url: str, token):
token,
"Respond bot",
DailyParams(
audio_out_sample_rate=44100,
audio_out_enabled=True,
audio_out_sample_rate=44100,
transcription_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer()
@@ -47,8 +47,8 @@ async def main(room_url: str, token):
tts = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", # Barbershop Man
sample_rate=44100,
voice_name="British Lady",
output_format="pcm_44100"
)
llm = OpenAILLMService(
@@ -70,11 +70,11 @@ async def main(room_url: str, token):
tma_in, # User responses
llm, # LLM
tts, # TTS
tma_out, # Goes before the transport because cartesia has word-level timestamps!
transport.output(), # Transport bot output
tma_out # Assistant spoken responses
])
task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True, enable_metrics=True))
task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))
@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):

View File

@@ -1,96 +0,0 @@
#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import asyncio
import aiohttp
import os
import sys
from pipecat.frames.frames import LLMMessagesFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_response import (
LLMAssistantResponseAggregator, LLMUserResponseAggregator)
from pipecat.services.deepgram import DeepgramSTTService, DeepgramTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.xtts import XTTSService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.vad.silero import SileroVADAnalyzer
from runner import configure
from loguru import logger
from dotenv import load_dotenv
load_dotenv(override=True)
logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
async def main(room_url: str, token):
async with aiohttp.ClientSession() as session:
transport = DailyTransport(
room_url,
token,
"Respond bot",
DailyParams(
audio_out_enabled=True,
transcription_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
)
)
tts = XTTSService(
aiohttp_session=session,
voice_id="Claribel Dervla",
language="en",
base_url="http://localhost:8000"
)
llm = OpenAILLMService(
api_key=os.getenv("OPENAI_API_KEY"),
model="gpt-4o")
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
},
]
tma_in = LLMUserResponseAggregator(messages)
tma_out = LLMAssistantResponseAggregator(messages)
pipeline = Pipeline([
transport.input(), # Transport user input
tma_in, # User responses
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output
tma_out # Assistant spoken responses
])
task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))
@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
transport.capture_participant_transcription(participant["id"])
# Kick off the conversation.
messages.append(
{"role": "system", "content": "Please introduce yourself to the user."})
await task.queue_frames([LLMMessagesFrame(messages)])
runner = PipelineRunner()
await runner.run(task)
if __name__ == "__main__":
(url, token) = configure()
asyncio.run(main(url, token))

View File

@@ -1,101 +0,0 @@
#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import asyncio
import aiohttp
import os
import sys
from pipecat.frames.frames import LLMMessagesFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_response import (
LLMAssistantResponseAggregator, LLMUserResponseAggregator)
from pipecat.services.deepgram import DeepgramSTTService, DeepgramTTSService
from pipecat.services.gladia import GladiaSTTService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.xtts import XTTSService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.vad.silero import SileroVADAnalyzer
from runner import configure
from loguru import logger
from dotenv import load_dotenv
load_dotenv(override=True)
logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
async def main(room_url: str, token):
async with aiohttp.ClientSession() as session:
transport = DailyTransport(
room_url,
token,
"Respond bot",
DailyParams(
audio_out_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
vad_audio_passthrough=True,
)
)
stt = GladiaSTTService(
api_key=os.getenv("GLADIA_API_KEY"),
)
tts = DeepgramTTSService(
aiohttp_session=session,
api_key=os.getenv("DEEPGRAM_API_KEY"),
voice="aura-helios-en"
)
llm = OpenAILLMService(
api_key=os.getenv("OPENAI_API_KEY"),
model="gpt-4o")
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
},
]
tma_in = LLMUserResponseAggregator(messages)
tma_out = LLMAssistantResponseAggregator(messages)
pipeline = Pipeline([
transport.input(), # Transport user input
stt, # STT
tma_in, # User responses
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output
tma_out # Assistant spoken responses
])
task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))
@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
transport.capture_participant_transcription(participant["id"])
# Kick off the conversation.
messages.append(
{"role": "system", "content": "Please introduce yourself to the user."})
await task.queue_frames([LLMMessagesFrame(messages)])
runner = PipelineRunner()
await runner.run(task)
if __name__ == "__main__":
(url, token) = configure()
asyncio.run(main(url, token))

View File

@@ -66,6 +66,7 @@ async def main(room_url: str, token):
"Pipecat",
DailyParams(
audio_out_enabled=True,
audio_out_sample_rate=44100,
transcription_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer()
@@ -74,17 +75,20 @@ async def main(room_url: str, token):
news_lady = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
voice_id="bf991597-6c13-47e4-8411-91ec2de5c466", # Newslady
voice_name="Newslady",
output_format="pcm_44100"
)
british_lady = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22", # British Lady
voice_name="British Lady",
output_format="pcm_44100"
)
barbershop_man = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
voice_id="a0e99841-438c-4a64-b679-ae501e7d6091", # Barbershop Man
voice_name="Barbershop Man",
output_format="pcm_44100"
)
llm = OpenAILLMService(

View File

@@ -1,108 +0,0 @@
#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import asyncio
import aiohttp
import os
import sys
from pipecat.frames.frames import LLMMessagesFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.llm_response import (
LLMAssistantResponseAggregator, LLMUserResponseAggregator)
from pipecat.processors.frame_processor import FrameDirection
from pipecat.processors.user_idle_processor import UserIdleProcessor
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.vad.silero import SileroVADAnalyzer
from runner import configure
from loguru import logger
from dotenv import load_dotenv
load_dotenv(override=True)
logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
async def main(room_url: str, token):
async with aiohttp.ClientSession() as session:
transport = DailyTransport(
room_url,
token,
"Respond bot",
DailyParams(
audio_out_enabled=True,
transcription_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer()
)
)
tts = ElevenLabsTTSService(
aiohttp_session=session,
api_key=os.getenv("ELEVENLABS_API_KEY"),
voice_id=os.getenv("ELEVENLABS_VOICE_ID"),
)
llm = OpenAILLMService(
api_key=os.getenv("OPENAI_API_KEY"),
model="gpt-4o")
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
},
]
tma_in = LLMUserResponseAggregator(messages)
tma_out = LLMAssistantResponseAggregator(messages)
async def user_idle_callback(user_idle: UserIdleProcessor):
messages.append(
{"role": "system", "content": "Ask the user if they are still there and try to prompt for some input, but be short."})
await user_idle.queue_frame(LLMMessagesFrame(messages))
user_idle = UserIdleProcessor(callback=user_idle_callback, timeout=5.0)
pipeline = Pipeline([
transport.input(), # Transport user input
user_idle, # Idle user check-in
tma_in, # User responses
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output
tma_out # Assistant spoken responses
])
task = PipelineTask(pipeline, PipelineParams(
allow_interruptions=True,
enable_metrics=True,
report_only_initial_ttfb=True,
))
@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
transport.capture_participant_transcription(participant["id"])
# Kick off the conversation.
messages.append(
{"role": "system", "content": "Please introduce yourself to the user."})
await task.queue_frames([LLMMessagesFrame(messages)])
runner = PipelineRunner()
await runner.run(task)
if __name__ == "__main__":
(url, token) = configure()
asyncio.run(main(url, token))

View File

@@ -1,4 +1,4 @@
FROM python:3.11-slim-bookworm
FROM python:3.11-bullseye
ARG DEBIAN_FRONTEND=noninteractive
ARG USE_PERSISTENT_DATA
@@ -51,4 +51,4 @@ COPY --chown=user ./frontend/ frontend/
RUN cd frontend && npm install && npm run build
# Start the FastAPI server
CMD python3 src/bot_runner.py --port ${FAST_API_PORT}
CMD python3 src/server.py --port ${FAST_API_PORT}

View File

@@ -48,8 +48,6 @@ pip install -r requirements.txt
mv env.example .env
```
When deploying to production, to ensure only this app can spawn a new bot, set your `ENV` to `production`
**Build the frontend:**
This project uses a custom frontend, which needs to built. Note: this is done automatically as part of the Docker deployment.
@@ -66,11 +64,11 @@ The build UI files can be found in `frontend/out`
Start the API / bot manager:
`python src/bot_runner.py`
`python src/server.py`
If you'd like to run a custom domain or port:
`python src/bot_runner.py --host somehost --p someport`
`python src/server.py --host somehost --p 7777`
➡️ Open the host URL in your browser `http://localhost:7860`

View File

@@ -1,9 +1,5 @@
DAILY_API_KEY=
DAILY_SAMPLE_ROOM_URL=
ELEVENLABS_API_KEY=
ELEVENLABS_VOICE_ID=
FAL_KEY=
OPENAI_API_KEY=
ENV= # dev | production
RUN_AS_VM= # Set this if you want to run bots on process (not launch a new VM)
DAILY_API_KEY=7df...
ELEVENLABS_API_KEY=aeb...
ELEVENLABS_VOICE_ID=7S...
FAL_KEY=8c...
OPENAI_API_KEY=sk-PL...

View File

@@ -27,11 +27,14 @@ export default function Call() {
// Create a new room for the story session
try {
const response = await fetch("/start_bot", {
const response = await fetch("/create", {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
room_url: process.env.NEXT_PUBLIC_ROOM_URL || null,
}),
});
const { room_url, token } = await response.json();
@@ -52,9 +55,21 @@ export default function Call() {
// Disable local audio, the bot will say hello first
daily.setLocalAudio(false);
// Start the bot
const resp = await fetch("/start", {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
room_url,
}),
});
setState("started");
} catch (error) {
setState("error");
leave();
}
}
@@ -64,13 +79,7 @@ export default function Call() {
}
if (state === "error") {
return (
<div className="flex items-center mx-auto">
<p className="text-red-500 font-semibold bg-white px-4 py-2 shadow-xl rounded-lg">
This demo is currently at capacity. Please try again later.
</p>
</div>
);
return <div>An Error occured</div>;
}
if (state === "started") {

View File

@@ -108,26 +108,26 @@ export default function DevicePicker({}: Props) {
{hasMicError && (
<div className="error">
{micState === "blocked" ? (
<p className="text-red-500">
<p>
Please check your browser and system permissions. Make sure that
this app is allowed to access your microphone.
</p>
) : micState === "in-use" ? (
<p className="text-red-500">
<p>
Your microphone is being used by another app. Please close any
other apps using your microphone and restart this app.
</p>
) : micState === "not-found" ? (
<p className="text-red-500">
<p>
No microphone seems to be connected. Please connect a microphone.
</p>
) : micState === "not-supported" ? (
<p className="text-red-500">
<p>
This app is not supported on your device. Please update your
software or use a different device.
</p>
) : (
<p className="text-red-500">
<p>
There seems to be an issue accessing your microphone. Try
restarting the app or consult a system administrator.
</p>

View File

@@ -1,7 +1,7 @@
import React from "react";
import { Button } from "@/components/ui/button";
import DevicePicker from "@/components/DevicePicker";
import { IconAlertCircle, IconEar, IconLoader2 } from "@tabler/icons-react";
import { IconEar, IconLoader2 } from "@tabler/icons-react";
type SetupProps = {
handleStart: () => void;
@@ -24,6 +24,7 @@ export const Setup: React.FC<SetupProps> = ({ handleStart }) => {
<h1 className="text-4xl font-bold text-pretty tracking-tighter mb-4">
Welcome to <span className="text-sky-500">Storytime</span>
</h1>
{state === "intro" ? (
<>
<p className="text-gray-600 leading-relaxed text-pretty">
@@ -37,9 +38,6 @@ export const Setup: React.FC<SetupProps> = ({ handleStart }) => {
<IconEar size={24} /> For best results, try in a quiet
environment!
</p>
<p className="flex flex-row gap-2 text-gray-600 font-medium text-red-500">
<IconAlertCircle size={24} /> This demo expires after 5 minutes.
</p>
</>
) : (
<>
@@ -51,6 +49,7 @@ export const Setup: React.FC<SetupProps> = ({ handleStart }) => {
<DevicePicker />
</>
)}
<hr className="border-gray-150 my-2" />
<Button

View File

@@ -1 +1,2 @@
NEXT_PUBLIC_ROOM_URL=
SITE_URL=

View File

@@ -899,11 +899,11 @@ brace-expansion@^2.0.1:
balanced-match "^1.0.0"
braces@^3.0.2, braces@~3.0.2:
version "3.0.3"
resolved "https://registry.yarnpkg.com/braces/-/braces-3.0.3.tgz#490332f40919452272d55a8480adc0c441358789"
integrity "sha1-SQMy9AkZRSJy1VqEgK3AxEE1h4k= sha512-yQbXgO/OSZVD2IsiLlro+7Hf6Q18EJrKSEsdoMzKePKXct3gvD8oLcOQdIzGupr5Fj+EDe8gO/lxc1BzfMpxvA=="
version "3.0.2"
resolved "https://registry.yarnpkg.com/braces/-/braces-3.0.2.tgz#3454e1a462ee8d599e236df336cd9ea4f8afe107"
integrity sha512-b8um+L1RzM3WDSzvhm6gIz1yfTbBt6YTlcEKAvsmqCZZFw46z626lVj9j1yEPW33H5H+lBQpZMP1k8l+78Ha0A==
dependencies:
fill-range "^7.1.1"
fill-range "^7.0.1"
browserslist@^4.23.0:
version "4.23.0"
@@ -1551,10 +1551,10 @@ file-entry-cache@^6.0.1:
dependencies:
flat-cache "^3.0.4"
fill-range@^7.1.1:
version "7.1.1"
resolved "https://registry.yarnpkg.com/fill-range/-/fill-range-7.1.1.tgz#44265d3cac07e3ea7dc247516380643754a05292"
integrity "sha1-RCZdPKwH4+p9wkdRY4BkN1SgUpI= sha512-YsGpe3WHLK8ZYi4tWDg2Jy3ebRz2rXowDxnld4bkQB00cc/1Zw9AWnC0i9ztDJitivtQvaI9KaLyKrc+hBW0yg=="
fill-range@^7.0.1:
version "7.0.1"
resolved "https://registry.yarnpkg.com/fill-range/-/fill-range-7.0.1.tgz#1919a6a7c75fe38b2c7c77e5198535da9acdda40"
integrity sha512-qOo9F+dMUmC2Lcb4BbVvnKJxTPjCm+RRpe4gDuGrzkL7mEVl/djYSu2OdQ2Pa302N4oqkSg9ir6jaLWJ2USVpQ==
dependencies:
to-regex-range "^5.0.1"

View File

@@ -5,7 +5,7 @@ import os
import sys
from pipecat.frames.frames import LLMMessagesFrame, StopTaskFrame, EndFrame
from pipecat.frames.frames import LLMMessagesFrame, StopTaskFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineTask
@@ -139,16 +139,6 @@ async def main(room_url, token=None):
main_task = PipelineTask(main_pipeline)
@transport.event_handler("on_participant_left")
async def on_participant_left(transport, participant, reason):
intro_task.queue_frame(EndFrame())
await main_task.queue_frame(EndFrame())
@transport.event_handler("on_call_state_updated")
async def on_call_state_updated(transport, state):
if state == "left":
await main_task.queue_frame(EndFrame())
await runner.run(main_task)
if __name__ == "__main__":

View File

@@ -1,233 +0,0 @@
import os
import argparse
import subprocess
import requests
from pathlib import Path
from typing import Optional
from fastapi import FastAPI, Request, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse, JSONResponse
from pipecat.transports.services.helpers.daily_rest import DailyRESTHelper, DailyRoomObject, DailyRoomProperties, DailyRoomParams
from dotenv import load_dotenv
load_dotenv(override=True)
# ------------ Fast API Config ------------ #
MAX_SESSION_TIME = 5 * 60 # 5 minutes
daily_rest_helper = DailyRESTHelper(
os.getenv("DAILY_API_KEY", ""),
os.getenv("DAILY_API_URL", 'https://api.daily.co/v1'))
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Mount the static directory
STATIC_DIR = "frontend/out"
# ------------ Fast API Routes ------------ #
app.mount("/static", StaticFiles(directory=STATIC_DIR, html=True), name="static")
@app.post("/start_bot")
async def start_bot(request: Request) -> JSONResponse:
if os.getenv("ENV", "dev") == "production":
# Only allow requests from the specified domain
host_header = request.headers.get("host")
allowed_domains = ["storytelling-chatbot.fly.dev", "www.storytelling-chatbot.fly.dev"]
# Check if the Host header matches the allowed domain
if host_header not in allowed_domains:
raise HTTPException(status_code=403, detail="Access denied")
try:
data = await request.json()
# Is this a webhook creation request?
if "test" in data:
return JSONResponse({"test": True})
except Exception as e:
pass
# Use specified room URL, or create a new one if not specified
room_url = os.getenv("DAILY_SAMPLE_ROOM_URL", "")
if not room_url:
params = DailyRoomParams(
properties=DailyRoomProperties()
)
try:
room: DailyRoomObject = daily_rest_helper.create_room(params=params)
except Exception as e:
raise HTTPException(
status_code=500,
detail=f"Unable to provision room {e}")
else:
# Check passed room URL exists, we should assume that it already has a sip set up
try:
room: DailyRoomObject = daily_rest_helper.get_room_from_url(room_url)
except Exception:
raise HTTPException(
status_code=500, detail=f"Room not found: {room_url}")
# Give the agent a token to join the session
token = daily_rest_helper.get_token(room.url, MAX_SESSION_TIME)
if not room or not token:
raise HTTPException(
status_code=500, detail=f"Failed to get token for room: {room_url}")
# Launch a new VM, or run as a shell process (not recommended)
if os.getenv("RUN_AS_VM", False):
try:
virtualize_bot(room.url, token)
except Exception as e:
raise HTTPException(
status_code=500, detail=f"Failed to spawn VM: {e}")
else:
try:
subprocess.Popen(
[f"python3 -m bot -u {room.url} -t {token}"],
shell=True,
bufsize=1,
cwd=os.path.dirname(os.path.abspath(__file__)))
except Exception as e:
raise HTTPException(
status_code=500, detail=f"Failed to start subprocess: {e}")
# Grab a token for the user to join with
user_token = daily_rest_helper.get_token(room.url, MAX_SESSION_TIME)
return JSONResponse({
"room_url": room.url,
"token": user_token,
})
@app.get("/{path_name:path}", response_class=FileResponse)
async def catch_all(path_name: Optional[str] = ""):
if path_name == "":
return FileResponse(f"{STATIC_DIR}/index.html")
file_path = Path(STATIC_DIR) / (path_name or "")
if file_path.is_file():
return file_path
html_file_path = file_path.with_suffix(".html")
if html_file_path.is_file():
return FileResponse(html_file_path)
raise HTTPException(status_code=450, detail="Incorrect API call")
# ------------ Virtualization ------------ #
def virtualize_bot(room_url: str, token: str):
"""
This is an example of how to virtualize the bot using Fly.io
You can adapt this method to use whichever cloud provider you prefer.
"""
FLY_API_HOST = os.getenv("FLY_API_HOST", "https://api.machines.dev/v1")
FLY_APP_NAME = os.getenv("FLY_APP_NAME", "storytelling-chatbot")
FLY_API_KEY = os.getenv("FLY_API_KEY", "")
FLY_HEADERS = {
'Authorization': f"Bearer {FLY_API_KEY}",
'Content-Type': 'application/json'
}
# Use the same image as the bot runner
res = requests.get(f"{FLY_API_HOST}/apps/{FLY_APP_NAME}/machines", headers=FLY_HEADERS)
if res.status_code != 200:
raise Exception(f"Unable to get machine info from Fly: {res.text}")
image = res.json()[0]['config']['image']
# Machine configuration
cmd = f"python3 src/bot.py -u {room_url} -t {token}"
cmd = cmd.split()
worker_props = {
"config": {
"image": image,
"auto_destroy": True,
"init": {
"cmd": cmd
},
"restart": {
"policy": "no"
},
"guest": {
"cpu_kind": "shared",
"cpus": 1,
"memory_mb": 512
}
},
}
# Spawn a new machine instance
res = requests.post(
f"{FLY_API_HOST}/apps/{FLY_APP_NAME}/machines",
headers=FLY_HEADERS,
json=worker_props)
if res.status_code != 200:
raise Exception(f"Problem starting a bot worker: {res.text}")
# Wait for the machine to enter the started state
vm_id = res.json()['id']
res = requests.get(
f"{FLY_API_HOST}/apps/{FLY_APP_NAME}/machines/{vm_id}/wait?state=started",
headers=FLY_HEADERS)
if res.status_code != 200:
raise Exception(f"Bot was unable to enter started state: {res.text}")
print(f"Machine joined room: {room_url}")
# ------------ Main ------------ #
if __name__ == "__main__":
# Check environment variables
required_env_vars = ['OPENAI_API_KEY', 'DAILY_API_KEY',
'FAL_KEY', 'ELEVENLABS_VOICE_ID', 'ELEVENLABS_API_KEY']
for env_var in required_env_vars:
if env_var not in os.environ:
raise Exception(f"Missing environment variable: {env_var}.")
import uvicorn
default_host = os.getenv("HOST", "0.0.0.0")
default_port = int(os.getenv("FAST_API_PORT", "7860"))
parser = argparse.ArgumentParser(
description="Daily Storyteller FastAPI server")
parser.add_argument("--host", type=str,
default=default_host, help="Host address")
parser.add_argument("--port", type=int,
default=default_port, help="Port number")
parser.add_argument("--reload", action="store_true",
help="Reload code on change")
config = parser.parse_args()
uvicorn.run(
"bot_runner:app",
host=config.host,
port=config.port,
reload=config.reload
)

View File

@@ -0,0 +1,175 @@
import os
import argparse
import subprocess
import atexit
from pathlib import Path
from typing import Optional
from fastapi import FastAPI, Request, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles
from fastapi.responses import FileResponse, JSONResponse
from utils.daily_helpers import create_room as _create_room, get_token, get_name_from_url
MAX_BOTS_PER_ROOM = 1
# Bot sub-process dict for status reporting and concurrency control
bot_procs = {}
def cleanup():
# Clean up function, just to be extra safe
for proc in bot_procs.values():
proc.terminate()
proc.wait()
atexit.register(cleanup)
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Mount the static directory
STATIC_DIR = "frontend/out"
app.mount("/static", StaticFiles(directory=STATIC_DIR, html=True), name="static")
@app.post("/create")
async def create_room(request: Request) -> JSONResponse:
data = await request.json()
if data.get('room_url') is not None:
room_url = data.get('room_url')
room_name = get_name_from_url(room_url)
else:
room_url, room_name = _create_room()
token = get_token(room_url)
return JSONResponse({"room_url": room_url, "room_name": room_name, "token": token})
@app.post("/start")
async def start_agent(request: Request) -> JSONResponse:
data = await request.json()
# Is this a webhook creation request?
if "test" in data:
return JSONResponse({"test": True})
# Ensure the room property is present
room_url = data.get('room_url')
if not room_url:
raise HTTPException(
status_code=500,
detail="Missing 'room' property in request data. Cannot start agent without a target room!")
# Check if there is already an existing process running in this room
num_bots_in_room = sum(
1 for proc in bot_procs.values() if proc[1] == room_url and proc[0].poll() is None)
if num_bots_in_room >= MAX_BOTS_PER_ROOM:
raise HTTPException(
status_code=500, detail=f"Max bot limited reach for room: {room_url}")
# Get the token for the room
token = get_token(room_url)
if not token:
raise HTTPException(
status_code=500, detail=f"Failed to get token for room: {room_url}")
# Spawn a new agent, and join the user session
# Note: this is mostly for demonstration purposes (refer to 'deployment' in README)
try:
proc = subprocess.Popen(
[
f"python3 -m bot -u {room_url} -t {token}"
],
shell=True,
bufsize=1,
cwd=os.path.dirname(os.path.abspath(__file__))
)
bot_procs[proc.pid] = (proc, room_url)
except Exception as e:
raise HTTPException(
status_code=500, detail=f"Failed to start subprocess: {e}")
return JSONResponse({"bot_id": proc.pid, "room_url": room_url})
@app.get("/status/{pid}")
def get_status(pid: int):
# Look up the subprocess
proc = bot_procs.get(pid)
# If the subprocess doesn't exist, return an error
if not proc:
raise HTTPException(
status_code=404, detail=f"Bot with process id: {pid} not found")
# Check the status of the subprocess
if proc[0].poll() is None:
status = "running"
else:
status = "finished"
return JSONResponse({"bot_id": pid, "status": status})
@app.get("/{path_name:path}", response_class=FileResponse)
async def catch_all(path_name: Optional[str] = ""):
if path_name == "":
return FileResponse(f"{STATIC_DIR}/index.html")
file_path = Path(STATIC_DIR) / (path_name or "")
if file_path.is_file():
return file_path
html_file_path = file_path.with_suffix(".html")
if html_file_path.is_file():
return FileResponse(html_file_path)
raise HTTPException(status_code=450, detail="Incorrect API call")
if __name__ == "__main__":
# Check environment variables
required_env_vars = ['OPENAI_API_KEY', 'DAILY_API_KEY',
'FAL_KEY', 'ELEVENLABS_VOICE_ID', 'ELEVENLABS_API_KEY']
for env_var in required_env_vars:
if env_var not in os.environ:
raise Exception(f"Missing environment variable: {env_var}.")
import uvicorn
default_host = os.getenv("HOST", "0.0.0.0")
default_port = int(os.getenv("FAST_API_PORT", "7860"))
parser = argparse.ArgumentParser(
description="Daily Storyteller FastAPI server")
parser.add_argument("--host", type=str,
default=default_host, help="Host address")
parser.add_argument("--port", type=int,
default=default_port, help="Port number")
parser.add_argument("--reload", action="store_true",
help="Reload code on change")
config = parser.parse_args()
uvicorn.run(
"server:app",
host=config.host,
port=config.port,
reload=config.reload
)

View File

@@ -15,7 +15,6 @@ from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.transports.network.fastapi_websocket import FastAPIWebsocketTransport, FastAPIWebsocketParams
from pipecat.vad.silero import SileroVADAnalyzer
from pipecat.serializers.twilio import TwilioFrameSerializer
from loguru import logger
@@ -26,7 +25,7 @@ logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
async def run_bot(websocket_client, stream_sid):
async def run_bot(websocket_client):
async with aiohttp.ClientSession() as session:
transport = FastAPIWebsocketTransport(
websocket=websocket_client,
@@ -35,8 +34,7 @@ async def run_bot(websocket_client, stream_sid):
add_wav_header=False,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
vad_audio_passthrough=True,
serializer=TwilioFrameSerializer(stream_sid)
vad_audio_passthrough=True
)
)

View File

@@ -1,5 +1,3 @@
import json
import uvicorn
from fastapi import FastAPI, WebSocket
@@ -28,13 +26,8 @@ async def start_call():
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
start_data = websocket.iter_text()
await start_data.__anext__()
call_data = json.loads(await start_data.__anext__())
print(call_data, flush=True)
stream_sid = call_data['start']['streamSid']
print("WebSocket connection accepted")
await run_bot(websocket, stream_sid)
await run_bot(websocket)
if __name__ == "__main__":

View File

@@ -4,7 +4,7 @@
#
# pip-compile --all-extras pyproject.toml
#
aiofiles==24.1.0
aiofiles==23.2.1
# via deepgram-sdk
aiohttp==3.9.5
# via
@@ -17,7 +17,7 @@ aiosignal==1.3.1
# via aiohttp
annotated-types==0.7.0
# via pydantic
anthropic==0.28.1
anthropic==0.25.9
# via
# openpipe
# pipecat-ai (pyproject.toml)
@@ -36,21 +36,23 @@ attrs==23.2.0
# via
# aiohttp
# openpipe
av==12.2.0
av==12.1.0
# via faster-whisper
azure-cognitiveservices-speech==1.38.0
azure-cognitiveservices-speech==1.37.0
# via pipecat-ai (pyproject.toml)
blinker==1.8.2
# via flask
cachetools==5.3.3
# via google-auth
cartesia==1.0.3
cartesia==0.1.1
# via pipecat-ai (pyproject.toml)
certifi==2024.6.2
# via
# httpcore
# httpx
# requests
cffi==1.16.0
# via sounddevice
charset-normalizer==3.3.2
# via requests
click==8.1.7
@@ -62,7 +64,7 @@ coloredlogs==15.0.1
# via onnxruntime
ctranslate2==4.3.1
# via faster-whisper
daily-python==0.10.1
daily-python==0.10.0
# via pipecat-ai (pyproject.toml)
dataclasses-json==0.6.7
# via
@@ -84,15 +86,15 @@ exceptiongroup==1.2.1
# via
# anyio
# pytest
fal-client==0.4.1
fal-client==0.4.0
# via pipecat-ai (pyproject.toml)
fastapi==0.111.0
# via pipecat-ai (pyproject.toml)
fastapi-cli==0.0.4
# via fastapi
faster-whisper==1.0.3
faster-whisper==1.0.2
# via pipecat-ai (pyproject.toml)
filelock==3.15.4
filelock==3.15.3
# via
# huggingface-hub
# pyht
@@ -111,22 +113,22 @@ frozenlist==1.4.1
# via
# aiohttp
# aiosignal
fsspec==2024.6.1
fsspec==2024.6.0
# via
# huggingface-hub
# torch
future==1.0.0
# via pyloudnorm
google-ai-generativelanguage==0.6.6
google-ai-generativelanguage==0.6.4
# via google-generativeai
google-api-core[grpc]==2.19.1
google-api-core[grpc]==2.19.0
# via
# google-ai-generativelanguage
# google-api-python-client
# google-generativeai
google-api-python-client==2.135.0
google-api-python-client==2.134.0
# via google-generativeai
google-auth==2.31.0
google-auth==2.30.0
# via
# google-ai-generativelanguage
# google-api-core
@@ -135,9 +137,9 @@ google-auth==2.31.0
# google-generativeai
google-auth-httplib2==0.2.0
# via google-api-python-client
google-generativeai==0.7.1
google-generativeai==0.5.4
# via pipecat-ai (pyproject.toml)
googleapis-common-protos==1.63.2
googleapis-common-protos==1.63.1
# via
# google-api-core
# grpcio-status
@@ -197,35 +199,31 @@ jinja2==3.1.4
# fastapi
# flask
# torch
jiter==0.5.0
# via anthropic
jsonpatch==1.33
# via langchain-core
jsonpointer==3.0.0
# via jsonpatch
langchain==0.2.6
langchain==0.2.5
# via
# langchain-community
# pipecat-ai (pyproject.toml)
langchain-community==0.2.6
langchain-community==0.2.5
# via pipecat-ai (pyproject.toml)
langchain-core==0.2.10
langchain-core==0.2.9
# via
# langchain
# langchain-community
# langchain-openai
# langchain-text-splitters
langchain-openai==0.1.10
langchain-openai==0.1.9
# via pipecat-ai (pyproject.toml)
langchain-text-splitters==0.2.2
langchain-text-splitters==0.2.1
# via langchain
langsmith==0.1.83
langsmith==0.1.81
# via
# langchain
# langchain-community
# langchain-core
llvmlite==0.43.0
# via numba
loguru==0.7.2
# via pipecat-ai (pyproject.toml)
markdown-it-py==3.0.0
@@ -248,18 +246,14 @@ mypy-extensions==1.0.0
# via typing-inspect
networkx==3.3
# via torch
numba==0.60.0
# via resampy
numpy==1.26.4
# via
# ctranslate2
# langchain
# langchain-community
# numba
# onnxruntime
# pipecat-ai (pyproject.toml)
# pyloudnorm
# resampy
# scipy
# torchvision
# transformers
@@ -288,20 +282,20 @@ nvidia-cusparse-cu12==12.1.0.106
# torch
nvidia-nccl-cu12==2.20.5
# via torch
nvidia-nvjitlink-cu12==12.5.82
nvidia-nvjitlink-cu12==12.5.40
# via
# nvidia-cusolver-cu12
# nvidia-cusparse-cu12
nvidia-nvtx-cu12==12.1.105
# via torch
onnxruntime==1.18.1
onnxruntime==1.18.0
# via faster-whisper
openai==1.27.0
openai==1.26.0
# via
# langchain-openai
# openpipe
# pipecat-ai (pyproject.toml)
openpipe==4.16.0
openpipe==4.14.0
# via pipecat-ai (pyproject.toml)
orjson==3.10.5
# via
@@ -344,7 +338,9 @@ pyasn1-modules==0.4.0
# via google-auth
pyaudio==0.2.14
# via pipecat-ai (pyproject.toml)
pydantic==2.8.0
pycparser==2.22
# via cffi
pydantic==2.7.4
# via
# anthropic
# fastapi
@@ -353,7 +349,7 @@ pydantic==2.8.0
# langchain-core
# langsmith
# openai
pydantic-core==2.20.0
pydantic-core==2.18.4
# via pydantic
pygments==2.18.0
# via rich
@@ -400,8 +396,6 @@ requests==2.32.3
# pyht
# tiktoken
# transformers
resampy==0.4.3
# via pipecat-ai (pyproject.toml)
rich==13.7.1
# via typer
rsa==4.9
@@ -410,7 +404,7 @@ safetensors==0.4.3
# via
# timm
# transformers
scipy==1.14.0
scipy==1.13.1
# via pyloudnorm
shellingham==1.5.4
# via typer
@@ -422,6 +416,8 @@ sniffio==1.3.1
# anyio
# httpx
# openai
sounddevice==0.4.7
# via pipecat-ai (pyproject.toml)
sqlalchemy==2.0.31
# via
# langchain
@@ -432,7 +428,7 @@ sympy==1.12.1
# via
# onnxruntime
# torch
tenacity==8.4.2
tenacity==8.4.1
# via
# langchain
# langchain-community

View File

@@ -1,10 +1,10 @@
#
# This file is autogenerated by pip-compile with Python 3.10
# This file is autogenerated by pip-compile with Python 3.12
# by the following command:
#
# pip-compile --all-extras pyproject.toml
#
aiofiles==24.1.0
aiofiles==23.2.1
# via deepgram-sdk
aiohttp==3.9.5
# via
@@ -17,7 +17,7 @@ aiosignal==1.3.1
# via aiohttp
annotated-types==0.7.0
# via pydantic
anthropic==0.28.1
anthropic==0.25.9
# via
# openpipe
# pipecat-ai (pyproject.toml)
@@ -28,29 +28,27 @@ anyio==4.4.0
# openai
# starlette
# watchfiles
async-timeout==4.0.3
# via
# aiohttp
# langchain
attrs==23.2.0
# via
# aiohttp
# openpipe
av==12.2.0
av==12.1.0
# via faster-whisper
azure-cognitiveservices-speech==1.38.0
azure-cognitiveservices-speech==1.37.0
# via pipecat-ai (pyproject.toml)
blinker==1.8.2
# via flask
cachetools==5.3.3
# via google-auth
cartesia==1.0.3
cartesia==0.1.1
# via pipecat-ai (pyproject.toml)
certifi==2024.6.2
# via
# httpcore
# httpx
# requests
cffi==1.16.0
# via sounddevice
charset-normalizer==3.3.2
# via requests
click==8.1.7
@@ -62,7 +60,7 @@ coloredlogs==15.0.1
# via onnxruntime
ctranslate2==4.3.1
# via faster-whisper
daily-python==0.10.1
daily-python==0.10.0
# via pipecat-ai (pyproject.toml)
dataclasses-json==0.6.7
# via
@@ -80,19 +78,15 @@ einops==0.8.0
# via pipecat-ai (pyproject.toml)
email-validator==2.2.0
# via fastapi
exceptiongroup==1.2.1
# via
# anyio
# pytest
fal-client==0.4.1
fal-client==0.4.0
# via pipecat-ai (pyproject.toml)
fastapi==0.111.0
# via pipecat-ai (pyproject.toml)
fastapi-cli==0.0.4
# via fastapi
faster-whisper==1.0.3
faster-whisper==1.0.2
# via pipecat-ai (pyproject.toml)
filelock==3.15.4
filelock==3.15.3
# via
# huggingface-hub
# pyht
@@ -110,22 +104,22 @@ frozenlist==1.4.1
# via
# aiohttp
# aiosignal
fsspec==2024.6.1
fsspec==2024.6.0
# via
# huggingface-hub
# torch
future==1.0.0
# via pyloudnorm
google-ai-generativelanguage==0.6.6
google-ai-generativelanguage==0.6.4
# via google-generativeai
google-api-core[grpc]==2.19.1
google-api-core[grpc]==2.19.0
# via
# google-ai-generativelanguage
# google-api-python-client
# google-generativeai
google-api-python-client==2.135.0
google-api-python-client==2.134.0
# via google-generativeai
google-auth==2.31.0
google-auth==2.30.0
# via
# google-ai-generativelanguage
# google-api-core
@@ -134,9 +128,9 @@ google-auth==2.31.0
# google-generativeai
google-auth-httplib2==0.2.0
# via google-api-python-client
google-generativeai==0.7.1
google-generativeai==0.5.4
# via pipecat-ai (pyproject.toml)
googleapis-common-protos==1.63.2
googleapis-common-protos==1.63.1
# via
# google-api-core
# grpcio-status
@@ -194,35 +188,31 @@ jinja2==3.1.4
# fastapi
# flask
# torch
jiter==0.5.0
# via anthropic
jsonpatch==1.33
# via langchain-core
jsonpointer==3.0.0
# via jsonpatch
langchain==0.2.6
langchain==0.2.5
# via
# langchain-community
# pipecat-ai (pyproject.toml)
langchain-community==0.2.6
langchain-community==0.2.5
# via pipecat-ai (pyproject.toml)
langchain-core==0.2.10
langchain-core==0.2.9
# via
# langchain
# langchain-community
# langchain-openai
# langchain-text-splitters
langchain-openai==0.1.10
langchain-openai==0.1.9
# via pipecat-ai (pyproject.toml)
langchain-text-splitters==0.2.2
langchain-text-splitters==0.2.1
# via langchain
langsmith==0.1.83
langsmith==0.1.81
# via
# langchain
# langchain-community
# langchain-core
llvmlite==0.43.0
# via numba
loguru==0.7.2
# via pipecat-ai (pyproject.toml)
markdown-it-py==3.0.0
@@ -245,29 +235,25 @@ mypy-extensions==1.0.0
# via typing-inspect
networkx==3.3
# via torch
numba==0.60.0
# via resampy
numpy==1.26.4
# via
# ctranslate2
# langchain
# langchain-community
# numba
# onnxruntime
# pipecat-ai (pyproject.toml)
# pyloudnorm
# resampy
# scipy
# torchvision
# transformers
onnxruntime==1.18.1
onnxruntime==1.18.0
# via faster-whisper
openai==1.27.0
openai==1.26.0
# via
# langchain-openai
# openpipe
# pipecat-ai (pyproject.toml)
openpipe==4.16.0
openpipe==4.14.0
# via pipecat-ai (pyproject.toml)
orjson==3.10.5
# via
@@ -310,7 +296,9 @@ pyasn1-modules==0.4.0
# via google-auth
pyaudio==0.2.14
# via pipecat-ai (pyproject.toml)
pydantic==2.8.0
pycparser==2.22
# via cffi
pydantic==2.7.4
# via
# anthropic
# fastapi
@@ -319,7 +307,7 @@ pydantic==2.8.0
# langchain-core
# langsmith
# openai
pydantic-core==2.20.0
pydantic-core==2.18.4
# via pydantic
pygments==2.18.0
# via rich
@@ -366,8 +354,6 @@ requests==2.32.3
# pyht
# tiktoken
# transformers
resampy==0.4.3
# via pipecat-ai (pyproject.toml)
rich==13.7.1
# via typer
rsa==4.9
@@ -376,7 +362,7 @@ safetensors==0.4.3
# via
# timm
# transformers
scipy==1.14.0
scipy==1.13.1
# via pyloudnorm
shellingham==1.5.4
# via typer
@@ -388,6 +374,8 @@ sniffio==1.3.1
# anyio
# httpx
# openai
sounddevice==0.4.7
# via pipecat-ai (pyproject.toml)
sqlalchemy==2.0.31
# via
# langchain
@@ -398,7 +386,7 @@ sympy==1.12.1
# via
# onnxruntime
# torch
tenacity==8.4.2
tenacity==8.4.1
# via
# langchain
# langchain-community
@@ -412,8 +400,6 @@ tokenizers==0.19.1
# anthropic
# faster-whisper
# transformers
tomli==2.0.1
# via pytest
torch==2.3.1
# via
# pipecat-ai (pyproject.toml)
@@ -437,7 +423,6 @@ typer==0.12.3
typing-extensions==4.12.2
# via
# anthropic
# anyio
# deepgram-sdk
# fastapi
# google-generativeai
@@ -450,7 +435,6 @@ typing-extensions==4.12.2
# torch
# typer
# typing-inspect
# uvicorn
typing-inspect==0.9.0
# via dataclasses-json
ujson==5.10.0

View File

@@ -8,7 +8,7 @@ dynamic = ["version"]
description = "An open source framework for voice (and multimodal) assistants"
license = { text = "BSD 2-Clause License" }
readme = "README.md"
requires-python = ">=3.10"
requires-python = ">=3.7"
keywords = ["webrtc", "audio", "video", "ai"]
classifiers = [
"Development Status :: 5 - Production/Stable",
@@ -34,26 +34,24 @@ Source = "https://github.com/pipecat-ai/pipecat"
Website = "https://pipecat.ai"
[project.optional-dependencies]
anthropic = [ "anthropic~=0.28.1" ]
azure = [ "azure-cognitiveservices-speech~=1.38.0" ]
cartesia = [ "websockets~=12.0" ]
daily = [ "daily-python~=0.10.1" ]
anthropic = [ "anthropic~=0.25.7" ]
azure = [ "azure-cognitiveservices-speech~=1.37.0" ]
cartesia = [ "numpy~=1.26.0", "sounddevice", "cartesia" ]
daily = [ "daily-python~=0.10.0" ]
deepgram = [ "deepgram-sdk~=3.2.7" ]
examples = [ "python-dotenv~=1.0.0", "flask~=3.0.3", "flask_cors~=4.0.1" ]
fal = [ "fal-client~=0.4.1" ]
gladia = [ "websockets~=12.0" ]
google = [ "google-generativeai~=0.7.1" ]
fireworks = [ "openai~=1.27.0" ]
langchain = [ "langchain~=0.2.6", "langchain-community~=0.2.6", "langchain-openai~=0.1.10" ]
fal = [ "fal-client~=0.4.0" ]
google = [ "google-generativeai~=0.5.3" ]
fireworks = [ "openai~=1.26.0" ]
langchain = [ "langchain~=0.2.1", "langchain-community~=0.2.1", "langchain-openai~=0.1.8" ]
local = [ "pyaudio~=0.2.0" ]
moondream = [ "einops~=0.8.0", "timm~=0.9.16", "transformers~=4.40.2" ]
openai = [ "openai~=1.27.0" ]
openpipe = [ "openpipe~=4.16.0" ]
openai = [ "openai~=1.26.0" ]
openpipe = [ "openpipe~=4.14.0" ]
playht = [ "pyht~=0.0.28" ]
silero = [ "torch~=2.3.1", "torchaudio~=2.3.1" ]
silero = [ "torch~=2.3.0", "torchaudio~=2.3.0" ]
websocket = [ "websockets~=12.0", "fastapi~=0.111.0" ]
whisper = [ "faster-whisper~=1.0.3" ]
xtts = [ "resampy~=0.4.3" ]
whisper = [ "faster-whisper~=1.0.2" ]
[tool.setuptools.packages.find]
# All the following settings are optional:

View File

@@ -158,34 +158,6 @@ class LLMMessagesFrame(DataFrame):
messages: List[dict]
@dataclass
class LLMMessagesAppendFrame(DataFrame):
"""A frame containing a list of LLM messages that neeed to be added to the
current context.
"""
messages: List[dict]
@dataclass
class LLMMessagesUpdateFrame(DataFrame):
"""A frame containing a list of new LLM messages. These messages will
replace the current context LLM messages and should generate a new
LLMMessagesFrame.
"""
messages: List[dict]
@dataclass
class TTSSpeakFrame(DataFrame):
"""A frame that contains a text that should be spoken by the TTS in the
pipeline (if any).
"""
text: str
@dataclass
class TransportMessageFrame(DataFrame):
message: Any
@@ -268,33 +240,12 @@ class StopInterruptionFrame(SystemFrame):
pass
@dataclass
class BotInterruptionFrame(SystemFrame):
"""Emitted by when the bot should be interrupted. This will mainly cause the
same actions as if the user interrupted except that the
UserStartedSpeakingFrame and UserStoppedSpeakingFrame won't be generated.
"""
pass
@dataclass
class BotSpeakingFrame(SystemFrame):
"""Emitted by transport outputs while the bot is still speaking. This can be
used, for example, to detect when a user is idle. That is, while the bot is
speaking we don't want to trigger any user idle timeout since the user might
be listening.
"""
pass
@dataclass
class MetricsFrame(SystemFrame):
"""Emitted by processor that can compute metrics like latencies.
"""
ttfb: List[Mapping[str, Any]] | None = None
processing: List[Mapping[str, Any]] | None = None
ttfb: Mapping[str, float]
#
# Control frames
@@ -320,13 +271,27 @@ class EndFrame(ControlFrame):
@dataclass
class LLMFullResponseStartFrame(ControlFrame):
"""Used to indicate the beginning of an LLM response. Following by one or
more TextFrame and a final LLMFullResponseEndFrame."""
"""Used to indicate the beginning of a full LLM response. Following
LLMResponseStartFrame, TextFrame and LLMResponseEndFrame for each sentence
until a LLMFullResponseEndFrame."""
pass
@dataclass
class LLMFullResponseEndFrame(ControlFrame):
"""Indicates the end of a full LLM response."""
pass
@dataclass
class LLMResponseStartFrame(ControlFrame):
"""Used to indicate the beginning of an LLM response. Following TextFrames
are part of the LLM response until an LLMResponseEndFrame"""
pass
@dataclass
class LLMResponseEndFrame(ControlFrame):
"""Indicates the end of an LLM response."""
pass
@@ -373,17 +338,3 @@ class UserImageRequestFrame(ControlFrame):
def __str__(self):
return f"{self.name}, user: {self.user_id}"
@dataclass
class LLMModelUpdateFrame(ControlFrame):
"""A control frame containing a request to update to a new LLM model.
"""
model: str
@dataclass
class TTSVoiceUpdateFrame(ControlFrame):
"""A control frame containing a request to update to a new TTS voice.
"""
voice: str

View File

@@ -91,7 +91,5 @@ class Pipeline(BasePipeline):
def _link_processors(self):
prev = self._processors[0]
for curr in self._processors[1:]:
prev.set_parent(self)
prev.link(curr)
prev = curr
prev.set_parent(self)

View File

@@ -15,7 +15,7 @@ from loguru import logger
class PipelineRunner:
def __init__(self, *, name: str | None = None, handle_sigint: bool = True):
def __init__(self, name: str | None = None, handle_sigint: bool = True):
self.id: int = obj_id()
self.name: str = name or f"{self.__class__.__name__}#{obj_count(self)}"

View File

@@ -95,9 +95,8 @@ class PipelineTask:
def _initial_metrics_frame(self) -> MetricsFrame:
processors = self._pipeline.processors_with_metrics()
ttfb = [{"name": p.name, "time": 0.0} for p in processors]
processing = [{"name": p.name, "time": 0.0} for p in processors]
return MetricsFrame(ttfb=ttfb, processing=processing)
ttfb = dict(zip([p.name for p in processors], [0] * len(processors)))
return MetricsFrame(ttfb=ttfb)
async def _process_down_queue(self):
start_frame = StartFrame(

View File

@@ -14,9 +14,9 @@ from pipecat.frames.frames import (
InterimTranscriptionFrame,
LLMFullResponseEndFrame,
LLMFullResponseStartFrame,
LLMMessagesAppendFrame,
LLMResponseEndFrame,
LLMResponseStartFrame,
LLMMessagesFrame,
LLMMessagesUpdateFrame,
StartInterruptionFrame,
TranscriptionFrame,
TextFrame,
@@ -122,19 +122,6 @@ class LLMResponseAggregator(FrameProcessor):
# Reset anyways
self._reset()
await self.push_frame(frame, direction)
elif isinstance(frame, LLMMessagesAppendFrame):
self._messages.extend(frame.messages)
messages_frame = LLMMessagesFrame(self._messages)
await self.push_frame(messages_frame)
elif isinstance(frame, LLMMessagesUpdateFrame):
# We push the frame downstream so the assistant aggregator gets
# updated as well.
await self.push_frame(frame)
# We can now reset this one.
self._reset()
self._messages = frame.messages
messages_frame = LLMMessagesFrame(self._messages)
await self.push_frame(messages_frame)
else:
await self.push_frame(frame, direction)
@@ -186,7 +173,7 @@ class LLMUserResponseAggregator(LLMResponseAggregator):
class LLMFullResponseAggregator(FrameProcessor):
"""This class aggregates Text frames until it receives a
LLMFullResponseEndFrame, then emits the concatenated text as
LLMResponseEndFrame, then emits the concatenated text as
a single text frame.
given the following frames:
@@ -195,12 +182,12 @@ class LLMFullResponseAggregator(FrameProcessor):
TextFrame(" world.")
TextFrame(" I am")
TextFrame(" an LLM.")
LLMFullResponseEndFrame()]
LLMResponseEndFrame()]
this processor will yield nothing for the first 4 frames, then
TextFrame("Hello, world. I am an LLM.")
LLMFullResponseEndFrame()
LLMResponseEndFrame()
when passed the last frame.
@@ -216,9 +203,9 @@ class LLMFullResponseAggregator(FrameProcessor):
>>> asyncio.run(print_frames(aggregator, TextFrame(" world.")))
>>> asyncio.run(print_frames(aggregator, TextFrame(" I am")))
>>> asyncio.run(print_frames(aggregator, TextFrame(" an LLM.")))
>>> asyncio.run(print_frames(aggregator, LLMFullResponseEndFrame()))
>>> asyncio.run(print_frames(aggregator, LLMResponseEndFrame()))
Hello, world. I am an LLM.
LLMFullResponseEndFrame
LLMResponseEndFrame
"""
def __init__(self):
@@ -247,11 +234,6 @@ class LLMContextAggregator(LLMResponseAggregator):
async def _push_aggregation(self):
if len(self._aggregation) > 0:
self._context.add_message({"role": self._role, "content": self._aggregation})
# Reset the aggregation. Reset it before pushing it down, otherwise
# if the tasks gets cancelled we won't be able to clear things up.
self._aggregation = ""
frame = OpenAILLMContextFrame(self._context)
await self.push_frame(frame)
@@ -265,10 +247,9 @@ class LLMAssistantContextAggregator(LLMContextAggregator):
messages=[],
context=context,
role="assistant",
start_frame=LLMFullResponseStartFrame,
end_frame=LLMFullResponseEndFrame,
accumulator_frame=TextFrame,
handle_interruptions=True
start_frame=LLMResponseStartFrame,
end_frame=LLMResponseEndFrame,
accumulator_frame=TextFrame
)

View File

@@ -1,64 +0,0 @@
#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import asyncio
from pipecat.frames.frames import EndFrame, Frame, StartInterruptionFrame
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
class AsyncFrameProcessor(FrameProcessor):
def __init__(
self,
*,
name: str | None = None,
loop: asyncio.AbstractEventLoop | None = None,
**kwargs):
super().__init__(name=name, loop=loop, **kwargs)
self._create_push_task()
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, StartInterruptionFrame):
await self._handle_interruptions(frame)
async def queue_frame(
self,
frame: Frame,
direction: FrameDirection = FrameDirection.DOWNSTREAM):
await self._push_queue.put((frame, direction))
async def cleanup(self):
self._push_frame_task.cancel()
await self._push_frame_task
async def _handle_interruptions(self, frame: Frame):
# Cancel the task. This will stop pushing frames downstream.
self._push_frame_task.cancel()
await self._push_frame_task
# Push an out-of-band frame (i.e. not using the ordered push
# frame task).
await self.push_frame(frame)
# Create a new queue and task.
self._create_push_task()
def _create_push_task(self):
self._push_queue = asyncio.Queue()
self._push_frame_task = self.get_event_loop().create_task(self._push_frame_task_handler())
async def _push_frame_task_handler(self):
running = True
while running:
try:
(frame, direction) = await self._push_queue.get()
await self.push_frame(frame, direction)
running = not isinstance(frame, EndFrame)
self._push_queue.task_done()
except asyncio.CancelledError:
break

View File

@@ -82,5 +82,5 @@ class WakeCheckFilter(FrameProcessor):
await self.push_frame(frame, direction)
except Exception as e:
error_msg = f"Error in wake word filter: {e}"
logger.exception(error_msg)
logger.error(error_msg)
await self.push_error(ErrorFrame(error_msg))

View File

@@ -9,7 +9,7 @@ import time
from enum import Enum
from pipecat.frames.frames import ErrorFrame, Frame, MetricsFrame, StartFrame, StartInterruptionFrame, UserStoppedSpeakingFrame
from pipecat.frames.frames import ErrorFrame, Frame, MetricsFrame, StartFrame, UserStoppedSpeakingFrame
from pipecat.utils.utils import obj_count, obj_id
from loguru import logger
@@ -20,59 +20,15 @@ class FrameDirection(Enum):
UPSTREAM = 2
class FrameProcessorMetrics:
def __init__(self, name: str):
self._name = name
self._start_ttfb_time = 0
self._start_processing_time = 0
self._should_report_ttfb = True
async def start_ttfb_metrics(self, report_only_initial_ttfb):
if self._should_report_ttfb:
self._start_ttfb_time = time.time()
self._should_report_ttfb = not report_only_initial_ttfb
async def stop_ttfb_metrics(self):
if self._start_ttfb_time == 0:
return None
value = time.time() - self._start_ttfb_time
logger.debug(f"{self._name} TTFB: {value}")
ttfb = {
"processor": self._name,
"value": value
}
self._start_ttfb_time = 0
return MetricsFrame(ttfb=[ttfb])
async def start_processing_metrics(self):
self._start_processing_time = time.time()
async def stop_processing_metrics(self):
if self._start_processing_time == 0:
return None
value = time.time() - self._start_processing_time
logger.debug(f"{self._name} processing time: {value}")
processing = {
"processor": self._name,
"value": value
}
self._start_processing_time = 0
return MetricsFrame(processing=[processing])
class FrameProcessor:
def __init__(
self,
*,
name: str | None = None,
loop: asyncio.AbstractEventLoop | None = None,
**kwargs):
self.id: int = obj_id()
self.name = name or f"{self.__class__.__name__}#{obj_count(self)}"
self._parent: "FrameProcessor" | None = None
self._prev: "FrameProcessor" | None = None
self._next: "FrameProcessor" | None = None
self._loop: asyncio.AbstractEventLoop = loop or asyncio.get_running_loop()
@@ -83,7 +39,8 @@ class FrameProcessor:
self._report_only_initial_ttfb = False
# Metrics
self._metrics = FrameProcessorMetrics(name=self.name)
self._start_ttfb_time = 0
self._should_report_ttfb = True
@property
def interruptions_allowed(self):
@@ -101,33 +58,21 @@ class FrameProcessor:
return False
async def start_ttfb_metrics(self):
if self.can_generate_metrics() and self.metrics_enabled:
await self._metrics.start_ttfb_metrics(self._report_only_initial_ttfb)
if self.metrics_enabled and self._should_report_ttfb:
self._start_ttfb_time = time.time()
self._should_report_ttfb = not self._report_only_initial_ttfb
async def stop_ttfb_metrics(self):
if self.can_generate_metrics() and self.metrics_enabled:
frame = await self._metrics.stop_ttfb_metrics()
if frame:
await self.push_frame(frame)
async def start_processing_metrics(self):
if self.can_generate_metrics() and self.metrics_enabled:
await self._metrics.start_processing_metrics()
async def stop_processing_metrics(self):
if self.can_generate_metrics() and self.metrics_enabled:
frame = await self._metrics.stop_processing_metrics()
if frame:
await self.push_frame(frame)
async def stop_all_metrics(self):
await self.stop_ttfb_metrics()
await self.stop_processing_metrics()
if self.metrics_enabled and self._start_ttfb_time > 0:
ttfb = time.time() - self._start_ttfb_time
logger.debug(f"{self.name} TTFB: {ttfb}")
await self.push_frame(MetricsFrame(ttfb={self.name: ttfb}))
self._start_ttfb_time = 0
async def cleanup(self):
pass
def link(self, processor: "FrameProcessor"):
def link(self, processor: 'FrameProcessor'):
self._next = processor
processor._prev = self
logger.debug(f"Linking {self} -> {self._next}")
@@ -135,19 +80,11 @@ class FrameProcessor:
def get_event_loop(self) -> asyncio.AbstractEventLoop:
return self._loop
def set_parent(self, parent: "FrameProcessor"):
self._parent = parent
def get_parent(self) -> "FrameProcessor":
return self._parent
async def process_frame(self, frame: Frame, direction: FrameDirection):
if isinstance(frame, StartFrame):
self._allow_interruptions = frame.allow_interruptions
self._enable_metrics = frame.enable_metrics
self._report_only_initial_ttfb = frame.report_only_initial_ttfb
elif isinstance(frame, StartInterruptionFrame):
await self.stop_all_metrics()
elif isinstance(frame, UserStoppedSpeakingFrame):
self._should_report_ttfb = True
@@ -155,15 +92,12 @@ class FrameProcessor:
await self.push_frame(error, FrameDirection.UPSTREAM)
async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
try:
if direction == FrameDirection.DOWNSTREAM and self._next:
logger.trace(f"Pushing {frame} from {self} to {self._next}")
await self._next.process_frame(frame, direction)
elif direction == FrameDirection.UPSTREAM and self._prev:
logger.trace(f"Pushing {frame} upstream from {self} to {self._prev}")
await self._prev.process_frame(frame, direction)
except Exception as e:
logger.exception(f"Uncaught exception in {self}: {e}")
if direction == FrameDirection.DOWNSTREAM and self._next:
logger.trace(f"Pushing {frame} from {self} to {self._next}")
await self._next.process_frame(frame, direction)
elif direction == FrameDirection.UPSTREAM and self._prev:
logger.trace(f"Pushing {frame} upstream from {self} to {self._prev}")
await self._prev.process_frame(frame, direction)
def __str__(self):
return self.name

View File

@@ -11,6 +11,8 @@ from pipecat.frames.frames import (
LLMFullResponseEndFrame,
LLMFullResponseStartFrame,
LLMMessagesFrame,
LLMResponseEndFrame,
LLMResponseStartFrame,
TextFrame)
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
@@ -67,10 +69,11 @@ class LangchainProcessor(FrameProcessor):
{self._transcript_key: text},
config={"configurable": {"session_id": self._participant_id}},
):
await self.push_frame(LLMResponseStartFrame())
await self.push_frame(TextFrame(self.__get_token_value(token)))
await self.push_frame(LLMResponseEndFrame())
except GeneratorExit:
logger.warning(f"{self} generator was closed prematurely")
except Exception as e:
logger.exception(f"{self} an unknown error occurred: {e}")
finally:
await self.push_frame(LLMFullResponseEndFrame())
logger.error(f"{self} an unknown error occurred: {e}")
await self.push_frame(LLMFullResponseEndFrame())

View File

@@ -1,523 +0,0 @@
#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import asyncio
import dataclasses
from typing import List, Literal, Optional, Type
from pydantic import BaseModel, ValidationError
from pipecat.frames.frames import (
BotInterruptionFrame,
Frame,
InterimTranscriptionFrame,
LLMFullResponseEndFrame,
LLMFullResponseStartFrame,
LLMMessagesAppendFrame,
LLMMessagesUpdateFrame,
LLMModelUpdateFrame,
StartFrame,
SystemFrame,
TTSSpeakFrame,
TTSVoiceUpdateFrame,
TextFrame,
TranscriptionFrame,
TransportMessageFrame,
UserStartedSpeakingFrame,
UserStoppedSpeakingFrame)
from pipecat.pipeline.pipeline import Pipeline
from pipecat.processors.aggregators.llm_response import (
LLMAssistantResponseAggregator, LLMUserResponseAggregator)
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.services.ai_services import AIService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.services.openai import OpenAILLMService, OpenAILLMContext
from pipecat.transports.base_transport import BaseTransport
DEFAULT_MESSAGES = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
}
]
DEFAULT_MODEL = "llama3-70b-8192"
DEFAULT_VOICE = "79a125e8-cd45-4c13-8a67-188112f4dd22"
class RTVILLMConfig(BaseModel):
model: Optional[str] = None
messages: Optional[List[dict]] = None
class RTVITTSConfig(BaseModel):
voice: Optional[str] = None
class RTVIConfig(BaseModel):
llm: Optional[RTVILLMConfig] = None
tts: Optional[RTVITTSConfig] = None
class RTVISetup(BaseModel):
config: Optional[RTVIConfig] = None
class RTVILLMMessageData(BaseModel):
messages: List[dict]
class RTVITTSMessageData(BaseModel):
text: str
interrupt: Optional[bool] = False
class RTVIMessageData(BaseModel):
setup: Optional[RTVISetup] = None
config: Optional[RTVIConfig] = None
llm: Optional[RTVILLMMessageData] = None
tts: Optional[RTVITTSMessageData] = None
class RTVIMessage(BaseModel):
label: Literal["rtvi"] = "rtvi"
type: str
id: str
data: Optional[RTVIMessageData] = None
class RTVIResponseData(BaseModel):
success: bool
error: Optional[str] = None
class RTVIResponse(BaseModel):
label: Literal["rtvi"] = "rtvi"
type: Literal["response"] = "response"
id: str
data: RTVIResponseData
class RTVIErrorData(BaseModel):
message: str
class RTVIError(BaseModel):
label: Literal["rtvi"] = "rtvi"
type: Literal["error"] = "error"
data: RTVIErrorData
class RTVILLMContextMessageData(BaseModel):
messages: List[dict]
class RTVILLMContextMessage(BaseModel):
label: Literal["rtvi"] = "rtvi"
type: Literal["llm-context"] = "llm-context"
data: RTVILLMContextMessageData
class RTVITTSTextMessageData(BaseModel):
text: str
class RTVITTSTextMessage(BaseModel):
label: Literal["rtvi"] = "rtvi"
type: Literal["tts-text"] = "tts-text"
data: RTVITTSTextMessageData
class RTVIBotReady(BaseModel):
label: Literal["rtvi"] = "rtvi"
type: Literal["bot-ready"] = "bot-ready"
class RTVITranscriptionMessageData(BaseModel):
text: str
user_id: str
timestamp: str
final: bool
class RTVITranscriptionMessage(BaseModel):
label: Literal["rtvi"] = "rtvi"
type: Literal["user-transcription"] = "user-transcription"
data: RTVITranscriptionMessageData
class RTVIUserStartedSpeakingMessage(BaseModel):
label: Literal["rtvi"] = "rtvi"
type: Literal["user-started-speaking"] = "user-started-speaking"
class RTVIUserStoppedSpeakingMessage(BaseModel):
label: Literal["rtvi"] = "rtvi"
type: Literal["user-stopped-speaking"] = "user-stopped-speaking"
class RTVIJSONCompletion(BaseModel):
label: Literal["rtvi"] = "rtvi"
type: Literal["json-completion"] = "json-completion"
data: str
class FunctionCaller(FrameProcessor):
def __init__(self, context):
super().__init__()
self._checking = False
self._aggregating = False
self._emitted_start = False
self._aggregation = ""
self._context = context
self._callbacks = {}
self._start_callbacks = {}
def register_function(self, function_name: str, callback, start_callback=None):
self._callbacks[function_name] = callback
if start_callback:
self._start_callbacks[function_name] = start_callback
def unregister_function(self, function_name: str):
del self._callbacks[function_name]
if self._start_callbacks[function_name]:
del self._start_callbacks[function_name]
def has_function(self, function_name: str):
return function_name in self._callbacks.keys()
async def call_function(self, function_name: str, args):
if function_name in self._callbacks.keys():
return await self._callbacks[function_name](self, args)
return None
async def call_start_function(self, function_name: str):
if function_name in self._start_callbacks.keys():
await self._start_callbacks[function_name](self)
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, LLMFullResponseStartFrame):
self._checking = True
await self.push_frame(frame, direction)
elif isinstance(frame, TextFrame) and self._checking:
# TODO-CB: should we expand this to any non-text character to start the completion?
if frame.text.strip().startswith("{") or frame.text.strip().startswith("```"):
self._emitted_start = False
self._checking = False
self._aggregation = frame.text
self._aggregating = True
else:
self._checking = False
self._aggregating = False
self._aggregation = ""
self._emitted_start = False
await self.push_frame(frame, direction)
elif isinstance(frame, TextFrame) and self._aggregating:
self._aggregation += frame.text
# TODO-CB: We can probably ignore function start I think
# if not self._emitted_start:
# fn = re.search(r'{"function_name":\s*"(.*)",', self._aggregation)
# if fn and fn.group(1):
# await self.call_start_function(fn.group(1))
# self._emitted_start = True
elif isinstance(frame, LLMFullResponseEndFrame) and self._aggregating:
try:
self._aggregation = self._aggregation.replace("```json", "").replace("```", "")
self._context.add_message({"role": "assistant", "content": self._aggregation})
message = RTVIJSONCompletion(data=self._aggregation)
msg = message.model_dump(exclude_none=True)
await self.push_frame(TransportMessageFrame(message=msg))
except Exception as e:
print(f"Error parsing function call json: {e}")
print(f"aggregation was: {self._aggregation}")
self._aggregating = False
self._aggregation = ""
self._emitted_start = False
elif isinstance(frame, LLMFullResponseEndFrame):
await self.push_frame(frame, direction)
else:
await self.push_frame(frame, direction)
class RTVITTSTextProcessor(FrameProcessor):
def __init__(self):
super().__init__()
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
await self.push_frame(frame, direction)
if isinstance(frame, TextFrame):
message = RTVITTSTextMessage(data=RTVITTSTextMessageData(text=frame.text))
await self.push_frame(TransportMessageFrame(message=message.model_dump(exclude_none=True)))
class RTVIProcessor(FrameProcessor):
def __init__(
self,
*,
transport: BaseTransport,
setup: RTVISetup | None = None,
llm_api_key: str = "",
llm_base_url: str = "https://api.groq.com/openai/v1",
tts_api_key: str = "",
llm_cls: Type[AIService] = OpenAILLMService,
tts_cls: Type[AIService] = CartesiaTTSService):
super().__init__()
self._transport = transport
self._setup = setup
self._llm_api_key = llm_api_key
self._llm_base_url = llm_base_url
self._tts_api_key = tts_api_key
self._llm_cls = llm_cls
self._tts_cls = tts_cls
self._start_frame: Frame | None = None
self._llm: FrameProcessor | None = None
self._tts: FrameProcessor | None = None
self._pipeline: FrameProcessor | None = None
self._frame_handler_task = self.get_event_loop().create_task(self._frame_handler())
self._frame_queue = asyncio.Queue()
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, SystemFrame):
await self.push_frame(frame, direction)
else:
await self._frame_queue.put((frame, direction))
if isinstance(frame, StartFrame):
self._start_frame = frame
try:
await self._handle_setup(self._setup)
except Exception as e:
await self._send_error(f"unable to setup RTVI: {e}")
async def cleanup(self):
self._frame_handler_task.cancel()
await self._frame_handler_task
async def _frame_handler(self):
while True:
try:
(frame, direction) = await self._frame_queue.get()
await self._handle_frame(frame, direction)
self._frame_queue.task_done()
except asyncio.CancelledError:
break
async def _handle_frame(self, frame: Frame, direction: FrameDirection):
if isinstance(frame, TransportMessageFrame):
await self._handle_message(frame)
else:
await self.push_frame(frame, direction)
if isinstance(frame, TranscriptionFrame) or isinstance(frame, InterimTranscriptionFrame):
await self._handle_transcriptions(frame)
elif isinstance(frame, UserStartedSpeakingFrame) or isinstance(frame, UserStoppedSpeakingFrame):
await self._handle_interruptions(frame)
async def _handle_transcriptions(self, frame: Frame):
# TODO(aleix): Once we add support for using custom piplines, the STTs will
# be in the pipeline after this processor. This means the STT will have to
# push transcriptions upstream as well.
message = None
if isinstance(frame, TranscriptionFrame):
message = RTVITranscriptionMessage(
data=RTVITranscriptionMessageData(
text=frame.text,
user_id=frame.user_id,
timestamp=frame.timestamp,
final=True))
elif isinstance(frame, InterimTranscriptionFrame):
message = RTVITranscriptionMessage(
data=RTVITranscriptionMessageData(
text=frame.text,
user_id=frame.user_id,
timestamp=frame.timestamp,
final=False))
if message:
frame = TransportMessageFrame(message=message.model_dump(exclude_none=True))
await self.push_frame(frame)
async def _handle_interruptions(self, frame: Frame):
message = None
if isinstance(frame, UserStartedSpeakingFrame):
message = RTVIUserStartedSpeakingMessage()
elif isinstance(frame, UserStoppedSpeakingFrame):
message = RTVIUserStoppedSpeakingMessage()
if message:
frame = TransportMessageFrame(message=message.model_dump(exclude_none=True))
await self.push_frame(frame)
async def _handle_message(self, frame: TransportMessageFrame):
try:
message = RTVIMessage.model_validate(frame.message)
except ValidationError as e:
await self._send_error(f"invalid message: {e}")
return
try:
success = True
error = None
match message.type:
case "setup":
setup = None
if message.data:
setup = message.data.setup
await self._handle_setup(message.id, setup)
case "config-update":
await self._handle_config_update(message.data.config)
case "llm-get-context":
await self._handle_llm_get_context()
case "llm-append-context":
await self._handle_llm_append_context(message.data.llm)
case "llm-update-context":
await self._handle_llm_update_context(message.data.llm)
case "tts-speak":
await self._handle_tts_speak(message.data.tts)
case "tts-interrupt":
await self._handle_tts_interrupt()
case _:
success = False
error = f"unsupported type {message.type}"
await self._send_response(message.id, success, error)
except ValidationError as e:
await self._send_response(message.id, False, f"invalid message: {e}")
except Exception as e:
await self._send_response(message.id, False, f"{e}")
async def _handle_setup(self, setup: RTVISetup | None):
model = DEFAULT_MODEL
if setup and setup.config and setup.config.llm and setup.config.llm.model:
model = setup.config.llm.model
messages = DEFAULT_MESSAGES
if setup and setup.config and setup.config.llm and setup.config.llm.messages:
messages = setup.config.llm.messages
voice = DEFAULT_VOICE
if setup and setup.config and setup.config.tts and setup.config.tts.voice:
voice = setup.config.tts.voice
self._tma_in = LLMUserResponseAggregator(messages)
self._tma_out = LLMAssistantResponseAggregator(messages)
self._llm = self._llm_cls(
name="LLM",
base_url=self._llm_base_url,
api_key=self._llm_api_key,
model=model)
self._tts = self._tts_cls(name="TTS", api_key=self._tts_api_key, voice_id=voice)
# TODO-CB: Eventually we'll need to switch the context aggregators to use the
# OpenAI context frames instead of message frames
context = OpenAILLMContext(messages=messages)
self._fc = FunctionCaller(context)
self._tts_text = RTVITTSTextProcessor()
pipeline = Pipeline([
self._tma_in,
self._llm,
self._fc,
self._tts,
self._tts_text,
self._tma_out,
self._transport.output(),
])
self._pipeline = pipeline
parent = self.get_parent()
if parent and self._start_frame:
parent.link(pipeline)
# We need to initialize the new pipeline with the same settings
# as the initial one.
start_frame = dataclasses.replace(self._start_frame)
await self.push_frame(start_frame)
message = RTVIBotReady()
frame = TransportMessageFrame(message=message.model_dump(exclude_none=True))
await self.push_frame(frame)
async def _handle_config_update(self, config: RTVIConfig):
# Change voice before LLM updates, so we can hear the new vocie.
if config.tts and config.tts.voice:
frame = TTSVoiceUpdateFrame(config.tts.voice)
await self.push_frame(frame)
if config.llm and config.llm.model:
frame = LLMModelUpdateFrame(config.llm.model)
await self.push_frame(frame)
if config.llm and config.llm.messages:
frame = LLMMessagesUpdateFrame(config.llm.messages)
await self.push_frame(frame)
async def _handle_llm_get_context(self):
data = RTVILLMContextMessageData(messages=self._tma_in.messages)
message = RTVILLMContextMessage(data=data)
frame = TransportMessageFrame(message=message.model_dump(exclude_none=True))
await self.push_frame(frame)
async def _handle_llm_append_context(self, data: RTVILLMMessageData):
if data and data.messages:
frame = LLMMessagesAppendFrame(data.messages)
await self.push_frame(frame)
async def _handle_llm_update_context(self, data: RTVILLMMessageData):
if data and data.messages:
frame = LLMMessagesUpdateFrame(data.messages)
await self.push_frame(frame)
async def _handle_tts_speak(self, data: RTVITTSMessageData):
if data and data.text:
if data.interrupt:
await self._handle_tts_interrupt()
frame = TTSSpeakFrame(text=data.text)
await self.push_frame(frame)
async def _handle_tts_interrupt(self):
await self.push_frame(BotInterruptionFrame(), FrameDirection.UPSTREAM)
async def _send_error(self, error: str):
message = RTVIError(data=RTVIErrorData(message=error))
frame = TransportMessageFrame(message=message.model_dump(exclude_none=True))
await self.push_frame(frame)
async def _send_response(self, id: str, success: bool, error: str | None = None):
# TODO(aleix): This is a bit hacky, but we might get invalid
# configuration or something might going wrong during setup and we would
# like to send the error to the client. However, if the pipeline is not
# setup yet we don't have an output transport and therefore we can't
# send any messages. So, we setup a super basic pipeline with just the
# output transport so we can send messages.
if not self._pipeline:
pipeline = Pipeline([self._transport.output()])
self._pipeline = pipeline
parent = self.get_parent()
if parent and self._start_frame:
parent.link(pipeline)
message = RTVIResponse(id=id, data=RTVIResponseData(success=success, error=error))
frame = TransportMessageFrame(message=message.model_dump(exclude_none=True))
await self.push_frame(frame)

View File

@@ -1,76 +0,0 @@
#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import asyncio
from typing import Awaitable, Callable, List
from pipecat.frames.frames import Frame, SystemFrame
from pipecat.processors.async_frame_processor import AsyncFrameProcessor
from pipecat.processors.frame_processor import FrameDirection
class IdleFrameProcessor(AsyncFrameProcessor):
"""This class waits to receive any frame or list of desired frames within a
given timeout. If the timeout is reached before receiving any of those
frames the provided callback will be called.
The callback can then be used to push frames downstream by using
`queue_frame()` (or `push_frame()` for system frames).
"""
def __init__(
self,
*,
callback: Callable[["IdleFrameProcessor"], Awaitable[None]],
timeout: float,
types: List[type] = [],
**kwargs):
super().__init__(**kwargs)
self._callback = callback
self._timeout = timeout
self._types = types
self._create_idle_task()
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, SystemFrame):
await self.push_frame(frame, direction)
else:
await self.queue_frame(frame, direction)
# If we are not waiting for any specific frame set the event, otherwise
# check if we have received one of the desired frames.
if not self._types:
self._idle_event.set()
else:
for t in self._types:
if isinstance(frame, t):
self._idle_event.set()
# If we are not waiting for any specific frame set the event, otherwise
async def cleanup(self):
self._idle_task.cancel()
await self._idle_task
def _create_idle_task(self):
self._idle_event = asyncio.Event()
self._idle_task = self.get_event_loop().create_task(self._idle_task_handler())
async def _idle_task_handler(self):
while True:
try:
await asyncio.wait_for(self._idle_event.wait(), timeout=self._timeout)
except asyncio.TimeoutError:
await self._callback(self)
except asyncio.CancelledError:
break
finally:
self._idle_event.clear()

View File

@@ -33,6 +33,6 @@ class StatelessTextTransformer(FrameProcessor):
result = self._transform_fn(frame.text)
if isinstance(result, Coroutine):
result = await result
await self.push_frame(TextFrame(text=result))
await self.push_frame(result)
else:
await self.push_frame(frame, direction)

View File

@@ -1,77 +0,0 @@
#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import asyncio
from typing import Awaitable, Callable
from pipecat.frames.frames import BotSpeakingFrame, Frame, StartInterruptionFrame, StopInterruptionFrame, SystemFrame
from pipecat.processors.async_frame_processor import AsyncFrameProcessor
from pipecat.processors.frame_processor import FrameDirection
class UserIdleProcessor(AsyncFrameProcessor):
"""This class is useful to check if the user is interacting with the bot
within a given timeout. If the timeout is reached before any interaction
occurred the provided callback will be called.
The callback can then be used to push frames downstream by using
`queue_frame()` (or `push_frame()` for system frames).
"""
def __init__(
self,
*,
callback: Callable[["UserIdleProcessor"], Awaitable[None]],
timeout: float,
**kwargs):
super().__init__(**kwargs)
self._callback = callback
self._timeout = timeout
self._interrupted = False
self._create_idle_task()
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, SystemFrame):
await self.push_frame(frame, direction)
else:
await self.queue_frame(frame, direction)
# We shouldn't call the idle callback if the user or the bot are speaking.
if isinstance(frame, StartInterruptionFrame):
self._interrupted = True
self._idle_event.set()
elif isinstance(frame, StopInterruptionFrame):
self._interrupted = False
self._idle_event.set()
elif isinstance(frame, BotSpeakingFrame):
self._idle_event.set()
async def cleanup(self):
self._idle_task.cancel()
await self._idle_task
def _create_idle_task(self):
self._idle_event = asyncio.Event()
self._idle_task = self.get_event_loop().create_task(self._idle_task_handler())
async def _idle_task_handler(self):
while True:
try:
await asyncio.wait_for(self._idle_event.wait(), timeout=self._timeout)
except asyncio.TimeoutError:
if not self._interrupted:
await self._callback(self)
except asyncio.CancelledError:
break
finally:
self._idle_event.clear()

View File

@@ -17,8 +17,8 @@ class TwilioFrameSerializer(FrameSerializer):
AudioRawFrame: "audio",
}
def __init__(self, stream_sid: str):
self._stream_sid = stream_sid
def __init__(self):
self._sid = None
def serialize(self, frame: Frame) -> str | bytes | None:
if not isinstance(frame, AudioRawFrame):
@@ -30,7 +30,7 @@ class TwilioFrameSerializer(FrameSerializer):
payload = base64.b64encode(serialized_data).decode("utf-8")
answer = {
"event": "media",
"streamSid": self._stream_sid,
"streamSid": self._sid,
"media": {
"payload": payload
}
@@ -41,6 +41,9 @@ class TwilioFrameSerializer(FrameSerializer):
def deserialize(self, data: str | bytes) -> Frame | None:
message = json.loads(data)
if not self._sid:
self._sid = message["streamSid"] if "streamSid" in message else None
if message["event"] != "media":
return None
else:

View File

@@ -16,38 +16,16 @@ from pipecat.frames.frames import (
EndFrame,
ErrorFrame,
Frame,
LLMFullResponseEndFrame,
StartFrame,
StartInterruptionFrame,
TTSSpeakFrame,
TTSStartedFrame,
TTSStoppedFrame,
TTSVoiceUpdateFrame,
TextFrame,
VisionImageRawFrame,
LLMFullResponseEndFrame,
)
from pipecat.processors.async_frame_processor import AsyncFrameProcessor
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.utils.audio import calculate_audio_volume
from pipecat.utils.utils import exp_smoothing
import re
ENDOFSENTENCE_PATTERN_STR = r"""
(?<![A-Z]) # Negative lookbehind: not preceded by an uppercase letter (e.g., "U.S.A.")
(?<!\d) # Negative lookbehind: not preceded by a digit (e.g., "1. Let's start")
(?<!\d\s[ap]) # Negative lookbehind: not preceded by time (e.g., "3:00 a.m.")
(?<!Mr|Ms|Dr) # Negative lookbehind: not preceded by Mr, Ms, Dr (combined bc. length is the same)
(?<!Mrs) # Negative lookbehind: not preceded by "Mrs"
(?<!Prof) # Negative lookbehind: not preceded by "Prof"
[\.\?\!:] # Match a period, question mark, exclamation point, or colon
$ # End of string
"""
ENDOFSENTENCE_PATTERN = re.compile(ENDOFSENTENCE_PATTERN_STR, re.VERBOSE)
def match_endofsentence(text: str) -> bool:
return ENDOFSENTENCE_PATTERN.search(text.rstrip()) is not None
class AIService(FrameProcessor):
@@ -81,30 +59,6 @@ class AIService(FrameProcessor):
await self.push_frame(f)
class AsyncAIService(AsyncFrameProcessor):
def __init__(self, **kwargs):
super().__init__(**kwargs)
async def start(self, frame: StartFrame):
pass
async def stop(self, frame: EndFrame):
pass
async def cancel(self, frame: CancelFrame):
pass
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, StartFrame):
await self.start(frame)
elif isinstance(frame, CancelFrame):
await self.cancel(frame)
elif isinstance(frame, EndFrame):
await self.stop(frame)
class LLMService(AIService):
"""This class is a no-op but serves as a base class for LLM services."""
@@ -138,22 +92,11 @@ class LLMService(AIService):
class TTSService(AIService):
def __init__(
self,
*,
aggregate_sentences: bool = True,
# if True, subclass is responsible for pushing TextFrames and LLMFullResponseEndFrames
push_text_frames: bool = True,
**kwargs):
def __init__(self, aggregate_sentences: bool = True, **kwargs):
super().__init__(**kwargs)
self._aggregate_sentences: bool = aggregate_sentences
self._push_text_frames: bool = push_text_frames
self._current_sentence: str = ""
@abstractmethod
async def set_voice(self, voice: str):
pass
# Converts the text to audio.
@abstractmethod
async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
@@ -162,58 +105,43 @@ class TTSService(AIService):
async def say(self, text: str):
await self.process_frame(TextFrame(text=text), FrameDirection.DOWNSTREAM)
async def _handle_interruption(self, frame: StartInterruptionFrame, direction: FrameDirection):
self._current_sentence = ""
await self.push_frame(frame, direction)
async def _process_text_frame(self, frame: TextFrame):
text: str | None = None
if not self._aggregate_sentences:
text = frame.text
else:
self._current_sentence += frame.text
if match_endofsentence(self._current_sentence):
text = self._current_sentence
if self._current_sentence.strip().endswith(
(".", "?", "!")) and not self._current_sentence.strip().endswith(
("Mr,", "Mrs.", "Ms.", "Dr.")):
text = self._current_sentence.strip()
self._current_sentence = ""
if text:
await self._push_tts_frames(text)
async def _push_tts_frames(self, text: str, text_passthrough: bool = True):
text = text.strip()
if not text:
return
async def _push_tts_frames(self, text: str):
await self.push_frame(TTSStartedFrame())
await self.start_processing_metrics()
await self.process_generator(self.run_tts(text))
await self.stop_processing_metrics()
await self.push_frame(TTSStoppedFrame())
if self._push_text_frames:
# We send the original text after the audio. This way, if we are
# interrupted, the text is not added to the assistant context.
await self.push_frame(TextFrame(text))
# We send the original text after the audio. This way, if we are
# interrupted, the text is not added to the assistant context.
await self.push_frame(TextFrame(text))
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, TextFrame):
await self._process_text_frame(frame)
elif isinstance(frame, StartInterruptionFrame):
await self._handle_interruption(frame, direction)
elif isinstance(frame, LLMFullResponseEndFrame) or isinstance(frame, EndFrame):
sentence = self._current_sentence
self._current_sentence = ""
await self._push_tts_frames(sentence)
if isinstance(frame, LLMFullResponseEndFrame):
if self._push_text_frames:
await self.push_frame(frame, direction)
else:
await self.push_frame(frame, direction)
elif isinstance(frame, TTSSpeakFrame):
await self._push_tts_frames(frame.text, False)
elif isinstance(frame, TTSVoiceUpdateFrame):
await self.set_voice(frame.voice)
elif isinstance(frame, EndFrame):
if self._current_sentence:
await self._push_tts_frames(self._current_sentence)
await self.push_frame(frame)
elif isinstance(frame, LLMFullResponseEndFrame):
if self._current_sentence:
await self._push_tts_frames(self._current_sentence.strip())
self._current_sentence = ""
await self.push_frame(frame)
else:
await self.push_frame(frame, direction)
@@ -222,7 +150,6 @@ class STTService(AIService):
"""STTService is a base class for speech-to-text services."""
def __init__(self,
*,
min_volume: float = 0.6,
max_silence_secs: float = 0.3,
max_buffer_secs: float = 1.5,
@@ -278,9 +205,7 @@ class STTService(AIService):
self._silence_num_frames = 0
self._wave.close()
self._content.seek(0)
await self.start_processing_metrics()
await self.process_generator(self.run_stt(self._content.read()))
await self.stop_processing_metrics()
(self._content, self._wave) = self._new_wave()
async def process_frame(self, frame: Frame, direction: FrameDirection):
@@ -313,9 +238,7 @@ class ImageGenService(AIService):
if isinstance(frame, TextFrame):
await self.push_frame(frame, direction)
await self.start_processing_metrics()
await self.process_generator(self.run_image_gen(frame.text))
await self.stop_processing_metrics()
else:
await self.push_frame(frame, direction)
@@ -335,8 +258,6 @@ class VisionService(AIService):
await super().process_frame(frame, direction)
if isinstance(frame, VisionImageRawFrame):
await self.start_processing_metrics()
await self.process_generator(self.run_vision(frame))
await self.stop_processing_metrics()
else:
await self.push_frame(frame, direction)

View File

@@ -8,11 +8,12 @@ import base64
from pipecat.frames.frames import (
Frame,
LLMModelUpdateFrame,
TextFrame,
VisionImageRawFrame,
LLMMessagesFrame,
LLMFullResponseStartFrame,
LLMResponseStartFrame,
LLMResponseEndFrame,
LLMFullResponseEndFrame
)
from pipecat.processors.frame_processor import FrameDirection
@@ -40,7 +41,6 @@ class AnthropicLLMService(LLMService):
def __init__(
self,
*,
api_key: str,
model: str = "claude-3-opus-20240229",
max_tokens: int = 1024):
@@ -117,10 +117,12 @@ class AnthropicLLMService(LLMService):
async for event in response:
# logger.debug(f"Anthropic LLM event: {event}")
if (event.type == "content_block_delta"):
await self.push_frame(LLMResponseStartFrame())
await self.push_frame(TextFrame(event.delta.text))
await self.push_frame(LLMResponseEndFrame())
except Exception as e:
logger.exception(f"{self} exception: {e}")
logger.error(f"{self} exception: {e}")
finally:
await self.push_frame(LLMFullResponseEndFrame())
@@ -135,9 +137,6 @@ class AnthropicLLMService(LLMService):
context = OpenAILLMContext.from_messages(frame.messages)
elif isinstance(frame, VisionImageRawFrame):
context = OpenAILLMContext.from_image_frame(frame)
elif isinstance(frame, LLMModelUpdateFrame):
logger.debug(f"Switching LLM model to: [{frame.model}]")
self._model = frame.model
else:
await self.push_frame(frame, direction)

View File

@@ -12,18 +12,9 @@ import time
from PIL import Image
from typing import AsyncGenerator
from pipecat.frames.frames import (
AudioRawFrame,
CancelFrame,
EndFrame,
ErrorFrame,
Frame,
StartFrame,
SystemFrame,
TranscriptionFrame,
URLImageRawFrame)
from pipecat.frames.frames import AudioRawFrame, CancelFrame, EndFrame, ErrorFrame, Frame, StartFrame, SystemFrame, TranscriptionFrame, URLImageRawFrame
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.ai_services import AsyncAIService, TTSService, ImageGenService
from pipecat.services.ai_services import AIService, TTSService, ImageGenService
from pipecat.services.openai import BaseOpenAILLMService
from loguru import logger
@@ -43,7 +34,7 @@ try:
except ModuleNotFoundError as e:
logger.error(f"Exception: {e}")
logger.error(
"In order to use Azure, you need to `pip install pipecat-ai[azure]`. Also, set `AZURE_SPEECH_API_KEY` and `AZURE_SPEECH_REGION` environment variables.")
"In order to use Azure TTS, you need to `pip install pipecat-ai[azure]`. Also, set `AZURE_SPEECH_API_KEY` and `AZURE_SPEECH_REGION` environment variables.")
raise Exception(f"Missing module: {e}")
@@ -81,12 +72,8 @@ class AzureTTSService(TTSService):
def can_generate_metrics(self) -> bool:
return True
async def set_voice(self, voice: str):
logger.debug(f"Switching TTS voice to: [{voice}]")
self._voice = voice
async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
logger.debug(f"Generating TTS: [{text}]")
logger.debug(f"Generating TTS: {text}")
await self.start_ttfb_metrics()
@@ -113,7 +100,7 @@ class AzureTTSService(TTSService):
logger.error(f"{self} error: {cancellation_details.error_details}")
class AzureSTTService(AsyncAIService):
class AzureSTTService(AIService):
def __init__(
self,
*,
@@ -136,6 +123,8 @@ class AzureSTTService(AsyncAIService):
speech_config=speech_config, audio_config=audio_config)
self._speech_recognizer.recognized.connect(self._on_handle_recognized)
self._create_push_task()
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
@@ -151,16 +140,34 @@ class AzureSTTService(AsyncAIService):
async def stop(self, frame: EndFrame):
self._speech_recognizer.stop_continuous_recognition_async()
self._audio_stream.close()
await self._push_queue.put((frame, FrameDirection.DOWNSTREAM))
await self._push_frame_task
async def cancel(self, frame: CancelFrame):
self._speech_recognizer.stop_continuous_recognition_async()
self._audio_stream.close()
self._push_frame_task.cancel()
await self._push_frame_task
def _create_push_task(self):
self._push_queue = asyncio.Queue()
self._push_frame_task = self.get_event_loop().create_task(self._push_frame_task_handler())
async def _push_frame_task_handler(self):
running = True
while running:
try:
(frame, direction) = await self._push_queue.get()
await self.push_frame(frame, direction)
running = not isinstance(frame, EndFrame)
except asyncio.CancelledError:
break
def _on_handle_recognized(self, event):
if event.result.reason == ResultReason.RecognizedSpeech and len(event.result.text) > 0:
direction = FrameDirection.DOWNSTREAM
frame = TranscriptionFrame(event.result.text, "", int(time.time_ns() / 1000000))
asyncio.run_coroutine_threadsafe(self.queue_frame(frame), self.get_event_loop())
asyncio.run_coroutine_threadsafe(
self._push_queue.put((frame, direction)), self.get_event_loop())
class AzureImageGenServiceREST(ImageGenService):

View File

@@ -4,37 +4,15 @@
# SPDX-License-Identifier: BSD 2-Clause License
#
import json
import uuid
import base64
import asyncio
import time
from cartesia.tts import AsyncCartesiaTTS
from typing import AsyncGenerator
from pipecat.processors.frame_processor import FrameDirection
from pipecat.frames.frames import (
Frame,
AudioRawFrame,
StartInterruptionFrame,
StartFrame,
EndFrame,
TextFrame,
LLMFullResponseEndFrame
)
from pipecat.frames.frames import AudioRawFrame, Frame
from pipecat.services.ai_services import TTSService
from loguru import logger
# See .env.example for Cartesia configuration needed
try:
import websockets
except ModuleNotFoundError as e:
logger.error(f"Exception: {e}")
logger.error(
"In order to use Cartesia, you need to `pip install pipecat-ai[cartesia]`. Also, set `CARTESIA_API_KEY` environment variable.")
raise Exception(f"Missing module: {e}")
class CartesiaTTSService(TTSService):
@@ -42,184 +20,44 @@ class CartesiaTTSService(TTSService):
self,
*,
api_key: str,
voice_id: str,
cartesia_version: str = "2024-06-10",
url: str = "wss://api.cartesia.ai/tts/websocket",
model_id: str = "sonic-english",
encoding: str = "pcm_s16le",
sample_rate: int = 16000,
language: str = "en",
voice_name: str,
model_id: str = "upbeat-moon",
output_format: str = "pcm_16000",
**kwargs):
super().__init__(**kwargs)
# Aggregating sentences still gives cleaner-sounding results and fewer
# artifacts than streaming one word at a time. On average, waiting for
# a full sentence should only "cost" us 15ms or so with GPT-4o or a Llama 3
# model, and it's worth it for the better audio quality.
self._aggregate_sentences = True
# we don't want to automatically push LLM response text frames, because the
# context aggregators will add them to the LLM context even if we're
# interrupted. cartesia gives us word-by-word timestamps. we can use those
# to generate text frames ourselves aligned with the playout timing of the audio!
self._push_text_frames = False
self._api_key = api_key
self._cartesia_version = cartesia_version
self._url = url
self._voice_id = voice_id
self._voice_name = voice_name
self._model_id = model_id
self._output_format = {
"container": "raw",
"encoding": encoding,
"sample_rate": sample_rate,
}
self._language = language
self._output_format = output_format
self._websocket = None
self._context_id = None
self._context_id_start_timestamp = None
self._timestamped_words_buffer = []
self._receive_task = None
self._context_appending_task = None
try:
self._client = AsyncCartesiaTTS(api_key=self._api_key)
voices = self._client.get_voices()
voice_id = voices[self._voice_name]["id"]
self._voice = self._client.get_voice_embedding(voice_id=voice_id)
except Exception as e:
logger.error(f"{self} initialization error: {e}")
def can_generate_metrics(self) -> bool:
return True
async def set_voice(self, voice: str):
logger.debug(f"Switching TTS voice to: [{voice}]")
self._voice_id = voice
async def start(self, frame: StartFrame):
await super().start(frame)
await self._connect()
async def stop(self, frame: EndFrame):
await super().stop(frame)
await self._disconnect()
async def _connect(self):
try:
self._websocket = await websockets.connect(
f"{self._url}?api_key={self._api_key}&cartesia_version={self._cartesia_version}"
)
self._receive_task = self.get_event_loop().create_task(self._receive_task_handler())
self._context_appending_task = self.get_event_loop().create_task(self._context_appending_task_handler())
except Exception as e:
logger.exception(f"{self} initialization error: {e}")
self._websocket = None
async def _disconnect(self):
try:
if self._context_appending_task:
self._context_appending_task.cancel()
await self._context_appending_task
self._context_appending_task = None
if self._receive_task:
self._receive_task.cancel()
await self._receive_task
self._receive_task = None
if self._websocket:
ws = self._websocket
self._websocket = None
await ws.close()
self._context_id = None
self._context_id_start_timestamp = None
self._timestamped_words_buffer = []
await self.stop_all_metrics()
except Exception as e:
logger.exception(f"{self} error closing websocket: {e}")
async def _handle_interruption(self, frame: StartInterruptionFrame, direction: FrameDirection):
await super()._handle_interruption(frame, direction)
self._context_id = None
self._context_id_start_timestamp = None
self._timestamped_words_buffer = []
await self.stop_all_metrics()
await self.push_frame(LLMFullResponseEndFrame())
async def _receive_task_handler(self):
try:
async for message in self._websocket:
msg = json.loads(message)
# logger.debug(f"Received message: {msg['type']} {msg['context_id']}")
if not msg or msg["context_id"] != self._context_id:
continue
if msg["type"] == "done":
await self.stop_ttfb_metrics()
# unset _context_id but not the _context_id_start_timestamp because we are likely still
# playing out audio and need the timestamp to set send context frames
self._context_id = None
self._timestamped_words_buffer.append(("LLMFullResponseEndFrame", 0))
elif msg["type"] == "timestamps":
# logger.debug(f"TIMESTAMPS: {msg}")
self._timestamped_words_buffer.extend(
list(zip(msg["word_timestamps"]["words"], msg["word_timestamps"]["end"]))
)
elif msg["type"] == "chunk":
await self.stop_ttfb_metrics()
if not self._context_id_start_timestamp:
self._context_id_start_timestamp = time.time()
frame = AudioRawFrame(
audio=base64.b64decode(msg["data"]),
sample_rate=self._output_format["sample_rate"],
num_channels=1
)
await self.push_frame(frame)
except Exception as e:
logger.exception(f"{self} exception: {e}")
async def _context_appending_task_handler(self):
try:
while True:
await asyncio.sleep(0.1)
if not self._context_id_start_timestamp:
continue
elapsed_seconds = time.time() - self._context_id_start_timestamp
# pop all words from self._timestamped_words_buffer that are older than the
# elapsed time and print a message about them to the console
while self._timestamped_words_buffer and self._timestamped_words_buffer[0][1] <= elapsed_seconds:
word, timestamp = self._timestamped_words_buffer.pop(0)
if word == "LLMFullResponseEndFrame" and timestamp == 0:
await self.push_frame(LLMFullResponseEndFrame())
continue
# print(f"Word '{word}' with timestamp {timestamp:.2f}s has been spoken.")
await self.push_frame(TextFrame(word))
except Exception as e:
logger.exception(f"{self} exception: {e}")
async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
logger.debug(f"Generating TTS: [{text}]")
try:
if not self._websocket:
await self._connect()
await self.start_ttfb_metrics()
if not self._context_id:
await self.start_ttfb_metrics()
self._context_id = str(uuid.uuid4())
chunk_generator = await self._client.generate(
stream=True,
transcript=text,
voice=self._voice,
model_id=self._model_id,
output_format=self._output_format,
)
msg = {
"transcript": text + " ",
"continue": True,
"context_id": self._context_id,
"model_id": self._model_id,
"voice": {
"mode": "id",
"id": self._voice_id
},
"output_format": self._output_format,
"language": self._language,
"add_timestamps": True,
}
# logger.debug(f"SENDING MESSAGE {json.dumps(msg)}")
try:
await self._websocket.send(json.dumps(msg))
except Exception as e:
logger.exception(f"{self} error sending message: {e}")
await self._disconnect()
await self._connect()
return
yield None
async for chunk in chunk_generator:
await self.stop_ttfb_metrics()
yield AudioRawFrame(chunk["audio"], chunk["sampling_rate"], 1)
except Exception as e:
logger.exception(f"{self} exception: {e}")
logger.error(f"{self} exception: {e}")

View File

@@ -5,6 +5,7 @@
#
import aiohttp
import asyncio
import time
from typing import AsyncGenerator
@@ -20,24 +21,17 @@ from pipecat.frames.frames import (
SystemFrame,
TranscriptionFrame)
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.ai_services import AsyncAIService, TTSService
from pipecat.services.ai_services import AIService, TTSService
from deepgram import (
DeepgramClient,
DeepgramClientOptions,
LiveTranscriptionEvents,
LiveOptions,
)
from loguru import logger
# See .env.example for Deepgram configuration needed
try:
from deepgram import (
DeepgramClient,
DeepgramClientOptions,
LiveTranscriptionEvents,
LiveOptions,
)
except ModuleNotFoundError as e:
logger.error(f"Exception: {e}")
logger.error(
"In order to use Deepgram, you need to `pip install pipecat-ai[deepgram]`. Also, set `DEEPGRAM_API_KEY` environment variable.")
raise Exception(f"Missing module: {e}")
class DeepgramTTSService(TTSService):
@@ -59,10 +53,6 @@ class DeepgramTTSService(TTSService):
def can_generate_metrics(self) -> bool:
return True
async def set_voice(self, voice: str):
logger.debug(f"Switching TTS voice to: [{voice}]")
self._voice = voice
async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
logger.debug(f"Generating TTS: [{text}]")
@@ -93,12 +83,11 @@ class DeepgramTTSService(TTSService):
frame = AudioRawFrame(audio=data, sample_rate=16000, num_channels=1)
yield frame
except Exception as e:
logger.exception(f"{self} exception: {e}")
logger.error(f"{self} exception: {e}")
class DeepgramSTTService(AsyncAIService):
class DeepgramSTTService(AIService):
def __init__(self,
*,
api_key: str,
url: str = "",
live_options: LiveOptions = LiveOptions(
@@ -120,6 +109,8 @@ class DeepgramSTTService(AsyncAIService):
self._connection = self._client.listen.asynclive.v("1")
self._connection.on(LiveTranscriptionEvents.Transcript, self._on_message)
self._create_push_task()
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
@@ -128,7 +119,7 @@ class DeepgramSTTService(AsyncAIService):
elif isinstance(frame, AudioRawFrame):
await self._connection.send(frame.audio)
else:
await self.queue_frame(frame, direction)
await self._push_queue.put((frame, direction))
async def start(self, frame: StartFrame):
if await self._connection.start(self._live_options):
@@ -138,9 +129,27 @@ class DeepgramSTTService(AsyncAIService):
async def stop(self, frame: EndFrame):
await self._connection.finish()
await self._push_queue.put((frame, FrameDirection.DOWNSTREAM))
await self._push_frame_task
async def cancel(self, frame: CancelFrame):
await self._connection.finish()
self._push_frame_task.cancel()
await self._push_frame_task
def _create_push_task(self):
self._push_queue = asyncio.Queue()
self._push_frame_task = self.get_event_loop().create_task(self._push_frame_task_handler())
async def _push_frame_task_handler(self):
running = True
while running:
try:
(frame, direction) = await self._push_queue.get()
await self.push_frame(frame, direction)
running = not isinstance(frame, EndFrame)
except asyncio.CancelledError:
break
async def _on_message(self, *args, **kwargs):
result = kwargs["result"]
@@ -148,6 +157,6 @@ class DeepgramSTTService(AsyncAIService):
transcript = result.channel.alternatives[0].transcript
if len(transcript) > 0:
if is_final:
await self.queue_frame(TranscriptionFrame(transcript, "", int(time.time_ns() / 1000000)))
await self._push_queue.put((TranscriptionFrame(transcript, "", int(time.time_ns() / 1000000)), FrameDirection.DOWNSTREAM))
else:
await self.queue_frame(InterimTranscriptionFrame(transcript, "", int(time.time_ns() / 1000000)))
await self._push_queue.put((InterimTranscriptionFrame(transcript, "", int(time.time_ns() / 1000000)), FrameDirection.DOWNSTREAM))

View File

@@ -34,10 +34,6 @@ class ElevenLabsTTSService(TTSService):
def can_generate_metrics(self) -> bool:
return True
async def set_voice(self, voice: str):
logger.debug(f"Switching TTS voice to: [{voice}]")
self._voice_id = voice
async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
logger.debug(f"Generating TTS: [{text}]")

View File

@@ -56,7 +56,7 @@ class FalImageGenService(ImageGenService):
response = await fal_client.run_async(
self._model,
arguments={"prompt": prompt, **self._params.model_dump(exclude_none=True)}
arguments={"prompt": prompt, **self._params.model_dump()}
)
image_url = response["images"][0]["url"] if response else None

View File

@@ -19,7 +19,6 @@ except ModuleNotFoundError as e:
class FireworksLLMService(BaseOpenAILLMService):
def __init__(self,
*,
model: str = "accounts/fireworks/models/firefunction-v1",
base_url: str = "https://api.fireworks.ai/inference/v1"):
super().__init__(model, base_url)

View File

@@ -1,115 +0,0 @@
#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import base64
import json
import time
from typing import Optional
from pydantic.main import BaseModel
from pipecat.frames.frames import (
AudioRawFrame,
CancelFrame,
EndFrame,
Frame,
InterimTranscriptionFrame,
StartFrame,
SystemFrame,
TranscriptionFrame)
from pipecat.processors.frame_processor import FrameDirection
from pipecat.services.ai_services import AsyncAIService
from loguru import logger
# See .env.example for Gladia configuration needed
try:
import websockets
except ModuleNotFoundError as e:
logger.error(f"Exception: {e}")
logger.error(
"In order to use Gladia, you need to `pip install pipecat-ai[gladia]`. Also, set `GLADIA_API_KEY` environment variable.")
raise Exception(f"Missing module: {e}")
class GladiaSTTService(AsyncAIService):
class InputParams(BaseModel):
sample_rate: Optional[int] = 16000
language: Optional[str] = "english"
transcription_hint: Optional[str] = None
endpointing: Optional[int] = 200
prosody: Optional[bool] = None
def __init__(self,
*,
api_key: str,
url: str = "wss://api.gladia.io/audio/text/audio-transcription",
confidence: float = 0.5,
params: InputParams = InputParams(),
**kwargs):
super().__init__(**kwargs)
self._api_key = api_key
self._url = url
self._params = params
self._confidence = confidence
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if isinstance(frame, SystemFrame):
await self.push_frame(frame, direction)
elif isinstance(frame, AudioRawFrame):
await self._send_audio(frame)
else:
await self.queue_frame(frame, direction)
async def start(self, frame: StartFrame):
self._websocket = await websockets.connect(self._url)
self._receive_task = self.get_event_loop().create_task(self._receive_task_handler())
await self._setup_gladia()
async def stop(self, frame: EndFrame):
await self._websocket.close()
async def cancel(self, frame: CancelFrame):
await self._websocket.close()
async def _setup_gladia(self):
configuration = {
"x_gladia_key": self._api_key,
"encoding": "WAV/PCM",
"model_type": "fast",
"language_behaviour": "manual",
**self._params.model_dump(exclude_none=True)
}
await self._websocket.send(json.dumps(configuration))
async def _send_audio(self, frame: AudioRawFrame):
message = {
'frames': base64.b64encode(frame.audio).decode("utf-8")
}
await self._websocket.send(json.dumps(message))
async def _receive_task_handler(self):
async for message in self._websocket:
utterance = json.loads(message)
if not utterance:
continue
if "error" in utterance:
message = utterance["message"]
logger.error(f"Gladia error: {message}")
elif "confidence" in utterance:
type = utterance["type"]
confidence = utterance["confidence"]
transcript = utterance["transcription"]
if confidence >= self._confidence:
if type == "final":
await self.queue_frame(TranscriptionFrame(transcript, "", int(time.time_ns() / 1000000)))
else:
await self.queue_frame(InterimTranscriptionFrame(transcript, "", int(time.time_ns() / 1000000)))

View File

@@ -10,11 +10,12 @@ from typing import List
from pipecat.frames.frames import (
Frame,
LLMModelUpdateFrame,
TextFrame,
VisionImageRawFrame,
LLMMessagesFrame,
LLMFullResponseStartFrame,
LLMResponseStartFrame,
LLMResponseEndFrame,
LLMFullResponseEndFrame
)
from pipecat.processors.frame_processor import FrameDirection
@@ -41,17 +42,14 @@ class GoogleLLMService(LLMService):
franca for all LLM services, so that it is easy to switch between different LLMs.
"""
def __init__(self, *, api_key: str, model: str = "gemini-1.5-flash-latest", **kwargs):
def __init__(self, api_key: str, model: str = "gemini-1.5-flash-latest", **kwargs):
super().__init__(**kwargs)
gai.configure(api_key=api_key)
self._create_client(model)
self._client = gai.GenerativeModel(model)
def can_generate_metrics(self) -> bool:
return True
def _create_client(self, model: str):
self._client = gai.GenerativeModel(model)
def _get_messages_from_openai_context(
self, context: OpenAILLMContext) -> List[glm.Content]:
openai_messages = context.get_messages()
@@ -97,17 +95,19 @@ class GoogleLLMService(LLMService):
async for chunk in self._async_generator_wrapper(response):
try:
text = chunk.text
await self.push_frame(LLMResponseStartFrame())
await self.push_frame(TextFrame(text))
await self.push_frame(LLMResponseEndFrame())
except Exception as e:
# Google LLMs seem to flag safety issues a lot!
if chunk.candidates[0].finish_reason == 3:
logger.debug(
f"LLM refused to generate content for safety reasons - {messages}.")
else:
logger.exception(f"{self} error: {e}")
logger.error(f"{self} error: {e}")
except Exception as e:
logger.exception(f"{self} exception: {e}")
logger.error(f"{self} exception: {e}")
finally:
await self.push_frame(LLMFullResponseEndFrame())
@@ -122,9 +122,6 @@ class GoogleLLMService(LLMService):
context = OpenAILLMContext.from_messages(frame.messages)
elif isinstance(frame, VisionImageRawFrame):
context = OpenAILLMContext.from_image_frame(frame)
elif isinstance(frame, LLMModelUpdateFrame):
logger.debug(f"Switching LLM model to: [{frame.model}]")
self._create_client(frame.model)
else:
await self.push_frame(frame, direction)

View File

@@ -46,7 +46,6 @@ def detect_device():
class MoondreamService(VisionService):
def __init__(
self,
*,
model="vikhyatk/moondream2",
revision="2024-04-02",
use_cpu=False

View File

@@ -9,5 +9,5 @@ from pipecat.services.openai import BaseOpenAILLMService
class OLLamaLLMService(BaseOpenAILLMService):
def __init__(self, *, model: str = "llama2", base_url: str = "http://localhost:11434/v1"):
def __init__(self, model: str = "llama2", base_url: str = "http://localhost:11434/v1"):
super().__init__(model=model, base_url=base_url, api_key="ollama")

View File

@@ -8,9 +8,8 @@ import aiohttp
import base64
import io
import json
import httpx
from typing import AsyncGenerator, List, Literal
from typing import Any, AsyncGenerator, List, Literal
from loguru import logger
from PIL import Image
@@ -22,7 +21,8 @@ from pipecat.frames.frames import (
LLMFullResponseEndFrame,
LLMFullResponseStartFrame,
LLMMessagesFrame,
LLMModelUpdateFrame,
LLMResponseEndFrame,
LLMResponseStartFrame,
TextFrame,
URLImageRawFrame,
VisionImageRawFrame
@@ -39,7 +39,7 @@ from pipecat.services.ai_services import (
)
try:
from openai import AsyncOpenAI, AsyncStream, DefaultAsyncHttpxClient, BadRequestError
from openai import AsyncOpenAI, AsyncStream, BadRequestError
from openai.types.chat import (
ChatCompletionChunk,
ChatCompletionFunctionMessageParam,
@@ -53,7 +53,7 @@ except ModuleNotFoundError as e:
raise Exception(f"Missing module: {e}")
class OpenAIUnhandledFunctionException(Exception):
class OpenAIUnhandledFunctionException(BaseException):
pass
@@ -67,20 +67,13 @@ class BaseOpenAILLMService(LLMService):
calls from the LLM.
"""
def __init__(self, *, model: str, api_key=None, base_url=None, **kwargs):
def __init__(self, model: str, api_key=None, base_url=None, **kwargs):
super().__init__(**kwargs)
self._model: str = model
self._client = self.create_client(api_key=api_key, base_url=base_url, **kwargs)
def create_client(self, api_key=None, base_url=None, **kwargs):
return AsyncOpenAI(
api_key=api_key,
base_url=base_url,
http_client=DefaultAsyncHttpxClient(
limits=httpx.Limits(
max_keepalive_connections=100,
max_connections=1000,
keepalive_expiry=None)))
return AsyncOpenAI(api_key=api_key, base_url=base_url)
def can_generate_metrics(self) -> bool:
return True
@@ -116,7 +109,10 @@ class BaseOpenAILLMService(LLMService):
del message["data"]
del message["mime_type"]
chunks = await self.get_chat_completions(context, messages)
try:
chunks = await self.get_chat_completions(context, messages)
except Exception as e:
logger.error(f"{self} exception: {e}")
return chunks
@@ -158,7 +154,9 @@ class BaseOpenAILLMService(LLMService):
# Keep iterating through the response to collect all the argument fragments
arguments += tool_call.function.arguments
elif chunk.choices[0].delta.content:
await self.push_frame(LLMResponseStartFrame())
await self.push_frame(TextFrame(chunk.choices[0].delta.content))
await self.push_frame(LLMResponseEndFrame())
# if we got a function name and arguments, check to see if it's a function with
# a registered handler. If so, run the registered callback, save the result to
@@ -216,7 +214,7 @@ class BaseOpenAILLMService(LLMService):
elif isinstance(result, type(None)):
pass
else:
raise TypeError(f"Unknown return type from function callback: {type(result)}")
raise BaseException(f"Unknown return type from function callback: {type(result)}")
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
@@ -228,24 +226,19 @@ class BaseOpenAILLMService(LLMService):
context = OpenAILLMContext.from_messages(frame.messages)
elif isinstance(frame, VisionImageRawFrame):
context = OpenAILLMContext.from_image_frame(frame)
elif isinstance(frame, LLMModelUpdateFrame):
logger.debug(f"Switching LLM model to: [{frame.model}]")
self._model = frame.model
else:
await self.push_frame(frame, direction)
if context:
await self.push_frame(LLMFullResponseStartFrame())
await self.start_processing_metrics()
await self._process_context(context)
await self.stop_processing_metrics()
await self.push_frame(LLMFullResponseEndFrame())
class OpenAILLMService(BaseOpenAILLMService):
def __init__(self, *, model: str = "gpt-4o", **kwargs):
super().__init__(model=model, **kwargs)
def __init__(self, model="gpt-4o", **kwargs):
super().__init__(model, **kwargs)
class OpenAIImageGenService(ImageGenService):
@@ -317,10 +310,6 @@ class OpenAITTSService(TTSService):
def can_generate_metrics(self) -> bool:
return True
async def set_voice(self, voice: str):
logger.debug(f"Switching TTS voice to: [{voice}]")
self._voice = voice
async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
logger.debug(f"Generating TTS: [{text}]")
@@ -345,4 +334,4 @@ class OpenAITTSService(TTSService):
frame = AudioRawFrame(chunk, 24_000, 1)
yield frame
except BadRequestError as e:
logger.exception(f"{self} error generating TTS: {e}")
logger.error(f"{self} error generating TTS: {e}")

View File

@@ -25,7 +25,6 @@ class OpenPipeLLMService(BaseOpenAILLMService):
def __init__(
self,
*,
model: str = "gpt-4o",
api_key: str | None = None,
base_url: str | None = None,
@@ -34,9 +33,9 @@ class OpenPipeLLMService(BaseOpenAILLMService):
tags: Dict[str, str] | None = None,
**kwargs):
super().__init__(
model=model,
api_key=api_key,
base_url=base_url,
model,
api_key,
base_url,
openpipe_api_key=openpipe_api_key,
openpipe_base_url=openpipe_base_url,
**kwargs)

View File

@@ -80,4 +80,4 @@ class PlayHTTTSService(TTSService):
frame = AudioRawFrame(chunk, 16000, 1)
yield frame
except Exception as e:
logger.exception(f"{self} error generating TTS: {e}")
logger.error(f"{self} error generating TTS: {e}")

View File

@@ -42,8 +42,7 @@ class WhisperSTTService(STTService):
"""Class to transcribe audio with a locally-downloaded Whisper model"""
def __init__(self,
*,
model: str | Model = Model.DISTIL_MEDIUM_EN,
model: Model = Model.DISTIL_MEDIUM_EN,
device: str = "auto",
compute_type: str = "default",
no_speech_prob: float = 0.4,
@@ -52,7 +51,7 @@ class WhisperSTTService(STTService):
super().__init__(**kwargs)
self._device: str = device
self._compute_type = compute_type
self._model_name: str | Model = model
self._model_name: Model = model
self._no_speech_prob = no_speech_prob
self._model: WhisperModel | None = None
self._load()
@@ -65,7 +64,7 @@ class WhisperSTTService(STTService):
this model is being run, it will take time to download."""
logger.debug("Loading Whisper model...")
self._model = WhisperModel(
self._model_name.value if isinstance(self._model_name, Enum) else self._model_name,
self._model_name.value,
device=self._device,
compute_type=self._compute_type)
logger.debug("Loaded Whisper model")

View File

@@ -1,116 +0,0 @@
#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import aiohttp
from typing import AsyncGenerator
from pipecat.frames.frames import AudioRawFrame, ErrorFrame, Frame
from pipecat.services.ai_services import TTSService
from loguru import logger
import requests
import numpy as np
try:
import resampy
except ModuleNotFoundError as e:
logger.error(f"Exception: {e}")
logger.error("In order to use XTTS, you need to `pip install pipecat-ai[xtts]`.")
raise Exception(f"Missing module: {e}")
# The server below can connect to XTTS through a local running docker
#
# Docker command: $ docker run --gpus=all -e COQUI_TOS_AGREED=1 --rm -p 8000:80 ghcr.io/coqui-ai/xtts-streaming-server:latest-cuda121
#
# You can find more information on the official repo:
# https://github.com/coqui-ai/xtts-streaming-server
class XTTSService(TTSService):
def __init__(
self,
*,
aiohttp_session: aiohttp.ClientSession,
voice_id: str,
language: str,
base_url: str,
**kwargs):
super().__init__(**kwargs)
self._voice_id = voice_id
self._language = language
self._base_url = base_url
self._aiohttp_session = aiohttp_session
self._studio_speakers = requests.get(self._base_url + "/studio_speakers").json()
def can_generate_metrics(self) -> bool:
return True
async def set_voice(self, voice: str):
logger.debug(f"Switching TTS voice to: [{voice}]")
self._voice_id = voice
async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
logger.debug(f"Generating TTS: [{text}]")
embeddings = self._studio_speakers[self._voice_id]
url = self._base_url + "/tts_stream"
payload = {
"text": text.replace('.', '').replace('*', ''),
"language": self._language,
"speaker_embedding": embeddings["speaker_embedding"],
"gpt_cond_latent": embeddings["gpt_cond_latent"],
"add_wav_header": False,
"stream_chunk_size": 20,
}
await self.start_ttfb_metrics()
async with self._aiohttp_session.post(url, json=payload) as r:
if r.status != 200:
text = await r.text()
logger.error(f"{self} error getting audio (status: {r.status}, error: {text})")
yield ErrorFrame(f"Error getting audio (status: {r.status}, error: {text})")
return
buffer = bytearray()
async for chunk in r.content.iter_chunked(1024):
if len(chunk) > 0:
await self.stop_ttfb_metrics()
# Append new chunk to the buffer
buffer.extend(chunk)
# Check if buffer has enough data for processing
while len(buffer) >= 48000: # Assuming at least 0.5 seconds of audio data at 24000 Hz
# Process the buffer up to a safe size for resampling
process_data = buffer[:48000]
# Remove processed data from buffer
buffer = buffer[48000:]
# Convert the byte data to numpy array for resampling
audio_np = np.frombuffer(process_data, dtype=np.int16)
# Resample the audio from 24000 Hz to 16000 Hz
resampled_audio = resampy.resample(audio_np, 24000, 16000)
# Convert the numpy array back to bytes
resampled_audio_bytes = resampled_audio.astype(np.int16).tobytes()
# Create the frame with the resampled audio
frame = AudioRawFrame(resampled_audio_bytes, 16000, 1)
yield frame
# Process any remaining data in the buffer
if len(buffer) > 0:
audio_np = np.frombuffer(buffer, dtype=np.int16)
resampled_audio = resampy.resample(audio_np, 24000, 16000)
resampled_audio_bytes = resampled_audio.astype(np.int16).tobytes()
frame = AudioRawFrame(resampled_audio_bytes, 16000, 1)
yield frame

View File

@@ -11,7 +11,6 @@ from concurrent.futures import ThreadPoolExecutor
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.frames.frames import (
AudioRawFrame,
BotInterruptionFrame,
CancelFrame,
StartFrame,
EndFrame,
@@ -56,7 +55,7 @@ class BaseInputTransport(FrameProcessor):
async def push_audio_frame(self, frame: AudioRawFrame):
if self._params.audio_in_enabled or self._params.vad_enabled:
await self._audio_in_queue.put(frame)
self._audio_in_queue.put_nowait(frame)
#
# Frame processor
@@ -79,8 +78,6 @@ class BaseInputTransport(FrameProcessor):
elif isinstance(frame, EndFrame):
await self._internal_push_frame(frame, direction)
await self.stop()
elif isinstance(frame, BotInterruptionFrame):
await self._handle_interruptions(frame, False)
else:
await self._internal_push_frame(frame, direction)
@@ -104,7 +101,6 @@ class BaseInputTransport(FrameProcessor):
try:
(frame, direction) = await self._push_queue.get()
await self.push_frame(frame, direction)
self._push_queue.task_done()
except asyncio.CancelledError:
break
@@ -112,35 +108,19 @@ class BaseInputTransport(FrameProcessor):
# Handle interruptions
#
async def _start_interruption(self):
# Cancel the task. This will stop pushing frames downstream.
self._push_frame_task.cancel()
await self._push_frame_task
# Push an out-of-band frame (i.e. not using the ordered push
# frame task) to stop everything, specially at the output
# transport.
await self.push_frame(StartInterruptionFrame())
# Create a new queue and task.
self._create_push_task()
async def _stop_interruption(self):
await self.push_frame(StopInterruptionFrame())
async def _handle_interruptions(self, frame: Frame, push_frame: bool):
async def _handle_interruptions(self, frame: Frame):
if self.interruptions_allowed:
# Make sure we notify about interruptions quickly out-of-band
if isinstance(frame, BotInterruptionFrame):
logger.debug("Bot interruption")
await self._start_interruption()
elif isinstance(frame, UserStartedSpeakingFrame):
if isinstance(frame, UserStartedSpeakingFrame):
logger.debug("User started speaking")
await self._start_interruption()
self._push_frame_task.cancel()
await self._push_frame_task
self._create_push_task()
await self.push_frame(StartInterruptionFrame())
elif isinstance(frame, UserStoppedSpeakingFrame):
logger.debug("User stopped speaking")
await self._stop_interruption()
if push_frame:
await self._internal_push_frame(frame)
await self.push_frame(StopInterruptionFrame())
await self._internal_push_frame(frame)
#
# Audio input
@@ -164,7 +144,7 @@ class BaseInputTransport(FrameProcessor):
frame = UserStoppedSpeakingFrame()
if frame:
await self._handle_interruptions(frame, True)
await self._handle_interruptions(frame)
vad_state = new_vad_state
return vad_state
@@ -186,9 +166,7 @@ class BaseInputTransport(FrameProcessor):
# Push audio downstream if passthrough.
if audio_passthrough:
await self._internal_push_frame(frame)
self._audio_in_queue.task_done()
except asyncio.CancelledError:
break
except Exception as e:
logger.exception(f"{self} error reading audio frames: {e}")
except BaseException as e:
logger.error(f"{self} error reading audio frames: {e}")

View File

@@ -14,7 +14,6 @@ from typing import List
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.frames.frames import (
AudioRawFrame,
BotSpeakingFrame,
CancelFrame,
MetricsFrame,
SpriteFrame,
@@ -181,8 +180,8 @@ class BaseOutputTransport(FrameProcessor):
self._sink_queue.task_done()
except asyncio.CancelledError:
break
except Exception as e:
logger.exception(f"{self} error processing sink queue: {e}")
except BaseException as e:
logger.error(f"{self} error processing sink queue: {e}")
#
# Push frames task
@@ -204,7 +203,6 @@ class BaseOutputTransport(FrameProcessor):
try:
(frame, direction) = await self._push_queue.get()
await self.push_frame(frame, direction)
self._push_queue.task_done()
except asyncio.CancelledError:
break
@@ -252,7 +250,7 @@ class BaseOutputTransport(FrameProcessor):
except asyncio.CancelledError:
break
except Exception as e:
logger.exception(f"{self} error writing to camera: {e}")
logger.error(f"{self} error writing to camera: {e}")
#
# Audio out
@@ -265,5 +263,4 @@ class BaseOutputTransport(FrameProcessor):
if len(buffer) >= self._audio_chunk_size:
await self.write_raw_audio_frames(bytes(buffer[:self._audio_chunk_size]))
buffer = buffer[self._audio_chunk_size:]
await self.push_frame(BotSpeakingFrame(), FrameDirection.UPSTREAM)
return buffer

View File

@@ -82,4 +82,5 @@ class BaseTransport(ABC):
else:
handler(self, *args, **kwargs)
except Exception as e:
logger.exception(f"Exception in event handler {event_name}: {e}")
logger.error(f"Exception in event handler {event_name}: {e}")
raise e

View File

@@ -12,6 +12,7 @@ import wave
from typing import Awaitable, Callable
from pydantic.main import BaseModel
from pipecat.serializers.twilio import TwilioFrameSerializer
from pipecat.frames.frames import AudioRawFrame, StartFrame
from pipecat.processors.frame_processor import FrameProcessor
from pipecat.serializers.base_serializer import FrameSerializer
@@ -34,7 +35,7 @@ except ModuleNotFoundError as e:
class FastAPIWebsocketParams(TransportParams):
add_wav_header: bool = False
audio_frame_size: int = 6400 # 200ms
serializer: FrameSerializer
serializer: FrameSerializer = TwilioFrameSerializer()
class FastAPIWebsocketCallbacks(BaseModel):
@@ -113,7 +114,7 @@ class FastAPIWebsocketOutputTransport(BaseOutputTransport):
frame = wav_frame
payload = self._params.serializer.serialize(frame)
if payload and self._websocket.client_state == WebSocketState.CONNECTED:
if payload:
await self._websocket.send_text(payload)
self._audio_buffer = self._audio_buffer[self._params.audio_frame_size:]
@@ -124,7 +125,7 @@ class FastAPIWebsocketTransport(BaseTransport):
def __init__(
self,
websocket: WebSocket,
params: FastAPIWebsocketParams,
params: FastAPIWebsocketParams = FastAPIWebsocketParams(),
input_name: str | None = None,
output_name: str | None = None,
loop: asyncio.AbstractEventLoop | None = None):

View File

@@ -124,9 +124,6 @@ class WebsocketServerOutputTransport(BaseOutputTransport):
self._websocket = websocket
async def write_raw_audio_frames(self, frames: bytes):
if not self._websocket:
return
self._audio_buffer += frames
while len(self._audio_buffer) >= self._params.audio_frame_size:
frame = AudioRawFrame(
@@ -151,8 +148,8 @@ class WebsocketServerOutputTransport(BaseOutputTransport):
frame = wav_frame
proto = self._params.serializer.serialize(frame)
if proto:
await self._websocket.send(proto)
await self._websocket.send(proto)
self._audio_buffer = self._audio_buffer[self._params.audio_frame_size:]

View File

@@ -9,7 +9,7 @@ import asyncio
import time
from dataclasses import dataclass
from typing import Any, Awaitable, Callable, Mapping, Optional
from typing import Any, Awaitable, Callable, Mapping
from concurrent.futures import ThreadPoolExecutor
from daily import (
@@ -59,8 +59,8 @@ class DailyTransportMessageFrame(TransportMessageFrame):
class WebRTCVADAnalyzer(VADAnalyzer):
def __init__(self, *, sample_rate=16000, num_channels=1, params: VADParams = VADParams()):
super().__init__(sample_rate=sample_rate, num_channels=num_channels, params=params)
def __init__(self, sample_rate=16000, num_channels=1, params: VADParams = VADParams()):
super().__init__(sample_rate, num_channels, params)
self._webrtc_vad = Daily.create_native_vad(
reset_period_ms=VAD_RESET_PERIOD_MS,
@@ -101,7 +101,7 @@ class DailyTranscriptionSettings(BaseModel):
class DailyParams(TransportParams):
api_url: str = "https://api.daily.co/v1"
api_key: str = ""
dialin_settings: Optional[DailyDialinSettings] = None
dialin_settings: DailyDialinSettings | None = None
transcription_enabled: bool = False
transcription_settings: DailyTranscriptionSettings = DailyTranscriptionSettings()
@@ -198,36 +198,30 @@ class DailyTransportClient(EventHandler):
def set_callbacks(self, callbacks: DailyCallbacks):
self._callbacks = callbacks
async def send_message(self, frame: TransportMessageFrame):
if not self._client:
return
participant_id = None
if isinstance(frame, DailyTransportMessageFrame):
participant_id = frame.participant_id
async def send_message(self, frame: DailyTransportMessageFrame):
future = self._loop.create_future()
self._client.send_app_message(
frame.message,
participant_id,
frame.participant_id,
completion=completion_callback(future))
await future
async def read_next_audio_frame(self) -> AudioRawFrame | None:
sample_rate = self._params.audio_in_sample_rate
num_channels = self._params.audio_in_channels
num_frames = int(sample_rate / 100) * 2 # 20ms of audio
future = self._loop.create_future()
self._speaker.read_frames(num_frames, completion=completion_callback(future))
audio = await future
if self._other_participant_has_joined:
num_frames = int(sample_rate / 100) * 2 # 20ms of audio
future = self._loop.create_future()
self._speaker.read_frames(num_frames, completion=completion_callback(future))
audio = await future
if len(audio) > 0:
return AudioRawFrame(audio=audio, sample_rate=sample_rate, num_channels=num_channels)
else:
# If we don't read any audio it could be there's no participant
# connected. daily-python will return immediately if that's the
# case, so let's sleep for a little bit (i.e. busy wait).
# If no one has ever joined the meeting `read_frames()` would block,
# instead we just wait a bit. daily-python should probably return
# silence instead.
await asyncio.sleep(0.01)
return None
@@ -272,7 +266,7 @@ class DailyTransportClient(EventHandler):
logger.info(
f"Enabling transcription with settings {self._params.transcription_settings}")
self._client.start_transcription(
self._params.transcription_settings.model_dump(exclude_none=True))
self._params.transcription_settings.model_dump())
await self._callbacks.on_joined(data["participants"]["local"])
else:
@@ -659,15 +653,15 @@ class DailyOutputTransport(BaseOutputTransport):
await super().cleanup()
await self._client.cleanup()
async def send_message(self, frame: TransportMessageFrame):
async def send_message(self, frame: DailyTransportMessageFrame):
await self._client.send_message(frame)
async def send_metrics(self, frame: MetricsFrame):
ttfb = [{"name": n, "time": t} for n, t in frame.ttfb.items()]
message = DailyTransportMessageFrame(message={
"type": "pipecat-metrics",
"metrics": {
"ttfb": frame.ttfb or [],
"processing": frame.processing or [],
"ttfb": ttfb
},
})
await self._client.send_message(message)
@@ -842,8 +836,8 @@ class DailyTransport(BaseTransport):
logger.debug("Event dialin-ready was handled successfully")
except asyncio.TimeoutError:
logger.error(f"Timeout handling dialin-ready event ({url})")
except Exception as e:
logger.exception(f"Error handling dialin-ready event ({url}): {e}")
except BaseException as e:
logger.error(f"Error handling dialin-ready event ({url}): {e}")
async def _on_dialin_ready(self, sip_endpoint):
if self._params.dialin_settings:

View File

@@ -2,7 +2,7 @@ from typing import List
from pipecat.processors.frame_processor import FrameProcessor
class TestException(Exception):
class TestException(BaseException):
pass

View File

@@ -33,23 +33,14 @@ _MODEL_RESET_STATES_TIME = 5.0
class SileroVADAnalyzer(VADAnalyzer):
def __init__(
self,
*,
sample_rate: int = 16000,
version: str = "v5.0",
params: VADParams = VADParams()):
def __init__(self, sample_rate=16000, params: VADParams = VADParams()):
super().__init__(sample_rate=sample_rate, num_channels=1, params=params)
if sample_rate != 16000 and sample_rate != 8000:
raise ValueError("Silero VAD sample rate needs to be 16000 or 8000")
logger.debug("Loading Silero VAD model...")
(self._model, _) = torch.hub.load(repo_or_dir=f"snakers4/silero-vad:{version}",
model="silero_vad",
force_reload=False,
trust_repo=True)
(self._model, utils) = torch.hub.load(
repo_or_dir="snakers4/silero-vad", model="silero_vad", force_reload=False
)
self._last_reset_time = 0
@@ -60,7 +51,7 @@ class SileroVADAnalyzer(VADAnalyzer):
#
def num_frames_required(self) -> int:
return 512 if self.sample_rate == 16000 else 256
return int(self.sample_rate / 100) * 4 # 40ms
def voice_confidence(self, buffer) -> float:
try:
@@ -78,9 +69,9 @@ class SileroVADAnalyzer(VADAnalyzer):
self._last_reset_time = curr_time
return new_confidence
except Exception as e:
except BaseException as e:
# This comes from an empty audio array
logger.exception(f"Error analyzing audio with Silero VAD: {e}")
logger.error(f"Error analyzing audio with Silero VAD: {e}")
return 0
@@ -88,15 +79,12 @@ class SileroVAD(FrameProcessor):
def __init__(
self,
*,
sample_rate: int = 16000,
version: str = "v5.0",
vad_params: VADParams = VADParams(),
audio_passthrough: bool = False):
super().__init__()
self._vad_analyzer = SileroVADAnalyzer(
sample_rate=sample_rate, version=version, params=vad_params)
self._vad_analyzer = SileroVADAnalyzer(sample_rate=sample_rate, params=vad_params)
self._audio_passthrough = audio_passthrough
self._processor_vad_state: VADState = VADState.QUIET

View File

@@ -28,7 +28,7 @@ class VADParams(BaseModel):
class VADAnalyzer:
def __init__(self, *, sample_rate: int, num_channels: int, params: VADParams):
def __init__(self, sample_rate: int, num_channels: int, params: VADParams):
self._sample_rate = sample_rate
self._num_channels = num_channels
self._params = params

View File

@@ -8,6 +8,8 @@ from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.frames.frames import (
LLMFullResponseStartFrame,
LLMFullResponseEndFrame,
LLMResponseEndFrame,
LLMResponseStartFrame,
TextFrame
)
from pipecat.utils.test_frame_processor import TestFrameProcessor
@@ -62,7 +64,7 @@ if __name__ == "__main__":
llm.register_function("get_current_weather", get_weather_from_api)
t = TestFrameProcessor([
LLMFullResponseStartFrame,
TextFrame,
[LLMResponseStartFrame, TextFrame, LLMResponseEndFrame],
LLMFullResponseEndFrame
])
llm.link(t)
@@ -96,7 +98,7 @@ if __name__ == "__main__":
llm.register_function("get_current_weather", get_weather_from_api)
t = TestFrameProcessor([
LLMFullResponseStartFrame,
TextFrame,
[LLMResponseStartFrame, TextFrame, LLMResponseEndFrame],
LLMFullResponseEndFrame
])
llm.link(t)
@@ -119,7 +121,7 @@ if __name__ == "__main__":
api_key = os.getenv("OPENAI_API_KEY")
t = TestFrameProcessor([
LLMFullResponseStartFrame,
TextFrame,
[LLMResponseStartFrame, TextFrame, LLMResponseEndFrame],
LLMFullResponseEndFrame
])
llm = OpenAILLMService(

View File

@@ -2,8 +2,8 @@ import unittest
from typing import AsyncGenerator
from pipecat.services.ai_services import AIService, match_endofsentence
from pipecat.frames.frames import EndFrame, Frame, TextFrame
from pipecat.services.ai_services import AIService
from pipecat.pipeline.frames import EndFrame, Frame, TextFrame
class SimpleAIService(AIService):
@@ -27,22 +27,6 @@ class TestBaseAIService(unittest.IsolatedAsyncioTestCase):
self.assertEqual(input_frames, output_frames)
async def test_endofsentence(self):
assert match_endofsentence("This is a sentence.")
assert match_endofsentence("This is a sentence! ")
assert match_endofsentence("This is a sentence?")
assert match_endofsentence("This is a sentence:")
assert not match_endofsentence("This is not a sentence")
assert not match_endofsentence("This is not a sentence,")
assert not match_endofsentence("This is not a sentence, ")
assert not match_endofsentence("Ok, Mr. Smith let's ")
assert not match_endofsentence("Dr. Walker, I presume ")
assert not match_endofsentence("Prof. Walker, I presume ")
assert not match_endofsentence("zweitens, und 3.")
assert not match_endofsentence("Heute ist Dienstag, der 3.") # 3. Juli 2024
assert not match_endofsentence("America, or the U.") # U.S.A.
assert not match_endofsentence("It still early, it's 3:00 a.") # 3:00 a.m.
if __name__ == "__main__":
unittest.main()