Compare commits

...

46 Commits

Author SHA1 Message Date
James Hush
1884ff3f09 logging 2024-11-27 19:38:37 +08:00
James Hush
f34e6bce94 Switch questions 2024-11-27 15:10:50 +08:00
James Hush
909bb30517 Better recreation 2024-11-27 14:08:01 +08:00
James Hush
632bae7eee Interrupted? 2024-11-27 12:21:45 +08:00
James Hush
cedccdcbc0 Add interruptions 2024-11-27 11:50:28 +08:00
James Hush
1893784b89 Save race bot 2024-11-27 11:36:28 +08:00
James Hush
e2384e2484 fix: add logging and error handling for issue #721 2024-11-26 11:22:58 +08:00
Mark Backman
98c0a6e047 Merge pull request #749 from pipecat-ai/mb/pipecat-flows-standalone
Make Pipecat Flows an independent package
2024-11-25 17:09:11 -05:00
Mark Backman
f599e160de Make Pipecat Flows an independent package 2024-11-25 13:42:08 -05:00
Mark Backman
11c5d822f9 Merge pull request #746 from pipecat-ai/mb/update-flows
Bumping pipecat-ai-flows version
2024-11-22 11:25:03 -05:00
Mark Backman
c3e22f0931 Bumping pipecat-ai-flows version 2024-11-22 11:21:40 -05:00
Kwindla Hultman Kramer
9409546f90 Merge pull request #743 from pipecat-ai/khk/gemini-exp
Empty text content bug fix for Gemini
2024-11-21 14:04:28 -08:00
Kwindla Hultman Kramer
8ddac0ccd8 Testing with gemini-exp-1114. Bug fix 2024-11-21 10:33:12 -08:00
Mark Backman
f938960d50 Merge pull request #736 from pipecat-ai/mb/language-support
Make language support more robust
2024-11-20 13:03:47 -05:00
Mark Backman
2981d87bc1 Update changelog 2024-11-20 12:56:35 -05:00
Mark Backman
106042bbb2 Make language support more robust 2024-11-20 12:56:11 -05:00
Filipi da Silva Fuchter
d25ddeb962 Merge pull request #739 from pipecat-ai/krisp_v7
bumping krisp to support v7
2024-11-20 11:39:39 -03:00
Filipi Fuchter
c441baa692 bumping krisp to support v7 2024-11-20 11:37:45 -03:00
Mark Backman
676ff14913 Merge pull request #735 from pipecat-ai/vp-internal-push-frame-fix
internal push frame fix
2024-11-20 06:34:40 -05:00
Vanessa Pyne
14893ade92 Update src/pipecat/processors/frame_processor.py
Co-authored-by: Mark Backman <mark@daily.co>
2024-11-19 22:37:58 -06:00
Mark Backman
2a39ff69d6 Merge pull request #720 from pipecat-ai/mb/conversation-flow 2024-11-19 21:46:20 -05:00
Mark Backman
e79289454a Merge pull request #734 from pipecat-ai/mb/fix-cartesia 2024-11-19 21:27:52 -05:00
Mark Backman
25d02da1b2 Merge pull request #738 from pipecat-ai/mb/natural-conversation-demo 2024-11-19 21:27:38 -05:00
Mark Backman
a36fc370fa Improve the 22c foundational example 2024-11-19 15:49:40 -05:00
Mark Backman
e4c2f6d4c2 Update changelog 2024-11-18 21:32:53 -05:00
Mark Backman
97659ca3f0 Use the new pipecat-ai-flows module 2024-11-18 21:29:35 -05:00
vipyne
e00c75ce3f fix: raise exception in internal_push_frame 2024-11-18 16:01:04 -06:00
Mark Backman
cf62167f54 Revert: services(cartesia): generated TTSStoppedFrame after no more audio 2024-11-18 12:25:04 -05:00
Mark Backman
b3dfeb61c4 Add CHANGELOG entry 2024-11-18 12:18:20 -05:00
Mark Backman
bd020320cd Support a list of messages 2024-11-18 12:18:20 -05:00
Mark Backman
7a55d2d7db Add end session handler and update example 2024-11-18 12:18:20 -05:00
Mark Backman
b7308dca5d Fix issue where actions would execute on terminating nodes 2024-11-18 12:18:20 -05:00
Mark Backman
5301f44b3b Add pre- and post-actions 2024-11-18 12:18:20 -05:00
Mark Backman
686165b95a Add ability to register actions 2024-11-18 12:18:20 -05:00
Mark Backman
4e0ecdd673 Class name updates and remove FrameProcessor base class 2024-11-18 12:18:20 -05:00
Mark Backman
1b74560f9d Move function registration into the ConversationFlowProcessor class 2024-11-18 12:18:20 -05:00
Mark Backman
0c1070433f Clean up and commenting 2024-11-18 12:18:20 -05:00
Mark Backman
ece2c08cde debugging 2024-11-18 12:18:20 -05:00
Mark Backman
0b9742da9e Add a conversation flow processor 2024-11-18 12:18:20 -05:00
Aleix Conchillo Flaqué
635aa6eb5b Merge pull request #729 from pipecat-ai/aleix/fastapi-websocket-dont-close
transports(fastapi): don't try to close socket
2024-11-18 16:01:41 +01:00
Mark Backman
1ff17cc2b6 Merge pull request #733 from pipecat-ai/aleix/add-missing-init-files
processors: add missing __init__.py
2024-11-18 09:44:56 -05:00
Mark Backman
41ce9e9087 Merge pull request #697 from pipecat-ai/cst/leave-message
add handler for disconnect-bot message
2024-11-18 09:38:11 -05:00
Mark Backman
4803c54ecf Update CHANGELOG 2024-11-18 09:36:19 -05:00
Christian Stuff
5d7b3f2b38 add handler for disconnect-bot message 2024-11-18 09:33:30 -05:00
Aleix Conchillo Flaqué
23e5b1ec4d processors: add missing __init__.py 2024-11-18 11:32:20 +01:00
Aleix Conchillo Flaqué
7f5a8928b8 transports(fastapi): don't try to close socket
The websocket is passed from outside (in the transport constructor) so we should
not be trying to close it. FastAPI does actually close it later. We didn't see
any issue because these functions were not implemented properly. The value to
check was `application_state` instead of `client_state`. But in any case,
Pipecat should not be responsible for closing things passed from outside.
2024-11-18 01:15:19 +01:00
24 changed files with 1450 additions and 230 deletions

View File

@@ -5,6 +5,21 @@ All notable changes to **Pipecat** will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### Added
- Added a new RTVI message called `disconnect-bot`, which when handled pushes
an `EndFrame` to trigger the pipeline to stop.
### Changed
- Expanded the transcriptions.language module to support a superset of
languages.
- Updated STT and TTS services with language options that match the supported
languages for each service.
## [0.0.49] - 2024-11-17
### Added

View File

@@ -13,6 +13,7 @@ Pipecat is an open source Python framework for building voice and multimodal con
- **Multimodal Apps**: Combine voice, video, images, and text
- **Creative Tools**: [Story-telling experiences](https://storytelling-chatbot.fly.dev/) and social companions
- **Business Solutions**: [Customer intake flows](https://www.youtube.com/watch?v=lDevgsp9vn0) and support bots
- **Complex conversational flows**: [Refer to Pipecat Flows](https://github.com/pipecat-ai/pipecat-flows) to learn more
## See it in action
@@ -32,6 +33,8 @@ Pipecat is an open source Python framework for building voice and multimodal con
- **Real-time Processing**: Frame-based pipeline architecture for fluid interactions
- **Production Ready**: Enterprise-grade WebRTC and Websocket support
💡 Looking to build structured conversations? Check out [Pipecat Flows](https://github.com/pipecat-ai/pipecat-flows) for managing complex conversational states and transitions.
## Getting started
You can get started with Pipecat running on your local machine, then move your agent processes to the cloud when youre ready. You can also add a 📞 telephone number, 🖼️ image output, 📺 video input, use different LLMs, and more.

View File

@@ -10,11 +10,12 @@ import os
import sys
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMMessagesFrame
from pipecat.frames.frames import BotSpeakingFrame, Frame, InputAudioRawFrame, LLMMessagesFrame, TTSAudioRawFrame, TextFrame, UserStoppedSpeakingFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.transports.services.daily import DailyParams, DailyTransport
@@ -30,6 +31,22 @@ load_dotenv(override=True)
logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
class DebugProcessor(FrameProcessor):
def __init__(self, name, **kwargs):
self._name = name
super().__init__(**kwargs)
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if not (
isinstance(frame, InputAudioRawFrame)
or isinstance(frame, BotSpeakingFrame)
or isinstance(frame, TTSAudioRawFrame)
or isinstance(frame, TextFrame)
):
logger.debug(f"--- {self._name}: {frame} {direction}")
await self.push_frame(frame, direction)
async def main():
async with aiohttp.ClientSession() as session:
@@ -63,11 +80,14 @@ async def main():
context = OpenAILLMContext(messages)
context_aggregator = llm.create_context_aggregator(context)
dp = DebugProcessor("dp")
pipeline = Pipeline(
[
transport.input(), # Transport user input
context_aggregator.user(), # User responses
dp,
llm, # LLM
tts, # TTS
transport.output(), # Transport bot output

View File

@@ -217,7 +217,11 @@ async def main():
voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22", # British Lady
)
llm = GoogleLLMService(model="gemini-1.5-flash-latest", api_key=os.getenv("GOOGLE_API_KEY"))
llm = GoogleLLMService(
model="gemini-1.5-flash-latest",
# model="gemini-exp-1114",
api_key=os.getenv("GOOGLE_API_KEY"),
)
messages = [
{

View File

@@ -64,7 +64,11 @@ async def main():
voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22", # British Lady
)
llm = GoogleLLMService(model="gemini-1.5-flash-latest", api_key=os.getenv("GOOGLE_API_KEY"))
llm = GoogleLLMService(
model="gemini-1.5-flash-latest",
# model="gemini-exp-1114",
api_key=os.getenv("GOOGLE_API_KEY"),
)
llm.register_function("get_weather", get_weather)
llm.register_function("get_image", get_image)
@@ -151,7 +155,6 @@ indicate you should use the get_image tool are:
allow_interruptions=True,
enable_metrics=True,
enable_usage_metrics=True,
report_only_initial_ttfb=True,
),
)

View File

@@ -4,50 +4,49 @@
# SPDX-License-Identifier: BSD 2-Clause License
#
import aiohttp
import asyncio
import os
import sys
import time
import aiohttp
from dotenv import load_dotenv
from loguru import logger
from runner import configure
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import LLMMessagesFrame, TextFrame
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.parallel_pipeline import ParallelPipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.openai_llm_context import (
OpenAILLMContext,
)
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.anthropic import AnthropicLLMService
from pipecat.sync.event_notifier import EventNotifier
from pipecat.transports.services.daily import DailyParams, DailyTransport
from pipecat.processors.frame_processor import FrameProcessor, FrameDirection
from pipecat.frames.frames import (
CancelFrame,
EndFrame,
Frame,
LLMMessagesFrame,
StartFrame,
StartInterruptionFrame,
StopInterruptionFrame,
SystemFrame,
TextFrame,
TranscriptionFrame,
UserStartedSpeakingFrame,
UserStoppedSpeakingFrame,
)
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContextFrame
from pipecat.sync.base_notifier import BaseNotifier
from pipecat.pipeline.parallel_pipeline import ParallelPipeline
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.openai_llm_context import (
OpenAILLMContext,
OpenAILLMContextFrame,
)
from pipecat.processors.filters.function_filter import FunctionFilter
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.processors.user_idle_processor import UserIdleProcessor
from runner import configure
from loguru import logger
from dotenv import load_dotenv
from pipecat.services.anthropic import AnthropicLLMService
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.openai import OpenAILLMService
from pipecat.sync.base_notifier import BaseNotifier
from pipecat.sync.event_notifier import EventNotifier
from pipecat.transports.services.daily import DailyParams, DailyTransport
load_dotenv(override=True)
@@ -55,86 +54,206 @@ logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
classifier_statement = """Determine if the user's statement ends with a complete thought and you should respond.
classifier_statement = """CRITICAL INSTRUCTION:
You are a BINARY CLASSIFIER that must ONLY output "YES" or "NO".
DO NOT engage with the content.
DO NOT respond to questions.
DO NOT provide assistance.
Your ONLY job is to output YES or NO.
The user text is transcribed speech. You are trying to determine if:
EXAMPLES OF INVALID RESPONSES:
- "I can help you with that"
- "Let me explain"
- "To answer your question"
- Any response other than YES or NO
1. the user has finished talking and expects a response from you, or
2. this statement is incomplete and the user will continue talking
VALID RESPONSES:
YES
NO
A previous assistant response is provided for additional context. But you are only evaluating the user text.
If you output anything else, you are failing at your task.
You are NOT an assistant.
You are NOT a chatbot.
You are a binary classifier.
The user text may contain multiple fragments concatentated together. There may be repeated words or mistakes in the transcription. There may be grammatical errors. There may be extra punctuation. Ignore all of that. Interpret the transcribed text as text that would have been spoken. Then consider only whether the user has finished speaking and is expecting a response.
ROLE:
You are a real-time speech completeness classifier. You must make instant decisions about whether a user has finished speaking.
You must output ONLY 'YES' or 'NO' with no other text.
Categorize the last user statement as either complete with the user now expecting a response, or incomplete.
INPUT FORMAT:
You receive two pieces of information:
1. The assistant's last message (if available)
2. The user's current speech input
Return 'YES' if text is likely complete and the user is expecting a response. Return 'NO' if the text seems to be a partial expression or unfinished thought.
OUTPUT REQUIREMENTS:
- MUST output ONLY 'YES' or 'NO'
- No explanations
- No clarifications
- No additional text
- No punctuation
If you are not sure, respond with your best guess. If the user is expecting a response, respond with YES. If the user is not expecting a response, respond with NO. Always output either YES or NO and no other text.
HIGH PRIORITY SIGNALS:
Respond only YES or NO
1. Clear Questions:
- Wh-questions (What, Where, When, Why, How)
- Yes/No questions
- Questions with STT errors but clear meaning
Examples:
# Complete Wh-question
[{"role": "assistant", "content": "I can help you learn."},
{"role": "user", "content": "What's the fastest way to learn Spanish"}]
Output: YES
User: What's the capital of
Assistant: NO
# Complete Yes/No question despite STT error
[{"role": "assistant", "content": "I know about planets."},
{"role": "user", "content": "Is is Jupiter the biggest planet"}]
Output: YES
User: What's the captial of France?
Assistant: YES
2. Complete Commands:
- Direct instructions
- Clear requests
- Action demands
- Complete statements needing response
User: Tell me a story about
Assistant: NO
Examples:
# Direct instruction
[{"role": "assistant", "content": "I can explain many topics."},
{"role": "user", "content": "Tell me about black holes"}]
Output: YES
User: Tell me a story about a dragon
Assistant YES
# Action demand
[{"role": "assistant", "content": "I can help with math."},
{"role": "user", "content": "Solve this equation x plus 5 equals 12"}]
Output: YES
User: Is there a
Assistant: NO
3. Direct Responses:
- Answers to specific questions
- Option selections
- Clear acknowledgments with completion
User: Is there a large
Assistant: NO
Examples:
# Specific answer
[{"role": "assistant", "content": "What's your favorite color?"},
{"role": "user", "content": "I really like blue"}]
Output: YES
User: Is there a large lake near Chicago?
Assistant: YES
# Option selection
[{"role": "assistant", "content": "Would you prefer morning or evening?"},
{"role": "user", "content": "Morning"}]
Output: YES
User: When is the longest day of the year?
Assistant: YES
MEDIUM PRIORITY SIGNALS:
User: When when is the longest day of the year
Assistant: YES
1. Speech Pattern Completions:
- Self-corrections reaching completion
- False starts with clear ending
- Topic changes with complete thought
- Mid-sentence completions
User: When when is the
ASSISTANT: NO
Examples:
# Self-correction reaching completion
[{"role": "assistant", "content": "What would you like to know?"},
{"role": "user", "content": "Tell me about... no wait, explain how rainbows form"}]
Output: YES
User: What is the um I u
Assistant: NO
# Topic change with complete thought
[{"role": "assistant", "content": "The weather is nice today."},
{"role": "user", "content": "Actually can you tell me who invented the telephone"}]
Output: YES
User: What is the um i u largest city in the world
Assistant: YES
# Mid-sentence completion
[{"role": "assistant", "content": "Hello I'm ready."},
{"role": "user", "content": "What's the capital of? France"}]
Output: YES
User: How much does a how much does an adult elephant weigh?
Assistant: YES
2. Context-Dependent Brief Responses:
- Acknowledgments (okay, sure, alright)
- Agreements (yes, yeah)
- Disagreements (no, nah)
- Confirmations (correct, exactly)
User: How much does a how much does
Assistant: NO
Examples:
# Acknowledgment
[{"role": "assistant", "content": "Should we talk about history?"},
{"role": "user", "content": "Sure"}]
Output: YES
User: What can you tell me All the
Assistant: NO
# Disagreement with completion
[{"role": "assistant", "content": "Is that what you meant?"},
{"role": "user", "content": "No not really"}]
Output: YES
User: What can you tell me All the prime numbers less than 100
Assistant: YES
LOW PRIORITY SIGNALS:
User: What's the what's the length of the Amazon River?
Assistant: YES
1. STT Artifacts (Consider but don't over-weight):
- Repeated words
- Unusual punctuation
- Capitalization errors
- Word insertions/deletions
User: What's what's the length of the Amazon River?
Assistant: YES
Examples:
# Word repetition but complete
[{"role": "assistant", "content": "I can help with that."},
{"role": "user", "content": "What what is the time right now"}]
Output: YES
User: What's what's the length of the Amazon River
Assistant: YES
# Missing punctuation but complete
[{"role": "assistant", "content": "I can explain that."},
{"role": "user", "content": "Please tell me how computers work"}]
Output: YES
User: What's what's the best way to get a coffee stain out of a white shirt
Assistant: YES
2. Speech Features:
- Filler words (um, uh, like)
- Thinking pauses
- Word repetitions
- Brief hesitations
Examples:
# Filler words but complete
[{"role": "assistant", "content": "What would you like to know?"},
{"role": "user", "content": "Um uh how do airplanes fly"}]
Output: YES
# Thinking pause but incomplete
[{"role": "assistant", "content": "I can explain anything."},
{"role": "user", "content": "Well um I want to know about the"}]
Output: NO
DECISION RULES:
1. Return YES if:
- ANY high priority signal shows clear completion
- Medium priority signals combine to show completion
- Meaning is clear despite low priority artifacts
2. Return NO if:
- No high priority signals present
- Thought clearly trails off
- Multiple incomplete indicators
- User appears mid-formulation
3. When uncertain:
- If you can understand the intent YES
- If meaning is unclear NO
- Always make a binary decision
- Never request clarification
Examples:
# Incomplete despite corrections
[{"role": "assistant", "content": "What would you like to know about?"},
{"role": "user", "content": "Can you tell me about"}]
Output: NO
# Complete despite multiple artifacts
[{"role": "assistant", "content": "I can help you learn."},
{"role": "user", "content": "How do you I mean what's the best way to learn programming"}]
Output: YES
# Trailing off incomplete
[{"role": "assistant", "content": "I can explain anything."},
{"role": "user", "content": "I was wondering if you could tell me why"}]
Output: NO
"""
conversational_system_message = """You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.
@@ -297,15 +416,14 @@ async def main():
# statement. This doesn't really need to be an LLM, we could use NLP
# libraries for that, but we have the machinery to use an LLM, so we might as well!
statement_llm = AnthropicLLMService(
api_key=os.getenv("ANTHROPIC_API_KEY"), model="claude-3-5-haiku-20241022", name="Haiku"
api_key=os.getenv("ANTHROPIC_API_KEY"),
model="claude-3-5-sonnet-20241022",
)
# This is the regular LLM.
llm = AnthropicLLMService(
api_key=os.getenv("ANTHROPIC_API_KEY"),
model="claude-3-5-sonnet-20241022",
name="Sonnet",
params=AnthropicLLMService.InputParams(enable_prompt_caching_beta=True),
llm = OpenAILLMService(
api_key=os.getenv("OPENAI_API_KEY"),
model="gpt-4o",
)
messages = [

View File

@@ -0,0 +1,191 @@
#
# Copyright (c) 2024, Daily
#
# SPDX-License-Identifier: BSD 2-Clause License
#
import asyncio
import os
import sys
import time
import aiohttp
from loguru import logger
from runner import configure
from pipecat.audio.vad.silero import SileroVADAnalyzer
from pipecat.frames.frames import (
BotSpeakingFrame,
EndFrame,
Frame,
InputAudioRawFrame,
StartInterruptionFrame,
StopInterruptionFrame,
TextFrame,
TranscriptionFrame,
TTSAudioRawFrame,
UserStartedSpeakingFrame,
UserStoppedSpeakingFrame,
)
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.pipeline.task import PipelineParams, PipelineTask
from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext
from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
from pipecat.services.cartesia import CartesiaTTSService
from pipecat.services.openai import OpenAILLMService
from pipecat.transports.services.daily import DailyParams, DailyTransport
logger.remove(0)
logger.add(sys.stderr, level="DEBUG")
class DebugProcessor(FrameProcessor):
def __init__(self, name, **kwargs):
self._name = name
super().__init__(**kwargs)
async def process_frame(self, frame: Frame, direction: FrameDirection):
await super().process_frame(frame, direction)
if not (
isinstance(frame, InputAudioRawFrame)
or isinstance(frame, BotSpeakingFrame)
or isinstance(frame, UserStoppedSpeakingFrame)
or isinstance(frame, TTSAudioRawFrame)
or isinstance(frame, TextFrame)
):
logger.debug(f"--- {self._name}: {frame} {direction}")
await self.push_frame(frame, direction)
async def main():
async with aiohttp.ClientSession() as session:
(room_url, _) = await configure(session)
transport = DailyTransport(
room_url,
None,
"AI Bot",
DailyParams(
audio_out_enabled=True,
transcription_enabled=True,
vad_enabled=True,
vad_analyzer=SileroVADAnalyzer(),
),
)
tts = CartesiaTTSService(
api_key=os.getenv("CARTESIA_API_KEY"),
voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22", # British Lady
)
llm = OpenAILLMService(api_key=os.environ["OPENAI_API_KEY"], model="gpt-4o")
messages = [
{
"role": "system",
"content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
},
]
dp = DebugProcessor("dp")
context = OpenAILLMContext(messages)
context_aggregator = llm.create_context_aggregator(context)
runner = PipelineRunner()
task = PipelineTask(
Pipeline(
[
# transport.input(),
context_aggregator.user(),
llm,
dp,
tts,
transport.output(),
context_aggregator.assistant(),
]
),
PipelineParams(
allow_interruptions=True,
),
)
# Register an event handler so we can play the audio when the
# participant joins.
@transport.event_handler("on_first_participant_joined")
async def on_first_participant_joined(transport, participant):
participant_id = participant.get("info", {}).get("participantId", "")
# Create frames for 600 seconds
start_time = time.time()
while time.time() - start_time < 300:
elapsed_time = round(time.time() - start_time)
logger.info(f"Running for {elapsed_time} seconds")
await task.queue_frame(
StartInterruptionFrame(),
)
await asyncio.sleep(1)
await task.queue_frame(
UserStartedSpeakingFrame(),
)
await asyncio.sleep(1)
await task.queue_frame(
TranscriptionFrame("Tell me more about your company.", participant_id, time.time()),
)
await asyncio.sleep(1)
await task.queue_frame(
StopInterruptionFrame(),
)
await asyncio.sleep(1)
await task.queue_frame(
UserStoppedSpeakingFrame(),
)
await asyncio.sleep(5)
await task.queue_frame(StartInterruptionFrame())
await asyncio.sleep(1)
await task.queue_frame(
UserStartedSpeakingFrame(),
)
await asyncio.sleep(1)
await task.queue_frame(
TranscriptionFrame("Give me a list of appointment dates.", participant_id, time.time()),
)
await asyncio.sleep(1)
await task.queue_frames(
StopInterruptionFrame(),
)
await asyncio.sleep(1)
await task.queue_frame(
UserStoppedSpeakingFrame(),
)
await asyncio.sleep(5)
await task.queue_frame(EndFrame())
# @transport.event_handler("on_first_participant_joined")
# async def on_first_participant_joined(transport, participant):
# await transport.capture_participant_transcription(participant["id"])
# # Kick off the conversation.
# messages.append({"role": "system", "content": "Please introduce yourself to the user."})
# await task.queue_frames([LLMMessagesFrame(messages)])
await runner.run(task)
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -51,7 +51,7 @@ gladia = [ "websockets~=13.1" ]
google = [ "google-generativeai~=0.8.3", "google-cloud-texttospeech~=2.17.2" ]
gstreamer = [ "pygobject~=3.48.2" ]
fireworks = [ "openai~=1.37.2" ]
krisp = [ "pipecat-ai-krisp~=0.2.0" ]
krisp = [ "pipecat-ai-krisp~=0.3.0" ]
langchain = [ "langchain~=0.2.14", "langchain-community~=0.2.12", "langchain-openai~=0.1.20" ]
livekit = [ "livekit~=0.17.5", "livekit-api~=0.7.1", "tenacity~=8.5.0" ]
lmnt = [ "lmnt~=1.1.4" ]

View File

View File

@@ -246,6 +246,8 @@ class FrameProcessor:
await self._prev.queue_frame(frame, direction)
except Exception as e:
logger.exception(f"Uncaught exception in {self}: {e}")
await self.push_error(ErrorFrame(str(e)))
raise
def __create_input_task(self):
self.__input_queue = asyncio.Queue()

View File

@@ -743,6 +743,8 @@ class RTVIProcessor(FrameProcessor):
case "update-config":
update_config = RTVIUpdateConfig.model_validate(message.data)
await self._handle_update_config(message.id, update_config)
case "disconnect-bot":
await self.push_frame(EndFrame())
case "action":
action = RTVIActionRun.model_validate(message.data)
action_frame = RTVIActionFrame(message_id=message.id, rtvi_action_run=action)

View File

@@ -34,35 +34,77 @@ except ModuleNotFoundError as e:
def language_to_aws_language(language: Language) -> str | None:
language_map = {
# Arabic
Language.AR: "arb",
Language.AR_AE: "ar-AE",
# Catalan
Language.CA: "ca-ES",
Language.ZH: "cmn-CN",
# Chinese
Language.ZH: "cmn-CN", # Mandarin
Language.YUE: "yue-CN", # Cantonese
Language.YUE_CN: "yue-CN",
# Czech
Language.CS: "cs-CZ",
# Danish
Language.DA: "da-DK",
# Dutch
Language.NL: "nl-NL",
Language.NL_BE: "nl-BE",
Language.EN: "en-US",
Language.EN_US: "en-US",
# English
Language.EN: "en-US", # Default to US English
Language.EN_AU: "en-AU",
Language.EN_GB: "en-GB",
Language.EN_NZ: "en-NZ",
Language.EN_IN: "en-IN",
Language.EN_NZ: "en-NZ",
Language.EN_US: "en-US",
Language.EN_ZA: "en-ZA",
# Finnish
Language.FI: "fi-FI",
# French
Language.FR: "fr-FR",
Language.FR_BE: "fr-BE",
Language.FR_CA: "fr-CA",
# German
Language.DE: "de-DE",
Language.DE_AT: "de-AT",
Language.DE_CH: "de-CH",
# Hindi
Language.HI: "hi-IN",
# Icelandic
Language.IS: "is-IS",
# Italian
Language.IT: "it-IT",
# Japanese
Language.JA: "ja-JP",
# Korean
Language.KO: "ko-KR",
# Norwegian
Language.NO: "nb-NO",
Language.NB: "nb-NO",
Language.NB_NO: "nb-NO",
# Polish
Language.PL: "pl-PL",
# Portuguese
Language.PT: "pt-PT",
Language.PT_BR: "pt-BR",
Language.PT_PT: "pt-PT",
# Romanian
Language.RO: "ro-RO",
# Russian
Language.RU: "ru-RU",
# Spanish
Language.ES: "es-ES",
Language.ES_MX: "es-MX",
Language.ES_US: "es-US",
# Swedish
Language.SV: "sv-SE",
# Turkish
Language.TR: "tr-TR",
# Welsh
Language.CY: "cy-GB",
Language.CY_GB: "cy-GB",
}
return language_map.get(language)

View File

@@ -63,49 +63,325 @@ except ModuleNotFoundError as e:
def language_to_azure_language(language: Language) -> str | None:
language_map = {
# Afrikaans
Language.AF: "af-ZA",
Language.AF_ZA: "af-ZA",
# Amharic
Language.AM: "am-ET",
Language.AM_ET: "am-ET",
# Arabic
Language.AR: "ar-AE", # Default to UAE Arabic
Language.AR_AE: "ar-AE",
Language.AR_BH: "ar-BH",
Language.AR_DZ: "ar-DZ",
Language.AR_EG: "ar-EG",
Language.AR_IQ: "ar-IQ",
Language.AR_JO: "ar-JO",
Language.AR_KW: "ar-KW",
Language.AR_LB: "ar-LB",
Language.AR_LY: "ar-LY",
Language.AR_MA: "ar-MA",
Language.AR_OM: "ar-OM",
Language.AR_QA: "ar-QA",
Language.AR_SA: "ar-SA",
Language.AR_SY: "ar-SY",
Language.AR_TN: "ar-TN",
Language.AR_YE: "ar-YE",
# Assamese
Language.AS: "as-IN",
Language.AS_IN: "as-IN",
# Azerbaijani
Language.AZ: "az-AZ",
Language.AZ_AZ: "az-AZ",
# Bulgarian
Language.BG: "bg-BG",
Language.BG_BG: "bg-BG",
# Bengali
Language.BN: "bn-IN", # Default to Indian Bengali
Language.BN_BD: "bn-BD",
Language.BN_IN: "bn-IN",
# Bosnian
Language.BS: "bs-BA",
Language.BS_BA: "bs-BA",
# Catalan
Language.CA: "ca-ES",
Language.ZH: "zh-CN",
Language.ZH_TW: "zh-TW",
Language.CA_ES: "ca-ES",
# Czech
Language.CS: "cs-CZ",
Language.CS_CZ: "cs-CZ",
# Welsh
Language.CY: "cy-GB",
Language.CY_GB: "cy-GB",
# Danish
Language.DA: "da-DK",
Language.NL: "nl-NL",
Language.EN: "en-US",
Language.EN_US: "en-US",
Language.EN_AU: "en-AU",
Language.EN_GB: "en-GB",
Language.EN_NZ: "en-NZ",
Language.EN_IN: "en-IN",
Language.ET: "et-EE",
Language.FI: "fi-FI",
Language.NL_BE: "nl-BE",
Language.FR: "fr-FR",
Language.FR_CA: "fr-CA",
Language.DA_DK: "da-DK",
# German
Language.DE: "de-DE",
Language.DE_AT: "de-AT",
Language.DE_CH: "de-CH",
Language.DE_DE: "de-DE",
# Greek
Language.EL: "el-GR",
Language.EL_GR: "el-GR",
# English
Language.EN: "en-US", # Default to US English
Language.EN_AU: "en-AU",
Language.EN_CA: "en-CA",
Language.EN_GB: "en-GB",
Language.EN_HK: "en-HK",
Language.EN_IE: "en-IE",
Language.EN_IN: "en-IN",
Language.EN_KE: "en-KE",
Language.EN_NG: "en-NG",
Language.EN_NZ: "en-NZ",
Language.EN_PH: "en-PH",
Language.EN_SG: "en-SG",
Language.EN_TZ: "en-TZ",
Language.EN_US: "en-US",
Language.EN_ZA: "en-ZA",
# Spanish
Language.ES: "es-ES", # Default to Spain Spanish
Language.ES_AR: "es-AR",
Language.ES_BO: "es-BO",
Language.ES_CL: "es-CL",
Language.ES_CO: "es-CO",
Language.ES_CR: "es-CR",
Language.ES_CU: "es-CU",
Language.ES_DO: "es-DO",
Language.ES_EC: "es-EC",
Language.ES_ES: "es-ES",
Language.ES_GQ: "es-GQ",
Language.ES_GT: "es-GT",
Language.ES_HN: "es-HN",
Language.ES_MX: "es-MX",
Language.ES_NI: "es-NI",
Language.ES_PA: "es-PA",
Language.ES_PE: "es-PE",
Language.ES_PR: "es-PR",
Language.ES_PY: "es-PY",
Language.ES_SV: "es-SV",
Language.ES_US: "es-US",
Language.ES_UY: "es-UY",
Language.ES_VE: "es-VE",
# Estonian
Language.ET: "et-EE",
Language.ET_EE: "et-EE",
# Basque
Language.EU: "eu-ES",
Language.EU_ES: "eu-ES",
# Persian
Language.FA: "fa-IR",
Language.FA_IR: "fa-IR",
# Finnish
Language.FI: "fi-FI",
Language.FI_FI: "fi-FI",
# Filipino
Language.FIL: "fil-PH",
Language.FIL_PH: "fil-PH",
# French
Language.FR: "fr-FR",
Language.FR_BE: "fr-BE",
Language.FR_CA: "fr-CA",
Language.FR_CH: "fr-CH",
Language.FR_FR: "fr-FR",
# Irish
Language.GA: "ga-IE",
Language.GA_IE: "ga-IE",
# Galician
Language.GL: "gl-ES",
Language.GL_ES: "gl-ES",
# Gujarati
Language.GU: "gu-IN",
Language.GU_IN: "gu-IN",
# Hebrew
Language.HE: "he-IL",
Language.HE_IL: "he-IL",
# Hindi
Language.HI: "hi-IN",
Language.HI_IN: "hi-IN",
# Croatian
Language.HR: "hr-HR",
Language.HR_HR: "hr-HR",
# Hungarian
Language.HU: "hu-HU",
Language.HU_HU: "hu-HU",
# Armenian
Language.HY: "hy-AM",
Language.HY_AM: "hy-AM",
# Indonesian
Language.ID: "id-ID",
Language.ID_ID: "id-ID",
# Icelandic
Language.IS: "is-IS",
Language.IS_IS: "is-IS",
# Italian
Language.IT: "it-IT",
Language.IT_IT: "it-IT",
# Inuktitut
Language.IU_CANS_CA: "iu-Cans-CA",
Language.IU_LATN_CA: "iu-Latn-CA",
# Japanese
Language.JA: "ja-JP",
Language.JA_JP: "ja-JP",
# Javanese
Language.JV: "jv-ID",
Language.JV_ID: "jv-ID",
# Georgian
Language.KA: "ka-GE",
Language.KA_GE: "ka-GE",
# Kazakh
Language.KK: "kk-KZ",
Language.KK_KZ: "kk-KZ",
# Khmer
Language.KM: "km-KH",
Language.KM_KH: "km-KH",
# Kannada
Language.KN: "kn-IN",
Language.KN_IN: "kn-IN",
# Korean
Language.KO: "ko-KR",
Language.LV: "lv-LV",
Language.KO_KR: "ko-KR",
# Lao
Language.LO: "lo-LA",
Language.LO_LA: "lo-LA",
# Lithuanian
Language.LT: "lt-LT",
Language.LT_LT: "lt-LT",
# Latvian
Language.LV: "lv-LV",
Language.LV_LV: "lv-LV",
# Macedonian
Language.MK: "mk-MK",
Language.MK_MK: "mk-MK",
# Malayalam
Language.ML: "ml-IN",
Language.ML_IN: "ml-IN",
# Mongolian
Language.MN: "mn-MN",
Language.MN_MN: "mn-MN",
# Marathi
Language.MR: "mr-IN",
Language.MR_IN: "mr-IN",
# Malay
Language.MS: "ms-MY",
Language.MS_MY: "ms-MY",
# Maltese
Language.MT: "mt-MT",
Language.MT_MT: "mt-MT",
# Burmese
Language.MY: "my-MM",
Language.MY_MM: "my-MM",
# Norwegian
Language.NB: "nb-NO",
Language.NB_NO: "nb-NO",
Language.NO: "nb-NO",
# Nepali
Language.NE: "ne-NP",
Language.NE_NP: "ne-NP",
# Dutch
Language.NL: "nl-NL",
Language.NL_BE: "nl-BE",
Language.NL_NL: "nl-NL",
# Odia
Language.OR: "or-IN",
Language.OR_IN: "or-IN",
# Punjabi
Language.PA: "pa-IN",
Language.PA_IN: "pa-IN",
# Polish
Language.PL: "pl-PL",
Language.PL_PL: "pl-PL",
# Pashto
Language.PS: "ps-AF",
Language.PS_AF: "ps-AF",
# Portuguese
Language.PT: "pt-PT",
Language.PT_BR: "pt-BR",
Language.PT_PT: "pt-PT",
# Romanian
Language.RO: "ro-RO",
Language.RO_RO: "ro-RO",
# Russian
Language.RU: "ru-RU",
Language.RU_RU: "ru-RU",
# Sinhala
Language.SI: "si-LK",
Language.SI_LK: "si-LK",
# Slovak
Language.SK: "sk-SK",
Language.ES: "es-ES",
Language.SK_SK: "sk-SK",
# Slovenian
Language.SL: "sl-SI",
Language.SL_SI: "sl-SI",
# Somali
Language.SO: "so-SO",
Language.SO_SO: "so-SO",
# Albanian
Language.SQ: "sq-AL",
Language.SQ_AL: "sq-AL",
# Serbian
Language.SR: "sr-RS",
Language.SR_RS: "sr-RS",
Language.SR_LATN: "sr-Latn-RS",
Language.SR_LATN_RS: "sr-Latn-RS",
# Sundanese
Language.SU: "su-ID",
Language.SU_ID: "su-ID",
# Swedish
Language.SV: "sv-SE",
Language.SV_SE: "sv-SE",
# Swahili
Language.SW: "sw-KE",
Language.SW_KE: "sw-KE",
Language.SW_TZ: "sw-TZ",
# Tamil
Language.TA: "ta-IN",
Language.TA_IN: "ta-IN",
Language.TA_LK: "ta-LK",
Language.TA_MY: "ta-MY",
Language.TA_SG: "ta-SG",
# Telugu
Language.TE: "te-IN",
Language.TE_IN: "te-IN",
# Thai
Language.TH: "th-TH",
Language.TH_TH: "th-TH",
# Turkish
Language.TR: "tr-TR",
Language.TR_TR: "tr-TR",
# Ukrainian
Language.UK: "uk-UA",
Language.UK_UA: "uk-UA",
# Urdu
Language.UR: "ur-IN",
Language.UR_IN: "ur-IN",
Language.UR_PK: "ur-PK",
# Uzbek
Language.UZ: "uz-UZ",
Language.UZ_UZ: "uz-UZ",
# Vietnamese
Language.VI: "vi-VN",
Language.VI_VN: "vi-VN",
# Wu Chinese
Language.WUU: "wuu-CN",
Language.WUU_CN: "wuu-CN",
# Yue Chinese
Language.YUE: "yue-CN",
Language.YUE_CN: "yue-CN",
# Chinese
Language.ZH: "zh-CN",
Language.ZH_CN: "zh-CN",
Language.ZH_CN_GUANGXI: "zh-CN-guangxi",
Language.ZH_CN_HENAN: "zh-CN-henan",
Language.ZH_CN_LIAONING: "zh-CN-liaoning",
Language.ZH_CN_SHAANXI: "zh-CN-shaanxi",
Language.ZH_CN_SHANDONG: "zh-CN-shandong",
Language.ZH_CN_SICHUAN: "zh-CN-sichuan",
Language.ZH_HK: "zh-HK",
Language.ZH_TW: "zh-TW",
# Zulu
Language.ZU: "zu-ZA",
Language.ZU_ZA: "zu-ZA",
}
return language_map.get(language)

View File

@@ -7,6 +7,7 @@
import asyncio
import base64
import json
import random
import uuid
from typing import AsyncGenerator, List, Optional, Union
@@ -44,24 +45,27 @@ except ModuleNotFoundError as e:
def language_to_cartesia_language(language: Language) -> str | None:
language_map = {
BASE_LANGUAGES = {
Language.DE: "de",
Language.EN: "en",
Language.EN_US: "en",
Language.EN_GB: "en",
Language.EN_AU: "en",
Language.EN_NZ: "en",
Language.EN_IN: "en",
Language.ES: "es",
Language.FR: "fr",
Language.FR_CA: "fr",
Language.JA: "ja",
Language.PT: "pt",
Language.PT_BR: "pt",
Language.ZH: "zh",
Language.ZH_TW: "zh",
}
return language_map.get(language)
result = BASE_LANGUAGES.get(language)
# If not found in base languages, try to find the base language from a variant
if not result:
# Convert enum value to string and get the base language part (e.g. es-ES -> es)
lang_str = str(language.value)
base_code = lang_str.split("-")[0].lower()
# Look up the base code in our supported languages
result = base_code if base_code in BASE_LANGUAGES.values() else None
return result
class CartesiaTTSService(WordTTSService):
@@ -219,17 +223,22 @@ class CartesiaTTSService(WordTTSService):
async def _receive_task_handler(self):
try:
async for message in self._get_websocket():
# Randomly cancel the asyncio task 1% of the time
if random.random() < 0.01:
logger.info(f"Cancelling task for {self} due to random chance")
asyncio.current_task().cancel()
msg = json.loads(message)
if not msg or msg["context_id"] != self._context_id:
continue
if msg["type"] == "done":
await self.push_frame(TTSStoppedFrame())
await self.stop_ttfb_metrics()
# Unset _context_id but not the _context_id_start_timestamp
# because we are likely still playing out audio and need the
# timestamp to set send context frames.
self._context_id = None
await self.add_word_timestamps([("LLMFullResponseEndFrame", 0), ("Reset", 0)])
await self.add_word_timestamps(
[("TTSStoppedFrame", 0), ("LLMFullResponseEndFrame", 0), ("Reset", 0)]
)
elif msg["type"] == "timestamps":
await self.add_word_timestamps(
list(zip(msg["word_timestamps"]["words"], msg["word_timestamps"]["start"]))
@@ -252,6 +261,7 @@ class CartesiaTTSService(WordTTSService):
logger.error(f"Cartesia error, unknown message type: {msg}")
except asyncio.CancelledError:
pass
# await self.push_error(ErrorFrame(f"{self} cancelled", True))
except Exception as e:
logger.error(f"{self} exception: {e}")

View File

@@ -43,24 +43,16 @@ ElevenLabsOutputFormat = Literal["pcm_16000", "pcm_22050", "pcm_24000", "pcm_441
def language_to_elevenlabs_language(language: Language) -> str | None:
language_map = {
BASE_LANGUAGES = {
Language.BG: "bg",
Language.ZH: "zh",
Language.CS: "cs",
Language.DA: "da",
Language.NL: "nl",
Language.DE: "de",
Language.EL: "el",
Language.EN: "en",
Language.EN_US: "en",
Language.EN_AU: "en",
Language.EN_GB: "en",
Language.EN_NZ: "en",
Language.EN_IN: "en",
Language.ES: "es",
Language.FI: "fi",
Language.FR: "fr",
Language.FR_CA: "fr",
Language.DE: "de",
Language.DE_CH: "de",
Language.EL: "el",
Language.HI: "hi",
Language.HU: "hu",
Language.ID: "id",
@@ -68,20 +60,31 @@ def language_to_elevenlabs_language(language: Language) -> str | None:
Language.JA: "ja",
Language.KO: "ko",
Language.MS: "ms",
Language.NL: "nl",
Language.NO: "no",
Language.PL: "pl",
Language.PT: "pt-PT",
Language.PT_BR: "pt-BR",
Language.PT: "pt",
Language.RO: "ro",
Language.RU: "ru",
Language.SK: "sk",
Language.ES: "es",
Language.SV: "sv",
Language.TR: "tr",
Language.UK: "uk",
Language.VI: "vi",
Language.ZH: "zh",
}
return language_map.get(language)
result = BASE_LANGUAGES.get(language)
# If not found in base languages, try to find the base language from a variant
if not result:
# Convert enum value to string and get the base language part (e.g. es-ES -> es)
lang_str = str(language.value)
base_code = lang_str.split("-")[0].lower()
# Look up the base code in our supported languages
result = base_code if base_code in BASE_LANGUAGES.values() else None
return result
def sample_rate_from_output_format(output_format: str) -> int:

View File

@@ -35,50 +35,98 @@ except ModuleNotFoundError as e:
def language_to_gladia_language(language: Language) -> str | None:
language_map = {
BASE_LANGUAGES = {
Language.AF: "af",
Language.AM: "am",
Language.AR: "ar",
Language.AS: "as",
Language.AZ: "az",
Language.BG: "bg",
Language.BN: "bn",
Language.BS: "bs",
Language.CA: "ca",
Language.ZH: "zh",
Language.CS: "cs",
Language.CY: "cy",
Language.DA: "da",
Language.NL: "nl",
Language.DE: "de",
Language.EL: "el",
Language.EN: "en",
Language.EN_US: "en",
Language.EN_AU: "en",
Language.EN_GB: "en",
Language.EN_NZ: "en",
Language.EN_IN: "en",
Language.ES: "es",
Language.ET: "et",
Language.EU: "eu",
Language.FA: "fa",
Language.FI: "fi",
Language.FR: "fr",
Language.FR_CA: "fr",
Language.DE: "de",
Language.DE_CH: "de",
Language.EL: "el",
Language.GA: "ga",
Language.GL: "gl",
Language.GU: "gu",
Language.HE: "he",
Language.HI: "hi",
Language.HR: "hr",
Language.HU: "hu",
Language.HY: "hy",
Language.ID: "id",
Language.IS: "is",
Language.IT: "it",
Language.JA: "ja",
Language.JV: "jv",
Language.KA: "ka",
Language.KK: "kk",
Language.KM: "km",
Language.KN: "kn",
Language.KO: "ko",
Language.LV: "lv",
Language.LO: "lo",
Language.LT: "lt",
Language.LV: "lv",
Language.MK: "mk",
Language.ML: "ml",
Language.MN: "mn",
Language.MR: "mr",
Language.MS: "ms",
Language.MT: "mt",
Language.MY: "my",
Language.NE: "ne",
Language.NL: "nl",
Language.NO: "no",
Language.OR: "or",
Language.PA: "pa",
Language.PL: "pl",
Language.PS: "ps",
Language.PT: "pt",
Language.PT_BR: "pt",
Language.RO: "ro",
Language.RU: "ru",
Language.SI: "si",
Language.SK: "sk",
Language.ES: "es",
Language.SL: "sl",
Language.SO: "so",
Language.SQ: "sq",
Language.SR: "sr",
Language.SU: "su",
Language.SV: "sv",
Language.SW: "sw",
Language.TA: "ta",
Language.TE: "te",
Language.TH: "th",
Language.TR: "tr",
Language.UK: "uk",
Language.UR: "ur",
Language.UZ: "uz",
Language.VI: "vi",
Language.ZH: "zh",
Language.ZU: "zu",
}
return language_map.get(language)
result = BASE_LANGUAGES.get(language)
# If not found in base languages, try to find the base language from a variant
if not result:
# Convert enum value to string and get the base language part (e.g. es-ES -> es)
lang_str = str(language.value)
base_code = lang_str.split("-")[0].lower()
# Look up the base code in our supported languages
result = base_code if base_code in BASE_LANGUAGES.values() else None
return result
class GladiaSTTService(STTService):

View File

@@ -58,48 +58,161 @@ except ModuleNotFoundError as e:
def language_to_google_language(language: Language) -> str | None:
language_map = {
# Afrikaans
Language.AF: "af-ZA",
Language.AF_ZA: "af-ZA",
# Arabic
Language.AR: "ar-XA",
# Bengali
Language.BN: "bn-IN",
Language.BN_IN: "bn-IN",
# Bulgarian
Language.BG: "bg-BG",
Language.BG_BG: "bg-BG",
# Catalan
Language.CA: "ca-ES",
Language.CA_ES: "ca-ES",
# Chinese (Mandarin and Cantonese)
Language.ZH: "cmn-CN",
Language.ZH_CN: "cmn-CN",
Language.ZH_TW: "cmn-TW",
Language.ZH_HK: "yue-HK",
# Czech
Language.CS: "cs-CZ",
Language.CS_CZ: "cs-CZ",
# Danish
Language.DA: "da-DK",
Language.DA_DK: "da-DK",
# Dutch
Language.NL: "nl-NL",
Language.NL_BE: "nl-BE",
Language.NL_NL: "nl-NL",
# English
Language.EN: "en-US",
Language.EN_US: "en-US",
Language.EN_AU: "en-AU",
Language.EN_GB: "en-GB",
Language.EN_IN: "en-IN",
# Estonian
Language.ET: "et-EE",
Language.ET_EE: "et-EE",
# Filipino
Language.FIL: "fil-PH",
Language.FIL_PH: "fil-PH",
# Finnish
Language.FI: "fi-FI",
Language.NL_BE: "nl-BE",
Language.FI_FI: "fi-FI",
# French
Language.FR: "fr-FR",
Language.FR_CA: "fr-CA",
Language.FR_FR: "fr-FR",
# Galician
Language.GL: "gl-ES",
Language.GL_ES: "gl-ES",
# German
Language.DE: "de-DE",
Language.DE_DE: "de-DE",
# Greek
Language.EL: "el-GR",
Language.EL_GR: "el-GR",
# Gujarati
Language.GU: "gu-IN",
Language.GU_IN: "gu-IN",
# Hebrew
Language.HE: "he-IL",
Language.HE_IL: "he-IL",
# Hindi
Language.HI: "hi-IN",
Language.HI_IN: "hi-IN",
# Hungarian
Language.HU: "hu-HU",
Language.HU_HU: "hu-HU",
# Icelandic
Language.IS: "is-IS",
Language.IS_IS: "is-IS",
# Indonesian
Language.ID: "id-ID",
Language.ID_ID: "id-ID",
# Italian
Language.IT: "it-IT",
Language.IT_IT: "it-IT",
# Japanese
Language.JA: "ja-JP",
Language.JA_JP: "ja-JP",
# Kannada
Language.KN: "kn-IN",
Language.KN_IN: "kn-IN",
# Korean
Language.KO: "ko-KR",
Language.KO_KR: "ko-KR",
# Latvian
Language.LV: "lv-LV",
Language.LV_LV: "lv-LV",
# Lithuanian
Language.LT: "lt-LT",
Language.LT_LT: "lt-LT",
# Malay
Language.MS: "ms-MY",
Language.MS_MY: "ms-MY",
# Malayalam
Language.ML: "ml-IN",
Language.ML_IN: "ml-IN",
# Marathi
Language.MR: "mr-IN",
Language.MR_IN: "mr-IN",
# Norwegian
Language.NO: "nb-NO",
Language.NB: "nb-NO",
Language.NB_NO: "nb-NO",
# Polish
Language.PL: "pl-PL",
Language.PL_PL: "pl-PL",
# Portuguese
Language.PT: "pt-PT",
Language.PT_BR: "pt-BR",
Language.PT_PT: "pt-PT",
# Punjabi
Language.PA: "pa-IN",
Language.PA_IN: "pa-IN",
# Romanian
Language.RO: "ro-RO",
Language.RO_RO: "ro-RO",
# Russian
Language.RU: "ru-RU",
Language.RU_RU: "ru-RU",
# Serbian
Language.SR: "sr-RS",
Language.SR_RS: "sr-RS",
# Slovak
Language.SK: "sk-SK",
Language.SK_SK: "sk-SK",
# Spanish
Language.ES: "es-ES",
Language.ES_ES: "es-ES",
Language.ES_US: "es-US",
# Swedish
Language.SV: "sv-SE",
Language.SV_SE: "sv-SE",
# Tamil
Language.TA: "ta-IN",
Language.TA_IN: "ta-IN",
# Telugu
Language.TE: "te-IN",
Language.TE_IN: "te-IN",
# Thai
Language.TH: "th-TH",
Language.TH_TH: "th-TH",
# Turkish
Language.TR: "tr-TR",
Language.TR_TR: "tr-TR",
# Ukrainian
Language.UK: "uk-UA",
Language.UK_UA: "uk-UA",
# Vietnamese
Language.VI: "vi-VN",
Language.VI_VN: "vi-VN",
}
return language_map.get(language)
@@ -168,9 +281,10 @@ class GoogleAssistantContextAggregator(OpenAIAssistantContextAggregator):
)
run_llm = not bool(self._function_calls_in_progress)
else:
self._context.add_message(
glm.Content(role="model", parts=[glm.Part(text=aggregation)])
)
if aggregation.strip():
self._context.add_message(
glm.Content(role="model", parts=[glm.Part(text=aggregation)])
)
if self._pending_image_frame_message:
frame = self._pending_image_frame_message

View File

@@ -36,24 +36,27 @@ except ModuleNotFoundError as e:
def language_to_lmnt_language(language: Language) -> str | None:
language_map = {
BASE_LANGUAGES = {
Language.DE: "de",
Language.EN: "en",
Language.EN_US: "en",
Language.EN_AU: "en",
Language.EN_GB: "en",
Language.EN_NZ: "en",
Language.EN_IN: "en",
Language.ES: "es",
Language.FR: "fr",
Language.FR_CA: "fr",
Language.PT: "pt",
Language.PT_BR: "pt",
Language.ZH: "zh",
Language.ZH_TW: "zh",
Language.KO: "ko",
Language.PT: "pt",
Language.ZH: "zh",
}
return language_map.get(language)
result = BASE_LANGUAGES.get(language)
# If not found in base languages, try to find the base language from a variant
if not result:
# Convert enum value to string and get the base language part (e.g. es-ES -> es)
lang_str = str(language.value)
base_code = lang_str.split("-")[0].lower()
# Look up the base code in our supported languages
result = base_code if base_code in BASE_LANGUAGES.values() else None
return result
class LmntTTSService(TTSService):

View File

@@ -7,6 +7,7 @@
from typing import Any, AsyncGenerator, Dict
import aiohttp
from loguru import logger
from pipecat.audio.utils import resample_audio
from pipecat.frames.frames import (
@@ -20,9 +21,6 @@ from pipecat.frames.frames import (
from pipecat.services.ai_services import TTSService
from pipecat.transcriptions.language import Language
from loguru import logger
# The server below can connect to XTTS through a local running docker
#
# Docker command: $ docker run --gpus=all -e COQUI_TOS_AGREED=1 --rm -p 8000:80 ghcr.io/coqui-ai/xtts-streaming-server:latest-cuda121
@@ -32,15 +30,10 @@ from loguru import logger
def language_to_xtts_language(language: Language) -> str | None:
language_map = {
BASE_LANGUAGES = {
Language.CS: "cs",
Language.DE: "de",
Language.EN: "en",
Language.EN_US: "en",
Language.EN_AU: "en",
Language.EN_GB: "en",
Language.EN_NZ: "en",
Language.EN_IN: "en",
Language.ES: "es",
Language.FR: "fr",
Language.HI: "hi",
@@ -51,12 +44,28 @@ def language_to_xtts_language(language: Language) -> str | None:
Language.NL: "nl",
Language.PL: "pl",
Language.PT: "pt",
Language.PT_BR: "pt",
Language.RU: "ru",
Language.TR: "tr",
# Special case for Chinese base language
Language.ZH: "zh-cn",
}
return language_map.get(language)
result = BASE_LANGUAGES.get(language)
# If not found in base languages, try to find the base language from a variant
if not result:
# Convert enum value to string and get the base language part (e.g. es-ES -> es)
lang_str = str(language.value)
base_code = lang_str.split("-")[0].lower()
# Special handling for Chinese variants
if base_code == "zh":
result = "zh-cn"
else:
# Look up the base code in our supported languages
result = base_code if base_code in BASE_LANGUAGES.values() else None
return result
class XTTSService(TTSService):

View File

@@ -5,7 +5,6 @@
#
import sys
from enum import Enum
if sys.version_info < (3, 11):
@@ -20,46 +19,411 @@ else:
class Language(StrEnum):
BG = "bg" # Bulgarian
CA = "ca" # Catalan
ZH = "zh" # Chinese simplified
ZH_TW = "zh-TW" # Chinese traditional
CS = "cs" # Czech
DA = "da" # Danish
NL = "nl" # Dutch
EN = "en" # English
EN_US = "en-US" # English (USA)
EN_AU = "en-AU" # English (Australia)
EN_GB = "en-GB" # English (Great Britain)
EN_NZ = "en-NZ" # English (New Zealand)
EN_IN = "en-IN" # English (India)
ET = "et" # Estonian
FI = "fi" # Finnish
NL_BE = "nl-BE" # Flemmish
FR = "fr" # French
FR_CA = "fr-CA" # French (Canada)
DE = "de" # German
DE_CH = "de-CH" # German (Switzerland)
EL = "el" # Greek
HI = "hi" # Hindi
HU = "hu" # Hungarian
ID = "id" # Indonesian
IT = "it" # Italian
JA = "ja" # Japanese
KO = "ko" # Korean
LV = "lv" # Latvian
LT = "lt" # Lithuanian
MS = "ms" # Malay
NO = "no" # Norwegian
PL = "pl" # Polish
PT = "pt" # Portuguese
PT_BR = "pt-BR" # Portuguese (Brazil)
RO = "ro" # Romanian
RU = "ru" # Russian
SK = "sk" # Slovak
ES = "es" # Spanish
SV = "sv" # Swedish
TH = "th" # Thai
TR = "tr" # Turkish
UK = "uk" # Ukrainian
VI = "vi" # Vietnamese
# Afrikaans
AF = "af"
AF_ZA = "af-ZA"
# Amharic
AM = "am"
AM_ET = "am-ET"
# Arabic
AR = "ar"
AR_AE = "ar-AE"
AR_BH = "ar-BH"
AR_DZ = "ar-DZ"
AR_EG = "ar-EG"
AR_IQ = "ar-IQ"
AR_JO = "ar-JO"
AR_KW = "ar-KW"
AR_LB = "ar-LB"
AR_LY = "ar-LY"
AR_MA = "ar-MA"
AR_OM = "ar-OM"
AR_QA = "ar-QA"
AR_SA = "ar-SA"
AR_SY = "ar-SY"
AR_TN = "ar-TN"
AR_YE = "ar-YE"
# Assamese
AS = "as"
AS_IN = "as-IN"
# Azerbaijani
AZ = "az"
AZ_AZ = "az-AZ"
# Bulgarian
BG = "bg"
BG_BG = "bg-BG"
# Bengali
BN = "bn"
BN_BD = "bn-BD"
BN_IN = "bn-IN"
# Bosnian
BS = "bs"
BS_BA = "bs-BA"
# Catalan
CA = "ca"
CA_ES = "ca-ES"
# Czech
CS = "cs"
CS_CZ = "cs-CZ"
# Welsh
CY = "cy"
CY_GB = "cy-GB"
# Danish
DA = "da"
DA_DK = "da-DK"
# German
DE = "de"
DE_AT = "de-AT"
DE_CH = "de-CH"
DE_DE = "de-DE"
# Greek
EL = "el"
EL_GR = "el-GR"
# English
EN = "en"
EN_AU = "en-AU"
EN_CA = "en-CA"
EN_GB = "en-GB"
EN_HK = "en-HK"
EN_IE = "en-IE"
EN_IN = "en-IN"
EN_KE = "en-KE"
EN_NG = "en-NG"
EN_NZ = "en-NZ"
EN_PH = "en-PH"
EN_SG = "en-SG"
EN_TZ = "en-TZ"
EN_US = "en-US"
EN_ZA = "en-ZA"
# Spanish
ES = "es"
ES_AR = "es-AR"
ES_BO = "es-BO"
ES_CL = "es-CL"
ES_CO = "es-CO"
ES_CR = "es-CR"
ES_CU = "es-CU"
ES_DO = "es-DO"
ES_EC = "es-EC"
ES_ES = "es-ES"
ES_GQ = "es-GQ"
ES_GT = "es-GT"
ES_HN = "es-HN"
ES_MX = "es-MX"
ES_NI = "es-NI"
ES_PA = "es-PA"
ES_PE = "es-PE"
ES_PR = "es-PR"
ES_PY = "es-PY"
ES_SV = "es-SV"
ES_US = "es-US"
ES_UY = "es-UY"
ES_VE = "es-VE"
# Estonian
ET = "et"
ET_EE = "et-EE"
# Basque
EU = "eu"
EU_ES = "eu-ES"
# Persian
FA = "fa"
FA_IR = "fa-IR"
# Finnish
FI = "fi"
FI_FI = "fi-FI"
# Filipino
FIL = "fil"
FIL_PH = "fil-PH"
# French
FR = "fr"
FR_BE = "fr-BE"
FR_CA = "fr-CA"
FR_CH = "fr-CH"
FR_FR = "fr-FR"
# Irish
GA = "ga"
GA_IE = "ga-IE"
# Galician
GL = "gl"
GL_ES = "gl-ES"
# Gujarati
GU = "gu"
GU_IN = "gu-IN"
# Hebrew
HE = "he"
HE_IL = "he-IL"
# Hindi
HI = "hi"
HI_IN = "hi-IN"
# Croatian
HR = "hr"
HR_HR = "hr-HR"
# Hungarian
HU = "hu"
HU_HU = "hu-HU"
# Armenian
HY = "hy"
HY_AM = "hy-AM"
# Indonesian
ID = "id"
ID_ID = "id-ID"
# Icelandic
IS = "is"
IS_IS = "is-IS"
# Italian
IT = "it"
IT_IT = "it-IT"
# Inuktitut
IU_CANS = "iu-Cans"
IU_CANS_CA = "iu-Cans-CA"
IU_LATN = "iu-Latn"
IU_LATN_CA = "iu-Latn-CA"
# Japanese
JA = "ja"
JA_JP = "ja-JP"
# Javanese
JV = "jv"
JV_ID = "jv-ID"
# Georgian
KA = "ka"
KA_GE = "ka-GE"
# Kazakh
KK = "kk"
KK_KZ = "kk-KZ"
# Khmer
KM = "km"
KM_KH = "km-KH"
# Kannada
KN = "kn"
KN_IN = "kn-IN"
# Korean
KO = "ko"
KO_KR = "ko-KR"
# Lao
LO = "lo"
LO_LA = "lo-LA"
# Lithuanian
LT = "lt"
LT_LT = "lt-LT"
# Latvian
LV = "lv"
LV_LV = "lv-LV"
# Macedonian
MK = "mk"
MK_MK = "mk-MK"
# Malayalam
ML = "ml"
ML_IN = "ml-IN"
# Mongolian
MN = "mn"
MN_MN = "mn-MN"
# Marathi
MR = "mr"
MR_IN = "mr-IN"
# Malay
MS = "ms"
MS_MY = "ms-MY"
# Maltese
MT = "mt"
MT_MT = "mt-MT"
# Burmese
MY = "my"
MY_MM = "my-MM"
# Norwegian
NB = "nb"
NB_NO = "nb-NO"
NO = "no"
# Nepali
NE = "ne"
NE_NP = "ne-NP"
# Dutch
NL = "nl"
NL_BE = "nl-BE"
NL_NL = "nl-NL"
# Odia
OR = "or"
OR_IN = "or-IN"
# Punjabi
PA = "pa"
PA_IN = "pa-IN"
# Polish
PL = "pl"
PL_PL = "pl-PL"
# Pashto
PS = "ps"
PS_AF = "ps-AF"
# Portuguese
PT = "pt"
PT_BR = "pt-BR"
PT_PT = "pt-PT"
# Romanian
RO = "ro"
RO_RO = "ro-RO"
# Russian
RU = "ru"
RU_RU = "ru-RU"
# Sinhala
SI = "si"
SI_LK = "si-LK"
# Slovak
SK = "sk"
SK_SK = "sk-SK"
# Slovenian
SL = "sl"
SL_SI = "sl-SI"
# Somali
SO = "so"
SO_SO = "so-SO"
# Albanian
SQ = "sq"
SQ_AL = "sq-AL"
# Serbian
SR = "sr"
SR_RS = "sr-RS"
SR_LATN = "sr-Latn"
SR_LATN_RS = "sr-Latn-RS"
# Sundanese
SU = "su"
SU_ID = "su-ID"
# Swedish
SV = "sv"
SV_SE = "sv-SE"
# Swahili
SW = "sw"
SW_KE = "sw-KE"
SW_TZ = "sw-TZ"
# Tagalog
TL = "tl"
# Tamil
TA = "ta"
TA_IN = "ta-IN"
TA_LK = "ta-LK"
TA_MY = "ta-MY"
TA_SG = "ta-SG"
# Telugu
TE = "te"
TE_IN = "te-IN"
# Thai
TH = "th"
TH_TH = "th-TH"
# Turkish
TR = "tr"
TR_TR = "tr-TR"
# Ukrainian
UK = "uk"
UK_UA = "uk-UA"
# Urdu
UR = "ur"
UR_IN = "ur-IN"
UR_PK = "ur-PK"
# Uzbek
UZ = "uz"
UZ_UZ = "uz-UZ"
# Vietnamese
VI = "vi"
VI_VN = "vi-VN"
# Wu Chinese
WUU = "wuu"
WUU_CN = "wuu-CN"
# Yue Chinese
YUE = "yue"
YUE_CN = "yue-CN"
# Chinese
ZH = "zh"
ZH_CN = "zh-CN"
ZH_CN_GUANGXI = "zh-CN-guangxi"
ZH_CN_HENAN = "zh-CN-henan"
ZH_CN_LIAONING = "zh-CN-liaoning"
ZH_CN_SHAANXI = "zh-CN-shaanxi"
ZH_CN_SHANDONG = "zh-CN-shandong"
ZH_CN_SICHUAN = "zh-CN-sichuan"
ZH_HK = "zh-HK"
ZH_TW = "zh-TW"
# Xhosa
XH = "xh"
# Zulu
ZU = "zu"
ZU_ZA = "zu-ZA"

View File

@@ -71,6 +71,7 @@ class BaseInputTransport(FrameProcessor):
return self._params.vad_analyzer
async def push_audio_frame(self, frame: InputAudioRawFrame):
logger.info(f"Pushing audio qsize: {self._audio_in_queue.qsize()}")
if self._params.audio_in_enabled or self._params.vad_enabled:
await self._audio_in_queue.put(frame)
@@ -167,6 +168,7 @@ class BaseInputTransport(FrameProcessor):
return vad_state
async def _audio_task_handler(self):
logger.info("_audio_task_handler started")
vad_state: VADState = VADState.QUIET
while True:
try:

View File

@@ -70,16 +70,6 @@ class FastAPIWebsocketInputTransport(BaseInputTransport):
await self._callbacks.on_client_connected(self._websocket)
self._receive_task = self.get_event_loop().create_task(self._receive_messages())
async def stop(self, frame: EndFrame):
await super().stop(frame)
if self._websocket.client_state != WebSocketState.DISCONNECTED:
await self._websocket.close()
async def cancel(self, frame: CancelFrame):
await super().cancel(frame)
if self._websocket.client_state != WebSocketState.DISCONNECTED:
await self._websocket.close()
async def _receive_messages(self):
async for message in self._websocket.iter_text():
frame = self._params.serializer.deserialize(message)

View File

@@ -106,6 +106,7 @@ class WebsocketServerInputTransport(BaseInputTransport):
continue
if isinstance(frame, AudioRawFrame):
logger.info("websocket_server")
await self.push_audio_frame(
InputAudioRawFrame(
audio=frame.audio,