attempt at 2 pipelines

Fixed logic
Starting to add logic for native audio input for flash lite
2025-02-24 21:25:13 +00:00 · 2025-02-24 10:44:07 -08:00 · 2025-02-24 10:28:28 -08:00 · 2025-02-22 14:52:53 -08:00 · 2025-02-22 14:49:33 -08:00 · 2025-02-22 14:38:14 -08:00
3 changed files with 268 additions and 78 deletions
--- a/examples/phone-chatbot/README.md
+++ b/examples/phone-chatbot/README.md
@@ -106,12 +106,13 @@ curl -X POST "http://localhost:7860/daily_start_bot" \
     -d '{"dialoutNumber": "+18057145330", "detectVoicemail": true}'
 ```
-### New! Using Gemini with Daily
+### New! Using Gemini 2.0 Flash Lite with Daily
-We have introduced a new example file that uses Gemini. You can find the code within bot_daily_gemini.py.
+We have introduced support for Google's Gemini 2.0 Flash Lite model in this example. This lightweight model offers faster response times and reduced costs while maintaining good conversational capabilities.
-If you want to spin up a Gemini-based bot for this demo, instead of an OpenAI-based bot, call the same properties above but on the `daily_gemini_start_bot` endpoint instead.
+
 **Quick Start**
 To use the Gemini-based bot instead of OpenAI:
 For example:
 ```shell
 curl -X POST "http://localhost:7860/daily_gemini_start_bot" \                                                                                                        py pipecat
@@ -119,7 +120,26 @@ curl -X POST "http://localhost:7860/daily_gemini_start_bot" \
     -d '{"detectVoicemail": true}'
 ```
-Any request body properties supported by `/daily_start_bot` (such as "detectVoicemail", "dialoutnumber", etc) can also be passed to `/daily_gemini_start_bot`. The only difference is that calling the Gemini endpoint will start a Gemini bot session.
+All request body parameters supported by /daily_start_bot (such as detectVoicemail, dialoutNumber, etc.) are also compatible with /daily_gemini_start_bot.
 This example uses context switching to help steer the bot in the right direction. As Flash Lite is a smaller model, getting it to consistently call functions was difficult for these longer prompts. Breaking the prompt
 down into smaller pieces helped improve the accuracy of the bot.
 **Implementation Details**
 The implementation is available in bot_daily_gemini.py and features:
 Staged prompting approach: Breaking down complex tasks into smaller, more focused prompts to improve the lightweight model's performance
 Dynamic context switching: The bot can change its behavior in real-time based on what it detects (voicemail vs. human caller)
 Function-based architecture: Uses function calling to trigger context switches and call termination
 **Optimizations for Lightweight Models**
 Working with Gemini 2.0 Flash Lite required some specific optimizations:
 Simplified prompts: Each prompt focuses on a single task with clear instructions
 Function-driven state changes: The model calls specific functions to switch between different conversation modes
 Reduced context requirements: Each stage maintains only the context needed for its specific purpose
 This approach significantly improves the consistency of function calling in this lightweight model, which was challenging with longer, more complex prompts.
 ### More information
--- a/examples/phone-chatbot/bot_daily.py
+++ b/examples/phone-chatbot/bot_daily.py
@@ -49,7 +49,11 @@ async def main(
    # If you are handling this via Twilio, Telnyx, set this to None
    # and handle call-forwarding when on_dialin_ready fires.
-    dialin_settings = DailyDialinSettings(call_id=callId, call_domain=callDomain)
+    # We don't want to specify dialin settings if we're not dialing in
    dialin_settings = None
    if callId and callDomain:
        dialin_settings = DailyDialinSettings(call_id=callId, call_domain=callDomain)
    transport = DailyTransport(
        room_url,
        token,
@@ -96,6 +100,13 @@ async def main(
            - **"Please leave a message after the beep."**
            - **"No one is available to take your call."**
            - **"Record your message after the tone."**
            - **"Please leave a message after the beep"**
            - **"You have reached voicemail for..."**
            - **"You have reached [phone number]"**
            - **"[phone number] is unavailable"**
            - **"The person you are trying to reach..."**
            - **"The number you have dialed..."**
            - **"Your call has been forwarded to an automated voice messaging system"**
            - **Any phrase that suggests an answering machine or voicemail.**
            - **ASSUME IT IS A VOICEMAIL. DO NOT WAIT FOR MORE CONFIRMATION.**
            - **IF THE CALL SAYS "PLEASE LEAVE A MESSAGE AFTER THE BEEP", WAIT FOR THE BEEP BEFORE LEAVING A MESSAGE.**
--- a/examples/phone-chatbot/bot_daily_gemini.py
+++ b/examples/phone-chatbot/bot_daily_gemini.py
@@ -7,17 +7,30 @@ import argparse
 import asyncio
 import os
 import sys
 from dataclasses import dataclass
 from typing import Optional
 import google.ai.generativelanguage as glm
 from dotenv import load_dotenv
 from loguru import logger
 from pipecat.audio.vad.silero import SileroVADAnalyzer
-from pipecat.frames.frames import EndTaskFrame
+from pipecat.frames.frames import (
    BotStoppedSpeakingFrame,
    EndTaskFrame,
    Frame,
    InputAudioRawFrame,
    StopTaskFrame,
    SystemFrame,
    TranscriptionFrame,
    UserStartedSpeakingFrame,
    UserStoppedSpeakingFrame,
 )
 from pipecat.pipeline.pipeline import Pipeline
 from pipecat.pipeline.runner import PipelineRunner
 from pipecat.pipeline.task import PipelineParams, PipelineTask
-from pipecat.processors.frame_processor import FrameDirection
+from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContextFrame
 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
 from pipecat.services.ai_services import LLMService
 from pipecat.services.elevenlabs import ElevenLabsTTSService
 from pipecat.services.google import GoogleLLMContext, GoogleLLMService
@@ -32,11 +45,124 @@ logger.add(sys.stderr, level="DEBUG")
 daily_api_key = os.getenv("DAILY_API_KEY", "")
 daily_api_url = os.getenv("DAILY_API_URL", "https://api.daily.co/v1")
 system_message = None
 class UserAudioCollector(FrameProcessor):
    """This FrameProcessor collects audio frames in a buffer, then adds them to the
    LLM context when the user stops speaking.
    """
    def __init__(self, context, user_context_aggregator):
        super().__init__()
        self._context = context
        self._user_context_aggregator = user_context_aggregator
        self._audio_frames = []
        self._start_secs = 0.2  # this should match VAD start_secs (hardcoding for now)
        self._user_speaking = False
    async def process_frame(self, frame, direction):
        await super().process_frame(frame, direction)
        if isinstance(frame, TranscriptionFrame):
            # We could gracefully handle both audio input and text/transcription input ...
            # but let's leave that as an exercise to the reader. :-)
            return
        if isinstance(frame, UserStartedSpeakingFrame):
            self._user_speaking = True
        elif isinstance(frame, UserStoppedSpeakingFrame):
            self._user_speaking = False
            self._context.add_audio_frames_message(audio_frames=self._audio_frames)
            await self._user_context_aggregator.push_frame(
                self._user_context_aggregator.get_context_frame()
            )
        elif isinstance(frame, InputAudioRawFrame):
            if self._user_speaking:
                self._audio_frames.append(frame)
            else:
                # Append the audio frame to our buffer. Treat the buffer as a ring buffer, dropping the oldest
                # frames as necessary. Assume all audio frames have the same duration.
                self._audio_frames.append(frame)
                frame_duration = len(frame.audio) / 16 * frame.num_channels / frame.sample_rate
                buffer_duration = frame_duration * len(self._audio_frames)
                while buffer_duration > self._start_secs:
                    self._audio_frames.pop(0)
                    buffer_duration -= frame_duration
        await self.push_frame(frame, direction)
 class ContextSwitcher:
    def __init__(self, llm, context_aggregator):
        self._llm = llm
        self._context_aggregator = context_aggregator
    async def switch_context(self, system_instruction):
        """Switch the context to a new system instruction based on what the bot hears."""
        # Create messages with updated system instruction
        messages = [
            {
                "role": "system",
                "content": system_instruction,
            }
        ]
        # Update context with new messages
        self._context_aggregator.set_messages(messages)
        # Get the context frame with the updated messages
        context_frame = self._context_aggregator.get_context_frame()
        # Trigger LLM response by pushing a context frame
        await self._llm.push_frame(context_frame)
 class FunctionHandlers:
    def __init__(self, context_switcher):
        self.context_switcher = context_switcher
    async def voicemail_response(
        self, function_name, tool_call_id, args, llm, context, result_callback
    ):
        """Function the bot can call to leave a voicemail message."""
        print(f"!!! Got a voicemail response, llm is: {llm}")
        system_message = """You are Chatbot leaving a voicemail message. Say EXACTLY this message and nothing else:
                    "Hello, this is a message for Pipecat example user. This is Chatbot. Please call back on 123-456-7891. Thank you."
                    After saying this message, call the terminate_call function."""
        print("!!! about to push stop task frame from voicemail")
        await llm.queue_frame(StopTaskFrame(), FrameDirection.UPSTREAM)
        print("!!! pushed stop task frame from voicemail")
        await result_callback("Goodbye")
    async def human_conversation(
        self, function_name, tool_call_id, args, llm, context, result_callback
    ):
        """Function the bot can when it detects it's talking to a human."""
        print(f"!!! Got a human response, llm is: {llm}")
        system_message = """You are Chatbot talking to a human. Be friendly and helpful.
                    Start with: "Hello! I'm a friendly chatbot. How can I help you today?"
                    Keep your responses brief and to the point. Listen to what the person says.
                    When the person indicates they're done with the conversation by saying something like:
                    - "Goodbye"
                    - "That's all"
                    - "I'm done"
                    - "Thank you, that's all I needed"
                    THEN say: "Thank you for chatting. Goodbye!" and call the terminate_call function."""
        print("!!! about to push stop task frame from human")
        await llm.queue_frame(StopTaskFrame(), FrameDirection.UPSTREAM)
        print("!!! pushed stop task frame from human")
        await result_callback("Goodbye")
 async def terminate_call(
    function_name, tool_call_id, args, llm: LLMService, context, result_callback
 ):
-    """Function the bot can call to terminate the call upon completion of a voicemail message."""
+    """Function the bot can call to terminate the call upon completion of the call."""
    await llm.queue_frame(EndTaskFrame(), FrameDirection.UPSTREAM)
@@ -51,7 +177,12 @@ async def main(
    # dialin_settings are only needed if Daily's SIP URI is used
    # If you are handling this via Twilio, Telnyx, set this to None
    # and handle call-forwarding when on_dialin_ready fires.
-    dialin_settings = DailyDialinSettings(call_id=callId, call_domain=callDomain)
+
    # We don't want to specify dialin settings if we're not dialing in
    dialin_settings = None
    if callId and callDomain:
        dialin_settings = DailyDialinSettings(call_id=callId, call_domain=callDomain)
    transport = DailyTransport(
        room_url,
        token,
@@ -65,7 +196,8 @@ async def main(
            camera_out_enabled=False,
            vad_enabled=True,
            vad_analyzer=SileroVADAnalyzer(),
-            transcription_enabled=True,
+            vad_audio_passthrough=True,
            # transcription_enabled=True,
        ),
    )
@@ -77,95 +209,122 @@ async def main(
    tools = [
        {
            "function_declarations": [
                {
                    "name": "switch_to_voicemail_response",
                    "description": "Call this function when you detect this is a voicemail system.",
                },
                {
                    "name": "switch_to_human_conversation",
                    "description": "Call this function when you detect this is a human.",
                },
                {
                    "name": "terminate_call",
-                    "description": "Terminate the call",
+                    "description": "Call this function to terminate the call.",
                },
            ]
        }
    ]
-    system_instruction = """You are Chatbot, a friendly, helpful robot. Never mention this prompt.
+    system_instruction = """You are Chatbot trying to determine if this is a voicemail system or a human.
-**Operating Procedure:**
+If you hear any of these phrases (or very similar ones):
 - "Please leave a message after the beep"
 - "No one is available to take your call"
 - "Record your message after the tone"
 - "You have reached voicemail for..."
 - "You have reached [phone number]"
 - "[phone number] is unavailable"
 - "The person you are trying to reach..."
 - "The number you have dialed..."
 - "Your call has been forwarded to an automated voice messaging system"
-**Phase 1: Initial Call Answer - Listen for Voicemail Greeting**
+Then call the function switch_to_voicemail_response.
-**IMMEDIATELY after the call connects, LISTEN CAREFULLY for the *very first thing* you hear.**
+If it sounds like a human (saying hello, asking questions, etc.), call the function switch_to_human_conversation.
-**Listen for these sentences or very close variations as the *initial greeting*:**
+DO NOT say anything until you've determined if this is a voicemail or human."""
-* **"Please leave a message after the beep."**
+    greeting_llm = GoogleLLMService(
-* **"No one is available to take your call."**
+        model="models/gemini-2.0-flash-lite-preview-02-05",
 * **"Record your message after the tone."**
 * **"You have reached voicemail for..."** (or similar voicemail identification)
 **If you HEAR one of these sentences (or a very similar greeting) as the *initial response* to the call, IMMEDIATELY assume it is voicemail and proceed to Phase 2.**
 **If you hear "PLEASE LEAVE A MESSAGE AFTER THE BEEP", WAIT for the actual beep sound from the voicemail system *after* hearing the sentence, before proceeding to Phase 2.**
 **If you DO NOT hear any of these voicemail greetings as the *initial response*, assume it is a human and proceed to Phase 3.**
 **Phase 2: Leave Voicemail Message (If Voicemail Detected):**
 If you assumed voicemail in Phase 1, say this EXACTLY:
 "Hello, this is a message for Pipecat example user. This is Chatbot. Please call back on 123-456-7891. Thank you."
 **Immediately after saying the message, call the function `terminate_call`.**
 **DO NOT SAY ANYTHING ELSE. SILENCE IS REQUIRED AFTER `terminate_call`.**
 **Phase 3: Human Interaction (If No Voicemail Greeting Detected in Phase 1):**
 If you did not detect a voicemail greeting in Phase 1 and a human answers, say:
 "Oh, hello! I'm a friendly chatbot. Is there anything I can help you with?"
 Keep your responses **short and helpful.**
 If the human is finished, say:
 "Okay, thank you! Have a great day!"
 **Then, immediately call the function `terminate_call`.**
 **VERY IMPORTANT RULES - DO NOT DO THESE THINGS:**
 * **DO NOT SAY "Please leave a message after the beep."**
 * **DO NOT SAY "No one is available to take your call."**
 * **DO NOT SAY "Record your message after the tone."**
 * **DO NOT SAY ANY voicemail greeting yourself.**
 * **Only check for voicemail greetings in Phase 1, *immediately after the call connects*.**
 * **After voicemail or human interaction, ALWAYS call `terminate_call` immediately.**
 * **Do not speak after calling `terminate_call`.**
 * Your speech will be audio, so use simple language without special characters.
 """
    llm = GoogleLLMService(
        model="models/gemini-2.0-flash-exp",
        api_key=os.getenv("GOOGLE_API_KEY"),
        system_instruction=system_instruction,
        tools=tools,
    )
    llm.register_function("terminate_call", terminate_call)
-    context = GoogleLLMContext()
+    greeting_context = GoogleLLMContext()
    greeting_context_aggregator = greeting_llm.create_context_aggregator(greeting_context)
    greeting_audio_collector = UserAudioCollector(
        greeting_context, greeting_context_aggregator.user()
    )
-    context_aggregator = llm.create_context_aggregator(context)
+    context_switcher = ContextSwitcher(greeting_llm, greeting_context_aggregator.user())
    handlers = FunctionHandlers(context_switcher)
-    pipeline = Pipeline(
+    greeting_llm.register_function("switch_to_voicemail_response", handlers.voicemail_response)
    greeting_llm.register_function("switch_to_human_conversation", handlers.human_conversation)
    greeting_llm.register_function("terminate_call", terminate_call)
    greeting_pipeline = Pipeline(
        [
            transport.input(),  # Transport user input
-            context_aggregator.user(),  # User responses
+            greeting_audio_collector,  # Collect audio frames
-            llm,  # LLM
+            greeting_context_aggregator.user(),  # User responses
            greeting_llm,  # LLM
            tts,  # TTS
            transport.output(),  # Transport bot output
-            context_aggregator.assistant(),  # Assistant spoken responses
+            greeting_context_aggregator.assistant(),  # Assistant spoken responses
        ]
    )
    greeting_pipeline_task = PipelineTask(
        greeting_pipeline,
        PipelineParams(allow_interruptions=True),
    )
    runner = PipelineRunner()
    print("!!! starting greeting")
    await runner.run(greeting_pipeline_task)
    print("!!! Done with greeting")
    # Create conversation pipeline with new system message
    conversation_llm = GoogleLLMService(
        model="models/gemini-2.0-flash-lite-preview-02-05",
        api_key=os.getenv("GOOGLE_API_KEY"),
        system_instruction=system_message if system_message else "You are a helpful chatbot.",
        tools=[
            {
                "function_declarations": [
                    {
                        "name": "terminate_call",
                        "description": "Call this function to terminate the call.",
                    }
                ]
            }
        ],
    )
    conversation_llm.register_function("terminate_call", terminate_call)
    conversation_context = GoogleLLMContext()
    conversation_context_aggregator = conversation_llm.create_context_aggregator(
        conversation_context
    )
    conversation_audio_collector = UserAudioCollector(
        conversation_context, conversation_context_aggregator.user()
    )
    conversation_pipeline = Pipeline(
        [
            transport.input(),  # Transport user input
            conversation_audio_collector,  # Collect audio frames
            conversation_context_aggregator.user(),  # User responses
            conversation_llm,  # LLM
            tts,  # TTS
            transport.output(),  # Transport bot output
            conversation_context_aggregator.assistant(),  # Assistant spoken responses
        ]
    )
-    task = PipelineTask(
+    conversation_task = PipelineTask(
-        pipeline,
+        conversation_pipeline,
        PipelineParams(allow_interruptions=True),
    )
@@ -214,11 +373,11 @@ If the human is finished, say:
    @transport.event_handler("on_participant_left")
    async def on_participant_left(transport, participant, reason):
-        await task.cancel()
+        await conversation_task.cancel()
-    runner = PipelineRunner()
+    print("!!! Starting conversation")
-
+    await runner.run(conversation_task)
-    await runner.run(task)
+    print("!!! Done with conversation")
 if __name__ == "__main__":
Author	SHA1	Message	Date
Chad Bailey	1472a3abb8	attempt at 2 pipelines	2025-02-24 21:25:13 +00:00
Dominic	3745078bf1	Fixed logic	2025-02-24 10:44:07 -08:00
Dominic	1a2c98f70b	Starting to add logic for native audio input for flash lite	2025-02-24 10:28:28 -08:00
Dominic	e988ce6838	Forgot to use the same logic for the openai bot	2025-02-22 14:52:53 -08:00
Dominic	546c97e75b	Simplified logic for dialin	2025-02-22 14:49:33 -08:00
Dominic	410a6b9238	moved terminate call to handlers class	2025-02-22 14:38:14 -08:00
Dominic	281b56e5de	Updated prompt for non gemini bot to look for more voicemail examples, plus added logic to detect if we're doing dialin or not to avoid a non-fatal dialin related error	2025-02-21 16:19:59 -08:00
Dominic	c66042afb6	Fixed import ordering	2025-02-20 14:56:45 -08:00
Dominic Stewart	61f8e54dec	Merge branch 'main' into dom/gemini-system-prompt-switching	2025-02-20 14:48:45 -08:00
Dominic	390adf193a	Added a few more things to detect in the prompt	2025-02-20 14:44:12 -08:00
Dominic	68587ca4e9	Updated the code to use the correct prompt broken down into smaller pieces	2025-02-20 14:28:02 -08:00
Dominic	b71ad2d082	I think this works	2025-02-20 09:42:19 -08:00
Dominic	781652f4f9	Improvement	2025-02-20 09:27:34 -08:00
Dominic	621813571a	This works	2025-02-19 20:24:27 -08:00
Dominic	ceefea8d63	Changed example to use gemini 2.0 flash lite	2025-02-18 19:08:22 -08:00
Dominic	1974474480	Updated the readme	2025-02-18 18:16:27 -08:00
Dominic	160d054aa5	Updated the readme	2025-02-18 18:10:34 -08:00
Dominic	4718f68717	Based on feedback, made the gemini file something that can be called separately	2025-02-18 18:04:29 -08:00
Dominic	3a781c786c	Fixed typo	2025-02-17 10:22:06 -08:00
Dominic	a066e2bcfd	Updated example to use Gemini	2025-02-17 10:17:59 -08:00