Merge pull request #487 from pipecat-ai/mb/llm-rtvi-service-option

Add control frames for LLM config params
Add control frames for LLM param updates
2024-09-21 18:26:17 -04:00 · 2024-09-21 00:02:58 -04:00 · 2024-09-21 00:01:25 -04:00 · 2024-09-20 18:24:19 -07:00 · 2024-09-20 18:21:06 -07:00 · 2024-09-20 20:35:49 -04:00
106 changed files with 3030 additions and 1441 deletions
--- a/.github/workflows/tests.yaml
+++ b/.github/workflows/tests.yaml
@@ -20,14 +20,17 @@ jobs:
    name: "Unit and Integration Tests"
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v4
+      - name: Checkout repo
+        uses: actions/checkout@v4
      - name: Set up Python
        id: setup_python
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"
      - name: Install system packages
-        run: sudo apt-get install -y portaudio19-dev
+        id: install_system_packages
+        run: |
+          sudo apt-get install -y portaudio19-dev
      - name: Setup virtual environment
        run: |
          python -m venv .venv
@@ -35,8 +38,8 @@ jobs:
        run: |
          source .venv/bin/activate
          python -m pip install --upgrade pip
-          pip install -r dev-requirements.txt
+          pip install -r test-requirements.txt
      - name: Test with pytest
        run: |
          source .venv/bin/activate
-          pytest --doctest-modules --ignore-glob="*to_be_updated*" src tests
+          pytest --ignore-glob="*to_be_updated*" --ignore-glob=*pipeline_source* src tests
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,126 @@ All notable changes to **pipecat** will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+## [Unreleased]
+
+### Added
+
+- Added configurable LLM parameters (e.g., temperature, top_p, max_tokens, seed)
+  for OpenAI, Anthropic, and Together AI services along with corresponding
+  setter functions.
+
+- Added `sample_rate` as a constructor parameter for TTS services.
+
+- Pipecat has a pipeline-based architecture. The pipeline consists of frame
+  processors linked to each other. The elements traveling across the pipeline
+  are called frames.
+
+  To have a deterministic behavior the frames traveling through the pipeline
+  should always be ordered, except system frames which are out-of-band
+  frames. To achieve that, each frame processor should only output frames from a
+  single task.
+
+  In this version we introduce synchronous and asynchronous frame
+  processors. The synchronous processors push output frames from the same task
+  that they receive input frames, and therefore only pushing frames from one
+  task. Asynchronous frame processors can have internal tasks to perform things
+  asynchronously (e.g. receiving data from a websocket) but they also have a
+  single task where they push frames from.
+
+  By default, frame processors are synchronous. To change a frame processor to
+  asynchronous you only need to pass `sync=False` to the base class constructor.
+
+- Added pipeline clocks. A pipeline clock is used by the output transport to
+  know when a frame needs to be presented. For that, all frames now have an
+  optional `pts` field (prensentation timestamp). There's currently just one
+  clock implementation `SystemClock` and the `pts` field is currently only used
+  for `TextFrame`s (audio and image frames will be next).
+
+- A clock can now be specified to `PipelineTask` (defaults to
+  `SystemClock`). This clock will be passed to each frame processor via the
+  `StartFrame`.
+
+- Added `CartesiaHttpTTSService`. This is a synchronous frame processor
+  (i.e. given an input text frame it will wait for the whole output before
+  returning).
+
+- `DailyTransport` now supports setting the audio bitrate to improve audio
+  quality through the `DailyParams.audio_out_bitrate` parameter. The new
+  default is 96kbps.
+
+- `DailyTransport` now uses the number of audio output channels (1 or 2) to set
+  mono or stereo audio when needed.
+
+- Interruptions support has been added to `TwilioFrameSerializer` when using
+  `FastAPIWebsocketTransport`.
+
+- Added new `LmntTTSService` text-to-speech service.
+  (see https://www.lmnt.com/)
+
+- Added `TTSModelUpdateFrame`, `TTSLanguageUpdateFrame`, `STTModelUpdateFrame`,
+  and `STTLanguageUpdateFrame` frames to allow you to switch models, language
+  and voices in TTS and STT services.
+
+- Added new `transcriptions.Language` enum.
+
+### Changed
+
+- We now distinguish between input and output audio and image frames. We
+  introduce `InputAudioRawFrame`, `OutputAudioRawFrame`, `InputImageRawFrame`
+  and `OutputImageRawFrame` (and other subclasses of those). The input frames
+  usually come from an input transport and are meant to be processed inside the
+  pipeline to generate new frames. However, the input frames will not be sent
+  through an output transport. The output frames can also be processed by any
+  frame processor in the pipeline and they are allowed to be sent by the output
+  transport.
+
+- `ParallelTask` has been renamed to `SyncParallelPipeline`. A
+  `SyncParallelPipeline` is a frame processor that contains a list of different
+  pipelines to be executed concurrently. The difference between a
+  `SyncParallelPipeline` and a `ParallelPipeline` is that, given an input frame,
+  the `SyncParallelPipeline` will wait for all the internal pipelines to
+  complete. This is achieved by ensuring all the processors in each of the
+  internal pipelines are synchronous.
+
+- `StartFrame` is back a system frame so we make sure it's processed immediately
+  by all processors. `EndFrame` stays a control frame since it needs to be
+  ordered allowing the frames in the pipeline to be processed.
+
+- Updated `MoondreamService` revision to `2024-08-26`.
+
+- `CartesiaTTSService` and `ElevenLabsTTSService` now add presentation
+  timestamps to their text output. This allows the output transport to push the
+  text frames downstream at almost the same time the words are spoken. We say
+  "almost" because currently the audio frames don't have presentation timestamp
+  but they should be played at roughly the same time.
+
+- `DailyTransport.on_joined` event now returns the full session data instead of
+  just the participant.
+
+- `CartesiaTTSService` is now a subclass of `TTSService`.
+
+- `DeepgramSTTService` is now a subclass of `STTService`.
+
+- `WhisperSTTService` is now a subclass of `SegmentedSTTService`. A
+  `SegmentedSTTService` is a `STTService` where the provided audio is given in a
+  big chunk (i.e. from when the user starts speaking until the user stops
+  speaking) instead of a continous stream.
+
+### Fixed
+
+- Fixed a `BaseOutputTransport` issue that would stop audio and video rendering
+  tasks (after receiving and `EndFrame`) before the internal queue was emptied,
+  causing the pipeline to finish prematurely.
+
+- `StartFrame` should be the first frame every processor receives to avoid
+  situations where things are not initialized (because initialization happens on
+  `StartFrame`) and other frames come in resulting in undesired behavior.
+
+### Performance
+
+- `obj_id()` and `obj_count()` now use `itertools.count` avoiding the need of
+  `threading.Lock`.
+
 ## [0.0.41] - 2024-08-22

 ### Added
@@ -229,7 +349,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - It is now possible to specify a Silero VAD version when using `SileroVADAnalyzer`
  or `SileroVAD`.

- Added `AysncFrameProcessor` and `AsyncAIService`.  Some services like
+- Added `AysncFrameProcessor` and `AsyncAIService`. Some services like
  `DeepgramSTTService` need to process things asynchronously. For example, audio
  is sent to Deepgram but transcriptions are not returned immediately. In these
  cases we still require all frames (except system frames) to be pushed
@@ -246,7 +366,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 - `WhisperSTTService` model can now also be a string.

- Added missing * keyword separators in services.
+- Added missing \* keyword separators in services.

 ### Fixed

@@ -323,7 +443,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Added new `TwilioFrameSerializer`. This is a new serializer that knows how to
  serialize and deserialize audio frames from Twilio.

- Added Daily transport event: `on_dialout_answered`.  See
+- Added Daily transport event: `on_dialout_answered`. See
  https://reference-python.daily.co/api_reference.html#daily.EventHandler

 - Added new `AzureSTTService`. This allows you to use Azure Speech-To-Text.
@@ -563,7 +683,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Added Daily transport support for dial-in use cases.

 - Added Daily transport events: `on_dialout_connected`, `on_dialout_stopped`,
-  `on_dialout_error` and `on_dialout_warning`.  See
+  `on_dialout_error` and `on_dialout_warning`. See
  https://reference-python.daily.co/api_reference.html#daily.EventHandler

 ## [0.0.21] - 2024-05-22
--- a/README.md
+++ b/README.md
@@ -38,7 +38,7 @@ pip install "pipecat-ai[option,...]"

 Your project may or may not need these, so they're made available as optional requirements. Here is a list:

- **AI services**: `anthropic`, `azure`, `deepgram`, `gladia`, `google`, `fal`, `moondream`, `openai`, `openpipe`, `playht`, `silero`, `whisper`, `xtts`
+- **AI services**: `anthropic`, `azure`, `deepgram`, `gladia`, `google`, `fal`, `lmnt`, `moondream`, `openai`, `openpipe`, `playht`, `silero`, `whisper`, `xtts`
 - **Transports**: `local`, `websocket`, `daily`

 ## Code examples
@@ -165,7 +165,7 @@ pip install "path_to_this_repo[option,...]"
 From the root directory, run:

 ```shell
-pytest --doctest-modules --ignore-glob="*to_be_updated*" src tests
+pytest --doctest-modules --ignore-glob="*to_be_updated*" --ignore-glob=*pipeline_source* src tests
 ```

 ## Setting up your editor
--- a/dot-env.template
+++ b/dot-env.template
@@ -30,6 +30,10 @@ FIREWORKS_API_KEY=...
 # Gladia
 GLADIA_API_KEY=...

+# LMNT
+LMNT_API_KEY=...
+LMNT_VOICE_ID=...
+
 # PlayHT
 PLAY_HT_USER_ID=...
 PLAY_HT_API_KEY=...
--- a/examples/deployment/flyio-example/bot.py
+++ b/examples/deployment/flyio-example/bot.py
@@ -1,5 +1,4 @@
 import asyncio
-import aiohttp
 import os
 import sys
 import argparse
@@ -27,71 +26,69 @@ daily_api_url = os.getenv("DAILY_API_URL", "https://api.daily.co/v1")


 async def main(room_url: str, token: str):
-    async with aiohttp.ClientSession() as session:
-        transport = DailyTransport(
-            room_url,
-            token,
-            "Chatbot",
-            DailyParams(
-                api_url=daily_api_url,
-                api_key=daily_api_key,
-                audio_in_enabled=True,
-                audio_out_enabled=True,
-                camera_out_enabled=False,
-                vad_enabled=True,
-                vad_analyzer=SileroVADAnalyzer(),
-                transcription_enabled=True,
-            )
+    transport = DailyTransport(
+        room_url,
+        token,
+        "Chatbot",
+        DailyParams(
+            api_url=daily_api_url,
+            api_key=daily_api_key,
+            audio_in_enabled=True,
+            audio_out_enabled=True,
+            camera_out_enabled=False,
+            vad_enabled=True,
+            vad_analyzer=SileroVADAnalyzer(),
+            transcription_enabled=True,
        )
+    )

-        tts = ElevenLabsTTSService(
-            aiohttp_session=session,
-            api_key=os.getenv("ELEVENLABS_API_KEY", ""),
-            voice_id=os.getenv("ELEVENLABS_VOICE_ID", ""),
-        )
+    tts = ElevenLabsTTSService(
+        api_key=os.getenv("ELEVENLABS_API_KEY", ""),
+        voice_id=os.getenv("ELEVENLABS_VOICE_ID", ""),
+    )

-        llm = OpenAILLMService(
-            api_key=os.getenv("OPENAI_API_KEY"),
-            model="gpt-4o")
+    llm = OpenAILLMService(
+        api_key=os.getenv("OPENAI_API_KEY"),
+        model="gpt-4o")

-        messages = [
-            {
-                "role": "system",
-                "content": "You are Chatbot, a friendly, helpful robot. Your output will be converted to audio so don't include special characters other than '!' or '?' in your answers. Respond to what the user said in a creative and helpful way, but keep your responses brief. Start by saying hello.",
-            },
-        ]
+    messages = [
+        {
+            "role": "system",
+            "content": "You are Chatbot, a friendly, helpful robot. Your output will be converted to audio so don't include special characters other than '!' or '?' in your answers. Respond to what the user said in a creative and helpful way, but keep your responses brief. Start by saying hello.",
+        },
+    ]

-        tma_in = LLMUserResponseAggregator(messages)
-        tma_out = LLMAssistantResponseAggregator(messages)
+    tma_in = LLMUserResponseAggregator(messages)
+    tma_out = LLMAssistantResponseAggregator(messages)

-        pipeline = Pipeline([
-            transport.input(),
-            tma_in,
-            llm,
-            tts,
-            transport.output(),
-            tma_out,
-        ])
+    pipeline = Pipeline([
+        transport.input(),
+        tma_in,
+        llm,
+        tts,
+        transport.output(),
+        tma_out,
+    ])

-        task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))
+    task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))

-        @transport.event_handler("on_first_participant_joined")
-        async def on_first_participant_joined(transport, participant):
-            transport.capture_participant_transcription(participant["id"])
-            await task.queue_frames([LLMMessagesFrame(messages)])
+    @transport.event_handler("on_first_participant_joined")
+    async def on_first_participant_joined(transport, participant):
+        transport.capture_participant_transcription(participant["id"])
+        await task.queue_frames([LLMMessagesFrame(messages)])

-        @transport.event_handler("on_participant_left")
-        async def on_participant_left(transport, participant, reason):
+    @transport.event_handler("on_participant_left")
+    async def on_participant_left(transport, participant, reason):
+        await task.queue_frame(EndFrame())
+
+    @transport.event_handler("on_call_state_updated")
+    async def on_call_state_updated(transport, state):
+        if state == "left":
            await task.queue_frame(EndFrame())

-        @transport.event_handler("on_call_state_updated")
-        async def on_call_state_updated(transport, state):
-            if state == "left":
-                await task.queue_frame(EndFrame())
+    runner = PipelineRunner()

-        runner = PipelineRunner()
-
-        await runner.run(task)
+    await runner.run(task)


 if __name__ == "__main__":
--- a/examples/dialin-chatbot/bot_daily.py
+++ b/examples/dialin-chatbot/bot_daily.py
@@ -1,5 +1,4 @@
 import asyncio
-import aiohttp
 import os
 import sys
 import argparse
@@ -29,75 +28,74 @@ daily_api_url = os.getenv("DAILY_API_URL", "https://api.daily.co/v1")


 async def main(room_url: str, token: str, callId: str, callDomain: str):
-    async with aiohttp.ClientSession() as session:
-        # diallin_settings are only needed if Daily's SIP URI is used
-        # If you are handling this via Twilio, Telnyx, set this to None
-        # and handle call-forwarding when on_dialin_ready fires.
-        diallin_settings = DailyDialinSettings(
-            call_id=callId,
-            call_domain=callDomain
+    # diallin_settings are only needed if Daily's SIP URI is used
+    # If you are handling this via Twilio, Telnyx, set this to None
+    # and handle call-forwarding when on_dialin_ready fires.
+    diallin_settings = DailyDialinSettings(
+        call_id=callId,
+        call_domain=callDomain
+    )
+
+    transport = DailyTransport(
+        room_url,
+        token,
+        "Chatbot",
+        DailyParams(
+            api_url=daily_api_url,
+            api_key=daily_api_key,
+            dialin_settings=diallin_settings,
+            audio_in_enabled=True,
+            audio_out_enabled=True,
+            camera_out_enabled=False,
+            vad_enabled=True,
+            vad_analyzer=SileroVADAnalyzer(),
+            transcription_enabled=True,
        )
+    )

-        transport = DailyTransport(
-            room_url,
-            token,
-            "Chatbot",
-            DailyParams(
-                api_url=daily_api_url,
-                api_key=daily_api_key,
-                dialin_settings=diallin_settings,
-                audio_in_enabled=True,
-                audio_out_enabled=True,
-                camera_out_enabled=False,
-                vad_enabled=True,
-                vad_analyzer=SileroVADAnalyzer(),
-                transcription_enabled=True,
-            )
-        )
+    tts = ElevenLabsTTSService(
+        api_key=os.getenv("ELEVENLABS_API_KEY", ""),
+        voice_id=os.getenv("ELEVENLABS_VOICE_ID", ""),
+    )

-        tts = ElevenLabsTTSService(
-            aiohttp_session=session,
-            api_key=os.getenv("ELEVENLABS_API_KEY", ""),
-            voice_id=os.getenv("ELEVENLABS_VOICE_ID", ""),
-        )
+    llm = OpenAILLMService(
+        api_key=os.getenv("OPENAI_API_KEY"),
+        model="gpt-4o"
+    )

-        llm = OpenAILLMService(
-            api_key=os.getenv("OPENAI_API_KEY"),
-            model="gpt-4o")
+    messages = [
+        {
+            "role": "system",
+            "content": "You are Chatbot, a friendly, helpful robot. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way, but keep your responses brief. Start by saying 'Oh, hello! Who dares dial me at this hour?!'.",
+        },
+    ]

-        messages = [
-            {
-                "role": "system",
-                "content": "You are Chatbot, a friendly, helpful robot. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way, but keep your responses brief. Start by saying 'Oh, hello! Who dares dial me at this hour?!'.",
-            },
-        ]
+    tma_in = LLMUserResponseAggregator(messages)
+    tma_out = LLMAssistantResponseAggregator(messages)

-        tma_in = LLMUserResponseAggregator(messages)
-        tma_out = LLMAssistantResponseAggregator(messages)
+    pipeline = Pipeline([
+        transport.input(),
+        tma_in,
+        llm,
+        tts,
+        transport.output(),
+        tma_out,
+    ])

-        pipeline = Pipeline([
-            transport.input(),
-            tma_in,
-            llm,
-            tts,
-            transport.output(),
-            tma_out,
-        ])
+    task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))

-        task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))
+    @transport.event_handler("on_first_participant_joined")
+    async def on_first_participant_joined(transport, participant):
+        transport.capture_participant_transcription(participant["id"])
+        await task.queue_frames([LLMMessagesFrame(messages)])

-        @transport.event_handler("on_first_participant_joined")
-        async def on_first_participant_joined(transport, participant):
-            transport.capture_participant_transcription(participant["id"])
-            await task.queue_frames([LLMMessagesFrame(messages)])
+    @transport.event_handler("on_participant_left")
+    async def on_participant_left(transport, participant, reason):
+        await task.queue_frame(EndFrame())

-        @transport.event_handler("on_participant_left")
-        async def on_participant_left(transport, participant, reason):
-            await task.queue_frame(EndFrame())
+    runner = PipelineRunner()

-        runner = PipelineRunner()
-
-        await runner.run(task)
+    await runner.run(task)


 if __name__ == "__main__":
--- a/examples/dialin-chatbot/bot_twilio.py
+++ b/examples/dialin-chatbot/bot_twilio.py
@@ -1,5 +1,4 @@
 import asyncio
-import aiohttp
 import os
 import sys
 import argparse
@@ -36,82 +35,81 @@ daily_api_key = os.getenv("DAILY_API_KEY", "")


 async def main(room_url: str, token: str, callId: str, sipUri: str):
-    async with aiohttp.ClientSession() as session:
-        # diallin_settings are only needed if Daily's SIP URI is used
-        # If you are handling this via Twilio, Telnyx, set this to None
-        # and handle call-forwarding when on_dialin_ready fires.
-        transport = DailyTransport(
-            room_url,
-            token,
-            "Chatbot",
-            DailyParams(
-                api_key=daily_api_key,
-                dialin_settings=None,  # Not required for Twilio
-                audio_in_enabled=True,
-                audio_out_enabled=True,
-                camera_out_enabled=False,
-                vad_enabled=True,
-                vad_analyzer=SileroVADAnalyzer(),
-                transcription_enabled=True,
+    # dialin_settings are only needed if Daily's SIP URI is used
+    # If you are handling this via Twilio, Telnyx, set this to None
+    # and handle call-forwarding when on_dialin_ready fires.
+    transport = DailyTransport(
+        room_url,
+        token,
+        "Chatbot",
+        DailyParams(
+            api_key=daily_api_key,
+            dialin_settings=None,  # Not required for Twilio
+            audio_in_enabled=True,
+            audio_out_enabled=True,
+            camera_out_enabled=False,
+            vad_enabled=True,
+            vad_analyzer=SileroVADAnalyzer(),
+            transcription_enabled=True,
+        )
+    )
+
+    tts = ElevenLabsTTSService(
+        api_key=os.getenv("ELEVENLABS_API_KEY", ""),
+        voice_id=os.getenv("ELEVENLABS_VOICE_ID", ""),
+    )
+
+    llm = OpenAILLMService(
+        api_key=os.getenv("OPENAI_API_KEY"),
+        model="gpt-4o"
+    )
+
+    messages = [
+        {
+            "role": "system",
+            "content": "You are Chatbot, a friendly, helpful robot. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way, but keep your responses brief. Start by saying 'Hello! Who dares dial me at this hour?!'.",
+        },
+    ]
+
+    tma_in = LLMUserResponseAggregator(messages)
+    tma_out = LLMAssistantResponseAggregator(messages)
+
+    pipeline = Pipeline([
+        transport.input(),
+        tma_in,
+        llm,
+        tts,
+        transport.output(),
+        tma_out,
+    ])
+
+    task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))
+
+    @transport.event_handler("on_first_participant_joined")
+    async def on_first_participant_joined(transport, participant):
+        transport.capture_participant_transcription(participant["id"])
+        await task.queue_frames([LLMMessagesFrame(messages)])
+
+    @transport.event_handler("on_participant_left")
+    async def on_participant_left(transport, participant, reason):
+        await task.queue_frame(EndFrame())
+
+    @transport.event_handler("on_dialin_ready")
+    async def on_dialin_ready(transport, cdata):
+        # For Twilio, Telnyx, etc. You need to update the state of the call
+        # and forward it to the sip_uri..
+        print(f"Forwarding call: {callId} {sipUri}")
+
+        try:
+            # The TwiML is updated using Twilio's client library
+            call = twilioclient.calls(callId).update(
+                twiml=f'<Response><Dial><Sip>{sipUri}</Sip></Dial></Response>'
            )
-        )
+        except Exception as e:
+            raise Exception(f"Failed to forward call: {str(e)}")

-        tts = ElevenLabsTTSService(
-            aiohttp_session=session,
-            api_key=os.getenv("ELEVENLABS_API_KEY", ""),
-            voice_id=os.getenv("ELEVENLABS_VOICE_ID", ""),
-        )
-
-        llm = OpenAILLMService(
-            api_key=os.getenv("OPENAI_API_KEY"),
-            model="gpt-4o")
-
-        messages = [
-            {
-                "role": "system",
-                "content": "You are Chatbot, a friendly, helpful robot. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way, but keep your responses brief. Start by saying 'Hello! Who dares dial me at this hour?!'.",
-            },
-        ]
-
-        tma_in = LLMUserResponseAggregator(messages)
-        tma_out = LLMAssistantResponseAggregator(messages)
-
-        pipeline = Pipeline([
-            transport.input(),
-            tma_in,
-            llm,
-            tts,
-            transport.output(),
-            tma_out,
-        ])
-
-        task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))
-
-        @transport.event_handler("on_first_participant_joined")
-        async def on_first_participant_joined(transport, participant):
-            transport.capture_participant_transcription(participant["id"])
-            await task.queue_frames([LLMMessagesFrame(messages)])
-
-        @transport.event_handler("on_participant_left")
-        async def on_participant_left(transport, participant, reason):
-            await task.queue_frame(EndFrame())
-
-        @transport.event_handler("on_dialin_ready")
-        async def on_dialin_ready(transport, cdata):
-            # For Twilio, Telnyx, etc. You need to update the state of the call
-            # and forward it to the sip_uri..
-            print(f"Forwarding call: {callId} {sipUri}")
-
-            try:
-                # The TwiML is updated using Twilio's client library
-                call = twilioclient.calls(callId).update(
-                    twiml=f'<Response><Dial><Sip>{sipUri}</Sip></Dial></Response>'
-                )
-            except Exception as e:
-                raise Exception(f"Failed to forward call: {str(e)}")
-
-        runner = PipelineRunner()
-        await runner.run(task)
+    runner = PipelineRunner()
+    await runner.run(task)


 if __name__ == "__main__":
--- a/examples/dialin-chatbot/requirements.txt
+++ b/examples/dialin-chatbot/requirements.txt
@@ -1,5 +1,6 @@
-pipecat-ai[daily,openai,silero]
+pipecat-ai[daily,elevenlabs,openai,silero]
 fastapi
 uvicorn
 python-dotenv
 twilio
+python-multipart
--- a/examples/foundational/01-say-one-thing.py
+++ b/examples/foundational/01-say-one-thing.py
@@ -9,11 +9,11 @@ import aiohttp
 import os
 import sys

-from pipecat.frames.frames import TextFrame
+from pipecat.frames.frames import EndFrame, TextFrame
 from pipecat.pipeline.pipeline import Pipeline
 from pipecat.pipeline.task import PipelineTask
 from pipecat.pipeline.runner import PipelineRunner
-from pipecat.services.cartesia import CartesiaTTSService
+from pipecat.services.cartesia import CartesiaHttpTTSService
 from pipecat.transports.services.daily import DailyParams, DailyTransport

 from runner import configure
@@ -34,7 +34,7 @@ async def main():
        transport = DailyTransport(
            room_url, None, "Say One Thing", DailyParams(audio_out_enabled=True))

-        tts = CartesiaTTSService(
+        tts = CartesiaHttpTTSService(
            api_key=os.getenv("CARTESIA_API_KEY"),
            voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
        )
@@ -48,7 +48,7 @@ async def main():
        @transport.event_handler("on_participant_joined")
        async def on_new_participant_joined(transport, participant):
            participant_name = participant["info"]["userName"] or ''
-            await task.queue_frame(TextFrame(f"Hello there, {participant_name}!"))
+            await task.queue_frames([TextFrame(f"Hello there, {participant_name}!"), EndFrame()])

        await runner.run(task)

--- a/examples/foundational/02-llm-say-one-thing.py
+++ b/examples/foundational/02-llm-say-one-thing.py
@@ -9,11 +9,11 @@ import aiohttp
 import os
 import sys

-from pipecat.frames.frames import LLMMessagesFrame
+from pipecat.frames.frames import EndFrame, LLMMessagesFrame
 from pipecat.pipeline.pipeline import Pipeline
 from pipecat.pipeline.runner import PipelineRunner
 from pipecat.pipeline.task import PipelineTask
-from pipecat.services.cartesia import CartesiaTTSService
+from pipecat.services.cartesia import CartesiaHttpTTSService
 from pipecat.services.openai import OpenAILLMService
 from pipecat.transports.services.daily import DailyParams, DailyTransport

@@ -38,7 +38,7 @@ async def main():
            "Say One Thing From an LLM",
            DailyParams(audio_out_enabled=True))

-        tts = CartesiaTTSService(
+        tts = CartesiaHttpTTSService(
            api_key=os.getenv("CARTESIA_API_KEY"),
            voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
        )
@@ -59,7 +59,7 @@ async def main():

        @transport.event_handler("on_first_participant_joined")
        async def on_first_participant_joined(transport, participant):
-            await task.queue_frame(LLMMessagesFrame(messages))
+            await task.queue_frames([LLMMessagesFrame(messages), EndFrame()])

        await runner.run(task)

--- a/examples/foundational/04-utterance-and-speech.py
+++ b/examples/foundational/04-utterance-and-speech.py
@@ -4,6 +4,10 @@
 # SPDX-License-Identifier: BSD 2-Clause License
 #

+#
+# This example broken on latest pipecat and needs updating.
+#
+
 import aiohttp
 import asyncio
 import os
--- a/examples/foundational/05-sync-speech-and-image.py
+++ b/examples/foundational/05-sync-speech-and-image.py
@@ -14,21 +14,18 @@ from dataclasses import dataclass
 from pipecat.frames.frames import (
    AppFrame,
    Frame,
-    ImageRawFrame,
    LLMFullResponseStartFrame,
    LLMMessagesFrame,
    TextFrame
 )
 from pipecat.pipeline.pipeline import Pipeline
 from pipecat.pipeline.runner import PipelineRunner
+from pipecat.pipeline.sync_parallel_pipeline import SyncParallelPipeline
 from pipecat.pipeline.task import PipelineTask
-from pipecat.pipeline.parallel_task import ParallelTask
 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
-from pipecat.processors.aggregators.gated import GatedAggregator
-from pipecat.processors.aggregators.llm_response import LLMFullResponseAggregator
 from pipecat.processors.aggregators.sentence import SentenceAggregator
+from pipecat.services.cartesia import CartesiaHttpTTSService
 from pipecat.services.openai import OpenAILLMService
-from pipecat.services.elevenlabs import ElevenLabsTTSService
 from pipecat.services.fal import FalImageGenService
 from pipecat.transports.services.daily import DailyParams, DailyTransport

@@ -88,10 +85,9 @@ async def main():
            )
        )

-        tts = ElevenLabsTTSService(
-            aiohttp_session=session,
-            api_key=os.getenv("ELEVENLABS_API_KEY"),
-            voice_id=os.getenv("ELEVENLABS_VOICE_ID"),
+        tts = CartesiaHttpTTSService(
+            api_key=os.getenv("CARTESIA_API_KEY"),
+            voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
        )

        llm = OpenAILLMService(
@@ -106,24 +102,23 @@ async def main():
            key=os.getenv("FAL_KEY"),
        )

-        gated_aggregator = GatedAggregator(
-            gate_open_fn=lambda frame: isinstance(frame, ImageRawFrame),
-            gate_close_fn=lambda frame: isinstance(frame, LLMFullResponseStartFrame),
-            start_open=False
-        )
-
        sentence_aggregator = SentenceAggregator()
        month_prepender = MonthPrepender()
-        llm_full_response_aggregator = LLMFullResponseAggregator()

+        # With `SyncParallelPipeline` we synchronize audio and images by pushing
+        # them basically in order (e.g. I1 A1 A1 A1 I2 A2 A2 A2 A2 I3 A3). To do
+        # that, each pipeline runs concurrently and `SyncParallelPipeline` will
+        # wait for the input frame to be processed.
+        #
+        # Note that `SyncParallelPipeline` requires all processors in it to be
+        # synchronous (which is the default for most processors).
        pipeline = Pipeline([
            llm,                     # LLM
            sentence_aggregator,     # Aggregates LLM output into full sentences
-            ParallelTask(            # Run pipelines in parallel aggregating the result
-                [month_prepender, tts],                   # Create "Month: sentence" and output audio
-                [llm_full_response_aggregator, imagegen]  # Aggregate full LLM response
+            SyncParallelPipeline(    # Run pipelines in parallel aggregating the result
+                [month_prepender, tts],  # Create "Month: sentence" and output audio
+                [imagegen]               # Generate image
            ),
-            gated_aggregator,        # Queues everything until an image is available
            transport.output()       # Transport output
        ])

--- a/examples/foundational/05a-local-sync-speech-and-image.py
+++ b/examples/foundational/05a-local-sync-speech-and-image.py
@@ -11,18 +11,24 @@ import sys

 import tkinter as tk

-from pipecat.frames.frames import AudioRawFrame, Frame, URLImageRawFrame, LLMMessagesFrame, TextFrame
-from pipecat.pipeline.parallel_pipeline import ParallelPipeline
+from pipecat.frames.frames import (
+    Frame,
+    OutputAudioRawFrame,
+    TTSAudioRawFrame,
+    URLImageRawFrame,
+    LLMMessagesFrame,
+    TextFrame)
 from pipecat.pipeline.pipeline import Pipeline
 from pipecat.pipeline.runner import PipelineRunner
+from pipecat.pipeline.sync_parallel_pipeline import SyncParallelPipeline
 from pipecat.pipeline.task import PipelineTask
-from pipecat.processors.aggregators.llm_response import LLMFullResponseAggregator
+from pipecat.processors.aggregators.sentence import SentenceAggregator
 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
+from pipecat.services.cartesia import CartesiaHttpTTSService
 from pipecat.services.openai import OpenAILLMService
-from pipecat.services.elevenlabs import ElevenLabsTTSService
 from pipecat.services.fal import FalImageGenService
 from pipecat.transports.base_transport import TransportParams
-from pipecat.transports.local.tk import TkLocalTransport
+from pipecat.transports.local.tk import TkLocalTransport, TkOutputTransport

 from loguru import logger

@@ -60,13 +66,14 @@ async def main():
                def __init__(self):
                    super().__init__()
                    self.audio = bytearray()
+                    self.frame = None

                async def process_frame(self, frame: Frame, direction: FrameDirection):
                    await super().process_frame(frame, direction)

-                    if isinstance(frame, AudioRawFrame):
+                    if isinstance(frame, TTSAudioRawFrame):
                        self.audio.extend(frame.audio)
-                        self.frame = AudioRawFrame(
+                        self.frame = OutputAudioRawFrame(
                            bytes(self.audio), frame.sample_rate, frame.num_channels)

            class ImageGrabber(FrameProcessor):
@@ -84,10 +91,10 @@ async def main():
                api_key=os.getenv("OPENAI_API_KEY"),
                model="gpt-4o")

-            tts = ElevenLabsTTSService(
-                aiohttp_session=session,
-                api_key=os.getenv("ELEVENLABS_API_KEY"),
-                voice_id=os.getenv("ELEVENLABS_VOICE_ID"))
+            tts = CartesiaHttpTTSService(
+                api_key=os.getenv("CARTESIA_API_KEY"),
+                voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
+            )

            imagegen = FalImageGenService(
                params=FalImageGenService.InputParams(
@@ -96,7 +103,7 @@ async def main():
                aiohttp_session=session,
                key=os.getenv("FAL_KEY"))

-            aggregator = LLMFullResponseAggregator()
+            sentence_aggregator = SentenceAggregator()

            description = ImageDescription()

@@ -104,12 +111,22 @@ async def main():

            image_grabber = ImageGrabber()

+            # With `SyncParallelPipeline` we synchronize audio and images by
+            # pushing them basically in order (e.g. I1 A1 A1 A1 I2 A2 A2 A2 A2
+            # I3 A3). To do that, each pipeline runs concurrently and
+            # `SyncParallelPipeline` will wait for the input frame to be
+            # processed.
+            #
+            # Note that `SyncParallelPipeline` requires all processors in it to
+            # be synchronous (which is the default for most processors).
            pipeline = Pipeline([
-                llm,
-                aggregator,
-                description,
-                ParallelPipeline([tts, audio_grabber],
-                                 [imagegen, image_grabber])
+                llm,                     # LLM
+                sentence_aggregator,     # Aggregates LLM output into full sentences
+                description,             # Store sentence
+                SyncParallelPipeline(
+                    [tts, audio_grabber],      # Generate and store audio for the given sentence
+                    [imagegen, image_grabber]  # Generate and storeimage for the given sentence
+                )
            ])

            task = PipelineTask(pipeline)
--- a/examples/foundational/06-listen-and-respond.py
+++ b/examples/foundational/06-listen-and-respond.py
@@ -10,6 +10,7 @@ import os
 import sys

 from pipecat.frames.frames import Frame, LLMMessagesFrame, MetricsFrame
+from pipecat.metrics.metrics import TTFBMetricsData, ProcessingMetricsData, LLMUsageMetricsData, TTSUsageMetricsData
 from pipecat.pipeline.pipeline import Pipeline
 from pipecat.pipeline.runner import PipelineRunner
 from pipecat.pipeline.task import PipelineParams, PipelineTask
@@ -37,8 +38,19 @@ logger.add(sys.stderr, level="DEBUG")
 class MetricsLogger(FrameProcessor):
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        if isinstance(frame, MetricsFrame):
-            print(
-                f"!!! MetricsFrame: {frame}, ttfb: {frame.ttfb}, processing: {frame.processing}, tokens: {frame.tokens}, characters: {frame.characters}")
+            for d in frame.data:
+                if isinstance(d, TTFBMetricsData):
+                    print(f"!!! MetricsFrame: {frame}, ttfb: {d.value}")
+                elif isinstance(d, ProcessingMetricsData):
+                    print(f"!!! MetricsFrame: {frame}, processing: {d.value}")
+                elif isinstance(d, LLMUsageMetricsData):
+                    tokens = d.value
+                    print(
+                        f"!!! MetricsFrame: {frame}, tokens: {
+                            tokens.prompt_tokens}, characters: {
+                            tokens.completion_tokens}")
+                elif isinstance(d, TTSUsageMetricsData):
+                    print(f"!!! MetricsFrame: {frame}, characters: {d.value}")
        await self.push_frame(frame, direction)


@@ -90,11 +102,6 @@ async def main():
        ])

        task = PipelineTask(pipeline)
-        task = PipelineTask(pipeline, PipelineParams(
-            allow_interruptions=True,
-            enable_metrics=True,
-            report_only_initial_ttfb=False,
-        ))

        @transport.event_handler("on_first_participant_joined")
        async def on_first_participant_joined(transport, participant):
--- a/examples/foundational/06a-image-sync.py
+++ b/examples/foundational/06a-image-sync.py
@@ -11,7 +11,7 @@ import sys

 from PIL import Image

-from pipecat.frames.frames import ImageRawFrame, Frame, SystemFrame, TextFrame
+from pipecat.frames.frames import Frame, OutputImageRawFrame, SystemFrame, TextFrame
 from pipecat.pipeline.pipeline import Pipeline
 from pipecat.pipeline.runner import PipelineRunner
 from pipecat.pipeline.task import PipelineTask
@@ -20,8 +20,8 @@ from pipecat.processors.aggregators.llm_response import (
    LLMUserResponseAggregator,
 )
 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
+from pipecat.services.cartesia import CartesiaHttpTTSService
 from pipecat.services.openai import OpenAILLMService
-from pipecat.services.elevenlabs import ElevenLabsTTSService
 from pipecat.transports.services.daily import DailyTransport
 from pipecat.vad.silero import SileroVADAnalyzer

@@ -52,9 +52,16 @@ class ImageSyncAggregator(FrameProcessor):
        await super().process_frame(frame, direction)

        if not isinstance(frame, SystemFrame) and direction == FrameDirection.DOWNSTREAM:
-            await self.push_frame(ImageRawFrame(image=self._speaking_image_bytes, size=(1024, 1024), format=self._speaking_image_format))
+            await self.push_frame(OutputImageRawFrame(
+                image=self._speaking_image_bytes,
+                size=(1024, 1024),
+                format=self._speaking_image_format)
+            )
            await self.push_frame(frame)
-            await self.push_frame(ImageRawFrame(image=self._waiting_image_bytes, size=(1024, 1024), format=self._waiting_image_format))
+            await self.push_frame(OutputImageRawFrame(
+                image=self._waiting_image_bytes,
+                size=(1024, 1024),
+                format=self._waiting_image_format))
        else:
            await self.push_frame(frame)

@@ -78,10 +85,9 @@ async def main():
            )
        )

-        tts = ElevenLabsTTSService(
-            aiohttp_session=session,
-            api_key=os.getenv("ELEVENLABS_API_KEY"),
-            voice_id=os.getenv("ELEVENLABS_VOICE_ID"),
+        tts = CartesiaHttpTTSService(
+            api_key=os.getenv("CARTESIA_API_KEY"),
+            voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
        )

        llm = OpenAILLMService(
--- a/examples/foundational/07b-interruptible-langchain.py
+++ b/examples/foundational/07b-interruptible-langchain.py
@@ -18,7 +18,6 @@ from pipecat.processors.aggregators.llm_response import (
    LLMAssistantResponseAggregator, LLMUserResponseAggregator)
 from pipecat.processors.frameworks.langchain import LangchainProcessor
 from pipecat.services.cartesia import CartesiaTTSService
-from pipecat.services.elevenlabs import ElevenLabsTTSService
 from pipecat.transports.services.daily import DailyParams, DailyTransport
 from pipecat.vad.silero import SileroVADAnalyzer

--- a/examples/foundational/07d-interruptible-elevenlabs.py
+++ b/examples/foundational/07d-interruptible-elevenlabs.py
@@ -0,0 +1,99 @@
+#
+# Copyright (c) 2024, Daily
+#
+# SPDX-License-Identifier: BSD 2-Clause License
+#
+
+import asyncio
+import aiohttp
+import os
+import sys
+
+from pipecat.frames.frames import LLMMessagesFrame
+from pipecat.pipeline.pipeline import Pipeline
+from pipecat.pipeline.runner import PipelineRunner
+from pipecat.pipeline.task import PipelineParams, PipelineTask
+from pipecat.processors.aggregators.llm_response import (
+    LLMAssistantResponseAggregator, LLMUserResponseAggregator)
+from pipecat.services.elevenlabs import ElevenLabsTTSService
+from pipecat.services.openai import OpenAILLMService
+from pipecat.transports.services.daily import DailyParams, DailyTransport
+from pipecat.vad.silero import SileroVADAnalyzer
+
+from runner import configure
+
+from loguru import logger
+
+from dotenv import load_dotenv
+load_dotenv(override=True)
+
+logger.remove(0)
+logger.add(sys.stderr, level="DEBUG")
+
+
+async def main():
+    async with aiohttp.ClientSession() as session:
+        (room_url, token) = await configure(session)
+
+        transport = DailyTransport(
+            room_url,
+            token,
+            "Respond bot",
+            DailyParams(
+                audio_out_enabled=True,
+                transcription_enabled=True,
+                vad_enabled=True,
+                vad_analyzer=SileroVADAnalyzer()
+            )
+        )
+
+        tts = ElevenLabsTTSService(
+            api_key=os.getenv("ELEVENLABS_API_KEY", ""),
+            voice_id=os.getenv("ELEVENLABS_VOICE_ID", ""),
+        )
+
+        llm = OpenAILLMService(
+            api_key=os.getenv("OPENAI_API_KEY"),
+            model="gpt-4o")
+
+        messages = [
+            {
+                "role": "system",
+                "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
+            },
+        ]
+
+        tma_in = LLMUserResponseAggregator(messages)
+        tma_out = LLMAssistantResponseAggregator(messages)
+
+        pipeline = Pipeline([
+            transport.input(),   # Transport user input
+            tma_in,              # User responses
+            llm,                 # LLM
+            tts,                 # TTS
+            transport.output(),  # Transport bot output
+            tma_out              # Assistant spoken responses
+        ])
+
+        task = PipelineTask(pipeline, PipelineParams(
+            allow_interruptions=True,
+            enable_metrics=True,
+            enable_usage_metrics=True,
+            report_only_initial_ttfb=True,
+        ))
+
+        @transport.event_handler("on_first_participant_joined")
+        async def on_first_participant_joined(transport, participant):
+            transport.capture_participant_transcription(participant["id"])
+            # Kick off the conversation.
+            messages.append(
+                {"role": "system", "content": "Please introduce yourself to the user."})
+            await task.queue_frames([LLMMessagesFrame(messages)])
+
+        runner = PipelineRunner()
+
+        await runner.run(task)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/examples/foundational/07d-interruptible-cartesia.py
+++ b/examples/foundational/07d-interruptible-cartesia.py
@@ -15,12 +15,11 @@ from pipecat.pipeline.runner import PipelineRunner
 from pipecat.pipeline.task import PipelineParams, PipelineTask
 from pipecat.processors.aggregators.llm_response import (
    LLMAssistantResponseAggregator, LLMUserResponseAggregator)
-from pipecat.services.cartesia import CartesiaTTSService
+from pipecat.services.lmnt import LmntTTSService
 from pipecat.services.openai import OpenAILLMService
 from pipecat.transports.services.daily import DailyParams, DailyTransport
 from pipecat.vad.silero import SileroVADAnalyzer

-
 from runner import configure

 from loguru import logger
@@ -41,18 +40,17 @@ async def main():
            token,
            "Respond bot",
            DailyParams(
-                audio_out_sample_rate=44100,
                audio_out_enabled=True,
+                audio_out_sample_rate=24000,
                transcription_enabled=True,
                vad_enabled=True,
                vad_analyzer=SileroVADAnalyzer()
            )
        )

-        tts = CartesiaTTSService(
-            api_key=os.getenv("CARTESIA_API_KEY"),
-            voice_id="a0e99841-438c-4a64-b679-ae501e7d6091",  # Barbershop Man
-            sample_rate=44100,
+        tts = LmntTTSService(
+            api_key=os.getenv("LMNT_API_KEY"),
+            voice_id="morgan"
        )

        llm = OpenAILLMService(
@@ -74,11 +72,11 @@ async def main():
            tma_in,              # User responses
            llm,                 # LLM
            tts,                 # TTS
-            tma_out,             # Goes before the transport because cartesia has word-level timestamps!
            transport.output(),  # Transport bot output
+            tma_out              # Assistant spoken responses
        ])

-        task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True, enable_metrics=True))
+        task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))

        @transport.event_handler("on_first_participant_joined")
        async def on_first_participant_joined(transport, participant):
--- a/examples/foundational/07l-interruptible-together.py
+++ b/examples/foundational/07l-interruptible-together.py
@@ -0,0 +1,102 @@
+#
+# Copyright (c) 2024, Daily
+#
+# SPDX-License-Identifier: BSD 2-Clause License
+#
+
+import asyncio
+import aiohttp
+import os
+import sys
+
+from pipecat.frames.frames import LLMMessagesFrame
+from pipecat.pipeline.pipeline import Pipeline
+from pipecat.pipeline.runner import PipelineRunner
+from pipecat.pipeline.task import PipelineParams, PipelineTask
+from pipecat.processors.aggregators.llm_response import (
+    LLMAssistantResponseAggregator, LLMUserResponseAggregator)
+from pipecat.services.cartesia import CartesiaTTSService
+from pipecat.services.together import TogetherLLMService
+from pipecat.transports.services.daily import DailyParams, DailyTransport
+from pipecat.vad.silero import SileroVADAnalyzer
+
+from runner import configure
+
+from loguru import logger
+
+from dotenv import load_dotenv
+load_dotenv(override=True)
+
+logger.remove(0)
+logger.add(sys.stderr, level="DEBUG")
+
+
+async def main():
+    async with aiohttp.ClientSession() as session:
+        (room_url, token) = await configure(session)
+
+        transport = DailyTransport(
+            room_url,
+            token,
+            "Respond bot",
+            DailyParams(
+                audio_out_enabled=True,
+                transcription_enabled=True,
+                vad_enabled=True,
+                vad_analyzer=SileroVADAnalyzer()
+            )
+        )
+
+        tts = CartesiaTTSService(
+            api_key=os.getenv("CARTESIA_API_KEY"),
+            voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
+        )
+
+        llm = TogetherLLMService(
+            api_key=os.getenv("TOGETHER_API_KEY"),
+            model=os.getenv("TOGETHER_MODEL"),
+            params=TogetherLLMService.InputParams(
+                temperature=1.0,
+                top_p=0.9,
+                top_k=40,
+                extra={
+                    "frequency_penalty": 2.0,
+                    "presence_penalty": 0.0,
+                }
+            )
+        )
+
+        messages = [
+            {
+                "role": "system",
+                "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
+            },
+        ]
+
+        tma_in = LLMUserResponseAggregator(messages)
+        tma_out = LLMAssistantResponseAggregator(messages)
+
+        pipeline = Pipeline([
+            transport.input(),   # Transport user input
+            tma_in,              # User responses
+            llm,                 # LLM
+            tts,                 # TTS
+            transport.output(),  # Transport bot output
+            tma_out              # Assistant spoken responses
+        ])
+
+        task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True))
+
+        @transport.event_handler("on_first_participant_joined")
+        async def on_first_participant_joined(transport, participant):
+            transport.capture_participant_transcription(participant["id"])
+            # Kick off the conversation.
+            await task.queue_frames([LLMMessagesFrame(messages)])
+
+        runner = PipelineRunner()
+
+        await runner.run(task)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/examples/foundational/08-bots-arguing.py
+++ b/examples/foundational/08-bots-arguing.py
@@ -3,14 +3,14 @@ import aiohttp
 import asyncio
 import logging
 import os
-from pipecat.pipeline.aggregators import SentenceAggregator
+from pipecat.processors.aggregators import SentenceAggregator
 from pipecat.pipeline.pipeline import Pipeline

-from pipecat.transports.daily_transport import DailyTransport
-from pipecat.services.azure_ai_services import AzureLLMService, AzureTTSService
-from pipecat.services.elevenlabs_ai_services import ElevenLabsTTSService
-from pipecat.services.fal_ai_services import FalImageGenService
-from pipecat.pipeline.frames import AudioFrame, EndFrame, ImageFrame, LLMMessagesFrame, TextFrame
+from pipecat.transports.services.daily import DailyTransport
+from pipecat.services.azure import AzureLLMService, AzureTTSService
+from pipecat.services.elevenlabs import ElevenLabsTTSService
+from pipecat.services.fal import FalImageGenService
+from pipecat.frames.frames import AudioFrame, EndFrame, ImageFrame, LLMMessagesFrame, TextFrame

 from runner import configure

--- a/examples/foundational/09-mirror.py
+++ b/examples/foundational/09-mirror.py
@@ -8,9 +8,11 @@ import aiohttp
 import asyncio
 import sys

+from pipecat.frames.frames import Frame, InputAudioRawFrame, InputImageRawFrame, OutputAudioRawFrame, OutputImageRawFrame
 from pipecat.pipeline.pipeline import Pipeline
 from pipecat.pipeline.runner import PipelineRunner
 from pipecat.pipeline.task import PipelineTask
+from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
 from pipecat.transports.services.daily import DailyTransport, DailyParams

 from runner import configure
@@ -24,6 +26,27 @@ logger.remove(0)
 logger.add(sys.stderr, level="DEBUG")


+class MirrorProcessor(FrameProcessor):
+
+    async def process_frame(self, frame: Frame, direction: FrameDirection):
+        await super().process_frame(frame, direction)
+
+        if isinstance(frame, InputAudioRawFrame):
+            await self.push_frame(OutputAudioRawFrame(
+                audio=frame.audio,
+                sample_rate=frame.sample_rate,
+                num_channels=frame.num_channels)
+            )
+        elif isinstance(frame, InputImageRawFrame):
+            await self.push_frame(OutputImageRawFrame(
+                image=frame.image,
+                size=frame.size,
+                format=frame.format)
+            )
+        else:
+            await self.push_frame(frame, direction)
+
+
 async def main():
    async with aiohttp.ClientSession() as session:
        (room_url, token) = await configure(session)
@@ -44,7 +67,7 @@ async def main():
        async def on_first_participant_joined(transport, participant):
            transport.capture_participant_video(participant["id"])

-        pipeline = Pipeline([transport.input(), transport.output()])
+        pipeline = Pipeline([transport.input(), MirrorProcessor(), transport.output()])

        runner = PipelineRunner()

--- a/examples/foundational/09a-local-mirror.py
+++ b/examples/foundational/09a-local-mirror.py
@@ -10,9 +10,11 @@ import sys

 import tkinter as tk

+from pipecat.frames.frames import Frame, InputAudioRawFrame, InputImageRawFrame, OutputAudioRawFrame, OutputImageRawFrame
 from pipecat.pipeline.pipeline import Pipeline
 from pipecat.pipeline.runner import PipelineRunner
 from pipecat.pipeline.task import PipelineTask
+from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
 from pipecat.transports.base_transport import TransportParams
 from pipecat.transports.local.tk import TkLocalTransport
 from pipecat.transports.services.daily import DailyParams, DailyTransport
@@ -27,6 +29,25 @@ load_dotenv(override=True)
 logger.remove(0)
 logger.add(sys.stderr, level="DEBUG")

+class MirrorProcessor(FrameProcessor):
+
+    async def process_frame(self, frame: Frame, direction: FrameDirection):
+        await super().process_frame(frame, direction)
+
+        if isinstance(frame, InputAudioRawFrame):
+            await self.push_frame(OutputAudioRawFrame(
+                audio=frame.audio,
+                sample_rate=frame.sample_rate,
+                num_channels=frame.num_channels)
+            )
+        elif isinstance(frame, InputImageRawFrame):
+            await self.push_frame(OutputImageRawFrame(
+                image=frame.image,
+                size=frame.size,
+                format=frame.format)
+            )
+        else:
+            await self.push_frame(frame, direction)

 async def main():
    async with aiohttp.ClientSession() as session:
@@ -52,7 +73,7 @@ async def main():
        async def on_first_participant_joined(transport, participant):
            transport.capture_participant_video(participant["id"])

-        pipeline = Pipeline([daily_transport.input(), tk_transport.output()])
+        pipeline = Pipeline([daily_transport.input(), MirrorProcessor(), tk_transport.output()])

        task = PipelineTask(pipeline)

--- a/examples/foundational/11-sound-effects.py
+++ b/examples/foundational/11-sound-effects.py
@@ -12,9 +12,9 @@ import wave

 from pipecat.frames.frames import (
    Frame,
-    AudioRawFrame,
    LLMFullResponseEndFrame,
    LLMMessagesFrame,
+    OutputAudioRawFrame,
 )
 from pipecat.pipeline.pipeline import Pipeline
 from pipecat.pipeline.runner import PipelineRunner
@@ -25,7 +25,7 @@ from pipecat.processors.aggregators.llm_response import (
 )
 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
 from pipecat.processors.logger import FrameLogger
-from pipecat.services.elevenlabs import ElevenLabsTTSService
+from pipecat.services.cartesia import CartesiaHttpTTSService
 from pipecat.services.openai import OpenAILLMService
 from pipecat.transports.services.daily import DailyParams, DailyTransport
 from pipecat.vad.silero import SileroVADAnalyzer
@@ -53,8 +53,8 @@ for file in sound_files:
    filename = os.path.splitext(os.path.basename(full_path))[0]
    # Open the image and convert it to bytes
    with wave.open(full_path) as audio_file:
-        sounds[file] = AudioRawFrame(audio_file.readframes(-1),
-                                     audio_file.getframerate(), audio_file.getnchannels())
+        sounds[file] = OutputAudioRawFrame(audio_file.readframes(-1),
+                                           audio_file.getframerate(), audio_file.getnchannels())


 class OutboundSoundEffectWrapper(FrameProcessor):
@@ -103,10 +103,9 @@ async def main():
            api_key=os.getenv("OPENAI_API_KEY"),
            model="gpt-4o")

-        tts = ElevenLabsTTSService(
-            aiohttp_session=session,
-            api_key=os.getenv("ELEVENLABS_API_KEY"),
-            voice_id="ErXwobaYiN019PkySvjV",
+        tts = CartesiaHttpTTSService(
+            api_key=os.getenv("CARTESIA_API_KEY"),
+            voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
        )

        messages = [
--- a/examples/foundational/17-detect-user-idle.py
+++ b/examples/foundational/17-detect-user-idle.py
@@ -70,7 +70,7 @@ async def main():
        async def user_idle_callback(user_idle: UserIdleProcessor):
            messages.append(
                {"role": "system", "content": "Ask the user if they are still there and try to prompt for some input, but be short."})
-            await user_idle.queue_frame(LLMMessagesFrame(messages))
+            await user_idle.push_frame(LLMMessagesFrame(messages))

        user_idle = UserIdleProcessor(callback=user_idle_callback, timeout=5.0)

--- a/examples/moondream-chatbot/bot.py
+++ b/examples/moondream-chatbot/bot.py
@@ -13,10 +13,11 @@ from PIL import Image

 from pipecat.frames.frames import (
    ImageRawFrame,
+    OutputImageRawFrame,
    SpriteFrame,
    Frame,
    LLMMessagesFrame,
-    AudioRawFrame,
+    TTSAudioRawFrame,
    TTSStoppedFrame,
    TextFrame,
    UserImageRawFrame,
@@ -59,7 +60,11 @@ for i in range(1, 26):
    # Get the filename without the extension to use as the dictionary key
    # Open the image and convert it to bytes
    with Image.open(full_path) as img:
-        sprites.append(ImageRawFrame(image=img.tobytes(), size=img.size, format=img.format))
+        sprites.append(OutputImageRawFrame(
+            image=img.tobytes(),
+            size=img.size,
+            format=img.format)
+        )

 flipped = sprites[::-1]
 sprites.extend(flipped)
@@ -82,7 +87,7 @@ class TalkingAnimation(FrameProcessor):
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        await super().process_frame(frame, direction)

-        if isinstance(frame, AudioRawFrame):
+        if isinstance(frame, TTSAudioRawFrame):
            if not self._is_talking:
                await self.push_frame(talking_frame)
                self._is_talking = True
--- a/examples/moondream-chatbot/requirements.txt
+++ b/examples/moondream-chatbot/requirements.txt
@@ -1,4 +1,4 @@
 python-dotenv
 fastapi[all]
 uvicorn
-pipecat-ai[daily,moondream,openai,silero]
+pipecat-ai[daily,cartesia,moondream,openai,silero]
--- a/examples/patient-intake/bot.py
+++ b/examples/patient-intake/bot.py
@@ -10,7 +10,7 @@ import os
 import sys
 import wave

-from pipecat.frames.frames import AudioRawFrame
+from pipecat.frames.frames import OutputAudioRawFrame
 from pipecat.pipeline.pipeline import Pipeline
 from pipecat.pipeline.runner import PipelineRunner
 from pipecat.pipeline.task import PipelineParams, PipelineTask
@@ -49,8 +49,9 @@ for file in sound_files:
    filename = os.path.splitext(os.path.basename(full_path))[0]
    # Open the sound and convert it to bytes
    with wave.open(full_path) as audio_file:
-        sounds[file] = AudioRawFrame(audio_file.readframes(-1),
-                                     audio_file.getframerate(), audio_file.getnchannels())
+        sounds[file] = OutputAudioRawFrame(audio_file.readframes(-1),
+                                           audio_file.getframerate(),
+                                           audio_file.getnchannels())


 class IntakeProcessor:
--- a/examples/patient-intake/requirements.txt
+++ b/examples/patient-intake/requirements.txt
@@ -1,4 +1,4 @@
 python-dotenv
 fastapi[all]
 uvicorn
-pipecat-ai[daily,openai,silero]
+pipecat-ai[daily,cartesia,openai,silero]
--- a/examples/simple-chatbot/bot.py
+++ b/examples/simple-chatbot/bot.py
@@ -16,11 +16,11 @@ from pipecat.pipeline.runner import PipelineRunner
 from pipecat.pipeline.task import PipelineParams, PipelineTask
 from pipecat.processors.aggregators.llm_response import LLMAssistantResponseAggregator, LLMUserResponseAggregator
 from pipecat.frames.frames import (
-    AudioRawFrame,
-    ImageRawFrame,
+    OutputImageRawFrame,
    SpriteFrame,
    Frame,
    LLMMessagesFrame,
+    TTSAudioRawFrame,
    TTSStoppedFrame
 )
 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
@@ -49,7 +49,11 @@ for i in range(1, 26):
    # Get the filename without the extension to use as the dictionary key
    # Open the image and convert it to bytes
    with Image.open(full_path) as img:
-        sprites.append(ImageRawFrame(image=img.tobytes(), size=img.size, format=img.format))
+        sprites.append(OutputImageRawFrame(
+            image=img.tobytes(),
+            size=img.size,
+            format=img.format)
+        )

 flipped = sprites[::-1]
 sprites.extend(flipped)
@@ -72,7 +76,7 @@ class TalkingAnimation(FrameProcessor):
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        await super().process_frame(frame, direction)

-        if isinstance(frame, AudioRawFrame):
+        if isinstance(frame, TTSAudioRawFrame):
            if not self._is_talking:
                await self.push_frame(talking_frame)
                self._is_talking = True
@@ -111,7 +115,6 @@ async def main():
        )

        tts = ElevenLabsTTSService(
-            aiohttp_session=session,
            api_key=os.getenv("ELEVENLABS_API_KEY"),
            #
            # English
--- a/examples/simple-chatbot/requirements.txt
+++ b/examples/simple-chatbot/requirements.txt
@@ -1,4 +1,4 @@
 python-dotenv
 fastapi[all]
 uvicorn
-pipecat-ai[daily,openai,silero]
+pipecat-ai[daily,elevenlabs,openai,silero]
--- a/examples/storytelling-chatbot/requirements.txt
+++ b/examples/storytelling-chatbot/requirements.txt
@@ -2,4 +2,4 @@ async_timeout
 fastapi
 uvicorn
 python-dotenv
-pipecat-ai[daily,openai,fal]
+pipecat-ai[daily,elevenlabs,openai,fal]
--- a/examples/storytelling-chatbot/src/bot.py
+++ b/examples/storytelling-chatbot/src/bot.py
@@ -60,7 +60,6 @@ async def main(room_url, token=None):
        )

        tts_service = ElevenLabsTTSService(
-            aiohttp_session=session,
            api_key=os.getenv("ELEVENLABS_API_KEY"),
            voice_id=os.getenv("ELEVENLABS_VOICE_ID"),
        )
--- a/examples/storytelling-chatbot/src/utils/helpers.py
+++ b/examples/storytelling-chatbot/src/utils/helpers.py
@@ -2,7 +2,7 @@ import os
 import wave
 from PIL import Image

-from pipecat.frames.frames import AudioRawFrame, ImageRawFrame
+from pipecat.frames.frames import OutputAudioRawFrame, OutputImageRawFrame

 script_dir = os.path.dirname(__file__)

@@ -16,7 +16,8 @@ def load_images(image_files):
        filename = os.path.splitext(os.path.basename(full_path))[0]
        # Open the image and convert it to bytes
        with Image.open(full_path) as img:
-            images[filename] = ImageRawFrame(image=img.tobytes(), size=img.size, format=img.format)
+            images[filename] = OutputImageRawFrame(
+                image=img.tobytes(), size=img.size, format=img.format)
    return images


@@ -30,8 +31,8 @@ def load_sounds(sound_files):
        filename = os.path.splitext(os.path.basename(full_path))[0]
        # Open the sound and convert it to bytes
        with wave.open(full_path) as audio_file:
-            sounds[filename] = AudioRawFrame(audio=audio_file.readframes(-1),
-                                             sample_rate=audio_file.getframerate(),
-                                             num_channels=audio_file.getnchannels())
+            sounds[filename] = OutputAudioRawFrame(audio=audio_file.readframes(-1),
+                                                   sample_rate=audio_file.getframerate(),
+                                                   num_channels=audio_file.getnchannels())

    return sounds
--- a/examples/studypal/studypal.py
+++ b/examples/studypal/studypal.py
@@ -147,8 +147,8 @@ Your task is to help the user understand and learn from this article in 2 senten
            tma_in,
            llm,
            tts,
-            tma_out,
            transport.output(),
+            tma_out,
        ])

        task = PipelineTask(pipeline, PipelineParams(allow_interruptions=True, enable_metrics=True))
--- a/examples/twilio-chatbot/README.md
+++ b/examples/twilio-chatbot/README.md
@@ -55,7 +55,7 @@ This project is a FastAPI-based chatbot that integrates with Twilio to handle We
 2. **Update the Twilio Webhook**:
    Copy the ngrok URL and update your Twilio phone number webhook URL to `http://<ngrok_url>/start_call`.

-3. **Update the streams.xml**:
+3. **Update streams.xml**:
    Copy the ngrok URL and update templates/streams.xml with `wss://<ngrok_url>/ws`.

 ## Running the Application
--- a/examples/twilio-chatbot/bot.py
+++ b/examples/twilio-chatbot/bot.py
@@ -1,4 +1,3 @@
-import aiohttp
 import os
 import sys

@@ -27,63 +26,62 @@ logger.add(sys.stderr, level="DEBUG")


 async def run_bot(websocket_client, stream_sid):
-    async with aiohttp.ClientSession() as session:
-        transport = FastAPIWebsocketTransport(
-            websocket=websocket_client,
-            params=FastAPIWebsocketParams(
-                audio_out_enabled=True,
-                add_wav_header=False,
-                vad_enabled=True,
-                vad_analyzer=SileroVADAnalyzer(),
-                vad_audio_passthrough=True,
-                serializer=TwilioFrameSerializer(stream_sid)
-            )
+    transport = FastAPIWebsocketTransport(
+        websocket=websocket_client,
+        params=FastAPIWebsocketParams(
+            audio_out_enabled=True,
+            add_wav_header=False,
+            vad_enabled=True,
+            vad_analyzer=SileroVADAnalyzer(),
+            vad_audio_passthrough=True,
+            serializer=TwilioFrameSerializer(stream_sid)
        )
+    )

-        llm = OpenAILLMService(
-            api_key=os.getenv("OPENAI_API_KEY"),
-            model="gpt-4o")
+    llm = OpenAILLMService(
+        api_key=os.getenv("OPENAI_API_KEY"),
+        model="gpt-4o")

-        stt = DeepgramSTTService(api_key=os.getenv('DEEPGRAM_API_KEY'))
+    stt = DeepgramSTTService(api_key=os.getenv('DEEPGRAM_API_KEY'))

-        tts = CartesiaTTSService(
-            api_key=os.getenv("CARTESIA_API_KEY"),
-            voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
-        )
+    tts = CartesiaTTSService(
+        api_key=os.getenv("CARTESIA_API_KEY"),
+        voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
+    )

-        messages = [
-            {
-                "role": "system",
-                "content": "You are a helpful LLM in an audio call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
-            },
-        ]
+    messages = [
+        {
+            "role": "system",
+            "content": "You are a helpful LLM in an audio call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
+        },
+    ]

-        tma_in = LLMUserResponseAggregator(messages)
-        tma_out = LLMAssistantResponseAggregator(messages)
+    tma_in = LLMUserResponseAggregator(messages)
+    tma_out = LLMAssistantResponseAggregator(messages)

-        pipeline = Pipeline([
-            transport.input(),   # Websocket input from client
-            stt,                 # Speech-To-Text
-            tma_in,              # User responses
-            llm,                 # LLM
-            tts,                 # Text-To-Speech
-            transport.output(),  # Websocket output to client
-            tma_out              # LLM responses
-        ])
+    pipeline = Pipeline([
+        transport.input(),   # Websocket input from client
+        stt,                 # Speech-To-Text
+        tma_in,              # User responses
+        llm,                 # LLM
+        tts,                 # Text-To-Speech
+        transport.output(),  # Websocket output to client
+        tma_out              # LLM responses
+    ])

-        task = PipelineTask(pipeline, params=PipelineParams(allow_interruptions=True))
+    task = PipelineTask(pipeline, params=PipelineParams(allow_interruptions=True))

-        @transport.event_handler("on_client_connected")
-        async def on_client_connected(transport, client):
-            # Kick off the conversation.
-            messages.append(
-                {"role": "system", "content": "Please introduce yourself to the user."})
-            await task.queue_frames([LLMMessagesFrame(messages)])
+    @transport.event_handler("on_client_connected")
+    async def on_client_connected(transport, client):
+        # Kick off the conversation.
+        messages.append(
+            {"role": "system", "content": "Please introduce yourself to the user."})
+        await task.queue_frames([LLMMessagesFrame(messages)])

-        @transport.event_handler("on_client_disconnected")
-        async def on_client_disconnected(transport, client):
-            await task.queue_frames([EndFrame()])
+    @transport.event_handler("on_client_disconnected")
+    async def on_client_disconnected(transport, client):
+        await task.queue_frames([EndFrame()])

-        runner = PipelineRunner(handle_sigint=False)
+    runner = PipelineRunner(handle_sigint=False)

-        await runner.run(task)
+    await runner.run(task)
--- a/examples/twilio-chatbot/requirements.txt
+++ b/examples/twilio-chatbot/requirements.txt
@@ -1,4 +1,4 @@
-pipecat-ai[daily,openai,silero,deepgram]
+pipecat-ai[daily,cartesia,openai,silero,deepgram]
 fastapi
 uvicorn
 python-dotenv
--- a/examples/websocket-server/bot.py
+++ b/examples/websocket-server/bot.py
@@ -4,7 +4,6 @@
 # SPDX-License-Identifier: BSD 2-Clause License
 #

-import aiohttp
 import asyncio
 import os
 import sys
@@ -33,60 +32,59 @@ logger.add(sys.stderr, level="DEBUG")


 async def main():
-    async with aiohttp.ClientSession() as session:
-        transport = WebsocketServerTransport(
-            params=WebsocketServerParams(
-                audio_out_enabled=True,
-                add_wav_header=True,
-                vad_enabled=True,
-                vad_analyzer=SileroVADAnalyzer(),
-                vad_audio_passthrough=True
-            )
+    transport = WebsocketServerTransport(
+        params=WebsocketServerParams(
+            audio_out_enabled=True,
+            add_wav_header=True,
+            vad_enabled=True,
+            vad_analyzer=SileroVADAnalyzer(),
+            vad_audio_passthrough=True
        )
+    )

-        llm = OpenAILLMService(
-            api_key=os.getenv("OPENAI_API_KEY"),
-            model="gpt-4o")
+    llm = OpenAILLMService(
+        api_key=os.getenv("OPENAI_API_KEY"),
+        model="gpt-4o")

-        stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
+    stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))

-        tts = CartesiaTTSService(
-            api_key=os.getenv("CARTESIA_API_KEY"),
-            voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
-        )
+    tts = CartesiaTTSService(
+        api_key=os.getenv("CARTESIA_API_KEY"),
+        voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22",  # British Lady
+    )

-        messages = [
-            {
-                "role": "system",
-                "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
-            },
-        ]
+    messages = [
+        {
+            "role": "system",
+            "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be converted to audio so don't include special characters in your answers. Respond to what the user said in a creative and helpful way.",
+        },
+    ]

-        tma_in = LLMUserResponseAggregator(messages)
-        tma_out = LLMAssistantResponseAggregator(messages)
+    tma_in = LLMUserResponseAggregator(messages)
+    tma_out = LLMAssistantResponseAggregator(messages)

-        pipeline = Pipeline([
-            transport.input(),   # Websocket input from client
-            stt,                 # Speech-To-Text
-            tma_in,              # User responses
-            llm,                 # LLM
-            tts,                 # Text-To-Speech
-            transport.output(),  # Websocket output to client
-            tma_out              # LLM responses
-        ])
+    pipeline = Pipeline([
+        transport.input(),   # Websocket input from client
+        stt,                 # Speech-To-Text
+        tma_in,              # User responses
+        llm,                 # LLM
+        tts,                 # Text-To-Speech
+        transport.output(),  # Websocket output to client
+        tma_out              # LLM responses
+    ])

-        task = PipelineTask(pipeline)
+    task = PipelineTask(pipeline)

-        @transport.event_handler("on_client_connected")
-        async def on_client_connected(transport, client):
-            # Kick off the conversation.
-            messages.append(
-                {"role": "system", "content": "Please introduce yourself to the user."})
-            await task.queue_frames([LLMMessagesFrame(messages)])
+    @transport.event_handler("on_client_connected")
+    async def on_client_connected(transport, client):
+        # Kick off the conversation.
+        messages.append(
+            {"role": "system", "content": "Please introduce yourself to the user."})
+        await task.queue_frames([LLMMessagesFrame(messages)])

-        runner = PipelineRunner()
+    runner = PipelineRunner()

-        await runner.run(task)
+    await runner.run(task)

 if __name__ == "__main__":
    asyncio.run(main())
--- a/examples/websocket-server/frames.proto
+++ b/examples/websocket-server/frames.proto
@@ -24,6 +24,7 @@ message AudioRawFrame {
  bytes audio = 3;
  uint32 sample_rate = 4;
  uint32 num_channels = 5;
+  optional uint64 pts = 6;
 }

 message TranscriptionFrame {
--- a/examples/websocket-server/requirements.txt
+++ b/examples/websocket-server/requirements.txt
@@ -1,2 +1,2 @@
 python-dotenv
-pipecat-ai[openai,silero,websocket,whisper]
+pipecat-ai[cartesia,openai,silero,websocket,whisper]
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -36,9 +36,10 @@ Website = "https://pipecat.ai"
 [project.optional-dependencies]
 anthropic = [ "anthropic~=0.34.0" ]
 azure = [ "azure-cognitiveservices-speech~=1.40.0" ]
-cartesia = [ "websockets~=12.0" ]
+cartesia = [ "cartesia~=1.0.13", "websockets~=12.0" ]
 daily = [ "daily-python~=0.10.1" ]
 deepgram = [ "deepgram-sdk~=3.5.0" ]
+elevenlabs = [ "websockets~=12.0" ]
 examples = [ "python-dotenv~=1.0.1", "flask~=3.0.3", "flask_cors~=4.0.1" ]
 fal = [ "fal-client~=0.4.1" ]
 gladia = [ "websockets~=12.0" ]
@@ -47,6 +48,7 @@ gstreamer = [ "pygobject~=3.48.2" ]
 fireworks = [ "openai~=1.37.2" ]
 langchain = [ "langchain~=0.2.14", "langchain-community~=0.2.12", "langchain-openai~=0.1.20" ]
 livekit = [ "livekit~=0.13.1" ]
+lmnt = [ "lmnt~=1.1.4" ]
 local = [ "pyaudio~=0.2.14" ]
 moondream = [ "einops~=0.8.0", "timm~=1.0.8", "transformers~=4.44.0" ]
 openai = [ "openai~=1.37.2" ]
--- a/src/pipecat/clocks/init.py
+++ b/src/pipecat/clocks/init.py
--- a/src/pipecat/clocks/base_clock.py
+++ b/src/pipecat/clocks/base_clock.py
@@ -0,0 +1,18 @@
+#
+# Copyright (c) 2024, Daily
+#
+# SPDX-License-Identifier: BSD 2-Clause License
+#
+
+from abc import ABC, abstractmethod
+
+
+class BaseClock(ABC):
+
+    @abstractmethod
+    def get_time(self) -> int:
+        pass
+
+    @abstractmethod
+    def start(self):
+        pass
--- a/src/pipecat/clocks/system_clock.py
+++ b/src/pipecat/clocks/system_clock.py
@@ -0,0 +1,21 @@
+#
+# Copyright (c) 2024, Daily
+#
+# SPDX-License-Identifier: BSD 2-Clause License
+#
+
+import time
+
+from pipecat.clocks.base_clock import BaseClock
+
+
+class SystemClock(BaseClock):
+
+    def __init__(self):
+        self._time = 0
+
+    def get_time(self) -> int:
+        return time.monotonic_ns() - self._time if self._time > 0 else 0
+
+    def start(self):
+        self._time = time.monotonic_ns()
--- a/src/pipecat/frames/frames.proto
+++ b/src/pipecat/frames/frames.proto
@@ -24,6 +24,7 @@ message AudioRawFrame {
  bytes audio = 3;
  uint32 sample_rate = 4;
  uint32 num_channels = 5;
+  optional uint64 pts = 6;
 }

 message TranscriptionFrame {
--- a/src/pipecat/frames/frames.py
+++ b/src/pipecat/frames/frames.py
@@ -4,22 +4,32 @@
 # SPDX-License-Identifier: BSD 2-Clause License
 #

-from typing import Any, List, Mapping, Optional, Tuple
+from typing import Any, List, Optional, Tuple

 from dataclasses import dataclass, field

+from pipecat.clocks.base_clock import BaseClock
+from pipecat.metrics.metrics import MetricsData
+from pipecat.transcriptions.language import Language
+from pipecat.utils.time import nanoseconds_to_str
 from pipecat.utils.utils import obj_count, obj_id
 from pipecat.vad.vad_analyzer import VADParams


+def format_pts(pts: int | None):
+    return nanoseconds_to_str(pts) if pts else None
+
+
@dataclass
 class Frame:
    id: int = field(init=False)
    name: str = field(init=False)
+    pts: Optional[int] = field(init=False)

    def __post_init__(self):
        self.id: int = obj_id()
        self.name: str = f"{self.__class__.__name__}#{obj_count(self)}"
+        self.pts: Optional[int] = None

    def __str__(self):
        return self.name
@@ -32,10 +42,7 @@ class DataFrame(Frame):

@dataclass
 class AudioRawFrame(DataFrame):
-    """A chunk of audio. Will be played by the transport if the transport's
-    microphone has been enabled.
-
-    """
+    """A chunk of audio."""
    audio: bytes
    sample_rate: int
    num_channels: int
@@ -45,7 +52,33 @@ class AudioRawFrame(DataFrame):
        self.num_frames = int(len(self.audio) / (self.num_channels * 2))

    def __str__(self):
-        return f"{self.name}(size: {len(self.audio)}, frames: {self.num_frames}, sample_rate: {self.sample_rate}, channels: {self.num_channels})"
+        pts = format_pts(self.pts)
+        return f"{self.name}(pts: {pts}, size: {len(self.audio)}, frames: {self.num_frames}, sample_rate: {self.sample_rate}, channels: {self.num_channels})"
+
+
+@dataclass
+class InputAudioRawFrame(AudioRawFrame):
+    """A chunk of audio usually coming from an input transport.
+
+    """
+    pass
+
+
+@dataclass
+class OutputAudioRawFrame(AudioRawFrame):
+    """A chunk of audio. Will be played by the output transport if the
+    transport's microphone has been enabled.
+
+    """
+    pass
+
+
+@dataclass
+class TTSAudioRawFrame(OutputAudioRawFrame):
+    """A chunk of output audio generated by a TTS service.
+
+    """
+    pass


@dataclass
@@ -59,35 +92,22 @@ class ImageRawFrame(DataFrame):
    format: str | None

    def __str__(self):
-        return f"{self.name}(size: {self.size}, format: {self.format})"
+        pts = format_pts(self.pts)
+        return f"{self.name}(pts: {pts}, size: {self.size}, format: {self.format})"


@dataclass
-class URLImageRawFrame(ImageRawFrame):
-    """An image with an associated URL. Will be shown by the transport if the
-    transport's camera is enabled.
-
-    """
-    url: str | None
-
-    def __str__(self):
-        return f"{self.name}(url: {self.url}, size: {self.size}, format: {self.format})"
+class InputImageRawFrame(ImageRawFrame):
+    pass


@dataclass
-class VisionImageRawFrame(ImageRawFrame):
-    """An image with an associated text to ask for a description of it. Will be
-    shown by the transport if the transport's camera is enabled.
-
-    """
-    text: str | None
-
-    def __str__(self):
-        return f"{self.name}(text: {self.text}, size: {self.size}, format: {self.format})"
+class OutputImageRawFrame(ImageRawFrame):
+    pass


@dataclass
-class UserImageRawFrame(ImageRawFrame):
+class UserImageRawFrame(InputImageRawFrame):
    """An image associated to a user. Will be shown by the transport if the
    transport's camera is enabled.

@@ -95,7 +115,34 @@ class UserImageRawFrame(ImageRawFrame):
    user_id: str

    def __str__(self):
-        return f"{self.name}(user: {self.user_id}, size: {self.size}, format: {self.format})"
+        pts = format_pts(self.pts)
+        return f"{self.name}(pts: {pts}, user: {self.user_id}, size: {self.size}, format: {self.format})"
+
+
+@dataclass
+class VisionImageRawFrame(InputImageRawFrame):
+    """An image with an associated text to ask for a description of it. Will be
+    shown by the transport if the transport's camera is enabled.
+
+    """
+    text: str | None
+
+    def __str__(self):
+        pts = format_pts(self.pts)
+        return f"{self.name}(pts: {pts}, text: {self.text}, size: {self.size}, format: {self.format})"
+
+
+@dataclass
+class URLImageRawFrame(OutputImageRawFrame):
+    """An image with an associated URL. Will be shown by the transport if the
+    transport's camera is enabled.
+
+    """
+    url: str | None
+
+    def __str__(self):
+        pts = format_pts(self.pts)
+        return f"{self.name}(pts: {pts}, url: {self.url}, size: {self.size}, format: {self.format})"


@dataclass
@@ -108,7 +155,8 @@ class SpriteFrame(Frame):
    images: List[ImageRawFrame]

    def __str__(self):
-        return f"{self.name}(size: {len(self.images)})"
+        pts = format_pts(self.pts)
+        return f"{self.name}(pts: {pts}, size: {len(self.images)})"


@dataclass
@@ -120,7 +168,8 @@ class TextFrame(DataFrame):
    text: str

    def __str__(self):
-        return f"{self.name}(text: {self.text})"
+        pts = format_pts(self.pts)
+        return f"{self.name}(pts: {pts}, text: {self.text})"


@dataclass
@@ -131,9 +180,10 @@ class TranscriptionFrame(TextFrame):
    """
    user_id: str
    timestamp: str
+    language: Language | None = None

    def __str__(self):
-        return f"{self.name}(user: {self.user_id}, text: {self.text}, timestamp: {self.timestamp})"
+        return f"{self.name}(user: {self.user_id}, text: {self.text}, language: {self.language}, timestamp: {self.timestamp})"


@dataclass
@@ -142,9 +192,10 @@ class InterimTranscriptionFrame(TextFrame):
    the transport's receive queue when a participant speaks."""
    user_id: str
    timestamp: str
+    language: Language | None = None

    def __str__(self):
-        return f"{self.name}(user: {self.user_id}, text: {self.text}, timestamp: {self.timestamp})"
+        return f"{self.name}(user: {self.user_id}, text: {self.text}, language: {self.language}, timestamp: {self.timestamp})"


@dataclass
@@ -230,6 +281,16 @@ class SystemFrame(Frame):
    pass


+@dataclass
+class StartFrame(SystemFrame):
+    """This is the first frame that should be pushed down a pipeline."""
+    clock: BaseClock
+    allow_interruptions: bool = False
+    enable_metrics: bool = False
+    enable_usage_metrics: bool = False
+    report_only_initial_ttfb: bool = False
+
+
@dataclass
 class CancelFrame(SystemFrame):
    """Indicates that a pipeline needs to stop right away."""
@@ -305,10 +366,8 @@ class BotInterruptionFrame(SystemFrame):
 class MetricsFrame(SystemFrame):
    """Emitted by processor that can compute metrics like latencies.
    """
-    ttfb: List[Mapping[str, Any]] | None = None
-    processing: List[Mapping[str, Any]] | None = None
-    tokens: List[Mapping[str, Any]] | None = None
-    characters: List[Mapping[str, Any]] | None = None
+    data: List[MetricsData]
+

 #
 # Control frames
@@ -320,15 +379,6 @@ class ControlFrame(Frame):
    pass


-@dataclass
-class StartFrame(ControlFrame):
-    """This is the first frame that should be pushed down a pipeline."""
-    allow_interruptions: bool = False
-    enable_metrics: bool = False
-    enable_usage_metrics: bool = False
-    report_only_initial_ttfb: bool = False
-
-
@dataclass
 class EndFrame(ControlFrame):
    """Indicates that a pipeline has ended and frame processors and pipelines
@@ -401,10 +451,10 @@ class BotSpeakingFrame(ControlFrame):
@dataclass
 class TTSStartedFrame(ControlFrame):
    """Used to indicate the beginning of a TTS response. Following
-    AudioRawFrames are part of the TTS response until an TTSEndFrame. These
-    frames can be used for aggregating audio frames in a transport to optimize
-    the size of frames sent to the session, without needing to control this in
-    the TTS service.
+    TTSAudioRawFrames are part of the TTS response until an
+    TTSStoppedFrame. These frames can be used for aggregating audio frames in a
+    transport to optimize the size of frames sent to the session, without
+    needing to control this in the TTS service.

    """
    pass
@@ -433,6 +483,73 @@ class LLMModelUpdateFrame(ControlFrame):
    model: str


+@dataclass
+class LLMTemperatureUpdateFrame(ControlFrame):
+    """A control frame containing a request to update to a new LLM temperature.
+    """
+    temperature: float
+
+
+@dataclass
+class LLMTopKUpdateFrame(ControlFrame):
+    """A control frame containing a request to update to a new LLM top_k.
+    """
+    top_k: int
+
+
+@dataclass
+class LLMTopPUpdateFrame(ControlFrame):
+    """A control frame containing a request to update to a new LLM top_p.
+    """
+    top_p: float
+
+
+@dataclass
+class LLMFrequencyPenaltyUpdateFrame(ControlFrame):
+    """A control frame containing a request to update to a new LLM frequency
+    penalty.
+
+    """
+    frequency_penalty: float
+
+
+@dataclass
+class LLMPresencePenaltyUpdateFrame(ControlFrame):
+    """A control frame containing a request to update to a new LLM presence
+    penalty.
+
+    """
+    presence_penalty: float
+
+
+@dataclass
+class LLMMaxTokensUpdateFrame(ControlFrame):
+    """A control frame containing a request to update to a new LLM max tokens.
+    """
+    max_tokens: int
+
+
+@dataclass
+class LLMSeedUpdateFrame(ControlFrame):
+    """A control frame containing a request to update to a new LLM seed.
+    """
+    seed: int
+
+
+@dataclass
+class LLMExtraUpdateFrame(ControlFrame):
+    """A control frame containing a request to update to a new LLM extra params.
+    """
+    extra: dict
+
+
+@dataclass
+class TTSModelUpdateFrame(ControlFrame):
+    """A control frame containing a request to update the TTS model.
+    """
+    model: str
+
+
@dataclass
 class TTSVoiceUpdateFrame(ControlFrame):
    """A control frame containing a request to update to a new TTS voice.
@@ -440,6 +557,31 @@ class TTSVoiceUpdateFrame(ControlFrame):
    voice: str


+@dataclass
+class TTSLanguageUpdateFrame(ControlFrame):
+    """A control frame containing a request to update to a new TTS language and
+    optional voice.
+
+    """
+    language: Language
+
+
+@dataclass
+class STTModelUpdateFrame(ControlFrame):
+    """A control frame containing a request to update the STT model and optional
+    language.
+
+    """
+    model: str
+
+
+@dataclass
+class STTLanguageUpdateFrame(ControlFrame):
+    """A control frame containing a request to update to STT language.
+    """
+    language: Language
+
+
@dataclass
 class FunctionCallInProgressFrame(SystemFrame):
    """A frame signaling that a function call is in progress.
--- a/src/pipecat/frames/protobufs/frames_pb2.py
+++ b/src/pipecat/frames/protobufs/frames_pb2.py
@@ -14,7 +14,7 @@ _sym_db = _symbol_database.Default()



-DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\x0c\x66rames.proto\x12\x07pipecat\"3\n\tTextFrame\x12\n\n\x02id\x18\x01 \x01(\x04\x12\x0c\n\x04name\x18\x02 \x01(\t\x12\x0c\n\x04text\x18\x03 \x01(\t\"c\n\rAudioRawFrame\x12\n\n\x02id\x18\x01 \x01(\x04\x12\x0c\n\x04name\x18\x02 \x01(\t\x12\r\n\x05\x61udio\x18\x03 \x01(\x0c\x12\x13\n\x0bsample_rate\x18\x04 \x01(\r\x12\x14\n\x0cnum_channels\x18\x05 \x01(\r\"`\n\x12TranscriptionFrame\x12\n\n\x02id\x18\x01 \x01(\x04\x12\x0c\n\x04name\x18\x02 \x01(\t\x12\x0c\n\x04text\x18\x03 \x01(\t\x12\x0f\n\x07user_id\x18\x04 \x01(\t\x12\x11\n\ttimestamp\x18\x05 \x01(\t\"\x93\x01\n\x05\x46rame\x12\"\n\x04text\x18\x01 \x01(\x0b\x32\x12.pipecat.TextFrameH\x00\x12\'\n\x05\x61udio\x18\x02 \x01(\x0b\x32\x16.pipecat.AudioRawFrameH\x00\x12\x34\n\rtranscription\x18\x03 \x01(\x0b\x32\x1b.pipecat.TranscriptionFrameH\x00\x42\x07\n\x05\x66rameb\x06proto3')
+DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\x0c\x66rames.proto\x12\x07pipecat\"3\n\tTextFrame\x12\n\n\x02id\x18\x01 \x01(\x04\x12\x0c\n\x04name\x18\x02 \x01(\t\x12\x0c\n\x04text\x18\x03 \x01(\t\"}\n\rAudioRawFrame\x12\n\n\x02id\x18\x01 \x01(\x04\x12\x0c\n\x04name\x18\x02 \x01(\t\x12\r\n\x05\x61udio\x18\x03 \x01(\x0c\x12\x13\n\x0bsample_rate\x18\x04 \x01(\r\x12\x14\n\x0cnum_channels\x18\x05 \x01(\r\x12\x10\n\x03pts\x18\x06 \x01(\x04H\x00\x88\x01\x01\x42\x06\n\x04_pts\"`\n\x12TranscriptionFrame\x12\n\n\x02id\x18\x01 \x01(\x04\x12\x0c\n\x04name\x18\x02 \x01(\t\x12\x0c\n\x04text\x18\x03 \x01(\t\x12\x0f\n\x07user_id\x18\x04 \x01(\t\x12\x11\n\ttimestamp\x18\x05 \x01(\t\"\x93\x01\n\x05\x46rame\x12\"\n\x04text\x18\x01 \x01(\x0b\x32\x12.pipecat.TextFrameH\x00\x12\'\n\x05\x61udio\x18\x02 \x01(\x0b\x32\x16.pipecat.AudioRawFrameH\x00\x12\x34\n\rtranscription\x18\x03 \x01(\x0b\x32\x1b.pipecat.TranscriptionFrameH\x00\x42\x07\n\x05\x66rameb\x06proto3')

 _globals = globals()
 _builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
@@ -24,9 +24,9 @@ if _descriptor._USE_C_DESCRIPTORS == False:
  _globals['_TEXTFRAME']._serialized_start=25
  _globals['_TEXTFRAME']._serialized_end=76
  _globals['_AUDIORAWFRAME']._serialized_start=78
-  _globals['_AUDIORAWFRAME']._serialized_end=177
-  _globals['_TRANSCRIPTIONFRAME']._serialized_start=179
-  _globals['_TRANSCRIPTIONFRAME']._serialized_end=275
-  _globals['_FRAME']._serialized_start=278
-  _globals['_FRAME']._serialized_end=425
+  _globals['_AUDIORAWFRAME']._serialized_end=203
+  _globals['_TRANSCRIPTIONFRAME']._serialized_start=205
+  _globals['_TRANSCRIPTIONFRAME']._serialized_end=301
+  _globals['_FRAME']._serialized_start=304
+  _globals['_FRAME']._serialized_end=451
 # @@protoc_insertion_point(module_scope)
--- a/src/pipecat/metrics/init.py
+++ b/src/pipecat/metrics/init.py
--- a/src/pipecat/metrics/metrics.py
+++ b/src/pipecat/metrics/metrics.py
@@ -0,0 +1,31 @@
+from typing import Optional
+from pydantic import BaseModel
+
+
+class MetricsData(BaseModel):
+    processor: str
+    model: Optional[str] = None
+
+
+class TTFBMetricsData(MetricsData):
+    value: float
+
+
+class ProcessingMetricsData(MetricsData):
+    value: float
+
+
+class LLMTokenUsage(BaseModel):
+    prompt_tokens: int
+    completion_tokens: int
+    total_tokens: int
+    cache_read_input_tokens: Optional[int] = None
+    cache_creation_input_tokens: Optional[int] = None
+
+
+class LLMUsageMetricsData(MetricsData):
+    value: LLMTokenUsage
+
+
+class TTSUsageMetricsData(MetricsData):
+    value: int
--- a/src/pipecat/pipeline/sync_parallel_pipeline.py
+++ b/src/pipecat/pipeline/sync_parallel_pipeline.py
@@ -49,12 +49,12 @@ class Sink(FrameProcessor):
                await self._down_queue.put(frame)


-class ParallelTask(BasePipeline):
+class SyncParallelPipeline(BasePipeline):
    def __init__(self, *args):
        super().__init__()

        if len(args) == 0:
-            raise Exception(f"ParallelTask needs at least one argument")
+            raise Exception(f"SyncParallelPipeline needs at least one argument")

        self._sinks = []
        self._sources = []
@@ -66,7 +66,7 @@ class ParallelTask(BasePipeline):
        logger.debug(f"Creating {self} pipelines")
        for processors in args:
            if not isinstance(processors, list):
-                raise TypeError(f"ParallelTask argument {processors} is not a list")
+                raise TypeError(f"SyncParallelPipeline argument {processors} is not a list")

            # We add a source at the beginning of the pipeline and a sink at the end.
            source = Source(self._up_queue)
--- a/src/pipecat/pipeline/task.py
+++ b/src/pipecat/pipeline/task.py
@@ -10,6 +10,8 @@ from typing import AsyncIterable, Iterable

 from pydantic import BaseModel

+from pipecat.clocks.base_clock import BaseClock
+from pipecat.clocks.system_clock import SystemClock
 from pipecat.frames.frames import (
    CancelFrame,
    EndFrame,
@@ -18,6 +20,7 @@ from pipecat.frames.frames import (
    MetricsFrame,
    StartFrame,
    StopTaskFrame)
+from pipecat.metrics.metrics import TTFBMetricsData, ProcessingMetricsData
 from pipecat.pipeline.base_pipeline import BasePipeline
 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
 from pipecat.utils.utils import obj_count, obj_id
@@ -60,11 +63,16 @@ class Source(FrameProcessor):

 class PipelineTask:

-    def __init__(self, pipeline: BasePipeline, params: PipelineParams = PipelineParams()):
+    def __init__(
+            self,
+            pipeline: BasePipeline,
+            params: PipelineParams = PipelineParams(),
+            clock: BaseClock = SystemClock()):
        self.id: int = obj_id()
        self.name: str = f"{self.__class__.__name__}#{obj_count(self)}"

        self._pipeline = pipeline
+        self._clock = clock
        self._params = params
        self._finished = False

@@ -111,16 +119,21 @@ class PipelineTask:

    def _initial_metrics_frame(self) -> MetricsFrame:
        processors = self._pipeline.processors_with_metrics()
-        ttfb = [{"processor": p.name, "value": 0.0} for p in processors]
-        processing = [{"processor": p.name, "value": 0.0} for p in processors]
-        return MetricsFrame(ttfb=ttfb, processing=processing)
+        data = []
+        for p in processors:
+            data.append(TTFBMetricsData(processor=p.name, value=0.0))
+            data.append(ProcessingMetricsData(processor=p.name, value=0.0))
+        return MetricsFrame(data=data)

    async def _process_down_queue(self):
+        self._clock.start()
+
        start_frame = StartFrame(
            allow_interruptions=self._params.allow_interruptions,
            enable_metrics=self._params.enable_metrics,
            enable_usage_metrics=self._params.enable_metrics,
-            report_only_initial_ttfb=self._params.report_only_initial_ttfb
+            report_only_initial_ttfb=self._params.report_only_initial_ttfb,
+            clock=self._clock
        )
        await self._source.process_frame(start_frame, FrameDirection.DOWNSTREAM)

--- a/src/pipecat/pipeline/to_be_updated/merge_pipeline.py
+++ b/src/pipecat/pipeline/to_be_updated/merge_pipeline.py
@@ -1,5 +1,5 @@
 from typing import List
-from pipecat.pipeline.frames import EndFrame, EndPipeFrame
+from pipecat.frames.frames import EndFrame, EndPipeFrame
 from pipecat.pipeline.pipeline import Pipeline


--- a/src/pipecat/processors/aggregators/gated.py
+++ b/src/pipecat/processors/aggregators/gated.py
@@ -17,7 +17,8 @@ class GatedAggregator(FrameProcessor):
    Yields gate-opening frame before any accumulated frames, then ensuing frames
    until and not including the gate-closed frame.

-    >>> from pipecat.pipeline.frames import ImageFrame
+    Doctest: FIXME to work with asyncio
+    >>> from pipecat.frames.frames import ImageRawFrame

    >>> async def print_frames(aggregator, frame):
    ...     async for frame in aggregator.process_frame(frame):
@@ -28,12 +29,12 @@ class GatedAggregator(FrameProcessor):

    >>> aggregator = GatedAggregator(
    ...     gate_close_fn=lambda x: isinstance(x, LLMResponseStartFrame),
-    ...     gate_open_fn=lambda x: isinstance(x, ImageFrame),
+    ...     gate_open_fn=lambda x: isinstance(x, ImageRawFrame),
    ...     start_open=False)
    >>> asyncio.run(print_frames(aggregator, TextFrame("Hello")))
    >>> asyncio.run(print_frames(aggregator, TextFrame("Hello again.")))
-    >>> asyncio.run(print_frames(aggregator, ImageFrame(image=bytes([]), size=(0, 0))))
-    ImageFrame
+    >>> asyncio.run(print_frames(aggregator, ImageRawFrame(image=bytes([]), size=(0, 0))))
+    ImageRawFrame
    Hello
    Hello again.
    >>> asyncio.run(print_frames(aggregator, TextFrame("Goodbye.")))
--- a/src/pipecat/processors/aggregators/llm_response.py
+++ b/src/pipecat/processors/aggregators/llm_response.py
@@ -4,8 +4,7 @@
 # SPDX-License-Identifier: BSD 2-Clause License
 #

-import sys
-from typing import List
+from typing import List, Type

 from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContextFrame, OpenAILLMContext

@@ -35,8 +34,8 @@ class LLMResponseAggregator(FrameProcessor):
        role: str,
        start_frame,
        end_frame,
-        accumulator_frame: TextFrame,
-        interim_accumulator_frame: TextFrame | None = None,
+        accumulator_frame: Type[TextFrame],
+        interim_accumulator_frame: Type[TextFrame] | None = None,
        handle_interruptions: bool = False
    ):
        super().__init__()
@@ -109,7 +108,7 @@ class LLMResponseAggregator(FrameProcessor):
            await self.push_frame(frame, direction)
        elif isinstance(frame, self._accumulator_frame):
            if self._aggregating:
-                self._aggregation += f" {frame.text}"
+                self._aggregation += f" {frame.text}" if self._aggregation else frame.text
                # We have recevied a complete sentence, so if we have seen the
                # end frame and we were still aggregating, it means we should
                # send the aggregation.
--- a/src/pipecat/processors/aggregators/openai_llm_context.py
+++ b/src/pipecat/processors/aggregators/openai_llm_context.py
@@ -13,7 +13,11 @@ from typing import Any, Awaitable, Callable, List

 from PIL import Image

-from pipecat.frames.frames import Frame, VisionImageRawFrame, FunctionCallInProgressFrame, FunctionCallResultFrame
+from pipecat.frames.frames import (
+    Frame,
+    VisionImageRawFrame,
+    FunctionCallInProgressFrame,
+    FunctionCallResultFrame)
 from pipecat.processors.frame_processor import FrameProcessor

 from loguru import logger
--- a/src/pipecat/processors/aggregators/sentence.py
+++ b/src/pipecat/processors/aggregators/sentence.py
@@ -16,7 +16,8 @@ class SentenceAggregator(FrameProcessor):
        TextFrame("Hello,") -> None
        TextFrame(" world.") -> TextFrame("Hello world.")

-    Doctest:
+    Doctest: FIXME to work with asyncio
+    >>> import asyncio
    >>> async def print_frames(aggregator, frame):
    ...     async for frame in aggregator.process_frame(frame):
    ...         print(frame.text)
--- a/src/pipecat/processors/aggregators/user_response.py
+++ b/src/pipecat/processors/aggregators/user_response.py
@@ -25,7 +25,7 @@ class ResponseAggregator(FrameProcessor):
        TranscriptionFrame(" world.") -> None
        UserStoppedSpeakingFrame() -> TextFrame("Hello world.")

-    Doctest:
+    Doctest: FIXME to work with asyncio
    >>> async def print_frames(aggregator, frame):
    ...     async for frame in aggregator.process_frame(frame):
    ...         if isinstance(frame, TextFrame):
--- a/src/pipecat/processors/aggregators/vision_image_frame.py
+++ b/src/pipecat/processors/aggregators/vision_image_frame.py
@@ -4,15 +4,21 @@
 # SPDX-License-Identifier: BSD 2-Clause License
 #

-from pipecat.frames.frames import Frame, ImageRawFrame, TextFrame, VisionImageRawFrame
+from pipecat.frames.frames import (
+    Frame,
+    InputImageRawFrame,
+    TextFrame,
+    VisionImageRawFrame
+)
 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor


 class VisionImageFrameAggregator(FrameProcessor):
    """This aggregator waits for a consecutive TextFrame and an
-    ImageFrame. After the ImageFrame arrives it will output a VisionImageFrame.
+    InputImageRawFrame. After the InputImageRawFrame arrives it will output a
+    VisionImageRawFrame.

-    >>> from pipecat.pipeline.frames import ImageFrame
+    >>> from pipecat.frames.frames import ImageFrame

    >>> async def print_frames(aggregator, frame):
    ...     async for frame in aggregator.process_frame(frame):
@@ -34,7 +40,7 @@ class VisionImageFrameAggregator(FrameProcessor):

        if isinstance(frame, TextFrame):
            self._describe_text = frame.text
-        elif isinstance(frame, ImageRawFrame):
+        elif isinstance(frame, InputImageRawFrame):
            if self._describe_text:
                frame = VisionImageRawFrame(
                    text=self._describe_text,
--- a/src/pipecat/processors/async_frame_processor.py
+++ b/src/pipecat/processors/async_frame_processor.py
@@ -1,64 +0,0 @@
-#
-# Copyright (c) 2024, Daily
-#
-# SPDX-License-Identifier: BSD 2-Clause License
-#
-
-import asyncio
-
-from pipecat.frames.frames import EndFrame, Frame, StartInterruptionFrame
-from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
-
-
-class AsyncFrameProcessor(FrameProcessor):
-
-    def __init__(
-            self,
-            *,
-            name: str | None = None,
-            loop: asyncio.AbstractEventLoop | None = None,
-            **kwargs):
-        super().__init__(name=name, loop=loop, **kwargs)
-
-        self._create_push_task()
-
-    async def process_frame(self, frame: Frame, direction: FrameDirection):
-        await super().process_frame(frame, direction)
-
-        if isinstance(frame, StartInterruptionFrame):
-            await self._handle_interruptions(frame)
-
-    async def queue_frame(
-            self,
-            frame: Frame,
-            direction: FrameDirection = FrameDirection.DOWNSTREAM):
-        await self._push_queue.put((frame, direction))
-
-    async def cleanup(self):
-        self._push_frame_task.cancel()
-        await self._push_frame_task
-
-    async def _handle_interruptions(self, frame: Frame):
-        # Cancel the task. This will stop pushing frames downstream.
-        self._push_frame_task.cancel()
-        await self._push_frame_task
-        # Push an out-of-band frame (i.e. not using the ordered push
-        # frame task).
-        await self.push_frame(frame)
-        # Create a new queue and task.
-        self._create_push_task()
-
-    def _create_push_task(self):
-        self._push_queue = asyncio.Queue()
-        self._push_frame_task = self.get_event_loop().create_task(self._push_frame_task_handler())
-
-    async def _push_frame_task_handler(self):
-        running = True
-        while running:
-            try:
-                (frame, direction) = await self._push_queue.get()
-                await self.push_frame(frame, direction)
-                running = not isinstance(frame, EndFrame)
-                self._push_queue.task_done()
-            except asyncio.CancelledError:
-                break
--- a/src/pipecat/processors/frame_processor.py
+++ b/src/pipecat/processors/frame_processor.py
@@ -9,13 +9,23 @@ import time

 from enum import Enum

+from pipecat.clocks.base_clock import BaseClock
 from pipecat.frames.frames import (
+    EndFrame,
    ErrorFrame,
    Frame,
    MetricsFrame,
    StartFrame,
    StartInterruptionFrame,
-    UserStoppedSpeakingFrame)
+    StopInterruptionFrame,
+    SystemFrame)
+from pipecat.metrics.metrics import (
+    LLMTokenUsage,
+    LLMUsageMetricsData,
+    MetricsData,
+    ProcessingMetricsData,
+    TTFBMetricsData,
+    TTSUsageMetricsData)
 from pipecat.utils.utils import obj_count, obj_id

 from loguru import logger
@@ -28,11 +38,20 @@ class FrameDirection(Enum):

 class FrameProcessorMetrics:
    def __init__(self, name: str):
-        self._name = name
+        self._core_metrics_data = MetricsData(processor=name)
        self._start_ttfb_time = 0
        self._start_processing_time = 0
        self._should_report_ttfb = True

+    def _processor_name(self):
+        return self._core_metrics_data.processor
+
+    def _model_name(self):
+        return self._core_metrics_data.model
+
+    def set_core_metrics_data(self, data: MetricsData):
+        self._core_metrics_data = data
+
    async def start_ttfb_metrics(self, report_only_initial_ttfb):
        if self._should_report_ttfb:
            self._start_ttfb_time = time.time()
@@ -43,13 +62,13 @@ class FrameProcessorMetrics:
            return None

        value = time.time() - self._start_ttfb_time
-        logger.debug(f"{self._name} TTFB: {value}")
-        ttfb = {
-            "processor": self._name,
-            "value": value
-        }
+        logger.debug(f"{self._processor_name()} TTFB: {value}")
+        ttfb = TTFBMetricsData(
+            processor=self._processor_name(),
+            value=value,
+            model=self._model_name())
        self._start_ttfb_time = 0
-        return MetricsFrame(ttfb=[ttfb])
+        return MetricsFrame(data=[ttfb])

    async def start_processing_metrics(self):
        self._start_processing_time = time.time()
@@ -59,26 +78,28 @@ class FrameProcessorMetrics:
            return None

        value = time.time() - self._start_processing_time
-        logger.debug(f"{self._name} processing time: {value}")
-        processing = {
-            "processor": self._name,
-            "value": value
-        }
+        logger.debug(f"{self._processor_name()} processing time: {value}")
+        processing = ProcessingMetricsData(
+            processor=self._processor_name(), value=value, model=self._model_name())
        self._start_processing_time = 0
-        return MetricsFrame(processing=[processing])
+        return MetricsFrame(data=[processing])

-    async def start_llm_usage_metrics(self, tokens: dict):
+    async def start_llm_usage_metrics(self, tokens: LLMTokenUsage):
        logger.debug(
-            f"{self._name} prompt tokens: {tokens['prompt_tokens']}, completion tokens: {tokens['completion_tokens']}")
-        return MetricsFrame(tokens=[tokens])
+            f"{self._processor_name()} prompt tokens: {tokens.prompt_tokens}, completion tokens: {tokens.completion_tokens}")
+        value = LLMUsageMetricsData(
+            processor=self._processor_name(),
+            model=self._model_name(),
+            value=tokens)
+        return MetricsFrame(data=[value])

    async def start_tts_usage_metrics(self, text: str):
-        characters = {
-            "processor": self._name,
-            "value": len(text),
-        }
-        logger.debug(f"{self._name} usage characters: {characters['value']}")
-        return MetricsFrame(characters=[characters])
+        characters = TTSUsageMetricsData(
+            processor=self._processor_name(),
+            model=self._model_name(),
+            value=len(text))
+        logger.debug(f"{self._processor_name()} usage characters: {characters.value}")
+        return MetricsFrame(data=[characters])


 class FrameProcessor:
@@ -87,6 +108,7 @@ class FrameProcessor:
            self,
            *,
            name: str | None = None,
+            sync: bool = True,
            loop: asyncio.AbstractEventLoop | None = None,
            **kwargs):
        self.id: int = obj_id()
@@ -95,6 +117,10 @@ class FrameProcessor:
        self._prev: "FrameProcessor" | None = None
        self._next: "FrameProcessor" | None = None
        self._loop: asyncio.AbstractEventLoop = loop or asyncio.get_running_loop()
+        self._sync = sync
+
+        # Clock
+        self._clock: BaseClock | None = None

        # Properties
        self._allow_interruptions = False
@@ -105,6 +131,14 @@ class FrameProcessor:
        # Metrics
        self._metrics = FrameProcessorMetrics(name=self.name)

+        # Every processor in Pipecat should only output frames from a single
+        # task. This avoid problems like audio overlapping. System frames are
+        # the exception to this rule.
+        #
+        # This create this task.
+        if not self._sync:
+            self.__create_push_task()
+
    @property
    def interruptions_allowed(self):
        return self._allow_interruptions
@@ -124,6 +158,9 @@ class FrameProcessor:
    def can_generate_metrics(self) -> bool:
        return False

+    def set_core_metrics_data(self, data: MetricsData):
+        self._metrics.set_core_metrics_data(data)
+
    async def start_ttfb_metrics(self):
        if self.can_generate_metrics() and self.metrics_enabled:
            await self._metrics.start_ttfb_metrics(self._report_only_initial_ttfb)
@@ -144,7 +181,7 @@ class FrameProcessor:
            if frame:
                await self.push_frame(frame)

-    async def start_llm_usage_metrics(self, tokens: dict):
+    async def start_llm_usage_metrics(self, tokens: LLMTokenUsage):
        if self.can_generate_metrics() and self.usage_metrics_enabled:
            frame = await self._metrics.start_llm_usage_metrics(tokens)
            if frame:
@@ -177,21 +214,49 @@ class FrameProcessor:
    def get_parent(self) -> "FrameProcessor":
        return self._parent

+    def get_clock(self) -> BaseClock:
+        return self._clock
+
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        if isinstance(frame, StartFrame):
+            self._clock = frame.clock
            self._allow_interruptions = frame.allow_interruptions
            self._enable_metrics = frame.enable_metrics
            self._enable_usage_metrics = frame.enable_usage_metrics
            self._report_only_initial_ttfb = frame.report_only_initial_ttfb
        elif isinstance(frame, StartInterruptionFrame):
+            await self._start_interruption()
            await self.stop_all_metrics()
-        elif isinstance(frame, UserStoppedSpeakingFrame):
+        elif isinstance(frame, StopInterruptionFrame):
            self._should_report_ttfb = True

    async def push_error(self, error: ErrorFrame):
        await self.push_frame(error, FrameDirection.UPSTREAM)

    async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
+        if self._sync or isinstance(frame, SystemFrame):
+            await self.__internal_push_frame(frame, direction)
+        else:
+            await self.__push_queue.put((frame, direction))
+
+    #
+    # Handle interruptions
+    #
+
+    async def _start_interruption(self):
+        if not self._sync:
+            # Cancel the task. This will stop pushing frames downstream.
+            self.__push_frame_task.cancel()
+            await self.__push_frame_task
+
+            # Create a new queue and task.
+            self.__create_push_task()
+
+    async def _stop_interruption(self):
+        # Nothing to do right now.
+        pass
+
+    async def __internal_push_frame(self, frame: Frame, direction: FrameDirection):
        try:
            if direction == FrameDirection.DOWNSTREAM and self._next:
                logger.trace(f"Pushing {frame} from {self} to {self._next}")
@@ -202,5 +267,20 @@ class FrameProcessor:
        except Exception as e:
            logger.exception(f"Uncaught exception in {self}: {e}")

+    def __create_push_task(self):
+        self.__push_queue = asyncio.Queue()
+        self.__push_frame_task = self.get_event_loop().create_task(self.__push_frame_task_handler())
+
+    async def __push_frame_task_handler(self):
+        running = True
+        while running:
+            try:
+                (frame, direction) = await self.__push_queue.get()
+                await self.__internal_push_frame(frame, direction)
+                running = not isinstance(frame, EndFrame)
+                self.__push_queue.task_done()
+            except asyncio.CancelledError:
+                break
+
    def __str__(self):
        return self.name
--- a/src/pipecat/processors/frameworks/rtvi.py
+++ b/src/pipecat/processors/frameworks/rtvi.py
@@ -81,11 +81,6 @@ class RTVIAction(BaseModel):
        return super().model_post_init(__context)


-#
-# Client -> Pipecat messages.
-#
-
-
 class RTVIServiceOptionConfig(BaseModel):
    name: str
    value: Any
@@ -100,6 +95,16 @@ class RTVIConfig(BaseModel):
    config: List[RTVIServiceConfig]


+#
+# Client -> Pipecat messages.
+#
+
+
+class RTVIUpdateConfig(BaseModel):
+    config: List[RTVIServiceConfig]
+    interrupt: bool = False
+
+
 class RTVIActionRunArgument(BaseModel):
    name: str
    value: Any
@@ -222,7 +227,7 @@ class RTVILLMFunctionCallResultData(BaseModel):
    function_name: str
    tool_call_id: str
    arguments: dict
-    result: dict
+    result: dict | str


 class RTVITranscriptionMessageData(BaseModel):
@@ -267,8 +272,9 @@ class RTVIProcessor(FrameProcessor):
    def __init__(self,
                 *,
                 config: RTVIConfig = RTVIConfig(config=[]),
-                 params: RTVIProcessorParams = RTVIProcessorParams()):
-        super().__init__()
+                 params: RTVIProcessorParams = RTVIProcessorParams(),
+                 **kwargs):
+        super().__init__(sync=False, **kwargs)
        self._config = config
        self._params = params

@@ -281,9 +287,6 @@ class RTVIProcessor(FrameProcessor):
        self._registered_actions: Dict[str, RTVIAction] = {}
        self._registered_services: Dict[str, RTVIService] = {}

-        self._push_frame_task = self.get_event_loop().create_task(self._push_frame_task_handler())
-        self._push_queue = asyncio.Queue()
-
        self._message_task = self.get_event_loop().create_task(self._message_task_handler())
        self._message_queue = asyncio.Queue()

@@ -330,17 +333,16 @@ class RTVIProcessor(FrameProcessor):
        message = RTVILLMFunctionCallStartMessage(data=fn)
        await self._push_transport_message(message, exclude_none=False)

-    async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
-        if isinstance(frame, SystemFrame):
-            await super().push_frame(frame, direction)
-        else:
-            await self._internal_push_frame(frame, direction)
-
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        await super().process_frame(frame, direction)

        # Specific system frames
-        if isinstance(frame, CancelFrame):
+        if isinstance(frame, StartFrame):
+            # Push StartFrame before start(), because we want StartFrame to be
+            # processed by every processor before any other frame is processed.
+            await self.push_frame(frame, direction)
+            await self._start(frame)
+        elif isinstance(frame, CancelFrame):
            await self._cancel(frame)
            await self.push_frame(frame, direction)
        elif isinstance(frame, ErrorFrame):
@@ -350,9 +352,6 @@ class RTVIProcessor(FrameProcessor):
        elif isinstance(frame, SystemFrame):
            await self.push_frame(frame, direction)
        # Control frames
-        elif isinstance(frame, StartFrame):
-            await self._start(frame)
-            await self.push_frame(frame, direction)
        elif isinstance(frame, EndFrame):
            # Push EndFrame before stop(), because stop() waits on the task to
            # finish and the task finishes when EndFrame is processed.
@@ -387,30 +386,10 @@ class RTVIProcessor(FrameProcessor):
        # processing EndFrames.
        self._message_task.cancel()
        await self._message_task
-        await self._push_frame_task

    async def _cancel(self, frame: CancelFrame):
        self._message_task.cancel()
        await self._message_task
-        self._push_frame_task.cancel()
-        await self._push_frame_task
-
-    async def _internal_push_frame(
-            self,
-            frame: Frame | None,
-            direction: FrameDirection | None = FrameDirection.DOWNSTREAM):
-        await self._push_queue.put((frame, direction))
-
-    async def _push_frame_task_handler(self):
-        running = True
-        while running:
-            try:
-                (frame, direction) = await self._push_queue.get()
-                await super().push_frame(frame, direction)
-                self._push_queue.task_done()
-                running = not isinstance(frame, EndFrame)
-            except asyncio.CancelledError:
-                break

    async def _push_transport_message(self, model: BaseModel, exclude_none: bool = True):
        frame = TransportMessageFrame(
@@ -489,8 +468,8 @@ class RTVIProcessor(FrameProcessor):
                case "get-config":
                    await self._handle_get_config(message.id)
                case "update-config":
-                    config = RTVIConfig.model_validate(message.data)
-                    await self._handle_update_config(message.id, config)
+                    update_config = RTVIUpdateConfig.model_validate(message.data)
+                    await self._handle_update_config(message.id, update_config)
                case "action":
                    action = RTVIActionRun.model_validate(message.data)
                    await self._handle_action(message.id, action)
@@ -545,17 +524,14 @@ class RTVIProcessor(FrameProcessor):
            await handler(self, service.name, option)
            self._update_config_option(service.name, option)

-    async def _update_config(self, data: RTVIConfig):
+    async def _update_config(self, data: RTVIConfig, interrupt: bool):
+        if interrupt:
+            await self.interrupt_bot()
        for service_config in data.config:
            await self._update_service_config(service_config)

-    async def _handle_update_config(self, request_id: str, data: RTVIConfig):
-        # NOTE(aleix): The bot might be talking while we receive a new
-        # config. Let's interrupt it for now and update the config. Another
-        # solution is to wait until the bot stops speaking and then apply the
-        # config, but this definitely is more complicated to achieve.
-        await self.interrupt_bot()
-        await self._update_config(data)
+    async def _handle_update_config(self, request_id: str, data: RTVIUpdateConfig):
+        await self._update_config(RTVIConfig(config=data.config), data.interrupt)
        await self._handle_get_config(request_id)

    async def _handle_function_call_result(self, data):
@@ -583,7 +559,7 @@ class RTVIProcessor(FrameProcessor):
    async def _maybe_send_bot_ready(self):
        if self._pipeline_started and self._client_ready:
            await self._send_bot_ready()
-            await self._update_config(self._config)
+            await self._update_config(self._config, False)

    async def _send_bot_ready(self):
        if not self._params.send_bot_ready:
--- a/src/pipecat/processors/gstreamer/pipeline_source.py
+++ b/src/pipecat/processors/gstreamer/pipeline_source.py
@@ -9,11 +9,11 @@ import asyncio
 from pydantic import BaseModel

 from pipecat.frames.frames import (
-    AudioRawFrame,
    CancelFrame,
    EndFrame,
    Frame,
-    ImageRawFrame,
+    OutputAudioRawFrame,
+    OutputImageRawFrame,
    StartFrame,
    SystemFrame)
 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
@@ -41,7 +41,7 @@ class GStreamerPipelineSource(FrameProcessor):
        clock_sync: bool = True

    def __init__(self, *, pipeline: str, out_params: OutputParams = OutputParams(), **kwargs):
-        super().__init__(**kwargs)
+        super().__init__(sync=False, **kwargs)

        self._out_params = out_params

@@ -62,76 +62,42 @@ class GStreamerPipelineSource(FrameProcessor):
        bus.add_signal_watch()
        bus.connect("message", self._on_gstreamer_message)

-        # Create push frame task. This is the task that will push frames in
-        # order. We also guarantee that all frames are pushed in the same task.
-        self._create_push_task()
-
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        await super().process_frame(frame, direction)

        # Specific system frames
-        if isinstance(frame, CancelFrame):
+        if isinstance(frame, StartFrame):
+            # Push StartFrame before start(), because we want StartFrame to be
+            # processed by every processor before any other frame is processed.
+            await self.push_frame(frame, direction)
+            await self._start(frame)
+        elif isinstance(frame, CancelFrame):
            await self._cancel(frame)
            await self.push_frame(frame, direction)
        # All other system frames
        elif isinstance(frame, SystemFrame):
            await self.push_frame(frame, direction)
        # Control frames
-        elif isinstance(frame, StartFrame):
-            await self._start(frame)
-            await self._internal_push_frame(frame, direction)
        elif isinstance(frame, EndFrame):
            # Push EndFrame before stop(), because stop() waits on the task to
            # finish and the task finishes when EndFrame is processed.
-            await self._internal_push_frame(frame, direction)
+            await self.push_frame(frame, direction)
            await self._stop(frame)
        # Other frames
        else:
-            await self._internal_push_frame(frame, direction)
+            await self.push_frame(frame, direction)

    async def _start(self, frame: StartFrame):
        self._player.set_state(Gst.State.PLAYING)

    async def _stop(self, frame: EndFrame):
        self._player.set_state(Gst.State.NULL)
-        # Wait for the push frame task to finish. It will finish when the
-        # EndFrame is actually processed.
-        await self._push_frame_task

    async def _cancel(self, frame: CancelFrame):
        self._player.set_state(Gst.State.NULL)
-        # Cancel all the tasks and wait for them to finish.
-        self._push_frame_task.cancel()
-        await self._push_frame_task

    #
-    # Push frames task
-    #
-
-    def _create_push_task(self):
-        loop = self.get_event_loop()
-        self._push_queue = asyncio.Queue()
-        self._push_frame_task = loop.create_task(self._push_frame_task_handler())
-
-    async def _internal_push_frame(
-            self,
-            frame: Frame | None,
-            direction: FrameDirection | None = FrameDirection.DOWNSTREAM):
-        await self._push_queue.put((frame, direction))
-
-    async def _push_frame_task_handler(self):
-        running = True
-        while running:
-            try:
-                (frame, direction) = await self._push_queue.get()
-                await self.push_frame(frame, direction)
-                running = not isinstance(frame, EndFrame)
-                self._push_queue.task_done()
-            except asyncio.CancelledError:
-                break
-
-    #
-    # GStreaner
+    # GStreamer
    #

    def _on_gstreamer_message(self, bus: Gst.Bus, message: Gst.Message):
@@ -216,20 +182,20 @@ class GStreamerPipelineSource(FrameProcessor):
    def _appsink_audio_new_sample(self, appsink: GstApp.AppSink):
        buffer = appsink.pull_sample().get_buffer()
        (_, info) = buffer.map(Gst.MapFlags.READ)
-        frame = AudioRawFrame(audio=info.data,
-                              sample_rate=self._out_params.audio_sample_rate,
-                              num_channels=self._out_params.audio_channels)
-        asyncio.run_coroutine_threadsafe(self._internal_push_frame(frame), self.get_event_loop())
+        frame = OutputAudioRawFrame(audio=info.data,
+                                    sample_rate=self._out_params.audio_sample_rate,
+                                    num_channels=self._out_params.audio_channels)
+        asyncio.run_coroutine_threadsafe(self.push_frame(frame), self.get_event_loop())
        buffer.unmap(info)
        return Gst.FlowReturn.OK

    def _appsink_video_new_sample(self, appsink: GstApp.AppSink):
        buffer = appsink.pull_sample().get_buffer()
        (_, info) = buffer.map(Gst.MapFlags.READ)
-        frame = ImageRawFrame(
+        frame = OutputImageRawFrame(
            image=info.data,
            size=(self._out_params.video_width, self._out_params.video_height),
            format="RGB")
-        asyncio.run_coroutine_threadsafe(self._internal_push_frame(frame), self.get_event_loop())
+        asyncio.run_coroutine_threadsafe(self.push_frame(frame), self.get_event_loop())
        buffer.unmap(info)
        return Gst.FlowReturn.OK
--- a/src/pipecat/processors/idle_frame_processor.py
+++ b/src/pipecat/processors/idle_frame_processor.py
@@ -8,19 +8,14 @@ import asyncio

 from typing import Awaitable, Callable, List

-from pipecat.frames.frames import Frame, SystemFrame
-from pipecat.processors.async_frame_processor import AsyncFrameProcessor
-from pipecat.processors.frame_processor import FrameDirection
+from pipecat.frames.frames import Frame
+from pipecat.processors.frame_processor import FrameDirection, FrameProcessor


-class IdleFrameProcessor(AsyncFrameProcessor):
+class IdleFrameProcessor(FrameProcessor):
    """This class waits to receive any frame or list of desired frames within a
    given timeout. If the timeout is reached before receiving any of those
    frames the provided callback will be called.
-
-    The callback can then be used to push frames downstream by using
-    `queue_frame()` (or `push_frame()` for system frames).
-
    """

    def __init__(
@@ -30,7 +25,7 @@ class IdleFrameProcessor(AsyncFrameProcessor):
            timeout: float,
            types: List[type] = [],
            **kwargs):
-        super().__init__(**kwargs)
+        super().__init__(sync=False, **kwargs)

        self._callback = callback
        self._timeout = timeout
@@ -41,10 +36,7 @@ class IdleFrameProcessor(AsyncFrameProcessor):
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        await super().process_frame(frame, direction)

-        if isinstance(frame, SystemFrame):
-            await self.push_frame(frame, direction)
-        else:
-            await self.queue_frame(frame, direction)
+        await self.push_frame(frame, direction)

        # If we are not waiting for any specific frame set the event, otherwise
        # check if we have received one of the desired frames.
@@ -55,7 +47,6 @@ class IdleFrameProcessor(AsyncFrameProcessor):
                if isinstance(frame, t):
                    self._idle_event.set()

-        # If we are not waiting for any specific frame set the event, otherwise
    async def cleanup(self):
        self._idle_task.cancel()
        await self._idle_task
--- a/src/pipecat/processors/user_idle_processor.py
+++ b/src/pipecat/processors/user_idle_processor.py
@@ -11,21 +11,16 @@ from typing import Awaitable, Callable
 from pipecat.frames.frames import (
    BotSpeakingFrame,
    Frame,
-    SystemFrame,
    UserStartedSpeakingFrame,
    UserStoppedSpeakingFrame)
-from pipecat.processors.async_frame_processor import AsyncFrameProcessor
-from pipecat.processors.frame_processor import FrameDirection
+from pipecat.processors.frame_processor import FrameDirection, FrameProcessor


-class UserIdleProcessor(AsyncFrameProcessor):
+class UserIdleProcessor(FrameProcessor):
    """This class is useful to check if the user is interacting with the bot
    within a given timeout. If the timeout is reached before any interaction
    occurred the provided callback will be called.

-    The callback can then be used to push frames downstream by using
-    `queue_frame()` (or `push_frame()` for system frames).
-
    """

    def __init__(
@@ -34,7 +29,7 @@ class UserIdleProcessor(AsyncFrameProcessor):
            callback: Callable[["UserIdleProcessor"], Awaitable[None]],
            timeout: float,
            **kwargs):
-        super().__init__(**kwargs)
+        super().__init__(sync=False, **kwargs)

        self._callback = callback
        self._timeout = timeout
@@ -46,10 +41,7 @@ class UserIdleProcessor(AsyncFrameProcessor):
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        await super().process_frame(frame, direction)

-        if isinstance(frame, SystemFrame):
-            await self.push_frame(frame, direction)
-        else:
-            await self.queue_frame(frame, direction)
+        await self.push_frame(frame, direction)

        # We shouldn't call the idle callback if the user or the bot are speaking.
        if isinstance(frame, UserStartedSpeakingFrame):
--- a/src/pipecat/serializers/livekit.py
+++ b/src/pipecat/serializers/livekit.py
@@ -7,7 +7,10 @@
 import ctypes
 import pickle

-from pipecat.frames.frames import AudioRawFrame, Frame
+from pipecat.frames.frames import (
+    Frame,
+    InputAudioRawFrame,
+    OutputAudioRawFrame)
 from pipecat.serializers.base_serializer import FrameSerializer

 from loguru import logger
@@ -22,12 +25,8 @@ except ModuleNotFoundError as e:


 class LivekitFrameSerializer(FrameSerializer):
-    SERIALIZABLE_TYPES = {
-        AudioRawFrame: "audio",
-    }
-
    def serialize(self, frame: Frame) -> str | bytes | None:
-        if not isinstance(frame, AudioRawFrame):
+        if not isinstance(frame, OutputAudioRawFrame):
            return None
        audio_frame = AudioFrame(
            data=frame.audio,
@@ -39,7 +38,7 @@ class LivekitFrameSerializer(FrameSerializer):

    def deserialize(self, data: str | bytes) -> Frame | None:
        audio_frame: AudioFrame = pickle.loads(data)['frame']
-        return AudioRawFrame(
+        return InputAudioRawFrame(
            audio=bytes(audio_frame.data),
            sample_rate=audio_frame.sample_rate,
            num_channels=audio_frame.num_channels,
--- a/src/pipecat/serializers/protobuf.py
+++ b/src/pipecat/serializers/protobuf.py
@@ -8,7 +8,11 @@ import dataclasses

 import pipecat.frames.protobufs.frames_pb2 as frame_protos

-from pipecat.frames.frames import AudioRawFrame, Frame, TextFrame, TranscriptionFrame
+from pipecat.frames.frames import (
+    AudioRawFrame,
+    Frame,
+    TextFrame,
+    TranscriptionFrame)
 from pipecat.serializers.base_serializer import FrameSerializer

 from loguru import logger
@@ -29,14 +33,15 @@ class ProtobufFrameSerializer(FrameSerializer):
    def serialize(self, frame: Frame) -> str | bytes | None:
        proto_frame = frame_protos.Frame()
        if type(frame) not in self.SERIALIZABLE_TYPES:
-            raise ValueError(
-                f"Frame type {type(frame)} is not serializable. You may need to add it to ProtobufFrameSerializer.SERIALIZABLE_FIELDS.")
+            logger.warning(f"Frame type {type(frame)} is not serializable")
+            return None

        # ignoring linter errors; we check that type(frame) is in this dict above
        proto_optional_name = self.SERIALIZABLE_TYPES[type(frame)]  # type: ignore
        for field in dataclasses.fields(frame):  # type: ignore
-            setattr(getattr(proto_frame, proto_optional_name), field.name,
-                    getattr(frame, field.name))
+            value = getattr(frame, field.name)
+            if value:
+                setattr(getattr(proto_frame, proto_optional_name), field.name, value)

        result = proto_frame.SerializeToString()
        return result
@@ -48,8 +53,8 @@ class ProtobufFrameSerializer(FrameSerializer):

        >>> serializer = ProtobufFrameSerializer()
        >>> serializer.deserialize(
-        ...     serializer.serialize(AudioFrame(data=b'1234567890')))
-        AudioFrame(data=b'1234567890')
+        ...     serializer.serialize(OutputAudioFrame(data=b'1234567890')))
+        InputAudioFrame(data=b'1234567890')

        >>> serializer.deserialize(
        ...     serializer.serialize(TextFrame(text='hello world')))
@@ -75,10 +80,13 @@ class ProtobufFrameSerializer(FrameSerializer):
        # Remove special fields if needed
        id = getattr(args, "id")
        name = getattr(args, "name")
+        pts = getattr(args, "pts")
        if not id:
            del args_dict["id"]
        if not name:
            del args_dict["name"]
+        if not pts:
+            del args_dict["pts"]

        # Create the instance
        instance = class_name(**args_dict)
@@ -88,5 +96,7 @@ class ProtobufFrameSerializer(FrameSerializer):
            setattr(instance, "id", getattr(args, "id"))
        if name:
            setattr(instance, "name", getattr(args, "name"))
+        if pts:
+            setattr(instance, "pts", getattr(args, "pts"))

        return instance
--- a/src/pipecat/serializers/twilio.py
+++ b/src/pipecat/serializers/twilio.py
@@ -9,7 +9,10 @@ import json

 from pydantic import BaseModel

-from pipecat.frames.frames import AudioRawFrame, Frame
+from pipecat.frames.frames import (
+    AudioRawFrame,
+    Frame,
+    StartInterruptionFrame)
 from pipecat.serializers.base_serializer import FrameSerializer
 from pipecat.utils.audio import ulaw_to_pcm, pcm_to_ulaw

@@ -19,31 +22,30 @@ class TwilioFrameSerializer(FrameSerializer):
        twilio_sample_rate: int = 8000
        sample_rate: int = 16000

-    SERIALIZABLE_TYPES = {
-        AudioRawFrame: "audio",
-    }
-
    def __init__(self, stream_sid: str, params: InputParams = InputParams()):
        self._stream_sid = stream_sid
        self._params = params

    def serialize(self, frame: Frame) -> str | bytes | None:
-        if not isinstance(frame, AudioRawFrame):
-            return None
+        if isinstance(frame, AudioRawFrame):
+            data = frame.audio

-        data = frame.audio
-
-        serialized_data = pcm_to_ulaw(data, frame.sample_rate, self._params.twilio_sample_rate)
-        payload = base64.b64encode(serialized_data).decode("utf-8")
-        answer = {
-            "event": "media",
-            "streamSid": self._stream_sid,
-            "media": {
-                "payload": payload
+            serialized_data = pcm_to_ulaw(
+                data, frame.sample_rate, self._params.twilio_sample_rate)
+            payload = base64.b64encode(serialized_data).decode("utf-8")
+            answer = {
+                "event": "media",
+                "streamSid": self._stream_sid,
+                "media": {
+                    "payload": payload
+                }
            }
-        }

-        return json.dumps(answer)
+            return json.dumps(answer)
+
+        if isinstance(frame, StartInterruptionFrame):
+            answer = {"event": "clear", "streamSid": self._stream_sid}
+            return json.dumps(answer)

    def deserialize(self, data: str | bytes) -> Frame | None:
        message = json.loads(data)
--- a/src/pipecat/services/ai_services.py
+++ b/src/pipecat/services/ai_services.py
@@ -4,11 +4,12 @@
 # SPDX-License-Identifier: BSD 2-Clause License
 #

+import asyncio
 import io
 import wave

 from abc import abstractmethod
-from typing import AsyncGenerator
+from typing import AsyncGenerator, List, Optional, Tuple

 from pipecat.frames.frames import (
    AudioRawFrame,
@@ -17,25 +18,45 @@ from pipecat.frames.frames import (
    ErrorFrame,
    Frame,
    LLMFullResponseEndFrame,
+    STTLanguageUpdateFrame,
+    STTModelUpdateFrame,
    StartFrame,
    StartInterruptionFrame,
+    TTSAudioRawFrame,
+    TTSLanguageUpdateFrame,
+    TTSModelUpdateFrame,
    TTSSpeakFrame,
+    TTSStartedFrame,
+    TTSStoppedFrame,
    TTSVoiceUpdateFrame,
    TextFrame,
    UserImageRequestFrame,
    VisionImageRawFrame
 )
-from pipecat.processors.async_frame_processor import AsyncFrameProcessor
+from pipecat.metrics.metrics import MetricsData
 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
+from pipecat.transcriptions.language import Language
 from pipecat.utils.audio import calculate_audio_volume
 from pipecat.utils.string import match_endofsentence
+from pipecat.utils.time import seconds_to_nanoseconds
 from pipecat.utils.utils import exp_smoothing
 from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext

+from loguru import logger
+

 class AIService(FrameProcessor):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
+        self._model_name: str = ""
+
+    @property
+    def model_name(self) -> str:
+        return self._model_name
+
+    def set_model_name(self, model: str):
+        self._model_name = model
+        self.set_core_metrics_data(MetricsData(processor=self.name, model=self._model_name))

    async def start(self, frame: StartFrame):
        pass
@@ -56,7 +77,7 @@ class AIService(FrameProcessor):
        elif isinstance(frame, EndFrame):
            await self.stop(frame)

-    async def process_generator(self, generator: AsyncGenerator[Frame, None]):
+    async def process_generator(self, generator: AsyncGenerator[Frame | None, None]):
        async for f in generator:
            if f:
                if isinstance(f, ErrorFrame):
@@ -65,30 +86,6 @@ class AIService(FrameProcessor):
                    await self.push_frame(f)


-class AsyncAIService(AsyncFrameProcessor):
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-
-    async def start(self, frame: StartFrame):
-        pass
-
-    async def stop(self, frame: EndFrame):
-        pass
-
-    async def cancel(self, frame: CancelFrame):
-        pass
-
-    async def process_frame(self, frame: Frame, direction: FrameDirection):
-        await super().process_frame(frame, direction)
-
-        if isinstance(frame, StartFrame):
-            await self.start(frame)
-        elif isinstance(frame, CancelFrame):
-            await self.cancel(frame)
-        elif isinstance(frame, EndFrame):
-            await self.stop(frame)
-
-
 class LLMService(AIService):
    """This class is a no-op but serves as a base class for LLM services."""

@@ -154,18 +151,34 @@ class TTSService(AIService):
            self,
            *,
            aggregate_sentences: bool = True,
-            # if True, subclass is responsible for pushing TextFrames and LLMFullResponseEndFrames
+            # if True, TTSService will push TextFrames and LLMFullResponseEndFrames,
+            # otherwise subclass must do it
            push_text_frames: bool = True,
+            # TTS output sample rate
+            sample_rate: int = 16000,
            **kwargs):
        super().__init__(**kwargs)
        self._aggregate_sentences: bool = aggregate_sentences
        self._push_text_frames: bool = push_text_frames
        self._current_sentence: str = ""
+        self._sample_rate: int = sample_rate
+
+    @property
+    def sample_rate(self) -> int:
+        return self._sample_rate
+
+    @abstractmethod
+    async def set_model(self, model: str):
+        self.set_model_name(model)

    @abstractmethod
    async def set_voice(self, voice: str):
        pass

+    @abstractmethod
+    async def set_language(self, language: Language):
+        pass
+
    # Converts the text to audio.
    @abstractmethod
    async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
@@ -191,7 +204,7 @@ class TTSService(AIService):
        if text:
            await self._push_tts_frames(text)

-    async def _push_tts_frames(self, text: str, text_passthrough: bool = True):
+    async def _push_tts_frames(self, text: str):
        text = text.strip()
        if not text:
            return
@@ -221,16 +234,186 @@ class TTSService(AIService):
            else:
                await self.push_frame(frame, direction)
        elif isinstance(frame, TTSSpeakFrame):
-            await self._push_tts_frames(frame.text, False)
+            await self._push_tts_frames(frame.text)
+        elif isinstance(frame, TTSModelUpdateFrame):
+            await self.set_model(frame.model)
        elif isinstance(frame, TTSVoiceUpdateFrame):
            await self.set_voice(frame.voice)
+        elif isinstance(frame, TTSLanguageUpdateFrame):
+            await self.set_language(frame.language)
        else:
            await self.push_frame(frame, direction)


+class AsyncTTSService(TTSService):
+    def __init__(
+            self,
+            # if True, TTSService will push TTSStoppedFrames, otherwise subclass must do it
+            push_stop_frames: bool = False,
+            # if push_stop_frames is True, wait for this idle period before pushing TTSStoppedFrame
+            stop_frame_timeout_s: float = 1.0,
+            **kwargs):
+        super().__init__(sync=False, **kwargs)
+        self._push_stop_frames: bool = push_stop_frames
+        self._stop_frame_timeout_s: float = stop_frame_timeout_s
+        self._stop_frame_task: Optional[asyncio.Task] = None
+        self._stop_frame_queue: asyncio.Queue = asyncio.Queue()
+
+    @abstractmethod
+    async def flush_audio(self):
+        pass
+
+    async def start(self, frame: StartFrame):
+        await super().start(frame)
+        if self._push_stop_frames:
+            self._stop_frame_task = self.get_event_loop().create_task(self._stop_frame_handler())
+
+    async def stop(self, frame: EndFrame):
+        await super().stop(frame)
+        if self._stop_frame_task:
+            self._stop_frame_task.cancel()
+            await self._stop_frame_task
+            self._stop_frame_task = None
+
+    async def cancel(self, frame: CancelFrame):
+        await super().cancel(frame)
+        if self._stop_frame_task:
+            self._stop_frame_task.cancel()
+            await self._stop_frame_task
+            self._stop_frame_task = None
+
+    async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
+        await super().push_frame(frame, direction)
+
+        if self._push_stop_frames and (
+                isinstance(frame, StartInterruptionFrame) or
+                isinstance(frame, TTSStartedFrame) or
+                isinstance(frame, TTSAudioRawFrame) or
+                isinstance(frame, TTSStoppedFrame)):
+            await self._stop_frame_queue.put(frame)
+
+    async def _stop_frame_handler(self):
+        try:
+            has_started = False
+            while True:
+                try:
+                    frame = await asyncio.wait_for(self._stop_frame_queue.get(),
+                                                   self._stop_frame_timeout_s)
+                    if isinstance(frame, TTSStartedFrame):
+                        has_started = True
+                    elif isinstance(frame, (TTSStoppedFrame, StartInterruptionFrame)):
+                        has_started = False
+                except asyncio.TimeoutError:
+                    if has_started:
+                        await self.push_frame(TTSStoppedFrame())
+                        has_started = False
+        except asyncio.CancelledError:
+            pass
+
+
+class AsyncWordTTSService(AsyncTTSService):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self._initial_word_timestamp = -1
+        self._words_queue = asyncio.Queue()
+        self._words_task = self.get_event_loop().create_task(self._words_task_handler())
+
+    def start_word_timestamps(self):
+        if self._initial_word_timestamp == -1:
+            self._initial_word_timestamp = self.get_clock().get_time()
+
+    def reset_word_timestamps(self):
+        self._initial_word_timestamp = -1
+        self._word_timestamps = []
+
+    async def add_word_timestamps(self, word_times: List[Tuple[str, float]]):
+        for (word, timestamp) in word_times:
+            await self._words_queue.put((word, seconds_to_nanoseconds(timestamp)))
+
+    async def stop(self, frame: EndFrame):
+        await super().stop(frame)
+        await self._stop_words_task()
+
+    async def cancel(self, frame: CancelFrame):
+        await super().cancel(frame)
+        await self._stop_words_task()
+
+    async def process_frame(self, frame: Frame, direction: FrameDirection):
+        await super().process_frame(frame, direction)
+
+        if isinstance(frame, LLMFullResponseEndFrame) or isinstance(frame, EndFrame):
+            await self.flush_audio()
+
+    async def _handle_interruption(self, frame: StartInterruptionFrame, direction: FrameDirection):
+        await super()._handle_interruption(frame, direction)
+        self.reset_word_timestamps()
+
+    async def _stop_words_task(self):
+        if self._words_task:
+            self._words_task.cancel()
+            await self._words_task
+
+    async def _words_task_handler(self):
+        while True:
+            try:
+                (word, timestamp) = await self._words_queue.get()
+                if word == "LLMFullResponseEndFrame" and timestamp == 0:
+                    await self.push_frame(LLMFullResponseEndFrame())
+                else:
+                    frame = TextFrame(word)
+                    frame.pts = self._initial_word_timestamp + timestamp
+                    await self.push_frame(frame)
+                self._words_queue.task_done()
+            except asyncio.CancelledError:
+                break
+            except Exception as e:
+                logger.exception(f"{self} exception: {e}")
+
+
 class STTService(AIService):
    """STTService is a base class for speech-to-text services."""

+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+    @abstractmethod
+    async def set_model(self, model: str):
+        self.set_model_name(model)
+
+    @abstractmethod
+    async def set_language(self, language: Language):
+        pass
+
+    @abstractmethod
+    async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
+        """Returns transcript as a string"""
+        pass
+
+    async def process_audio_frame(self, frame: AudioRawFrame):
+        await self.process_generator(self.run_stt(frame.audio))
+
+    async def process_frame(self, frame: Frame, direction: FrameDirection):
+        """Processes a frame of audio data, either buffering or transcribing it."""
+        await super().process_frame(frame, direction)
+
+        if isinstance(frame, AudioRawFrame):
+            # In this service we accumulate audio internally and at the end we
+            # push a TextFrame. We don't really want to push audio frames down.
+            await self.process_audio_frame(frame)
+        elif isinstance(frame, STTModelUpdateFrame):
+            await self.set_model(frame.model)
+        elif isinstance(frame, STTLanguageUpdateFrame):
+            await self.set_language(frame.language)
+        else:
+            await self.push_frame(frame, direction)
+
+
+class SegmentedSTTService(STTService):
+    """SegmentedSTTService is an STTService that will detect speech and will run
+    speech-to-text on speech segments only, instead of a continous stream.
+
+    """
+
    def __init__(self,
                 *,
                 min_volume: float = 0.6,
@@ -251,24 +434,7 @@ class STTService(AIService):
        self._smoothing_factor = 0.2
        self._prev_volume = 0

-    @abstractmethod
-    async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
-        """Returns transcript as a string"""
-        pass
-
-    def _new_wave(self):
-        content = io.BytesIO()
-        ww = wave.open(content, "wb")
-        ww.setsampwidth(2)
-        ww.setnchannels(self._num_channels)
-        ww.setframerate(self._sample_rate)
-        return (content, ww)
-
-    def _get_smoothed_volume(self, frame: AudioRawFrame) -> float:
-        volume = calculate_audio_volume(frame.audio, frame.sample_rate)
-        return exp_smoothing(volume, self._prev_volume, self._smoothing_factor)
-
-    async def _append_audio(self, frame: AudioRawFrame):
+    async def process_audio_frame(self, frame: AudioRawFrame):
        # Try to filter out empty background noise
        volume = self._get_smoothed_volume(frame)
        if volume >= self._min_volume:
@@ -288,9 +454,7 @@ class STTService(AIService):
            self._silence_num_frames = 0
            self._wave.close()
            self._content.seek(0)
-            await self.start_processing_metrics()
            await self.process_generator(self.run_stt(self._content.read()))
-            await self.stop_processing_metrics()
            (self._content, self._wave) = self._new_wave()

    async def stop(self, frame: EndFrame):
@@ -299,16 +463,17 @@ class STTService(AIService):
    async def cancel(self, frame: CancelFrame):
        self._wave.close()

-    async def process_frame(self, frame: Frame, direction: FrameDirection):
-        """Processes a frame of audio data, either buffering or transcribing it."""
-        await super().process_frame(frame, direction)
+    def _new_wave(self):
+        content = io.BytesIO()
+        ww = wave.open(content, "wb")
+        ww.setsampwidth(2)
+        ww.setnchannels(self._num_channels)
+        ww.setframerate(self._sample_rate)
+        return (content, ww)

-        if isinstance(frame, AudioRawFrame):
-            # In this service we accumulate audio internally and at the end we
-            # push a TextFrame. We don't really want to push audio frames down.
-            await self._append_audio(frame)
-        else:
-            await self.push_frame(frame, direction)
+    def _get_smoothed_volume(self, frame: AudioRawFrame) -> float:
+        volume = calculate_audio_volume(frame.audio, frame.sample_rate)
+        return exp_smoothing(volume, self._prev_volume, self._smoothing_factor)


 class ImageGenService(AIService):
--- a/src/pipecat/services/anthropic.py
+++ b/src/pipecat/services/anthropic.py
@@ -8,11 +8,12 @@ import base64
 import json
 import io
 import copy
-from typing import List, Optional
+from typing import Any, Dict, List, Optional
 from dataclasses import dataclass
 from PIL import Image
 from asyncio import CancelledError
 import re
+from pydantic import BaseModel, Field

 from pipecat.frames.frames import (
    Frame,
@@ -29,6 +30,7 @@ from pipecat.frames.frames import (
    FunctionCallInProgressFrame,
    StartInterruptionFrame
 )
+from pipecat.metrics.metrics import LLMTokenUsage
 from pipecat.processors.frame_processor import FrameDirection
 from pipecat.services.ai_services import LLMService
 from pipecat.processors.aggregators.openai_llm_context import (
@@ -73,20 +75,30 @@ class AnthropicContextAggregatorPair:
 class AnthropicLLMService(LLMService):
    """This class implements inference with Anthropic's AI models
    """
+    class InputParams(BaseModel):
+        enable_prompt_caching_beta: Optional[bool] = False
+        max_tokens: Optional[int] = Field(default_factory=lambda: 4096, ge=1)
+        temperature: Optional[float] = Field(default_factory=lambda: NOT_GIVEN, ge=0.0, le=1.0)
+        top_k: Optional[int] = Field(default_factory=lambda: NOT_GIVEN, ge=0)
+        top_p: Optional[float] = Field(default_factory=lambda: NOT_GIVEN, ge=0.0, le=1.0)
+        extra: Optional[Dict[str, Any]] = Field(default_factory=dict)

    def __init__(
            self,
            *,
            api_key: str,
            model: str = "claude-3-5-sonnet-20240620",
-            max_tokens: int = 4096,
-            enable_prompt_caching_beta: bool = False,
+            params: InputParams = InputParams(),
            **kwargs):
        super().__init__(**kwargs)
        self._client = AsyncAnthropic(api_key=api_key)
-        self._model = model
-        self._max_tokens = max_tokens
-        self._enable_prompt_caching_beta = enable_prompt_caching_beta
+        self.set_model_name(model)
+        self._max_tokens = params.max_tokens
+        self._enable_prompt_caching_beta: bool = params.enable_prompt_caching_beta or False
+        self._temperature = params.temperature
+        self._top_k = params.top_k
+        self._top_p = params.top_p
+        self._extra = params.extra if isinstance(params.extra, dict) else {}

    def can_generate_metrics(self) -> bool:
        return True
@@ -104,6 +116,30 @@ class AnthropicLLMService(LLMService):
            _assistant=assistant
        )

+    async def set_enable_prompt_caching_beta(self, enable_prompt_caching_beta: bool):
+        logger.debug(f"Switching LLM enable_prompt_caching_beta to: [{enable_prompt_caching_beta}]")
+        self._enable_prompt_caching_beta = enable_prompt_caching_beta
+
+    async def set_max_tokens(self, max_tokens: int):
+        logger.debug(f"Switching LLM max_tokens to: [{max_tokens}]")
+        self._max_tokens = max_tokens
+
+    async def set_temperature(self, temperature: float):
+        logger.debug(f"Switching LLM temperature to: [{temperature}]")
+        self._temperature = temperature
+
+    async def set_top_k(self, top_k: float):
+        logger.debug(f"Switching LLM top_k to: [{top_k}]")
+        self._top_k = top_k
+
+    async def set_top_p(self, top_p: float):
+        logger.debug(f"Switching LLM top_p to: [{top_p}]")
+        self._top_p = top_p
+
+    async def set_extra(self, extra: Dict[str, Any]):
+        logger.debug(f"Switching LLM extra to: [{extra}]")
+        self._extra = extra
+
    async def _process_context(self, context: OpenAILLMContext):
        # Usage tracking. We track the usage reported by Anthropic in prompt_tokens and
        # completion_tokens. We also estimate the completion tokens from output text
@@ -133,13 +169,21 @@ class AnthropicLLMService(LLMService):

            await self.start_ttfb_metrics()

-            response = await api_call(
-                tools=context.tools or [],
-                system=context.system,
-                messages=messages,
-                model=self._model,
-                max_tokens=self._max_tokens,
-                stream=True)
+            params = {
+                "tools": context.tools or [],
+                "system": context.system,
+                "messages": messages,
+                "model": self.model_name,
+                "max_tokens": self._max_tokens,
+                "stream": True,
+                "temperature": self._temperature,
+                "top_k": self._top_k,
+                "top_p": self._top_p
+            }
+
+            params.update(self._extra)
+
+            response = await api_call(**params)

            await self.stop_ttfb_metrics()

@@ -231,7 +275,7 @@ class AnthropicLLMService(LLMService):
            context = AnthropicLLMContext.from_image_frame(frame)
        elif isinstance(frame, LLMModelUpdateFrame):
            logger.debug(f"Switching LLM model to: [{frame.model}]")
-            self._model = frame.model
+            self.set_model_name(frame.model)
        elif isinstance(frame, LLMEnablePromptCachingFrame):
            logger.debug(f"Setting enable prompt caching to: [{frame.enable}]")
            self._enable_prompt_caching_beta = frame.enable
@@ -251,15 +295,13 @@ class AnthropicLLMService(LLMService):
            cache_creation_input_tokens: int,
            cache_read_input_tokens: int):
        if prompt_tokens or completion_tokens or cache_creation_input_tokens or cache_read_input_tokens:
-            tokens = {
-                "processor": self.name,
-                "model": self._model,
-                "prompt_tokens": prompt_tokens,
-                "completion_tokens": completion_tokens,
-                "cache_creation_input_tokens": cache_creation_input_tokens,
-                "cache_read_input_tokens": cache_read_input_tokens,
-                "total_tokens": prompt_tokens + completion_tokens
-            }
+            tokens = LLMTokenUsage(
+                prompt_tokens=prompt_tokens,
+                completion_tokens=completion_tokens,
+                cache_creation_input_tokens=cache_creation_input_tokens,
+                cache_read_input_tokens=cache_read_input_tokens,
+                total_tokens=prompt_tokens + completion_tokens
+            )
            await self.start_llm_usage_metrics(tokens)


--- a/src/pipecat/services/azure.py
+++ b/src/pipecat/services/azure.py
@@ -12,19 +12,19 @@ from PIL import Image
 from typing import AsyncGenerator

 from pipecat.frames.frames import (
-    AudioRawFrame,
    CancelFrame,
    EndFrame,
    ErrorFrame,
    Frame,
    StartFrame,
-    SystemFrame,
+    TTSAudioRawFrame,
    TTSStartedFrame,
    TTSStoppedFrame,
    TranscriptionFrame,
    URLImageRawFrame)
+from pipecat.metrics.metrics import TTSUsageMetricsData
 from pipecat.processors.frame_processor import FrameDirection
-from pipecat.services.ai_services import AsyncAIService, TTSService, ImageGenService
+from pipecat.services.ai_services import STTService, TTSService, ImageGenService
 from pipecat.services.openai import BaseOpenAILLMService
 from pipecat.utils.time import time_now_iso8601

@@ -72,13 +72,21 @@ class AzureLLMService(BaseOpenAILLMService):


 class AzureTTSService(TTSService):
-    def __init__(self, *, api_key: str, region: str, voice="en-US-SaraNeural", **kwargs):
-        super().__init__(**kwargs)
+    def __init__(
+            self,
+            *,
+            api_key: str,
+            region: str,
+            voice="en-US-SaraNeural",
+            sample_rate: int = 16000,
+            **kwargs):
+        super().__init__(sample_rate=sample_rate, **kwargs)

        speech_config = SpeechConfig(subscription=api_key, region=region)
        self._speech_synthesizer = SpeechSynthesizer(speech_config=speech_config, audio_config=None)

        self._voice = voice
+        self._sample_rate = sample_rate

    def can_generate_metrics(self) -> bool:
        return True
@@ -109,7 +117,7 @@ class AzureTTSService(TTSService):
            await self.stop_ttfb_metrics()
            await self.push_frame(TTSStartedFrame())
            # Azure always sends a 44-byte header. Strip it off.
-            yield AudioRawFrame(audio=result.audio_data[44:], sample_rate=16000, num_channels=1)
+            yield TTSAudioRawFrame(audio=result.audio_data[44:], sample_rate=self._sample_rate, num_channels=1)
            await self.push_frame(TTSStoppedFrame())
        elif result.reason == ResultReason.Canceled:
            cancellation_details = result.cancellation_details
@@ -118,7 +126,7 @@ class AzureTTSService(TTSService):
                logger.error(f"{self} error: {cancellation_details.error_details}")


-class AzureSTTService(AsyncAIService):
+class AzureSTTService(STTService):
    def __init__(
            self,
            *,
@@ -141,15 +149,11 @@ class AzureSTTService(AsyncAIService):
            speech_config=speech_config, audio_config=audio_config)
        self._speech_recognizer.recognized.connect(self._on_handle_recognized)

-    async def process_frame(self, frame: Frame, direction: FrameDirection):
-        await super().process_frame(frame, direction)
-
-        if isinstance(frame, SystemFrame):
-            await self.push_frame(frame, direction)
-        elif isinstance(frame, AudioRawFrame):
-            self._audio_stream.write(frame.audio)
-        else:
-            await self._push_queue.put((frame, direction))
+    async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
+        await self.start_processing_metrics()
+        self._audio_stream.write(audio)
+        await self.stop_processing_metrics()
+        yield None

    async def start(self, frame: StartFrame):
        await super().start(frame)
@@ -168,7 +172,7 @@ class AzureSTTService(AsyncAIService):
    def _on_handle_recognized(self, event):
        if event.result.reason == ResultReason.RecognizedSpeech and len(event.result.text) > 0:
            frame = TranscriptionFrame(event.result.text, "", time_now_iso8601())
-            asyncio.run_coroutine_threadsafe(self.queue_frame(frame), self.get_event_loop())
+            asyncio.run_coroutine_threadsafe(self.push_frame(frame), self.get_event_loop())


 class AzureImageGenServiceREST(ImageGenService):
@@ -188,7 +192,7 @@ class AzureImageGenServiceREST(ImageGenService):
        self._api_key = api_key
        self._azure_endpoint = endpoint
        self._api_version = api_version
-        self._model = model
+        self.set_model_name(model)
        self._image_size = image_size
        self._aiohttp_session = aiohttp_session

--- a/src/pipecat/services/cartesia.py
+++ b/src/pipecat/services/cartesia.py
@@ -8,30 +8,30 @@ import json
 import uuid
 import base64
 import asyncio
-import time

 from typing import AsyncGenerator

-from pipecat.processors.frame_processor import FrameDirection
 from pipecat.frames.frames import (
    CancelFrame,
    ErrorFrame,
    Frame,
-    AudioRawFrame,
    StartInterruptionFrame,
    StartFrame,
    EndFrame,
+    TTSAudioRawFrame,
    TTSStartedFrame,
    TTSStoppedFrame,
-    TextFrame,
    LLMFullResponseEndFrame
 )
-from pipecat.services.ai_services import TTSService
+from pipecat.processors.frame_processor import FrameDirection
+from pipecat.transcriptions.language import Language
+from pipecat.services.ai_services import AsyncWordTTSService, TTSService

 from loguru import logger

 # See .env.example for Cartesia configuration needed
 try:
+    from cartesia import AsyncCartesia
    import websockets
 except ModuleNotFoundError as e:
    logger.error(f"Exception: {e}")
@@ -40,7 +40,26 @@ except ModuleNotFoundError as e:
    raise Exception(f"Missing module: {e}")


-class CartesiaTTSService(TTSService):
+def language_to_cartesia_language(language: Language) -> str | None:
+    match language:
+        case Language.DE:
+            return "de"
+        case Language.EN:
+            return "en"
+        case Language.ES:
+            return "es"
+        case Language.FR:
+            return "fr"
+        case Language.JA:
+            return "ja"
+        case Language.PT:
+            return "pt"
+        case Language.ZH:
+            return "zh"
+    return None
+
+
+class CartesiaTTSService(AsyncWordTTSService):

    def __init__(
            self,
@@ -54,25 +73,23 @@ class CartesiaTTSService(TTSService):
            sample_rate: int = 16000,
            language: str = "en",
            **kwargs):
-        super().__init__(**kwargs)
-
        # Aggregating sentences still gives cleaner-sounding results and fewer
-        # artifacts than streaming one word at a time. On average, waiting for
-        # a full sentence should only "cost" us 15ms or so with GPT-4o or a Llama 3
-        # model, and it's worth it for the better audio quality.
-        self._aggregate_sentences = True
-
-        # we don't want to automatically push LLM response text frames, because the
-        # context aggregators will add them to the LLM context even if we're
-        # interrupted. cartesia gives us word-by-word timestamps. we can use those
-        # to generate text frames ourselves aligned with the playout timing of the audio!
-        self._push_text_frames = False
+        # artifacts than streaming one word at a time. On average, waiting for a
+        # full sentence should only "cost" us 15ms or so with GPT-4o or a Llama
+        # 3 model, and it's worth it for the better audio quality.
+        #
+        # We also don't want to automatically push LLM response text frames,
+        # because the context aggregators will add them to the LLM context even
+        # if we're interrupted. Cartesia gives us word-by-word timestamps. We
+        # can use those to generate text frames ourselves aligned with the
+        # playout timing of the audio!
+        super().__init__(aggregate_sentences=True, push_text_frames=False, sample_rate=sample_rate, **kwargs)

        self._api_key = api_key
        self._cartesia_version = cartesia_version
        self._url = url
        self._voice_id = voice_id
-        self._model_id = model_id
+        self.set_model_name(model_id)
        self._output_format = {
            "container": "raw",
            "encoding": encoding,
@@ -82,18 +99,23 @@ class CartesiaTTSService(TTSService):

        self._websocket = None
        self._context_id = None
-        self._context_id_start_timestamp = None
-        self._timestamped_words_buffer = []
        self._receive_task = None
-        self._context_appending_task = None

    def can_generate_metrics(self) -> bool:
        return True

+    async def set_model(self, model: str):
+        await super().set_model(model)
+        logger.debug(f"Switching TTS model to: [{model}]")
+
    async def set_voice(self, voice: str):
        logger.debug(f"Switching TTS voice to: [{voice}]")
        self._voice_id = voice

+    async def set_language(self, language: Language):
+        logger.debug(f"Switching TTS language to: [{language}]")
+        self._language = language_to_cartesia_language(language)
+
    async def start(self, frame: StartFrame):
        await super().start(frame)
        await self._connect()
@@ -112,44 +134,60 @@ class CartesiaTTSService(TTSService):
                f"{self._url}?api_key={self._api_key}&cartesia_version={self._cartesia_version}"
            )
            self._receive_task = self.get_event_loop().create_task(self._receive_task_handler())
-            self._context_appending_task = self.get_event_loop().create_task(self._context_appending_task_handler())
        except Exception as e:
-            logger.exception(f"{self} initialization error: {e}")
+            logger.error(f"{self} initialization error: {e}")
            self._websocket = None

    async def _disconnect(self):
        try:
            await self.stop_all_metrics()

-            if self._context_appending_task:
-                self._context_appending_task.cancel()
-                await self._context_appending_task
-                self._context_appending_task = None
-            if self._receive_task:
-                self._receive_task.cancel()
-                await self._receive_task
-                self._receive_task = None
            if self._websocket:
                await self._websocket.close()
                self._websocket = None

+            if self._receive_task:
+                self._receive_task.cancel()
+                await self._receive_task
+                self._receive_task = None
+
            self._context_id = None
-            self._context_id_start_timestamp = None
-            self._timestamped_words_buffer = []
        except Exception as e:
-            logger.exception(f"{self} error closing websocket: {e}")
+            logger.error(f"{self} error closing websocket: {e}")
+
+    def _get_websocket(self):
+        if self._websocket:
+            return self._websocket
+        raise Exception("Websocket not connected")

    async def _handle_interruption(self, frame: StartInterruptionFrame, direction: FrameDirection):
        await super()._handle_interruption(frame, direction)
-        self._context_id = None
-        self._context_id_start_timestamp = None
-        self._timestamped_words_buffer = []
        await self.stop_all_metrics()
        await self.push_frame(LLMFullResponseEndFrame())
+        self._context_id = None
+
+    async def flush_audio(self):
+        if not self._context_id or not self._websocket:
+            return
+        logger.trace("Flushing audio")
+        msg = {
+            "transcript": "",
+            "continue": False,
+            "context_id": self._context_id,
+            "model_id": self.model_name,
+            "voice": {
+                "mode": "id",
+                "id": self._voice_id
+            },
+            "output_format": self._output_format,
+            "language": self._language,
+            "add_timestamps": True,
+        }
+        await self._websocket.send(json.dumps(msg))

    async def _receive_task_handler(self):
        try:
-            async for message in self._websocket:
+            async for message in self._get_websocket():
                msg = json.loads(message)
                if not msg or msg["context_id"] != self._context_id:
                    continue
@@ -160,17 +198,15 @@ class CartesiaTTSService(TTSService):
                    # because we are likely still playing out audio and need the
                    # timestamp to set send context frames.
                    self._context_id = None
-                    self._timestamped_words_buffer.append(("LLMFullResponseEndFrame", 0))
+                    await self.add_word_timestamps([("LLMFullResponseEndFrame", 0)])
                elif msg["type"] == "timestamps":
-                    # logger.debug(f"TIMESTAMPS: {msg}")
-                    self._timestamped_words_buffer.extend(
-                        list(zip(msg["word_timestamps"]["words"], msg["word_timestamps"]["end"]))
+                    await self.add_word_timestamps(
+                        list(zip(msg["word_timestamps"]["words"], msg["word_timestamps"]["start"]))
                    )
                elif msg["type"] == "chunk":
                    await self.stop_ttfb_metrics()
-                    if not self._context_id_start_timestamp:
-                        self._context_id_start_timestamp = time.time()
-                    frame = AudioRawFrame(
+                    self.start_word_timestamps()
+                    frame = TTSAudioRawFrame(
                        audio=base64.b64decode(msg["data"]),
                        sample_rate=self._output_format["sample_rate"],
                        num_channels=1
@@ -186,28 +222,7 @@ class CartesiaTTSService(TTSService):
        except asyncio.CancelledError:
            pass
        except Exception as e:
-            logger.exception(f"{self} exception: {e}")
-
-    async def _context_appending_task_handler(self):
-        try:
-            while True:
-                await asyncio.sleep(0.1)
-                if not self._context_id_start_timestamp:
-                    continue
-                elapsed_seconds = time.time() - self._context_id_start_timestamp
-                # Pop all words from self._timestamped_words_buffer that are
-                # older than the elapsed time and print a message about them to
-                # the console.
-                while self._timestamped_words_buffer and self._timestamped_words_buffer[0][1] <= elapsed_seconds:
-                    word, timestamp = self._timestamped_words_buffer.pop(0)
-                    if word == "LLMFullResponseEndFrame" and timestamp == 0:
-                        await self.push_frame(LLMFullResponseEndFrame())
-                        continue
-                    await self.push_frame(TextFrame(word))
-        except asyncio.CancelledError:
-            pass
-        except Exception as e:
-            logger.exception(f"{self} exception: {e}")
+            logger.error(f"{self} exception: {e}")

    async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
        logger.debug(f"Generating TTS: [{text}]")
@@ -225,7 +240,7 @@ class CartesiaTTSService(TTSService):
                "transcript": text + " ",
                "continue": True,
                "context_id": self._context_id,
-                "model_id": self._model_id,
+                "model_id": self.model_name,
                "voice": {
                    "mode": "id",
                    "id": self._voice_id
@@ -235,7 +250,7 @@ class CartesiaTTSService(TTSService):
                "add_timestamps": True,
            }
            try:
-                await self._websocket.send(json.dumps(msg))
+                await self._get_websocket().send(json.dumps(msg))
                await self.start_tts_usage_metrics(text)
            except Exception as e:
                logger.error(f"{self} error sending message: {e}")
@@ -245,4 +260,85 @@ class CartesiaTTSService(TTSService):
                return
            yield None
        except Exception as e:
-            logger.exception(f"{self} exception: {e}")
+            logger.error(f"{self} exception: {e}")
+
+
+class CartesiaHttpTTSService(TTSService):
+
+    def __init__(
+            self,
+            *,
+            api_key: str,
+            voice_id: str,
+            model_id: str = "sonic-english",
+            base_url: str = "https://api.cartesia.ai",
+            encoding: str = "pcm_s16le",
+            sample_rate: int = 16000,
+            language: str = "en",
+            **kwargs):
+        super().__init__(**kwargs)
+
+        self._api_key = api_key
+        self._voice_id = voice_id
+        self._model_id = model_id
+        self._output_format = {
+            "container": "raw",
+            "encoding": encoding,
+            "sample_rate": sample_rate,
+        }
+        self._language = language
+
+        self._client = AsyncCartesia(api_key=api_key, base_url=base_url)
+
+    def can_generate_metrics(self) -> bool:
+        return True
+
+    async def set_model(self, model: str):
+        logger.debug(f"Switching TTS model to: [{model}]")
+        self._model_id = model
+
+    async def set_voice(self, voice: str):
+        logger.debug(f"Switching TTS voice to: [{voice}]")
+        self._voice_id = voice
+
+    async def set_language(self, language: Language):
+        logger.debug(f"Switching TTS language to: [{language}]")
+        self._language = language_to_cartesia_language(language)
+
+    async def stop(self, frame: EndFrame):
+        await super().stop(frame)
+        await self._client.close()
+
+    async def cancel(self, frame: CancelFrame):
+        await super().cancel(frame)
+        await self._client.close()
+
+    async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
+        logger.debug(f"Generating TTS: [{text}]")
+
+        await self.push_frame(TTSStartedFrame())
+        await self.start_ttfb_metrics()
+
+        try:
+            output = await self._client.tts.sse(
+                model_id=self._model_id,
+                transcript=text,
+                voice_id=self._voice_id,
+                output_format=self._output_format,
+                language=self._language,
+                stream=False
+            )
+
+            await self.stop_ttfb_metrics()
+
+            frame = TTSAudioRawFrame(
+                audio=output["audio"],
+                sample_rate=self._output_format["sample_rate"],
+                num_channels=1
+            )
+            yield frame
+        except Exception as e:
+            logger.error(f"{self} exception: {e}")
+
+        await self.start_tts_usage_metrics(text)
+        await self.push_frame(TTSStoppedFrame())
--- a/src/pipecat/services/deepgram.py
+++ b/src/pipecat/services/deepgram.py
@@ -9,19 +9,18 @@ import aiohttp
 from typing import AsyncGenerator

 from pipecat.frames.frames import (
-    AudioRawFrame,
    CancelFrame,
    EndFrame,
    ErrorFrame,
    Frame,
    InterimTranscriptionFrame,
    StartFrame,
-    SystemFrame,
+    TTSAudioRawFrame,
    TTSStartedFrame,
    TTSStoppedFrame,
    TranscriptionFrame)
-from pipecat.processors.frame_processor import FrameDirection
-from pipecat.services.ai_services import AsyncAIService, TTSService
+from pipecat.services.ai_services import STTService, TTSService
+from pipecat.transcriptions.language import Language
 from pipecat.utils.time import time_now_iso8601

 from loguru import logger
@@ -30,10 +29,12 @@ from loguru import logger
 # See .env.example for Deepgram configuration needed
 try:
    from deepgram import (
+        AsyncListenWebSocketClient,
        DeepgramClient,
        DeepgramClientOptions,
        LiveTranscriptionEvents,
        LiveOptions,
+        LiveResultResponse
    )
 except ModuleNotFoundError as e:
    logger.error(f"Exception: {e}")
@@ -100,14 +101,15 @@ class DeepgramTTSService(TTSService):
                await self.push_frame(TTSStartedFrame())
                async for data in r.content:
                    await self.stop_ttfb_metrics()
-                    frame = AudioRawFrame(audio=data, sample_rate=self._sample_rate, num_channels=1)
+                    frame = TTSAudioRawFrame(
+                        audio=data, sample_rate=self._sample_rate, num_channels=1)
                    yield frame
                await self.push_frame(TTSStoppedFrame())
        except Exception as e:
            logger.exception(f"{self} exception: {e}")


-class DeepgramSTTService(AsyncAIService):
+class DeepgramSTTService(STTService):
    def __init__(self,
                 *,
                 api_key: str,
@@ -120,6 +122,8 @@ class DeepgramSTTService(AsyncAIService):
                     channels=1,
                     interim_results=True,
                     smart_format=True,
+                     punctuate=True,
+                     profanity_filter=True,
                 ),
                 **kwargs):
        super().__init__(**kwargs)
@@ -128,40 +132,63 @@ class DeepgramSTTService(AsyncAIService):

        self._client = DeepgramClient(
            api_key, config=DeepgramClientOptions(url=url, options={"keepalive": "true"}))
-        self._connection = self._client.listen.asynclive.v("1")
+        self._connection: AsyncListenWebSocketClient = self._client.listen.asyncwebsocket.v("1")
        self._connection.on(LiveTranscriptionEvents.Transcript, self._on_message)

-    async def process_frame(self, frame: Frame, direction: FrameDirection):
-        await super().process_frame(frame, direction)
+    async def set_model(self, model: str):
+        await super().set_model(model)
+        logger.debug(f"Switching STT model to: [{model}]")
+        self._live_options.model = model
+        await self._disconnect()
+        await self._connect()

-        if isinstance(frame, SystemFrame):
-            await self.push_frame(frame, direction)
-        elif isinstance(frame, AudioRawFrame):
-            await self._connection.send(frame.audio)
-        else:
-            await self.queue_frame(frame, direction)
+    async def set_language(self, language: Language):
+        logger.debug(f"Switching STT language to: [{language}]")
+        self._live_options.language = language
+        await self._disconnect()
+        await self._connect()

    async def start(self, frame: StartFrame):
        await super().start(frame)
+        await self._connect()
+
+    async def stop(self, frame: EndFrame):
+        await super().stop(frame)
+        await self._disconnect()
+
+    async def cancel(self, frame: CancelFrame):
+        await super().cancel(frame)
+        await self._disconnect()
+
+    async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
+        await self.start_processing_metrics()
+        await self._connection.send(audio)
+        await self.stop_processing_metrics()
+        yield None
+
+    async def _connect(self):
        if await self._connection.start(self._live_options):
            logger.debug(f"{self}: Connected to Deepgram")
        else:
            logger.error(f"{self}: Unable to connect to Deepgram")

-    async def stop(self, frame: EndFrame):
-        await super().stop(frame)
-        await self._connection.finish()
-
-    async def cancel(self, frame: CancelFrame):
-        await super().cancel(frame)
-        await self._connection.finish()
+    async def _disconnect(self):
+        if self._connection.is_connected:
+            await self._connection.finish()
+            logger.debug(f"{self}: Disconnected from Deepgram")

    async def _on_message(self, *args, **kwargs):
-        result = kwargs["result"]
+        result: LiveResultResponse = kwargs["result"]
+        if len(result.channel.alternatives) == 0:
+            return
        is_final = result.is_final
        transcript = result.channel.alternatives[0].transcript
+        language = None
+        if result.channel.alternatives[0].languages:
+            language = result.channel.alternatives[0].languages[0]
+            language = Language(language)
        if len(transcript) > 0:
            if is_final:
-                await self.queue_frame(TranscriptionFrame(transcript, "", time_now_iso8601()))
+                await self.push_frame(TranscriptionFrame(transcript, "", time_now_iso8601(), language))
            else:
-                await self.queue_frame(InterimTranscriptionFrame(transcript, "", time_now_iso8601()))
+                await self.push_frame(InterimTranscriptionFrame(transcript, "", time_now_iso8601(), language))
--- a/src/pipecat/services/elevenlabs.py
+++ b/src/pipecat/services/elevenlabs.py
@@ -4,18 +4,72 @@
 # SPDX-License-Identifier: BSD 2-Clause License
 #

-import aiohttp
+import asyncio
+import base64
+import json

-from typing import AsyncGenerator, Literal
+from typing import Any, AsyncGenerator, List, Literal, Mapping, Tuple
 from pydantic import BaseModel

-from pipecat.frames.frames import AudioRawFrame, ErrorFrame, Frame, TTSStartedFrame, TTSStoppedFrame
-from pipecat.services.ai_services import TTSService
+from pipecat.frames.frames import (
+    CancelFrame,
+    EndFrame,
+    Frame,
+    StartFrame,
+    StartInterruptionFrame,
+    TTSAudioRawFrame,
+    TTSStartedFrame,
+    TTSStoppedFrame)
+from pipecat.processors.frame_processor import FrameDirection
+from pipecat.services.ai_services import AsyncWordTTSService

 from loguru import logger

+# See .env.example for ElevenLabs configuration needed
+try:
+    import websockets
+except ModuleNotFoundError as e:
+    logger.error(f"Exception: {e}")
+    logger.error(
+        "In order to use ElevenLabs, you need to `pip install pipecat-ai[elevenlabs]`. Also, set `ELEVENLABS_API_KEY` environment variable.")
+    raise Exception(f"Missing module: {e}")

-class ElevenLabsTTSService(TTSService):
+
+def sample_rate_from_output_format(output_format: str) -> int:
+    match output_format:
+        case "pcm_16000":
+            return 16000
+        case "pcm_22050":
+            return 22050
+        case "pcm_24000":
+            return 24000
+        case "pcm_44100":
+            return 44100
+    return 16000
+
+
+def calculate_word_times(
+        alignment_info: Mapping[str, Any], cumulative_time: float
+) -> List[Tuple[str, float]]:
+    zipped_times = list(zip(alignment_info["chars"], alignment_info["charStartTimesMs"]))
+
+    words = "".join(alignment_info["chars"]).split(" ")
+
+    # Calculate start time for each word. We do this by finding a space character
+    # and using the previous word time, also taking into account there might not
+    # be a space at the end.
+    times = []
+    for (i, (a, b)) in enumerate(zipped_times):
+        if a == " " or i == len(zipped_times) - 1:
+            t = cumulative_time + (zipped_times[i - 1][1] / 1000.0)
+            times.append(t)
+
+    word_times = list(zip(words, times))
+
+    return word_times
+
+
+class ElevenLabsTTSService(AsyncWordTTSService):
    class InputParams(BaseModel):
        output_format: Literal["pcm_16000", "pcm_22050", "pcm_24000", "pcm_44100"] = "pcm_16000"

@@ -24,56 +78,186 @@ class ElevenLabsTTSService(TTSService):
            *,
            api_key: str,
            voice_id: str,
-            aiohttp_session: aiohttp.ClientSession,
            model: str = "eleven_turbo_v2_5",
+            url: str = "wss://api.elevenlabs.io",
            params: InputParams = InputParams(),
            **kwargs):
-        super().__init__(**kwargs)
+        # Aggregating sentences still gives cleaner-sounding results and fewer
+        # artifacts than streaming one word at a time. On average, waiting for a
+        # full sentence should only "cost" us 15ms or so with GPT-4o or a Llama
+        # 3 model, and it's worth it for the better audio quality.
+        #
+        # We also don't want to automatically push LLM response text frames,
+        # because the context aggregators will add them to the LLM context even
+        # if we're interrupted. ElevenLabs gives us word-by-word timestamps. We
+        # can use those to generate text frames ourselves aligned with the
+        # playout timing of the audio!
+        #
+        # Finally, ElevenLabs doesn't provide information on when the bot stops
+        # speaking for a while, so we want the parent class to send TTSStopFrame
+        # after a short period not receiving any audio.
+        super().__init__(
+            aggregate_sentences=True,
+            push_text_frames=False,
+            push_stop_frames=True,
+            stop_frame_timeout_s=2.0,
+            sample_rate=sample_rate_from_output_format(params.output_format),
+            **kwargs
+        )

        self._api_key = api_key
        self._voice_id = voice_id
-        self._model = model
+        self.set_model_name(model)
+        self._url = url
        self._params = params
-        self._aiohttp_session = aiohttp_session
+
+        # Websocket connection to ElevenLabs.
+        self._websocket = None
+        # Indicates if we have sent TTSStartedFrame. It will reset to False when
+        # there's an interruption or TTSStoppedFrame.
+        self._started = False
+        self._cumulative_time = 0

    def can_generate_metrics(self) -> bool:
        return True

+    async def set_model(self, model: str):
+        await super().set_model(model)
+        logger.debug(f"Switching TTS model to: [{model}]")
+        await self._disconnect()
+        await self._connect()
+
    async def set_voice(self, voice: str):
        logger.debug(f"Switching TTS voice to: [{voice}]")
        self._voice_id = voice
+        await self._disconnect()
+        await self._connect()
+
+    async def start(self, frame: StartFrame):
+        await super().start(frame)
+        await self._connect()
+
+    async def stop(self, frame: EndFrame):
+        await super().stop(frame)
+        await self._disconnect()
+
+    async def cancel(self, frame: CancelFrame):
+        await super().cancel(frame)
+        await self._disconnect()
+
+    async def flush_audio(self):
+        if self._websocket:
+            msg = {"text": " ", "flush": True}
+            await self._websocket.send(json.dumps(msg))
+
+    async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
+        await super().push_frame(frame, direction)
+        if isinstance(frame, (TTSStoppedFrame, StartInterruptionFrame)):
+            self._started = False
+            if isinstance(frame, TTSStoppedFrame):
+                await self.add_word_timestamps([("LLMFullResponseEndFrame", 0)])
+
+    async def _connect(self):
+        try:
+            voice_id = self._voice_id
+            model = self.model_name
+            output_format = self._params.output_format
+            url = f"{self._url}/v1/text-to-speech/{voice_id}/stream-input?model_id={model}&output_format={output_format}"
+            self._websocket = await websockets.connect(url)
+            self._receive_task = self.get_event_loop().create_task(self._receive_task_handler())
+            self._keepalive_task = self.get_event_loop().create_task(self._keepalive_task_handler())
+
+            # According to ElevenLabs, we should always start with a single space.
+            msg = {
+                "text": " ",
+                "xi_api_key": self._api_key,
+            }
+            await self._websocket.send(json.dumps(msg))
+        except Exception as e:
+            logger.error(f"{self} initialization error: {e}")
+            self._websocket = None
+
+    async def _disconnect(self):
+        try:
+            await self.stop_all_metrics()
+
+            if self._websocket:
+                await self._websocket.send(json.dumps({"text": ""}))
+                await self._websocket.close()
+                self._websocket = None
+
+            if self._receive_task:
+                self._receive_task.cancel()
+                await self._receive_task
+                self._receive_task = None
+
+            if self._keepalive_task:
+                self._keepalive_task.cancel()
+                await self._keepalive_task
+                self._keepalive_task = None
+
+            self._started = False
+        except Exception as e:
+            logger.error(f"{self} error closing websocket: {e}")
+
+    async def _receive_task_handler(self):
+        try:
+            async for message in self._websocket:
+                msg = json.loads(message)
+                if msg.get("audio"):
+                    await self.stop_ttfb_metrics()
+                    self.start_word_timestamps()
+
+                    audio = base64.b64decode(msg["audio"])
+                    frame = TTSAudioRawFrame(audio, self.sample_rate, 1)
+                    await self.push_frame(frame)
+
+                if msg.get("alignment"):
+                    word_times = calculate_word_times(msg["alignment"], self._cumulative_time)
+                    await self.add_word_timestamps(word_times)
+                    self._cumulative_time = word_times[-1][1]
+        except asyncio.CancelledError:
+            pass
+        except Exception as e:
+            logger.error(f"{self} exception: {e}")
+
+    async def _keepalive_task_handler(self):
+        while True:
+            try:
+                await asyncio.sleep(10)
+                await self._send_text("")
+            except asyncio.CancelledError:
+                break
+            except Exception as e:
+                logger.error(f"{self} exception: {e}")
+
+    async def _send_text(self, text: str):
+        if self._websocket:
+            msg = {"text": text + " "}
+            await self._websocket.send(json.dumps(msg))

    async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
        logger.debug(f"Generating TTS: [{text}]")

-        url = f"https://api.elevenlabs.io/v1/text-to-speech/{self._voice_id}/stream"
+        try:
+            if not self._websocket:
+                await self._connect()

-        payload = {"text": text, "model_id": self._model}
+            try:
+                if not self._started:
+                    await self.push_frame(TTSStartedFrame())
+                    await self.start_ttfb_metrics()
+                    self._started = True
+                    self._cumulative_time = 0

-        querystring = {
-            "output_format": self._params.output_format
-        }
-
-        headers = {
-            "xi-api-key": self._api_key,
-            "Content-Type": "application/json",
-        }
-
-        await self.start_ttfb_metrics()
-
-        async with self._aiohttp_session.post(url, json=payload, headers=headers, params=querystring) as r:
-            if r.status != 200:
-                text = await r.text()
-                logger.error(f"{self} error getting audio (status: {r.status}, error: {text})")
-                yield ErrorFrame(f"Error getting audio (status: {r.status}, error: {text})")
+                await self._send_text(text)
+                await self.start_tts_usage_metrics(text)
+            except Exception as e:
+                logger.error(f"{self} error sending message: {e}")
+                await self.push_frame(TTSStoppedFrame())
+                await self._disconnect()
+                await self._connect()
                return
-
-            await self.start_tts_usage_metrics(text)
-
-            await self.push_frame(TTSStartedFrame())
-            async for chunk in r.content:
-                if len(chunk) > 0:
-                    await self.stop_ttfb_metrics()
-                    frame = AudioRawFrame(chunk, 16000, 1)
-                    yield frame
-            await self.push_frame(TTSStoppedFrame())
+            yield None
+        except Exception as e:
+            logger.error(f"{self} exception: {e}")
--- a/src/pipecat/services/fal.py
+++ b/src/pipecat/services/fal.py
@@ -43,9 +43,10 @@ class FalImageGenService(ImageGenService):
        aiohttp_session: aiohttp.ClientSession,
        model: str = "fal-ai/fast-sdxl",
        key: str | None = None,
+        **kwargs
    ):
-        super().__init__()
-        self._model = model
+        super().__init__(**kwargs)
+        self.set_model_name(model)
        self._params = params
        self._aiohttp_session = aiohttp_session
        if key:
@@ -55,7 +56,7 @@ class FalImageGenService(ImageGenService):
        logger.debug(f"Generating image from prompt: {prompt}")

        response = await fal_client.run_async(
-            self._model,
+            self.model_name,
            arguments={"prompt": prompt, **self._params.model_dump(exclude_none=True)}
        )

--- a/src/pipecat/services/fireworks.py
+++ b/src/pipecat/services/fireworks.py
@@ -22,4 +22,4 @@ class FireworksLLMService(BaseOpenAILLMService):
                 *,
                 model: str = "accounts/fireworks/models/firefunction-v1",
                 base_url: str = "https://api.fireworks.ai/inference/v1"):
-        super().__init__(model, base_url)
+        super().__init__(model=model, base_url=base_url)
--- a/src/pipecat/services/gladia.py
+++ b/src/pipecat/services/gladia.py
@@ -7,20 +7,17 @@
 import base64
 import json

-from typing import Optional
+from typing import AsyncGenerator, Optional
 from pydantic.main import BaseModel

 from pipecat.frames.frames import (
-    AudioRawFrame,
    CancelFrame,
    EndFrame,
    Frame,
    InterimTranscriptionFrame,
    StartFrame,
-    SystemFrame,
    TranscriptionFrame)
-from pipecat.processors.frame_processor import FrameDirection
-from pipecat.services.ai_services import AsyncAIService
+from pipecat.services.ai_services import STTService
 from pipecat.utils.time import time_now_iso8601

 from loguru import logger
@@ -35,7 +32,7 @@ except ModuleNotFoundError as e:
    raise Exception(f"Missing module: {e}")


-class GladiaSTTService(AsyncAIService):
+class GladiaSTTService(STTService):
    class InputParams(BaseModel):
        sample_rate: Optional[int] = 16000
        language: Optional[str] = "english"
@@ -50,23 +47,13 @@ class GladiaSTTService(AsyncAIService):
                 confidence: float = 0.5,
                 params: InputParams = InputParams(),
                 **kwargs):
-        super().__init__(**kwargs)
+        super().__init__(sync=False, **kwargs)

        self._api_key = api_key
        self._url = url
        self._params = params
        self._confidence = confidence

-    async def process_frame(self, frame: Frame, direction: FrameDirection):
-        await super().process_frame(frame, direction)
-
-        if isinstance(frame, SystemFrame):
-            await self.push_frame(frame, direction)
-        elif isinstance(frame, AudioRawFrame):
-            await self._send_audio(frame)
-        else:
-            await self.queue_frame(frame, direction)
-
    async def start(self, frame: StartFrame):
        await super().start(frame)
        self._websocket = await websockets.connect(self._url)
@@ -81,6 +68,12 @@ class GladiaSTTService(AsyncAIService):
        await super().cancel(frame)
        await self._websocket.close()

+    async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
+        await self.start_processing_metrics()
+        await self._send_audio(audio)
+        await self.stop_processing_metrics()
+        yield None
+
    async def _setup_gladia(self):
        configuration = {
            "x_gladia_key": self._api_key,
@@ -92,9 +85,9 @@ class GladiaSTTService(AsyncAIService):

        await self._websocket.send(json.dumps(configuration))

-    async def _send_audio(self, frame: AudioRawFrame):
+    async def _send_audio(self, audio: bytes):
        message = {
-            'frames': base64.b64encode(frame.audio).decode("utf-8")
+            'frames': base64.b64encode(audio).decode("utf-8")
        }
        await self._websocket.send(json.dumps(message))

@@ -113,6 +106,6 @@ class GladiaSTTService(AsyncAIService):
                transcript = utterance["transcription"]
                if confidence >= self._confidence:
                    if type == "final":
-                        await self.queue_frame(TranscriptionFrame(transcript, "", time_now_iso8601()))
+                        await self.push_frame(TranscriptionFrame(transcript, "", time_now_iso8601()))
                    else:
-                        await self.queue_frame(InterimTranscriptionFrame(transcript, "", time_now_iso8601()))
+                        await self.push_frame(InterimTranscriptionFrame(transcript, "", time_now_iso8601()))
--- a/src/pipecat/services/google.py
+++ b/src/pipecat/services/google.py
@@ -50,6 +50,7 @@ class GoogleLLMService(LLMService):
        return True

    def _create_client(self, model: str):
+        self.set_model_name(model)
        self._client = gai.GenerativeModel(model)

    def _get_messages_from_openai_context(
--- a/src/pipecat/services/lmnt.py
+++ b/src/pipecat/services/lmnt.py
@@ -0,0 +1,166 @@
+#
+# Copyright (c) 2024, Daily
+#
+# SPDX-License-Identifier: BSD 2-Clause License
+#
+
+import asyncio
+
+from typing import AsyncGenerator
+
+from pipecat.processors.frame_processor import FrameDirection
+from pipecat.frames.frames import (
+    CancelFrame,
+    EndFrame,
+    ErrorFrame,
+    Frame,
+    StartFrame,
+    StartInterruptionFrame,
+    TTSAudioRawFrame,
+    TTSStartedFrame,
+    TTSStoppedFrame,
+)
+from pipecat.services.ai_services import AsyncTTSService
+
+from loguru import logger
+
+# See .env.example for LMNT configuration needed
+try:
+    from lmnt.api import Speech
+except ModuleNotFoundError as e:
+    logger.error(f"Exception: {e}")
+    logger.error(
+        "In order to use LMNT, you need to `pip install pipecat-ai[lmnt]`. Also, set `LMNT_API_KEY` environment variable.")
+    raise Exception(f"Missing module: {e}")
+
+
+class LmntTTSService(AsyncTTSService):
+
+    def __init__(
+            self,
+            *,
+            api_key: str,
+            voice_id: str,
+            sample_rate: int = 24000,
+            language: str = "en",
+            **kwargs):
+        # Let TTSService produce TTSStoppedFrames after a short delay of
+        # no activity.
+        super().__init__(sync=False, push_stop_frames=True, sample_rate=sample_rate, **kwargs)
+
+        self._api_key = api_key
+        self._voice_id = voice_id
+        self._output_format = {
+            "container": "raw",
+            "encoding": "pcm_s16le",
+            "sample_rate": sample_rate,
+        }
+        self._language = language
+
+        self._speech = None
+        self._connection = None
+        self._receive_task = None
+        # Indicates if we have sent TTSStartedFrame. It will reset to False when
+        # there's an interruption or TTSStoppedFrame.
+        self._started = False
+
+    def can_generate_metrics(self) -> bool:
+        return True
+
+    async def set_voice(self, voice: str):
+        logger.debug(f"Switching TTS voice to: [{voice}]")
+        self._voice_id = voice
+
+    async def start(self, frame: StartFrame):
+        await super().start(frame)
+        await self._connect()
+
+    async def stop(self, frame: EndFrame):
+        await super().stop(frame)
+        await self._disconnect()
+
+    async def cancel(self, frame: CancelFrame):
+        await super().cancel(frame)
+        await self._disconnect()
+
+    async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
+        await super().push_frame(frame, direction)
+        if isinstance(frame, (TTSStoppedFrame, StartInterruptionFrame)):
+            self._started = False
+
+    async def _connect(self):
+        try:
+            self._speech = Speech()
+            self._connection = await self._speech.synthesize_streaming(
+                self._voice_id, format="raw", sample_rate=self._output_format["sample_rate"])
+            self._receive_task = self.get_event_loop().create_task(self._receive_task_handler())
+        except Exception as e:
+            logger.exception(f"{self} initialization error: {e}")
+            self._connection = None
+
+    async def _disconnect(self):
+        try:
+            await self.stop_all_metrics()
+
+            if self._receive_task:
+                self._receive_task.cancel()
+                await self._receive_task
+                self._receive_task = None
+            if self._connection:
+                await self._connection.socket.close()
+                self._connection = None
+            if self._speech:
+                await self._speech.close()
+                self._speech = None
+            self._started = False
+        except Exception as e:
+            logger.exception(f"{self} error closing websocket: {e}")
+
+    async def _receive_task_handler(self):
+        try:
+            async for msg in self._connection:
+                if "error" in msg:
+                    logger.error(f'{self} error: {msg["error"]}')
+                    await self.push_frame(TTSStoppedFrame())
+                    await self.stop_all_metrics()
+                    await self.push_error(ErrorFrame(f'{self} error: {msg["error"]}'))
+                elif "audio" in msg:
+                    await self.stop_ttfb_metrics()
+                    frame = TTSAudioRawFrame(
+                        audio=msg["audio"],
+                        sample_rate=self._output_format["sample_rate"],
+                        num_channels=1
+                    )
+                    await self.push_frame(frame)
+                else:
+                    logger.error(f"LMNT error, unknown message type: {msg}")
+        except asyncio.CancelledError:
+            pass
+        except Exception as e:
+            logger.exception(f"{self} exception: {e}")
+
+    async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
+        logger.debug(f"Generating TTS: [{text}]")
+
+        try:
+            if not self._connection:
+                await self._connect()
+
+            if not self._started:
+                await self.push_frame(TTSStartedFrame())
+                await self.start_ttfb_metrics()
+                self._started = True
+
+            try:
+                await self._connection.append_text(text)
+                await self._connection.flush()
+                await self.start_tts_usage_metrics(text)
+            except Exception as e:
+                logger.error(f"{self} error sending message: {e}")
+                await self.push_frame(TTSStoppedFrame())
+                await self._disconnect()
+                await self._connect()
+                return
+            yield None
+        except Exception as e:
+            logger.exception(f"{self} exception: {e}")
--- a/src/pipecat/services/moondream.py
+++ b/src/pipecat/services/moondream.py
@@ -46,12 +46,15 @@ def detect_device():
 class MoondreamService(VisionService):
    def __init__(
        self,
-            *,
+        *,
        model="vikhyatk/moondream2",
-        revision="2024-04-02",
-        use_cpu=False
+        revision="2024-08-26",
+        use_cpu=False,
+        **kwargs
    ):
-        super().__init__()
+        super().__init__(**kwargs)
+
+        self.set_model_name(model)

        if not use_cpu:
            device, dtype = detect_device()
@@ -72,7 +75,7 @@ class MoondreamService(VisionService):

    async def run_vision(self, frame: VisionImageRawFrame) -> AsyncGenerator[Frame, None]:
        if not self._model:
-            logger.error(f"{self} error: Moondream model not available")
+            logger.error(f"{self} error: Moondream model not available ({self.model_name})")
            yield ErrorFrame("Moondream model not available")
            return

--- a/src/pipecat/services/openai.py
+++ b/src/pipecat/services/openai.py
@@ -11,19 +11,20 @@ import json
 import httpx
 from dataclasses import dataclass

-from typing import AsyncGenerator, List, Literal
+from typing import Any, AsyncGenerator, Dict, List, Literal, Optional
+from pydantic import BaseModel, Field

 from loguru import logger
 from PIL import Image

 from pipecat.frames.frames import (
-    AudioRawFrame,
    ErrorFrame,
    Frame,
    LLMFullResponseEndFrame,
    LLMFullResponseStartFrame,
    LLMMessagesFrame,
    LLMModelUpdateFrame,
+    TTSAudioRawFrame,
    TTSStartedFrame,
    TTSStoppedFrame,
    TextFrame,
@@ -33,6 +34,7 @@ from pipecat.frames.frames import (
    FunctionCallInProgressFrame,
    StartInterruptionFrame
 )
+from pipecat.metrics.metrics import LLMTokenUsage
 from pipecat.processors.aggregators.llm_response import LLMUserContextAggregator, LLMAssistantContextAggregator

 from pipecat.processors.aggregators.openai_llm_context import (
@@ -47,7 +49,7 @@ from pipecat.services.ai_services import (
 )

 try:
-    from openai import AsyncOpenAI, AsyncStream, DefaultAsyncHttpxClient, BadRequestError
+    from openai import AsyncOpenAI, AsyncStream, DefaultAsyncHttpxClient, BadRequestError, NOT_GIVEN
    from openai.types.chat import ChatCompletionChunk, ChatCompletionMessageParam
 except ModuleNotFoundError as e:
    logger.error(f"Exception: {e}")
@@ -55,6 +57,17 @@ except ModuleNotFoundError as e:
        "In order to use OpenAI, you need to `pip install pipecat-ai[openai]`. Also, set `OPENAI_API_KEY` environment variable.")
    raise Exception(f"Missing module: {e}")

+ValidVoice = Literal["alloy", "echo", "fable", "onyx", "nova", "shimmer"]
+
+VALID_VOICES: Dict[str, ValidVoice] = {
+    "alloy": "alloy",
+    "echo": "echo",
+    "fable": "fable",
+    "onyx": "onyx",
+    "nova": "nova",
+    "shimmer": "shimmer",
+}
+

 class OpenAIUnhandledFunctionException(Exception):
    pass
@@ -69,11 +82,33 @@ class BaseOpenAILLMService(LLMService):
    as well as tool choices and the tool, which is used if requesting function
    calls from the LLM.
    """
+    class InputParams(BaseModel):
+        frequency_penalty: Optional[float] = Field(
+            default_factory=lambda: NOT_GIVEN, ge=-2.0, le=2.0)
+        presence_penalty: Optional[float] = Field(
+            default_factory=lambda: NOT_GIVEN, ge=-2.0, le=2.0)
+        seed: Optional[int] = Field(default_factory=lambda: NOT_GIVEN, ge=0)
+        temperature: Optional[float] = Field(default_factory=lambda: NOT_GIVEN, ge=0.0, le=2.0)
+        top_p: Optional[float] = Field(default_factory=lambda: NOT_GIVEN, ge=0.0, le=1.0)
+        extra: Optional[Dict[str, Any]] = Field(default_factory=dict)

-    def __init__(self, *, model: str, api_key=None, base_url=None, **kwargs):
+    def __init__(
+            self,
+            *,
+            model: str,
+            api_key=None,
+            base_url=None,
+            params: InputParams = InputParams(),
+            **kwargs):
        super().__init__(**kwargs)
-        self._model: str = model
+        self.set_model_name(model)
        self._client = self.create_client(api_key=api_key, base_url=base_url, **kwargs)
+        self._frequency_penalty = params.frequency_penalty
+        self._presence_penalty = params.presence_penalty
+        self._seed = params.seed
+        self._temperature = params.temperature
+        self._top_p = params.top_p
+        self._extra = params.extra if isinstance(params.extra, dict) else {}

    def create_client(self, api_key=None, base_url=None, **kwargs):
        return AsyncOpenAI(
@@ -88,18 +123,52 @@ class BaseOpenAILLMService(LLMService):
    def can_generate_metrics(self) -> bool:
        return True

+    async def set_frequency_penalty(self, frequency_penalty: float):
+        logger.debug(f"Switching LLM frequency_penalty to: [{frequency_penalty}]")
+        self._frequency_penalty = frequency_penalty
+
+    async def set_presence_penalty(self, presence_penalty: float):
+        logger.debug(f"Switching LLM presence_penalty to: [{presence_penalty}]")
+        self._presence_penalty = presence_penalty
+
+    async def set_seed(self, seed: int):
+        logger.debug(f"Switching LLM seed to: [{seed}]")
+        self._seed = seed
+
+    async def set_temperature(self, temperature: float):
+        logger.debug(f"Switching LLM temperature to: [{temperature}]")
+        self._temperature = temperature
+
+    async def set_top_p(self, top_p: float):
+        logger.debug(f"Switching LLM top_p to: [{top_p}]")
+        self._top_p = top_p
+
+    async def set_extra(self, extra: Dict[str, Any]):
+        logger.debug(f"Switching LLM extra to: [{extra}]")
+        self._extra = extra
+
    async def get_chat_completions(
            self,
            context: OpenAILLMContext,
            messages: List[ChatCompletionMessageParam]) -> AsyncStream[ChatCompletionChunk]:
-        chunks = await self._client.chat.completions.create(
-            model=self._model,
-            stream=True,
-            messages=messages,
-            tools=context.tools,
-            tool_choice=context.tool_choice,
-            stream_options={"include_usage": True}
-        )
+
+        params = {
+            "model": self.model_name,
+            "stream": True,
+            "messages": messages,
+            "tools": context.tools,
+            "tool_choice": context.tool_choice,
+            "stream_options": {"include_usage": True},
+            "frequency_penalty": self._frequency_penalty,
+            "presence_penalty": self._presence_penalty,
+            "seed": self._seed,
+            "temperature": self._temperature,
+            "top_p": self._top_p,
+        }
+
+        params.update(self._extra)
+
+        chunks = await self._client.chat.completions.create(**params)
        return chunks

    async def _stream_chat_completions(
@@ -137,13 +206,11 @@ class BaseOpenAILLMService(LLMService):

        async for chunk in chunk_stream:
            if chunk.usage:
-                tokens = {
-                    "processor": self.name,
-                    "model": self._model,
-                    "prompt_tokens": chunk.usage.prompt_tokens,
-                    "completion_tokens": chunk.usage.completion_tokens,
-                    "total_tokens": chunk.usage.total_tokens
-                }
+                tokens = LLMTokenUsage(
+                    prompt_tokens=chunk.usage.prompt_tokens,
+                    completion_tokens=chunk.usage.completion_tokens,
+                    total_tokens=chunk.usage.total_tokens
+                )
                await self.start_llm_usage_metrics(tokens)

            if len(chunk.choices) == 0:
@@ -212,7 +279,7 @@ class BaseOpenAILLMService(LLMService):
            context = OpenAILLMContext.from_image_frame(frame)
        elif isinstance(frame, LLMModelUpdateFrame):
            logger.debug(f"Switching LLM model to: [{frame.model}]")
-            self._model = frame.model
+            self.set_model_name(frame.model)
        else:
            await self.push_frame(frame, direction)

@@ -238,8 +305,13 @@ class OpenAIContextAggregatorPair:

 class OpenAILLMService(BaseOpenAILLMService):

-    def __init__(self, *, model: str = "gpt-4o", **kwargs):
-        super().__init__(model=model, **kwargs)
+    def __init__(
+            self,
+            *,
+            model: str = "gpt-4o",
+            params: BaseOpenAILLMService.InputParams = BaseOpenAILLMService.InputParams(),
+            **kwargs):
+        super().__init__(model=model, params=params, **kwargs)

    @staticmethod
    def create_context_aggregator(context: OpenAILLMContext) -> OpenAIContextAggregatorPair:
@@ -262,7 +334,7 @@ class OpenAIImageGenService(ImageGenService):
        model: str = "dall-e-3",
    ):
        super().__init__()
-        self._model = model
+        self.set_model_name(model)
        self._image_size = image_size
        self._client = AsyncOpenAI(api_key=api_key)
        self._aiohttp_session = aiohttp_session
@@ -272,7 +344,7 @@ class OpenAIImageGenService(ImageGenService):

        image = await self._client.images.generate(
            prompt=prompt,
-            model=self._model,
+            model=self.model_name,
            n=1,
            size=self._image_size
        )
@@ -307,13 +379,15 @@ class OpenAITTSService(TTSService):
            self,
            *,
            api_key: str | None = None,
-            voice: Literal["alloy", "echo", "fable", "onyx", "nova", "shimmer"] = "alloy",
+            voice: str = "alloy",
            model: Literal["tts-1", "tts-1-hd"] = "tts-1",
+            sample_rate: int = 24000,
            **kwargs):
-        super().__init__(**kwargs)
+        super().__init__(sample_rate=sample_rate, **kwargs)

-        self._voice = voice
-        self._model = model
+        self._voice: ValidVoice = VALID_VOICES.get(voice, "alloy")
+        self.set_model_name(model)
+        self._sample_rate = sample_rate

        self._client = AsyncOpenAI(api_key=api_key)

@@ -322,7 +396,11 @@ class OpenAITTSService(TTSService):

    async def set_voice(self, voice: str):
        logger.debug(f"Switching TTS voice to: [{voice}]")
-        self._voice = voice
+        self._voice = VALID_VOICES.get(voice, self._voice)
+
+    async def set_model(self, model: str):
+        logger.debug(f"Switching TTS model to: [{model}]")
+        self._model = model

    async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
        logger.debug(f"Generating TTS: [{text}]")
@@ -331,7 +409,7 @@ class OpenAITTSService(TTSService):

            async with self._client.audio.speech.with_streaming_response.create(
                    input=text,
-                    model=self._model,
+                    model=self.model_name,
                    voice=self._voice,
                    response_format="pcm",
            ) as r:
@@ -348,7 +426,7 @@ class OpenAITTSService(TTSService):
                async for chunk in r.iter_bytes(8192):
                    if len(chunk) > 0:
                        await self.stop_ttfb_metrics()
-                        frame = AudioRawFrame(chunk, 24_000, 1)
+                        frame = TTSAudioRawFrame(chunk, self.sample_rate, 1)
                        yield frame
                await self.push_frame(TTSStoppedFrame())
        except BadRequestError as e:
--- a/src/pipecat/services/openpipe.py
+++ b/src/pipecat/services/openpipe.py
@@ -60,7 +60,7 @@ class OpenPipeLLMService(BaseOpenAILLMService):
            context: OpenAILLMContext,
            messages: List[ChatCompletionMessageParam]) -> AsyncStream[ChatCompletionChunk]:
        chunks = await self._client.chat.completions.create(
-            model=self._model,
+            model=self.model_name,
            stream=True,
            messages=messages,
            openpipe={
--- a/src/pipecat/services/playht.py
+++ b/src/pipecat/services/playht.py
@@ -9,7 +9,11 @@ import struct

 from typing import AsyncGenerator

-from pipecat.frames.frames import AudioRawFrame, Frame, TTSStartedFrame, TTSStoppedFrame
+from pipecat.frames.frames import (
+    Frame,
+    TTSAudioRawFrame,
+    TTSStartedFrame,
+    TTSStoppedFrame)
 from pipecat.services.ai_services import TTSService

 from loguru import logger
@@ -27,8 +31,15 @@ except ModuleNotFoundError as e:

 class PlayHTTTSService(TTSService):

-    def __init__(self, *, api_key: str, user_id: str, voice_url: str, **kwargs):
-        super().__init__(**kwargs)
+    def __init__(
+            self,
+            *,
+            api_key: str,
+            user_id: str,
+            voice_url: str,
+            sample_rate: int = 16000,
+            **kwargs):
+        super().__init__(sample_rate=sample_rate, **kwargs)

        self._user_id = user_id
        self._speech_key = api_key
@@ -39,13 +50,17 @@ class PlayHTTTSService(TTSService):
        )
        self._options = TTSOptions(
            voice=voice_url,
-            sample_rate=16000,
+            sample_rate=sample_rate,
            quality="higher",
            format=Format.FORMAT_WAV)

    def can_generate_metrics(self) -> bool:
        return True

+    async def set_voice(self, voice: str):
+        logger.debug(f"Switching TTS voice to: [{voice}]")
+        self._options.voice = voice
+
    async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
        logger.debug(f"Generating TTS: [{text}]")

@@ -80,7 +95,7 @@ class PlayHTTTSService(TTSService):
                else:
                    if len(chunk):
                        await self.stop_ttfb_metrics()
-                        frame = AudioRawFrame(chunk, 16000, 1)
+                        frame = TTSAudioRawFrame(chunk, 16000, 1)
                        yield frame
            await self.push_frame(TTSStoppedFrame())
        except Exception as e:
--- a/src/pipecat/services/together.py
+++ b/src/pipecat/services/together.py
@@ -4,23 +4,20 @@
 # SPDX-License-Identifier: BSD 2-Clause License
 #

-import base64
 import json
-import io
-import copy
-from typing import List, Optional
-from dataclasses import dataclass
-from asyncio import CancelledError
 import re
 import uuid
+from pydantic import BaseModel, Field
+
+from typing import Any, Dict, List, Optional
+from dataclasses import dataclass
+from asyncio import CancelledError

 from pipecat.frames.frames import (
    Frame,
    LLMModelUpdateFrame,
    TextFrame,
-    VisionImageRawFrame,
    UserImageRequestFrame,
-    UserImageRawFrame,
    LLMMessagesFrame,
    LLMFullResponseStartFrame,
    LLMFullResponseEndFrame,
@@ -28,6 +25,7 @@ from pipecat.frames.frames import (
    FunctionCallInProgressFrame,
    StartInterruptionFrame
 )
+from pipecat.metrics.metrics import LLMTokenUsage
 from pipecat.processors.frame_processor import FrameDirection
 from pipecat.services.ai_services import LLMService
 from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext, OpenAILLMContextFrame
@@ -59,18 +57,32 @@ class TogetherContextAggregatorPair:
 class TogetherLLMService(LLMService):
    """This class implements inference with Together's Llama 3.1 models
    """
+    class InputParams(BaseModel):
+        frequency_penalty: Optional[float] = Field(default=None, ge=-2.0, le=2.0)
+        max_tokens: Optional[int] = Field(default=4096, ge=1)
+        presence_penalty: Optional[float] = Field(default=None, ge=-2.0, le=2.0)
+        temperature: Optional[float] = Field(default=None, ge=0.0, le=1.0)
+        top_k: Optional[int] = Field(default=None, ge=0)
+        top_p: Optional[float] = Field(default=None, ge=0.0, le=1.0)
+        extra: Optional[Dict[str, Any]] = Field(default_factory=dict)

    def __init__(
            self,
            *,
            api_key: str,
            model: str = "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
-            max_tokens: int = 4096,
+            params: InputParams = InputParams(),
            **kwargs):
        super().__init__(**kwargs)
        self._client = AsyncTogether(api_key=api_key)
-        self._model = model
-        self._max_tokens = max_tokens
+        self.set_model_name(model)
+        self._max_tokens = params.max_tokens
+        self._frequency_penalty = params.frequency_penalty
+        self._presence_penalty = params.presence_penalty
+        self._temperature = params.temperature
+        self._top_k = params.top_k
+        self._top_p = params.top_p
+        self._extra = params.extra if isinstance(params.extra, dict) else {}

    def can_generate_metrics(self) -> bool:
        return True
@@ -84,6 +96,34 @@ class TogetherLLMService(LLMService):
            _assistant=assistant
        )

+    async def set_frequency_penalty(self, frequency_penalty: float):
+        logger.debug(f"Switching LLM frequency_penalty to: [{frequency_penalty}]")
+        self._frequency_penalty = frequency_penalty
+
+    async def set_max_tokens(self, max_tokens: int):
+        logger.debug(f"Switching LLM max_tokens to: [{max_tokens}]")
+        self._max_tokens = max_tokens
+
+    async def set_presence_penalty(self, presence_penalty: float):
+        logger.debug(f"Switching LLM presence_penalty to: [{presence_penalty}]")
+        self._presence_penalty = presence_penalty
+
+    async def set_temperature(self, temperature: float):
+        logger.debug(f"Switching LLM temperature to: [{temperature}]")
+        self._temperature = temperature
+
+    async def set_top_k(self, top_k: float):
+        logger.debug(f"Switching LLM top_k to: [{top_k}]")
+        self._top_k = top_k
+
+    async def set_top_p(self, top_p: float):
+        logger.debug(f"Switching LLM top_p to: [{top_p}]")
+        self._top_p = top_p
+
+    async def set_extra(self, extra: Dict[str, Any]):
+        logger.debug(f"Switching LLM extra to: [{extra}]")
+        self._extra = extra
+
    async def _process_context(self, context: OpenAILLMContext):
        try:
            await self.push_frame(LLMFullResponseStartFrame())
@@ -93,12 +133,21 @@ class TogetherLLMService(LLMService):

            await self.start_ttfb_metrics()

-            stream = await self._client.chat.completions.create(
-                messages=context.messages,
-                model=self._model,
-                max_tokens=self._max_tokens,
-                stream=True,
-            )
+            params = {
+                "messages": context.messages,
+                "model": self.model_name,
+                "max_tokens": self._max_tokens,
+                "stream": True,
+                "frequency_penalty": self._frequency_penalty,
+                "presence_penalty": self._presence_penalty,
+                "temperature": self._temperature,
+                "top_k": self._top_k,
+                "top_p": self._top_p
+            }
+
+            params.update(self._extra)
+
+            stream = await self._client.chat.completions.create(**params)

            # Function calling
            got_first_chunk = False
@@ -108,13 +157,11 @@ class TogetherLLMService(LLMService):
            async for chunk in stream:
                # logger.debug(f"Together LLM event: {chunk}")
                if chunk.usage:
-                    tokens = {
-                        "processor": self.name,
-                        "model": self._model,
-                        "prompt_tokens": chunk.usage.prompt_tokens,
-                        "completion_tokens": chunk.usage.completion_tokens,
-                        "total_tokens": chunk.usage.total_tokens
-                    }
+                    tokens = LLMTokenUsage(
+                        prompt_tokens=chunk.usage.prompt_tokens,
+                        completion_tokens=chunk.usage.completion_tokens,
+                        total_tokens=chunk.usage.total_tokens
+                    )
                    await self.start_llm_usage_metrics(tokens)

                if len(chunk.choices) == 0:
@@ -156,7 +203,7 @@ class TogetherLLMService(LLMService):
            context = TogetherLLMContext.from_messages(frame.messages)
        elif isinstance(frame, LLMModelUpdateFrame):
            logger.debug(f"Switching LLM model to: [{frame.model}]")
-            self._model = frame.model
+            self.set_model_name(frame.model)
        else:
            await self.push_frame(frame, direction)

@@ -173,7 +220,7 @@ class TogetherLLMService(LLMService):
            try:
                arguments = json.loads(args_string)
                await self.call_function(context=context,
-                                         tool_call_id=uuid.uuid4(),
+                                         tool_call_id=str(uuid.uuid4()),
                                         function_name=function_name,
                                         arguments=arguments)
                return
@@ -301,7 +348,8 @@ class TogetherAssistantContextAggregator(LLMAssistantContextAggregator):
                self._function_call_result = None
                self._context.add_message({
                    "role": "tool",
-                    "content": frame.result
+                    # Together expects the content here to be a string, so stringify it
+                    "content": str(frame.result)
                })
                run_llm = True
            else:
--- a/src/pipecat/services/whisper.py
+++ b/src/pipecat/services/whisper.py
@@ -14,7 +14,7 @@ from typing import AsyncGenerator
 import numpy as np

 from pipecat.frames.frames import ErrorFrame, Frame, TranscriptionFrame
-from pipecat.services.ai_services import STTService
+from pipecat.services.ai_services import SegmentedSTTService
 from pipecat.utils.time import time_now_iso8601

 from loguru import logger
@@ -38,7 +38,7 @@ class Model(Enum):
    DISTIL_MEDIUM_EN = "Systran/faster-distil-whisper-medium.en"


-class WhisperSTTService(STTService):
+class WhisperSTTService(SegmentedSTTService):
    """Class to transcribe audio with a locally-downloaded Whisper model"""

    def __init__(self,
@@ -52,7 +52,7 @@ class WhisperSTTService(STTService):
        super().__init__(**kwargs)
        self._device: str = device
        self._compute_type = compute_type
-        self._model_name: str | Model = model
+        self.set_model_name(model if isinstance(model, str) else model.value)
        self._no_speech_prob = no_speech_prob
        self._model: WhisperModel | None = None
        self._load()
@@ -65,7 +65,7 @@ class WhisperSTTService(STTService):
        this model is being run, it will take time to download."""
        logger.debug("Loading Whisper model...")
        self._model = WhisperModel(
-            self._model_name.value if isinstance(self._model_name, Enum) else self._model_name,
+            self.model_name,
            device=self._device,
            compute_type=self._compute_type)
        logger.debug("Loaded Whisper model")
@@ -77,6 +77,7 @@ class WhisperSTTService(STTService):
            yield ErrorFrame("Whisper model not available")
            return

+        await self.start_processing_metrics()
        await self.start_ttfb_metrics()

        # Divide by 32768 because we have signed 16-bit data.
@@ -88,7 +89,9 @@ class WhisperSTTService(STTService):
            if segment.no_speech_prob < self._no_speech_prob:
                text += f"{segment.text} "

+        await self.stop_ttfb_metrics()
+        await self.stop_processing_metrics()
+
        if text:
-            await self.stop_ttfb_metrics()
            logger.debug(f"Transcription: [{text}]")
            yield TranscriptionFrame(text, "", time_now_iso8601())
--- a/src/pipecat/services/xtts.py
+++ b/src/pipecat/services/xtts.py
@@ -9,10 +9,10 @@ import aiohttp
 from typing import Any, AsyncGenerator, Dict

 from pipecat.frames.frames import (
-    AudioRawFrame,
    ErrorFrame,
    Frame,
    StartFrame,
+    TTSAudioRawFrame,
    TTSStartedFrame,
    TTSStoppedFrame)
 from pipecat.services.ai_services import TTSService
@@ -128,7 +128,7 @@ class XTTSService(TTSService):
                        # Convert the numpy array back to bytes
                        resampled_audio_bytes = resampled_audio.astype(np.int16).tobytes()
                        # Create the frame with the resampled audio
-                        frame = AudioRawFrame(resampled_audio_bytes, 16000, 1)
+                        frame = TTSAudioRawFrame(resampled_audio_bytes, 16000, 1)
                        yield frame

            # Process any remaining data in the buffer
@@ -136,7 +136,7 @@ class XTTSService(TTSService):
                audio_np = np.frombuffer(buffer, dtype=np.int16)
                resampled_audio = resampy.resample(audio_np, 24000, 16000)
                resampled_audio_bytes = resampled_audio.astype(np.int16).tobytes()
-                frame = AudioRawFrame(resampled_audio_bytes, 16000, 1)
+                frame = TTSAudioRawFrame(resampled_audio_bytes, 16000, 1)
                yield frame

            await self.push_frame(TTSStoppedFrame())
--- a/src/pipecat/transcriptions/init.py
+++ b/src/pipecat/transcriptions/init.py
--- a/src/pipecat/transcriptions/language.py
+++ b/src/pipecat/transcriptions/language.py
@@ -0,0 +1,64 @@
+#
+# Copyright (c) 2024, Daily
+#
+# SPDX-License-Identifier: BSD 2-Clause License
+#
+
+import sys
+
+from enum import Enum
+
+if sys.version_info < (3, 11):
+    class StrEnum(str, Enum):
+        def __new__(cls, value):
+            obj = str.__new__(cls, value)
+            obj._value_ = value
+            return obj
+else:
+    from enum import StrEnum
+
+
+class Language(StrEnum):
+    BG = "bg"            # Bulgarian
+    CA = "ca"            # Catalan
+    ZH = "zh"            # Chinese simplified
+    ZH_TW = "zh-TW"      # Chinese traditional
+    CS = "cs"            # Czech
+    DA = "da"            # Danish
+    NL = "nl"            # Dutch
+    EN = "en"            # English
+    EN_US = "en-US"      # English (USA)
+    EN_AU = "en-AU"      # English (Australia)
+    EN_GB = "en-GB"      # English (Great Britain)
+    EN_NZ = "en-NZ"      # English (New Zealand)
+    EN_IN = "en-IN"      # English (India)
+    ET = "et"            # Estonian
+    FI = "fi"            # Finnish
+    NL_BE = "nl-BE"      # Flemmish
+    FR = "fr"            # French
+    FR_CA = "fr-CA"      # French (Canada)
+    DE = "de"            # German
+    DE_CH = "de-CH"      # German (Switzerland)
+    EL = "el"            # Greek
+    HI = "hi"            # Hindi
+    HU = "hu"            # Hungarian
+    ID = "id"            # Indonesian
+    IT = "it"            # Italian
+    JA = "ja"            # Japanese
+    KO = "ko"            # Korean
+    LV = "lv"            # Latvian
+    LT = "lt"            # Lithuanian
+    MS = "ms"            # Malay
+    NO = "no"            # Norwegian
+    PL = "pl"            # Polish
+    PT = "pt"            # Portuguese
+    PT_BR = "pt-BR"      # Portuguese (Brazil)
+    RO = "ro"            # Romanian
+    RU = "ru"            # Russian
+    SK = "sk"            # Slovak
+    ES = "es"            # Spanish
+    SV = "sv"            # Swedish
+    TH = "th"            # Thai
+    TR = "tr"            # Turkish
+    UK = "uk"            # Ukrainian
+    VI = "vi"            # Vietnamese
--- a/src/pipecat/transports/base_input.py
+++ b/src/pipecat/transports/base_input.py
@@ -10,9 +10,9 @@ from concurrent.futures import ThreadPoolExecutor

 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
 from pipecat.frames.frames import (
-    AudioRawFrame,
    BotInterruptionFrame,
    CancelFrame,
+    InputAudioRawFrame,
    StartFrame,
    EndFrame,
    Frame,
@@ -31,16 +31,12 @@ from loguru import logger
 class BaseInputTransport(FrameProcessor):

    def __init__(self, params: TransportParams, **kwargs):
-        super().__init__(**kwargs)
+        super().__init__(sync=False, **kwargs)

        self._params = params

        self._executor = ThreadPoolExecutor(max_workers=5)

-        # Create push frame task. This is the task that will push frames in
-        # order. We also guarantee that all frames are pushed in the same task.
-        self._create_push_task()
-
    async def start(self, frame: StartFrame):
        # Create audio input queue and task if needed.
        if self._params.audio_in_enabled or self._params.vad_enabled:
@@ -53,10 +49,6 @@ class BaseInputTransport(FrameProcessor):
            self._audio_task.cancel()
            await self._audio_task

-        # Wait for the push frame task to finish. It will finish when the
-        # EndFrame is actually processed.
-        await self._push_frame_task
-
    async def cancel(self, frame: CancelFrame):
        # Cancel all the tasks and wait for them to finish.

@@ -64,13 +56,10 @@ class BaseInputTransport(FrameProcessor):
            self._audio_task.cancel()
            await self._audio_task

-        self._push_frame_task.cancel()
-        await self._push_frame_task
-
    def vad_analyzer(self) -> VADAnalyzer | None:
        return self._params.vad_analyzer

-    async def push_audio_frame(self, frame: AudioRawFrame):
+    async def push_audio_frame(self, frame: InputAudioRawFrame):
        if self._params.audio_in_enabled or self._params.vad_enabled:
            await self._audio_in_queue.put(frame)

@@ -82,26 +71,25 @@ class BaseInputTransport(FrameProcessor):
        await super().process_frame(frame, direction)

        # Specific system frames
-        if isinstance(frame, CancelFrame):
+        if isinstance(frame, StartFrame):
+            # Push StartFrame before start(), because we want StartFrame to be
+            # processed by every processor before any other frame is processed.
+            await self.push_frame(frame, direction)
+            await self.start(frame)
+        elif isinstance(frame, CancelFrame):
            await self.cancel(frame)
            await self.push_frame(frame, direction)
        elif isinstance(frame, BotInterruptionFrame):
-            await self._handle_interruptions(frame, False)
-        elif isinstance(frame, StartInterruptionFrame):
+            logger.debug("Bot interruption")
            await self._start_interruption()
-        elif isinstance(frame, StopInterruptionFrame):
-            await self._stop_interruption()
        # All other system frames
        elif isinstance(frame, SystemFrame):
            await self.push_frame(frame, direction)
        # Control frames
-        elif isinstance(frame, StartFrame):
-            await self.start(frame)
-            await self._internal_push_frame(frame, direction)
        elif isinstance(frame, EndFrame):
            # Push EndFrame before stop(), because stop() waits on the task to
            # finish and the task finishes when EndFrame is processed.
-            await self._internal_push_frame(frame, direction)
+            await self.push_frame(frame, direction)
            await self.stop(frame)
        elif isinstance(frame, VADParamsUpdateFrame):
            vad_analyzer = self.vad_analyzer()
@@ -109,73 +97,28 @@ class BaseInputTransport(FrameProcessor):
                vad_analyzer.set_params(frame.params)
        # Other frames
        else:
-            await self._internal_push_frame(frame, direction)
-
-    #
-    # Push frames task
-    #
-
-    def _create_push_task(self):
-        loop = self.get_event_loop()
-        self._push_queue = asyncio.Queue()
-        self._push_frame_task = loop.create_task(self._push_frame_task_handler())
-
-    async def _internal_push_frame(
-            self,
-            frame: Frame | None,
-            direction: FrameDirection | None = FrameDirection.DOWNSTREAM):
-        await self._push_queue.put((frame, direction))
-
-    async def _push_frame_task_handler(self):
-        running = True
-        while running:
-            try:
-                (frame, direction) = await self._push_queue.get()
-                await self.push_frame(frame, direction)
-                running = not isinstance(frame, EndFrame)
-                self._push_queue.task_done()
-            except asyncio.CancelledError:
-                break
+            await self.push_frame(frame, direction)

    #
    # Handle interruptions
    #

-    async def _start_interruption(self):
-        if not self.interruptions_allowed:
-            return
-
-        # Cancel the task. This will stop pushing frames downstream.
-        self._push_frame_task.cancel()
-        await self._push_frame_task
-        # Push an out-of-band frame (i.e. not using the ordered push
-        # frame task) to stop everything, specially at the output
-        # transport.
-        await self.push_frame(StartInterruptionFrame())
-        # Create a new queue and task.
-        self._create_push_task()
-
-    async def _stop_interruption(self):
-        if not self.interruptions_allowed:
-            return
-
-        await self.push_frame(StopInterruptionFrame())
-
-    async def _handle_interruptions(self, frame: Frame, push_frame: bool):
+    async def _handle_interruptions(self, frame: Frame):
        if self.interruptions_allowed:
-            # Make sure we notify about interruptions quickly out-of-band
-            if isinstance(frame, BotInterruptionFrame):
-                logger.debug("Bot interruption")
-                await self._start_interruption()
-            elif isinstance(frame, UserStartedSpeakingFrame):
+            # Make sure we notify about interruptions quickly out-of-band.
+            if isinstance(frame, UserStartedSpeakingFrame):
                logger.debug("User started speaking")
                await self._start_interruption()
+                # Push an out-of-band frame (i.e. not using the ordered push
+                # frame task) to stop everything, specially at the output
+                # transport.
+                await self.push_frame(StartInterruptionFrame())
            elif isinstance(frame, UserStoppedSpeakingFrame):
                logger.debug("User stopped speaking")
                await self._stop_interruption()
+                await self.push_frame(StopInterruptionFrame())

-        if push_frame:
-            await self._internal_push_frame(frame)
+        await self.push_frame(frame)

    #
    # Audio input
@@ -199,7 +142,7 @@ class BaseInputTransport(FrameProcessor):
                frame = UserStoppedSpeakingFrame()

            if frame:
-                await self._handle_interruptions(frame, True)
+                await self._handle_interruptions(frame)

            vad_state = new_vad_state
        return vad_state
@@ -208,7 +151,7 @@ class BaseInputTransport(FrameProcessor):
        vad_state: VADState = VADState.QUIET
        while True:
            try:
-                frame: AudioRawFrame = await self._audio_in_queue.get()
+                frame: InputAudioRawFrame = await self._audio_in_queue.get()

                audio_passthrough = True

@@ -220,7 +163,7 @@ class BaseInputTransport(FrameProcessor):

                # Push audio downstream if passthrough.
                if audio_passthrough:
-                    await self._internal_push_frame(frame)
+                    await self.push_frame(frame)

                self._audio_in_queue.task_done()
            except asyncio.CancelledError:
--- a/src/pipecat/transports/base_output.py
+++ b/src/pipecat/transports/base_output.py
@@ -8,38 +8,42 @@
 import asyncio
 import itertools
 import time
+import sys

 from PIL import Image
 from typing import List

 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
 from pipecat.frames.frames import (
-    AudioRawFrame,
    BotSpeakingFrame,
    BotStartedSpeakingFrame,
    BotStoppedSpeakingFrame,
    CancelFrame,
    MetricsFrame,
+    OutputAudioRawFrame,
+    OutputImageRawFrame,
    SpriteFrame,
    StartFrame,
    EndFrame,
    Frame,
-    ImageRawFrame,
    StartInterruptionFrame,
    StopInterruptionFrame,
    SystemFrame,
    TTSStartedFrame,
    TTSStoppedFrame,
+    TextFrame,
    TransportMessageFrame)
 from pipecat.transports.base_transport import TransportParams

 from loguru import logger

+from pipecat.utils.time import nanoseconds_to_seconds
+

 class BaseOutputTransport(FrameProcessor):

    def __init__(self, params: TransportParams, **kwargs):
-        super().__init__(**kwargs)
+        super().__init__(sync=False, **kwargs)

        self._params = params

@@ -64,11 +68,7 @@ class BaseOutputTransport(FrameProcessor):
        # Create sink frame task. This is the task that will actually write
        # audio or video frames. We write audio/video in a task so we can keep
        # generating frames upstream while, for example, the audio is playing.
-        self._create_sink_task()
-
-        # Create push frame task. This is the task that will push frames in
-        # order. We also guarantee that all frames are pushed in the same task.
-        self._create_push_task()
+        self._create_sink_tasks()

    async def start(self, frame: StartFrame):
        # Create camera output queue and task if needed.
@@ -81,6 +81,13 @@ class BaseOutputTransport(FrameProcessor):
            self._audio_out_task = self.get_event_loop().create_task(self._audio_out_task_handler())

    async def stop(self, frame: EndFrame):
+        # At this point we have enqueued an EndFrame and we need to wait for
+        # that EndFrame to be processed by the sink tasks. We also need to wait
+        # for these tasks before cancelling the camera and audio tasks below
+        # because they might be still rendering.
+        await self._sink_task
+        await self._sink_clock_task
+
        # Cancel and wait for the camera output task to finish.
        if self._params.camera_out_enabled:
            self._camera_out_task.cancel()
@@ -91,23 +98,23 @@ class BaseOutputTransport(FrameProcessor):
            self._audio_out_task.cancel()
            await self._audio_out_task

-        # Wait for the push frame and sink tasks to finish. They will finish when
-        # the EndFrame is actually processed.
-        await self._push_frame_task
-        await self._sink_task
-
    async def cancel(self, frame: CancelFrame):
-        # Cancel all the tasks and wait for them to finish.
+        # Since we are cancelling everything it doesn't matter if we cancel sink
+        # tasks first or not.
+        self._sink_task.cancel()
+        self._sink_clock_task.cancel()
+        await self._sink_task
+        await self._sink_clock_task

+        # Cancel and wait for the camera output task to finish.
        if self._params.camera_out_enabled:
            self._camera_out_task.cancel()
            await self._camera_out_task

-        self._push_frame_task.cancel()
-        await self._push_frame_task
-
-        self._sink_task.cancel()
-        await self._sink_task
+        # Cancel and wait for the audio output task to finish.
+        if self._params.audio_out_enabled and self._params.audio_out_is_live:
+            self._audio_out_task.cancel()
+            await self._audio_out_task

    async def send_message(self, frame: TransportMessageFrame):
        pass
@@ -115,7 +122,7 @@ class BaseOutputTransport(FrameProcessor):
    async def send_metrics(self, frame: MetricsFrame):
        pass

-    async def write_frame_to_camera(self, frame: ImageRawFrame):
+    async def write_frame_to_camera(self, frame: OutputImageRawFrame):
        pass

    async def write_raw_audio_frames(self, frames: bytes):
@@ -133,9 +140,14 @@ class BaseOutputTransport(FrameProcessor):
        # immediately. Other frames require order so they are put in the sink
        # queue.
        #
-        if isinstance(frame, CancelFrame):
+        if isinstance(frame, StartFrame):
+            # Push StartFrame before start(), because we want StartFrame to be
+            # processed by every processor before any other frame is processed.
            await self.push_frame(frame, direction)
+            await self.start(frame)
+        elif isinstance(frame, CancelFrame):
            await self.cancel(frame)
+            await self.push_frame(frame, direction)
        elif isinstance(frame, StartInterruptionFrame) or isinstance(frame, StopInterruptionFrame):
            await self.push_frame(frame, direction)
            await self._handle_interruptions(frame)
@@ -145,19 +157,20 @@ class BaseOutputTransport(FrameProcessor):
        elif isinstance(frame, SystemFrame):
            await self.push_frame(frame, direction)
        # Control frames.
-        elif isinstance(frame, StartFrame):
-            await self._sink_queue.put(frame)
-            await self.start(frame)
        elif isinstance(frame, EndFrame):
+            await self._sink_clock_queue.put((sys.maxsize, frame.id, frame))
            await self._sink_queue.put(frame)
            await self.stop(frame)
        # Other frames.
-        elif isinstance(frame, AudioRawFrame):
+        elif isinstance(frame, OutputAudioRawFrame):
            await self._handle_audio(frame)
-        elif isinstance(frame, ImageRawFrame) or isinstance(frame, SpriteFrame):
+        elif isinstance(frame, OutputImageRawFrame) or isinstance(frame, SpriteFrame):
            await self._handle_image(frame)
        elif isinstance(frame, TransportMessageFrame) and frame.urgent:
            await self.send_message(frame)
+        # TODO(aleix): Images and audio should support presentation timestamps.
+        elif frame.pts:
+            await self._sink_clock_queue.put((frame.pts, frame.id, frame))
        else:
            await self._sink_queue.put(frame)

@@ -166,19 +179,19 @@ class BaseOutputTransport(FrameProcessor):
            return

        if isinstance(frame, StartInterruptionFrame):
-            # Stop sink task.
+            # Stop sink tasks.
            self._sink_task.cancel()
            await self._sink_task
-            self._create_sink_task()
-            # Stop push task.
-            self._push_frame_task.cancel()
-            await self._push_frame_task
-            self._create_push_task()
+            # Stop sink clock tasks.
+            self._sink_clock_task.cancel()
+            await self._sink_clock_task
+            # Create sink tasks.
+            self._create_sink_tasks()
            # Let's send a bot stopped speaking if we have to.
            if self._bot_speaking:
                await self._bot_stopped_speaking()

-    async def _handle_audio(self, frame: AudioRawFrame):
+    async def _handle_audio(self, frame: OutputAudioRawFrame):
        if not self._params.audio_out_enabled:
            return

@@ -187,12 +200,14 @@ class BaseOutputTransport(FrameProcessor):
        else:
            self._audio_buffer.extend(frame.audio)
            while len(self._audio_buffer) >= self._audio_chunk_size:
-                chunk = AudioRawFrame(bytes(self._audio_buffer[:self._audio_chunk_size]),
-                                      sample_rate=frame.sample_rate, num_channels=frame.num_channels)
+                chunk = OutputAudioRawFrame(
+                    bytes(self._audio_buffer[:self._audio_chunk_size]),
+                    sample_rate=frame.sample_rate, num_channels=frame.num_channels
+                )
                await self._sink_queue.put(chunk)
                self._audio_buffer = self._audio_buffer[self._audio_chunk_size:]

-    async def _handle_image(self, frame: ImageRawFrame | SpriteFrame):
+    async def _handle_image(self, frame: OutputImageRawFrame | SpriteFrame):
        if not self._params.camera_out_enabled:
            return

@@ -201,87 +216,101 @@ class BaseOutputTransport(FrameProcessor):
        else:
            await self._sink_queue.put(frame)

-    def _create_sink_task(self):
+    #
+    # Sink tasks
+    #
+
+    def _create_sink_tasks(self):
        loop = self.get_event_loop()
        self._sink_queue = asyncio.Queue()
        self._sink_task = loop.create_task(self._sink_task_handler())
+        self._sink_clock_queue = asyncio.PriorityQueue()
+        self._sink_clock_task = loop.create_task(self._sink_clock_task_handler())
+
+    async def _sink_frame_handler(self, frame: Frame):
+        if isinstance(frame, OutputAudioRawFrame):
+            await self.write_raw_audio_frames(frame.audio)
+            await self.push_frame(frame)
+            await self.push_frame(BotSpeakingFrame(), FrameDirection.UPSTREAM)
+        elif isinstance(frame, OutputImageRawFrame):
+            await self._set_camera_image(frame)
+        elif isinstance(frame, SpriteFrame):
+            await self._set_camera_images(frame.images)
+        elif isinstance(frame, TransportMessageFrame):
+            await self.send_message(frame)
+        elif isinstance(frame, TTSStartedFrame):
+            await self._bot_started_speaking()
+            await self.push_frame(frame)
+        elif isinstance(frame, TTSStoppedFrame):
+            await self._bot_stopped_speaking()
+            await self.push_frame(frame)
+        else:
+            await self.push_frame(frame)

    async def _sink_task_handler(self):
        running = True
        while running:
            try:
                frame = await self._sink_queue.get()
-                if isinstance(frame, AudioRawFrame):
-                    await self.write_raw_audio_frames(frame.audio)
-                    await self._internal_push_frame(frame)
-                    await self.push_frame(BotSpeakingFrame(), FrameDirection.UPSTREAM)
-                elif isinstance(frame, ImageRawFrame):
-                    await self._set_camera_image(frame)
-                elif isinstance(frame, SpriteFrame):
-                    await self._set_camera_images(frame.images)
-                elif isinstance(frame, TransportMessageFrame):
-                    await self.send_message(frame)
-                elif isinstance(frame, TTSStartedFrame):
-                    await self._bot_started_speaking()
-                    await self._internal_push_frame(frame)
-                elif isinstance(frame, TTSStoppedFrame):
-                    await self._bot_stopped_speaking()
-                    await self._internal_push_frame(frame)
-                else:
-                    await self._internal_push_frame(frame)
-
+                await self._sink_frame_handler(frame)
                running = not isinstance(frame, EndFrame)
-
                self._sink_queue.task_done()
            except asyncio.CancelledError:
                break
            except Exception as e:
                logger.exception(f"{self} error processing sink queue: {e}")

+    async def _sink_clock_frame_handler(self, frame: Frame):
+        # TODO(aleix): For now we just process TextFrame. But we should process
+        # audio and video as well.
+        if isinstance(frame, TextFrame):
+            await self.push_frame(frame)
+
+    async def _sink_clock_task_handler(self):
+        running = True
+        while running:
+            try:
+                timestamp, _, frame = await self._sink_clock_queue.get()
+
+                # If we hit an EndFrame, we can finish right away.
+                running = not isinstance(frame, EndFrame)
+
+                # If we have a frame we check it's presentation timestamp. If it
+                # has already passed we process it, otherwise we wait until it's
+                # time to process it.
+                if running:
+                    current_time = self.get_clock().get_time()
+                    if timestamp <= current_time:
+                        await self._sink_clock_frame_handler(frame)
+                    else:
+                        wait_time = nanoseconds_to_seconds(timestamp - current_time)
+                        await asyncio.sleep(wait_time)
+                        await self._sink_frame_handler(frame)
+
+                self._sink_clock_queue.task_done()
+            except asyncio.CancelledError:
+                break
+            except Exception as e:
+                logger.exception(f"{self} error processing sink clock queue: {e}")
+
    async def _bot_started_speaking(self):
        logger.debug("Bot started speaking")
        self._bot_speaking = True
-        await self._internal_push_frame(BotStartedSpeakingFrame(), FrameDirection.UPSTREAM)
+        await self.push_frame(BotStartedSpeakingFrame(), FrameDirection.UPSTREAM)

    async def _bot_stopped_speaking(self):
        logger.debug("Bot stopped speaking")
        self._bot_speaking = False
-        await self._internal_push_frame(BotStoppedSpeakingFrame(), FrameDirection.UPSTREAM)
-
-    #
-    # Push frames task
-    #
-
-    def _create_push_task(self):
-        loop = self.get_event_loop()
-        self._push_queue = asyncio.Queue()
-        self._push_frame_task = loop.create_task(self._push_frame_task_handler())
-
-    async def _internal_push_frame(
-            self,
-            frame: Frame | None,
-            direction: FrameDirection | None = FrameDirection.DOWNSTREAM):
-        await self._push_queue.put((frame, direction))
-
-    async def _push_frame_task_handler(self):
-        running = True
-        while running:
-            try:
-                (frame, direction) = await self._push_queue.get()
-                await self.push_frame(frame, direction)
-                running = not isinstance(frame, EndFrame)
-                self._push_queue.task_done()
-            except asyncio.CancelledError:
-                break
+        await self.push_frame(BotStoppedSpeakingFrame(), FrameDirection.UPSTREAM)

    #
    # Camera out
    #

-    async def send_image(self, frame: ImageRawFrame | SpriteFrame):
+    async def send_image(self, frame: OutputImageRawFrame | SpriteFrame):
        await self.process_frame(frame, FrameDirection.DOWNSTREAM)

-    async def _draw_image(self, frame: ImageRawFrame):
+    async def _draw_image(self, frame: OutputImageRawFrame):
        desired_size = (self._params.camera_out_width, self._params.camera_out_height)

        if frame.size != desired_size:
@@ -289,14 +318,17 @@ class BaseOutputTransport(FrameProcessor):
            resized_image = image.resize(desired_size)
            logger.warning(
                f"{frame} does not have the expected size {desired_size}, resizing")
-            frame = ImageRawFrame(resized_image.tobytes(), resized_image.size, resized_image.format)
+            frame = OutputImageRawFrame(
+                resized_image.tobytes(),
+                resized_image.size,
+                resized_image.format)

        await self.write_frame_to_camera(frame)

-    async def _set_camera_image(self, image: ImageRawFrame):
+    async def _set_camera_image(self, image: OutputImageRawFrame):
        self._camera_images = itertools.cycle([image])

-    async def _set_camera_images(self, images: List[ImageRawFrame]):
+    async def _set_camera_images(self, images: List[OutputImageRawFrame]):
        self._camera_images = itertools.cycle(images)

    async def _camera_out_task_handler(self):
@@ -311,9 +343,9 @@ class BaseOutputTransport(FrameProcessor):
                elif self._camera_images:
                    image = next(self._camera_images)
                    await self._draw_image(image)
-                    await asyncio.sleep(1.0 / self._params.camera_out_framerate)
+                    await asyncio.sleep(self._camera_out_frame_duration)
                else:
-                    await asyncio.sleep(1.0 / self._params.camera_out_framerate)
+                    await asyncio.sleep(self._camera_out_frame_duration)
            except asyncio.CancelledError:
                break
            except Exception as e:
@@ -348,7 +380,7 @@ class BaseOutputTransport(FrameProcessor):
    # Audio out
    #

-    async def send_audio(self, frame: AudioRawFrame):
+    async def send_audio(self, frame: OutputAudioRawFrame):
        await self.process_frame(frame, FrameDirection.DOWNSTREAM)

    async def _audio_out_task_handler(self):
@@ -356,7 +388,7 @@ class BaseOutputTransport(FrameProcessor):
            try:
                frame = await self._audio_out_queue.get()
                await self.write_raw_audio_frames(frame.audio)
-                await self._internal_push_frame(frame)
+                await self.push_frame(frame)
                await self.push_frame(BotSpeakingFrame(), FrameDirection.UPSTREAM)
            except asyncio.CancelledError:
                break
--- a/src/pipecat/transports/base_transport.py
+++ b/src/pipecat/transports/base_transport.py
@@ -32,6 +32,7 @@ class TransportParams(BaseModel):
    audio_out_is_live: bool = False
    audio_out_sample_rate: int = 16000
    audio_out_channels: int = 1
+    audio_out_bitrate: int = 96000
    audio_in_enabled: bool = False
    audio_in_sample_rate: int = 16000
    audio_in_channels: int = 1
--- a/src/pipecat/transports/local/audio.py
+++ b/src/pipecat/transports/local/audio.py
@@ -8,7 +8,7 @@ import asyncio

 from concurrent.futures import ThreadPoolExecutor

-from pipecat.frames.frames import AudioRawFrame, StartFrame
+from pipecat.frames.frames import InputAudioRawFrame, StartFrame
 from pipecat.processors.frame_processor import FrameProcessor
 from pipecat.transports.base_input import BaseInputTransport
 from pipecat.transports.base_output import BaseOutputTransport
@@ -54,9 +54,9 @@ class LocalAudioInputTransport(BaseInputTransport):
        self._in_stream.close()

    def _audio_in_callback(self, in_data, frame_count, time_info, status):
-        frame = AudioRawFrame(audio=in_data,
-                              sample_rate=self._params.audio_in_sample_rate,
-                              num_channels=self._params.audio_in_channels)
+        frame = InputAudioRawFrame(audio=in_data,
+                                   sample_rate=self._params.audio_in_sample_rate,
+                                   num_channels=self._params.audio_in_channels)

        asyncio.run_coroutine_threadsafe(self.push_audio_frame(frame), self.get_event_loop())

--- a/src/pipecat/transports/local/tk.py
+++ b/src/pipecat/transports/local/tk.py
@@ -11,8 +11,7 @@ from concurrent.futures import ThreadPoolExecutor
 import numpy as np
 import tkinter as tk

-from pipecat.frames.frames import AudioRawFrame, ImageRawFrame, StartFrame
-from pipecat.processors.frame_processor import FrameProcessor
+from pipecat.frames.frames import InputAudioRawFrame, OutputImageRawFrame, StartFrame
 from pipecat.transports.base_input import BaseInputTransport
 from pipecat.transports.base_output import BaseOutputTransport
 from pipecat.transports.base_transport import BaseTransport, TransportParams
@@ -64,9 +63,9 @@ class TkInputTransport(BaseInputTransport):
        self._in_stream.close()

    def _audio_in_callback(self, in_data, frame_count, time_info, status):
-        frame = AudioRawFrame(audio=in_data,
-                              sample_rate=self._params.audio_in_sample_rate,
-                              num_channels=self._params.audio_in_channels)
+        frame = InputAudioRawFrame(audio=in_data,
+                                   sample_rate=self._params.audio_in_sample_rate,
+                                   num_channels=self._params.audio_in_channels)

        asyncio.run_coroutine_threadsafe(self.push_audio_frame(frame), self.get_event_loop())

@@ -108,10 +107,10 @@ class TkOutputTransport(BaseOutputTransport):
    async def write_raw_audio_frames(self, frames: bytes):
        await self.get_event_loop().run_in_executor(self._executor, self._out_stream.write, frames)

-    async def write_frame_to_camera(self, frame: ImageRawFrame):
+    async def write_frame_to_camera(self, frame: OutputImageRawFrame):
        self.get_event_loop().call_soon(self._write_frame_to_tk, frame)

-    def _write_frame_to_tk(self, frame: ImageRawFrame):
+    def _write_frame_to_tk(self, frame: OutputImageRawFrame):
        width = frame.size[0]
        height = frame.size[1]
        data = f"P6 {width} {height} 255 ".encode() + frame.image
@@ -141,12 +140,12 @@ class TkLocalTransport(BaseTransport):
    # BaseTransport
    #

-    def input(self) -> FrameProcessor:
+    def input(self) -> TkInputTransport:
        if not self._input:
            self._input = TkInputTransport(self._pyaudio, self._params)
        return self._input

-    def output(self) -> FrameProcessor:
+    def output(self) -> TkOutputTransport:
        if not self._output:
            self._output = TkOutputTransport(self._tk_root, self._pyaudio, self._params)
        return self._output
--- a/src/pipecat/transports/network/fastapi_websocket.py
+++ b/src/pipecat/transports/network/fastapi_websocket.py
@@ -12,8 +12,16 @@ import wave
 from typing import Awaitable, Callable
 from pydantic.main import BaseModel

-from pipecat.frames.frames import AudioRawFrame, CancelFrame, EndFrame, StartFrame
-from pipecat.processors.frame_processor import FrameProcessor
+from pipecat.frames.frames import (
+    AudioRawFrame,
+    CancelFrame,
+    EndFrame,
+    Frame,
+    InputAudioRawFrame,
+    StartFrame,
+    StartInterruptionFrame
+)
+from pipecat.processors.frame_processor import FrameDirection
 from pipecat.serializers.base_serializer import FrameSerializer
 from pipecat.transports.base_input import BaseInputTransport
 from pipecat.transports.base_output import BaseOutputTransport
@@ -79,7 +87,11 @@ class FastAPIWebsocketInputTransport(BaseInputTransport):
                continue

            if isinstance(frame, AudioRawFrame):
-                await self.push_audio_frame(frame)
+                await self.push_audio_frame(InputAudioRawFrame(
+                    audio=frame.audio,
+                    sample_rate=frame.sample_rate,
+                    num_channels=frame.num_channels)
+                )

        await self._callbacks.on_client_disconnected(self._websocket)

@@ -93,11 +105,18 @@ class FastAPIWebsocketOutputTransport(BaseOutputTransport):
        self._params = params
        self._websocket_audio_buffer = bytes()

+    async def process_frame(self, frame: Frame, direction: FrameDirection):
+        await super().process_frame(frame, direction)
+
+        if isinstance(frame, StartInterruptionFrame):
+            await self._write_frame(frame)
+
    async def write_raw_audio_frames(self, frames: bytes):
        self._websocket_audio_buffer += frames
-        while len(self._websocket_audio_buffer) >= self._params.audio_frame_size:
+        while len(self._websocket_audio_buffer):
            frame = AudioRawFrame(
-                audio=self._websocket_audio_buffer[:self._params.audio_frame_size],
+                audio=self._websocket_audio_buffer[:
+                                                   self._params.audio_frame_size],
                sample_rate=self._params.audio_out_sample_rate,
                num_channels=self._params.audio_out_channels
            )
@@ -121,7 +140,13 @@ class FastAPIWebsocketOutputTransport(BaseOutputTransport):
            if payload and self._websocket.client_state == WebSocketState.CONNECTED:
                await self._websocket.send_text(payload)

-            self._websocket_audio_buffer = self._websocket_audio_buffer[self._params.audio_frame_size:]
+            self._websocket_audio_buffer = self._websocket_audio_buffer[
+                self._params.audio_frame_size:]
+
+    async def _write_frame(self, frame: Frame):
+        payload = self._params.serializer.serialize(frame)
+        if payload and self._websocket.client_state == WebSocketState.CONNECTED:
+            await self._websocket.send_text(payload)


 class FastAPIWebsocketTransport(BaseTransport):
@@ -151,10 +176,10 @@ class FastAPIWebsocketTransport(BaseTransport):
        self._register_event_handler("on_client_connected")
        self._register_event_handler("on_client_disconnected")

-    def input(self) -> FrameProcessor:
+    def input(self) -> FastAPIWebsocketInputTransport:
        return self._input

-    def output(self) -> FrameProcessor:
+    def output(self) -> FastAPIWebsocketOutputTransport:
        return self._output

    async def _on_client_connected(self, websocket):
--- a/src/pipecat/transports/network/websocket_server.py
+++ b/src/pipecat/transports/network/websocket_server.py
@@ -11,8 +11,7 @@ import wave
 from typing import Awaitable, Callable
 from pydantic.main import BaseModel

-from pipecat.frames.frames import AudioRawFrame, CancelFrame, EndFrame, StartFrame
-from pipecat.processors.frame_processor import FrameProcessor
+from pipecat.frames.frames import AudioRawFrame, CancelFrame, EndFrame, InputAudioRawFrame, StartFrame
 from pipecat.serializers.base_serializer import FrameSerializer
 from pipecat.serializers.protobuf import ProtobufFrameSerializer
 from pipecat.transports.base_input import BaseInputTransport
@@ -98,9 +97,13 @@ class WebsocketServerInputTransport(BaseInputTransport):
                continue

            if isinstance(frame, AudioRawFrame):
-                await self.push_audio_frame(frame)
+                await self.push_audio_frame(InputAudioRawFrame(
+                    audio=frame.audio,
+                    sample_rate=frame.sample_rate,
+                    num_channels=frame.num_channels)
+                )
            else:
-                await self._internal_push_frame(frame)
+                await self.push_frame(frame)

        # Notify disconnection
        await self._callbacks.on_client_disconnected(websocket)
@@ -190,13 +193,13 @@ class WebsocketServerTransport(BaseTransport):
        self._register_event_handler("on_client_connected")
        self._register_event_handler("on_client_disconnected")

-    def input(self) -> FrameProcessor:
+    def input(self) -> WebsocketServerInputTransport:
        if not self._input:
            self._input = WebsocketServerInputTransport(
                self._host, self._port, self._params, self._callbacks, name=self._input_name)
        return self._input

-    def output(self) -> FrameProcessor:
+    def output(self) -> WebsocketServerOutputTransport:
        if not self._output:
            self._output = WebsocketServerOutputTransport(self._params, name=self._output_name)
        return self._output
--- a/src/pipecat/transports/services/daily.py
+++ b/src/pipecat/transports/services/daily.py
@@ -22,20 +22,23 @@ from daily import (
 from pydantic.main import BaseModel

 from pipecat.frames.frames import (
-    AudioRawFrame,
    CancelFrame,
    EndFrame,
    Frame,
-    ImageRawFrame,
+    InputAudioRawFrame,
    InterimTranscriptionFrame,
    MetricsFrame,
+    OutputAudioRawFrame,
+    OutputImageRawFrame,
    SpriteFrame,
    StartFrame,
    TranscriptionFrame,
    TransportMessageFrame,
    UserImageRawFrame,
    UserImageRequestFrame)
+from pipecat.metrics.metrics import LLMUsageMetricsData, ProcessingMetricsData, TTFBMetricsData, TTSUsageMetricsData
 from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
+from pipecat.transcriptions.language import Language
 from pipecat.transports.base_input import BaseInputTransport
 from pipecat.transports.base_output import BaseOutputTransport
 from pipecat.transports.base_transport import BaseTransport, TransportParams
@@ -238,7 +241,7 @@ class DailyTransportClient(EventHandler):
            completion=completion_callback(future))
        await future

-    async def read_next_audio_frame(self) -> AudioRawFrame | None:
+    async def read_next_audio_frame(self) -> InputAudioRawFrame | None:
        if not self._speaker:
            return None

@@ -251,7 +254,10 @@ class DailyTransportClient(EventHandler):
        audio = await future

        if len(audio) > 0:
-            return AudioRawFrame(audio=audio, sample_rate=sample_rate, num_channels=num_channels)
+            return InputAudioRawFrame(
+                audio=audio,
+                sample_rate=sample_rate,
+                num_channels=num_channels)
        else:
            # If we don't read any audio it could be there's no participant
            # connected. daily-python will return immediately if that's the
@@ -267,7 +273,7 @@ class DailyTransportClient(EventHandler):
        self._mic.write_frames(frames, completion=completion_callback(future))
        await future

-    async def write_frame_to_camera(self, frame: ImageRawFrame):
+    async def write_frame_to_camera(self, frame: OutputImageRawFrame):
        if not self._camera:
            return None

@@ -305,7 +311,7 @@ class DailyTransportClient(EventHandler):
                if self._token and self._params.transcription_enabled:
                    await self._start_transcription()

-                await self._callbacks.on_joined(data["participants"]["local"])
+                await self._callbacks.on_joined(data)
            else:
                error_msg = f"Error joining {self._room_url}: {error}"
                logger.error(error_msg)
@@ -365,6 +371,12 @@ class DailyTransportClient(EventHandler):
                                }
                            },
                        }
+                    },
+                    "microphone": {
+                        "sendSettings": {
+                            "channelConfig": "stereo" if self._params.audio_out_channels == 2 else "mono",
+                            "bitrate": self._params.audio_out_bitrate,
+                        }
                    }
                },
            })
@@ -618,11 +630,11 @@ class DailyInputTransport(BaseInputTransport):
    #

    async def push_transcription_frame(self, frame: TranscriptionFrame | InterimTranscriptionFrame):
-        await self._internal_push_frame(frame)
+        await self.push_frame(frame)

    async def push_app_message(self, message: Any, sender: str):
        frame = DailyTransportMessageFrame(message=message, participant_id=sender)
-        await self._internal_push_frame(frame)
+        await self.push_frame(frame)

    #
    # Audio in
@@ -685,7 +697,7 @@ class DailyInputTransport(BaseInputTransport):
                image=buffer,
                size=size,
                format=format)
-            await self._internal_push_frame(frame)
+            await self.push_frame(frame)

        self._video_renderers[participant_id]["timestamp"] = curr_time

@@ -724,14 +736,23 @@ class DailyOutputTransport(BaseOutputTransport):

    async def send_metrics(self, frame: MetricsFrame):
        metrics = {}
-        if frame.ttfb:
-            metrics["ttfb"] = frame.ttfb
-        if frame.processing:
-            metrics["processing"] = frame.processing
-        if frame.tokens:
-            metrics["tokens"] = frame.tokens
-        if frame.characters:
-            metrics["characters"] = frame.characters
+        for d in frame.data:
+            if isinstance(d, TTFBMetricsData):
+                if "ttfb" not in metrics:
+                    metrics["ttfb"] = []
+                metrics["ttfb"].append(d.model_dump(exclude_none=True))
+            elif isinstance(d, ProcessingMetricsData):
+                if "processing" not in metrics:
+                    metrics["processing"] = []
+                metrics["processing"].append(d.model_dump(exclude_none=True))
+            elif isinstance(d, LLMUsageMetricsData):
+                if "tokens" not in metrics:
+                    metrics["tokens"] = []
+                metrics["tokens"].append(d.value.model_dump(exclude_none=True))
+            elif isinstance(d, TTSUsageMetricsData):
+                if "characters" not in metrics:
+                    metrics["characters"] = []
+                metrics["characters"].append(d.model_dump(exclude_none=True))

        message = DailyTransportMessageFrame(message={
            "type": "pipecat-metrics",
@@ -742,7 +763,7 @@ class DailyOutputTransport(BaseOutputTransport):
    async def write_raw_audio_frames(self, frames: bytes):
        await self._client.write_raw_audio_frames(frames)

-    async def write_frame_to_camera(self, frame: ImageRawFrame):
+    async def write_frame_to_camera(self, frame: OutputImageRawFrame):
        await self._client.write_frame_to_camera(frame)


@@ -804,12 +825,12 @@ class DailyTransport(BaseTransport):
    # BaseTransport
    #

-    def input(self) -> FrameProcessor:
+    def input(self) -> DailyInputTransport:
        if not self._input:
            self._input = DailyInputTransport(self._client, self._params, name=self._input_name)
        return self._input

-    def output(self) -> FrameProcessor:
+    def output(self) -> DailyOutputTransport:
        if not self._output:
            self._output = DailyOutputTransport(self._client, self._params, name=self._output_name)
        return self._output
@@ -822,11 +843,11 @@ class DailyTransport(BaseTransport):
    def participant_id(self) -> str:
        return self._client.participant_id

-    async def send_image(self, frame: ImageRawFrame | SpriteFrame):
+    async def send_image(self, frame: OutputImageRawFrame | SpriteFrame):
        if self._output:
            await self._output.process_frame(frame, FrameDirection.DOWNSTREAM)

-    async def send_audio(self, frame: AudioRawFrame):
+    async def send_audio(self, frame: OutputAudioRawFrame):
        if self._output:
            await self._output.process_frame(frame, FrameDirection.DOWNSTREAM)

@@ -864,8 +885,8 @@ class DailyTransport(BaseTransport):
            self._input.capture_participant_video(
                participant_id, framerate, video_source, color_format)

-    async def _on_joined(self, participant):
-        await self._call_event_handler("on_joined", participant)
+    async def _on_joined(self, data):
+        await self._call_event_handler("on_joined", data)

    async def _on_left(self):
        await self._call_event_handler("on_left")
@@ -950,11 +971,16 @@ class DailyTransport(BaseTransport):
        text = message["text"]
        timestamp = message["timestamp"]
        is_final = message["rawResponse"]["is_final"]
+        try:
+            language = message["rawResponse"]["channel"]["alternatives"][0]["languages"][0]
+            language = Language(language)
+        except KeyError:
+            language = None
        if is_final:
-            frame = TranscriptionFrame(text, participant_id, timestamp)
+            frame = TranscriptionFrame(text, participant_id, timestamp, language)
            logger.debug(f"Transcription (from: {participant_id}): [{text}]")
        else:
-            frame = InterimTranscriptionFrame(text, participant_id, timestamp)
+            frame = InterimTranscriptionFrame(text, participant_id, timestamp, language)

        if self._input:
            await self._input.push_transcription_frame(frame)
--- a/src/pipecat/utils/time.py
+++ b/src/pipecat/utils/time.py
@@ -9,3 +9,20 @@ import datetime

 def time_now_iso8601() -> str:
    return datetime.datetime.now(datetime.timezone.utc).isoformat(timespec="milliseconds")
+
+
+def seconds_to_nanoseconds(seconds: float) -> int:
+    return int(seconds * 1_000_000_000)
+
+
+def nanoseconds_to_seconds(nanoseconds: int) -> float:
+    return nanoseconds / 1_000_000_000
+
+
+def nanoseconds_to_str(nanoseconds: int) -> str:
+    total_seconds = nanoseconds_to_seconds(nanoseconds)
+    hours = int(total_seconds // 3600)
+    minutes = int((total_seconds % 3600) // 60)
+    seconds = int(total_seconds % 60)
+    microseconds = int((total_seconds - int(total_seconds)) * 1_000_000)
+    return f"{hours}:{minutes:02}:{seconds:02}.{microseconds:06}"
--- a/src/pipecat/utils/utils.py
+++ b/src/pipecat/utils/utils.py
@@ -3,32 +3,39 @@
 #
 # SPDX-License-Identifier: BSD 2-Clause License
 #
+import collections
+import itertools

-from threading import Lock
-
-_COUNTS = {}
-_COUNTS_MUTEX = Lock()
-
-_ID = 0
-_ID_MUTEX = Lock()
+_COUNTS = collections.defaultdict(itertools.count)
+_ID = itertools.count()


 def obj_id() -> int:
-    global _ID, _ID_MUTEX
-    with _ID_MUTEX:
-        _ID += 1
-        return _ID
+    """
+    Generate a unique id for an object.
+
+    >>> obj_id()
+    0
+    >>> obj_id()
+    1
+    >>> obj_id()
+    2
+    """
+    return next(_ID)


 def obj_count(obj) -> int:
-    global _COUNTS, COUNTS_MUTEX
-    name = obj.__class__.__name__
-    with _COUNTS_MUTEX:
-        if name not in _COUNTS:
-            _COUNTS[name] = 0
-        else:
-            _COUNTS[name] += 1
-        return _COUNTS[name]
+    """Generate a unique id for an object.
+
+    >>> obj_count(object())
+    0
+    >>> obj_count(object())
+    1
+    >>> new_type = type('NewType', (object,), {})
+    >>> obj_count(new_type())
+    0
+    """
+    return next(_COUNTS[obj.__class__.__name__])


 def exp_smoothing(value: float, prev_value: float, factor: float) -> float:
--- a/test-requirements.txt
+++ b/test-requirements.txt
@@ -0,0 +1,35 @@
+aiohttp~=3.10.3
+anthropic
+autopep8~=2.3.1
+azure-cognitiveservices-speech~=1.40.0
+build~=1.2.1
+daily-python~=0.10.1
+deepgram-sdk~=3.5.0
+fal-client~=0.4.1
+fastapi~=0.112.1
+faster-whisper~=1.0.3
+google-generativeai~=0.7.2
+grpcio-tools~=1.62.2
+langchain~=0.2.14
+livekit~=0.13.1
+lmnt~=1.1.4
+loguru~=0.7.2
+numpy~=1.26.4
+openai~=1.37.2
+openpipe~=4.24.0
+Pillow~=10.4.0
+pip-tools~=7.4.1
+pyaudio~=0.2.14
+pydantic~=2.8.2
+pyloudnorm~=0.1.1
+pyht~=0.0.28
+pyright~=1.1.376
+pytest~=8.3.2
+python-dotenv~=1.0.1
+resampy~=0.4.3
+setuptools~=72.2.0
+setuptools_scm~=8.1.0
+silero-vad~=5.1
+together~=1.2.7
+transformers~=4.44.0
+websockets~=12.0
--- a/tests/integration/integration_azure_llm.py
+++ b/tests/integration/integration_azure_llm.py
@@ -1,14 +1,19 @@
+import unittest
+
 import asyncio
 import os
-from pipecat.pipeline.openai_frames import OpenAILLMContextFrame
-from pipecat.services.azure_ai_services import AzureLLMService
-from pipecat.services.openai_llm_context import OpenAILLMContext
+from pipecat.processors.aggregators.openai_llm_context import (
+    OpenAILLMContext,
+    OpenAILLMContextFrame
+)
+from pipecat.services.azure import AzureLLMService

 from openai.types.chat import (
    ChatCompletionSystemMessageParam,
 )

 if __name__ == "__main__":
+    @unittest.skip("Skip azure integration test")
    async def test_chat():
        llm = AzureLLMService(
            api_key=os.getenv("AZURE_CHATGPT_API_KEY"),
--- a/tests/integration/integration_ollama_llm.py
+++ b/tests/integration/integration_ollama_llm.py
@@ -1,13 +1,18 @@
+import unittest
+
 import asyncio
-from pipecat.pipeline.openai_frames import OpenAILLMContextFrame
-from pipecat.services.openai_llm_context import OpenAILLMContext
+from pipecat.processors.aggregators.openai_llm_context import (
+    OpenAILLMContext,
+    OpenAILLMContextFrame
+)

 from openai.types.chat import (
    ChatCompletionSystemMessageParam,
 )
-from pipecat.services.ollama_ai_services import OLLamaLLMService
+from pipecat.services.ollama import OLLamaLLMService

 if __name__ == "__main__":
+    @unittest.skip("Skip azure integration test")
    async def test_chat():
        llm = OLLamaLLMService()
        context = OpenAILLMContext()
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Mark Backman	6138790448	Merge pull request #487 from pipecat-ai/mb/llm-rtvi-service-option Add control frames for LLM config params	2024-09-21 18:26:17 -04:00
Mark Backman	9e27a8aad0	Add control frames for LLM param updates	2024-09-21 00:02:58 -04:00
Mark Backman	c73111afea	Add extra input param to LLMs	2024-09-21 00:01:25 -04:00
Kwindla Hultman Kramer	26a64afd8d	Merge pull request #485 from pipecat-ai/khk/metrics-model-exclude-none fixup for serialization issue	2024-09-20 18:24:19 -07:00
Kwindla Hultman Kramer	78a3f081de	fixup for serialization issue	2024-09-20 18:21:06 -07:00
Mark Backman	e8f8a49646	Merge pull request #484 from pipecat-ai/mb/llm-input-params Add input params for OpenAI, Anthropic, Together AI LLMs	2024-09-20 20:35:49 -04:00
Mark Backman	219304c5ee	Added Changelog entries	2024-09-20 20:31:42 -04:00
Mark Backman	f3fd312b83	Add Together AI interruptible example	2024-09-20 20:21:19 -04:00
Mark Backman	357e66d64d	Input params for Together AI LLM	2024-09-20 20:21:19 -04:00
Mark Backman	4fa1ea8c4b	Input params for Anthropic LLM	2024-09-20 20:21:19 -04:00
Mark Backman	3b81cd462d	Input params to OpenAI LLM	2024-09-20 20:21:19 -04:00
Aleix Conchillo Flaqué	14acf05a26	Merge pull request #480 from pipecat-ai/aleix/input-output-frames introduce input/output audio and image frames	2024-09-20 14:44:37 -07:00
Mattie Ruth	58d9c84bc9	Merge pull request #474 from pipecat-ai/ruthless/improve-metrics-types-2 Ruthless/improve metrics types 2	2024-09-20 09:47:24 -04:00
Aleix Conchillo Flaqué	7e39d9ad3d	introduce input/output audio and image frames We now distinguish between input and output audio and image frames. We introduce `InputAudioRawFrame`, `OutputAudioRawFrame`, `InputImageRawFrame` and `OutputImageRawFrame` (and other subclasses of those). The input frames usually come from an input transport and are meant to be processed inside the pipeline to generate new frames. However, the input frames will not be sent through an output transport. The output frames can also be processed by any frame processor in the pipeline and they are allowed to be sent by the output transport.	2024-09-19 23:11:03 -07:00
mattie ruth backman	a4edb3dab1	Cleanup on aisle METRICS. Note: See below, this is a breaking change 1. Fleshed out MetricsFrames and broke it into a proper set of types 2. Add model_name as a property to the AIService so that it can be automatically included in metrics and also remove that overhead from all the various services themselves Breaking change! Because of the types improvements, the MetricsFrame type has changed. Each frame will have a list of metrics simlilar to before except each item in the list will only contain one type of metric: "ttfb", "tokens", "characters", or "processing". Previously these fields would be in every entry but set to None if they didn't apply. While this changes internal handling of the MetricsFrame, it does NOT break the RTVI/daily messaging of metrics. That format remains the same. Also. Remember to use model_name for accessing a service's current model and set_model_name for setting it.	2024-09-19 21:30:34 -04:00
Mattie Ruth	ed409d0460	Merge pull request #478 from pipecat-ai/ruthless/get-tests-running Ruthless/get tests running	2024-09-19 21:01:27 -04:00
mattie ruth backman	50b45ac2da	get the test infrastructure running again disable broken tests for now	2024-09-19 20:58:17 -04:00
Kwindla Hultman Kramer	29bcbc68c5	Merge pull request #479 from pipecat-ai/khk/small-fixes fix small issues that crept into main	2024-09-19 17:25:27 -07:00
Kwindla Hultman Kramer	affbe9ac7d	fix small issues that crept into main	2024-09-19 17:17:33 -07:00
Aleix Conchillo Flaqué	1790fa452f	Merge pull request #436 from pipecat-ai/aleix/frameprocessor-single-task introduce synchronous and asynchronous frame processors	2024-09-19 11:22:56 -07:00
Aleix Conchillo Flaqué	607a246572	updated CHANGELOG with sync/async frame processors	2024-09-19 01:32:17 -07:00
Aleix Conchillo Flaqué	4f1b06e6b2	pipeline: renamed ParallelTask to SyncParallelPipeline	2024-09-19 01:32:17 -07:00
Aleix Conchillo Flaqué	62e9a33a70	examples: use CartesiaHttpTTSService to synchronize frames	2024-09-19 01:32:17 -07:00
Aleix Conchillo Flaqué	3298f935ef	services(fal,moondream): add missing **kwargs	2024-09-19 01:32:17 -07:00
Aleix Conchillo Flaqué	0e8f56c752	services: move TTSService push_stop_frames to AsyncTTSService	2024-09-19 01:32:15 -07:00
Aleix Conchillo Flaqué	8224538372	services(cartesia): added CartesiaHttpTTSService	2024-09-19 01:31:12 -07:00
Aleix Conchillo Flaqué	fbf6eef68f	transports(base_output): wait for sink tasks before canceling audio/video tasks	2024-09-19 01:31:12 -07:00
Aleix Conchillo Flaqué	f078d156de	frames: StartFrame is now a SystemFrame	2024-09-19 01:31:12 -07:00
Aleix Conchillo Flaqué	23d6eed5ea	transports: input()/output() return subclass instead of base class	2024-09-19 01:31:12 -07:00
Aleix Conchillo Flaqué	0ed3d118d6	services(moondream); update revision to 2024-08-26	2024-09-19 01:31:12 -07:00
Aleix Conchillo Flaqué	337f048864	introduce synchronous and asynchronous frame processors Pipecat has a pipeline-based architecture. The pipeline consists of frame processors linked to each other. The elements travelling across the pipeline are called frames. To have a deterministic behavior the frames travelling through the pipeline should always be ordered, except system frames which are out-of-band frames. To achieve that, each frame processor should only output frames from a single task. There are synchronous and asynchronous frame processors. The synchronous processors push output frames from the same task that they receive input frames, and therefore only pushing frames from one task. Asynchrnous frame processors can have internal tasks to perform things asynchrnously (e.g. receiving data from a websocket) but they also have a single task where they push frames from.	2024-09-19 01:31:10 -07:00
Mark Backman	6f3c421621	Merge pull request #475 from pipecat-ai/mb/tts-sample-rate Add sample_rate setting to TTS services	2024-09-18 14:59:09 -04:00
Mark Backman	eadd68d40b	Add sample_rate setting to TTS services	2024-09-18 14:50:20 -04:00
Aleix Conchillo Flaqué	13a4a05388	Merge pull request #466 from pipecat-ai/aleix/elevenlabs-cartesia-close-websocket-first services(cartesia,elevenlabs): close websocket before the receiving task	2024-09-16 23:55:28 -07:00
Aleix Conchillo Flaqué	20c019ae16	services(cartesia,elevenlabs): close websocket before the receiving task	2024-09-16 23:54:21 -07:00
Aleix Conchillo Flaqué	d9d6571c73	Merge pull request #465 from kunal-cai/ks--fix-ws [Cartesia] Fix streaming truncation bug with Twilio Fast API WS	2024-09-16 17:17:13 -07:00
Kunal Shah	540cad4844	Undo sorting	2024-09-16 16:07:19 -07:00
Kunal Shah	0a26b650c0	Undo sorting	2024-09-16 16:06:25 -07:00
Kunal Shah	adaac003e5	[Cartesia] Fix streaming truncation bug with Twilio Fast API WS	2024-09-16 15:59:06 -07:00
Aleix Conchillo Flaqué	3d4f125071	Merge pull request #454 from pipecat-ai/aleix/initial-pipeline-clock-support initial pipeline clock support	2024-09-13 13:51:04 -07:00
Aleix Conchillo Flaqué	bce87f8717	update CHANGELOG.md	2024-09-13 13:50:03 -07:00
Aleix Conchillo Flaqué	1fe940bd6b	servceis(cartesia,elevenlabs): use word start times instead	2024-09-13 13:10:44 -07:00
Aleix Conchillo Flaqué	cb36a71381	fix some linting	2024-09-13 09:56:12 -07:00
Aleix Conchillo Flaqué	5acc4928fe	examples: add 07d-interruptible-elevenlabs.py	2024-09-13 09:43:18 -07:00
Aleix Conchillo Flaqué	434493b8aa	services(elevenlabs): implement word-by-word support through websockets	2024-09-13 09:31:35 -07:00
Aleix Conchillo Flaqué	f08b25dbb2	examples: assistant aggregator should always goes after transport	2024-09-12 00:37:34 -07:00
Aleix Conchillo Flaqué	3665734972	transports(output): initial sink clock synchronization	2024-09-12 00:37:34 -07:00
Aleix Conchillo Flaqué	a98d78cdea	services(lmnt): change to subclass of AsyncTTSService	2024-09-12 00:37:34 -07:00
Aleix Conchillo Flaqué	80f6d74e80	services(cartesia): change to subclass of AsyncWordTTSService	2024-09-12 00:37:34 -07:00
Aleix Conchillo Flaqué	02d926e9bd	services: create AsyncTTSService and AsyncWordTTSService	2024-09-12 00:31:48 -07:00
Aleix Conchillo Flaqué	7749692f72	processors: get pipeline clock from StartFrame	2024-09-12 00:31:48 -07:00
Aleix Conchillo Flaqué	7807cbeb39	pipeline(task): add a clock to the pipeline task	2024-09-12 00:31:48 -07:00
Aleix Conchillo Flaqué	72f231b327	frames: add a presentation timestamp (pts) to each frame	2024-09-12 00:31:48 -07:00
Aleix Conchillo Flaqué	3cbe97d346	clocks: added new BaseClock and SystemClock	2024-09-12 00:31:48 -07:00
Kwindla Hultman Kramer	b880e1a60e	Merge pull request #448 from pipecat-ai/khk/aggregation-leading-space fix for leading space in context aggregator strings	2024-09-10 09:57:35 -07:00
Aleix Conchillo Flaqué	886046e696	Merge pull request #445 from dleybz/patch-1 Update requirements.txt	2024-09-09 17:54:33 -07:00
Aleix Conchillo Flaqué	9106a5f8ae	Merge pull request #449 from pipecat-ai/aleix/audio-out-bitrate transports(daily): allow setting audio output bitrate (default 96kpbs)	2024-09-09 08:39:06 -07:00
Aleix Conchillo Flaqué	98286336bf	transports(daily): allow setting audio output bitrate (default 96kpbs) Fixes #388	2024-09-08 19:39:17 -07:00
Kwindla Hultman Kramer	081b001c8b	fix for leading space in context aggregator strings	2024-09-07 16:42:52 -07:00
Danny D. Leybzon	c92531a02f	Update requirements.txt request.form() throws an error if you don't have python-multipart installed	2024-09-06 20:22:18 +02:00
Aleix Conchillo Flaqué	748a7af602	update CHANGELOG.md	2024-09-05 19:05:29 -07:00
Aleix Conchillo Flaqué	f4a0de6327	Merge pull request #444 from pipecat-ai/aleix/elevenlabs-streaming services(elevenlabs): add elevenlabs package and use streaming	2024-09-05 11:24:12 -07:00
Aleix Conchillo Flaqué	e405d7af9f	services(elevenlabs): add elevenlabs package and use streaming	2024-09-05 11:20:01 -07:00
Aashraya	51cd7fd285	twiliohandle interruption (#422 ) * add interuption handler in twilio serializer * fix autopep8 * revert ruff autoformatting * address pr comments * change interruption frame to user started frame in serializer * remove overrrident handle interrupt * remove unused import * change userstarted to interuption frame	2024-09-02 11:06:38 -07:00
Aleix Conchillo Flaqué	aba5f89174	Merge pull request #437 from soof-golan/soof-obj-id-generation Generate ids with itertools.count	2024-09-02 10:53:48 -07:00
Soof Golan	5c0f5a1613	Generate ids with itertools.count Avoids the critical section with threading.Lock in favor of itertools.count. `count` objects are threadsafe, and their critical section is implemented in C and provide better performance that Python level locking.	2024-09-02 15:39:58 +02:00
Aleix Conchillo Flaqué	7c342f7ba2	Merge pull request #433 from pipecat-ai/aleix/process-all-startframes StartFrame should be the first frame every processor receives	2024-08-30 14:17:38 -07:00
Aleix Conchillo Flaqué	37e2388758	StartFrame should be the first frame every processor receives Fixes #427	2024-08-29 22:43:44 -07:00
Aleix Conchillo Flaqué	05f0492a8d	Merge pull request #421 from pipecat-ai/aleix/improve-multi-lingual-support improve multi lingual support	2024-08-29 13:19:40 -07:00
Aleix Conchillo Flaqué	c0ac5c6ae8	services(lmnt): fix example and update README and CHANGELOG	2024-08-29 11:11:24 -07:00
Aleix Conchillo Flaqué	be923687fb	processors(rtvi): user decices if bot interrupts on update config	2024-08-29 11:00:03 -07:00
Aleix Conchillo Flaqué	5f32fb125d	updated CHANGELOG.md	2024-08-29 11:00:03 -07:00
Aleix Conchillo Flaqué	ae6fbb3146	services: just set model, voice, language independently	2024-08-29 11:00:03 -07:00
Aleix Conchillo Flaqué	864768635a	services: add voice and language to set_model()	2024-08-29 11:00:03 -07:00
Aleix Conchillo Flaqué	d7c9679977	services: allow TTSModelUpdateFrame to also update language and voice	2024-08-29 11:00:03 -07:00
Aleix Conchillo Flaqué	fedfc366f6	services(deepgram): fix strenum values	2024-08-29 11:00:03 -07:00
Aleix Conchillo Flaqué	b3b39626e1	services: allow switching STT language and mdoel at the same time	2024-08-29 11:00:03 -07:00
Aleix Conchillo Flaqué	4e0ece17b6	services: added support for setting STT model and language	2024-08-29 11:00:03 -07:00
Aleix Conchillo Flaqué	fd3fdacdee	transcriptions: added more languages	2024-08-29 11:00:03 -07:00
Aleix Conchillo Flaqué	a253606d50	services(daily): on_joined now returns all data not only participant	2024-08-29 11:00:03 -07:00
Aleix Conchillo Flaqué	568d9dc0a3	services(whisper): inherit from SegmentedSTTService	2024-08-29 11:00:03 -07:00
Aleix Conchillo Flaqué	6629b853c5	services(deepgram): inherit from STTService instead of AsyncAIService	2024-08-29 11:00:03 -07:00
Aleix Conchillo Flaqué	3931cb3235	services(cartesia): allow setting language and language voices	2024-08-29 11:00:01 -07:00
Aleix Conchillo Flaqué	38cd86ad52	services: added language to transcription frames	2024-08-29 10:59:02 -07:00
Aleix Conchillo Flaqué	c0cdabf61d	frames: adde TTSLanguageUpdateFrame and TTSLanguageVoicesUpdateFrame	2024-08-29 10:59:02 -07:00
Aleix Conchillo Flaqué	51270a96c5	frames: add language to transcription frames	2024-08-29 10:59:02 -07:00
Kwindla Hultman Kramer	84d72c0d5c	Merge pull request #425 from pipecat-ai/khk/rtvi-together-function-calling fixup type mismatches between rtvi data structures and together.py	2024-08-28 13:11:52 -07:00
Aleix Conchillo Flaqué	79aca8169a	Merge pull request #391 from sharvil/pr/add-lmnt LMNT TTS	2024-08-27 21:40:46 -07:00
Kwindla Hultman Kramer	b9d362bd62	fixup type mismatches between rtvi data structures and together.py	2024-08-27 17:39:21 -07:00
Sharvil Nanavati	87c4a1bee1	Move stop frame task creation into `TTSService.start`	2024-08-27 04:45:21 +00:00
Sharvil Nanavati	c979762b70	Handle cancellation, stopping, and restarting	2024-08-27 01:24:00 +00:00
Sharvil Nanavati	1d92fc3199	Merge branch 'main' into pr/add-lmnt	2024-08-24 10:07:52 -07:00
Sharvil Nanavati	8ac7fb1a67	Use a single long-lived Task to push TTSStoppedFrame	2024-08-24 16:18:07 +00:00
Sharvil Nanavati	60c3d33def	Default LMNT to 24kHz, add example	2024-08-24 15:40:29 +00:00
Sharvil Nanavati	8a39d3f4eb	services: add a generic mechanism to produce TTSStoppedFrames	2024-08-24 15:40:12 +00:00
Sharvil Nanavati	f4fd7b7028	LMNT TTS	2024-08-22 00:47:41 +00:00