Update GradiumTTSService to flush instead of ending stream

Add delay_in_frames and language support
GradiumSTTService now flushes pending transcripts on VAD stopped detection
2026-01-29 18:47:17 -05:00 · 2026-01-29 18:47:00 -05:00 · 2026-01-29 18:47:00 -05:00 · 2026-01-29 13:53:27 -08:00 · 2026-01-29 13:12:44 -08:00 · 2026-01-29 13:10:10 -08:00
110 changed files with 4244 additions and 646 deletions
--- a/.github/workflows/coverage.yaml
+++ b/.github/workflows/coverage.yaml
@@ -33,7 +33,14 @@ jobs:

      - name: Install dependencies
        run: |
-          uv sync --group dev --extra anthropic --extra aws --extra google --extra langchain --extra livekit --extra websocket
+          uv sync --group dev \
+            --extra anthropic \
+            --extra aws \
+            --extra google \
+            --extra langchain \
+            --extra livekit \
+            --extra piper \
+            --extra websocket

      - name: Run tests with coverage
        run: |
--- a/.github/workflows/tests.yaml
+++ b/.github/workflows/tests.yaml
@@ -37,7 +37,14 @@ jobs:

      - name: Install dependencies
        run: |
-          uv sync --group dev --extra anthropic --extra aws --extra google --extra langchain --extra livekit --extra websocket
+          uv sync --group dev \
+            --extra anthropic \
+            --extra aws \
+            --extra google \
+            --extra langchain \
+            --extra livekit \
+            --extra piper \
+            --extra websocket

      - name: Test with pytest
        run: |
--- a/changelog/3406.fixed.md
+++ b/changelog/3406.fixed.md
@@ -0,0 +1 @@
+- Fixed an issue where if you were using `OpenRouterLLMService` with a Gemini model, it wouldn't handle multiple `"system"` messages as expected (and as we do in `GoogleLLMService`), which is to convert subsequent ones into `"user"` messages. Instead, the latest `"system"` message would overwrite the previous ones.
--- a/changelog/3408.added.md
+++ b/changelog/3408.added.md
@@ -0,0 +1,4 @@
+- Additions for `AICFilter` and `AICVADAnalyzer`:
+  - Added model downloading support to `AICFilter` with `model_id` and `model_download_dir` parameters.
+  - Added `model_path` parameter to `AICFilter` for loading local `.aicmodel` files.
+  - Added unit tests for `AICFilter` and `AICVADAnalyzer`.
--- a/changelog/3408.changed.md
+++ b/changelog/3408.changed.md
@@ -0,0 +1 @@
+- Updated `AICFilter` and `AICVADAnalyzer` to use aic-sdk ~= 2.0.1.
--- a/changelog/3408.removed.md
+++ b/changelog/3408.removed.md
@@ -0,0 +1 @@
+- Removed deprecated `AICFilter` parameters: `enhancement_level`, `voice_gain`, `noise_gate_enable`.
--- a/changelog/3429.added.md
+++ b/changelog/3429.added.md
@@ -0,0 +1 @@
+- Added handling for `server_content.interrupted` signal in the Gemini Live service for faster interruption response in the case where there isn't already turn tracking in the pipeline, e.g. local VAD + context aggregators. When there is already turn tracking in the pipeline, the additional interruption does no harm.
--- a/changelog/3495.changed.2.md
+++ b/changelog/3495.changed.2.md
@@ -0,0 +1 @@
+- `SarvamSTTService` now defaults `vad_signals` and `high_vad_sensitivity` to `None` (omitted from connection parameters), improving latency by ~300ms compared to the previous defaults.
--- a/changelog/3495.changed.md
+++ b/changelog/3495.changed.md
@@ -0,0 +1 @@
+- Improved the STT TTFB (Time To First Byte) measurement, reporting the delay between when the user stops speaking and when the final transcription is received. Note: Unlike traditional TTFB which measures from a discrete request, STT services receive continuous audio input—so we measure from speech end to final transcript, which captures the latency that matters for voice AI applications. In support of this change, added `finalized` field to `TranscriptionFrame` to indicate when a transcript is the final result for an utterance.
--- a/changelog/3500.added.md
+++ b/changelog/3500.added.md
@@ -0,0 +1 @@
+- Added new `GenesysFrameSerializer` for the Genesys AudioHook WebSocket protocol, enabling bidirectional audio streaming between Pipecat pipelines and Genesys Cloud contact center.
--- a/changelog/3525.added.md
+++ b/changelog/3525.added.md
@@ -1 +1 @@
- Added new `SMART_TURN_LOG_DATA` environment variable, which causes Smart Turn input data to be saved to disk
+- Added new `PIPECAT_SMART_TURN_LOG_DATA` environment variable, which causes Smart Turn input data to be saved to disk
--- a/changelog/3529.fixed.md
+++ b/changelog/3529.fixed.md
@@ -0,0 +1 @@
+- Fixed OpenAI LLM services to emit `ErrorFrame` on completion timeout, enabling proper error handling and LLMSwitcher failover.
--- a/changelog/3536.fixed.md
+++ b/changelog/3536.fixed.md
@@ -0,0 +1 @@
+- Fixed a logging issue where non-ASCII characters (e.g., Japanese, Chinese, etc.) were being unnecessarily escaped to Unicode sequences when function call occurred.
--- a/changelog/3541.fixed.md
+++ b/changelog/3541.fixed.md
@@ -0,0 +1 @@
+- Fixed how audio tracks are synchronized inside the `AudioBufferProcessor` to fix timing issues where silence and audio were misaligned between user and bot buffers.
--- a/changelog/3560.changed.md
+++ b/changelog/3560.changed.md
@@ -0,0 +1 @@
+- `FrameSerializer` now subclasses from `BaseObject` to enable event support.
--- a/changelog/3562.changed.md
+++ b/changelog/3562.changed.md
@@ -0,0 +1,2 @@
+- Added support for TTFS in `SpeechmaticsSTTService` and set the default mode to `EXTERNAL` to support Pipecat-controlled VAD.
+- Changed dependency to `speechmatics-voice[smart]>=0.2.8`
--- a/changelog/3567.fixed.md
+++ b/changelog/3567.fixed.md
@@ -0,0 +1 @@
+- Fixed race condition in `OpenAIRealtimeBetaLLMService` that could cause an error when truncating the conversation.
--- a/changelog/3571.added.2.md
+++ b/changelog/3571.added.2.md
@@ -0,0 +1 @@
+- Added `function_call_timeout_secs` parameter to `LLMService` to configure timeout for deferred function calls (defaults to 10.0 seconds).
--- a/changelog/3571.added.md
+++ b/changelog/3571.added.md
@@ -0,0 +1 @@
+- Added `result_callback` parameter to `UserImageRequestFrame` to support deferred function call results.
--- a/changelog/3571.changed.md
+++ b/changelog/3571.changed.md
@@ -0,0 +1,4 @@
+- ⚠️ Changed function call handling to use timeout-based completion instead of immediate callback execution.
+  - Function calls that defer their results (e.g., `UserImageRequestFrame`) now use a timeout mechanism
+  - The `result_callback` is invoked automatically when the deferred operation completes or after timeout
+  - This change affects examples using `UserImageRequestFrame` - the `result_callback` should now be passed to the frame instead of being called immediately
--- a/changelog/3574.fixed.md
+++ b/changelog/3574.fixed.md
@@ -0,0 +1 @@
+- Fixed an infinite loop in `WebsocketService` that blocked the event loop when a remote server closed the connection gracefully.
--- a/changelog/3575.fixed.md
+++ b/changelog/3575.fixed.md
@@ -0,0 +1 @@
+- Fixed `LLMUserAggregator` and `LLMAssistantAggregator` not emitting pending transcripts via `on_user_turn_stopped` and `on_assistant_turn_stopped` events when the conversation ends (`EndFrame`) or is cancelled (`CancelFrame`).
--- a/changelog/3580.fixed.md
+++ b/changelog/3580.fixed.md
@@ -0,0 +1 @@
+- Added missing `LiveKitRunnerArguments` and `LiveKitTransport` support in runner utilities to enable LiveKit transport configuration.
--- a/changelog/3581.fixed.md
+++ b/changelog/3581.fixed.md
@@ -0,0 +1 @@
+- Fixed race condition in `OpenAIRealtimeLLMService` that could cause an error when truncating the conversation.
--- a/changelog/3582.change.md
+++ b/changelog/3582.change.md
@@ -0,0 +1 @@
+- Pipecat runner now uses `DAILY_ROOM_URL` instead of `DAILY_SAMPLE_ROOM_URL`.
--- a/changelog/3585.added.md
+++ b/changelog/3585.added.md
@@ -0,0 +1 @@
+- Added local `PiperTTSService` for offline text-to-speech using Piper voice models. The existing HTTP-based service has been renamed to `PiperHttpTTSService`.
--- a/changelog/3585.fixed.md
+++ b/changelog/3585.fixed.md
@@ -0,0 +1 @@
+- Fixed `PiperHttpTTSService` (olf `PiperTTSService`) to resample audio output based on the model's sample rate parsed from the WAV header.
--- a/changelog/3587.changed.md
+++ b/changelog/3587.changed.md
@@ -0,0 +1,3 @@
+- Updates to `GradiumSTTService`:
+  - Now flushes pending transcriptions when VAD detects the user stopped speaking, improving response latency.
+  - `GradiumSTTService` now supports `InputParams` for configuring `language` and `delay_in_frames` settings.
--- a/changelog/3590.added.md
+++ b/changelog/3590.added.md
@@ -0,0 +1 @@
+- `main()` in `pipecat.runner.run` now accepts an optional `argparse.ArgumentParser`, allowing bots to define custom CLI arguments accessible via `runner_args.cli_args`.
--- a/changelog/3594.fixed.md
+++ b/changelog/3594.fixed.md
@@ -0,0 +1 @@
+- Fixed `UserTurnController` to reset user turn timeout when interim transcriptions are received.
--- a/changelog/3596.fixed.md
+++ b/changelog/3596.fixed.md
@@ -0,0 +1 @@
+- Fixed an issue in `GradiumTTSService` where the websocket was being disconnected at the end of every bot turn.
--- a/env.example
+++ b/env.example
@@ -43,7 +43,7 @@ CEREBRAS_API_KEY=...

 # Daily
 DAILY_API_KEY=...
-DAILY_SAMPLE_ROOM_URL=https://...
+DAILY_ROOM_URL=https://...

 # Deepgram
 DEEPGRAM_API_KEY=...
--- a/examples/foundational/01-say-one-thing-piper.py
+++ b/examples/foundational/01-say-one-thing-piper.py
@@ -16,7 +16,7 @@ from pipecat.pipeline.runner import PipelineRunner
 from pipecat.pipeline.task import PipelineTask
 from pipecat.runner.types import RunnerArguments
 from pipecat.runner.utils import create_transport
-from pipecat.services.piper.tts import PiperTTSService
+from pipecat.services.piper.tts import PiperHttpTTSService
 from pipecat.transports.base_transport import BaseTransport, TransportParams
 from pipecat.transports.daily.transport import DailyParams
 from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
@@ -39,7 +39,7 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):

    # Create an HTTP session
    async with aiohttp.ClientSession() as session:
-        tts = PiperTTSService(
+        tts = PiperHttpTTSService(
            base_url=os.getenv("PIPER_BASE_URL"), aiohttp_session=session, sample_rate=24000
        )

--- a/examples/foundational/07zd-interruptible-aicoustics.py
+++ b/examples/foundational/07zd-interruptible-aicoustics.py
@@ -50,7 +50,7 @@ def _create_aic_filter() -> AICFilter:

    return AICFilter(
        license_key=license_key,
-        enhancement_level=0.5,
+        model_id="quail-vf-l-16khz",
    )


@@ -62,7 +62,9 @@ transport_params = {
        lambda aic: DailyParams(
            audio_in_enabled=True,
            audio_out_enabled=True,
-            vad_analyzer=aic.create_vad_analyzer(lookback_buffer_size=6.0, sensitivity=6.0),
+            vad_analyzer=aic.create_vad_analyzer(
+                speech_hold_duration=0.05, minimum_speech_duration=0.0, sensitivity=6.0
+            ),
            audio_in_filter=aic,
        )
    )(_create_aic_filter()),
@@ -70,7 +72,9 @@ transport_params = {
        lambda aic: FastAPIWebsocketParams(
            audio_in_enabled=True,
            audio_out_enabled=True,
-            vad_analyzer=aic.create_vad_analyzer(lookback_buffer_size=6.0, sensitivity=6.0),
+            vad_analyzer=aic.create_vad_analyzer(
+                speech_hold_duration=0.05, minimum_speech_duration=0.0, sensitivity=6.0
+            ),
            audio_in_filter=aic,
        )
    )(_create_aic_filter()),
@@ -78,7 +82,9 @@ transport_params = {
        lambda aic: TransportParams(
            audio_in_enabled=True,
            audio_out_enabled=True,
-            vad_analyzer=aic.create_vad_analyzer(lookback_buffer_size=6.0, sensitivity=6.0),
+            vad_analyzer=aic.create_vad_analyzer(
+                speech_hold_duration=0.05, minimum_speech_duration=0.0, sensitivity=6.0
+            ),
            audio_in_filter=aic,
        )
    )(_create_aic_filter()),
--- a/examples/foundational/07zf-interruptible-gradium.py
+++ b/examples/foundational/07zf-interruptible-gradium.py
@@ -26,6 +26,7 @@ from pipecat.runner.utils import create_transport
 from pipecat.services.gradium.stt import GradiumSTTService
 from pipecat.services.gradium.tts import GradiumTTSService
 from pipecat.services.openai.llm import OpenAILLMService
+from pipecat.transcriptions.language import Language
 from pipecat.transports.base_transport import BaseTransport, TransportParams
 from pipecat.transports.daily.transport import DailyParams
 from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
@@ -59,11 +60,18 @@ transport_params = {
 async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
    logger.info(f"Starting bot")

-    stt = GradiumSTTService(api_key=os.getenv("GRADIUM_API_KEY"))
+    stt = GradiumSTTService(
+        api_key=os.getenv("GRADIUM_API_KEY"),
+        api_endpoint_base_url="wss://us.api.gradium.ai/api/speech/asr",
+        params=GradiumSTTService.InputParams(
+            language=Language.EN,
+        ),
+    )

    tts = GradiumTTSService(
        api_key=os.getenv("GRADIUM_API_KEY"),
        voice_id="YTpq7expH9539ERJ",
+        url="wss://us.api.gradium.ai/api/speech/tts",
    )

    llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
--- a/examples/foundational/07zi-interruptible-piper.py
+++ b/examples/foundational/07zi-interruptible-piper.py
@@ -0,0 +1,132 @@
+#
+# Copyright (c) 2024-2026, Daily
+#
+# SPDX-License-Identifier: BSD 2-Clause License
+#
+
+import os
+
+from dotenv import load_dotenv
+from loguru import logger
+
+from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
+from pipecat.audio.vad.silero import SileroVADAnalyzer
+from pipecat.audio.vad.vad_analyzer import VADParams
+from pipecat.frames.frames import LLMRunFrame
+from pipecat.pipeline.pipeline import Pipeline
+from pipecat.pipeline.runner import PipelineRunner
+from pipecat.pipeline.task import PipelineParams, PipelineTask
+from pipecat.processors.aggregators.llm_context import LLMContext
+from pipecat.processors.aggregators.llm_response_universal import (
+    LLMContextAggregatorPair,
+    LLMUserAggregatorParams,
+)
+from pipecat.runner.types import RunnerArguments
+from pipecat.runner.utils import create_transport
+from pipecat.services.deepgram.stt import DeepgramSTTService
+from pipecat.services.openai.llm import OpenAILLMService
+from pipecat.services.piper.tts import PiperTTSService
+from pipecat.transports.base_transport import BaseTransport, TransportParams
+from pipecat.transports.daily.transport import DailyParams
+from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
+from pipecat.turns.user_stop import TurnAnalyzerUserTurnStopStrategy
+from pipecat.turns.user_turn_strategies import UserTurnStrategies
+
+load_dotenv(override=True)
+
+# We store functions so objects (e.g. SileroVADAnalyzer) don't get
+# instantiated. The function will be called when the desired transport gets
+# selected.
+transport_params = {
+    "daily": lambda: DailyParams(
+        audio_in_enabled=True,
+        audio_out_enabled=True,
+        vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
+    ),
+    "twilio": lambda: FastAPIWebsocketParams(
+        audio_in_enabled=True,
+        audio_out_enabled=True,
+        vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
+    ),
+    "webrtc": lambda: TransportParams(
+        audio_in_enabled=True,
+        audio_out_enabled=True,
+        vad_analyzer=SileroVADAnalyzer(params=VADParams(stop_secs=0.2)),
+    ),
+}
+
+
+async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
+    logger.info(f"Starting bot")
+
+    stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY"))
+
+    tts = PiperTTSService(voice_id="en_US-ryan-high")
+
+    llm = OpenAILLMService(api_key=os.getenv("OPENAI_API_KEY"))
+
+    messages = [
+        {
+            "role": "system",
+            "content": "You are a helpful LLM in a WebRTC call. Your goal is to demonstrate your capabilities in a succinct way. Your output will be spoken aloud, so avoid special characters that can't easily be spoken, such as emojis or bullet points. Respond to what the user said in a creative and helpful way.",
+        },
+    ]
+
+    context = LLMContext(messages)
+    user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
+        context,
+        user_params=LLMUserAggregatorParams(
+            user_turn_strategies=UserTurnStrategies(
+                stop=[TurnAnalyzerUserTurnStopStrategy(turn_analyzer=LocalSmartTurnAnalyzerV3())]
+            ),
+        ),
+    )
+
+    pipeline = Pipeline(
+        [
+            transport.input(),  # Transport user input
+            stt,
+            user_aggregator,  # User responses
+            llm,  # LLM
+            tts,  # TTS
+            transport.output(),  # Transport bot output
+            assistant_aggregator,  # Assistant spoken responses
+        ]
+    )
+
+    task = PipelineTask(
+        pipeline,
+        params=PipelineParams(
+            enable_metrics=True,
+            enable_usage_metrics=True,
+        ),
+        idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
+    )
+
+    @transport.event_handler("on_client_connected")
+    async def on_client_connected(transport, client):
+        logger.info(f"Client connected")
+        # Kick off the conversation.
+        messages.append({"role": "system", "content": "Please introduce yourself to the user."})
+        await task.queue_frames([LLMRunFrame()])
+
+    @transport.event_handler("on_client_disconnected")
+    async def on_client_disconnected(transport, client):
+        logger.info(f"Client disconnected")
+        await task.cancel()
+
+    runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
+
+    await runner.run(task)
+
+
+async def bot(runner_args: RunnerArguments):
+    """Main bot entry point compatible with Pipecat Cloud."""
+    transport = await create_transport(runner_args, transport_params)
+    await run_bot(transport, runner_args)
+
+
+if __name__ == "__main__":
+    from pipecat.runner.run import main
+
+    main()
--- a/examples/foundational/13l-gradium-transcription.py
+++ b/examples/foundational/13l-gradium-transcription.py
@@ -0,0 +1,86 @@
+#
+# Copyright (c) 2024-2026, Daily
+#
+# SPDX-License-Identifier: BSD 2-Clause License
+#
+
+import os
+
+from dotenv import load_dotenv
+from loguru import logger
+
+from pipecat.frames.frames import Frame, TranscriptionFrame
+from pipecat.pipeline.pipeline import Pipeline
+from pipecat.pipeline.runner import PipelineRunner
+from pipecat.pipeline.task import PipelineTask
+from pipecat.processors.frame_processor import FrameDirection, FrameProcessor
+from pipecat.runner.types import RunnerArguments
+from pipecat.runner.utils import create_transport
+from pipecat.services.gradium.stt import GradiumSTTService
+from pipecat.transcriptions.language import Language
+from pipecat.transports.base_transport import BaseTransport, TransportParams
+from pipecat.transports.daily.transport import DailyParams
+from pipecat.transports.websocket.fastapi import FastAPIWebsocketParams
+
+load_dotenv(override=True)
+
+
+class TranscriptionLogger(FrameProcessor):
+    async def process_frame(self, frame: Frame, direction: FrameDirection):
+        await super().process_frame(frame, direction)
+
+        if isinstance(frame, TranscriptionFrame):
+            print(f"Transcription: {frame.text}")
+
+        # Push all frames through
+        await self.push_frame(frame, direction)
+
+
+# We store functions so objects (e.g. SileroVADAnalyzer) don't get
+# instantiated. The function will be called when the desired transport gets
+# selected.
+transport_params = {
+    "daily": lambda: DailyParams(audio_in_enabled=True),
+    "twilio": lambda: FastAPIWebsocketParams(audio_in_enabled=True),
+    "webrtc": lambda: TransportParams(audio_in_enabled=True),
+}
+
+
+async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
+    logger.info(f"Starting bot")
+
+    stt = GradiumSTTService(
+        api_key=os.getenv("GRADIUM_API_KEY"),
+        api_endpoint_base_url="wss://us.api.gradium.ai/api/speech/asr",
+        params=GradiumSTTService.InputParams(language=Language.EN, delay_in_frames=8),
+    )
+
+    tl = TranscriptionLogger()
+
+    pipeline = Pipeline([transport.input(), stt, tl])
+
+    task = PipelineTask(
+        pipeline,
+        idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
+    )
+
+    @transport.event_handler("on_client_disconnected")
+    async def on_client_disconnected(transport, client):
+        logger.info(f"Client disconnected")
+        await task.cancel()
+
+    runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)
+
+    await runner.run(task)
+
+
+async def bot(runner_args: RunnerArguments):
+    """Main bot entry point compatible with Pipecat Cloud."""
+    transport = await create_transport(runner_args, transport_params)
+    await run_bot(transport, runner_args)
+
+
+if __name__ == "__main__":
+    from pipecat.runner.run import main
+
+    main()
--- a/examples/foundational/14d-function-calling-anthropic-video.py
+++ b/examples/foundational/14d-function-calling-anthropic-video.py
@@ -48,14 +48,16 @@ async def fetch_user_image(params: FunctionCallParams):
    When called, this function pushes a UserImageRequestFrame upstream to the
    transport. As a result, the transport will request the user image and push a
    UserImageRawFrame downstream which will be added to the context by the LLM
-    assistant aggregator.
+    assistant aggregator. The result_callback will be invoked once the image is
+    retrieved and processed.
    """
    user_id = params.arguments["user_id"]
    question = params.arguments["question"]
    logger.debug(f"Requesting image with user_id={user_id}, question={question}")

    # Request a user image frame and indicate that it should be added to the
-    # context. Also associate it to the function call.
+    # context. Also associate it to the function call. Pass the result_callback
+    # so it can be invoked when the image is actually retrieved.
    await params.llm.push_frame(
        UserImageRequestFrame(
            user_id=user_id,
@@ -63,16 +65,11 @@ async def fetch_user_image(params: FunctionCallParams):
            append_to_context=True,
            function_name=params.function_name,
            tool_call_id=params.tool_call_id,
+            result_callback=params.result_callback,
        ),
        FrameDirection.UPSTREAM,
    )

-    await params.result_callback(None)
-
-    # Instead of None, it's possible to also provide a tool call answer to
-    # tell the LLM that we are grabbing the image to analyze.
-    # await params.result_callback({"result": "Image is being captured."})
-

 # We store functions so objects (e.g. SileroVADAnalyzer) don't get
 # instantiated. The function will be called when the desired transport gets
--- a/examples/foundational/14d-function-calling-aws-video.py
+++ b/examples/foundational/14d-function-calling-aws-video.py
@@ -48,14 +48,16 @@ async def fetch_user_image(params: FunctionCallParams):
    When called, this function pushes a UserImageRequestFrame upstream to the
    transport. As a result, the transport will request the user image and push a
    UserImageRawFrame downstream which will be added to the context by the LLM
-    assistant aggregator.
+    assistant aggregator. The result_callback will be invoked once the image is
+    retrieved and processed.
    """
    user_id = params.arguments["user_id"]
    question = params.arguments["question"]
    logger.debug(f"Requesting image with user_id={user_id}, question={question}")

    # Request a user image frame and indicate that it should be added to the
-    # context. Also associate it to the function call.
+    # context. Also associate it to the function call. Pass the result_callback
+    # so it can be invoked when the image is actually retrieved.
    await params.llm.push_frame(
        UserImageRequestFrame(
            user_id=user_id,
@@ -63,16 +65,11 @@ async def fetch_user_image(params: FunctionCallParams):
            append_to_context=True,
            function_name=params.function_name,
            tool_call_id=params.tool_call_id,
+            result_callback=params.result_callback,
        ),
        FrameDirection.UPSTREAM,
    )

-    await params.result_callback(None)
-
-    # Instead of None, it's possible to also provide a tool call answer to
-    # tell the LLM that we are grabbing the image to analyze.
-    # await params.result_callback({"result": "Image is being captured."})
-

 # We store functions so objects (e.g. SileroVADAnalyzer) don't get
 # instantiated. The function will be called when the desired transport gets
--- a/examples/foundational/14d-function-calling-gemini-flash-video.py
+++ b/examples/foundational/14d-function-calling-gemini-flash-video.py
@@ -48,14 +48,16 @@ async def fetch_user_image(params: FunctionCallParams):
    When called, this function pushes a UserImageRequestFrame upstream to the
    transport. As a result, the transport will request the user image and push a
    UserImageRawFrame downstream which will be added to the context by the LLM
-    assistant aggregator.
+    assistant aggregator. The result_callback will be invoked once the image is
+    retrieved and processed.
    """
    user_id = params.arguments["user_id"]
    question = params.arguments["question"]
    logger.debug(f"Requesting image with user_id={user_id}, question={question}")

    # Request a user image frame and indicate that it should be added to the
-    # context. Also associate it to the function call.
+    # context. Also associate it to the function call. Pass the result_callback
+    # so it can be invoked when the image is actually retrieved.
    await params.llm.push_frame(
        UserImageRequestFrame(
            user_id=user_id,
@@ -63,16 +65,11 @@ async def fetch_user_image(params: FunctionCallParams):
            append_to_context=True,
            function_name=params.function_name,
            tool_call_id=params.tool_call_id,
+            result_callback=params.result_callback,
        ),
        FrameDirection.UPSTREAM,
    )

-    await params.result_callback(None)
-
-    # Instead of None, it's possible to also provide a tool call answer to
-    # tell the LLM that we are grabbing the image to analyze.
-    # await params.result_callback({"result": "Image is being captured."})
-

 # We store functions so objects (e.g. SileroVADAnalyzer) don't get
 # instantiated. The function will be called when the desired transport gets
--- a/examples/foundational/14d-function-calling-moondream-video.py
+++ b/examples/foundational/14d-function-calling-moondream-video.py
@@ -57,7 +57,8 @@ async def fetch_user_image(params: FunctionCallParams):

    When called, this function pushes a UserImageRequestFrame upstream to the
    transport. As a result, the transport will request the user image and push a
-    UserImageRawFrame downstream.
+    UserImageRawFrame downstream. The result_callback will be invoked once the
+    image is retrieved and processed.
    """
    user_id = params.arguments["user_id"]
    question = params.arguments["question"]
@@ -65,7 +66,8 @@ async def fetch_user_image(params: FunctionCallParams):

    # Request a user image frame. In this case, we don't want the requested
    # image to be added to the context because we will process it with
-    # Moondream. Also associate it to the function call.
+    # Moondream. Also associate it to the function call. Pass the result_callback
+    # so it can be invoked when the image is actually retrieved.
    await params.llm.push_frame(
        UserImageRequestFrame(
            user_id=user_id,
@@ -73,16 +75,11 @@ async def fetch_user_image(params: FunctionCallParams):
            append_to_context=False,
            function_name=params.function_name,
            tool_call_id=params.tool_call_id,
+            result_callback=params.result_callback,
        ),
        FrameDirection.UPSTREAM,
    )

-    await params.result_callback(None)
-
-    # Instead of None, it's possible to also provide a tool call answer to
-    # tell the LLM that we are grabbing the image to analyze.
-    # await params.result_callback({"result": "Image is being captured."})
-

 class MoondreamTextFrameWrapper(FrameProcessor):
    """Wraps Moondream-provided TextFrames with LLM response start/end frames.
--- a/examples/foundational/14d-function-calling-openai-video.py
+++ b/examples/foundational/14d-function-calling-openai-video.py
@@ -49,14 +49,16 @@ async def fetch_user_image(params: FunctionCallParams):
    When called, this function pushes a UserImageRequestFrame upstream to the
    transport. As a result, the transport will request the user image and push a
    UserImageRawFrame downstream which will be added to the context by the LLM
-    assistant aggregator.
+    assistant aggregator. The result_callback will be invoked once the image is
+    retrieved and processed.
    """
    user_id = params.arguments["user_id"]
    question = params.arguments["question"]
    logger.debug(f"Requesting image with user_id={user_id}, question={question}")

    # Request a user image frame and indicate that it should be added to the
-    # context. Also associate it to the function call.
+    # context. Also associate it to the function call. Pass the result_callback
+    # so it can be invoked when the image is actually retrieved.
    await params.llm.push_frame(
        UserImageRequestFrame(
            user_id=user_id,
@@ -64,16 +66,11 @@ async def fetch_user_image(params: FunctionCallParams):
            append_to_context=True,
            function_name=params.function_name,
            tool_call_id=params.tool_call_id,
+            result_callback=params.result_callback,
        ),
        FrameDirection.UPSTREAM,
    )

-    await params.result_callback(None)
-
-    # Instead of None, it's possible to also provide a tool call answer to
-    # tell the LLM that we are grabbing the image to analyze.
-    # await params.result_callback({"result": "Image is being captured."})
-

 # We store functions so objects (e.g. SileroVADAnalyzer) don't get
 # instantiated. The function will be called when the desired transport gets
--- a/examples/foundational/14e-function-calling-google.py
+++ b/examples/foundational/14e-function-calling-google.py
@@ -58,14 +58,16 @@ async def get_image(params: FunctionCallParams):
    When called, this function pushes a UserImageRequestFrame upstream to the
    transport. As a result, the transport will request the user image and push a
    UserImageRawFrame downstream which will be added to the context by the LLM
-    assistant aggregator.
+    assistant aggregator. The result_callback will be invoked once the image is
+    retrieved and processed.
    """
    user_id = params.arguments["user_id"]
    question = params.arguments["question"]
    logger.debug(f"Requesting image with user_id={user_id}, question={question}")

    # Request a user image frame and indicate that it should be added to the
-    # context. Also associate it to the function call.
+    # context. Also associate it to the function call. Pass the result_callback
+    # so it can be invoked when the image is actually retrieved.
    await params.llm.push_frame(
        UserImageRequestFrame(
            user_id=user_id,
@@ -73,16 +75,11 @@ async def get_image(params: FunctionCallParams):
            append_to_context=True,
            function_name=params.function_name,
            tool_call_id=params.tool_call_id,
+            result_callback=params.result_callback,
        ),
        FrameDirection.UPSTREAM,
    )

-    await params.result_callback(None)
-
-    # Instead of None, it's possible to also provide a tool call answer to
-    # tell the LLM that we are grabbing the image to analyze.
-    # await params.result_callback({"result": "Image is being captured."})
-

 # We store functions so objects (e.g. SileroVADAnalyzer) don't get
 # instantiated. The function will be called when the desired transport gets
--- a/examples/foundational/18-gstreamer-filesrc.py
+++ b/examples/foundational/18-gstreamer-filesrc.py
@@ -20,10 +20,6 @@ from pipecat.transports.daily.transport import DailyParams

 load_dotenv(override=True)

-parser = argparse.ArgumentParser(description="Pipecat Video Streaming Bot")
-parser.add_argument("-i", "--input", type=str, required=True, help="Input video file")
-args = parser.parse_args()
-
 # We store functions so objects (e.g. SileroVADAnalyzer) don't get
 # instantiated. The function will be called when the desired transport gets
 # selected.
@@ -46,10 +42,10 @@ transport_params = {


 async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
-    logger.info(f"Starting bot with video input: {args.input}")
+    logger.info(f"Starting bot with video input: {runner_args.cli_args.input}")

    gst = GStreamerPipelineSource(
-        pipeline=f"filesrc location={args.input}",
+        pipeline=f"filesrc location={runner_args.cli_args.input}",
        out_params=GStreamerPipelineSource.OutputParams(
            video_width=1280,
            video_height=720,
@@ -68,6 +64,15 @@ async def run_bot(transport: BaseTransport, runner_args: RunnerArguments):
        idle_timeout_secs=runner_args.pipeline_idle_timeout_secs,
    )

+    @transport.event_handler("on_client_connected")
+    async def on_client_connected(transport, client):
+        logger.info(f"Client connected")
+
+    @transport.event_handler("on_client_disconnected")
+    async def on_client_disconnected(transport, client):
+        logger.info(f"Client disconnected")
+        await task.cancel()
+
    runner = PipelineRunner(handle_sigint=runner_args.handle_sigint)

    await runner.run(task)
@@ -82,4 +87,7 @@ async def bot(runner_args: RunnerArguments):
 if __name__ == "__main__":
    from pipecat.runner.run import main

-    main()
+    parser = argparse.ArgumentParser(description="Pipecat Video Streaming Bot")
+    parser.add_argument("-i", "--input", type=str, required=True, help="Input video file")
+
+    main(parser)
--- a/examples/foundational/20d-persistent-context-gemini.py
+++ b/examples/foundational/20d-persistent-context-gemini.py
@@ -66,7 +66,8 @@ async def get_image(params: FunctionCallParams):
    logger.debug(f"Requesting image with user_id={user_id}, question={question}")

    # Request a user image frame and indicate that it should be added to the
-    # context. Also associate it to the function call.
+    # context. Also associate it to the function call. Pass the result_callback
+    # so it can be invoked when the image is actually retrieved.
    await params.llm.push_frame(
        UserImageRequestFrame(
            user_id=user_id,
@@ -74,16 +75,11 @@ async def get_image(params: FunctionCallParams):
            append_to_context=True,
            function_name=params.function_name,
            tool_call_id=params.tool_call_id,
+            result_callback=params.result_callback,
        ),
        FrameDirection.UPSTREAM,
    )

-    await params.result_callback(None)
-
-    # Instead of None, it's possible to also provide a tool call answer to
-    # tell the LLM that we are grabbing the image to analyze.
-    # await params.result_callback({"result": "Image is being captured."})
-

 async def get_saved_conversation_filenames(params: FunctionCallParams):
    # Construct the full pattern including the BASE_FILENAME
--- a/examples/foundational/37-mem0.py
+++ b/examples/foundational/37-mem0.py
@@ -31,7 +31,7 @@ Requirements:
    - [Optional] Anthropic API key (if using Claude with local config)

    Environment variables (set in .env or in your terminal using `export`):
-        DAILY_SAMPLE_ROOM_URL=daily_sample_room_url
+        DAILY_ROOM_URL=daily_room_url
        DAILY_API_KEY=daily_api_key
        OPENAI_API_KEY=openai_api_key
        ELEVENLABS_API_KEY=elevenlabs_api_key
--- a/examples/foundational/README.md
+++ b/examples/foundational/README.md
@@ -4,7 +4,7 @@ This directory contains examples showing how to build voice and multimodal agent

 ## Setup

-1. Follow the [README](../../README.md#%EF%B8%8F-contributing-to-the-framework) steps to get your local environment configured.
+1. Follow the [README](https://github.com/pipecat-ai/pipecat/blob/main/README.md#%EF%B8%8F-contributing-to-the-framework) steps to get your local environment configured.

   > **Run from root directory**: Make sure you are running the steps from the root directory.

@@ -37,7 +37,7 @@ Most examples support running with other transports, like Twilio or Daily.

 ### Daily

-You need to create a Daily account at https://dashboard.daily.co/u/signup. Once signed up, you can create your own room from the dashboard and set the environment variables `DAILY_SAMPLE_ROOM_URL` and `DAILY_API_KEY`. Alternatively, you can let the example create a room for you (still needs `DAILY_API_KEY` environment variable). Then, start any example with `-t daily`:
+You need to create a Daily account at https://dashboard.daily.co/u/signup. Once signed up, you can create your own room from the dashboard and set the environment variables `DAILY_ROOM_URL` and `DAILY_API_KEY`. Alternatively, you can let the example create a room for you (still needs `DAILY_API_KEY` environment variable). Then, start any example with `-t daily`:

 ```bash
 uv run 07-interruptible.py -t daily
@@ -140,4 +140,4 @@ uv run python <example-name> --host 0.0.0.0 --port 8080
 - **Connection errors**: Verify API keys in `.env` file
 - **Port conflicts**: Use `--port` to change the port

-For more examples, visit our the [`pipecat-examples repository](https://github.com/pipecat-ai/pipecat-examples).
+For more examples, visit our the [pipecat-examples repository](https://github.com/pipecat-ai/pipecat-examples).
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -48,13 +48,13 @@ Issues = "https://github.com/pipecat-ai/pipecat/issues"
 Changelog = "https://github.com/pipecat-ai/pipecat/blob/main/CHANGELOG.md"

 [project.optional-dependencies]
-aic = [ "aic-sdk~=1.2.0" ]
+aic = [ "aic-sdk~=2.0.1" ]
 anthropic = [ "anthropic~=0.49.0" ]
 assemblyai = [ "pipecat-ai[websockets-base]" ]
 asyncai = [ "pipecat-ai[websockets-base]" ]
 aws = [ "aioboto3~=15.5.0", "pipecat-ai[websockets-base]" ]
 aws-nova-sonic = [ "aws_sdk_bedrock_runtime~=0.2.0; python_version>='3.12'" ]
-azure = [ "azure-cognitiveservices-speech~=1.44.0"]
+azure = [ "azure-cognitiveservices-speech~=1.47.0"]
 cartesia = [ "cartesia~=2.0.3", "pipecat-ai[websockets-base]" ]
 camb = [ "camb-sdk>=1.5.4" ]
 cerebras = []
@@ -95,6 +95,7 @@ rnnoise = [ "pyrnnoise~=0.4.1" ]
 openpipe = [ "openpipe>=4.50.0,<6" ]
 openrouter = []
 perplexity = []
+piper = [ "piper-tts>=1.3.0,<2" ]
 playht = [ "pipecat-ai[websockets-base]" ]
 qwen = []
 remote-smart-turn = []
@@ -109,7 +110,7 @@ silero = [ "onnxruntime>=1.20.1,<2" ]
 simli = [ "simli-ai~=1.0.3"]
 soniox = [ "pipecat-ai[websockets-base]" ]
 soundfile = [ "soundfile~=0.13.1" ]
-speechmatics = [ "speechmatics-voice[smart]>=0.2.6" ]
+speechmatics = [ "speechmatics-voice[smart]>=0.2.8" ]
 strands = [ "strands-agents>=1.9.1,<2" ]
 tavus=[]
 together = []
--- a/scripts/evals/eval.py
+++ b/scripts/evals/eval.py
@@ -195,7 +195,7 @@ class EvalRunner:


 async def run_example_pipeline(script_path: Path, eval_config: EvalConfig):
-    room_url = os.getenv("DAILY_SAMPLE_ROOM_URL")
+    room_url = os.getenv("DAILY_ROOM_URL")

    module = load_module_from_path(script_path)

@@ -225,7 +225,7 @@ async def run_eval_pipeline(
 ):
    logger.info(f"Starting eval bot")

-    room_url = os.getenv("DAILY_SAMPLE_ROOM_URL")
+    room_url = os.getenv("DAILY_ROOM_URL")

    transport = DailyTransport(
        room_url,
--- a/scripts/evals/run-release-evals.py
+++ b/scripts/evals/run-release-evals.py
@@ -133,12 +133,12 @@ TESTS_07 = [
    ("07zb-interruptible-inworld-http.py", EVAL_SIMPLE_MATH),
    ("07zc-interruptible-asyncai.py", EVAL_SIMPLE_MATH),
    ("07zc-interruptible-asyncai-http.py", EVAL_SIMPLE_MATH),
-    # Need license key to run
-    # ("07zd-interruptible-aicoustics.py", EVAL_SIMPLE_MATH),
+    ("07zd-interruptible-aicoustics.py", EVAL_SIMPLE_MATH),
    ("07ze-interruptible-hume.py", EVAL_SIMPLE_MATH),
    ("07zf-interruptible-gradium.py", EVAL_SIMPLE_MATH),
    ("07zg-interruptible-camb.py", EVAL_SIMPLE_MATH),
    ("07zh-interruptible-hathora.py", EVAL_SIMPLE_MATH),
+    ("07zi-interruptible-piper.py", EVAL_SIMPLE_MATH),
    # Needs a local XTTS docker instance running.
    # ("07i-interruptible-xtts.py", EVAL_SIMPLE_MATH),
    # Needs a Krisp license.
--- a/scripts/evals/utils.py
+++ b/scripts/evals/utils.py
@@ -28,7 +28,7 @@ def check_env_variables() -> bool:
        "CARTESIA_API_KEY",
        "DEEPGRAM_API_KEY",
        "OPENAI_API_KEY",
-        "DAILY_SAMPLE_ROOM_URL",
+        "DAILY_ROOM_URL",
    ]
    for env in required_envs:
        if not os.getenv(env):
--- a/src/pipecat/audio/filters/aic_filter.py
+++ b/src/pipecat/audio/filters/aic_filter.py
@@ -9,129 +9,145 @@
 This module provides an audio filter implementation using ai-coustics' AIC SDK to
 enhance audio streams in real time. It mirrors the structure of other filters like
 the Koala filter and integrates with Pipecat's input transport pipeline.
+
+Classes:
+    AICFilter: For aic-sdk (uses 'aic_sdk' module)
 """

+from pathlib import Path
 from typing import List, Optional

 import numpy as np
+from aic_sdk import (
+    Model,
+    ParameterFixedError,
+    ProcessorAsync,
+    ProcessorConfig,
+    ProcessorParameter,
+    set_sdk_id,
+)
 from loguru import logger

 from pipecat.audio.filters.base_audio_filter import BaseAudioFilter
+from pipecat.audio.vad.aic_vad import AICVADAnalyzer
 from pipecat.frames.frames import FilterControlFrame, FilterEnableFrame

-try:
-    # AIC SDK (https://ai-coustics.github.io/aic-sdk-py/api/)
-    from aic import AICModelType, AICParameter, Model
-except ModuleNotFoundError as e:
-    logger.error(f"Exception: {e}")
-    logger.error("In order to use the AIC filter, you need to `pip install pipecat-ai[aic]`.")
-    raise Exception(f"Missing module: {e}")
-

 class AICFilter(BaseAudioFilter):
    """Audio filter using ai-coustics' AIC SDK for real-time enhancement.

    Buffers incoming audio to the model's preferred block size and processes
-    planar frames in-place using float32 samples in the linear -1..+1 range.
+    frames using float32 samples normalized to the range -1 to +1.
    """

    def __init__(
        self,
        *,
-        license_key: str = "",
-        model_type: AICModelType = AICModelType.QUAIL_STT,
-        enhancement_level: Optional[float] = 1.0,
-        voice_gain: Optional[float] = 1.0,
-        noise_gate_enable: Optional[bool] = True,
+        license_key: str,
+        model_id: Optional[str] = None,
+        model_path: Optional[Path] = None,
+        model_download_dir: Optional[Path] = None,
    ) -> None:
        """Initialize the AIC filter.

        Args:
            license_key: ai-coustics license key for authentication.
-            model_type: Model variant to load.
-            enhancement_level: Optional overall enhancement strength (0.0..1.0).
-            voice_gain: Optional linear gain applied to detected speech (0.0..4.0).
-            noise_gate_enable: Optional enable/disable noise gate (default: True).
+            model_id: Model identifier to download from CDN. Required if model_path
+                is not provided. See https://artifacts.ai-coustics.io/ for available models.
+            model_path: Optional path to a local .aicmodel file. If provided,
+                model_id is ignored and no download occurs.
+            model_download_dir: Directory for downloading models as a Path object.
+                Defaults to a cache directory in user's home folder.

-                .. deprecated:: 1.3.0
-                    The `noise_gate_enable` parameter is deprecated and no longer has any effect.
-                    It will be removed in a future version.
+        Raises:
+            ValueError: If neither model_id nor model_path is provided.
        """
+        # Set SDK ID for telemetry identification (6 = pipecat)
+        set_sdk_id(6)
+
+        if model_id is None and model_path is None:
+            raise ValueError(
+                "Either 'model_id' or 'model_path' must be provided. "
+                "See https://artifacts.ai-coustics.io/ for available models."
+            )
+
        self._license_key = license_key
-        self._model_type = model_type
+        self._model_id = model_id
+        self._model_path = model_path
+        self._model_download_dir = model_download_dir or (
+            Path.home() / ".cache" / "pipecat" / "aic-models"
+        )

-        self._enhancement_level = enhancement_level
-        self._voice_gain = voice_gain
-        if noise_gate_enable is not None:
-            import warnings
-
-            with warnings.catch_warnings():
-                warnings.simplefilter("always")
-                warnings.warn(
-                    "Parameter `noise_gate_enable` is deprecated and no longer has any effect. "
-                    "It will be removed in a future version. Use AIC VAD instead (create_vad_analyzer()).",
-                    DeprecationWarning,
-                )
-
-        self._noise_gate_enable = noise_gate_enable
-
-        self._enabled = True
+        self._bypass = False
        self._sample_rate = 0
        self._aic_ready = False
        self._frames_per_block = 0
        self._audio_buffer = bytearray()
-        # Model will be created in start() since the API now requires sample_rate
-        self._aic = None

-    def get_vad_factory(self):
-        """Return a zero-arg factory that will create the VAD once the model exists.
+        # Audio format constants
+        self._bytes_per_sample = 2  # int16 = 2 bytes
+        self._dtype = np.int16
+        self._scale = (
+            32768.0  # 2^15, for normalizing int16 (-32768 to 32767) to float32 (-1.0 to 1.0)
+        )
+
+        # AIC SDK objects
+        self._model = None
+        self._processor = None
+        self._processor_ctx = None
+        self._vad_ctx = None
+
+        # Pre-allocated buffers (resized in start() once frames_per_block is known)
+        self._in_f32 = None
+        self._out_i16 = None
+
+    def get_vad_context(self):
+        """Return the VAD context once the processor exists.

        Returns:
-            A zero-argument callable that, when invoked, returns an initialized
-            VoiceActivityDetector bound to the underlying AIC model. Raises a
-            RuntimeError if the model has not been initialized (i.e. start()
-            has not been called successfully).
+            The VadContext instance bound to the underlying processor.
+            Raises RuntimeError if the processor has not been initialized.
        """
-
-        def _factory():
-            if self._aic is None:
-                raise RuntimeError("AIC model not initialized yet. Call start(sample_rate) first.")
-            return self._aic.create_vad()
-
-        return _factory
+        if self._vad_ctx is None:
+            raise RuntimeError("AIC processor not initialized yet. Call start(sample_rate) first.")
+        return self._vad_ctx

    def create_vad_analyzer(
        self,
        *,
-        lookback_buffer_size: Optional[float] = None,
+        speech_hold_duration: Optional[float] = None,
+        minimum_speech_duration: Optional[float] = None,
        sensitivity: Optional[float] = None,
    ):
        """Return an analyzer that will lazily instantiate the AIC VAD when ready.

        AIC VAD parameters:
-          - lookback_buffer_size:
-              Number of window-length audio buffers used as a lookback buffer.
-              Higher values increase prediction stability but add latency.
-              Range: 1.0 .. 20.0, Default (SDK): 6.0
+          - speech_hold_duration:
+              How long VAD continues detecting after speech ends (in seconds).
+              Range: 0.0 to 100x model window length, Default (SDK): 0.05s
+          - minimum_speech_duration:
+              Minimum duration of speech required before VAD reports speech detected
+              (in seconds). Range: 0.0 to 1.0, Default (SDK): 0.0s
          - sensitivity:
              Energy threshold sensitivity. Energy threshold = 10 ** (-sensitivity).
-              Range: 1.0 .. 15.0, Default (SDK): 6.0
+              Range: 1.0 to 15.0, Default (SDK): 6.0

        Args:
-            lookback_buffer_size: Optional lookback buffer size to configure on the VAD.
-                Range: 1.0 .. 20.0. If None, SDK default is used.
+            speech_hold_duration: Optional speech hold duration to configure on the VAD.
+                If None, SDK default (0.05s) is used.
+            minimum_speech_duration: Optional minimum speech duration before VAD reports
+                speech detected. If None, SDK default (0.0s) is used.
            sensitivity: Optional sensitivity (energy threshold) to configure on the VAD.
-                Range: 1.0 .. 15.0. If None, SDK default is used.
+                Range: 1.0 to 15.0. If None, SDK default (6.0) is used.

        Returns:
-            A lazily-initialized AICVADAnalyzer that will bind to the VAD backend
-            once the filter's model has been created (after start(sample_rate)).
+            A lazily-initialized AICVADAnalyzer that will bind to the VAD context
+            once the filter's processor has been created (after start(sample_rate)).
        """
-        from pipecat.audio.vad.aic_vad import AICVADAnalyzer
-
        return AICVADAnalyzer(
-            vad_factory=self.get_vad_factory(),
-            lookback_buffer_size=lookback_buffer_size,
+            vad_context_factory=lambda: self.get_vad_context(),
+            speech_hold_duration=speech_hold_duration,
+            minimum_speech_duration=minimum_speech_duration,
            sensitivity=sensitivity,
        )

@@ -146,52 +162,83 @@ class AICFilter(BaseAudioFilter):
        """
        self._sample_rate = sample_rate

+        # Load or download model
+        if self._model_path:
+            logger.debug(f"Loading AIC model from: {self._model_path}")
+            self._model = Model.from_file(str(self._model_path))
+        else:
+            logger.debug(f"Downloading AIC model: {self._model_id}")
+            self._model_download_dir.mkdir(parents=True, exist_ok=True)
+            model_path = await Model.download_async(self._model_id, str(self._model_download_dir))
+            logger.debug(f"Model downloaded to: {model_path}")
+            self._model = Model.from_file(model_path)
+
+        # Get optimal frames for this sample rate
+        self._frames_per_block = self._model.get_optimal_num_frames(self._sample_rate)
+
+        # Allocate processing buffers now that we know the block size
+        self._in_f32 = np.zeros((1, self._frames_per_block), dtype=np.float32)
+        self._out_i16 = np.zeros(self._frames_per_block, dtype=np.int16)
+
+        # Create configuration
+        config = ProcessorConfig.optimal(
+            self._model,
+            sample_rate=self._sample_rate,
+        )
+
+        # Create async processor
        try:
-            # Create model with required runtime parameters
-            self._aic = Model(
-                model_type=self._model_type,
-                license_key=self._license_key or None,
-                sample_rate=self._sample_rate,
-                channels=1,
-            )
-            self._frames_per_block = self._aic.optimal_num_frames()
-
-            # Optional parameter configuration
-            if self._enhancement_level is not None:
-                self._aic.set_parameter(
-                    AICParameter.ENHANCEMENT_LEVEL,
-                    float(self._enhancement_level if self._enabled else 0.0),
-                )
-            if self._voice_gain is not None:
-                self._aic.set_parameter(AICParameter.VOICE_GAIN, float(self._voice_gain))
-
-            self._aic_ready = True
-
-            # Log processor information
-            logger.debug(f"ai-coustics filter started:")
-            logger.debug(f"  Sample rate: {self._sample_rate} Hz")
-            logger.debug(f"  Frames per chunk: {self._frames_per_block}")
-            logger.debug(f"  Enhancement strength: {int(self._enhancement_level * 100)}%")
-            logger.debug(f"  Optimal input buffer size: {self._aic.optimal_num_frames()} samples")
-            logger.debug(f"  Optimal sample rate: {self._aic.optimal_sample_rate()} Hz")
-            logger.debug(
-                f"  Current algorithmic latency: {self._aic.processing_latency() / self._sample_rate * 1000:.2f}ms"
-            )
+            self._processor = ProcessorAsync(self._model, self._license_key, config)
        except Exception as e:  # noqa: BLE001 - surfacing SDK initialization errors
            logger.error(f"AIC model initialization failed: {e}")
-            self._aic_ready = False
+            self._processor = None
+
+        self._aic_ready = self._processor is not None
+
+        if not self._aic_ready:
+            logger.debug(f"ai-coustics filter is not ready.")
+            return
+
+        # Get contexts for parameter control and VAD
+        self._processor_ctx = self._processor.get_processor_context()
+        self._vad_ctx = self._processor.get_vad_context()
+
+        # Apply initial parameters
+        try:
+            self._processor_ctx.set_parameter(
+                ProcessorParameter.Bypass, 1.0 if self._bypass else 0.0
+            )
+        except ParameterFixedError as e:
+            logger.error(f"AIC parameter update failed: {e}")
+
+        # Log processor information
+        logger.debug(f"ai-coustics filter started:")
+        logger.debug(f"  Model ID: {self._model.get_id()}")
+        logger.debug(f"  Sample rate: {self._sample_rate} Hz")
+        logger.debug(f"  Frames per chunk: {self._frames_per_block}")
+        logger.debug(f"  Optimal sample rate: {self._model.get_optimal_sample_rate()} Hz")
+        logger.debug(
+            f"  Optimal number of frames for {self._sample_rate} Hz: {self._model.get_optimal_num_frames(self._sample_rate)}"
+        )
+        logger.debug(
+            f"  Output delay: {self._processor_ctx.get_output_delay()} samples "
+            f"({self._processor_ctx.get_output_delay() / self._sample_rate * 1000:.2f}ms)"
+        )

    async def stop(self):
-        """Clean up the AIC model when stopping.
+        """Clean up the AIC processor when stopping.

        Returns:
            None
        """
        try:
-            if self._aic is not None:
-                self._aic.close()
+            if self._processor_ctx is not None:
+                self._processor_ctx.reset()
        finally:
-            self._aic = None
+            self._processor = None
+            self._processor_ctx = None
+            self._vad_ctx = None
+            self._model = None
            self._aic_ready = False
            self._audio_buffer.clear()

@@ -205,11 +252,12 @@ class AICFilter(BaseAudioFilter):
            None
        """
        if isinstance(frame, FilterEnableFrame):
-            self._enabled = frame.enable
-            if self._aic is not None:
+            self._bypass = not frame.enable
+            if self._processor_ctx is not None:
                try:
-                    level = float(self._enhancement_level if self._enabled else 0.0)
-                    self._aic.set_parameter(AICParameter.ENHANCEMENT_LEVEL, level)
+                    self._processor_ctx.set_parameter(
+                        ProcessorParameter.Bypass, 1.0 if self._bypass else 0.0
+                    )
                except Exception as e:  # noqa: BLE001
                    logger.error(f"AIC set_parameter failed: {e}")

@@ -220,43 +268,41 @@ class AICFilter(BaseAudioFilter):
        model's required block length. Returns enhanced audio data.

        Args:
-            audio: Raw audio data as bytes to be filtered (int16 PCM, planar).
+            audio: Raw audio data as bytes (int16 PCM).

        Returns:
-            Enhanced audio data as bytes (int16 PCM, planar).
+            Enhanced audio data as bytes (int16 PCM).
        """
-        if not self._aic_ready or self._aic is None:
+        if not self._aic_ready or self._processor is None:
            return audio

        self._audio_buffer.extend(audio)
+        available_frames = len(self._audio_buffer) // self._bytes_per_sample
+        num_blocks = available_frames // self._frames_per_block
+
+        if num_blocks == 0:
+            return b""

        filtered_chunks: List[bytes] = []
+        mv = memoryview(self._audio_buffer)
+        block_size = self._frames_per_block * self._bytes_per_sample

-        # Number of int16 samples currently buffered
-        available_frames = len(self._audio_buffer) // 2
+        for i in range(num_blocks):
+            start = i * block_size
+            block_i16 = np.frombuffer(mv[start : start + block_size], dtype=self._dtype)

-        while available_frames >= self._frames_per_block:
-            # Consume exactly one block worth of frames
-            samples_to_consume = self._frames_per_block * 1
-            bytes_to_consume = samples_to_consume * 2
-            block_bytes = bytes(self._audio_buffer[:bytes_to_consume])
+            # Reuse input buffer, in-place divide
+            np.copyto(self._in_f32[0], block_i16)
+            self._in_f32 /= self._scale

-            # Convert to float32 in -1..+1 range and reshape to planar (channels, frames)
-            block_i16 = np.frombuffer(block_bytes, dtype=np.int16)
-            block_f32 = (block_i16.astype(np.float32) / 32768.0).reshape(
-                (1, self._frames_per_block)
-            )
+            out_f32 = await self._processor.process_async(self._in_f32)

-            # Process planar in-place; returns ndarray (same shape)
-            out_f32 = await self._aic.process_async(block_f32)
+            # Convert float32 output back to int16
+            np.multiply(out_f32, self._scale, out=self._in_f32)  # reuse in_f32 as temp
+            np.clip(self._in_f32, -self._scale, self._scale - 1, out=self._in_f32)
+            np.copyto(self._out_i16, self._in_f32[0].astype(self._dtype))

-            # Convert back to int16 bytes, planar layout
-            out_i16 = np.clip(out_f32 * 32768.0, -32768, 32767).astype(np.int16)
-            filtered_chunks.append(out_i16.reshape(-1).tobytes())
+            filtered_chunks.append(self._out_i16.tobytes())

-            # Slide buffer
-            self._audio_buffer = self._audio_buffer[bytes_to_consume:]
-            available_frames = len(self._audio_buffer) // 2
-
-        # Do not flush incomplete frames; keep them buffered for the next call
+        self._audio_buffer = self._audio_buffer[num_blocks * block_size :]
        return b"".join(filtered_chunks)
--- a/src/pipecat/audio/turn/smart_turn/local_smart_turn_v3.py
+++ b/src/pipecat/audio/turn/smart_turn/local_smart_turn_v3.py
@@ -49,7 +49,7 @@ class LocalSmartTurnAnalyzerV3(BaseSmartTurn):
        """
        super().__init__(**kwargs)

-        self._log_data = env_truthy("SMART_TURN_LOG_DATA", default=False)
+        self._log_data = env_truthy("PIPECAT_SMART_TURN_LOG_DATA", default=False)

        if not smart_turn_model_path:
            # Load bundled model
--- a/src/pipecat/audio/vad/aic_vad.py
+++ b/src/pipecat/audio/vad/aic_vad.py
@@ -1,44 +1,44 @@
 """AIC-integrated VAD analyzer that lazily binds to the AIC SDK backend.

-This analyzer queries the backend's is_speech_detected() and maps it to a float
-confidence (1.0/0.0). It uses 10 ms windows based on the sample rate and applies
-optional AIC VAD parameters (lookback_buffer_size, sensitivity) when available.
+This module provides VAD analyzer implementations that query the AIC SDK's
+is_speech_detected() and map it to a float confidence (1.0/0.0).
+
+Classes:
+    AICVADAnalyzer: For aic-sdk (uses 'aic_sdk' module)
 """

 from typing import Any, Callable, Optional

+from aic_sdk import VadParameter
 from loguru import logger

 from pipecat.audio.vad.vad_analyzer import VADAnalyzer, VADParams

-try:
-    from aic import AICVadParameter
-except ModuleNotFoundError as e:
-    logger.error(f"Exception: {e}")
-    logger.error("In order to use the AIC filter, you need to `pip install pipecat-ai[aic]`.")
-    raise Exception(f"Missing module: {e}")
-

 class AICVADAnalyzer(VADAnalyzer):
-    """VAD analyzer that lazily instantiates the AIC VoiceActivityDetector via a factory.
+    """VAD analyzer that lazily binds to the AIC VadContext via a factory.

-    The analyzer can be constructed before the AIC Model exists. Once the filter has
-    started and the Model is available, the provided factory will succeed and the
-    backend VAD will be created. We then switch to single-sample updates where
-    num_frames_required() returns 1 and confidence is derived from the backend's
-    boolean is_speech_detected() state.
+    The analyzer can be constructed before the AIC Processor exists. Once the filter has
+    started and the Processor is available, the provided factory will succeed and the
+    VadContext will be obtained. The context's is_speech_detected() boolean state is
+    then mapped to 1.0 (speech) or 0.0 (no speech) to satisfy the VADAnalyzer interface.

    AIC VAD runtime parameters:
-      - lookback_buffer_size:
-          Controls the lookback buffer size used by the VAD, i.e. the number of
-          window-length audio buffers used as a lookback buffer. Larger values improve
-          stability but increase latency.
-          Range: 1.0 .. 20.0
-          Default (SDK): 6.0
+      - speech_hold_duration:
+          Controls for how long the VAD continues to detect speech after the audio signal
+          no longer contains speech (in seconds).
+          Range: 0.0 to 100x model window length
+          Default (SDK): 0.05s (50ms)
+      - minimum_speech_duration:
+          Controls for how long speech needs to be present in the audio signal before the
+          VAD considers it speech (in seconds).
+          Range: 0.0 to 1.0
+          Default (SDK): 0.0s
      - sensitivity:
-          Controls the energy threshold sensitivity. Higher values make the detector
-          less sensitive (require more energy to count as speech).
-          Range: 1.0 .. 15.0
+          Controls the sensitivity (energy threshold) of the VAD. This value is used by
+          the VAD as the threshold a speech audio signal's energy has to exceed in order
+          to be considered speech.
+          Range: 1.0 to 15.0
          Formula: Energy threshold = 10 ** (-sensitivity)
          Default (SDK): 6.0
    """
@@ -46,69 +46,80 @@ class AICVADAnalyzer(VADAnalyzer):
    def __init__(
        self,
        *,
-        vad_factory: Optional[Callable[[], Any]] = None,
-        lookback_buffer_size: Optional[float] = None,
+        vad_context_factory: Optional[Callable[[], Any]] = None,
+        speech_hold_duration: Optional[float] = None,
+        minimum_speech_duration: Optional[float] = None,
        sensitivity: Optional[float] = None,
    ):
        """Create an AIC VAD analyzer.

        Args:
-            vad_factory:
-                Zero-arg callable that returns an initialized AIC VoiceActivityDetector.
-                This may raise until the filter's Model has been created; the analyzer
+            vad_context_factory:
+                Zero-arg callable that returns the AIC VadContext.
+                This may raise until the filter's Processor has been created; the analyzer
                will retry on set_sample_rate/first use.
-            lookback_buffer_size:
-                Optional override for AIC VAD lookback buffer size.
-                Range: 1.0 .. 20.0. Larger values increase stability at the cost of latency.
-                If None, the SDK default (6.0) is used.
+            speech_hold_duration:
+                Optional override for AIC VAD speech hold duration (in seconds).
+                Range: 0.0 to 100x model window length.
+                If None, the SDK default (0.05s) is used.
+            minimum_speech_duration:
+                Optional override for minimum speech duration before VAD reports
+                speech detected (in seconds).
+                Range: 0.0 to 1.0.
+                If None, the SDK default (0.0s) is used.
            sensitivity:
                Optional override for AIC VAD sensitivity (energy threshold).
-                Range: 1.0 .. 15.0. Energy threshold = 10 ** (-sensitivity).
+                Range: 1.0 to 15.0. Energy threshold = 10 ** (-sensitivity).
                If None, the SDK default (6.0) is used.
        """
        # Use fixed VAD parameters for AIC: no user override
        fixed_params = VADParams(confidence=0.5, start_secs=0.0, stop_secs=0.0, min_volume=0.0)
        super().__init__(sample_rate=None, params=fixed_params)
-        self._vad_factory = vad_factory
-        self._backend_vad: Optional[Any] = None
-        self._pending_lookback: Optional[float] = lookback_buffer_size
+
+        self._vad_context_factory = vad_context_factory
+        self._vad_ctx: Optional[Any] = None
+        self._pending_speech_hold_duration: Optional[float] = speech_hold_duration
+        self._pending_minimum_speech_duration: Optional[float] = minimum_speech_duration
        self._pending_sensitivity: Optional[float] = sensitivity

-    def bind_vad_factory(self, vad_factory: Callable[[], Any]):
+    def bind_vad_context_factory(self, vad_context_factory: Callable[[], Any]):
        """Attach or replace the factory post-construction."""
-        self._vad_factory = vad_factory
-        self._ensure_backend_initialized()
+        self._vad_context_factory = vad_context_factory
+        self._ensure_vad_context_initialized()

-    def _apply_backend_params(self):
+    def _apply_vad_params(self):
        """Apply optional AIC VAD parameters if available."""
-        if self._backend_vad is None or AICVadParameter is None:
+        if self._vad_ctx is None or VadParameter is None:
            return
+
        try:
-            if self._pending_lookback is not None:
-                self._backend_vad.set_parameter(
-                    AICVadParameter.LOOKBACK_BUFFER_SIZE, float(self._pending_lookback)
+            if self._pending_speech_hold_duration is not None:
+                self._vad_ctx.set_parameter(
+                    VadParameter.SpeechHoldDuration, self._pending_speech_hold_duration
+                )
+            if self._pending_minimum_speech_duration is not None:
+                self._vad_ctx.set_parameter(
+                    VadParameter.MinimumSpeechDuration, self._pending_minimum_speech_duration
                )
            if self._pending_sensitivity is not None:
-                self._backend_vad.set_parameter(
-                    AICVadParameter.SENSITIVITY, float(self._pending_sensitivity)
-                )
+                self._vad_ctx.set_parameter(VadParameter.Sensitivity, self._pending_sensitivity)
        except Exception as e:  # noqa: BLE001
            logger.debug(f"AIC VAD parameter application deferred/failed: {e}")

-    def _ensure_backend_initialized(self):
-        if self._backend_vad is not None:
+    def _ensure_vad_context_initialized(self):
+        if self._vad_ctx is not None:
            return
-        if not self._vad_factory:
+        if not self._vad_context_factory:
            return
        try:
-            self._backend_vad = self._vad_factory()
-            self._apply_backend_params()
-            # With backend ready, recompute internal frame sizing
+            self._vad_ctx = self._vad_context_factory()
+            self._apply_vad_params()
+            # With VAD context ready, recompute internal frame sizing
            super().set_params(self._params)
-            logger.debug("AIC VAD backend initialized in analyzer.")
+            logger.debug("AIC VAD context initialized in analyzer.")
        except Exception as e:  # noqa: BLE001
            # Filter may not be started yet; try again later
-            logger.debug(f"Deferring AIC VAD backend initialization: {e}")
+            logger.debug(f"Deferring AIC VAD context initialization: {e}")

    def set_sample_rate(self, sample_rate: int):
        """Set the sample rate for audio processing.
@@ -116,10 +127,10 @@ class AICVADAnalyzer(VADAnalyzer):
        Args:
            sample_rate: Audio sample rate in Hz.
        """
-        # Set rate and attempt backend initialization once we know SR
+        # Set rate and attempt VAD context initialization once we know SR
        self._sample_rate = self._init_sample_rate or sample_rate
-        self._ensure_backend_initialized()
-        # Ensure params are initialized even if backend not ready yet
+        self._ensure_vad_context_initialized()
+        # Ensure params are initialized even if VAD context not ready yet
        try:
            super().set_params(self._params)
        except Exception:
@@ -135,23 +146,29 @@ class AICVADAnalyzer(VADAnalyzer):
        return int(self.sample_rate * 0.01) if self.sample_rate > 0 else 160

    def voice_confidence(self, buffer: bytes) -> float:
-        """Calculate voice activity confidence for the given audio buffer.
+        """Return voice activity detection result for the given audio buffer.
+
+        Note:
+            The AIC SDK provides binary speech detection (not a probability score).
+            This method returns 1.0 when speech is detected and 0.0 otherwise,
+            rather than a true confidence value.

        Args:
-            buffer: Audio buffer to analyze.
+            buffer: Audio buffer (unused - AIC VAD state is updated internally
+                by the enhancement pipeline).

        Returns:
-            Voice confidence score is 0.0 or 1.0.
+            1.0 if speech is detected, 0.0 otherwise.
        """
-        # Ensure backend exists (filter might have started since last call)
-        self._ensure_backend_initialized()
-        if self._backend_vad is None:
+        # Ensure VAD context exists (filter might have started since last call)
+        self._ensure_vad_context_initialized()
+        if self._vad_ctx is None:
            return 0.0

-        # We do not need to analyze 'buffer' here since the model's VAD is updated
+        # We do not need to analyze 'buffer' here since the processor's VAD is updated
        # as part of the enhancement pipeline. Simply query the boolean and map it.
        try:
-            is_speech = self._backend_vad.is_speech_detected()
+            is_speech = self._vad_ctx.is_speech_detected()
            return 1.0 if is_speech else 0.0
        except Exception as e:  # noqa: BLE001
            logger.error(f"AIC VAD inference error: {e}")
--- a/src/pipecat/frames/frames.py
+++ b/src/pipecat/frames/frames.py
@@ -426,12 +426,15 @@ class TranscriptionFrame(TextFrame):
        timestamp: When the transcription occurred.
        language: Detected or specified language of the speech.
        result: Raw result from the STT service.
+        finalized: Whether this is the final transcription for an utterance.
+            Set by STT services that support commit/finalize signals.
    """

    user_id: str
    timestamp: str
    language: Optional[Language] = None
    result: Optional[Any] = None
+    finalized: bool = False

    def __str__(self):
        return f"{self.name}(user: {self.user_id}, text: [{self.text}], language: {self.language}, timestamp: {self.timestamp})"
@@ -1463,6 +1466,7 @@ class UserImageRequestFrame(SystemFrame):
        video_source: Specific video source to capture from.
        function_name: Name of function that generated this request (if any).
        tool_call_id: Tool call ID if generated by function call (if any).
+        result_callback: Optional callback to invoke when the image is retrieved.
        context: [DEPRECATED] Optional context for the image request.
    """

@@ -1472,6 +1476,7 @@ class UserImageRequestFrame(SystemFrame):
    video_source: Optional[str] = None
    function_name: Optional[str] = None
    tool_call_id: Optional[str] = None
+    result_callback: Optional[Any] = None
    context: Optional[Any] = None

    def __post_init__(self):
--- a/src/pipecat/processors/aggregators/llm_response.py
+++ b/src/pipecat/processors/aggregators/llm_response.py
@@ -1042,6 +1042,11 @@ class LLMAssistantContextAggregator(LLMContextResponseAggregator):

        del self._function_calls_in_progress[frame.request.tool_call_id]

+        # Call the result_callback if provided. This signals that the image
+        # has been retrieved and the function call can now complete.
+        if frame.request and frame.request.result_callback:
+            await frame.request.result_callback(None)
+
        await self.handle_user_image_frame(frame)
        await self.push_aggregation()
        await self.push_context_frame(FrameDirection.UPSTREAM)
--- a/src/pipecat/processors/aggregators/llm_response_universal.py
+++ b/src/pipecat/processors/aggregators/llm_response_universal.py
@@ -464,9 +464,11 @@ class LLMUserAggregator(LLMContextAggregator):
            await s.setup(self.task_manager)

    async def _stop(self, frame: EndFrame):
+        await self._maybe_emit_user_turn_stopped(on_session_end=True)
        await self._cleanup()

    async def _cancel(self, frame: CancelFrame):
+        await self._maybe_emit_user_turn_stopped(on_session_end=True)
        await self._cleanup()

    async def _cleanup(self):
@@ -602,14 +604,7 @@ class LLMUserAggregator(LLMContextAggregator):
        if params.enable_user_speaking_frames:
            await self.broadcast_frame(UserStoppedSpeakingFrame)

-        # Always push context frame.
-        aggregation = await self.push_aggregation()
-
-        message = UserTurnStoppedMessage(
-            content=aggregation, timestamp=self._user_turn_start_timestamp
-        )
-        await self._call_event_handler("on_user_turn_stopped", strategy, message)
-        self._user_turn_start_timestamp = ""
+        await self._maybe_emit_user_turn_stopped(strategy)

    async def _on_user_turn_stop_timeout(self, controller):
        await self._call_event_handler("on_user_turn_stop_timeout")
@@ -617,6 +612,26 @@ class LLMUserAggregator(LLMContextAggregator):
    async def _on_user_turn_idle(self, controller):
        await self._call_event_handler("on_user_turn_idle")

+    async def _maybe_emit_user_turn_stopped(
+        self,
+        strategy: Optional[BaseUserTurnStopStrategy] = None,
+        on_session_end: bool = False,
+    ):
+        """Maybe emit user turn stopped event.
+
+        Args:
+            strategy: The strategy that triggered the turn stop.
+            on_session_end: If True, only emit if there's unemitted content
+                (avoids duplicate events when session ends).
+        """
+        aggregation = await self.push_aggregation()
+        if not on_session_end or aggregation:
+            message = UserTurnStoppedMessage(
+                content=aggregation, timestamp=self._user_turn_start_timestamp
+            )
+            await self._call_event_handler("on_user_turn_stopped", strategy, message)
+            self._user_turn_start_timestamp = ""
+

 class LLMAssistantAggregator(LLMContextAggregator):
    """Assistant LLM aggregator that processes bot responses and function calls.
@@ -739,6 +754,9 @@ class LLMAssistantAggregator(LLMContextAggregator):
        if isinstance(frame, InterruptionFrame):
            await self._handle_interruptions(frame)
            await self.push_frame(frame, direction)
+        elif isinstance(frame, (EndFrame, CancelFrame)):
+            await self._handle_end_or_cancel(frame)
+            await self.push_frame(frame, direction)
        elif isinstance(frame, LLMFullResponseStartFrame):
            await self._handle_llm_start(frame)
        elif isinstance(frame, LLMFullResponseEndFrame):
@@ -813,6 +831,10 @@ class LLMAssistantAggregator(LLMContextAggregator):
        self._started = 0
        await self.reset()

+    async def _handle_end_or_cancel(self, frame: Frame):
+        await self._trigger_assistant_turn_stopped()
+        self._started = 0
+
    async def _handle_function_calls_started(self, frame: FunctionCallsStartedFrame):
        function_names = [f"{f.function_name}:{f.tool_call_id}" for f in frame.function_calls]
        logger.debug(f"{self} FunctionCallsStartedFrame: {function_names}")
@@ -833,7 +855,7 @@ class LLMAssistantAggregator(LLMContextAggregator):
                        "id": frame.tool_call_id,
                        "function": {
                            "name": frame.function_name,
-                            "arguments": json.dumps(frame.arguments),
+                            "arguments": json.dumps(frame.arguments, ensure_ascii=False),
                        },
                        "type": "function",
                    }
@@ -866,7 +888,7 @@ class LLMAssistantAggregator(LLMContextAggregator):

        # Update context with the function call result
        if frame.result:
-            result = json.dumps(frame.result)
+            result = json.dumps(frame.result, ensure_ascii=False)
            self._update_function_call_result(frame.function_name, frame.tool_call_id, result)
        else:
            self._update_function_call_result(frame.function_name, frame.tool_call_id, "COMPLETED")
@@ -919,16 +941,18 @@ class LLMAssistantAggregator(LLMContextAggregator):
    async def _handle_user_image_frame(self, frame: UserImageRawFrame):
        image_appended = False

-        # Check if this image is a result of a function call if so, let's cache.
-        # TODO(aleix): The function call might have already been executed
-        # because FunctionCallResultFrame was just faster, in that case we just
-        # push the context frame now.
+        # Check if this image is a result of a function call.
        if (
            frame.request
            and frame.request.tool_call_id
            and frame.request.tool_call_id in self._function_calls_in_progress
        ):
            self._function_calls_image_results[frame.request.tool_call_id] = frame
+
+            # Call the result_callback if provided. This signals that the image
+            # has been retrieved and the function call can now complete.
+            if frame.request.result_callback:
+                await frame.request.result_callback(None)
        else:
            image_appended = await self._maybe_append_image_to_context(frame)

--- a/src/pipecat/processors/audio/audio_buffer_processor.py
+++ b/src/pipecat/processors/audio/audio_buffer_processor.py
@@ -11,7 +11,6 @@ of audio from both user input and bot output sources, with support for various a
 configurations and event-driven processing.
 """

-import time
 from typing import Optional

 from pipecat.audio.utils import create_stream_resampler, interleave_stereo_audio, mix_audio
@@ -104,10 +103,6 @@ class AudioBufferProcessor(FrameProcessor):
        self._user_turn_audio_buffer = bytearray()
        self._bot_turn_audio_buffer = bytearray()

-        # Intermittent (non continous user stream variables)
-        self._last_user_frame_at = 0
-        self._last_bot_frame_at = 0
-
        self._recording = False

        self._input_resampler = create_stream_resampler()
@@ -211,23 +206,31 @@ class AudioBufferProcessor(FrameProcessor):
        """Process audio frames for recording."""
        resampled = None
        if isinstance(frame, InputAudioRawFrame):
-            # Add silence if we need to.
-            silence = self._compute_silence(self._last_user_frame_at)
-            self._user_audio_buffer.extend(silence)
-            # Add user audio.
            resampled = await self._resample_input_audio(frame)
-            self._user_audio_buffer.extend(resampled)
-            # Save time of frame so we can compute silence.
-            self._last_user_frame_at = time.time()
+            # Ignoring in case we don't have audio
+            if len(resampled) > 0:
+                # Sync bot buffer to current user position before adding user audio.
+                # We sync BEFORE extending to align both buffers at the same starting timestamp.
+                # For example, user buffer is at 100 bytes, and you receive 20 bytes of new audio
+                #  - Bot buffer sees User is at 100. Bot pads itself to 100.
+                #  - User buffer adds 20. User is now at 120.
+                #  - Outcome: At index 100-120, we have User Audio and (potentially) Bot Audio or silence. They are aligned
+                # This gives the opportunity to the bot to send audio.
+                #
+                # If we synced AFTER, we'd pad the bot buffer with silence for the same
+                # window we just gave to the user, effectively "overwriting" that time slot
+                # with silence and causing the bot's audio to flicker or cut out.
+                self._sync_buffer_to_position(self._bot_audio_buffer, len(self._user_audio_buffer))
+                # Add user audio.
+                self._user_audio_buffer.extend(resampled)
        elif self._recording and isinstance(frame, OutputAudioRawFrame):
-            # Add silence if we need to.
-            silence = self._compute_silence(self._last_bot_frame_at)
-            self._bot_audio_buffer.extend(silence)
-            # Add bot audio.
            resampled = await self._resample_output_audio(frame)
-            self._bot_audio_buffer.extend(resampled)
-            # Save time of frame so we can compute silence.
-            self._last_bot_frame_at = time.time()
+            # Ignoring in case we don't have audio
+            if len(resampled) > 0:
+                # Sync user buffer to current bot position before adding bot audio
+                self._sync_buffer_to_position(self._user_audio_buffer, len(self._bot_audio_buffer))
+                # Add bot audio.
+                self._bot_audio_buffer.extend(resampled)

        if self._buffer_size > 0 and (
            len(self._user_audio_buffer) >= self._buffer_size
@@ -240,6 +243,21 @@ class AudioBufferProcessor(FrameProcessor):
        if self._enable_turn_audio:
            await self._process_turn_recording(frame, resampled)

+    def _sync_buffer_to_position(self, buffer: bytearray, target_position: int):
+        """Pad buffer with silence if it's behind the target position.
+
+        This ensures both buffers stay synchronized by padding the lagging
+        buffer before new audio is added to the other buffer.
+
+        Args:
+            buffer: The buffer to potentially pad.
+            target_position: The position (in bytes) the buffer should reach.
+        """
+        current_len = len(buffer)
+        if current_len < target_position:
+            silence_needed = target_position - current_len
+            buffer.extend(b"\x00" * silence_needed)
+
    async def _process_turn_recording(self, frame: Frame, resampled_audio: Optional[bytes] = None):
        """Process frames for turn-based audio recording."""
        if isinstance(frame, UserStartedSpeakingFrame):
@@ -281,8 +299,8 @@ class AudioBufferProcessor(FrameProcessor):
        if len(self._user_audio_buffer) == 0 and len(self._bot_audio_buffer) == 0:
            return

+        # Final alignment before we send the audio
        self._align_track_buffers()
-        flush_time = time.time()

        # Call original handler with merged audio
        merged_audio = self.merge_audio_buffers()
@@ -299,9 +317,6 @@ class AudioBufferProcessor(FrameProcessor):
            self._num_channels,
        )

-        self._last_user_frame_at = flush_time
-        self._last_bot_frame_at = flush_time
-
    def _buffer_has_audio(self, buffer: bytearray) -> bool:
        """Check if a buffer contains audio data."""
        return buffer is not None and len(buffer) > 0
@@ -309,8 +324,6 @@ class AudioBufferProcessor(FrameProcessor):
    def _reset_recording(self):
        """Reset recording state and buffers."""
        self._reset_all_audio_buffers()
-        self._last_user_frame_at = time.time()
-        self._last_bot_frame_at = time.time()

    def _reset_all_audio_buffers(self):
        """Reset all audio buffers to empty state."""
@@ -336,11 +349,9 @@ class AudioBufferProcessor(FrameProcessor):

        target_len = max(user_len, bot_len)
        if user_len < target_len:
-            self._user_audio_buffer.extend(b"\x00" * (target_len - user_len))
-            self._last_user_frame_at = max(self._last_user_frame_at, self._last_bot_frame_at)
+            self._sync_buffer_to_position(self._user_audio_buffer, target_len)
        if bot_len < target_len:
-            self._bot_audio_buffer.extend(b"\x00" * (target_len - bot_len))
-            self._last_bot_frame_at = max(self._last_bot_frame_at, self._last_user_frame_at)
+            self._sync_buffer_to_position(self._bot_audio_buffer, target_len)

    async def _resample_input_audio(self, frame: InputAudioRawFrame) -> bytes:
        """Resample audio frame to the target sample rate."""
@@ -353,14 +364,3 @@ class AudioBufferProcessor(FrameProcessor):
        return await self._output_resampler.resample(
            frame.audio, frame.sample_rate, self._sample_rate
        )
-
-    def _compute_silence(self, from_time: float) -> bytes:
-        """Compute silence to insert based on time gap."""
-        quiet_time = time.time() - from_time
-        # We should get audio frames very frequently. We introduce silence only
-        # if there's a big enough gap of 1s.
-        if from_time == 0 or quiet_time < 1.0:
-            return b""
-        num_bytes = int(quiet_time * self._sample_rate) * 2
-        silence = b"\x00" * num_bytes
-        return silence
--- a/src/pipecat/processors/frame_processor.py
+++ b/src/pipecat/processors/frame_processor.py
@@ -14,7 +14,6 @@ management, and frame flow control mechanisms.
 import asyncio
 import dataclasses
 import traceback
-from copy import deepcopy
 from dataclasses import dataclass
 from enum import Enum
 from typing import (
@@ -775,20 +774,21 @@ class FrameProcessor(BaseObject):
        """Broadcasts a frame of the specified class upstream and downstream.

        This method creates two instances of the given frame class using the
-        provided keyword arguments and pushes them upstream and downstream.
+        provided keyword arguments (without deep-copying them) and pushes them
+        upstream and downstream.

        Args:
            frame_cls: The class of the frame to be broadcasted.
            **kwargs: Keyword arguments to be passed to the frame's constructor.
        """
-        await self.push_frame(frame_cls(**deepcopy(kwargs)))
-        await self.push_frame(frame_cls(**deepcopy(kwargs)), FrameDirection.UPSTREAM)
+        await self.push_frame(frame_cls(**kwargs))
+        await self.push_frame(frame_cls(**kwargs), FrameDirection.UPSTREAM)

    async def broadcast_frame_instance(self, frame: Frame):
        """Broadcasts a frame instance upstream and downstream.

-        This method creates two new frame instances copying all fields from the
-        original frame except `id` and `name`, which get fresh values.
+        This method creates two new frame instances shallow-copying all fields
+        from the original frame except `id` and `name`, which get fresh values.

        Args:
            frame: The frame instance to broadcast.
@@ -806,13 +806,13 @@ class FrameProcessor(BaseObject):
            if not f.init and f.name not in ("id", "name")
        }

-        new_frame = frame_cls(**deepcopy(init_fields))
-        for k, v in deepcopy(extra_fields).items():
+        new_frame = frame_cls(**init_fields)
+        for k, v in extra_fields.items():
            setattr(new_frame, k, v)
        await self.push_frame(new_frame)

-        new_frame = frame_cls(**deepcopy(init_fields))
-        for k, v in deepcopy(extra_fields).items():
+        new_frame = frame_cls(**init_fields)
+        for k, v in extra_fields.items():
            setattr(new_frame, k, v)
        await self.push_frame(new_frame, FrameDirection.UPSTREAM)

--- a/src/pipecat/runner/daily.py
+++ b/src/pipecat/runner/daily.py
@@ -17,7 +17,7 @@ Functions:
 Environment variables:

 - DAILY_API_KEY - Daily API key for room/token creation (required)
- DAILY_SAMPLE_ROOM_URL (optional) - Existing room URL to use. If not provided,
+- DAILY_ROOM_URL (optional) - Existing room URL to use. If not provided,
  a temporary room will be created automatically.

 Example::
@@ -91,7 +91,7 @@ async def configure(
    """Configure Daily room URL and token with optional SIP capabilities.

    This function will either:
-    1. Use an existing room URL from DAILY_SAMPLE_ROOM_URL environment variable (standard mode only)
+    1. Use an existing room URL from DAILY_ROOM_URL environment variable (standard mode only)
    2. Create a new temporary room automatically if no URL is provided

    Args:
@@ -177,7 +177,7 @@ async def configure(
    )

    # Check for existing room URL (only in standard mode)
-    existing_room_url = os.getenv("DAILY_SAMPLE_ROOM_URL")
+    existing_room_url = os.getenv("DAILY_ROOM_URL")
    if existing_room_url and not sip_enabled:
        # Use existing room (standard mode only)
        logger.info(f"Using existing Daily room: {existing_room_url}")
--- a/src/pipecat/runner/run.py
+++ b/src/pipecat/runner/run.py
@@ -153,26 +153,18 @@ def _get_bot_module():
    )


-async def _run_telephony_bot(websocket: WebSocket):
+async def _run_telephony_bot(websocket: WebSocket, args: argparse.Namespace):
    """Run a bot for telephony transports."""
    bot_module = _get_bot_module()

    # Just pass the WebSocket - let the bot handle parsing
    runner_args = WebSocketRunnerArguments(websocket=websocket)
+    runner_args.cli_args = args

    await bot_module.bot(runner_args)


-def _create_server_app(
-    *,
-    transport_type: str,
-    host: str = "localhost",
-    proxy: str,
-    esp32_mode: bool = False,
-    whatsapp_enabled: bool = False,
-    folder: Optional[str] = None,
-    dialin_enabled: bool = False,
-):
+def _create_server_app(args: argparse.Namespace):
    """Create FastAPI app with transport-specific routes."""
    app = FastAPI()

@@ -185,23 +177,21 @@ def _create_server_app(
    )

    # Set up transport-specific routes
-    if transport_type == "webrtc":
-        _setup_webrtc_routes(app, esp32_mode=esp32_mode, host=host, folder=folder)
-        if whatsapp_enabled:
-            _setup_whatsapp_routes(app)
-    elif transport_type == "daily":
-        _setup_daily_routes(app, dialin_enabled=dialin_enabled)
-    elif transport_type in TELEPHONY_TRANSPORTS:
-        _setup_telephony_routes(app, transport_type=transport_type, proxy=proxy)
+    if args.transport == "webrtc":
+        _setup_webrtc_routes(app, args)
+        if args.whatsapp:
+            _setup_whatsapp_routes(app, args)
+    elif args.transport == "daily":
+        _setup_daily_routes(app, args)
+    elif args.transport in TELEPHONY_TRANSPORTS:
+        _setup_telephony_routes(app, args)
    else:
-        logger.warning(f"Unknown transport type: {transport_type}")
+        logger.warning(f"Unknown transport type: {args.transport}")

    return app


-def _setup_webrtc_routes(
-    app: FastAPI, *, esp32_mode: bool = False, host: str = "localhost", folder: Optional[str] = None
-):
+def _setup_webrtc_routes(app: FastAPI, args: argparse.Namespace):
    """Set up WebRTC-specific routes."""
    try:
        from pipecat_ai_small_webrtc_prebuilt.frontend import SmallWebRTCPrebuiltUI
@@ -241,11 +231,11 @@ def _setup_webrtc_routes(
    @app.get("/files/{filename:path}")
    async def download_file(filename: str):
        """Handle file downloads."""
-        if not folder:
+        if not args.folder:
            logger.warning(f"Attempting to dowload {filename}, but downloads folder not setup.")
            return

-        file_path = Path(folder) / filename
+        file_path = Path(args.folder) / filename
        if not os.path.exists(file_path):
            raise HTTPException(404)

@@ -255,7 +245,7 @@ def _setup_webrtc_routes(

    # Initialize the SmallWebRTC request handler
    small_webrtc_handler: SmallWebRTCRequestHandler = SmallWebRTCRequestHandler(
-        esp32_mode=esp32_mode, host=host
+        esp32_mode=args.esp32, host=args.host
    )

    @app.post("/api/offer")
@@ -269,6 +259,7 @@ def _setup_webrtc_routes(
            runner_args = SmallWebRTCRunnerArguments(
                webrtc_connection=connection, body=request.request_data
            )
+            runner_args.cli_args = args
            background_tasks.add_task(bot_module.bot, runner_args)

        # Delegate handling to SmallWebRTCRequestHandler
@@ -298,7 +289,7 @@ def _setup_webrtc_routes(

        # Store session info immediately in memory, replicate the behavior expected on Pipecat Cloud
        session_id = str(uuid.uuid4())
-        active_sessions[session_id] = request_data
+        active_sessions[session_id] = request_data.get("body", {})

        result: StartBotResult = {"sessionId": session_id}
        if request_data.get("enableDefaultIceServers"):
@@ -331,7 +322,8 @@ def _setup_webrtc_routes(
                        pc_id=request_data.get("pc_id"),
                        restart_pc=request_data.get("restart_pc"),
                        request_data=request_data.get("request_data")
-                        or request_data.get("requestData"),
+                        or request_data.get("requestData")
+                        or active_session,
                    )
                    return await offer(webrtc_request, background_tasks)
                elif request.method == HTTPMethod.PATCH.value:
@@ -380,8 +372,8 @@ def _add_lifespan_to_app(app: FastAPI, new_lifespan):
        app.router.lifespan_context = new_lifespan


-def _setup_whatsapp_routes(app: FastAPI):
-    """Set up WebRTC-specific routes."""
+def _setup_whatsapp_routes(app: FastAPI, args: argparse.Namespace):
+    """Set up WhatsApp-specific routes."""
    WHATSAPP_APP_SECRET = os.getenv("WHATSAPP_APP_SECRET")
    WHATSAPP_PHONE_NUMBER_ID = os.getenv("WHATSAPP_PHONE_NUMBER_ID")
    WHATSAPP_TOKEN = os.getenv("WHATSAPP_TOKEN")
@@ -483,6 +475,7 @@ def _setup_whatsapp_routes(app: FastAPI):
            """
            bot_module = _get_bot_module()
            runner_args = SmallWebRTCRunnerArguments(webrtc_connection=connection)
+            runner_args.cli_args = args
            background_tasks.add_task(bot_module.bot, runner_args)

        try:
@@ -528,13 +521,8 @@ def _setup_whatsapp_routes(app: FastAPI):
    _add_lifespan_to_app(app, whatsapp_lifespan)


-def _setup_daily_routes(app: FastAPI, dialin_enabled: bool = False):
-    """Set up Daily-specific routes.
-
-    Args:
-        app: FastAPI application instance
-        dialin_enabled: If True, adds /daily-dialin-webhook endpoint for PSTN dial-in handling
-    """
+def _setup_daily_routes(app: FastAPI, args: argparse.Namespace):
+    """Set up Daily-specific routes."""

    @app.get("/")
    async def create_room_and_start_agent():
@@ -551,6 +539,7 @@ def _setup_daily_routes(app: FastAPI, dialin_enabled: bool = False):
            # Start the bot in the background with empty body for GET requests
            bot_module = _get_bot_module()
            runner_args = DailyRunnerArguments(room_url=room_url, token=token)
+            runner_args.cli_args = args
            asyncio.create_task(bot_module.bot(runner_args))
            return RedirectResponse(room_url)

@@ -583,13 +572,13 @@ def _setup_daily_routes(app: FastAPI, dialin_enabled: bool = False):

        bot_module = _get_bot_module()

-        existing_room_url = os.getenv("DAILY_SAMPLE_ROOM_URL")
+        existing_room_url = os.getenv("DAILY_ROOM_URL")

        result = None

        # Configure room if:
        # 1. Explicitly requested via createDailyRoom in payload
-        # 2. Using pre-configured room from DAILY_SAMPLE_ROOM_URL env var
+        # 2. Using pre-configured room from DAILY_ROOM_URL env var
        if create_daily_room or existing_room_url:
            import aiohttp

@@ -634,12 +623,15 @@ def _setup_daily_routes(app: FastAPI, dialin_enabled: bool = False):
        else:
            runner_args = RunnerArguments(body=body)

+        # Update CLI args.
+        runner_args.cli_args = args
+
        # Start the bot in the background
        asyncio.create_task(bot_module.bot(runner_args))

        return result

-    if dialin_enabled:
+    if args.dialin:

        @app.post("/daily-dialin-webhook")
        async def handle_dialin_webhook(request: Request):
@@ -736,6 +728,7 @@ def _setup_daily_routes(app: FastAPI, dialin_enabled: bool = False):
                token=room_config.token,
                body=request_body.model_dump(),
            )
+            runner_args.cli_args = args

            asyncio.create_task(bot_module.bot(runner_args))

@@ -750,44 +743,44 @@ def _setup_daily_routes(app: FastAPI, dialin_enabled: bool = False):
            }


-def _setup_telephony_routes(app: FastAPI, *, transport_type: str, proxy: str):
+def _setup_telephony_routes(app: FastAPI, args: argparse.Namespace):
    """Set up telephony-specific routes."""
    # XML response templates (Exotel doesn't use XML webhooks)
    XML_TEMPLATES = {
        "twilio": f"""<?xml version="1.0" encoding="UTF-8"?>
 <Response>
  <Connect>
-    <Stream url="wss://{proxy}/ws"></Stream>
+    <Stream url="wss://{args.proxy}/ws"></Stream>
  </Connect>
  <Pause length="40"/>
 </Response>""",
        "telnyx": f"""<?xml version="1.0" encoding="UTF-8"?>
 <Response>
  <Connect>
-    <Stream url="wss://{proxy}/ws" bidirectionalMode="rtp"></Stream>
+    <Stream url="wss://{args.proxy}/ws" bidirectionalMode="rtp"></Stream>
  </Connect>
  <Pause length="40"/>
 </Response>""",
        "plivo": f"""<?xml version="1.0" encoding="UTF-8"?>
 <Response>
-  <Stream bidirectional="true" keepCallAlive="true" contentType="audio/x-mulaw;rate=8000">wss://{proxy}/ws</Stream>
+  <Stream bidirectional="true" keepCallAlive="true" contentType="audio/x-mulaw;rate=8000">wss://{args.proxy}/ws</Stream>
 </Response>""",
    }

    @app.post("/")
    async def start_call():
        """Handle telephony webhook and return XML response."""
-        if transport_type == "exotel":
+        if args.transport == "exotel":
            # Exotel doesn't use POST webhooks - redirect to proper documentation
            logger.debug("POST Exotel endpoint - not used")
            return {
                "error": "Exotel doesn't use POST webhooks",
-                "websocket_url": f"wss://{proxy}/ws",
+                "websocket_url": f"wss://{args.proxy}/ws",
                "note": "Configure the WebSocket URL above in your Exotel App Bazaar Voicebot Applet",
            }
        else:
-            logger.debug(f"POST {transport_type.upper()} XML")
-            xml_content = XML_TEMPLATES.get(transport_type, "<Response></Response>")
+            logger.debug(f"POST {args.transport.upper()} XML")
+            xml_content = XML_TEMPLATES.get(args.transport, "<Response></Response>")
            return HTMLResponse(content=xml_content, media_type="application/xml")

    @app.websocket("/ws")
@@ -795,15 +788,15 @@ def _setup_telephony_routes(app: FastAPI, *, transport_type: str, proxy: str):
        """Handle WebSocket connections for telephony."""
        await websocket.accept()
        logger.debug("WebSocket connection accepted")
-        await _run_telephony_bot(websocket)
+        await _run_telephony_bot(websocket, args)

    @app.get("/")
    async def start_agent():
        """Simple status endpoint for telephony transports."""
-        return {"status": f"Bot started with {transport_type}"}
+        return {"status": f"Bot started with {args.transport}"}


-async def _run_daily_direct():
+async def _run_daily_direct(args: argparse.Namespace):
    """Run Daily bot with direct connection (no FastAPI server)."""
    try:
        from pipecat.runner.daily import configure
@@ -819,6 +812,7 @@ async def _run_daily_direct():
        # Direct connections have no request body, so use empty dict
        runner_args = DailyRunnerArguments(room_url=room_url, token=token)
        runner_args.handle_sigint = True
+        runner_args.cli_args = args

        # Get the bot module and run it directly
        bot_module = _get_bot_module()
@@ -866,29 +860,38 @@ def runner_port() -> int:
    return RUNNER_PORT


-def main():
+def main(parser: Optional[argparse.ArgumentParser] = None):
    """Start the Pipecat development runner.

    Parses command-line arguments and starts a FastAPI server configured
-    for the specified transport type. The runner will discover and run
-    any bot() function found in the current directory.
+    for the specified transport type.
+
+    The runner discovers and runs any ``bot(runner_args)`` function found in the
+    calling module.

    Command-line arguments:
+       - --host: Server host address (default: localhost) 879
+       - --port: Server port (default: 7860)
+       - -t/--transport: Transport type (daily, webrtc, twilio, telnyx, plivo, exotel)
+       - -x/--proxy: Public proxy hostname for telephony webhooks
+       - -d/--direct: Connect directly to Daily room (automatically sets transport to daily)
+       - -f/--folder: Path to downloads folder
+       - --dialin: Enable Daily PSTN dial-in webhook handling (requires Daily transport)
+       - --esp32: Enable SDP munging for ESP32 compatibility (requires --host with IP address)
+       - --whatsapp: Ensure requried WhatsApp environment variables are present
+       - -v/--verbose: Increase logging verbosity

    Args:
-        --host: Server host address (default: localhost)
-        --port: Server port (default: 7860)
-        -t/--transport: Transport type (daily, webrtc, twilio, telnyx, plivo, exotel)
-        -x/--proxy: Public proxy hostname for telephony webhooks
-        --esp32: Enable SDP munging for ESP32 compatibility (requires --host with IP address)
-        -d/--direct: Connect directly to Daily room (automatically sets transport to daily)
-        -v/--verbose: Increase logging verbosity
+        parser: Optional custom argument parser. If provided, default runner
+            arguments are added to it so bots can define their own CLI
+            arguments. Custom arguments should not conflict with the default
+            ones. Custom args are accessible via `runner_args.cli_args`.

-    The bot file must contain a `bot(runner_args)` function as the entry point.
    """
    global RUNNER_DOWNLOADS_FOLDER, RUNNER_HOST, RUNNER_PORT

-    parser = argparse.ArgumentParser(description="Pipecat Development Runner")
+    if not parser:
+        parser = argparse.ArgumentParser(description="Pipecat Development Runner")
    parser.add_argument("--host", type=str, default=RUNNER_HOST, help="Host address")
    parser.add_argument("--port", type=int, default=RUNNER_PORT, help="Port number")
    parser.add_argument(
@@ -899,13 +902,7 @@ def main():
        default="webrtc",
        help="Transport type",
    )
-    parser.add_argument("--proxy", "-x", help="Public proxy host name")
-    parser.add_argument(
-        "--esp32",
-        action="store_true",
-        default=False,
-        help="Enable SDP munging for ESP32 compatibility (requires --host with IP address)",
-    )
+    parser.add_argument("-x", "--proxy", help="Public proxy host name")
    parser.add_argument(
        "-d",
        "--direct",
@@ -915,13 +912,7 @@ def main():
    )
    parser.add_argument("-f", "--folder", type=str, help="Path to downloads folder")
    parser.add_argument(
-        "--verbose", "-v", action="count", default=0, help="Increase logging verbosity"
-    )
-    parser.add_argument(
-        "--whatsapp",
-        action="store_true",
-        default=False,
-        help="Ensure requried WhatsApp environment variables are present",
+        "-v", "--verbose", action="count", default=0, help="Increase logging verbosity"
    )
    parser.add_argument(
        "--dialin",
@@ -929,6 +920,18 @@ def main():
        default=False,
        help="Enable Daily PSTN dial-in webhook handling (requires Daily transport)",
    )
+    parser.add_argument(
+        "--esp32",
+        action="store_true",
+        default=False,
+        help="Enable SDP munging for ESP32 compatibility (requires --host with IP address)",
+    )
+    parser.add_argument(
+        "--whatsapp",
+        action="store_true",
+        default=False,
+        help="Ensure requried WhatsApp environment variables are present",
+    )

    args = parser.parse_args()

@@ -964,7 +967,7 @@ def main():
        print()

        # Run direct Daily connection
-        asyncio.run(_run_daily_direct())
+        asyncio.run(_run_daily_direct(args))
        return

    # Print startup message for server-based transports
@@ -995,15 +998,7 @@ def main():
    RUNNER_PORT = args.port

    # Create the app with transport-specific setup
-    app = _create_server_app(
-        transport_type=args.transport,
-        host=args.host,
-        proxy=args.proxy,
-        esp32_mode=args.esp32,
-        whatsapp_enabled=args.whatsapp,
-        folder=args.folder,
-        dialin_enabled=args.dialin,
-    )
+    app = _create_server_app(args)

    # Run the server
    uvicorn.run(app, host=args.host, port=args.port)
--- a/src/pipecat/runner/types.py
+++ b/src/pipecat/runner/types.py
@@ -10,6 +10,7 @@ These types are used by the development runner to pass transport-specific
 information to bot functions.
 """

+import argparse
 from dataclasses import dataclass, field
 from typing import Any, Dict, Optional

@@ -64,6 +65,7 @@ class RunnerArguments:
    handle_sigterm: bool = field(init=False, kw_only=True)
    pipeline_idle_timeout_secs: int = field(init=False, kw_only=True)
    body: Optional[Any] = field(default_factory=dict, kw_only=True)
+    cli_args: Optional[argparse.Namespace] = field(default=None, init=False, kw_only=True)

    def __post_init__(self):
        self.handle_sigint = False
@@ -106,3 +108,18 @@ class SmallWebRTCRunnerArguments(RunnerArguments):
    """

    webrtc_connection: Any
+
+
+@dataclass
+class LiveKitRunnerArguments(RunnerArguments):
+    """LiveKit transport session arguments for the runner.
+
+    Parameters:
+        room_name: LiveKit room name to join
+        token: Authentication token for the room
+        body: Additional request data
+    """
+
+    room_name: str
+    url: str
+    token: Optional[str] = None
--- a/src/pipecat/runner/utils.py
+++ b/src/pipecat/runner/utils.py
@@ -39,6 +39,7 @@ from loguru import logger

 from pipecat.runner.types import (
    DailyRunnerArguments,
+    LiveKitRunnerArguments,
    SmallWebRTCRunnerArguments,
    WebSocketRunnerArguments,
 )
@@ -568,6 +569,17 @@ async def create_transport(
        return await _create_telephony_transport(
            runner_args.websocket, params, transport_type, call_data
        )
+    elif isinstance(runner_args, LiveKitRunnerArguments):
+        params = _get_transport_params("livekit", transport_params)
+
+        from pipecat.transports.livekit.transport import LiveKitTransport
+
+        return LiveKitTransport(
+            runner_args.url,
+            runner_args.token,
+            runner_args.room_name,
+            params=params,
+        )

    else:
        raise ValueError(f"Unsupported runner arguments type: {type(runner_args)}")
--- a/src/pipecat/serializers/base_serializer.py
+++ b/src/pipecat/serializers/base_serializer.py
@@ -9,9 +9,10 @@
 from abc import ABC, abstractmethod

 from pipecat.frames.frames import Frame, StartFrame
+from pipecat.utils.base_object import BaseObject


-class FrameSerializer(ABC):
+class FrameSerializer(BaseObject):
    """Abstract base class for frame serialization implementations.

    Defines the interface for converting frames to/from serialized formats
--- a/src/pipecat/serializers/genesys.py
+++ b/src/pipecat/serializers/genesys.py
@@ -0,0 +1,963 @@
+"""Genesys AudioHook Serializer for Pipecat.
+
+This module provides a serializer for integrating Pipecat pipelines with
+Genesys Cloud Contact Center via the AudioHook protocol.
+
+Features:
+- Bidirectional audio streaming (PCMU μ-law at 8kHz)
+- Automatic protocol handshake handling (open/opened, close/closed, ping/pong)
+- Input/output variables for Architect flow integration
+- DTMF event support
+- Barge-in (interruption) events
+- Pause/resume support for hold scenarios (optional)
+
+Protocol Reference:
+- https://developer.genesys.cloud/devapps/audiohook
+
+Audio Format:
+- PCMU (μ-law) at 8kHz sample rate (preferred)
+- L16 (16-bit linear PCM) at 8kHz also supported
+- Mono (external channel) or Stereo (external on left, internal on right)
+"""
+
+import json
+import uuid
+from datetime import timedelta
+from enum import Enum
+from typing import Any, Dict, List, Optional
+
+from loguru import logger
+from pydantic import BaseModel
+
+from pipecat.audio.dtmf.types import KeypadEntry
+from pipecat.audio.resamplers.soxr_stream_resampler import SOXRStreamAudioResampler
+from pipecat.audio.utils import pcm_to_ulaw, ulaw_to_pcm
+from pipecat.frames.frames import (
+    AudioRawFrame,
+    CancelFrame,
+    EndFrame,
+    Frame,
+    InputAudioRawFrame,
+    InputDTMFFrame,
+    InterruptionFrame,
+    OutputTransportMessageFrame,
+    OutputTransportMessageUrgentFrame,
+    StartFrame,
+)
+from pipecat.serializers.base_serializer import FrameSerializer
+
+
+class AudioHookMessageType(str, Enum):
+    """AudioHook protocol message types."""
+
+    OPEN = "open"
+    OPENED = "opened"
+    CLOSE = "close"
+    CLOSED = "closed"
+    PAUSE = "pause"
+    RESUMED = "resumed"
+    PING = "ping"
+    PONG = "pong"
+    UPDATE = "update"
+    EVENT = "event"
+    ERROR = "error"
+    DISCONNECT = "disconnect"
+
+
+class AudioHookChannel(str, Enum):
+    """AudioHook audio channel configuration."""
+
+    EXTERNAL = "external"  # Customer audio only (mono)
+    INTERNAL = "internal"  # Agent audio only (mono)
+    BOTH = "both"  # Stereo: external=left, internal=right
+
+
+class AudioHookMediaFormat(str, Enum):
+    """Supported audio formats."""
+
+    PCMU = "PCMU"  # μ-law, 8kHz
+    L16 = "L16"  # 16-bit linear PCM, 8kHz
+
+
+class GenesysAudioHookSerializer(FrameSerializer):
+    """Serializer for Genesys AudioHook WebSocket protocol.
+
+    This serializer handles converting between Pipecat frames and Genesys
+    AudioHook protocol messages. It supports:
+
+    - Bidirectional audio streaming (PCMU at 8kHz)
+    - Automatic protocol handshake (open/opened, close/closed, ping/pong)
+    - Session lifecycle management with pause/resume support
+    - Custom input/output variables for Architect flow integration
+    - DTMF event handling
+    - Barge-in events for interruption support
+
+    The AudioHook protocol uses:
+    - Text WebSocket frames for JSON control messages
+    - Binary WebSocket frames for audio data
+
+    Example usage:
+        ```python
+        serializer = GenesysAudioHookSerializer(
+            params=GenesysAudioHookSerializer.InputParams(
+                channel=AudioHookChannel.EXTERNAL,
+                supported_languages=["en-US", "es-ES"],
+                selected_language="en-US",
+            )
+        )
+
+        # Use with FastAPI WebSocket transport
+        transport = FastAPIWebsocketTransport(
+            websocket=websocket,
+            params=FastAPIWebsocketParams(
+                audio_in_enabled=True,
+                audio_out_enabled=True,
+                serializer=serializer,
+                audio_out_fixed_packet_size=1600,  # Important: prevents 429 rate limiting from Genesys
+            ),
+        )
+
+        # Access call information after connection
+        participant = serializer.participant  # ani, dnis, etc.
+        input_vars = serializer.input_variables  # Custom vars from Architect
+
+        # Set output variables to return to Architect
+        serializer.set_output_variables({"intent": "billing", "resolved": True})
+        ```
+
+    Attributes:
+        PROTOCOL_VERSION: The AudioHook protocol version (currently "2").
+    """
+
+    PROTOCOL_VERSION = "2"
+
+    class InputParams(BaseModel):
+        """Configuration parameters for GenesysAudioHookSerializer.
+
+        Attributes:
+            genesys_sample_rate: Sample rate used by Genesys (default: 8000 Hz).
+            sample_rate: Optional override for pipeline input sample rate.
+            channel: Which audio channels to process (external, internal, both).
+            media_format: Audio format (PCMU or L16).
+            process_external: Whether to process external (customer) audio.
+            process_internal: Whether to process internal (agent) audio.
+            supported_languages: List of language codes the bot supports (e.g., ["en-US", "es-ES"]).
+            selected_language: Default language code to use.
+            start_paused: Whether to start the session in paused state.
+        """
+
+        genesys_sample_rate: int = 8000
+        sample_rate: Optional[int] = None
+        channel: AudioHookChannel = AudioHookChannel.EXTERNAL
+        media_format: AudioHookMediaFormat = AudioHookMediaFormat.PCMU
+        process_external: bool = True
+        process_internal: bool = False
+        supported_languages: Optional[List[str]] = None
+        selected_language: Optional[str] = None
+        start_paused: bool = False
+
+    def __init__(
+        self,
+        params: Optional[InputParams] = None,
+        **kwargs,
+    ):
+        """Initialize the GenesysAudioHookSerializer.
+
+        Args:
+            params: Configuration parameters.
+            **kwargs: Additional arguments passed to BaseObject (e.g., name).
+        """
+        super().__init__(**kwargs)
+        self._params = params or GenesysAudioHookSerializer.InputParams()
+
+        self._genesys_sample_rate = self._params.genesys_sample_rate
+        self._sample_rate = 0  # Pipeline input rate, set in setup()
+        self._session_id = str(uuid.uuid4())
+
+        # Use Pipecat's official resampler if needed (SOXR)
+        # Only used for TTS output (16kHz → 8kHz), input goes without resampling
+        self._input_resampler = SOXRStreamAudioResampler()
+        self._output_resampler = SOXRStreamAudioResampler()
+
+        # Protocol state
+        self._client_seq = 0
+        self._server_seq = 0
+        self._is_open = False
+        self._is_paused = False
+        self._position = timedelta(0)
+
+        # Session metadata
+        self._conversation_id: Optional[str] = None
+        self._participant: Optional[Dict[str, Any]] = None
+        self._custom_config: Optional[Dict[str, Any]] = None
+        self._media_info: Optional[List[Dict[str, Any]]] = None
+        self._input_variables: Optional[Dict[str, Any]] = None  # Custom input from Genesys
+        self._output_variables: Optional[Dict[str, Any]] = None  # Custom output to Genesys
+
+        # Event handlers
+        self._register_event_handler("on_open")
+        self._register_event_handler("on_close")
+        self._register_event_handler("on_ping")
+        self._register_event_handler("on_pause")
+        self._register_event_handler("on_update")
+        self._register_event_handler("on_error")
+        self._register_event_handler("on_dtmf")
+
+    @property
+    def session_id(self) -> str:
+        """Get the Genesys AudioHook session ID generated by the serializer."""
+        return self._session_id
+
+    @property
+    def conversation_id(self) -> Optional[str]:
+        """Get the Genesys conversation ID."""
+        return self._conversation_id
+
+    @property
+    def is_open(self) -> bool:
+        """Check if the AudioHook session is open."""
+        return self._is_open
+
+    @property
+    def is_paused(self) -> bool:
+        """Check if audio streaming is paused."""
+        return self._is_paused
+
+    @property
+    def participant(self) -> Optional[Dict[str, Any]]:
+        """Get participant info (ani, dnis, etc.) from the open message."""
+        return self._participant
+
+    @property
+    def input_variables(self) -> Optional[Dict[str, Any]]:
+        """Get custom input variables from the open message."""
+        return self._input_variables
+
+    @property
+    def output_variables(self) -> Optional[Dict[str, Any]]:
+        """Get custom output variables to send back to Genesys."""
+        return self._output_variables
+
+    def set_output_variables(self, variables: Dict[str, Any]) -> None:
+        """Set custom output variables to send back to Genesys on close.
+
+        These variables will be included in the 'closed' response when Genesys
+        closes the connection, making them available in the Architect flow.
+
+        Args:
+            variables: Dictionary of custom variables to send to Genesys.
+
+        Example:
+            ```python
+            # During the conversation, collect data and set it
+            serializer.set_output_variables({
+                "intent": "billing_inquiry",
+                "customer_verified": True,
+                "summary": "Customer asked about their bill",
+                "transfer_to": "billing_queue"
+            })
+            ```
+        """
+        self._output_variables = variables
+        logger.debug(f"Output variables set: {variables}")
+
+    async def setup(self, frame: StartFrame):
+        """Sets up the serializer with pipeline configuration.
+
+        Args:
+            frame: The StartFrame containing pipeline configuration.
+        """
+        self._sample_rate = self._params.sample_rate or frame.audio_in_sample_rate
+        logger.debug(f"GenesysAudioHookSerializer setup with sample_rate={self._sample_rate}")
+
+    def _format_position(self, position: timedelta) -> str:
+        """Format a timedelta as ISO 8601 duration string.
+
+        Args:
+            position: The timedelta to format.
+
+        Returns:
+            ISO 8601 duration string (e.g., "PT1.5S").
+        """
+        total_seconds = position.total_seconds()
+        return f"PT{total_seconds:.3f}S"
+
+    def _parse_position(self, position_str: str) -> timedelta:
+        """Parse an ISO 8601 duration string to timedelta.
+
+        Args:
+            position_str: ISO 8601 duration string (e.g., "PT1.5S").
+
+        Returns:
+            Corresponding timedelta.
+        """
+        # Simple parser for PT#S or PT#.#S format
+        if position_str.startswith("PT") and position_str.endswith("S"):
+            try:
+                seconds = float(position_str[2:-1])
+                return timedelta(seconds=seconds)
+            except ValueError:
+                pass
+        return timedelta(0)
+
+    def _next_server_seq(self) -> int:
+        """Get the next server sequence number."""
+        self._server_seq += 1
+        return self._server_seq
+
+    def _create_message(
+        self,
+        msg_type: AudioHookMessageType,
+        parameters: Optional[Dict[str, Any]] = None,
+        include_position: bool = True,
+    ) -> Dict[str, Any]:
+        """Create a protocol message with common fields.
+
+        Based on the Genesys AudioHook protocol, responses include:
+        - seq: Server's sequence number (incremented per message)
+        - clientseq: Echo of the client's last sequence number
+
+        Args:
+            msg_type: The message type.
+            parameters: Optional parameters object.
+            include_position: Whether to include position field.
+
+        Returns:
+            The message dictionary.
+        """
+        seq = self._next_server_seq()
+        msg = {
+            "version": self.PROTOCOL_VERSION,
+            "type": msg_type.value,
+            "seq": seq,
+            "clientseq": self._client_seq,
+            "id": self._session_id,
+        }
+
+        if include_position:
+            msg["position"] = self._format_position(self._position)
+
+        if parameters:
+            msg["parameters"] = parameters
+
+        return msg
+
+    def create_opened_response(
+        self,
+        start_paused: bool = False,
+        supported_languages: Optional[List[str]] = None,
+        selected_language: Optional[str] = None,
+    ) -> Dict[str, Any]:
+        """Create an 'opened' response message for the client.
+
+        This should be sent in response to an 'open' message from Genesys.
+
+        Args:
+            start_paused: Whether to start the session paused.
+            supported_languages: List of supported language codes.
+            selected_language: The selected language code.
+
+        Returns:
+            Dictionary of the opened response message.
+        """
+        # Build channels list based on configuration
+        channels: list[str] = []
+
+        if self._params.channel == AudioHookChannel.EXTERNAL:
+            channels = ["external"]
+        elif self._params.channel == AudioHookChannel.INTERNAL:
+            channels = ["internal"]
+        elif self._params.channel == AudioHookChannel.BOTH:
+            channels = ["external", "internal"]
+
+        parameters = {
+            "startPaused": start_paused,
+            "media": [
+                {
+                    "type": "audio",
+                    "format": self._params.media_format.value,
+                    "channels": channels,
+                    "rate": self._genesys_sample_rate,
+                }
+            ],
+        }
+
+        if supported_languages:
+            parameters["supportedLanguages"] = supported_languages
+        if selected_language:
+            parameters["selectedLanguage"] = selected_language
+
+        msg = self._create_message(
+            AudioHookMessageType.OPENED,
+            parameters=parameters,
+            include_position=False,  # opened doesn't need position
+        )
+
+        self._is_open = True
+
+        logger.debug(f"AudioHook session opened: {self._session_id}")
+
+        return msg
+
+    def create_closed_response(
+        self,
+        output_variables: Optional[Dict[str, Any]] = None,
+    ) -> Dict[str, Any]:
+        """Create a 'closed' response message.
+
+        This should be sent in response to a 'close' message from Genesys.
+
+        Args:
+            output_variables: Optional custom variables to pass back to Genesys.
+                These will be available in the Architect flow after the AudioHook
+                action completes.
+
+        Returns:
+            Dictionary of the closed response message.
+
+        Example:
+            ```python
+            # Pass custom data back to Genesys
+            serializer.create_closed_response(
+                output_variables={
+                    "intent": "billing_inquiry",
+                    "customer_verified": True,
+                    "summary": "Customer asked about their bill"
+                }
+            )
+            ```
+        """
+        parameters: Optional[Dict[str, Any]] = None
+
+        if output_variables:
+            parameters = {"outputVariables": output_variables}
+
+        msg = self._create_message(
+            AudioHookMessageType.CLOSED,
+            parameters=parameters,
+        )
+
+        self._is_open = False
+        logger.debug(f"AudioHook session closed: {self._session_id}")
+
+        return msg
+
+    def create_pong_response(self) -> Dict[str, Any]:
+        """Create a 'pong' response message.
+
+        This should be sent in response to a 'ping' message from Genesys.
+
+        Returns:
+            Dictionary of the pong response message.
+        """
+        msg = self._create_message(AudioHookMessageType.PONG)
+        return msg
+
+    def create_resumed_response(self) -> Dict[str, Any]:
+        """Create a 'resumed' response message.
+
+        This should be sent in response to a 'pause' message when ready to resume.
+
+        Returns:
+            Dictionary of the resumed response message.
+        """
+        msg = self._create_message(AudioHookMessageType.RESUMED)
+
+        self._is_paused = False
+        logger.debug(f"AudioHook session resumed: {self._session_id}")
+
+        return msg
+
+    def create_barge_in_event(self) -> Dict[str, Any]:
+        """Create a barge-in event message.
+
+        This notifies Genesys Cloud that the user has interrupted the bot's
+        audio output. Genesys will stop any queued audio playback.
+
+        Returns:
+            Dictionary of the barge-in event message.
+        """
+        msg = self._create_message(
+            AudioHookMessageType.EVENT,
+            parameters={"entities": [{"type": "barge_in", "data": {}}]},
+        )
+
+        logger.debug("🔇 Barge-in event sent to Genesys")
+
+        return msg
+
+    def create_disconnect_message(
+        self,
+        reason: str = "completed",
+        action: str = "transfer",
+        output_variables: Optional[Dict[str, Any]] = None,
+        info: Optional[str] = None,
+    ) -> Dict[str, Any]:
+        """Create a 'disconnect' message to initiate session termination.
+
+        Args:
+            reason: Disconnect reason (e.g., "completed", "error").
+            action: Action to take ("transfer" to agent, "finished" if completed).
+            output_variables: Custom output variables to pass back to Genesys.
+            info: Optional additional information.
+
+        Returns:
+            Dictionary of the disconnect message.
+        """
+        parameters: Dict[str, Any] = {"reason": reason}
+
+        # Build outputVariables
+        out_vars = {"action": action}
+        if output_variables:
+            out_vars.update(output_variables)
+        parameters["outputVariables"] = out_vars
+
+        if info:
+            parameters["info"] = info
+
+        msg = self._create_message(
+            AudioHookMessageType.DISCONNECT,
+            parameters=parameters,
+        )
+
+        logger.debug(f"AudioHook disconnect: reason={reason}, action={action}")
+        return msg
+
+    def create_error_message(
+        self,
+        code: int,
+        message: str,
+        retryable: bool = False,
+    ) -> Dict[str, Any]:
+        """Create an 'error' message.
+
+        Args:
+            code: Error code.
+            message: Error message.
+            retryable: Whether the operation can be retried.
+
+        Returns:
+            Dictionary of the error message.
+        """
+        parameters = {
+            "code": code,
+            "message": message,
+            "retryable": retryable,
+        }
+
+        msg = self._create_message(
+            AudioHookMessageType.ERROR,
+            parameters=parameters,
+        )
+
+        logger.error(f"AudioHook error: {code} - {message}")
+        return msg
+
+    async def serialize(self, frame: Frame) -> str | bytes | None:
+        """Serializes a Pipecat frame to Genesys AudioHook format.
+
+        Handles conversion of various frame types to AudioHook messages:
+        - AudioRawFrame -> Binary PCMU audio data (resampled to 8kHz)
+        - EndFrame/CancelFrame -> Disconnect message (JSON)
+        - InterruptionFrame -> Barge-in event (JSON)
+        - OutputTransportMessageFrame -> Pass-through JSON
+
+        Args:
+            frame: The Pipecat frame to serialize.
+
+        Returns:
+            Serialized data as string (JSON) or bytes (audio), or None if
+            the frame type is not handled or session is not open.
+        """
+        if isinstance(frame, (EndFrame, CancelFrame)):
+            return json.dumps(
+                self.create_disconnect_message(
+                    output_variables=self.output_variables, reason="completed"
+                )
+            )
+
+        elif isinstance(frame, AudioRawFrame):
+            if not self._is_open or self._is_paused:
+                return None
+
+            data = frame.audio
+
+            # Convert PCM to μ-law at 8kHz for Genesys
+            if self._params.media_format == AudioHookMediaFormat.PCMU:
+                serialized_data = await pcm_to_ulaw(
+                    data,
+                    frame.sample_rate,
+                    self._genesys_sample_rate,
+                    self._output_resampler,
+                )
+            else:
+                # L16 format - just resample if needed
+                logger.warning("L16 format not yet fully implemented")
+                return None
+
+            if serialized_data is None or len(serialized_data) == 0:
+                return None
+
+            return bytes(serialized_data)
+
+        elif isinstance(frame, InterruptionFrame):
+            return json.dumps(self.create_barge_in_event())
+
+        elif isinstance(frame, (OutputTransportMessageFrame, OutputTransportMessageUrgentFrame)):
+            # Only pass through AudioHook protocol messages (those with "version" field)
+            # Filter out RTVI and other non-AudioHook messages
+            if isinstance(frame.message, dict) and "version" in frame.message:
+                return json.dumps(frame.message)
+            else:
+                # Not an AudioHook message, ignore
+                return None
+
+        # Ignore other frames - we don't need to process them here
+        return None
+
+    async def deserialize(self, data: str | bytes) -> Frame | None:
+        """Deserializes Genesys AudioHook data to Pipecat frames.
+
+        Handles:
+        - Binary data -> InputAudioRawFrame (converted from PCMU to PCM)
+        - JSON 'open' -> OutputTransportMessageUrgentFrame with 'opened' response
+        - JSON 'close' -> OutputTransportMessageUrgentFrame with 'closed' response
+        - JSON 'ping' -> OutputTransportMessageUrgentFrame with 'pong' response
+        - JSON 'pause' -> Sets is_paused=True, returns None
+        - JSON 'dtmf' -> InputDTMFFrame
+        - JSON 'update' -> Updates participant info, returns None
+        - JSON 'error' -> Logs error, returns None
+
+        Protocol responses (opened, closed, pong) are returned as urgent frames
+        to be sent immediately through the transport.
+
+        Args:
+            data: The raw WebSocket data from Genesys (binary audio or JSON text).
+
+        Returns:
+            A Pipecat frame to process, or None if handled internally.
+        """
+        # Binary data = audio
+        if isinstance(data, bytes):
+            logger.debug(f"[AUDIO IN] Received {len(data)} bytes from Genesys")
+            return await self._deserialize_audio(data)
+
+        # Text data = JSON control message
+        try:
+            message = json.loads(data)
+        except json.JSONDecodeError as e:
+            logger.error(f"Failed to parse AudioHook message: {e}")
+            return None
+
+        return await self._handle_control_message(message)
+
+    async def _deserialize_audio(self, data: bytes) -> Frame | None:
+        """Deserialize binary audio data to an InputAudioRawFrame.
+
+        Args:
+            data: Raw audio bytes (PCMU or L16).
+
+        Returns:
+            InputAudioRawFrame with PCM audio at pipeline sample rate.
+        """
+        if not self._is_open or self._is_paused:
+            return None
+
+        audio_data = data
+        original_len = len(data)
+
+        # If Genesys sends stereo audio (BOTH channels), extract only the external channel (left)
+        # Stereo audio comes interleaved: [L0, R0, L1, R1, ...]
+        if self._params.channel == AudioHookChannel.BOTH and len(data) > 0:
+            # For PCMU, each sample is 1 byte
+            # Extract only bytes at even positions (left channel = external)
+            audio_data = bytes(data[i] for i in range(0, len(data), 2))
+            logger.debug(
+                f"🔊 Stereo audio: {original_len} bytes → {len(audio_data)} bytes (external channel)"
+            )
+
+        if self._params.media_format == AudioHookMediaFormat.PCMU:
+            # Convert μ-law at 8kHz to PCM at pipeline rate
+            deserialized_data = await ulaw_to_pcm(
+                audio_data,
+                self._genesys_sample_rate,
+                self._sample_rate,
+                self._input_resampler,
+            )
+        else:
+            # L16 format
+            logger.warning("L16 format not yet fully implemented")
+            return None
+
+        if deserialized_data is None or len(deserialized_data) == 0:
+            return None
+
+        # Always use mono for STT - ElevenLabs expects single channel
+        num_channels = 1
+
+        audio_frame = InputAudioRawFrame(
+            audio=deserialized_data,
+            num_channels=num_channels,
+            sample_rate=self._sample_rate,
+        )
+
+        return audio_frame
+
+    async def _handle_control_message(self, message: Dict[str, Any]) -> Frame | None:
+        """Handle a JSON control message from Genesys.
+
+        Args:
+            message: Parsed JSON message.
+
+        Returns:
+            Frame if the message should be passed to the pipeline, None otherwise.
+        """
+        msg_type = message.get("type", "")
+        self._client_seq = message.get("seq", 0)
+
+        # Update position if provided
+        if "position" in message:
+            self._position = self._parse_position(message["position"])
+
+        if msg_type == AudioHookMessageType.OPEN.value:
+            return await self._handle_open(message)
+
+        elif msg_type == AudioHookMessageType.CLOSE.value:
+            return await self._handle_close(message)
+
+        elif msg_type == AudioHookMessageType.PING.value:
+            return await self._handle_ping(message)
+
+        elif msg_type == AudioHookMessageType.PAUSE.value:
+            return await self._handle_pause(message)
+
+        elif msg_type == AudioHookMessageType.UPDATE.value:
+            return await self._handle_update(message)
+
+        elif msg_type == AudioHookMessageType.ERROR.value:
+            return await self._handle_error(message)
+
+        elif msg_type == "dtmf":
+            return await self._handle_dtmf(message)
+
+        elif msg_type == "playback_started":
+            logger.debug("Playback started (from Genesys)")
+            return None
+
+        elif msg_type == "playback_completed":
+            logger.debug("Playback completed (from Genesys)")
+            return None
+        else:
+            logger.warning(f"Unknown AudioHook message type: {msg_type}")
+            return None
+
+    async def _handle_open(self, message: Dict[str, Any]) -> Frame | None:
+        """Handle an 'open' message from Genesys.
+
+        This initializes the session with metadata from Genesys Cloud and
+        automatically responds with an 'opened' message using the configured
+        InputParams (supported_languages, selected_language, start_paused).
+
+        Extracts and stores:
+        - session_id: The AudioHook session identifier
+        - conversation_id: The Genesys conversation ID
+        - participant: Caller info (ani, dnis, etc.)
+        - input_variables: Custom variables from Architect flow
+        - media_info: Audio configuration from Genesys
+
+        Args:
+            message: The open message from Genesys.
+
+        Returns:
+            OutputTransportMessageUrgentFrame with the 'opened' response.
+        """
+        self._session_id = message.get("id", str(uuid.uuid4()))
+
+        params = message.get("parameters", {})
+        self._conversation_id = params.get("conversationId")
+        self._participant = params.get("participant")
+        self._custom_config = params.get("customConfig")
+        self._media_info = params.get("media")  # This is a list of media objects
+        self._input_variables = params.get("inputVariables")  # Custom vars from Genesys
+
+        # Extract media configuration if present
+        # media is a list like: [{"type": "audio", "format": "PCMU", "channels": ["external"], "rate": 8000}]
+        media_list = self._media_info
+        if media_list and isinstance(media_list, list) and len(media_list) > 0:
+            audio_media: Dict[str, Any] = media_list[0]  # Get first media entry
+            channels = audio_media.get("channels", [])
+            logger.debug(
+                f"📡 Genesys audio config: format={audio_media.get('format')}, channels={channels}, rate={audio_media.get('rate')}"
+            )
+            # channels is a list like ["external"] or ["external", "internal"]
+            if isinstance(channels, list):
+                if "external" in channels and "internal" in channels:
+                    self._params.channel = AudioHookChannel.BOTH
+                    logger.debug("📡 Stereo mode: extracting external channel")
+                elif "external" in channels:
+                    self._params.channel = AudioHookChannel.EXTERNAL
+                    logger.debug("📡 Mono mode: external channel")
+                elif "internal" in channels:
+                    self._params.channel = AudioHookChannel.INTERNAL
+                    logger.debug("📡 Mono mode: internal channel")
+
+        # Log participant info for debugging
+        ani = self._participant.get("ani", "unknown") if self._participant else "unknown"
+        logger.info(
+            f"AudioHook open request: session={self._session_id}, "
+            f"conversation={self._conversation_id}, ani={ani}"
+        )
+
+        await self._call_event_handler("on_open", message)
+
+        return OutputTransportMessageUrgentFrame(
+            message=self.create_opened_response(
+                start_paused=self._params.start_paused,
+                supported_languages=self._params.supported_languages,
+                selected_language=self._params.selected_language,
+            )
+        )
+
+    async def _handle_close(self, message: Dict[str, Any]) -> Frame | None:
+        """Handle a 'close' message from Genesys.
+
+        Automatically responds with a 'closed' message. If output_variables
+        were set via set_output_variables(), they will be included in the
+        response and made available in the Architect flow.
+
+        Args:
+            message: The close message from Genesys.
+
+        Returns:
+            OutputTransportMessageUrgentFrame with the closed response
+            (includes outputVariables if set).
+        """
+        params = message.get("parameters", {})
+        reason = params.get("reason", "unknown")
+
+        logger.info(f"🔴 Genesys closed the connection: {reason}")
+
+        self._is_open = False
+
+        logger.info(f"Sending closed response to Genesys...")
+
+        await self._call_event_handler("on_close", message)
+
+        # Return as urgent frame to be sent through pipeline immediately
+        # Include any output variables that were set during the session
+        return OutputTransportMessageUrgentFrame(
+            message=self.create_closed_response(output_variables=self._output_variables)
+        )
+
+    async def _handle_ping(self, message: Dict[str, Any]) -> Frame | None:
+        """Handle a 'ping' message from Genesys.
+
+        Automatically responds with a 'pong' message to maintain the connection.
+
+        Args:
+            message: The ping message from Genesys.
+
+        Returns:
+            OutputTransportMessageUrgentFrame with pong response.
+        """
+        logger.info(f"Sending pong response to Genesys...")
+
+        await self._call_event_handler("on_ping", message)
+
+        # Return as urgent frame to be sent through pipeline immediately
+        return OutputTransportMessageUrgentFrame(message=self.create_pong_response())
+
+    async def _handle_pause(self, message: Dict[str, Any]) -> Frame | None:
+        """Handle a 'pause' message from Genesys.
+
+        This is used when audio streaming is temporarily suspended
+        (e.g., during hold).
+
+        Args:
+            message: The pause message.
+
+        Returns:
+            None (response should be sent via create_resumed_response()).
+        """
+        params = message.get("parameters", {})
+        reason = params.get("reason", "unknown")
+
+        logger.info(f"AudioHook pause request: reason={reason}")
+
+        self._is_paused = True
+
+        await self._call_event_handler("on_pause", message)
+
+        # Note: Application should call create_resumed_response() when ready
+        return None
+
+    async def _handle_update(self, message: Dict[str, Any]) -> Frame | None:
+        """Handle an 'update' message from Genesys.
+
+        Updates may include changes to participants or configuration.
+
+        Args:
+            message: The update message.
+
+        Returns:
+            None.
+        """
+        params = message.get("parameters", {})
+
+        if "participant" in params:
+            self._participant = params["participant"]
+
+        logger.debug(f"AudioHook update received: {params}")
+
+        await self._call_event_handler("on_update", message)
+
+        return None
+
+    async def _handle_error(self, message: Dict[str, Any]) -> Frame | None:
+        """Handle an 'error' message from Genesys.
+
+        Args:
+            message: The error message.
+
+        Returns:
+            None.
+        """
+        params = message.get("parameters", {})
+        code = params.get("code", 0)
+        error_msg = params.get("message", "Unknown error")
+
+        logger.error(f"AudioHook error from Genesys: {code} - {error_msg}")
+
+        await self._call_event_handler("on_error", message)
+
+        return None
+
+    async def _handle_dtmf(self, message: Dict[str, Any]) -> Frame | None:
+        """Handle a 'dtmf' message from Genesys.
+
+        DTMF (Dual-Tone Multi-Frequency) events are sent when the user
+        presses keys on their phone keypad.
+
+        Args:
+            message: The DTMF message.
+
+        Returns:
+            InputDTMFFrame with the pressed digit.
+        """
+        params = message.get("parameters", {})
+        digit = params.get("digit", "")
+
+        if not digit:
+            logger.warning("DTMF message received without digit")
+            return None
+
+        logger.info(f"DTMF received: {digit}")
+
+        await self._call_event_handler("on_dtmf", message)
+
+        try:
+            return InputDTMFFrame(KeypadEntry(digit))
+        except ValueError:
+            # Invalid digit
+            logger.warning(f"Invalid DTMF digit: {digit}")
+            return None
--- a/src/pipecat/services/assemblyai/stt.py
+++ b/src/pipecat/services/assemblyai/stt.py
@@ -161,7 +161,7 @@ class AssemblyAISTTService(WebsocketSTTService):
        """
        await super().process_frame(frame, direction)
        if isinstance(frame, VADUserStartedSpeakingFrame):
-            await self.start_ttfb_metrics()
+            pass
        elif isinstance(frame, VADUserStoppedSpeakingFrame):
            if (
                self._vad_force_turn_endpoint
@@ -354,7 +354,6 @@ class AssemblyAISTTService(WebsocketSTTService):
        """Handle transcription results."""
        if not message.transcript:
            return
-        await self.stop_ttfb_metrics()
        if message.end_of_turn and (
            not self._connection_params.formatted_finals or message.turn_is_formatted
        ):
--- a/src/pipecat/services/aws/stt.py
+++ b/src/pipecat/services/aws/stt.py
@@ -158,7 +158,6 @@ class AWSTranscribeSTTService(WebsocketSTTService):
                await self._websocket.send(event_message)
                # Start metrics after first chunk sent
                await self.start_processing_metrics()
-                await self.start_ttfb_metrics()
            except Exception as e:
                yield ErrorFrame(error=f"Error sending audio: {e}")

@@ -470,7 +469,6 @@ class AWSTranscribeSTTService(WebsocketSTTService):
                            is_final = not result.get("IsPartial", True)

                            if transcript:
-                                await self.stop_ttfb_metrics()
                                if is_final:
                                    await self.push_frame(
                                        TranscriptionFrame(
--- a/src/pipecat/services/azure/stt.py
+++ b/src/pipecat/services/azure/stt.py
@@ -116,7 +116,6 @@ class AzureSTTService(STTService):
        """
        try:
            await self.start_processing_metrics()
-            await self.start_ttfb_metrics()
            if self._audio_stream:
                self._audio_stream.write(audio)
            yield None
@@ -191,7 +190,6 @@ class AzureSTTService(STTService):
        self, transcript: str, is_final: bool, language: Optional[Language] = None
    ):
        """Handle a transcription result with tracing."""
-        await self.stop_ttfb_metrics()
        await self.stop_processing_metrics()

    def _on_handle_recognized(self, event):
--- a/src/pipecat/services/azure/tts.py
+++ b/src/pipecat/services/azure/tts.py
@@ -90,7 +90,7 @@ class AzureBaseTTSService:
            emphasis: Emphasis level for speech ("strong", "moderate", "reduced").
            language: Language for synthesis. Defaults to English (US).
            pitch: Voice pitch adjustment (e.g., "+10%", "-5Hz", "high").
-            rate: Speech rate multiplier. Defaults to "1.05".
+            rate: Speech rate adjustment (e.g., "1.0", "1.25", "slow", "fast").
            role: Voice role for expression (e.g., "YoungAdultFemale").
            style: Speaking style (e.g., "cheerful", "sad", "excited").
            style_degree: Intensity of the speaking style (0.01 to 2.0).
@@ -100,7 +100,7 @@ class AzureBaseTTSService:
        emphasis: Optional[str] = None
        language: Optional[Language] = Language.EN_US
        pitch: Optional[str] = None
-        rate: Optional[str] = "1.05"
+        rate: Optional[str] = None
        role: Optional[str] = None
        style: Optional[str] = None
        style_degree: Optional[str] = None
@@ -185,7 +185,9 @@ class AzureBaseTTSService:
        if self._settings["volume"]:
            prosody_attrs.append(f"volume='{self._settings['volume']}'")

-        ssml += f"<prosody {' '.join(prosody_attrs)}>"
+        # Only wrap in prosody tag if there are prosody attributes
+        if prosody_attrs:
+            ssml += f"<prosody {' '.join(prosody_attrs)}>"

        if self._settings["emphasis"]:
            ssml += f"<emphasis level='{self._settings['emphasis']}'>"
@@ -195,7 +197,8 @@ class AzureBaseTTSService:
        if self._settings["emphasis"]:
            ssml += "</emphasis>"

-        ssml += "</prosody>"
+        if prosody_attrs:
+            ssml += "</prosody>"

        if self._settings["style"]:
            ssml += "</mstts:express-as>"
@@ -277,6 +280,11 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
        self._started = False
        self._first_chunk = True
        self._cumulative_audio_offset: float = 0.0  # Cumulative audio duration in seconds
+        self._current_sentence_base_offset: float = 0.0  # Base offset for current sentence
+        self._current_sentence_duration: float = 0.0  # Duration from Azure callback
+        self._current_sentence_max_word_offset: float = (
+            0.0  # Max word boundary offset seen in current sentence (for 8kHz workaround)
+        )
        self._last_word: Optional[str] = None  # Track last word for punctuation merging
        self._last_timestamp: Optional[float] = None  # Track last timestamp

@@ -386,8 +394,14 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
        word = evt.text
        sentence_relative_seconds = evt.audio_offset / 10_000_000.0

-        # Add cumulative offset to get absolute timestamp across sentences
-        absolute_seconds = self._cumulative_audio_offset + sentence_relative_seconds
+        # Use base offset captured at start of run_tts to avoid race conditions
+        # with callbacks from overlapping TTS requests
+        absolute_seconds = self._current_sentence_base_offset + sentence_relative_seconds
+
+        # Track max word offset for accurate cumulative timing
+        # (audio_duration from Azure doesn't always match word boundary offsets at 8kHz)
+        if sentence_relative_seconds > self._current_sentence_max_word_offset:
+            self._current_sentence_max_word_offset = sentence_relative_seconds

        if not word:
            return
@@ -492,9 +506,9 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
            self._last_word = None
            self._last_timestamp = None

-        # Update cumulative audio offset for next sentence
+        # Store duration for cumulative offset calculation
        if evt.result and evt.result.audio_duration:
-            self._cumulative_audio_offset += evt.result.audio_duration.total_seconds()
+            self._current_sentence_duration = evt.result.audio_duration.total_seconds()

        self._audio_queue.put_nowait(None)  # Signal completion

@@ -530,6 +544,9 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
        self._started = False
        self._first_chunk = True
        self._cumulative_audio_offset = 0.0
+        self._current_sentence_base_offset = 0.0
+        self._current_sentence_duration = 0.0
+        self._current_sentence_max_word_offset = 0.0
        self._last_word = None
        self._last_timestamp = None

@@ -604,6 +621,12 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
                    self._started = True
                    self._first_chunk = True

+                # Capture base offset BEFORE starting synthesis to avoid race conditions
+                # Word boundary callbacks will use this value
+                self._current_sentence_base_offset = self._cumulative_audio_offset
+                self._current_sentence_duration = 0.0
+                self._current_sentence_max_word_offset = 0.0
+
                ssml = self._construct_ssml(text)
                self._speech_synthesizer.speak_ssml_async(ssml)
                await self.start_tts_usage_metrics(text)
@@ -627,6 +650,16 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
                    )
                    yield frame

+                # Update cumulative offset for next sentence
+                # At 8kHz, Azure's audio_duration doesn't match word boundary offsets,
+                # so we use max_word_offset as a workaround. At other sample rates,
+                # audio_duration is accurate.
+                # TODO: Remove after Azure fixes word boundary timing at 8kHz
+                if self.sample_rate == 8000:
+                    self._cumulative_audio_offset += self._current_sentence_max_word_offset
+                else:
+                    self._cumulative_audio_offset += self._current_sentence_duration
+
            except Exception as e:
                yield ErrorFrame(error=f"Unknown error occurred: {e}")
                yield TTSStoppedFrame()
--- a/src/pipecat/services/cartesia/stt.py
+++ b/src/pipecat/services/cartesia/stt.py
@@ -207,9 +207,8 @@ class CartesiaSTTService(WebsocketSTTService):
        await super().cancel(frame)
        await self._disconnect()

-    async def start_metrics(self):
+    async def _start_metrics(self):
        """Start performance metrics collection for transcription processing."""
-        await self.start_ttfb_metrics()
        await self.start_processing_metrics()

    async def process_frame(self, frame: Frame, direction: FrameDirection):
@@ -222,7 +221,7 @@ class CartesiaSTTService(WebsocketSTTService):
        await super().process_frame(frame, direction)

        if isinstance(frame, VADUserStartedSpeakingFrame):
-            await self.start_metrics()
+            await self._start_metrics()
        elif isinstance(frame, VADUserStoppedSpeakingFrame):
            # Send finalize command to flush the transcription session
            if self._websocket and self._websocket.state is State.OPEN:
@@ -342,7 +341,6 @@ class CartesiaSTTService(WebsocketSTTService):
                pass

        if len(transcript) > 0:
-            await self.stop_ttfb_metrics()
            if is_final:
                await self.push_frame(
                    TranscriptionFrame(
--- a/src/pipecat/services/deepgram/flux/stt.py
+++ b/src/pipecat/services/deepgram/flux/stt.py
@@ -659,6 +659,8 @@ class DeepgramFluxSTTService(WebsocketSTTService):
        average_confidence = self._calculate_average_confidence(data)

        if not self._params.min_confidence or average_confidence > self._params.min_confidence:
+            # EndOfTurn means Flux has determined the turn is complete,
+            # so this TranscriptionFrame is always finalized
            await self.push_frame(
                TranscriptionFrame(
                    transcript,
@@ -666,6 +668,7 @@ class DeepgramFluxSTTService(WebsocketSTTService):
                    time_now_iso8601(),
                    self._language,
                    result=data,
+                    finalized=True,
                )
            )
        else:
--- a/src/pipecat/services/deepgram/stt.py
+++ b/src/pipecat/services/deepgram/stt.py
@@ -276,9 +276,8 @@ class DeepgramSTTService(STTService):
            # GH issue: https://github.com/deepgram/deepgram-python-sdk/issues/570
            await self._connection.finish()

-    async def start_metrics(self):
-        """Start TTFB and processing metrics collection."""
-        await self.start_ttfb_metrics()
+    async def _start_metrics(self):
+        """Start processing metrics collection for this utterance."""
        await self.start_processing_metrics()

    async def _on_error(self, *args, **kwargs):
@@ -292,7 +291,7 @@ class DeepgramSTTService(STTService):
        await self._connect()

    async def _on_speech_started(self, *args, **kwargs):
-        await self.start_metrics()
+        await self._start_metrics()
        await self._call_event_handler("on_speech_started", *args, **kwargs)
        await self.broadcast_frame(UserStartedSpeakingFrame)
        if self._should_interrupt:
@@ -320,8 +319,12 @@ class DeepgramSTTService(STTService):
            language = result.channel.alternatives[0].languages[0]
            language = Language(language)
        if len(transcript) > 0:
-            await self.stop_ttfb_metrics()
            if is_final:
+                # Check if this response is from a finalize() call.
+                # Only mark as finalized when both we requested it AND Deepgram confirms it.
+                from_finalize = getattr(result, "from_finalize", False)
+                if from_finalize:
+                    self.confirm_finalize()
                await self.push_frame(
                    TranscriptionFrame(
                        transcript,
@@ -356,8 +359,10 @@ class DeepgramSTTService(STTService):

        if isinstance(frame, VADUserStartedSpeakingFrame) and not self.vad_enabled:
            # Start metrics if Deepgram VAD is disabled & pipeline VAD has detected speech
-            await self.start_metrics()
+            await self._start_metrics()
        elif isinstance(frame, VADUserStoppedSpeakingFrame):
            # https://developers.deepgram.com/docs/finalize
+            # Mark that we're awaiting a from_finalize response
+            self.request_finalize()
            await self._connection.finalize()
            logger.trace(f"Triggered finalize event on: {frame.name=}, {direction=}")
--- a/src/pipecat/services/deepgram/stt_sagemaker.py
+++ b/src/pipecat/services/deepgram/stt_sagemaker.py
@@ -363,9 +363,6 @@ class DeepgramSageMakerSTTService(STTService):
        if not transcript.strip():
            return

-        # Stop TTFB metrics on first transcript
-        await self.stop_ttfb_metrics()
-
        is_final = parsed.get("is_final", False)
        speech_final = parsed.get("speech_final", False)

@@ -417,9 +414,8 @@ class DeepgramSageMakerSTTService(STTService):
        """
        pass

-    async def start_metrics(self):
-        """Start TTFB and processing metrics collection."""
-        await self.start_ttfb_metrics()
+    async def _start_metrics(self):
+        """Start processing metrics collection."""
        await self.start_processing_metrics()

    async def process_frame(self, frame: Frame, direction: FrameDirection):
@@ -433,7 +429,7 @@ class DeepgramSageMakerSTTService(STTService):

        # Start metrics when user starts speaking (if VAD is not provided by Deepgram)
        if isinstance(frame, VADUserStartedSpeakingFrame):
-            await self.start_metrics()
+            await self._start_metrics()
        elif isinstance(frame, VADUserStoppedSpeakingFrame):
            # Send finalize message to Deepgram when user stops speaking
            # This tells Deepgram to flush any remaining audio and return final results
--- a/src/pipecat/services/elevenlabs/stt.py
+++ b/src/pipecat/services/elevenlabs/stt.py
@@ -310,7 +310,6 @@ class ElevenLabsSTTService(SegmentedSTTService):
        self, transcript: str, is_final: bool, language: Optional[str] = None
    ):
        """Handle a transcription result with tracing."""
-        await self.stop_ttfb_metrics()
        await self.stop_processing_metrics()

    async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
@@ -328,7 +327,6 @@ class ElevenLabsSTTService(SegmentedSTTService):
        """
        try:
            await self.start_processing_metrics()
-            await self.start_ttfb_metrics()

            # Upload audio and get transcription result directly
            result = await self._transcribe_audio(audio)
@@ -539,9 +537,8 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
        await super().cancel(frame)
        await self._disconnect()

-    async def start_metrics(self):
+    async def _start_metrics(self):
        """Start performance metrics collection for transcription processing."""
-        await self.start_ttfb_metrics()
        await self.start_processing_metrics()

    async def process_frame(self, frame: Frame, direction: FrameDirection):
@@ -555,7 +552,7 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):

        if isinstance(frame, VADUserStartedSpeakingFrame):
            # Start metrics when user starts speaking
-            await self.start_metrics()
+            await self._start_metrics()
        elif isinstance(frame, VADUserStoppedSpeakingFrame):
            # Send commit when user stops speaking (manual commit mode)
            if self._params.commit_strategy == CommitStrategy.MANUAL:
@@ -764,8 +761,6 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
        if not text:
            return

-        await self.stop_ttfb_metrics()
-
        # Get language if provided
        language = data.get("language_code")

@@ -803,7 +798,6 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
        if not text:
            return

-        await self.stop_ttfb_metrics()
        await self.stop_processing_metrics()

        # Get language if provided
@@ -845,7 +839,6 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
        if not text:
            return

-        await self.stop_ttfb_metrics()
        await self.stop_processing_metrics()

        # Get language if provided
--- a/src/pipecat/services/fal/stt.py
+++ b/src/pipecat/services/fal/stt.py
@@ -249,7 +249,6 @@ class FalSTTService(SegmentedSTTService):
        self, transcript: str, is_final: bool, language: Optional[str] = None
    ):
        """Handle a transcription result with tracing."""
-        await self.stop_ttfb_metrics()
        await self.stop_processing_metrics()

    async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
@@ -267,7 +266,6 @@ class FalSTTService(SegmentedSTTService):
        """
        try:
            await self.start_processing_metrics()
-            await self.start_ttfb_metrics()

            # Send to Fal directly (audio is already in WAV format from base class)
            data_uri = fal_client.encode(audio, "audio/x-wav")
--- a/src/pipecat/services/gladia/stt.py
+++ b/src/pipecat/services/gladia/stt.py
@@ -385,7 +385,6 @@ class GladiaSTTService(WebsocketSTTService):
        Yields:
            None (processing is handled asynchronously via WebSocket).
        """
-        await self.start_ttfb_metrics()
        await self.start_processing_metrics()

        # Add audio to buffer
@@ -513,7 +512,6 @@ class GladiaSTTService(WebsocketSTTService):
    async def _handle_transcription(
        self, transcript: str, is_final: bool, language: Optional[str] = None
    ):
-        await self.stop_ttfb_metrics()
        await self.stop_processing_metrics()

    async def _on_speech_started(self):
--- a/src/pipecat/services/google/gemini_live/llm.py
+++ b/src/pipecat/services/google/gemini_live/llm.py
@@ -1198,7 +1198,20 @@ class GeminiLiveLLMService(LLMService):
                        # Reset failure counter if connection has been stable
                        self._check_and_reset_failure_counter()

-                        if message.server_content and message.server_content.model_turn:
+                        if message.server_content and message.server_content.interrupted:
+                            # NOTE: while the service triggers interruptions in
+                            # the specific case of barge-ins, it does *not*
+                            # emit UserStarted/StoppedSpeakingFrames, as the
+                            # Gemini Live API does not give us broadly reliable
+                            # signals to base those off of. Pipelines that
+                            # require turn tracking (like those using context
+                            # aggregators) still need an independent way to
+                            # track turns, such as local Silero VAD in
+                            # combination with the context aggregator default
+                            # turn strategies.
+                            logger.debug("Gemini VAD: interrupted signal received")
+                            await self.push_interruption_task_frame_and_wait()
+                        elif message.server_content and message.server_content.model_turn:
                            await self._handle_msg_model_turn(message)
                        elif (
                            message.server_content
--- a/src/pipecat/services/google/llm.py
+++ b/src/pipecat/services/google/llm.py
@@ -40,7 +40,6 @@ from pipecat.frames.frames import (
    LLMThoughtStartFrame,
    LLMThoughtTextFrame,
    LLMUpdateSettingsFrame,
-    UserImageRawFrame,
 )
 from pipecat.metrics.metrics import LLMTokenUsage
 from pipecat.processors.aggregators.llm_context import LLMContext
@@ -199,22 +198,6 @@ class GoogleAssistantContextAggregator(OpenAIAssistantContextAggregator):
                    if part.function_response and part.function_response.id == tool_call_id:
                        part.function_response.response = {"value": json.dumps(result)}

-    async def handle_user_image_frame(self, frame: UserImageRawFrame):
-        """Handle user image frame.
-
-        Args:
-            frame: Frame containing user image data and request context.
-        """
-        await self._update_function_call_result(
-            frame.request.function_name, frame.request.tool_call_id, "COMPLETED"
-        )
-        self._context.add_image_frame_message(
-            format=frame.format,
-            size=frame.size,
-            image=frame.image,
-            text=frame.request.context,
-        )
-

@dataclass
 class GoogleContextAggregatorPair:
--- a/src/pipecat/services/google/stt.py
+++ b/src/pipecat/services/google/stt.py
@@ -823,7 +823,6 @@ class GoogleSTTService(STTService):
        """
        if self._streaming_task:
            # Queue the audio data
-            await self.start_ttfb_metrics()
            await self.start_processing_metrics()
            await self._request_queue.put(audio)
        yield None
@@ -875,7 +874,6 @@ class GoogleSTTService(STTService):
                        )
                    else:
                        self._last_transcript_was_final = False
-                        await self.stop_ttfb_metrics()
                        await self.push_frame(
                            InterimTranscriptionFrame(
                                transcript,
--- a/src/pipecat/services/gradium/stt.py
+++ b/src/pipecat/services/gradium/stt.py
@@ -12,9 +12,10 @@ WebSocket API for streaming audio transcription.

 import base64
 import json
-from typing import AsyncGenerator
+from typing import AsyncGenerator, Optional

 from loguru import logger
+from pydantic import BaseModel

 from pipecat.frames.frames import (
    CancelFrame,
@@ -22,9 +23,12 @@ from pipecat.frames.frames import (
    Frame,
    StartFrame,
    TranscriptionFrame,
+    VADUserStartedSpeakingFrame,
+    VADUserStoppedSpeakingFrame,
 )
+from pipecat.processors.frame_processor import FrameDirection
 from pipecat.services.stt_service import WebsocketSTTService
-from pipecat.transcriptions.language import Language
+from pipecat.transcriptions.language import Language, resolve_language
 from pipecat.utils.time import time_now_iso8601
 from pipecat.utils.tracing.service_decorators import traced_stt

@@ -39,6 +43,26 @@ except ModuleNotFoundError as e:
 SAMPLE_RATE = 24000


+def language_to_gradium_language(language: Language) -> Optional[str]:
+    """Convert a Language enum to Gradium's language code format.
+
+    Args:
+        language: The Language enum value to convert.
+
+    Returns:
+        The Gradium language code string or None if not supported.
+    """
+    LANGUAGE_MAP = {
+        Language.DE: "de",
+        Language.EN: "en",
+        Language.ES: "es",
+        Language.FR: "fr",
+        Language.PT: "pt",
+    }
+
+    return resolve_language(language, LANGUAGE_MAP, use_base_code=True)
+
+
 class GradiumSTTService(WebsocketSTTService):
    """Gradium real-time speech-to-text service.

@@ -47,12 +71,29 @@ class GradiumSTTService(WebsocketSTTService):
    for audio processing and connection management.
    """

+    class InputParams(BaseModel):
+        """Configuration parameters for Gradium STT API.
+
+        Parameters:
+            language: Expected language of the audio (e.g., "en", "es", "fr").
+                This helps ground the model to a specific language and improve
+                transcription quality.
+            delay_in_frames: Delay in audio frames (80ms each) before text is
+                generated. Higher delays allow more context but increase latency.
+                Allowed values: 7, 8, 10, 12, 14, 16, 20, 24, 36, 48.
+                Default is 10 (800ms). Lower values like 7-8 give faster response.
+        """
+
+        language: Optional[Language] = None
+        delay_in_frames: Optional[int] = None
+
    def __init__(
        self,
        *,
        api_key: str,
        api_endpoint_base_url: str = "wss://eu.api.gradium.ai/api/speech/asr",
-        json_config: str | None = None,
+        params: Optional[InputParams] = None,
+        json_config: Optional[str] = None,
        **kwargs,
    ):
        """Initialize the Gradium STT service.
@@ -60,14 +101,29 @@ class GradiumSTTService(WebsocketSTTService):
        Args:
            api_key: Gradium API key for authentication.
            api_endpoint_base_url: WebSocket endpoint URL. Defaults to Gradium's streaming endpoint.
+            params: Configuration parameters for language and delay settings.
            json_config: Optional JSON configuration string for additional model settings.
+
+                .. deprecated:: 0.0.101
+                    Use `params` instead for type-safe configuration.
+
            **kwargs: Additional arguments passed to parent STTService class.
        """
        super().__init__(sample_rate=SAMPLE_RATE, **kwargs)

+        if json_config is not None:
+            import warnings
+
+            warnings.warn(
+                "Parameter 'json_config' is deprecated and will be removed in a future version, use 'params' instead.",
+                DeprecationWarning,
+                stacklevel=2,
+            )
+
        self._api_key = api_key
        self._api_endpoint_base_url = api_endpoint_base_url
        self._websocket = None
+        self._params = params or GradiumSTTService.InputParams()
        self._json_config = json_config

        self._receive_task = None
@@ -76,6 +132,11 @@ class GradiumSTTService(WebsocketSTTService):
        self._chunk_size_ms = 80
        self._chunk_size_bytes = 0

+        # Set from the ready message when connecting to the service.
+        # These values are used for flushing transcription.
+        self._delay_in_frames = 0
+        self._frame_size = 0
+
    def can_generate_metrics(self) -> bool:
        """Check if the service can generate metrics.

@@ -84,6 +145,17 @@ class GradiumSTTService(WebsocketSTTService):
        """
        return True

+    async def set_language(self, language: Language):
+        """Set the recognition language and reconnect.
+
+        Args:
+            language: The language to use for speech recognition.
+        """
+        logger.info(f"Switching STT language to: [{language}]")
+        self._params.language = language
+        await self._disconnect()
+        await self._connect()
+
    async def start(self, frame: StartFrame):
        """Start the speech-to-text service.

@@ -112,6 +184,57 @@ class GradiumSTTService(WebsocketSTTService):
        await super().cancel(frame)
        await self._disconnect()

+    async def process_frame(self, frame: Frame, direction: FrameDirection):
+        """Process frames with VAD-specific handling.
+
+        When VAD detects the user has stopped speaking, we flush the transcription
+        by sending silence frames. This makes the system more reactive by getting
+        the final transcription faster without closing the connection.
+
+        Args:
+            frame: The frame to process.
+            direction: The direction of frame processing.
+        """
+        await super().process_frame(frame, direction)
+
+        if isinstance(frame, VADUserStartedSpeakingFrame):
+            await self.start_processing_metrics()
+        elif isinstance(frame, VADUserStoppedSpeakingFrame):
+            await self._flush_transcription()
+
+    async def _flush_transcription(self):
+        """Flush the transcription by sending silence frames.
+
+        When VAD detects the user stopped speaking, we send delay_in_frames
+        chunks of silence (zeros) to flush the remaining audio from the model's
+        buffer. This allows for faster turn-around without closing the connection.
+
+        From Gradium docs: "feed in delay_in_frames chunks of silence (vectors
+        of zeros). If those are fed in faster than realtime, the API also has
+        a possibility to process them faster."
+        """
+        if not self._websocket or self._websocket.state is not State.OPEN:
+            return
+
+        if self._delay_in_frames <= 0:
+            logger.debug("No delay_in_frames set, skipping flush")
+            return
+
+        # Create a silence chunk (zeros) of frame_size samples
+        # Each sample is 2 bytes (16-bit PCM)
+        silence_bytes = bytes(self._frame_size * 2)
+        silence_b64 = base64.b64encode(silence_bytes).decode("utf-8")
+
+        logger.debug(f"Flushing Gradium STT with {self._delay_in_frames} silence frames")
+
+        for _ in range(self._delay_in_frames):
+            msg = {"type": "audio", "audio": silence_b64}
+            try:
+                await self._websocket.send(json.dumps(msg))
+            except Exception as e:
+                logger.warning(f"Failed to send silence frame: {e}")
+                break
+
    async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
        """Process audio data for speech-to-text conversion.

@@ -122,8 +245,6 @@ class GradiumSTTService(WebsocketSTTService):
            None (processing handled via WebSocket messages).
        """
        self._audio_buffer.extend(audio)
-        await self.start_ttfb_metrics()
-        await self.start_processing_metrics()

        while len(self._audio_buffer) >= self._chunk_size_bytes:
            chunk = bytes(self._audio_buffer[: self._chunk_size_bytes])
@@ -152,6 +273,9 @@ class GradiumSTTService(WebsocketSTTService):
        try:
            if self._websocket and self._websocket.state is State.OPEN:
                return
+
+            logger.debug("Connecting to Gradium STT")
+
            ws_url = self._api_endpoint_base_url
            headers = {
                "x-api-key": self._api_key,
@@ -166,8 +290,18 @@ class GradiumSTTService(WebsocketSTTService):
                "type": "setup",
                "input_format": "pcm",
            }
-            if self._json_config is not None:
-                setup_msg["json_config"] = self._json_config
+            # Build json_config: start with deprecated json_config, then override with params
+            json_config = {}
+            if self._json_config:
+                json_config = json.loads(self._json_config)
+            if self._params.language:
+                gradium_language = language_to_gradium_language(self._params.language)
+                if gradium_language:
+                    json_config["language"] = gradium_language
+            if self._params.delay_in_frames:
+                json_config["delay_in_frames"] = self._params.delay_in_frames
+            if json_config:
+                setup_msg["json_config"] = json_config
            await self._websocket.send(json.dumps(setup_msg))
            ready_msg = await self._websocket.recv()
            ready_msg = json.loads(ready_msg)
@@ -176,6 +310,14 @@ class GradiumSTTService(WebsocketSTTService):
            if ready_msg["type"] != "ready":
                raise Exception(f"unexpected first message type {ready_msg['type']}")

+            # Store delay_in_frames and frame_size for silence flushing
+            self._delay_in_frames = ready_msg.get("delay_in_frames", 0)
+            self._frame_size = ready_msg.get("frame_size", 1920)
+            logger.debug(
+                f"Connected to Gradium STT (delay_in_frames={self._delay_in_frames}, "
+                f"frame_size={self._frame_size})"
+            )
+
        except Exception as e:
            await self.push_error(error_msg=f"Unknown error occurred: {e}", exception=e)
            raise
@@ -241,3 +383,5 @@ class GradiumSTTService(WebsocketSTTService):
                time_now_iso8601(),
            )
        )
+        await self._trace_transcription(text, is_final=True, language=None)
+        await self.stop_processing_metrics()
--- a/src/pipecat/services/gradium/tts.py
+++ b/src/pipecat/services/gradium/tts.py
@@ -232,11 +232,15 @@ class GradiumTTSService(InterruptibleWordTTSService):
        raise Exception("Websocket not connected")

    async def flush_audio(self):
-        """Flush any pending audio synthesis."""
+        """Flush any pending audio synthesis.
+
+        Sends a <flush> tag to force the model to output audio for all text
+        that has been input so far, without closing the connection.
+        """
        if not self._websocket:
            return
        try:
-            msg = {"type": "end_of_stream"}
+            msg = {"type": "text", "text": "<flush>"}
            await self._websocket.send(json.dumps(msg))
        except ConnectionClosedOK:
            logger.debug(f"{self}: connection closed normally during flush")
--- a/src/pipecat/services/hathora/stt.py
+++ b/src/pipecat/services/hathora/stt.py
@@ -111,7 +111,6 @@ class HathoraSTTService(SegmentedSTTService):
        """
        try:
            await self.start_processing_metrics()
-            await self.start_ttfb_metrics()

            url = f"{self._base_url}"

@@ -153,7 +152,6 @@ class HathoraSTTService(SegmentedSTTService):
                        result=response,
                    )

-            await self.stop_ttfb_metrics()
            await self.stop_processing_metrics()

        except Exception as e:
--- a/src/pipecat/services/llm_service.py
+++ b/src/pipecat/services/llm_service.py
@@ -170,17 +170,22 @@ class LLMService(AIService):
    # However, subclasses should override this with a more specific adapter when necessary.
    adapter_class: Type[BaseLLMAdapter] = OpenAILLMAdapter

-    def __init__(self, run_in_parallel: bool = True, **kwargs):
+    def __init__(
+        self, run_in_parallel: bool = True, function_call_timeout_secs: float = 10.0, **kwargs
+    ):
        """Initialize the LLM service.

        Args:
            run_in_parallel: Whether to run function calls in parallel or sequentially.
                Defaults to True.
+            function_call_timeout_secs: Timeout in seconds for deferred function calls.
+                Defaults to 10.0 seconds.
            **kwargs: Additional arguments passed to the parent AIService.

        """
        super().__init__(**kwargs)
        self._run_in_parallel = run_in_parallel
+        self._function_call_timeout_secs = function_call_timeout_secs
        self._start_callbacks = {}
        self._adapter = self.adapter_class()
        self._functions: Dict[Optional[str], FunctionCallRegistryItem] = {}
@@ -596,14 +601,26 @@ class LLMService(AIService):
            cancel_on_interruption=item.cancel_on_interruption,
        )

-        callback_executed = False
+        # Start a timeout task for deferred function calls
+        async def timeout_handler():
+            await asyncio.sleep(self._function_call_timeout_secs)
+            logger.warning(
+                f"{self} Function call [{runner_item.function_name}:{runner_item.tool_call_id}] timed out after {self._function_call_timeout_secs} seconds"
+            )
+            await function_call_result_callback(None)
+
+        timeout_task = self.create_task(timeout_handler())

        # Define a callback function that pushes a FunctionCallResultFrame upstream & downstream.
        async def function_call_result_callback(
            result: Any, *, properties: Optional[FunctionCallResultProperties] = None
        ):
-            nonlocal callback_executed
-            callback_executed = True
+            nonlocal timeout_task
+
+            # Cancel timeout task if it exists
+            if timeout_task and not timeout_task.done():
+                await self.cancel_task(timeout_task)
+
            await self.broadcast_frame(
                FunctionCallResultFrame,
                function_name=runner_item.function_name,
@@ -653,9 +670,6 @@ class LLMService(AIService):
            error_message = f"Error executing function call [{runner_item.function_name}]: {e}"
            logger.error(f"{self} {error_message}")
            await self.push_error(error_msg=error_message, exception=e, fatal=False)
-        finally:
-            if not callback_executed:
-                await function_call_result_callback(None)

    async def _cancel_function_call(self, function_name: Optional[str]):
        cancelled_tasks = set()
--- a/src/pipecat/services/nvidia/stt.py
+++ b/src/pipecat/services/nvidia/stt.py
@@ -307,7 +307,6 @@ class NvidiaSTTService(STTService):

            transcript = result.alternatives[0].transcript
            if transcript and len(transcript) > 0:
-                await self.stop_ttfb_metrics()
                if result.is_final:
                    await self.stop_processing_metrics()
                    await self.push_frame(
@@ -344,7 +343,6 @@ class NvidiaSTTService(STTService):
        Yields:
            None - transcription results are pushed to the pipeline via frames.
        """
-        await self.start_ttfb_metrics()
        await self.start_processing_metrics()
        await self._queue.put(audio)
        yield None
@@ -598,12 +596,10 @@ class NvidiaSegmentedSTTService(SegmentedSTTService):
            assert self._config is not None, "Recognition config not created"

            await self.start_processing_metrics()
-            await self.start_ttfb_metrics()

            # Process audio with NVIDIA Riva ASR - explicitly request non-future response
            raw_response = self._asr_service.offline_recognize(audio, self._config, future=False)

-            await self.stop_ttfb_metrics()
            await self.stop_processing_metrics()

            # Process the response - handle different possible return types
--- a/src/pipecat/services/openai/base_llm.py
+++ b/src/pipecat/services/openai/base_llm.py
@@ -492,8 +492,11 @@ class BaseOpenAILLMService(LLMService):
                await self.push_frame(LLMFullResponseStartFrame())
                await self.start_processing_metrics()
                await self._process_context(context)
-            except httpx.TimeoutException:
+            except httpx.TimeoutException as e:
                await self._call_event_handler("on_completion_timeout")
+                await self.push_error(error_msg="LLM completion timeout", exception=e)
+            except Exception as e:
+                await self.push_error(error_msg=f"Error during completion: {e}", exception=e)
            finally:
                await self.stop_processing_metrics()
                await self.push_frame(LLMFullResponseEndFrame())
--- a/src/pipecat/services/openai/realtime/llm.py
+++ b/src/pipecat/services/openai/realtime/llm.py
@@ -599,6 +599,14 @@ class OpenAIRealtimeLLMService(LLMService):
        # note: ttfb is faster by 1/2 RTT than ttfb as measured for other services, since we're getting
        # this event from the server
        await self.stop_ttfb_metrics()
+
+        if self._current_audio_response and self._current_audio_response.item_id != evt.item_id:
+            logger.warning(
+                f"Received a new audio delta for an already completed audio response before receiving the BotStoppedSpeakingFrame."
+            )
+            logger.debug("Forcing previous audio response to None")
+            self._current_audio_response = None
+
        if not self._current_audio_response:
            self._current_audio_response = CurrentAudioResponse(
                item_id=evt.item_id,
--- a/src/pipecat/services/openai_realtime_beta/openai.py
+++ b/src/pipecat/services/openai_realtime_beta/openai.py
@@ -525,6 +525,14 @@ class OpenAIRealtimeBetaLLMService(LLMService):
        # note: ttfb is faster by 1/2 RTT than ttfb as measured for other services, since we're getting
        # this event from the server
        await self.stop_ttfb_metrics()
+
+        if self._current_audio_response and self._current_audio_response.item_id != evt.item_id:
+            logger.warning(
+                f"Received a new audio delta for an already completed audio response before receiving the BotStoppedSpeakingFrame."
+            )
+            logger.debug("Forcing previous audio response to None")
+            self._current_audio_response = None
+
        if not self._current_audio_response:
            self._current_audio_response = CurrentAudioResponse(
                item_id=evt.item_id,
--- a/src/pipecat/services/openrouter/llm.py
+++ b/src/pipecat/services/openrouter/llm.py
@@ -10,7 +10,7 @@ This module provides an OpenAI-compatible interface for interacting with OpenRou
 extending the base OpenAI LLM service functionality.
 """

-from typing import Optional
+from typing import Any, Dict, Optional

 from loguru import logger

@@ -61,3 +61,35 @@ class OpenRouterLLMService(OpenAILLMService):
        """
        logger.debug(f"Creating OpenRouter client with api {base_url}")
        return super().create_client(api_key, base_url, **kwargs)
+
+    def build_chat_completion_params(self, params_from_context: Dict[str, Any]) -> Dict[str, Any]:
+        """Builds chat parameters, handling model-specific constraints.
+
+        Args:
+            params_from_context: Parameters from the LLM context.
+
+        Returns:
+            Transformed parameters ready for the API call.
+        """
+        params = super().build_chat_completion_params(params_from_context)
+        model = getattr(self, "model_name", getattr(self, "model", "")).lower()
+        if "gemini" in model:
+            messages = params.get("messages", [])
+            if not messages:
+                return params
+            transformed_messages = []
+            system_message_seen = False
+            for msg in messages:
+                if msg.get("role") == "system":
+                    if not system_message_seen:
+                        transformed_messages.append(msg)
+                        system_message_seen = True
+                    else:
+                        new_msg = msg.copy()
+                        new_msg["role"] = "user"
+                        transformed_messages.append(new_msg)
+                else:
+                    transformed_messages.append(msg)
+            params["messages"] = transformed_messages
+
+        return params
--- a/src/pipecat/services/piper/tts.py
+++ b/src/pipecat/services/piper/tts.py
@@ -6,7 +6,9 @@

 """Piper TTS service implementation."""

-from typing import AsyncGenerator, Optional
+import asyncio
+from pathlib import Path
+from typing import AsyncGenerator, AsyncIterator, Optional

 import aiohttp
 from loguru import logger
@@ -20,11 +22,128 @@ from pipecat.frames.frames import (
 from pipecat.services.tts_service import TTSService
 from pipecat.utils.tracing.service_decorators import traced_tts

+try:
+    from piper import PiperVoice
+    from piper.download_voices import download_voice
+except ModuleNotFoundError as e:
+    logger.error(f"Exception: {e}")
+    logger.error("In order to use Piper, you need to `pip install pipecat-ai[piper]`.")
+    raise Exception(f"Missing module: {e}")
+

-# This assumes a running TTS service running: https://github.com/OHF-Voice/piper1-gpl/blob/main/docs/API_HTTP.md
 class PiperTTSService(TTSService):
    """Piper TTS service implementation.

+    Provides local text-to-speech synthesis using Piper voice models. Automatically
+    downloads voice models if not already present and resamples audio output to
+    match the configured sample rate.
+    """
+
+    def __init__(
+        self,
+        *,
+        voice_id: str,
+        download_dir: Optional[Path] = None,
+        force_redownload: bool = False,
+        use_cuda: bool = False,
+        **kwargs,
+    ):
+        """Initialize the Piper TTS service.
+
+        Args:
+            voice_id: Piper voice model identifier (e.g. `en_US-ryan-high`).
+            download_dir: Directory for storing voice model files. Defaults to
+                the current working directory.
+            force_redownload: Re-download the voice model even if it already exists.
+            use_cuda: Use CUDA for GPU-accelerated inference.
+            **kwargs: Additional arguments passed to the parent `TTSService`.
+        """
+        super().__init__(**kwargs)
+
+        self._voice_id = voice_id
+
+        download_dir = download_dir or Path.cwd()
+
+        model_file = f"{voice_id}.onnx"
+        model_path = Path(download_dir) / model_file
+
+        if not model_path.exists():
+            logger.debug(f"Downloading Piper '{voice_id}' model")
+            download_voice(voice_id, download_dir, force_redownload=force_redownload)
+
+        logger.debug(f"Loading Piper '{voice_id}' model from {model_path}")
+
+        self._voice = PiperVoice.load(model_path, use_cuda=use_cuda)
+
+        logger.debug(f"Loaded Piper '{voice_id}' model")
+
+    def can_generate_metrics(self) -> bool:
+        """Check if this service can generate processing metrics.
+
+        Returns:
+            True, as Piper service supports metrics generation.
+        """
+        return True
+
+    @traced_tts
+    async def run_tts(self, text: str) -> AsyncGenerator[Frame, None]:
+        """Generate speech from text using Piper.
+
+        Args:
+            text: The text to convert to speech.
+
+        Yields:
+            Frame: Audio frames containing the synthesized speech and status frames.
+        """
+
+        def async_next(it):
+            try:
+                return next(it)
+            except StopIteration:
+                return None
+
+        async def async_iterator(iterator) -> AsyncIterator[bytes]:
+            while True:
+                item = await asyncio.to_thread(async_next, iterator)
+                if item is None:
+                    return
+                yield item.audio_int16_bytes
+
+        logger.debug(f"{self}: Generating TTS [{text}]")
+
+        try:
+            await self.start_ttfb_metrics()
+
+            await self.start_tts_usage_metrics(text)
+
+            yield TTSStartedFrame()
+
+            async for frame in self._stream_audio_frames_from_iterator(
+                async_iterator(self._voice.synthesize(text)),
+                in_sample_rate=self._voice.config.sample_rate,
+            ):
+                await self.stop_ttfb_metrics()
+                yield frame
+        except Exception as e:
+            logger.error(f"{self} exception: {e}")
+            yield ErrorFrame(error=f"Unknown error occurred: {e}")
+        finally:
+            logger.debug(f"{self}: Finished TTS [{text}]")
+            await self.stop_ttfb_metrics()
+            yield TTSStoppedFrame()
+
+
+# This assumes a running TTS service running:
+# https://github.com/OHF-Voice/piper1-gpl/blob/main/docs/API_HTTP.md
+#
+# Usage:
+#
+#  $ uv pip install "piper-tts[http]"
+#  $ uv run python -m piper.http_server -m en_US-ryan-high
+#
+class PiperHttpTTSService(TTSService):
+    """Piper HTTP TTS service implementation.
+
    Provides integration with Piper's HTTP TTS server for text-to-speech
    synthesis. Supports streaming audio generation with configurable sample
    rates and automatic WAV header removal.
@@ -35,9 +154,7 @@ class PiperTTSService(TTSService):
        *,
        base_url: str,
        aiohttp_session: aiohttp.ClientSession,
-        # When using Piper, the sample rate of the generated audio depends on the
-        # voice model being used.
-        sample_rate: Optional[int] = None,
+        voice_id: Optional[str] = None,
        **kwargs,
    ):
        """Initialize the Piper TTS service.
@@ -45,10 +162,10 @@ class PiperTTSService(TTSService):
        Args:
            base_url: Base URL for the Piper TTS HTTP server.
            aiohttp_session: aiohttp ClientSession for making HTTP requests.
-            sample_rate: Output sample rate. If None, uses the voice model's native rate.
+            voice_id: Piper voice model identifier (e.g. `en_US-ryan-high`).
            **kwargs: Additional arguments passed to the parent TTSService.
        """
-        super().__init__(sample_rate=sample_rate, **kwargs)
+        super().__init__(**kwargs)

        if base_url.endswith("/"):
            logger.warning("Base URL ends with a slash, this is not allowed.")
@@ -56,7 +173,7 @@ class PiperTTSService(TTSService):

        self._base_url = base_url
        self._session = aiohttp_session
-        self._settings = {"base_url": base_url}
+        self._model_id = voice_id

    def can_generate_metrics(self) -> bool:
        """Check if this service can generate processing metrics.
@@ -83,9 +200,12 @@ class PiperTTSService(TTSService):
        try:
            await self.start_ttfb_metrics()

-            async with self._session.post(
-                self._base_url, json={"text": text}, headers=headers
-            ) as response:
+            data = {
+                "text": text,
+                "voice": self._model_id,
+            }
+
+            async with self._session.post(self._base_url, json=data, headers=headers) as response:
                if response.status != 200:
                    error = await response.text()
                    yield ErrorFrame(
--- a/src/pipecat/services/sarvam/stt.py
+++ b/src/pipecat/services/sarvam/stt.py
@@ -15,9 +15,15 @@ from pipecat.frames.frames import (
    CancelFrame,
    EndFrame,
    ErrorFrame,
+    Frame,
    StartFrame,
    TranscriptionFrame,
+    UserStartedSpeakingFrame,
+    UserStoppedSpeakingFrame,
+    VADUserStartedSpeakingFrame,
+    VADUserStoppedSpeakingFrame,
 )
+from pipecat.processors.frame_processor import FrameDirection
 from pipecat.services.sarvam._sdk import sdk_headers
 from pipecat.services.stt_service import STTService
 from pipecat.transcriptions.language import Language, resolve_language
@@ -75,14 +81,14 @@ class SarvamSTTService(STTService):
            language: Target language for transcription. Defaults to None (required for saarika models).
            prompt: Optional prompt to guide translation style/context for STT-Translate models.
                   Only applicable to saaras (STT-Translate) models. Defaults to None.
-            vad_signals: Enable VAD signals in response. Defaults to True.
-            high_vad_sensitivity: Enable high VAD (Voice Activity Detection) sensitivity. Defaults to False.
+            vad_signals: Enable VAD signals in response. Defaults to None.
+            high_vad_sensitivity: Enable high VAD (Voice Activity Detection) sensitivity. Defaults to None.
        """

        language: Optional[Language] = None
        prompt: Optional[str] = None
-        vad_signals: bool = True
-        high_vad_sensitivity: bool = False
+        vad_signals: bool = None
+        high_vad_sensitivity: bool = None

    def __init__(
        self,
@@ -155,6 +161,7 @@ class SarvamSTTService(STTService):
        self._websocket_context = None
        self._socket_client = None
        self._receive_task = None
+        logger.info(f"Sarvam STT initialized with SDK headers: {self._sdk_headers}")

    def language_to_service_language(self, language: Language) -> str:
        """Convert pipecat Language enum to Sarvam's language code.
@@ -175,6 +182,22 @@ class SarvamSTTService(STTService):
        """
        return True

+    async def process_frame(self, frame: Frame, direction: FrameDirection):
+        """Process incoming frames.
+
+        Handles VAD frames for TTFB tracking when using Pipecat's VAD
+        instead of Sarvam's built-in VAD.
+        """
+        await super().process_frame(frame, direction)
+
+        # Only handle VAD frames when not using Sarvam's VAD signals
+        if not self._vad_signals:
+            if isinstance(frame, VADUserStartedSpeakingFrame):
+                await self._start_metrics()
+            elif isinstance(frame, VADUserStoppedSpeakingFrame):
+                if self._socket_client:
+                    await self._socket_client.flush()
+
    async def set_language(self, language: Language):
        """Set the recognition language and reconnect.

@@ -411,16 +434,18 @@ class SarvamSTTService(STTService):
                logger.debug(f"VAD Signal: {signal}, Occurred at: {timestamp}")

                if signal == "START_SPEECH":
-                    await self.start_metrics()
+                    await self._start_metrics()
                    logger.debug("User started speaking")
                    await self._call_event_handler("on_speech_started")
+                    await self.broadcast_frame(UserStartedSpeakingFrame)
+                    await self.push_interruption_task_frame_and_wait()

                elif signal == "END_SPEECH":
                    logger.debug("User stopped speaking")
                    await self._call_event_handler("on_speech_stopped")
+                    await self.broadcast_frame(UserStoppedSpeakingFrame)

            elif message.type == "data":
-                await self.stop_ttfb_metrics()
                transcript = message.data.transcript
                language_code = message.data.language_code
                # Prefer language from message (auto-detected for translate models). Fallback to configured.
@@ -482,7 +507,6 @@ class SarvamSTTService(STTService):
        }
        return mapping.get(language_code, Language.HI_IN)

-    async def start_metrics(self):
-        """Start TTFB and processing metrics collection."""
-        await self.start_ttfb_metrics()
+    async def _start_metrics(self):
+        """Start processing metrics collection."""
        await self.start_processing_metrics()
--- a/src/pipecat/services/soniox/stt.py
+++ b/src/pipecat/services/soniox/stt.py
@@ -21,7 +21,7 @@ from pipecat.frames.frames import (
    InterimTranscriptionFrame,
    StartFrame,
    TranscriptionFrame,
-    UserStoppedSpeakingFrame,
+    VADUserStoppedSpeakingFrame,
 )
 from pipecat.processors.frame_processor import FrameDirection
 from pipecat.services.stt_service import WebsocketSTTService
@@ -162,7 +162,7 @@ class SonioxSTTService(WebsocketSTTService):
            sample_rate: Audio sample rate.
            params: Additional configuration parameters, such as language hints, context and
                speaker diarization.
-            vad_force_turn_endpoint: Listen to `UserStoppedSpeakingFrame` to send finalize message to Soniox. If disabled, Soniox will detect the end of the speech.
+            vad_force_turn_endpoint: Listen to `VADUserStoppedSpeakingFrame` to send finalize message to Soniox. If disabled, Soniox will detect the end of the speech.
            **kwargs: Additional arguments passed to the STTService.
        """
        super().__init__(sample_rate=sample_rate, **kwargs)
@@ -247,7 +247,7 @@ class SonioxSTTService(WebsocketSTTService):
        """
        await super().process_frame(frame, direction)

-        if isinstance(frame, UserStoppedSpeakingFrame) and self._vad_force_turn_endpoint:
+        if isinstance(frame, VADUserStoppedSpeakingFrame) and self._vad_force_turn_endpoint:
            # Send finalize message to Soniox so we get the final tokens asap.
            if self._websocket and self._websocket.state is State.OPEN:
                await self._websocket.send(FINALIZE_MESSAGE)
@@ -374,12 +374,15 @@ class SonioxSTTService(WebsocketSTTService):
        async def send_endpoint_transcript():
            if self._final_transcription_buffer:
                text = "".join(map(lambda token: token["text"], self._final_transcription_buffer))
+                # Soniox only pushes TranscriptionFrame when an end token is received,
+                # so every TranscriptionFrame is inherently finalized
                await self.push_frame(
                    TranscriptionFrame(
                        text=text,
                        user_id=self._user_id,
                        timestamp=time_now_iso8601(),
                        result=self._final_transcription_buffer,
+                        finalized=True,
                    )
                )
                await self._handle_transcription(text, is_final=True)
--- a/src/pipecat/services/speechmatics/stt.py
+++ b/src/pipecat/services/speechmatics/stt.py
@@ -8,7 +8,6 @@

 import asyncio
 import os
-import time
 from enum import Enum
 from typing import Any, AsyncGenerator

@@ -67,7 +66,7 @@ class TurnDetectionMode(str, Enum):
    """Endpoint and turn detection handling mode.

    How the STT engine handles the endpointing of speech. If using Pipecat's built-in endpointing,
-    then use `TurnDetectionMode.FIXED` (default).
+    then use `TurnDetectionMode.EXTERNAL` (default).

    To use the STT engine's built-in endpointing, then use `TurnDetectionMode.ADAPTIVE` for simple
    voice activity detection or `TurnDetectionMode.SMART_TURN` for more advanced ML-based
@@ -107,7 +106,7 @@ class SpeechmaticsSTTService(STTService):

            turn_detection_mode: Endpoint handling, one of `TurnDetectionMode.FIXED`,
                `TurnDetectionMode.EXTERNAL`, `TurnDetectionMode.ADAPTIVE` and
-                `TurnDetectionMode.SMART_TURN`. Defaults to `TurnDetectionMode.FIXED`.
+                `TurnDetectionMode.SMART_TURN`. Defaults to `TurnDetectionMode.EXTERNAL`.

            speaker_active_format: Formatter for active speaker ID. This formatter is used to format
                the text output for individual speakers and ensures that the context is clear for
@@ -201,6 +200,7 @@ class SpeechmaticsSTTService(STTService):
            extra_params: Extra parameters to pass to the STT engine. This is a dictionary of
                additional parameters that can be used to configure the STT engine.
                Default to None.
+
        """

        # Service configuration
@@ -208,7 +208,7 @@ class SpeechmaticsSTTService(STTService):
        language: Language | str = Language.EN

        # Endpointing mode
-        turn_detection_mode: TurnDetectionMode = TurnDetectionMode.FIXED
+        turn_detection_mode: TurnDetectionMode = TurnDetectionMode.EXTERNAL

        # Output formatting
        speaker_active_format: str | None = None
@@ -346,7 +346,7 @@ class SpeechmaticsSTTService(STTService):
            params.speaker_passive_format or params.speaker_active_format
        )

-        # Metrics
+        # Model + metrics
        self.set_model_name(self._config.operating_point.value)

        # Message queue
@@ -598,9 +598,6 @@ class SpeechmaticsSTTService(STTService):
        if segments:
            await self._send_frames(segments)

-        # Update metrics
-        await self._emit_metrics(message.get("metadata", {}).get("processing_time", 0.0))
-
    async def _handle_segment(self, message: dict[str, Any]) -> None:
        """Handle AddSegment events.

@@ -695,6 +692,7 @@ class SpeechmaticsSTTService(STTService):
                    f"{self} VADUserStoppedSpeakingFrame received but internal VAD is being used"
                )
            elif not self._enable_vad and self._client is not None:
+                self.request_finalize()
                self._client.finalize()

    async def _send_frames(self, segments: list[dict[str, Any]], finalized: bool = False) -> None:
@@ -738,16 +736,33 @@ class SpeechmaticsSTTService(STTService):

        # If final, then re-parse into TranscriptionFrame
        if finalized:
+            # Do any segments have `is_eou` set to True?
+            if (
+                any(segment.get("is_eou", False) for segment in segments)
+                and self._finalize_requested
+            ):
+                self.confirm_finalize()
+
+            # Add the finalized frames
            frames += [TranscriptionFrame(**attr_from_segment(segment)) for segment in segments]
+
+            # Handle the text (for metrics reporting)
            finalized_text = "|".join([s["text"] for s in segments])
-            await self._handle_transcription(finalized_text, True, segments[0]["language"])
+            await self._handle_transcription(
+                finalized_text, is_final=True, language=segments[0]["language"]
+            )
+
+            # Log the frames
            logger.debug(f"{self} finalized transcript: {[f.text for f in frames]}")

        # Return as interim results (unformatted)
        else:
+            # Add the interim frames
            frames += [
                InterimTranscriptionFrame(**attr_from_segment(segment)) for segment in segments
            ]
+
+            # Log the frames
            logger.debug(f"{self} interim transcript: {[f.text for f in frames]}")

        # Send the frames
@@ -804,28 +819,6 @@ class SpeechmaticsSTTService(STTService):
            yield ErrorFrame(f"Speechmatics error: {e}")
            await self._disconnect()

-    async def _emit_metrics(self, processing_time: float) -> None:
-        """Create TTFB metrics.
-
-        The TTFB is the seconds between the person speaking and the STT
-        engine emitting the first partial. This is only calculated at the
-        start of an utterance.
-        """
-        # Skip if metrics not available
-        if not self._metrics or processing_time == 0.0:
-            return
-
-        # Calculate time as time.time() - ttfb (which is seconds)
-        start_time = time.time() - processing_time
-
-        # Update internal metrics
-        self._metrics._start_ttfb_time = start_time
-        self._metrics._start_processing_time = start_time
-
-        # Stop TTFB metrics
-        await self.stop_ttfb_metrics()
-        await self.stop_processing_metrics()
-
    # ============================================================================
    # HELPERS
    # ============================================================================
--- a/src/pipecat/services/stt_service.py
+++ b/src/pipecat/services/stt_service.py
@@ -6,7 +6,9 @@

 """Base classes for Speech-to-Text services with continuous and segmented processing."""

+import asyncio
 import io
+import time
 import wave
 from abc import abstractmethod
 from typing import Any, AsyncGenerator, Dict, Mapping, Optional
@@ -17,12 +19,17 @@ from pipecat.frames.frames import (
    AudioRawFrame,
    ErrorFrame,
    Frame,
+    InterruptionFrame,
+    MetricsFrame,
+    SpeechControlParamsFrame,
    StartFrame,
    STTMuteFrame,
    STTUpdateSettingsFrame,
+    TranscriptionFrame,
    VADUserStartedSpeakingFrame,
    VADUserStoppedSpeakingFrame,
 )
+from pipecat.metrics.metrics import TTFBMetricsData
 from pipecat.processors.frame_processor import FrameDirection
 from pipecat.services.ai_service import AIService
 from pipecat.services.websocket_service import WebsocketService
@@ -61,6 +68,8 @@ class STTService(AIService):
        audio_passthrough=True,
        # STT input sample rate
        sample_rate: Optional[int] = None,
+        # STT TTFB timeout - time to wait after VAD stop before reporting TTFB
+        stt_ttfb_timeout: float = 2.0,
        **kwargs,
    ):
        """Initialize the STT service.
@@ -70,6 +79,12 @@ class STTService(AIService):
                Defaults to True.
            sample_rate: The sample rate for audio input. If None, will be determined
                from the start frame.
+            stt_ttfb_timeout: Time in seconds to wait after VAD stop before reporting
+                TTFB. This delay allows the final transcription to arrive. Defaults to 2.0.
+                Note: STT "TTFB" differs from traditional TTFB (which measures from a discrete
+                request to first response byte). Since STT receives continuous audio, we measure
+                from when the user stops speaking to when the final transcript arrives—capturing
+                the latency that matters for voice AI applications.
            **kwargs: Additional arguments passed to the parent AIService.
        """
        super().__init__(**kwargs)
@@ -81,6 +96,16 @@ class STTService(AIService):
        self._muted: bool = False
        self._user_id: str = ""

+        # STT TTFB tracking state
+        self._stt_ttfb_timeout = stt_ttfb_timeout
+        self._ttfb_timeout_task: Optional[asyncio.Task] = None
+        self._vad_stop_secs: Optional[float] = None
+        self._speech_end_time: Optional[float] = None
+        self._user_speaking: bool = False
+        self._last_transcription_time: Optional[float] = None
+        self._finalize_pending: bool = False
+        self._finalize_requested: bool = False
+
        self._register_event_handler("on_connected")
        self._register_event_handler("on_disconnected")
        self._register_event_handler("on_connection_error")
@@ -94,6 +119,31 @@ class STTService(AIService):
        """
        return self._muted

+    def request_finalize(self):
+        """Mark that a finalize request has been sent, awaiting server confirmation.
+
+        For providers that have explicit server confirmation of finalization
+        (e.g., Deepgram's from_finalize field), call this when sending the finalize
+        request. Then call confirm_finalize() when the server confirms.
+
+        For providers without server confirmation, don't call this method - just
+        send the finalize/flush/commit command and rely on the TTFB timeout.
+        """
+        self._finalize_requested = True
+
+    def confirm_finalize(self):
+        """Confirm that the server has acknowledged the finalize request.
+
+        Call this when the server response confirms finalization (e.g., Deepgram's
+        from_finalize=True). The next TranscriptionFrame pushed will be marked
+        as finalized.
+
+        Only has effect if request_finalize() was previously called.
+        """
+        if self._finalize_requested:
+            self._finalize_pending = True
+            self._finalize_requested = False
+
    @property
    def sample_rate(self) -> int:
        """Get the current sample rate for audio processing.
@@ -144,6 +194,11 @@ class STTService(AIService):
        self._sample_rate = self._init_sample_rate or frame.audio_in_sample_rate
        self._tracing_enabled = frame.enable_tracing

+    async def cleanup(self):
+        """Clean up STT service resources."""
+        await super().cleanup()
+        await self._cancel_ttfb_timeout()
+
    async def _update_settings(self, settings: Mapping[str, Any]):
        logger.info(f"Updating STT settings: {self._settings}")
        for key, value in settings.items():
@@ -206,14 +261,168 @@ class STTService(AIService):
            await self.process_audio_frame(frame, direction)
            if self._audio_passthrough:
                await self.push_frame(frame, direction)
+        elif isinstance(frame, SpeechControlParamsFrame):
+            await self._handle_speech_control_params(frame)
+            await self.push_frame(frame, direction)
+        elif isinstance(frame, VADUserStartedSpeakingFrame):
+            await self._handle_vad_user_started_speaking(frame)
+            await self.push_frame(frame, direction)
+        elif isinstance(frame, VADUserStoppedSpeakingFrame):
+            await self._handle_vad_user_stopped_speaking(frame)
+            await self.push_frame(frame, direction)
        elif isinstance(frame, STTUpdateSettingsFrame):
            await self._update_settings(frame.settings)
        elif isinstance(frame, STTMuteFrame):
            self._muted = frame.mute
            logger.debug(f"STT service {'muted' if frame.mute else 'unmuted'}")
+        elif isinstance(frame, InterruptionFrame):
+            await self._reset_stt_ttfb_state()
+            await self.push_frame(frame, direction)
        else:
            await self.push_frame(frame, direction)

+    async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
+        """Push a frame downstream, tracking TranscriptionFrame timestamps for TTFB.
+
+        Stores the timestamp of each TranscriptionFrame for TTFB calculation.
+        If the frame is marked as finalized (via request_finalize/confirm_finalize),
+        reports TTFB immediately and cancels any pending timeout. Otherwise, TTFB is
+        reported after a timeout.
+
+        Args:
+            frame: The frame to push.
+            direction: The direction to push the frame.
+        """
+        if isinstance(frame, TranscriptionFrame):
+            # Store the transcription time for TTFB calculation
+            self._last_transcription_time = time.time()
+
+            # Set finalized from pending state and auto-reset
+            if self._finalize_pending:
+                frame.finalized = True
+                self._finalize_pending = False
+
+            # If this is a finalized transcription, report TTFB immediately
+            if frame.finalized and self._speech_end_time is not None:
+                ttfb = self._last_transcription_time - self._speech_end_time
+                await self._emit_stt_ttfb_metric(ttfb)
+                # Cancel the timeout since we've already reported
+                await self._cancel_ttfb_timeout()
+                # Clear state
+                self._speech_end_time = None
+                self._last_transcription_time = None
+
+        await super().push_frame(frame, direction)
+
+    async def _handle_speech_control_params(self, frame: SpeechControlParamsFrame):
+        """Handle speech control parameters frame to extract VAD stop_secs.
+
+        Args:
+            frame: The speech control parameters frame.
+        """
+        if frame.vad_params is not None:
+            self._vad_stop_secs = frame.vad_params.stop_secs
+
+    async def _cancel_ttfb_timeout(self):
+        """Cancel any pending TTFB timeout task."""
+        if self._ttfb_timeout_task:
+            await self.cancel_task(self._ttfb_timeout_task)
+            self._ttfb_timeout_task = None
+
+    async def _reset_stt_ttfb_state(self):
+        """Reset STT TTFB measurement state.
+
+        Called when starting a new utterance or on interruption to ensure
+        we don't use stale state for TTFB calculations. This specifically guards
+        against the case where a TranscriptionFrame is received without corresponding
+        VADUserStartedSpeakingFrame and VADUserStoppedSpeakingFrame frames.
+
+        Note: Does not reset _user_speaking since InterruptionFrame can arrive
+        while user is still speaking.
+        """
+        await self._cancel_ttfb_timeout()
+        self._speech_end_time = None
+        self._last_transcription_time = None
+
+    async def _handle_vad_user_started_speaking(self, frame: VADUserStartedSpeakingFrame):
+        """Handle VAD user started speaking frame to start tracking transcriptions.
+
+        Cancels any pending TTFB timeout, resets TTFB tracking state, and marks user as speaking.
+        Also resets finalization state to prevent stale finalization from a previous utterance.
+
+        Args:
+            frame: The VAD user started speaking frame.
+        """
+        await self._reset_stt_ttfb_state()
+        self._user_speaking = True
+        self._finalize_requested = False
+        self._finalize_pending = False
+
+    async def _handle_vad_user_stopped_speaking(self, frame: VADUserStoppedSpeakingFrame):
+        """Handle VAD user stopped speaking frame.
+
+        Calculates the actual speech end time and starts a timeout task to wait
+        for the final transcription before reporting TTFB.
+
+        Args:
+            frame: The VAD user stopped speaking frame.
+        """
+        self._user_speaking = False
+
+        # Skip TTFB measurement if we don't have VAD params
+        if self._vad_stop_secs is None:
+            return
+
+        # Calculate the actual speech end time (current time minus VAD stop delay).
+        # This approximates when the last user audio was sent to the STT service,
+        # which we use to measure against the eventual transcription response.
+        self._speech_end_time = time.time() - self._vad_stop_secs
+
+        # Start timeout task (any previous timeout was cancelled by VADUserStartedSpeakingFrame
+        # or InterruptionFrame)
+        self._ttfb_timeout_task = self.create_task(
+            self._ttfb_timeout_handler(), name="stt_ttfb_timeout"
+        )
+
+    async def _ttfb_timeout_handler(self):
+        """Wait for timeout then report TTFB using the last transcription timestamp.
+
+        This timeout allows the final transcription to arrive before we calculate
+        and report TTFB. If no transcription arrived, no TTFB is reported.
+        """
+        try:
+            await asyncio.sleep(self._stt_ttfb_timeout)
+
+            # Report TTFB if we have both speech end time and transcription time
+            if self._speech_end_time is not None and self._last_transcription_time is not None:
+                ttfb = self._last_transcription_time - self._speech_end_time
+                await self._emit_stt_ttfb_metric(ttfb)
+
+            # Clear state after reporting
+            self._speech_end_time = None
+            self._last_transcription_time = None
+        except asyncio.CancelledError:
+            # Task was cancelled (new utterance or interruption), which is expected behavior
+            pass
+        finally:
+            self._ttfb_timeout_task = None
+
+    async def _emit_stt_ttfb_metric(self, ttfb: float):
+        """Emit STT TTFB metric if value is non-negative.
+
+        Args:
+            ttfb: The TTFB value in seconds.
+        """
+        if ttfb >= 0:
+            logger.debug(f"{self} TTFB: {ttfb:.3f}s")
+            if self.metrics_enabled:
+                ttfb_data = TTFBMetricsData(
+                    processor=self.name,
+                    model=self.model_name,
+                    value=ttfb,
+                )
+                await super().push_frame(MetricsFrame(data=[ttfb_data]))
+

 class SegmentedSTTService(STTService):
    """STT service that processes speech in segments using VAD events.
@@ -250,6 +459,20 @@ class SegmentedSTTService(STTService):
        await super().start(frame)
        self._audio_buffer_size_1s = self.sample_rate * 2

+    async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
+        """Push a frame, marking TranscriptionFrames as finalized.
+
+        Segmented STT services process complete speech segments and return a single
+        TranscriptionFrame per segment, so every transcription is inherently finalized.
+
+        Args:
+            frame: The frame to push.
+            direction: The direction of frame flow in the pipeline.
+        """
+        if isinstance(frame, TranscriptionFrame):
+            frame.finalized = True
+        await super().push_frame(frame, direction)
+
    async def process_frame(self, frame: Frame, direction: FrameDirection):
        """Process frames, handling VAD events and audio segmentation."""
        await super().process_frame(frame, direction)
@@ -274,11 +497,11 @@ class SegmentedSTTService(STTService):
        wav.close()
        content.seek(0)

-        await self.process_generator(self.run_stt(content.read()))
-
        # Start clean.
        self._audio_buffer.clear()

+        await self.process_generator(self.run_stt(content.read()))
+
    async def process_audio_frame(self, frame: AudioRawFrame, direction: FrameDirection):
        """Process audio frames by buffering them for segmented transcription.

--- a/src/pipecat/services/tts_service.py
+++ b/src/pipecat/services/tts_service.py
@@ -24,6 +24,7 @@ from typing import (

 from loguru import logger

+from pipecat.audio.utils import create_stream_resampler
 from pipecat.frames.frames import (
    AggregatedTextFrame,
    AggregationType,
@@ -202,6 +203,8 @@ class TTSService(AIService):
                )
            self._text_filters = [text_filter]

+        self._resampler = create_stream_resampler()
+
        self._stop_frame_task: Optional[asyncio.Task] = None
        self._stop_frame_queue: asyncio.Queue = asyncio.Queue()

@@ -505,12 +508,40 @@ class TTSService(AIService):
            await self._stop_frame_queue.put(frame)

    async def _stream_audio_frames_from_iterator(
-        self, iterator: AsyncIterator[bytes], *, strip_wav_header: bool
+        self,
+        iterator: AsyncIterator[bytes],
+        *,
+        strip_wav_header: bool = False,
+        in_sample_rate: Optional[int] = None,
    ) -> AsyncGenerator[Frame, None]:
+        """Stream audio frames from an async byte iterator with optional resampling.
+
+        For WAV data, use `strip_wav_header=True` to strip the header and
+        auto-detect the source sample rate. For raw PCM data, pass
+        `in_sample_rate` directly. Audio is resampled to `self.sample_rate` when
+        the source rate differs.
+
+        Args:
+            iterator: Async iterator yielding audio bytes.
+            strip_wav_header: Strip WAV header and parse source sample rate from it.
+            in_sample_rate: Source sample rate for raw PCM data. Overrides
+                WAV-detected rate if both are provided.
+
+        """
        buffer = bytearray()
+        source_sample_rate = in_sample_rate
        need_to_strip_wav_header = strip_wav_header
+
+        async def maybe_resample(audio: bytes) -> bytes:
+            if source_sample_rate and source_sample_rate != self.sample_rate:
+                return await self._resampler.resample(audio, source_sample_rate, self.sample_rate)
+            return audio
+
        async for chunk in iterator:
            if need_to_strip_wav_header and chunk.startswith(b"RIFF"):
+                # Parse sample rate from WAV header (bytes 24-28, little-endian uint32).
+                if len(chunk) >= 44 and source_sample_rate is None:
+                    source_sample_rate = int.from_bytes(chunk[24:28], "little")
                chunk = chunk[44:]
                need_to_strip_wav_header = False

@@ -520,19 +551,18 @@ class TTSService(AIService):
            # Round to nearest even number.
            aligned_length = len(buffer) & ~1  # 111111111...11110
            if aligned_length > 0:
-                aligned_chunk = buffer[:aligned_length]
+                aligned_chunk = await maybe_resample(bytes(buffer[:aligned_length]))
                buffer = buffer[aligned_length:]  # keep any leftover byte

                if len(aligned_chunk) > 0:
-                    frame = TTSAudioRawFrame(bytes(aligned_chunk), self.sample_rate, 1)
-                    yield frame
+                    yield TTSAudioRawFrame(aligned_chunk, self.sample_rate, 1)

        if len(buffer) > 0:
            # Make sure we don't need an extra padding byte.
            if len(buffer) % 2 == 1:
                buffer.extend(b"\x00")
-            frame = TTSAudioRawFrame(bytes(buffer), self.sample_rate, 1)
-            yield frame
+            audio = await maybe_resample(bytes(buffer))
+            yield TTSAudioRawFrame(audio, self.sample_rate, 1)

    async def _handle_interruption(self, frame: InterruptionFrame, direction: FrameDirection):
        self._processing_text = False
--- a/src/pipecat/services/websocket_service.py
+++ b/src/pipecat/services/websocket_service.py
@@ -123,26 +123,29 @@ class WebsocketService(ABC):

    async def _maybe_try_reconnect(
        self,
-        error: Exception,
        error_message: str,
        report_error: Callable[[ErrorFrame], Awaitable[None]],
+        error: Optional[Exception] = None,
    ) -> bool:
        """Check if reconnection should be attempted and try if appropriate.

        Args:
-            error: The exception that occurred.
            error_message: Human-readable error message for logging.
            report_error: Callback function to report connection errors.
+            error: The exception that occurred (optional, may be None for graceful closes).

        Returns:
            True if should continue the receive loop, False if should break.
        """
        # Don't reconnect if we're intentionally disconnecting
        if self._disconnecting:
-            logger.warning(f"{self} error during disconnect: {error}")
+            if error:
+                logger.warning(f"{self} error during disconnect: {error}")
+            else:
+                logger.debug(f"{self} receive loop ended during disconnect")
            return False

-        # Log the error
+        # Log the message
        logger.warning(error_message)

        # Try to reconnect if enabled
@@ -167,6 +170,14 @@ class WebsocketService(ABC):
        while True:
            try:
                await self._receive_messages()
+                # _receive_messages() returned normally. This happens when the websocket
+                # closes gracefully (server sent close frame). The async for loop over
+                # the websocket exits without raising an exception in this case.
+                # We must handle this to avoid an infinite loop.
+                message = f"{self} connection closed by server"
+                should_continue = await self._maybe_try_reconnect(message, report_error)
+                if not should_continue:
+                    break
            except ConnectionClosedOK as e:
                # Normal closure, don't retry
                logger.debug(f"{self} connection closed normally: {e}")
@@ -175,13 +186,13 @@ class WebsocketService(ABC):
                # Connection closed with error (e.g., no close frame received/sent)
                # This often indicates network issues, server problems, or abrupt disconnection
                message = f"{self} connection closed, but with an error: {e}"
-                should_continue = await self._maybe_try_reconnect(e, message, report_error)
+                should_continue = await self._maybe_try_reconnect(message, report_error, e)
                if not should_continue:
                    break
            except Exception as e:
                # General error during message receiving
                message = f"{self} error receiving messages: {e}"
-                should_continue = await self._maybe_try_reconnect(e, message, report_error)
+                should_continue = await self._maybe_try_reconnect(message, report_error, e)
                if not should_continue:
                    break

--- a/src/pipecat/services/whisper/base_stt.py
+++ b/src/pipecat/services/whisper/base_stt.py
@@ -204,11 +204,9 @@ class BaseWhisperSTTService(SegmentedSTTService):
        """
        try:
            await self.start_processing_metrics()
-            await self.start_ttfb_metrics()

            response = await self._transcribe(audio)

-            await self.stop_ttfb_metrics()
            await self.stop_processing_metrics()

            text = response.text.strip()
--- a/src/pipecat/services/whisper/stt.py
+++ b/src/pipecat/services/whisper/stt.py
@@ -289,7 +289,6 @@ class WhisperSTTService(SegmentedSTTService):
            return

        await self.start_processing_metrics()
-        await self.start_ttfb_metrics()

        # Divide by 32768 because we have signed 16-bit data.
        audio_float = np.frombuffer(audio, dtype=np.int16).astype(np.float32) / 32768.0
@@ -303,7 +302,6 @@ class WhisperSTTService(SegmentedSTTService):
            if segment.no_speech_prob < self._no_speech_prob:
                text += f"{segment.text} "

-        await self.stop_ttfb_metrics()
        await self.stop_processing_metrics()

        if text:
@@ -388,7 +386,6 @@ class WhisperSTTServiceMLX(WhisperSTTService):
            import mlx_whisper

            await self.start_processing_metrics()
-            await self.start_ttfb_metrics()

            # Divide by 32768 because we have signed 16-bit data.
            audio_float = np.frombuffer(audio, dtype=np.int16).astype(np.float32) / 32768.0
@@ -413,7 +410,6 @@ class WhisperSTTServiceMLX(WhisperSTTService):
            if len(text.strip()) == 0:
                text = None

-            await self.stop_ttfb_metrics()
            await self.stop_processing_metrics()

            if text:
--- a/src/pipecat/transports/daily/transport.py
+++ b/src/pipecat/transports/daily/transport.py
@@ -1733,7 +1733,7 @@ class DailyInputTransport(BaseInputTransport):
            message: The message data to send.
            sender: ID of the message sender.
        """
-        await self.broadcast_frame_class(
+        await self.broadcast_frame(
            DailyInputTransportMessageFrame, message=message, participant_id=sender
        )

--- a/src/pipecat/transports/smallwebrtc/transport.py
+++ b/src/pipecat/transports/smallwebrtc/transport.py
@@ -698,7 +698,7 @@ class SmallWebRTCInputTransport(BaseInputTransport):
            message: The application message to process.
        """
        logger.debug(f"Received app message inside SmallWebRTCInputTransport  {message}")
-        await self.broadcast_frame_class(InputTransportMessageFrame, message=message)
+        await self.broadcast_frame(InputTransportMessageFrame, message=message)

    # Add this method similar to DailyInputTransport.request_participant_image
    async def request_participant_image(self, frame: UserImageRequestFrame):
--- a/src/pipecat/turns/user_turn_controller.py
+++ b/src/pipecat/turns/user_turn_controller.py
@@ -11,6 +11,7 @@ from typing import Optional, Type

 from pipecat.frames.frames import (
    Frame,
+    InterimTranscriptionFrame,
    TranscriptionFrame,
    UserStartedSpeakingFrame,
    UserStoppedSpeakingFrame,
@@ -156,7 +157,7 @@ class UserTurnController(BaseObject):
            await self._handle_vad_user_started_speaking(frame)
        elif isinstance(frame, VADUserStoppedSpeakingFrame):
            await self._handle_vad_user_stopped_speaking(frame)
-        elif isinstance(frame, TranscriptionFrame):
+        elif isinstance(frame, (TranscriptionFrame, InterimTranscriptionFrame)):
            await self._handle_transcription(frame)

        for strategy in self._user_turn_strategies.start or []:
@@ -209,8 +210,8 @@ class UserTurnController(BaseObject):
        # The user stopped talking, let's reset the user turn timeout.
        self._user_turn_stop_timeout_event.set()

-    async def _handle_transcription(self, frame: TranscriptionFrame):
-        # We have creceived a transcription, let's reset the user turn timeout.
+    async def _handle_transcription(self, frame: TranscriptionFrame | InterimTranscriptionFrame):
+        # We have received a transcription, let's reset the user turn timeout.
        self._user_turn_stop_timeout_event.set()

    async def _on_push_frame(
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Mark Backman	d5b34759d7	Update GradiumTTSService to flush instead of ending stream	2026-01-29 18:47:17 -05:00
Mark Backman	0d697d184a	Add delay_in_frames and language support	2026-01-29 18:47:00 -05:00
Mark Backman	717e1ccc01	GradiumSTTService now flushes pending transcripts on VAD stopped detection	2026-01-29 18:47:00 -05:00
Aleix Conchillo Flaqué	671cc8eb74	Merge pull request #3590 from pipecat-ai/aleix/custom-cli-runner-args runner: allow custom CLI arguments	2026-01-29 13:53:27 -08:00
Aleix Conchillo Flaqué	b4dce656f0	Merge pull request #3594 from pipecat-ai/aleix/user-turn-controller-reset-timeout-on-interims UserTurnController: reset user turn timeout with interim transcriptions	2026-01-29 13:12:44 -08:00
Aleix Conchillo Flaqué	253a1d1114	UserTurnController: reset user turn timeout with interim transcriptions	2026-01-29 13:10:10 -08:00
Aleix Conchillo Flaqué	ca613bcb79	Merge pull request #3592 from pipecat-ai/aleix/broadcast-frame-no-deepcopy don't deep copy fields when broadcasting frames	2026-01-29 11:50:20 -08:00
Aleix Conchillo Flaqué	0423acd8a0	STTService: just clear buffer before running run_stt()	2026-01-29 11:47:57 -08:00
Aleix Conchillo Flaqué	7eabaaa0ef	FrameProcessors: do not deepcopy fields when broadcasting frames	2026-01-29 11:47:57 -08:00
Aleix Conchillo Flaqué	bbb8b53d03	runner: allow custom CLI arguments	2026-01-29 10:15:53 -08:00
Aleix Conchillo Flaqué	f3b72e9263	Merge pull request #3585 from pipecat-ai/aleix/improve-piper-tts-support improve Piper TTS support	2026-01-29 08:36:13 -08:00
Mark Backman	b77a50de73	Merge pull request #3529 from lukepayyapilli/fix/llm-timeout-without-retry feat: handle exceptions for BaseOpenAILLMService	2026-01-29 09:12:54 -05:00
Luke Payyapilli	433c1b9b92	add catch-all exception handler per review feedback	2026-01-29 09:07:06 -05:00
Aleix Conchillo Flaqué	bd00587092	changelog: add files for 3585	2026-01-29 00:16:39 -08:00
Aleix Conchillo Flaqué	5a85e27cc5	PiperHttpTTSService: allow passing a voice id	2026-01-29 00:16:39 -08:00
Aleix Conchillo Flaqué	11daa43b1b	TTSService: resample _stream_audio_frames_from_iterator() input audio if needed	2026-01-29 00:16:39 -08:00
Aleix Conchillo Flaqué	875614ff7a	tts: add support for local PiperTTSService	2026-01-29 00:16:39 -08:00
Aleix Conchillo Flaqué	eb1bf1e446	tts: rename PiperTTSService to PiperHttpTTSService	2026-01-28 23:27:32 -08:00
mattie ruth backman	7456a0a55f	Fix the /start and /offer/api proxy endpoints for smallWebRTC to match pipecat cloud behavior WRT requestData	2026-01-28 15:25:13 -05:00
Filipi da Silva Fuchter	27277ed3d9	Merge pull request #3571 from pipecat-ai/filipi/funcion_call_improvements Function call improvements	2026-01-28 14:03:40 -05:00
filipi87	5543bc56f3	Add changelog files for PR #3571 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-28 15:43:59 -03:00
filipi87	c8496dfb8e	Updated the examples which use UserImageRequestFrame to defer the function call result.	2026-01-28 15:39:21 -03:00
filipi87	d3f4cbb620	Providing a way to defer the function call results.	2026-01-28 15:39:06 -03:00
filipi87	c9f922c479	Removed an overridden method that was identical to the parent implementation.	2026-01-28 15:38:40 -03:00
Aleix Conchillo Flaqué	49bd3da26b	Merge pull request #3582 from pipecat-ai/aleix/daily-sample-room-url rename DAILY_SAMPLE_ROOM_URL to DAILY_ROOM_URL	2026-01-28 10:38:14 -08:00
Aleix Conchillo Flaqué	f3ef488925	rename DAILY_SAMPLE_ROOM_URL to DAILY_ROOM_URL	2026-01-28 10:05:27 -08:00
Aleix Conchillo Flaqué	4f08098917	Merge pull request #3580 from Pulkit0729/fix/livekit fix: adding missing livekit transport configs	2026-01-28 10:04:34 -08:00
Pulkit	a7cd5b0322	fix: adding missing livekit transport configs	2026-01-28 23:15:03 +05:30
Aleix Conchillo Flaqué	55dadc9118	tests(genesys): fix formatting	2026-01-28 09:15:42 -08:00
Aleix Conchillo Flaqué	01bbf61e0d	Merge pull request #3500 from ssillerom/feature/genesys_serializer Feature/genesys serializer	2026-01-28 09:09:11 -08:00
ssillerom	10fb77c0e2	added changelog file	2026-01-28 18:07:33 +01:00
ssillerom	2612fae527	ruff linting	2026-01-28 18:02:51 +01:00
ssillerom	c5be67f293	fix: create disconnect message passing output vars	2026-01-28 17:56:21 +01:00
kompfner	312caaba86	Merge pull request #3429 from lukepayyapilli/fix/gemini-live-interrupted-signal feat: handle server_content.interrupted for faster interruptions	2026-01-28 10:25:36 -05:00
Luke Payyapilli	ff0eb6d286	fix: emit ErrorFrame on LLM completion timeout	2026-01-28 09:44:32 -05:00
ssillerom	ef6bbace98	fixes: super init inhereted class to set event hanlders in the construct	2026-01-28 15:40:24 +01:00
Filipi da Silva Fuchter	06ec21387f	Merge pull request #3581 from pipecat-ai/filipi/open_ai_audio_duration Fixed race condition in OpenAIRealtimeLLMService	2026-01-28 07:42:35 -05:00
filipi87	bdae177125	Adding changelog entry for the OpenAiRealtimeLLMService fix.	2026-01-28 08:39:11 -03:00
filipi87	468e159f9b	Fixed race condition in OpenAIRealtimeLLMService that could cause an error when truncating the conversation.	2026-01-28 08:36:31 -03:00
ssillerom	a4acafd3be	feature: added event handlers in constructor and call func in each _handle_* func	2026-01-28 10:54:26 +01:00
ssillerom	105824a372	Merge main into feature/genesys_serializer Incorporates latest changes from main branch including: - AIC filter and VAD updates - STT service improvements - Base serializer changes - Various bug fixes	2026-01-28 10:48:56 +01:00
ssillerom	55e0d4ecc4	ruff fixes done	2026-01-28 08:59:28 +01:00
ssillerom	9102e81cb8	added tests to the PR	2026-01-27 23:39:43 +01:00
ssillerom	d7d8e93a3d	feature: added custom params in closed message to genesys, simplified create_* functions, simplified constructor method and simplified opened message	2026-01-27 23:36:47 +01:00
Mark Backman	bf9b166464	Merge pull request #3575 from pipecat-ai/mb/fix-turn-stopped-event-end-cancel-frame Emit on_assistant_turn_stopped and on_user_turn_stopped from EndFrame…	2026-01-27 14:55:34 -05:00
Mark Backman	e80e0eab29	Emit on_assistant_turn_stopped and on_user_turn_stopped from EndFrame or CancelFrame	2026-01-27 14:50:10 -05:00
Mark Backman	61242e6575	Merge pull request #3574 from pipecat-ai/mb/fix-websocket-close-message-handling Fix WebsocketService infinite loop on graceful server disconnect	2026-01-27 13:53:26 -05:00
Aleix Conchillo Flaqué	8841387121	Merge pull request #3560 from pipecat-ai/aleix/serializer-base-objects FrameSerializer: subclass from BaseObject so we can add events	2026-01-27 09:58:44 -08:00
Aleix Conchillo Flaqué	ee695ae9fe	FrameSerializer: subclass from BaseObject so we can add events	2026-01-27 09:53:46 -08:00
Mark Backman	52012b0fb2	Fix WebsocketService infinite loop on graceful server disconnect	2026-01-27 12:41:28 -05:00
Mark Backman	f7a1c6b719	Merge pull request #3408 from ai-coustics/aic-v2 Add ai-coustics AIC SDK v2 support with model downloading	2026-01-27 10:38:26 -05:00
Gökmen Görgen	6aa77ccc13	group aic related changes in changelog.	2026-01-27 16:22:54 +01:00
Gökmen Görgen	45b7ec4e2c	re-enable `07zd-interruptible-aicoustics.py` in release evals.	2026-01-27 16:18:56 +01:00
Mark Backman	1c434c6ad5	Merge pull request #3562 from speechmatics/fix/smx-ttfs-finals Support TTFS for Speechmatics STT	2026-01-27 08:35:34 -05:00
Mark Backman	4591affba9	Merge pull request #3568 from pipecat-ai/mb/changelog-3536	2026-01-27 07:14:41 -05:00
Sam Sykes	91346f5f37	Add support for `self.request_finalize()` for Pipecat-based VAD.	2026-01-27 10:44:35 +00:00
Filipi da Silva Fuchter	6a66ebe332	Merge pull request #3541 from pipecat-ai/filipi/audio_buffer Refactoring AudioBufferProcessor to fix audio track synchronization.	2026-01-27 05:32:41 -05:00
Filipi da Silva Fuchter	c1d4180042	Merge pull request #3567 from pipecat-ai/filipi/openai_realtime_audio_duration Fixed race condition in OpenAIRealtimeBetaLLMService	2026-01-27 05:30:33 -05:00
Gökmen Görgen	81a53c699c	handle AIC processor init errors gracefully and ensure `_aic_ready` reflects readiness	2026-01-27 11:28:05 +01:00
Sam Sykes	60168f7f69	remove comment	2026-01-26 23:16:43 +00:00
Sam Sykes	23d7608e5f	changelog update	2026-01-26 23:15:30 +00:00
Sam Sykes	99242c0a93	linting updates	2026-01-26 23:14:40 +00:00
Sam Sykes	3a71865cf4	removed old metrics	2026-01-26 23:11:25 +00:00
Mark Backman	ecf2e69f3f	Merge pull request #3536 from surapuramakhil/main LLMAssistantAggregator: preserve non-ASCII characters in JSON output	2026-01-26 16:42:05 -05:00
Mark Backman	febd52274d	Add changelog fragment for PR 3536	2026-01-26 16:42:00 -05:00
Mark Backman	1542d922e7	Merge pull request #3546 from pipecat-ai/pk/changelog-fragment-for-pr-3406 Added a changelog fragment for PR 3406	2026-01-26 16:31:57 -05:00
Paul Kompfner	15d5d1159e	Added a changelog fragment for PR 3406	2026-01-26 16:27:33 -05:00
Mark Backman	884630a6bd	Merge pull request #3559 from pipecat-ai/aleix/transport-broadcast-fixes transports: fix broadcast_frame_class reference	2026-01-26 16:25:31 -05:00
Mark Backman	1cf137c6a8	Merge pull request #3565 from pipecat-ai/markbackman-patch-1	2026-01-26 15:49:35 -05:00
filipi87	98fcfd7c91	Adding changelog entry for the OpenAiRealtimeBetaLLMService fix.	2026-01-26 17:19:08 -03:00
filipi87	2f23f2e39c	Fixed race condition in OpenAIRealtimeBetaLLMService that could cause an error when truncating the conversation.	2026-01-26 17:08:27 -03:00
Mark Backman	9c6b11cecf	Update README links to use absolute URLs	2026-01-26 13:03:39 -05:00
Sam Sykes	fc1444c9d6	Updated changelog	2026-01-26 16:25:37 +00:00
Sam Sykes	ea94939add	update dependency	2026-01-26 16:24:56 +00:00
Sam Sykes	0c69ae6371	Changelog entry.	2026-01-26 16:07:59 +00:00
Sam Sykes	8b88280bb1	Default to using `EXTERNAL` mode.	2026-01-26 15:52:42 +00:00
Sam Sykes	960d0faea5	support `is_eou` for final segment in utterance	2026-01-26 15:48:04 +00:00
Luke Payyapilli	b9390ccb1b	Address review: remove UserStartedSpeakingFrame, add explanatory comment	2026-01-26 10:08:17 -05:00
Mark Backman	061a0dc43d	Merge pull request #3498 from pipecat-ai/mb/azure-tts-8khz-workaround AzureTTSService 8khz workaround	2026-01-26 09:48:22 -05:00
Mark Backman	328bbe069f	Merge pull request #3554 from pipecat-ai/mb/simplify-stt-ttfb Simplify STT finalize handling	2026-01-26 08:00:04 -05:00
Mark Backman	dc32ecc872	Merge pull request #3555 from pipecat-ai/mb/speechmatics-stt-ttfb Align Speechmatics STT TTFB metrics with STT classes	2026-01-26 07:59:34 -05:00
Gökmen Görgen	ca2eb1904f	Merge remote-tracking branch 'origin/aic-v2' into aic-v2	2026-01-26 10:16:23 +01:00
Gökmen Görgen	4bce58f270	update changelog and remove outdated dependency notes	2026-01-26 10:15:15 +01:00
Gökmen Görgen	7572d63f8f	Update src/pipecat/audio/vad/aic_vad.py Co-authored-by: Andres O. Vela <andresovela@users.noreply.github.com>	2026-01-26 10:06:40 +01:00
Gökmen Görgen	3c463c9416	Update src/pipecat/audio/vad/aic_vad.py Co-authored-by: Andres O. Vela <andresovela@users.noreply.github.com>	2026-01-26 10:06:33 +01:00
Gökmen Görgen	bd618d64e3	Update src/pipecat/audio/filters/aic_filter.py Co-authored-by: Andres O. Vela <andresovela@users.noreply.github.com>	2026-01-26 10:06:16 +01:00
Gökmen Görgen	a824660df7	add unit tests for `AICVADAnalyzer` and `AICFilter`.	2026-01-26 09:56:36 +01:00
Gökmen Görgen	58b9019852	bump aic-sdk to 2.0.1 in optional dependencies.	2026-01-26 09:14:16 +01:00
Gökmen Görgen	afcdef8c81	docstring clarification.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	bd92104fb3	clarify voice confidence method behavior in AIC VAD.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	34e9f224a8	Update src/pipecat/audio/vad/aic_vad.py Co-authored-by: Andres O. Vela <andresovela@users.noreply.github.com>	2026-01-26 08:44:17 +01:00
Gökmen Görgen	dca7f3b5b0	add changelog.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	70a85cd192	use path for keeping the consistency between the parameters.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	91e86658b7	force developer to set a license key, it's required.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	0a8588669c	address feedback.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	0e99400148	two dots are rust specific thinks, I'm not sure if it's familiar for Python developers.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	648f20db6d	Update src/pipecat/audio/vad/aic_vad.py Co-authored-by: Andres O. Vela <andresovela@users.noreply.github.com>	2026-01-26 08:44:17 +01:00
Gökmen Görgen	09b5b6b12d	Update src/pipecat/audio/vad/aic_vad.py Co-authored-by: Andres O. Vela <andresovela@users.noreply.github.com>	2026-01-26 08:44:17 +01:00
Gökmen Görgen	0e6a423955	Update src/pipecat/audio/filters/aic_filter.py Co-authored-by: Andres O. Vela <andresovela@users.noreply.github.com>	2026-01-26 08:44:17 +01:00
Gökmen Görgen	dc8972cd94	log optimal number of frames for given sample rate in AICFilter.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	e4e2231958	Update src/pipecat/audio/vad/aic_vad.py Co-authored-by: Andres O. Vela <andresovela@users.noreply.github.com>	2026-01-26 08:44:17 +01:00
Gökmen Görgen	18b3ee743b	replace `os` with `pathlib.Path` in AICFilter for path handling consistency.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	65b8e0e89c	rename `enabled` to `bypass` in AICFilter for clarity.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	b77f8b065f	remove voice gain.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	5fd43faec3	add min speech duration.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	abebcf37bd	address feedback.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	ca4e3c79f9	Update pyproject.toml Co-authored-by: Andres O. Vela <andresovela@users.noreply.github.com>	2026-01-26 08:44:17 +01:00
Gökmen Görgen	e8d1bec03b	Update src/pipecat/audio/filters/aic_filter.py Co-authored-by: Andres O. Vela <andresovela@users.noreply.github.com>	2026-01-26 08:44:17 +01:00
Gökmen Görgen	f0cc54589e	remove enhancement level parameter from AICFilter.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	22b9aac2ff	use quail model in the example.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	7f86f4ac27	fix class name.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	dcab79753b	even the parameters are fixed, keep aic ready for processing.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	bdded9b026	set SDK ID for telemetry in AIC filter.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	1e1e275fea	address feedback.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	effb6aa8f4	clean up unused imports in audio utils.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	a4a9bae79e	drop v1 support from aic.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	c943ef9261	keep uv.lock as it is.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	f05809520b	Remove outdated AIC Filter and VAD v2 files, migrate to consolidated implementations. Added the new ACIFilter to the same module.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	ec17dc6626	aic-sdk-py v2. # Conflicts: # uv.lock # Conflicts: # examples/foundational/07zd-interruptible-aicoustics.py # pyproject.toml # src/pipecat/audio/filters/aic_filter.py # src/pipecat/audio/vad/aic_vad.py	2026-01-26 08:44:17 +01:00
Gökmen Görgen	4e85e81d9b	Update src/pipecat/audio/filters/aic_filter.py Co-authored-by: Tobias <76444201+Fl1tzi@users.noreply.github.com>	2026-01-26 08:44:17 +01:00
Gökmen Görgen	a1cc88a233	Update src/pipecat/audio/filters/aic_filter.py Co-authored-by: Tobias <76444201+Fl1tzi@users.noreply.github.com>	2026-01-26 08:44:17 +01:00
Gökmen Görgen	61a230ec53	Update src/pipecat/audio/filters/aic_filter.py Co-authored-by: Stephan Eckes <stephan@steck.tech>	2026-01-26 08:44:17 +01:00
Gökmen Görgen	a13380b574	clean up unused imports in audio utils.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	2a927189d9	reorganize imports.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	a90c15362c	drop v1 support from aic.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	d3bdd2d246	use new model id.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	465ae4f706	keep uv.lock as it is.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	a0d801b658	Remove outdated AIC Filter and VAD v2 files, migrate to consolidated implementations. Added the new ACIFilter to the same module.	2026-01-26 08:44:17 +01:00
Gökmen Görgen	35919a84e3	aic-sdk-py v2. # Conflicts: # uv.lock	2026-01-26 08:44:17 +01:00
Aleix Conchillo Flaqué	f94a60f381	transports: fix broadcast_frame_class reference	2026-01-25 15:42:09 -08:00
ssillerom	a446bca72d	changes: added OutputTransportUrgentFrame to on closed, removed callback	2026-01-25 21:12:28 +01:00
Sergio Sillero	8ae834366b	Merge branch 'pipecat-ai:main' into feature/genesys_serializer	2026-01-25 21:04:27 +01:00
Mark Backman	a4acc12f91	Align Speechmatics STT TTFB metrics with STT classes	2026-01-24 18:26:34 -05:00
Mark Backman	e93112e76e	Simplify STT finalize handling	2026-01-24 15:28:27 -05:00
Mark Backman	680bcaac66	Merge pull request #3550 from pipecat-ai/mb/update-smart-turn-data-env-var Update env var to PIPECAT_SMART_TURN_LOG_DATA	2026-01-24 13:52:36 -05:00
Mark Backman	d2ac9006a2	Update env var to PIPECAT_SMART_TURN_LOG_DATA	2026-01-24 12:50:42 -05:00
Mark Backman	bcb019e8ab	Add TTFB metrics for STT services (#3495 )	2026-01-23 18:47:34 -05:00
kompfner	4ea546785f	Merge pull request #3406 from omChauhanDev/fix/openrouter-gemini-messages fix(openrouter): handle multiple system messages for Gemini models	2026-01-23 14:53:59 -05:00
filipi87	f128cdd19a	Adding a changelog entry to the AudioBufferProcessor fix.	2026-01-23 16:16:01 -03:00
filipi87	7921bce4af	Refactoring AudioBufferProcessor to fix audio track synchronization.	2026-01-23 16:15:48 -03:00
Luke Payyapilli	cadced3f79	feat: handle server_content.interrupted for faster barge-in response	2026-01-23 10:41:04 -05:00
Akhil	3b3c7aa8cc	LLMAssistantAggregator: preserve non-ASCII characters in JSON output Add ensure_ascii=False to json.dumps() calls for tool call arguments and function call results to prevent unnecessary unicode escaping.	2026-01-22 15:37:44 -06:00
ssillerom	fa5da3b0be	change comments	2026-01-19 20:49:23 +01:00
ssillerom	7e82a0cf49	feature: Genesys AudioHook WebSocket protocol serializer for Pipecat	2026-01-19 20:45:22 +01:00
Mark Backman	0b1a4792b8	Bump to latest azure-cognitiveservices-speech version, 1.47.0	2026-01-19 09:52:28 -05:00
Mark Backman	14bd3b1b32	Set Azure TTS default prosody rate to None	2026-01-19 09:19:57 -05:00
Mark Backman	f733e77496	AzureTTS: work around word ordering issue at 8khz sample rate	2026-01-19 09:13:41 -05:00
Om Chauhan	38506f51f7	fix(openrouter): handle multiple system messages for Gemini models	2026-01-11 21:19:47 +05:30
				`@@ -0,0 +1 @@`
				- Fixed an issue where if you were using `OpenRouterLLMService` with a Gemini model, it wouldn't handle multiple `"system"` messages as expected (and as we do in `GoogleLLMService`), which is to convert subsequent ones into `"user"` messages. Instead, the latest `"system"` message would overwrite the previous ones.
				`@@ -0,0 +1 @@`
				- Updated `AICFilter` and `AICVADAnalyzer` to use aic-sdk ~= 2.0.1.
				`@@ -0,0 +1 @@`
				- Removed deprecated `AICFilter` parameters: `enhancement_level`, `voice_gain`, `noise_gate_enable`.
				`@@ -0,0 +1 @@`
				- Added handling for `server_content.interrupted` signal in the Gemini Live service for faster interruption response in the case where there isn't already turn tracking in the pipeline, e.g. local VAD + context aggregators. When there is already turn tracking in the pipeline, the additional interruption does no harm.
				`@@ -0,0 +1 @@`
				- `SarvamSTTService` now defaults `vad_signals` and `high_vad_sensitivity` to `None` (omitted from connection parameters), improving latency by ~300ms compared to the previous defaults.
				`@@ -0,0 +1 @@`
				- Improved the STT TTFB (Time To First Byte) measurement, reporting the delay between when the user stops speaking and when the final transcription is received. Note: Unlike traditional TTFB which measures from a discrete request, STT services receive continuous audio input—so we measure from speech end to final transcript, which captures the latency that matters for voice AI applications. In support of this change, added `finalized` field to `TranscriptionFrame` to indicate when a transcript is the final result for an utterance.
				`@@ -0,0 +1 @@`
				- Added new `GenesysFrameSerializer` for the Genesys AudioHook WebSocket protocol, enabling bidirectional audio streaming between Pipecat pipelines and Genesys Cloud contact center.
				`@@ -0,0 +1 @@`
				- Fixed OpenAI LLM services to emit `ErrorFrame` on completion timeout, enabling proper error handling and LLMSwitcher failover.
				`@@ -0,0 +1 @@`
				`- Fixed a logging issue where non-ASCII characters (e.g., Japanese, Chinese, etc.) were being unnecessarily escaped to Unicode sequences when function call occurred.`