It intentionally uses Pipecat's FastAPI websocket transport with ProtobufFrameSerializer. The old v3-minimal base64 JSON audio protocol is not supported here.

Shape

FastAPI /ws
-> Pipecat FastAPIWebsocketTransport
-> OpenAI STT
-> LLM context aggregator
-> OpenAI LLM
-> OpenAI TTS
-> Pipecat websocket output

Run

cd AI-VideoAssistant-engine-v5-pipecat-minimal
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
export OPENAI_API_KEY=...
uv run uvicorn engine.main:app --reload --host 0.0.0.0 --port 8001

Or pass keys directly in config.json.

uv run python -m engine.main --config ./config.json

Protocols

Pipecat-native endpoint:

ws://localhost:8001/ws

The websocket payloads are Pipecat protobuf frames. A client should use Pipecat's websocket client/serializer stack or generate frames compatible with pipecat.frames.protobufs.frames_pb2.

Important defaults:

serializer: ProtobufFrameSerializer
audio input: PCM16 mono
sample rate: 16000
endpoint: /ws

Product endpoint:

ws://localhost:8001/ws-product

This endpoint uses a stable JSON/base64 protocol named va.ws.v1. It is meant for browser, mobile, or other product applications that should not depend on Pipecat's internal protobuf frame schema.

Start a session:

{
  "type": "session.start",
  "protocol": "va.ws.v1",
  "audio": {
    "encoding": "pcm_s16le",
    "sample_rate": 16000,
    "channels": 1
  }
}

Send audio:

{
  "type": "input.audio",
  "audio": "<base64 pcm_s16le bytes>",
  "sample_rate": 16000,
  "channels": 1
}

The adapter also accepts raw binary websocket messages as PCM16 audio chunks. JSON/base64 is easier to inspect; binary is better for latency and bandwidth.

Stop:

{"type": "session.stop", "reason": "done"}

Cancel:

{"type": "response.cancel"}

Returned bot audio:

{
  "type": "response.audio.delta",
  "protocol": "va.ws.v1",
  "seq": 1,
  "audio": "<base64 pcm_s16le bytes>",
  "bytes": 6400,
  "sample_rate": 16000,
  "channels": 1
}

Returned transcripts and assistant text:

{"type": "input.transcript.final", "text": "What's the weather?", "user_id": "...", "timestamp": "..."}
{"type": "response.text.started"}
{"type": "response.text.delta", "text": "It's "}
{"type": "response.text.delta", "text": "sunny in "}
{"type": "response.text.delta", "text": "Berlin."}
{"type": "response.text.final", "text": "It's sunny in Berlin.", "interrupted": false}

response.text.started fires at the start of every assistant turn (LLM streaming reply, or a fixed TTSSpeakFrame greeting). response.text.delta events stream LLM token chunks as they're produced, ahead of the synthesized audio, because the producer sits upstream of the TTS in the pipeline. response.text.final fires when the turn ends, carrying the full concatenated assistant text and an interrupted flag (true when an input.text or barge-in cut the turn short).

Stream A WAV File

Start the server:

cd AI-VideoAssistant-engine-v5-pipecat-minimal
source .venv/bin/activate
export OPENAI_API_KEY=...
uvicorn engine.main:app --host 127.0.0.1 --port 8001

In another terminal, stream a WAV file through the product adapter:

cd AI-VideoAssistant-engine-v5-pipecat-minimal
source .venv/bin/activate
python scripts/stream_wav_product_ws.py \
  ../AI-VideoAssistant-engine-v3-minimal/data/audio_examples/three_utterances_simple.wav \
  --url ws://127.0.0.1:8001/ws-product \
  --save-stereo-wav /tmp/product-conversation.wav

Or stream a WAV file as Pipecat protobuf audio frames:

cd AI-VideoAssistant-engine-v5-pipecat-minimal
source .venv/bin/activate
python scripts/stream_wav_pipecat_ws.py \
  ../AI-VideoAssistant-engine-v3-minimal/data/audio_examples/three_utterances_simple.wav \
  --url ws://127.0.0.1:8001/ws \
  --save-stereo-wav /tmp/pipecat-conversation.wav

The input WAV must be PCM16 mono at 16 kHz. The script sends 20 ms chunks in real time and prints any text, transcription, message, or audio frames returned by the server.

Notes

This folder keeps the rewrite minimal on purpose. Dograh's useful pattern is the separation between app entrypoint, service factory, and pipeline builder; its workflow graph, database, telephony, recordings, and pricing layers are deliberately left out.