AI VideoAssistant Engine v5 Pipecat Minimal
This is a Pipecat-based rewrite of AI-VideoAssistant-engine-v3-minimal in a separate folder.
It intentionally uses Pipecat's FastAPI websocket transport with ProtobufFrameSerializer.
The old v3-minimal base64 JSON audio protocol is not supported here.
Shape
FastAPI /ws
-> Pipecat FastAPIWebsocketTransport
-> OpenAI STT
-> LLM context aggregator
-> OpenAI LLM
-> OpenAI TTS
-> Pipecat websocket output
Run
cd AI-VideoAssistant-engine-v5-pipecat-minimal
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
export OPENAI_API_KEY=...
uv run uvicorn engine.main:app --reload --host 0.0.0.0 --port 8001
Or pass keys directly in config.json.
uv run python -m engine.main --config ./config.json
Protocols
Pipecat-native endpoint:
ws://localhost:8001/ws
The websocket payloads are Pipecat protobuf frames. A client should use Pipecat's websocket client/serializer stack or generate frames compatible with pipecat.frames.protobufs.frames_pb2.
Important defaults:
- serializer:
ProtobufFrameSerializer - audio input: PCM16 mono
- sample rate:
16000 - endpoint:
/ws
Product endpoint:
ws://localhost:8001/ws-product
This endpoint uses a stable JSON/base64 protocol named va.ws.v1. It is meant for browser, mobile, or other product applications that should not depend on Pipecat's internal protobuf frame schema.
Start a session:
{
"type": "session.start",
"protocol": "va.ws.v1",
"audio": {
"encoding": "pcm_s16le",
"sample_rate": 16000,
"channels": 1
}
}
Send audio:
{
"type": "input.audio",
"audio": "<base64 pcm_s16le bytes>",
"sample_rate": 16000,
"channels": 1
}
The adapter also accepts raw binary websocket messages as PCM16 audio chunks. JSON/base64 is easier to inspect; binary is better for latency and bandwidth.
Stop:
{"type": "session.stop", "reason": "done"}
Cancel:
{"type": "response.cancel"}
Returned bot audio:
{
"type": "response.audio.delta",
"protocol": "va.ws.v1",
"seq": 1,
"audio": "<base64 pcm_s16le bytes>",
"bytes": 6400,
"sample_rate": 16000,
"channels": 1
}
Returned transcripts and assistant text:
{"type": "input.transcript.final", "text": "What's the weather?", "user_id": "...", "timestamp": "..."}
{"type": "response.text.started"}
{"type": "response.text.delta", "text": "It's "}
{"type": "response.text.delta", "text": "sunny in "}
{"type": "response.text.delta", "text": "Berlin."}
{"type": "response.text.final", "text": "It's sunny in Berlin.", "interrupted": false}
response.text.started fires at the start of every assistant turn (LLM
streaming reply, or a fixed TTSSpeakFrame greeting).
response.text.delta events stream LLM token chunks as they're produced,
ahead of the synthesized audio, because the producer sits upstream of
the TTS in the pipeline. response.text.final fires when the turn ends,
carrying the full concatenated assistant text and an interrupted flag
(true when an input.text or barge-in cut the turn short).
Stream A WAV File
Start the server:
cd AI-VideoAssistant-engine-v5-pipecat-minimal
source .venv/bin/activate
export OPENAI_API_KEY=...
uvicorn engine.main:app --host 127.0.0.1 --port 8001
In another terminal, stream a WAV file through the product adapter:
cd AI-VideoAssistant-engine-v5-pipecat-minimal
source .venv/bin/activate
python scripts/stream_wav_product_ws.py \
../AI-VideoAssistant-engine-v3-minimal/data/audio_examples/three_utterances_simple.wav \
--url ws://127.0.0.1:8001/ws-product \
--save-stereo-wav /tmp/product-conversation.wav
Or stream a WAV file as Pipecat protobuf audio frames:
cd AI-VideoAssistant-engine-v5-pipecat-minimal
source .venv/bin/activate
python scripts/stream_wav_pipecat_ws.py \
../AI-VideoAssistant-engine-v3-minimal/data/audio_examples/three_utterances_simple.wav \
--url ws://127.0.0.1:8001/ws \
--save-stereo-wav /tmp/pipecat-conversation.wav
The input WAV must be PCM16 mono at 16 kHz. The script sends 20 ms chunks in real time and prints any text, transcription, message, or audio frames returned by the server.
Notes
This folder keeps the rewrite minimal on purpose. Dograh's useful pattern is the separation between app entrypoint, service factory, and pipeline builder; its workflow graph, database, telephony, recordings, and pricing layers are deliberately left out.