4.0 KiB
AI VideoAssistant Engine v5 Pipecat Minimal
This is a Pipecat-based rewrite of AI-VideoAssistant-engine-v3-minimal in a separate folder.
It intentionally uses Pipecat's FastAPI websocket transport with ProtobufFrameSerializer.
The old v3-minimal base64 JSON audio protocol is not supported here.
Shape
FastAPI /ws
-> Pipecat FastAPIWebsocketTransport
-> OpenAI STT
-> LLM context aggregator
-> OpenAI LLM
-> OpenAI TTS
-> Pipecat websocket output
Run
cd AI-VideoAssistant-engine-v5-pipecat-minimal
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
export OPENAI_API_KEY=...
uv run uvicorn engine.main:app --reload --host 0.0.0.0 --port 8001
Or pass keys directly in config.json.
uv run python -m engine.main --config ./config.json
Protocols
Pipecat-native endpoint:
ws://localhost:8001/ws
The websocket payloads are Pipecat protobuf frames. A client should use Pipecat's websocket client/serializer stack or generate frames compatible with pipecat.frames.protobufs.frames_pb2.
Important defaults:
- serializer:
ProtobufFrameSerializer - audio input: PCM16 mono
- sample rate:
16000 - endpoint:
/ws
Product endpoint:
ws://localhost:8001/ws-product
This endpoint uses a stable JSON/base64 protocol named va.ws.v1. It is meant for browser, mobile, or other product applications that should not depend on Pipecat's internal protobuf frame schema.
Start a session:
{
"type": "session.start",
"protocol": "va.ws.v1",
"audio": {
"encoding": "pcm_s16le",
"sample_rate": 16000,
"channels": 1
}
}
Send audio:
{
"type": "input.audio",
"audio": "<base64 pcm_s16le bytes>",
"sample_rate": 16000,
"channels": 1
}
The adapter also accepts raw binary websocket messages as PCM16 audio chunks. JSON/base64 is easier to inspect; binary is better for latency and bandwidth.
Stop:
{"type": "session.stop", "reason": "done"}
Cancel:
{"type": "response.cancel"}
Returned bot audio:
{
"type": "response.audio.delta",
"protocol": "va.ws.v1",
"seq": 1,
"audio": "<base64 pcm_s16le bytes>",
"bytes": 6400,
"sample_rate": 16000,
"channels": 1
}
Returned transcripts and assistant text:
{"type": "input.transcript.final", "text": "What's the weather?", "user_id": "...", "timestamp": "..."}
{"type": "response.text.started"}
{"type": "response.text.delta", "text": "It's "}
{"type": "response.text.delta", "text": "sunny in "}
{"type": "response.text.delta", "text": "Berlin."}
{"type": "response.text.final", "text": "It's sunny in Berlin.", "interrupted": false}
response.text.started fires at the start of every assistant turn (LLM
streaming reply, or a fixed TTSSpeakFrame greeting).
response.text.delta events stream LLM token chunks as they're produced,
ahead of the synthesized audio, because the producer sits upstream of
the TTS in the pipeline. response.text.final fires when the turn ends,
carrying the full concatenated assistant text and an interrupted flag
(true when an input.text or barge-in cut the turn short).
Xfyun TTS
The TTS provider can be switched to iFlytek/Xfyun's online TTS WebSocket API.
The engine requests aue: "raw" PCM audio so the returned chunks can be sent
through the existing Pipecat/product audio path without MP3 decoding.
"tts": {
"provider": "xfyun",
"app_id": "your_xfyun_app_id",
"api_key": "your_xfyun_api_key",
"api_secret": "your_xfyun_api_secret",
"base_url": "wss://tts-api.xfyun.cn/v2/tts",
"voice": "x4_xiaoyan",
"aue": "raw",
"tte": "UTF8",
"speed": 50,
"volume": 50,
"pitch": 50,
"source_sample_rate_hz": 16000
}
Credentials may also be provided through XFYUN_APP_ID, XFYUN_API_KEY, and
XFYUN_API_SECRET.
Notes
This folder keeps the rewrite minimal on purpose. Dograh's useful pattern is the separation between app entrypoint, service factory, and pipeline builder; its workflow graph, database, telephony, recordings, and pricing layers are deliberately left out.