AI VideoAssistant Engine v5 Pipecat Minimal

This is a Pipecat-based rewrite of AI-VideoAssistant-engine-v3-minimal in a separate folder.

It intentionally uses Pipecat's FastAPI websocket transport with ProtobufFrameSerializer. The old v3-minimal base64 JSON audio protocol is not supported here.

Shape

FastAPI /ws
-> Pipecat FastAPIWebsocketTransport
-> OpenAI STT
-> LLM context aggregator
-> OpenAI LLM
-> OpenAI TTS
-> Pipecat websocket output

Run

cd AI-VideoAssistant-engine-v5-pipecat-minimal
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
export OPENAI_API_KEY=...
uv run uvicorn engine.main:app --reload --host 0.0.0.0 --port 8001

Or pass keys directly in config.json.

uv run python -m engine.main --config ./config.json

Protocols

Pipecat-native endpoint:

ws://localhost:8001/ws

The websocket payloads are Pipecat protobuf frames. A client should use Pipecat's websocket client/serializer stack or generate frames compatible with pipecat.frames.protobufs.frames_pb2.

Important defaults:

serializer: ProtobufFrameSerializer
audio input: PCM16 mono
sample rate: 16000
endpoint: /ws

Product endpoint:

ws://localhost:8001/ws-product

This endpoint uses a stable JSON/base64 protocol named va.ws.v1. It is meant for browser, mobile, or other product applications that should not depend on Pipecat's internal protobuf frame schema.

Start a session:

{
  "type": "session.start",
  "protocol": "va.ws.v1",
  "audio": {
    "encoding": "pcm_s16le",
    "sample_rate": 16000,
    "channels": 1
  }
}

Send audio:

{
  "type": "input.audio",
  "audio": "<base64 pcm_s16le bytes>",
  "sample_rate": 16000,
  "channels": 1
}

The adapter also accepts raw binary websocket messages as PCM16 audio chunks. JSON/base64 is easier to inspect; binary is better for latency and bandwidth.

Stop:

{"type": "session.stop", "reason": "done"}

Cancel:

{"type": "response.cancel"}

Returned bot audio:

{
  "type": "response.audio.delta",
  "protocol": "va.ws.v1",
  "seq": 1,
  "audio": "<base64 pcm_s16le bytes>",
  "bytes": 6400,
  "sample_rate": 16000,
  "channels": 1
}

Returned transcripts and assistant text:

{"type": "input.transcript.final", "text": "What's the weather?", "user_id": "...", "timestamp": "..."}
{"type": "response.text.started"}
{"type": "response.text.delta", "text": "It's "}
{"type": "response.text.delta", "text": "sunny in "}
{"type": "response.text.delta", "text": "Berlin."}
{"type": "response.text.final", "text": "It's sunny in Berlin.", "interrupted": false}

response.text.started fires at the start of every assistant turn (LLM streaming reply, or a fixed TTSSpeakFrame greeting). response.text.delta events stream LLM token chunks as they're produced, ahead of the synthesized audio, because the producer sits upstream of the TTS in the pipeline. response.text.final fires when the turn ends, carrying the full concatenated assistant text and an interrupted flag (true when an input.text or barge-in cut the turn short).

Xfyun TTS

The TTS provider can be switched to iFlytek/Xfyun's online TTS WebSocket API. The engine requests aue: "raw" PCM audio so the returned chunks can be sent through the existing Pipecat/product audio path without MP3 decoding.

"tts": {
  "provider": "xfyun",
  "app_id": "your_xfyun_app_id",
  "api_key": "your_xfyun_api_key",
  "api_secret": "your_xfyun_api_secret",
  "base_url": "wss://tts-api.xfyun.cn/v2/tts",
  "voice": "x4_xiaoyan",
  "aue": "raw",
  "tte": "UTF8",
  "speed": 50,
  "volume": 50,
  "pitch": 50,
  "source_sample_rate_hz": 16000
}

Credentials may also be provided through XFYUN_APP_ID, XFYUN_API_KEY, and XFYUN_API_SECRET.

Notes

This folder keeps the rewrite minimal on purpose. Dograh's useful pattern is the separation between app entrypoint, service factory, and pipeline builder; its workflow graph, database, telephony, recordings, and pricing layers are deliberately left out.

4.0 KiB Raw Blame History

AI VideoAssistant Engine v5 Pipecat Minimal

Shape

Run

Protocols

Xfyun TTS

Notes

4.0 KiB

Raw Blame History