8.5 KiB
AI VideoAssistant Engine v5 Pipecat Minimal
This is a Pipecat-based rewrite of AI-VideoAssistant-engine-v3-minimal in a separate folder.
It intentionally uses Pipecat's FastAPI websocket transport with ProtobufFrameSerializer.
The old v3-minimal base64 JSON audio protocol is not supported here.
Shape
FastAPI /ws
-> Pipecat FastAPIWebsocketTransport
-> OpenAI STT
-> LLM context aggregator
-> OpenAI LLM
-> OpenAI TTS
-> Pipecat websocket output
Run
cd AI-VideoAssistant-engine-v5-pipecat-minimal
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
export OPENAI_API_KEY=...
uv run uvicorn engine.main:app --reload --host 0.0.0.0 --port 8000
Or pass keys directly in config.json.
uv run python -m engine.main --config ./config.json
Browser demo (served from the same process when server.serve_webpage is
true in config.json):
http://localhost:8000/voice-demo/
See examples/webpage/README.md for details.
Protocols
Pipecat-native endpoint:
ws://localhost:8000/ws
The websocket payloads are Pipecat protobuf frames. A client should use Pipecat's websocket client/serializer stack or generate frames compatible with pipecat.frames.protobufs.frames_pb2.
Important defaults:
- serializer:
ProtobufFrameSerializer - audio input: PCM16 mono
- sample rate:
16000 - endpoint:
/ws
Optional input audio filtering can be enabled through audio_filter. See
docs/deepfilternet.md for the DeepFilterNet real-time filter setup.
Product endpoint:
ws://localhost:8000/ws-product?chatId=customer-chat-001
This endpoint uses a stable JSON/base64 protocol named va.ws.v1. It is meant for browser, mobile, or other product applications that should not depend on Pipecat's internal protobuf frame schema.
For FastGPT sessions, pass chatId on the websocket URL. The engine uses
that id for FastGPT server-side memory; if the id has existing FastGPT
records, the assistant greets with 欢迎回来继续对话, otherwise it uses the
FastGPT app opener.
Start a session:
{
"type": "session.start",
"protocol": "va.ws.v1",
"chatId": "customer-chat-001",
"audio": {
"encoding": "pcm_s16le",
"sample_rate": 16000,
"channels": 1
}
}
Send audio:
{
"type": "input.audio",
"audio": "<base64 pcm_s16le bytes>",
"sample_rate": 16000,
"channels": 1
}
The adapter also accepts raw binary websocket messages as PCM16 audio chunks. JSON/base64 is easier to inspect; binary is better for latency and bandwidth.
Send a camera snapshot for vision-capable LLM replies:
{
"type": "input.image",
"image": "<base64 jpeg/png/webp bytes>",
"mime_type": "image/jpeg",
"width": 640,
"height": 360,
"text": "Answer using this camera image.",
"append_to_context": true
}
input.image appends the image to the Pipecat LLM context as a
UserImageRawFrame and immediately triggers the LLM. The reply returns through
the existing response.text.* and response.audio.* events. Prefer occasional
compressed camera snapshots over continuous video frames.
Stop:
{"type": "session.stop", "reason": "done"}
Cancel:
{"type": "response.cancel"}
Returned bot audio:
{
"type": "response.audio.delta",
"protocol": "va.ws.v1",
"seq": 1,
"audio": "<base64 pcm_s16le bytes>",
"bytes": 6400,
"sample_rate": 16000,
"channels": 1
}
Returned transcripts and assistant text:
{"type": "input.transcript.interim", "text": "What's the"}
{"type": "input.transcript.final", "text": "What's the weather?", "user_id": "...", "timestamp": "..."}
{"type": "response.state", "state": "speaking"}
{"type": "response.text.started"}
{"type": "response.text.delta", "text": "It's "}
{"type": "response.text.delta", "text": "sunny in "}
{"type": "response.text.delta", "text": "Berlin."}
{"type": "response.text.final", "text": "It's sunny in Berlin.", "interrupted": false}
response.text.started fires at the start of every assistant turn (LLM
streaming reply, or a fixed TTSSpeakFrame greeting).
response.text.delta events stream LLM token chunks as they're produced,
ahead of the synthesized audio, because the producer sits upstream of
the TTS in the pipeline. response.text.final fires when the turn ends,
carrying the full concatenated assistant text and an interrupted flag
(true when an input.text or barge-in cut the turn short).
When agent.response_state.enabled is true, an LLM response that starts with
<state>...</state> emits the tag body as response.state before the
remaining assistant text is streamed and spoken. If the tag is missing or
malformed, the original response text is streamed unchanged.
Turn detection
User-turn segmentation (VAD thresholds + how long to wait after silence before declaring the turn done) is configurable per environment:
"turn": {
"vad": {
"confidence": 0.7,
"start_secs": 0.2,
"stop_secs": 0.6,
"min_volume": 0.6
},
"interruption_min_chars": 3,
"interruption_use_interim": true,
"interruption_short_replies": ["是的", "行", "可以"],
"user_speech_timeout_sec": 1.0
}
vad.*maps directly topipecat.audio.vad.vad_analyzer.VADParamsand controls the Silero VAD.stop_secsis the duration of silence required before VAD reports the user stopped speaking; raise it if VAD is cutting users off mid-clause, lower it for snappier turn-taking.interruption_min_chars,interruption_use_interim, andinterruption_short_repliesconfigure the custom turn-start gate used while the assistant is speaking. Short replies in the allowlist (for example,是的,行,可以) can barge in immediately; other text must contain at leastinterruption_min_charscountable characters after punctuation and spaces are removed. This keeps common yes/no answers while filtering brief background speech.user_speech_timeout_secis the additional grace window (used bySpeechTimeoutUserTurnStopStrategy) during which the user may resume speaking before the aggregator finalizes the turn. The timer is re-armed every time the user resumes, so brief mid-sentence pauses do not split one utterance into multiple LLM turns.
The total "user pause before turn ends" budget is approximately
vad.stop_secs + user_speech_timeout_sec. The repo defaults are tuned
slightly more conservatively than upstream pipecat to avoid streaming
ASRs (xfyun in particular) producing many short fragments per logical
utterance. Setting this stop strategy explicitly also replaces pipecat's
default Smart Turn v3 analyzer, so the engine no longer loads the
smart-turn-v3.*-cpu.onnx model at startup.
Xfyun ASR
The STT provider can be switched to iFlytek/Xfyun's streaming voice dictation
WebSocket API. The engine opens the xfyun websocket when Pipecat VAD detects
the user has started speaking, keeps it open across brief pauses, and closes it
only when Pipecat's user-turn strategy declares the logical turn complete. It
sends PCM chunks as encoding: "raw" and emits input.transcript.interim
events with the current full interim transcript as Xfyun results arrive,
followed by the existing input.transcript.final event.
"stt": {
"provider": "xfyun",
"app_id": "your_xfyun_app_id",
"api_key": "your_xfyun_api_key",
"api_secret": "your_xfyun_api_secret",
"base_url": "wss://iat-api.xfyun.cn/v2/iat",
"language": "zh_cn",
"domain": "iat",
"accent": "mandarin",
"encoding": "raw",
"frame_size": 1280,
"timeout_sec": 10.0
}
Credentials may also be provided through XFYUN_APP_ID, XFYUN_API_KEY, and
XFYUN_API_SECRET.
Xfyun TTS
The TTS provider can be switched to iFlytek/Xfyun's online TTS WebSocket API.
The engine requests aue: "raw" PCM audio so the returned chunks can be sent
through the existing Pipecat/product audio path without MP3 decoding.
"tts": {
"provider": "xfyun",
"app_id": "your_xfyun_app_id",
"api_key": "your_xfyun_api_key",
"api_secret": "your_xfyun_api_secret",
"base_url": "wss://tts-api.xfyun.cn/v2/tts",
"voice": "x4_xiaoyan",
"aue": "raw",
"tte": "UTF8",
"speed": 50,
"volume": 50,
"pitch": 50,
"source_sample_rate_hz": 16000
}
Credentials may also be provided through XFYUN_APP_ID, XFYUN_API_KEY, and
XFYUN_API_SECRET.
Notes
This folder keeps the rewrite minimal on purpose. Dograh's useful pattern is the separation between app entrypoint, service factory, and pipeline builder; its workflow graph, database, telephony, recordings, and pricing layers are deliberately left out.