# AI VideoAssistant Engine v5 Pipecat Minimal This is a Pipecat-based rewrite of `AI-VideoAssistant-engine-v3-minimal` in a separate folder. It intentionally uses Pipecat's FastAPI websocket transport with `ProtobufFrameSerializer`. The old v3-minimal base64 JSON audio protocol is not supported here. ## Shape ```text FastAPI /ws -> Pipecat FastAPIWebsocketTransport -> OpenAI STT -> LLM context aggregator -> OpenAI LLM -> OpenAI TTS -> Pipecat websocket output ``` ## Run ```bash cd AI-VideoAssistant-engine-v5-pipecat-minimal uv venv .venv source .venv/bin/activate uv pip install -r requirements.txt export OPENAI_API_KEY=... uv run uvicorn engine.main:app --reload --host 0.0.0.0 --port 8000 ``` Or pass keys directly in `config.json`. ```bash uv run python -m engine.main --config ./config.json ``` Browser demo (served from the same process when `server.serve_webpage` is true in `config.json`): ```text http://localhost:8000/voice-demo/ ``` See `examples/webpage/README.md` for details. ## Protocols Pipecat-native endpoint: ```text ws://localhost:8000/ws ``` The websocket payloads are Pipecat protobuf frames. A client should use Pipecat's websocket client/serializer stack or generate frames compatible with `pipecat.frames.protobufs.frames_pb2`. Important defaults: - serializer: `ProtobufFrameSerializer` - audio input: PCM16 mono - sample rate: `16000` - endpoint: `/ws` Optional input audio filtering can be enabled through `audio_filter`. See `docs/deepfilternet.md` for the DeepFilterNet real-time filter setup. Product endpoint: ```text ws://localhost:8000/ws-product?chatId=customer-chat-001 ``` This endpoint uses a stable JSON/base64 protocol named `va.ws.v1`. It is meant for browser, mobile, or other product applications that should not depend on Pipecat's internal protobuf frame schema. For FastGPT sessions, pass `chatId` on the websocket URL. The engine uses that id for FastGPT server-side memory; if the id has existing FastGPT records, the assistant greets with `欢迎回来继续对话`, otherwise it uses the FastGPT app opener. Start a session: ```json { "type": "session.start", "protocol": "va.ws.v1", "chatId": "customer-chat-001", "audio": { "encoding": "pcm_s16le", "sample_rate": 16000, "channels": 1 } } ``` Send audio: ```json { "type": "input.audio", "audio": "", "sample_rate": 16000, "channels": 1 } ``` The adapter also accepts raw binary websocket messages as PCM16 audio chunks. JSON/base64 is easier to inspect; binary is better for latency and bandwidth. Send a camera snapshot for vision-capable LLM replies: ```json { "type": "input.image", "image": "", "mime_type": "image/jpeg", "width": 640, "height": 360, "text": "Answer using this camera image.", "append_to_context": true } ``` `input.image` appends the image to the Pipecat LLM context as a `UserImageRawFrame` and immediately triggers the LLM. The reply returns through the existing `response.text.*` and `response.audio.*` events. Prefer occasional compressed camera snapshots over continuous video frames. Stop: ```json {"type": "session.stop", "reason": "done"} ``` Cancel: ```json {"type": "response.cancel"} ``` Returned bot audio: ```json { "type": "response.audio.delta", "protocol": "va.ws.v1", "seq": 1, "audio": "", "bytes": 6400, "sample_rate": 16000, "channels": 1 } ``` Returned transcripts and assistant text: ```json {"type": "input.transcript.interim", "text": "What's the"} {"type": "input.transcript.final", "text": "What's the weather?", "user_id": "...", "timestamp": "..."} {"type": "response.state", "state": "speaking"} {"type": "response.text.started"} {"type": "response.text.delta", "text": "It's "} {"type": "response.text.delta", "text": "sunny in "} {"type": "response.text.delta", "text": "Berlin."} {"type": "response.text.final", "text": "It's sunny in Berlin.", "interrupted": false} ``` `response.text.started` fires at the start of every assistant turn (LLM streaming reply, or a fixed `TTSSpeakFrame` greeting). `response.text.delta` events stream LLM token chunks as they're produced, **ahead of the synthesized audio**, because the producer sits upstream of the TTS in the pipeline. `response.text.final` fires when the turn ends, carrying the full concatenated assistant text and an `interrupted` flag (true when an `input.text` or barge-in cut the turn short). When `agent.response_state.enabled` is true, an LLM response that starts with `...` emits the tag body as `response.state` before the remaining assistant text is streamed and spoken. If the tag is missing or malformed, the original response text is streamed unchanged. ### Turn detection User-turn segmentation (VAD thresholds + how long to wait after silence before declaring the turn done) is configurable per environment: ```json "turn": { "vad": { "confidence": 0.7, "start_secs": 0.2, "stop_secs": 0.6, "min_volume": 0.6 }, "interruption_min_chars": 3, "interruption_use_interim": true, "interruption_short_replies": ["是的", "行", "可以"], "user_speech_timeout_sec": 1.0 } ``` - `vad.*` maps directly to `pipecat.audio.vad.vad_analyzer.VADParams` and controls the Silero VAD. `stop_secs` is the duration of silence required before VAD reports the user stopped speaking; raise it if VAD is cutting users off mid-clause, lower it for snappier turn-taking. - `interruption_min_chars`, `interruption_use_interim`, and `interruption_short_replies` configure the custom turn-start gate used while the assistant is speaking. Short replies in the allowlist (for example, `是的`, `行`, `可以`) can barge in immediately; other text must contain at least `interruption_min_chars` countable characters after punctuation and spaces are removed. This keeps common yes/no answers while filtering brief background speech. - `user_speech_timeout_sec` is the additional grace window (used by `SpeechTimeoutUserTurnStopStrategy`) during which the user may resume speaking before the aggregator finalizes the turn. The timer is re-armed every time the user resumes, so brief mid-sentence pauses do not split one utterance into multiple LLM turns. The total "user pause before turn ends" budget is approximately `vad.stop_secs + user_speech_timeout_sec`. The repo defaults are tuned slightly more conservatively than upstream pipecat to avoid streaming ASRs (xfyun in particular) producing many short fragments per logical utterance. Setting this stop strategy explicitly also replaces pipecat's default Smart Turn v3 analyzer, so the engine no longer loads the `smart-turn-v3.*-cpu.onnx` model at startup. ### Xfyun ASR The STT provider can be switched to iFlytek/Xfyun's streaming voice dictation WebSocket API. The engine opens the xfyun websocket when Pipecat VAD detects the user has started speaking, keeps it open across brief pauses, and closes it only when Pipecat's user-turn strategy declares the logical turn complete. It sends PCM chunks as `encoding: "raw"` and emits `input.transcript.interim` events with the current full interim transcript as Xfyun results arrive, followed by the existing `input.transcript.final` event. ```json "stt": { "provider": "xfyun", "app_id": "your_xfyun_app_id", "api_key": "your_xfyun_api_key", "api_secret": "your_xfyun_api_secret", "base_url": "wss://iat-api.xfyun.cn/v2/iat", "language": "zh_cn", "domain": "iat", "accent": "mandarin", "encoding": "raw", "frame_size": 1280, "timeout_sec": 10.0 } ``` Credentials may also be provided through `XFYUN_APP_ID`, `XFYUN_API_KEY`, and `XFYUN_API_SECRET`. ### Xfyun TTS The TTS provider can be switched to iFlytek/Xfyun's online TTS WebSocket API. The engine requests `aue: "raw"` PCM audio so the returned chunks can be sent through the existing Pipecat/product audio path without MP3 decoding. ```json "tts": { "provider": "xfyun", "app_id": "your_xfyun_app_id", "api_key": "your_xfyun_api_key", "api_secret": "your_xfyun_api_secret", "base_url": "wss://tts-api.xfyun.cn/v2/tts", "voice": "x4_xiaoyan", "aue": "raw", "tte": "UTF8", "speed": 50, "volume": 50, "pitch": 50, "source_sample_rate_hz": 16000 } ``` Credentials may also be provided through `XFYUN_APP_ID`, `XFYUN_API_KEY`, and `XFYUN_API_SECRET`. ## Notes This folder keeps the rewrite minimal on purpose. Dograh's useful pattern is the separation between app entrypoint, service factory, and pipeline builder; its workflow graph, database, telephony, recordings, and pricing layers are deliberately left out.