Files
engine-v5-pipecat-core/README.md
2026-05-31 22:46:48 +08:00

8.5 KiB

AI VideoAssistant Engine v5 Pipecat Minimal

This is a Pipecat-based rewrite of AI-VideoAssistant-engine-v3-minimal in a separate folder.

It intentionally uses Pipecat's FastAPI websocket transport with ProtobufFrameSerializer. The old v3-minimal base64 JSON audio protocol is not supported here.

Shape

FastAPI /ws
-> Pipecat FastAPIWebsocketTransport
-> OpenAI STT
-> LLM context aggregator
-> OpenAI LLM
-> OpenAI TTS
-> Pipecat websocket output

Run

cd AI-VideoAssistant-engine-v5-pipecat-minimal
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
export OPENAI_API_KEY=...
uv run uvicorn engine.main:app --reload --host 0.0.0.0 --port 8000

Or pass keys directly in config.json.

uv run python -m engine.main --config ./config.json

Browser demo (served from the same process when server.serve_webpage is true in config.json):

http://localhost:8000/voice-demo/

See examples/webpage/README.md for details.

Protocols

Pipecat-native endpoint:

ws://localhost:8000/ws

The websocket payloads are Pipecat protobuf frames. A client should use Pipecat's websocket client/serializer stack or generate frames compatible with pipecat.frames.protobufs.frames_pb2.

Important defaults:

  • serializer: ProtobufFrameSerializer
  • audio input: PCM16 mono
  • sample rate: 16000
  • endpoint: /ws

Optional input audio filtering can be enabled through audio_filter. See docs/deepfilternet.md for the DeepFilterNet real-time filter setup.

Product endpoint:

ws://localhost:8000/ws-product?chatId=customer-chat-001

This endpoint uses a stable JSON/base64 protocol named va.ws.v1. It is meant for browser, mobile, or other product applications that should not depend on Pipecat's internal protobuf frame schema. For FastGPT sessions, pass chatId on the websocket URL. The engine uses that id for FastGPT server-side memory; if the id has existing FastGPT records, the assistant greets with 欢迎回来继续对话, otherwise it uses the FastGPT app opener.

Start a session:

{
  "type": "session.start",
  "protocol": "va.ws.v1",
  "chatId": "customer-chat-001",
  "audio": {
    "encoding": "pcm_s16le",
    "sample_rate": 16000,
    "channels": 1
  }
}

Send audio:

{
  "type": "input.audio",
  "audio": "<base64 pcm_s16le bytes>",
  "sample_rate": 16000,
  "channels": 1
}

The adapter also accepts raw binary websocket messages as PCM16 audio chunks. JSON/base64 is easier to inspect; binary is better for latency and bandwidth.

Send a camera snapshot for vision-capable LLM replies:

{
  "type": "input.image",
  "image": "<base64 jpeg/png/webp bytes>",
  "mime_type": "image/jpeg",
  "width": 640,
  "height": 360,
  "text": "Answer using this camera image.",
  "append_to_context": true
}

input.image appends the image to the Pipecat LLM context as a UserImageRawFrame and immediately triggers the LLM. The reply returns through the existing response.text.* and response.audio.* events. Prefer occasional compressed camera snapshots over continuous video frames.

Stop:

{"type": "session.stop", "reason": "done"}

Cancel:

{"type": "response.cancel"}

Returned bot audio:

{
  "type": "response.audio.delta",
  "protocol": "va.ws.v1",
  "seq": 1,
  "audio": "<base64 pcm_s16le bytes>",
  "bytes": 6400,
  "sample_rate": 16000,
  "channels": 1
}

Returned transcripts and assistant text:

{"type": "input.transcript.interim", "text": "What's the"}
{"type": "input.transcript.final", "text": "What's the weather?", "user_id": "...", "timestamp": "..."}
{"type": "response.state", "state": "speaking"}
{"type": "response.text.started"}
{"type": "response.text.delta", "text": "It's "}
{"type": "response.text.delta", "text": "sunny in "}
{"type": "response.text.delta", "text": "Berlin."}
{"type": "response.text.final", "text": "It's sunny in Berlin.", "interrupted": false}

response.text.started fires at the start of every assistant turn (LLM streaming reply, or a fixed TTSSpeakFrame greeting). response.text.delta events stream LLM token chunks as they're produced, ahead of the synthesized audio, because the producer sits upstream of the TTS in the pipeline. response.text.final fires when the turn ends, carrying the full concatenated assistant text and an interrupted flag (true when an input.text or barge-in cut the turn short).

When agent.response_state.enabled is true, an LLM response that starts with <state>...</state> emits the tag body as response.state before the remaining assistant text is streamed and spoken. If the tag is missing or malformed, the original response text is streamed unchanged.

Turn detection

User-turn segmentation (VAD thresholds + how long to wait after silence before declaring the turn done) is configurable per environment:

"turn": {
  "vad": {
    "confidence": 0.7,
    "start_secs": 0.2,
    "stop_secs": 0.6,
    "min_volume": 0.6
  },
  "interruption_min_chars": 3,
  "interruption_use_interim": true,
  "interruption_short_replies": ["是的", "行", "可以"],
  "user_speech_timeout_sec": 1.0
}
  • vad.* maps directly to pipecat.audio.vad.vad_analyzer.VADParams and controls the Silero VAD. stop_secs is the duration of silence required before VAD reports the user stopped speaking; raise it if VAD is cutting users off mid-clause, lower it for snappier turn-taking.
  • interruption_min_chars, interruption_use_interim, and interruption_short_replies configure the custom turn-start gate used while the assistant is speaking. Short replies in the allowlist (for example, 是的, , 可以) can barge in immediately; other text must contain at least interruption_min_chars countable characters after punctuation and spaces are removed. This keeps common yes/no answers while filtering brief background speech.
  • user_speech_timeout_sec is the additional grace window (used by SpeechTimeoutUserTurnStopStrategy) during which the user may resume speaking before the aggregator finalizes the turn. The timer is re-armed every time the user resumes, so brief mid-sentence pauses do not split one utterance into multiple LLM turns.

The total "user pause before turn ends" budget is approximately vad.stop_secs + user_speech_timeout_sec. The repo defaults are tuned slightly more conservatively than upstream pipecat to avoid streaming ASRs (xfyun in particular) producing many short fragments per logical utterance. Setting this stop strategy explicitly also replaces pipecat's default Smart Turn v3 analyzer, so the engine no longer loads the smart-turn-v3.*-cpu.onnx model at startup.

Xfyun ASR

The STT provider can be switched to iFlytek/Xfyun's streaming voice dictation WebSocket API. The engine opens the xfyun websocket when Pipecat VAD detects the user has started speaking, keeps it open across brief pauses, and closes it only when Pipecat's user-turn strategy declares the logical turn complete. It sends PCM chunks as encoding: "raw" and emits input.transcript.interim events with the current full interim transcript as Xfyun results arrive, followed by the existing input.transcript.final event.

"stt": {
  "provider": "xfyun",
  "app_id": "your_xfyun_app_id",
  "api_key": "your_xfyun_api_key",
  "api_secret": "your_xfyun_api_secret",
  "base_url": "wss://iat-api.xfyun.cn/v2/iat",
  "language": "zh_cn",
  "domain": "iat",
  "accent": "mandarin",
  "encoding": "raw",
  "frame_size": 1280,
  "timeout_sec": 10.0
}

Credentials may also be provided through XFYUN_APP_ID, XFYUN_API_KEY, and XFYUN_API_SECRET.

Xfyun TTS

The TTS provider can be switched to iFlytek/Xfyun's online TTS WebSocket API. The engine requests aue: "raw" PCM audio so the returned chunks can be sent through the existing Pipecat/product audio path without MP3 decoding.

"tts": {
  "provider": "xfyun",
  "app_id": "your_xfyun_app_id",
  "api_key": "your_xfyun_api_key",
  "api_secret": "your_xfyun_api_secret",
  "base_url": "wss://tts-api.xfyun.cn/v2/tts",
  "voice": "x4_xiaoyan",
  "aue": "raw",
  "tte": "UTF8",
  "speed": 50,
  "volume": 50,
  "pitch": 50,
  "source_sample_rate_hz": 16000
}

Credentials may also be provided through XFYUN_APP_ID, XFYUN_API_KEY, and XFYUN_API_SECRET.

Notes

This folder keeps the rewrite minimal on purpose. Dograh's useful pattern is the separation between app entrypoint, service factory, and pipeline builder; its workflow graph, database, telephony, recordings, and pricing layers are deliberately left out.