280 lines
8.5 KiB
Markdown
280 lines
8.5 KiB
Markdown
# AI VideoAssistant Engine v5 Pipecat Minimal
|
|
|
|
This is a Pipecat-based rewrite of `AI-VideoAssistant-engine-v3-minimal` in a separate folder.
|
|
|
|
It intentionally uses Pipecat's FastAPI websocket transport with `ProtobufFrameSerializer`.
|
|
The old v3-minimal base64 JSON audio protocol is not supported here.
|
|
|
|
## Shape
|
|
|
|
```text
|
|
FastAPI /ws
|
|
-> Pipecat FastAPIWebsocketTransport
|
|
-> OpenAI STT
|
|
-> LLM context aggregator
|
|
-> OpenAI LLM
|
|
-> OpenAI TTS
|
|
-> Pipecat websocket output
|
|
```
|
|
|
|
## Run
|
|
|
|
```bash
|
|
cd AI-VideoAssistant-engine-v5-pipecat-minimal
|
|
uv venv .venv
|
|
source .venv/bin/activate
|
|
uv pip install -r requirements.txt
|
|
export OPENAI_API_KEY=...
|
|
uv run uvicorn engine.main:app --reload --host 0.0.0.0 --port 8000
|
|
```
|
|
|
|
Or pass keys directly in `config.json`.
|
|
|
|
```bash
|
|
uv run python -m engine.main --config ./config.json
|
|
```
|
|
|
|
Browser demo (served from the same process when `server.serve_webpage` is
|
|
true in `config.json`):
|
|
|
|
```text
|
|
http://localhost:8000/voice-demo/
|
|
```
|
|
|
|
See `examples/webpage/README.md` for details.
|
|
|
|
## Protocols
|
|
|
|
Pipecat-native endpoint:
|
|
|
|
```text
|
|
ws://localhost:8000/ws
|
|
```
|
|
|
|
The websocket payloads are Pipecat protobuf frames. A client should use Pipecat's websocket client/serializer stack or generate frames compatible with `pipecat.frames.protobufs.frames_pb2`.
|
|
|
|
Important defaults:
|
|
|
|
- serializer: `ProtobufFrameSerializer`
|
|
- audio input: PCM16 mono
|
|
- sample rate: `16000`
|
|
- endpoint: `/ws`
|
|
|
|
Optional input audio filtering can be enabled through `audio_filter`. See
|
|
`docs/deepfilternet.md` for the DeepFilterNet real-time filter setup.
|
|
|
|
Product endpoint:
|
|
|
|
```text
|
|
ws://localhost:8000/ws-product?chatId=customer-chat-001
|
|
```
|
|
|
|
This endpoint uses a stable JSON/base64 protocol named `va.ws.v1`. It is meant for browser, mobile, or other product applications that should not depend on Pipecat's internal protobuf frame schema.
|
|
For FastGPT sessions, pass `chatId` on the websocket URL. The engine uses
|
|
that id for FastGPT server-side memory; if the id has existing FastGPT
|
|
records, the assistant greets with `欢迎回来继续对话`, otherwise it uses the
|
|
FastGPT app opener.
|
|
|
|
Start a session:
|
|
|
|
```json
|
|
{
|
|
"type": "session.start",
|
|
"protocol": "va.ws.v1",
|
|
"chatId": "customer-chat-001",
|
|
"audio": {
|
|
"encoding": "pcm_s16le",
|
|
"sample_rate": 16000,
|
|
"channels": 1
|
|
}
|
|
}
|
|
```
|
|
|
|
Send audio:
|
|
|
|
```json
|
|
{
|
|
"type": "input.audio",
|
|
"audio": "<base64 pcm_s16le bytes>",
|
|
"sample_rate": 16000,
|
|
"channels": 1
|
|
}
|
|
```
|
|
|
|
The adapter also accepts raw binary websocket messages as PCM16 audio chunks. JSON/base64 is easier to inspect; binary is better for latency and bandwidth.
|
|
|
|
Send a camera snapshot for vision-capable LLM replies:
|
|
|
|
```json
|
|
{
|
|
"type": "input.image",
|
|
"image": "<base64 jpeg/png/webp bytes>",
|
|
"mime_type": "image/jpeg",
|
|
"width": 640,
|
|
"height": 360,
|
|
"text": "Answer using this camera image.",
|
|
"append_to_context": true
|
|
}
|
|
```
|
|
|
|
`input.image` appends the image to the Pipecat LLM context as a
|
|
`UserImageRawFrame` and immediately triggers the LLM. The reply returns through
|
|
the existing `response.text.*` and `response.audio.*` events. Prefer occasional
|
|
compressed camera snapshots over continuous video frames.
|
|
|
|
Stop:
|
|
|
|
```json
|
|
{"type": "session.stop", "reason": "done"}
|
|
```
|
|
|
|
Cancel:
|
|
|
|
```json
|
|
{"type": "response.cancel"}
|
|
```
|
|
|
|
Returned bot audio:
|
|
|
|
```json
|
|
{
|
|
"type": "response.audio.delta",
|
|
"protocol": "va.ws.v1",
|
|
"seq": 1,
|
|
"audio": "<base64 pcm_s16le bytes>",
|
|
"bytes": 6400,
|
|
"sample_rate": 16000,
|
|
"channels": 1
|
|
}
|
|
```
|
|
|
|
Returned transcripts and assistant text:
|
|
|
|
```json
|
|
{"type": "input.transcript.interim", "text": "What's the"}
|
|
{"type": "input.transcript.final", "text": "What's the weather?", "user_id": "...", "timestamp": "..."}
|
|
{"type": "response.state", "state": "speaking"}
|
|
{"type": "response.text.started"}
|
|
{"type": "response.text.delta", "text": "It's "}
|
|
{"type": "response.text.delta", "text": "sunny in "}
|
|
{"type": "response.text.delta", "text": "Berlin."}
|
|
{"type": "response.text.final", "text": "It's sunny in Berlin.", "interrupted": false}
|
|
```
|
|
|
|
`response.text.started` fires at the start of every assistant turn (LLM
|
|
streaming reply, or a fixed `TTSSpeakFrame` greeting).
|
|
`response.text.delta` events stream LLM token chunks as they're produced,
|
|
**ahead of the synthesized audio**, because the producer sits upstream of
|
|
the TTS in the pipeline. `response.text.final` fires when the turn ends,
|
|
carrying the full concatenated assistant text and an `interrupted` flag
|
|
(true when an `input.text` or barge-in cut the turn short).
|
|
|
|
When `agent.response_state.enabled` is true, an LLM response that starts with
|
|
`<state>...</state>` emits the tag body as `response.state` before the
|
|
remaining assistant text is streamed and spoken. If the tag is missing or
|
|
malformed, the original response text is streamed unchanged.
|
|
|
|
### Turn detection
|
|
|
|
User-turn segmentation (VAD thresholds + how long to wait after silence
|
|
before declaring the turn done) is configurable per environment:
|
|
|
|
```json
|
|
"turn": {
|
|
"vad": {
|
|
"confidence": 0.7,
|
|
"start_secs": 0.2,
|
|
"stop_secs": 0.6,
|
|
"min_volume": 0.6
|
|
},
|
|
"interruption_min_chars": 3,
|
|
"interruption_use_interim": true,
|
|
"interruption_short_replies": ["是的", "行", "可以"],
|
|
"user_speech_timeout_sec": 1.0
|
|
}
|
|
```
|
|
|
|
- `vad.*` maps directly to `pipecat.audio.vad.vad_analyzer.VADParams` and
|
|
controls the Silero VAD. `stop_secs` is the duration of silence required
|
|
before VAD reports the user stopped speaking; raise it if VAD is
|
|
cutting users off mid-clause, lower it for snappier turn-taking.
|
|
- `interruption_min_chars`, `interruption_use_interim`, and
|
|
`interruption_short_replies` configure the custom turn-start gate used
|
|
while the assistant is speaking. Short replies in the allowlist (for
|
|
example, `是的`, `行`, `可以`) can barge in immediately; other text must
|
|
contain at least `interruption_min_chars` countable characters after
|
|
punctuation and spaces are removed. This keeps common yes/no answers while
|
|
filtering brief background speech.
|
|
- `user_speech_timeout_sec` is the additional grace window (used by
|
|
`SpeechTimeoutUserTurnStopStrategy`) during which the user may resume
|
|
speaking before the aggregator finalizes the turn. The timer is
|
|
re-armed every time the user resumes, so brief mid-sentence pauses do
|
|
not split one utterance into multiple LLM turns.
|
|
|
|
The total "user pause before turn ends" budget is approximately
|
|
`vad.stop_secs + user_speech_timeout_sec`. The repo defaults are tuned
|
|
slightly more conservatively than upstream pipecat to avoid streaming
|
|
ASRs (xfyun in particular) producing many short fragments per logical
|
|
utterance. Setting this stop strategy explicitly also replaces pipecat's
|
|
default Smart Turn v3 analyzer, so the engine no longer loads the
|
|
`smart-turn-v3.*-cpu.onnx` model at startup.
|
|
|
|
### Xfyun ASR
|
|
|
|
The STT provider can be switched to iFlytek/Xfyun's streaming voice dictation
|
|
WebSocket API. The engine opens the xfyun websocket when Pipecat VAD detects
|
|
the user has started speaking, keeps it open across brief pauses, and closes it
|
|
only when Pipecat's user-turn strategy declares the logical turn complete. It
|
|
sends PCM chunks as `encoding: "raw"` and emits `input.transcript.interim`
|
|
events with the current full interim transcript as Xfyun results arrive,
|
|
followed by the existing `input.transcript.final` event.
|
|
|
|
```json
|
|
"stt": {
|
|
"provider": "xfyun",
|
|
"app_id": "your_xfyun_app_id",
|
|
"api_key": "your_xfyun_api_key",
|
|
"api_secret": "your_xfyun_api_secret",
|
|
"base_url": "wss://iat-api.xfyun.cn/v2/iat",
|
|
"language": "zh_cn",
|
|
"domain": "iat",
|
|
"accent": "mandarin",
|
|
"encoding": "raw",
|
|
"frame_size": 1280,
|
|
"timeout_sec": 10.0
|
|
}
|
|
```
|
|
|
|
Credentials may also be provided through `XFYUN_APP_ID`, `XFYUN_API_KEY`, and
|
|
`XFYUN_API_SECRET`.
|
|
|
|
### Xfyun TTS
|
|
|
|
The TTS provider can be switched to iFlytek/Xfyun's online TTS WebSocket API.
|
|
The engine requests `aue: "raw"` PCM audio so the returned chunks can be sent
|
|
through the existing Pipecat/product audio path without MP3 decoding.
|
|
|
|
```json
|
|
"tts": {
|
|
"provider": "xfyun",
|
|
"app_id": "your_xfyun_app_id",
|
|
"api_key": "your_xfyun_api_key",
|
|
"api_secret": "your_xfyun_api_secret",
|
|
"base_url": "wss://tts-api.xfyun.cn/v2/tts",
|
|
"voice": "x4_xiaoyan",
|
|
"aue": "raw",
|
|
"tte": "UTF8",
|
|
"speed": 50,
|
|
"volume": 50,
|
|
"pitch": 50,
|
|
"source_sample_rate_hz": 16000
|
|
}
|
|
```
|
|
|
|
Credentials may also be provided through `XFYUN_APP_ID`, `XFYUN_API_KEY`, and
|
|
`XFYUN_API_SECRET`.
|
|
|
|
## Notes
|
|
|
|
This folder keeps the rewrite minimal on purpose. Dograh's useful pattern is the separation between app entrypoint, service factory, and pipeline builder; its workflow graph, database, telephony, recordings, and pricing layers are deliberately left out.
|