engine-v5-pipecat-core/README.md

# AI VideoAssistant Engine v5 Pipecat Minimal

This is a Pipecat-based rewrite of `AI-VideoAssistant-engine-v3-minimal` in a separate folder.

It intentionally uses Pipecat's FastAPI websocket transport with `ProtobufFrameSerializer`.
The old v3-minimal base64 JSON audio protocol is not supported here.

## Shape

```text
FastAPI /ws
-> Pipecat FastAPIWebsocketTransport
-> OpenAI STT
-> LLM context aggregator
-> OpenAI LLM
-> OpenAI TTS
-> Pipecat websocket output
```

## Run

```bash
cd AI-VideoAssistant-engine-v5-pipecat-minimal
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
export OPENAI_API_KEY=...
uv run uvicorn engine.main:app --reload --host 0.0.0.0 --port 8000
```

Or pass keys directly in `config.json`.

```bash
uv run python -m engine.main --config ./config.json
```

Browser demo (served from the same process when `server.serve_webpage` is
true in `config.json`):

```text
http://localhost:8000/voice-demo/
```

See `examples/webpage/README.md` for details.

## Protocols

Pipecat-native endpoint:

```text
ws://localhost:8000/ws
```

The websocket payloads are Pipecat protobuf frames. A client should use Pipecat's websocket client/serializer stack or generate frames compatible with `pipecat.frames.protobufs.frames_pb2`.

Important defaults:

- serializer: `ProtobufFrameSerializer`
- audio input: PCM16 mono
- sample rate: `16000`
- endpoint: `/ws`

Optional input audio filtering can be enabled through `audio_filter`. See
`docs/deepfilternet.md` for the DeepFilterNet real-time filter setup.

Product endpoint:

```text
ws://localhost:8000/ws-product?chatId=customer-chat-001
```

This endpoint uses a stable JSON/base64 protocol named `va.ws.v1`. It is meant for browser, mobile, or other product applications that should not depend on Pipecat's internal protobuf frame schema.
For FastGPT sessions, pass `chatId` on the websocket URL. The engine uses
that id for FastGPT server-side memory; if the id has existing FastGPT
records, the assistant greets with `欢迎回来继续对话`, otherwise it uses the
FastGPT app opener.

Start a session:

```json
{
  "type": "session.start",
  "protocol": "va.ws.v1",
  "chatId": "customer-chat-001",
  "audio": {
    "encoding": "pcm_s16le",
    "sample_rate": 16000,
    "channels": 1
  }
}
```

Send audio:

```json
{
  "type": "input.audio",
  "audio": "<base64 pcm_s16le bytes>",
  "sample_rate": 16000,
  "channels": 1
}
```

The adapter also accepts raw binary websocket messages as PCM16 audio chunks. JSON/base64 is easier to inspect; binary is better for latency and bandwidth.

Send a camera snapshot for vision-capable LLM replies:

```json
{
  "type": "input.image",
  "image": "<base64 jpeg/png/webp bytes>",
  "mime_type": "image/jpeg",
  "width": 640,
  "height": 360,
  "text": "Answer using this camera image.",
  "append_to_context": true
}
```

`input.image` appends the image to the Pipecat LLM context as a
`UserImageRawFrame` and immediately triggers the LLM. The reply returns through
the existing `response.text.*` and `response.audio.*` events. Prefer occasional
compressed camera snapshots over continuous video frames.

Stop:

```json
{"type": "session.stop", "reason": "done"}
```

Cancel:

```json
{"type": "response.cancel"}
```

Returned bot audio:

```json
{
  "type": "response.audio.delta",
  "protocol": "va.ws.v1",
  "seq": 1,
  "audio": "<base64 pcm_s16le bytes>",
  "bytes": 6400,
  "sample_rate": 16000,
  "channels": 1
}
```

Returned transcripts and assistant text:

```json
{"type": "input.transcript.interim", "text": "What's the"}
{"type": "input.transcript.final", "text": "What's the weather?", "user_id": "...", "timestamp": "..."}
{"type": "response.state", "state": "speaking"}
{"type": "response.text.started"}
{"type": "response.text.delta", "text": "It's "}
{"type": "response.text.delta", "text": "sunny in "}
{"type": "response.text.delta", "text": "Berlin."}
{"type": "response.text.final", "text": "It's sunny in Berlin.", "interrupted": false}
```

`response.text.started` fires at the start of every assistant turn (LLM
streaming reply, or a fixed `TTSSpeakFrame` greeting).
`response.text.delta` events stream LLM token chunks as they're produced,
**ahead of the synthesized audio**, because the producer sits upstream of
the TTS in the pipeline. `response.text.final` fires when the turn ends,
carrying the full concatenated assistant text and an `interrupted` flag
(true when an `input.text` or barge-in cut the turn short).

When `agent.response_state.enabled` is true, an LLM response that starts with
`<state>...</state>` emits the tag body as `response.state` before the
remaining assistant text is streamed and spoken. If the tag is missing or
malformed, the original response text is streamed unchanged.

### Turn detection

User-turn segmentation (VAD thresholds + how long to wait after silence
before declaring the turn done) is configurable per environment:

```json
"turn": {
  "vad": {
    "confidence": 0.7,
    "start_secs": 0.2,
    "stop_secs": 0.6,
    "min_volume": 0.6
  },
  "interruption_min_chars": 3,
  "interruption_use_interim": true,
  "interruption_short_replies": ["是的", "行", "可以"],
  "user_speech_timeout_sec": 1.0
}
```

- `vad.*` maps directly to `pipecat.audio.vad.vad_analyzer.VADParams` and
  controls the Silero VAD. `stop_secs` is the duration of silence required
  before VAD reports the user stopped speaking; raise it if VAD is
  cutting users off mid-clause, lower it for snappier turn-taking.
- `interruption_min_chars`, `interruption_use_interim`, and
  `interruption_short_replies` configure the custom turn-start gate used
  while the assistant is speaking. Short replies in the allowlist (for
  example, `是的`, `行`, `可以`) can barge in immediately; other text must
  contain at least `interruption_min_chars` countable characters after
  punctuation and spaces are removed. This keeps common yes/no answers while
  filtering brief background speech.
- `user_speech_timeout_sec` is the additional grace window (used by
  `SpeechTimeoutUserTurnStopStrategy`) during which the user may resume
  speaking before the aggregator finalizes the turn. The timer is
  re-armed every time the user resumes, so brief mid-sentence pauses do
  not split one utterance into multiple LLM turns.

The total "user pause before turn ends" budget is approximately
`vad.stop_secs + user_speech_timeout_sec`. The repo defaults are tuned
slightly more conservatively than upstream pipecat to avoid streaming
ASRs (xfyun in particular) producing many short fragments per logical
utterance. Setting this stop strategy explicitly also replaces pipecat's
default Smart Turn v3 analyzer, so the engine no longer loads the
`smart-turn-v3.*-cpu.onnx` model at startup.

### Xfyun ASR

The STT provider can be switched to iFlytek/Xfyun's streaming voice dictation
WebSocket API. The engine opens the xfyun websocket when Pipecat VAD detects
the user has started speaking, keeps it open across brief pauses, and closes it
only when Pipecat's user-turn strategy declares the logical turn complete. It
sends PCM chunks as `encoding: "raw"` and emits `input.transcript.interim`
events with the current full interim transcript as Xfyun results arrive,
followed by the existing `input.transcript.final` event.

```json
"stt": {
  "provider": "xfyun",
  "app_id": "your_xfyun_app_id",
  "api_key": "your_xfyun_api_key",
  "api_secret": "your_xfyun_api_secret",
  "base_url": "wss://iat-api.xfyun.cn/v2/iat",
  "language": "zh_cn",
  "domain": "iat",
  "accent": "mandarin",
  "encoding": "raw",
  "frame_size": 1280,
  "timeout_sec": 10.0
}
```

Credentials may also be provided through `XFYUN_APP_ID`, `XFYUN_API_KEY`, and
`XFYUN_API_SECRET`.

### Xfyun TTS

The TTS provider can be switched to iFlytek/Xfyun's online TTS WebSocket API.
The engine requests `aue: "raw"` PCM audio so the returned chunks can be sent
through the existing Pipecat/product audio path without MP3 decoding.

```json
"tts": {
  "provider": "xfyun",
  "app_id": "your_xfyun_app_id",
  "api_key": "your_xfyun_api_key",
  "api_secret": "your_xfyun_api_secret",
  "base_url": "wss://tts-api.xfyun.cn/v2/tts",
  "voice": "x4_xiaoyan",
  "aue": "raw",
  "tte": "UTF8",
  "speed": 50,
  "volume": 50,
  "pitch": 50,
  "source_sample_rate_hz": 16000
}
```

Credentials may also be provided through `XFYUN_APP_ID`, `XFYUN_API_KEY`, and
`XFYUN_API_SECRET`.

## Notes

This folder keeps the rewrite minimal on purpose. Dograh's useful pattern is the separation between app entrypoint, service factory, and pipeline builder; its workflow graph, database, telephony, recordings, and pricing layers are deliberately left out.