Files
AI-VideoAssistant-Engine-V2/docs/ws_v1_schema.md
2026-02-23 17:16:18 +08:00

233 lines
6.1 KiB
Markdown

# WS v1 Protocol Schema (`/ws`)
This document defines the public WebSocket protocol for the `/ws` endpoint.
## Transport
- A single WebSocket connection carries:
- JSON text frames for control/events.
- Binary frames for raw PCM audio (`pcm_s16le`, mono, 16kHz by default).
## Handshake and State Machine
Required message order:
1. Client sends `hello`.
2. Server replies `hello.ack`.
3. Client sends `session.start`.
4. Server replies `session.started`.
5. Client may stream binary audio and/or send `input.text`.
6. Client sends `session.stop` (or closes socket).
If order is violated, server emits `error` with `code = "protocol.order"`.
## Client -> Server Messages
### `hello`
```json
{
"type": "hello",
"version": "v1",
"auth": {
"apiKey": "optional-api-key",
"jwt": "optional-jwt"
}
}
```
Rules:
- `version` must be `v1`.
- If `WS_API_KEY` is configured on server, `auth.apiKey` must match.
- If `WS_REQUIRE_AUTH=true`, either `auth.apiKey` or `auth.jwt` must be present.
### `session.start`
```json
{
"type": "session.start",
"audio": {
"encoding": "pcm_s16le",
"sample_rate_hz": 16000,
"channels": 1
},
"metadata": {
"appId": "assistant_123",
"channel": "web",
"configVersionId": "cfg_20260217_01",
"client": "web-debug",
"output": {
"mode": "audio"
},
"systemPrompt": "You are concise.",
"greeting": "Hi, how can I help?"
}
}
```
Rules:
- Client-side `metadata.services` is ignored.
- Service config (including secrets) is resolved server-side (env/backend).
- Client should pass stable IDs (`appId`, `channel`, `configVersionId`) plus small runtime overrides (e.g. `output`, `bargeIn`, greeting/prompt style hints).
Text-only mode:
- Set `metadata.output.mode = "text"`.
- In this mode server still sends `assistant.response.delta/final`, but will not emit audio frames or `output.audio.start/end`.
### `input.text`
```json
{
"type": "input.text",
"text": "What can you do?"
}
```
### `response.cancel`
```json
{
"type": "response.cancel",
"graceful": false
}
```
### `session.stop`
```json
{
"type": "session.stop",
"reason": "client_disconnect"
}
```
### `tool_call.results`
Client tool execution results returned to server.
Only needed when `assistant.tool_call.executor == "client"` (default execution is server-side).
```json
{
"type": "tool_call.results",
"results": [
{
"tool_call_id": "call_abc123",
"name": "weather",
"output": { "temp_c": 21, "condition": "sunny" },
"status": { "code": 200, "message": "ok" }
}
]
}
```
## Server -> Client Events
All server events include an envelope:
```json
{
"type": "event.name",
"timestamp": 1730000000000,
"sessionId": "sess_xxx",
"seq": 42,
"source": "asr",
"trackId": "audio_in",
"data": {}
}
```
Envelope notes:
- `seq` is monotonically increasing within one session (for replay/resume).
- `source` is one of: `asr | llm | tts | tool | system`.
- `data` is structured payload; legacy top-level fields are kept for compatibility.
Common events:
- `hello.ack`
- Fields: `sessionId`, `version`
- `session.started`
- Fields: `sessionId`, `trackId`, `tracks`, `audio`
- `config.resolved`
- Fields: `sessionId`, `trackId`, `config`
- Sent immediately after `session.started`.
- Contains effective model/voice/output/tool allowlist/prompt hash, and never includes secrets.
- `session.stopped`
- Fields: `sessionId`, `reason`
- `heartbeat`
- `input.speech_started`
- Fields: `trackId`, `probability`
- `input.speech_stopped`
- Fields: `trackId`, `probability`
- `transcript.delta`
- Fields: `trackId`, `text`
- `transcript.final`
- Fields: `trackId`, `text`
- `assistant.response.delta`
- Fields: `trackId`, `text`
- `assistant.response.final`
- Fields: `trackId`, `text`
- `assistant.tool_call`
- Fields: `trackId`, `tool_call`, `tool_call_id`, `tool_name`, `arguments`, `executor`, `timeout_ms`
- `assistant.tool_result`
- Fields: `trackId`, `source`, `result`, `tool_call_id`, `tool_name`, `ok`, `error`
- `error`: `{ code, message, retryable }` when `ok=false`
- `output.audio.start`
- Fields: `trackId`
- `output.audio.end`
- Fields: `trackId`
- `response.interrupted`
- Fields: `trackId`
- `metrics.ttfb`
- Fields: `trackId`, `latencyMs`
- `error`
- Fields: `sender`, `code`, `message`, `trackId`
Track IDs (MVP fixed values):
- `audio_in`: ASR/VAD input-side events (`input.*`, `transcript.*`)
- `audio_out`: assistant output-side events (`assistant.*`, `output.audio.*`, `response.interrupted`, `metrics.ttfb`)
- `control`: session/control events (`session.*`, `hello.*`, `error`, `config.resolved`)
Correlation IDs (`event.data`):
- `turn_id`: one user-assistant interaction turn.
- `utterance_id`: one ASR final utterance.
- `response_id`: one assistant response generation.
- `tool_call_id`: one tool invocation.
- `tts_id`: one TTS playback segment.
## Binary Audio Frames
After `session.started`, client may send binary PCM chunks continuously.
MVP fixed format:
- 16-bit signed little-endian PCM (`pcm_s16le`)
- mono (1 channel)
- 16000 Hz
- 20ms frame = 640 bytes
Framing rules:
- Binary audio frame unit is 640 bytes.
- A WS binary message may carry one or multiple complete 640-byte frames.
- Non-640-multiple payloads are treated as `audio.frame_size_mismatch` protocol errors.
TTS boundary events:
- `output.audio.start` and `output.audio.end` mark assistant playback boundaries.
## Event Throttling
To keep client rendering and server load stable, v1 applies/recommends:
- `transcript.delta`: merge to ~200-500ms cadence (server default: 300ms).
- `assistant.response.delta`: merge to ~50-100ms cadence (server default: 80ms).
- Metrics streams (if enabled beyond `metrics.ttfb`): emit every ~500-1000ms.
## Error Structure
`error` keeps legacy top-level fields (`code`, `message`) and adds structured info:
- `stage`: `protocol | asr | llm | tts | tool | audio`
- `retryable`: boolean
- `data.error`: `{ stage, code, message, retryable }`
## Compatibility
This endpoint now enforces v1 message schema for JSON control frames.
Legacy command names (`invite`, `chat`, etc.) are no longer part of the public protocol.