- Introduced `output.audio.played` message type for client acknowledgment of audio playback completion. - Updated `DuplexPipeline` to track client playback state and handle playback completion events. - Enhanced session handling to route `output.audio.played` messages to the pipeline. - Revised API documentation to include details about the new message type and its fields. - Updated schema documentation to reflect the addition of `output.audio.played` in the message flow.
263 lines
8.3 KiB
Markdown
263 lines
8.3 KiB
Markdown
# WS v1 Protocol Schema (`/ws`)
|
|
|
|
This document defines the public WebSocket protocol for the `/ws` endpoint.
|
|
|
|
Validation policy:
|
|
- WS v1 JSON control messages are validated strictly.
|
|
- Unknown top-level fields are rejected for all defined client message types.
|
|
- `assistant_id` query parameter is required on `/ws`.
|
|
|
|
## Transport
|
|
|
|
- A single WebSocket connection carries:
|
|
- JSON text frames for control/events.
|
|
- Binary frames for raw PCM audio (`pcm_s16le`, mono, 16kHz by default).
|
|
|
|
## Handshake and State Machine
|
|
|
|
Required message order:
|
|
|
|
1. Client connects to `/ws?assistant_id=<id>`.
|
|
2. Client sends `session.start`.
|
|
3. Server replies `session.started`.
|
|
4. Client may stream binary audio and/or send `input.text`, `response.cancel`, `output.audio.played`, `tool_call.results`.
|
|
5. Client sends `session.stop` (or closes socket).
|
|
|
|
If order is violated, server emits `error` with `code = "protocol.order"`.
|
|
|
|
## Client -> Server Messages
|
|
|
|
### `session.start`
|
|
|
|
```json
|
|
{
|
|
"type": "session.start",
|
|
"audio": {
|
|
"encoding": "pcm_s16le",
|
|
"sample_rate_hz": 16000,
|
|
"channels": 1
|
|
},
|
|
"metadata": {
|
|
"channel": "web",
|
|
"source": "web-debug",
|
|
"history": {
|
|
"userId": 1
|
|
},
|
|
"overrides": {
|
|
"output": {
|
|
"mode": "audio"
|
|
},
|
|
"systemPrompt": "You are concise.",
|
|
"greeting": "Hi, how can I help?"
|
|
},
|
|
"dynamicVariables": {
|
|
"customer_name": "Alice",
|
|
"plan_tier": "Pro"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Rules:
|
|
- Assistant config is resolved strictly by URL query `assistant_id`.
|
|
- `metadata` top-level keys allowed: `overrides`, `dynamicVariables`, `channel`, `source`, `history`, `workflow` (`workflow` is ignored).
|
|
- `metadata.overrides` whitelist: `systemPrompt`, `greeting`, `firstTurnMode`, `generatedOpenerEnabled`, `output`, `bargeIn`, `knowledgeBaseId`, `knowledge`, `tools`, `openerAudio`.
|
|
- `metadata.services` is rejected with `protocol.invalid_override`.
|
|
- `metadata.workflow` is ignored in this MVP protocol version.
|
|
- Top-level IDs are forbidden in payload (`assistantId`, `appId`, `app_id`, `configVersionId`, `config_version_id`).
|
|
- Secret-like keys are forbidden in metadata (`apiKey`, `token`, `secret`, `password`, `authorization`).
|
|
- `metadata.dynamicVariables` is optional and must be an object of string key/value pairs.
|
|
- Key pattern: `^[a-zA-Z_][a-zA-Z0-9_]{0,63}$`
|
|
- Max entries: 30
|
|
- Max value length: 1000 chars
|
|
- Placeholder format in `systemPrompt` and `greeting`: `{{variable_name}}`.
|
|
- Built-in system variables (always available): `{{system__time}}`, `{{system_utc}}`, `{{system_timezone}}`.
|
|
- `system__time`: current local time (`YYYY-MM-DD HH:mm:ss`)
|
|
- `system_utc`: current UTC time (`YYYY-MM-DD HH:mm:ss`)
|
|
- `system_timezone`: current local timezone
|
|
- Missing referenced placeholders reject `session.start` with `protocol.dynamic_variables_missing`.
|
|
- Invalid `dynamicVariables` payload rejects `session.start` with `protocol.dynamic_variables_invalid`.
|
|
|
|
Text-only mode:
|
|
- Set `metadata.overrides.output.mode = "text"`.
|
|
- In this mode server still sends `assistant.response.delta/final`, but will not emit audio frames or `output.audio.start/end`.
|
|
|
|
### `input.text`
|
|
|
|
```json
|
|
{
|
|
"type": "input.text",
|
|
"text": "What can you do?"
|
|
}
|
|
```
|
|
|
|
### `response.cancel`
|
|
|
|
```json
|
|
{
|
|
"type": "response.cancel",
|
|
"graceful": false
|
|
}
|
|
```
|
|
|
|
### `output.audio.played`
|
|
|
|
Client playback ACK after assistant audio is actually drained on local speakers
|
|
(including jitter buffer / playback queue).
|
|
|
|
```json
|
|
{
|
|
"type": "output.audio.played",
|
|
"tts_id": "tts_001",
|
|
"response_id": "resp_001",
|
|
"turn_id": "turn_001",
|
|
"played_at_ms": 1730000018450,
|
|
"played_ms": 2520
|
|
}
|
|
```
|
|
|
|
### `session.stop`
|
|
|
|
```json
|
|
{
|
|
"type": "session.stop",
|
|
"reason": "client_disconnect"
|
|
}
|
|
```
|
|
|
|
### `tool_call.results`
|
|
|
|
Client tool execution results returned to server.
|
|
Only needed when `assistant.tool_call.executor == "client"` (default execution is server-side).
|
|
|
|
```json
|
|
{
|
|
"type": "tool_call.results",
|
|
"results": [
|
|
{
|
|
"tool_call_id": "call_abc123",
|
|
"name": "weather",
|
|
"output": { "temp_c": 21, "condition": "sunny" },
|
|
"status": { "code": 200, "message": "ok" }
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Server -> Client Events
|
|
|
|
All server events include an envelope:
|
|
|
|
```json
|
|
{
|
|
"type": "event.name",
|
|
"timestamp": 1730000000000,
|
|
"sessionId": "sess_xxx",
|
|
"seq": 42,
|
|
"source": "asr",
|
|
"trackId": "audio_in",
|
|
"data": {}
|
|
}
|
|
```
|
|
|
|
Envelope notes:
|
|
- `seq` is monotonically increasing within one session (for replay/resume).
|
|
- `source` is one of: `asr | llm | tts | tool | system | client | server`.
|
|
- For `assistant.tool_result`, `source` may be `client` or `server` to indicate execution side.
|
|
- `data` is structured payload; legacy top-level fields are kept for compatibility.
|
|
|
|
Common events:
|
|
|
|
- `session.started`
|
|
- Fields: `sessionId`, `trackId`, `tracks`, `audio`
|
|
- `config.resolved`
|
|
- Fields: `sessionId`, `trackId`, `config`
|
|
- Optional debug event. Disabled by default (`ws_emit_config_resolved=false`).
|
|
- `config` is SaaS-safe and public-only: `channel` (if provided), `output.mode`, `tools.enabled`, `tools.count`, `tracks`.
|
|
- Must not expose internal IDs or runtime internals (`assistantId/appId/configVersionId/services/provider/model/baseUrl/systemPrompt`).
|
|
- `session.stopped`
|
|
- Fields: `sessionId`, `reason`
|
|
- `heartbeat`
|
|
- `input.speech_started`
|
|
- Fields: `trackId`, `probability`
|
|
- `input.speech_stopped`
|
|
- Fields: `trackId`, `probability`
|
|
- `transcript.delta`
|
|
- Fields: `trackId`, `text`
|
|
- `transcript.final`
|
|
- Fields: `trackId`, `text`
|
|
- `assistant.response.delta`
|
|
- Fields: `trackId`, `text`
|
|
- `assistant.response.final`
|
|
- Fields: `trackId`, `text`
|
|
- `assistant.tool_call`
|
|
- Fields: `trackId`, `tool_call`, `tool_call_id`, `tool_name`, `arguments`, `executor`, `timeout_ms`
|
|
- `assistant.tool_result`
|
|
- Fields: `trackId`, `source`, `result`, `tool_call_id`, `tool_name`, `ok`, `error`
|
|
- `error`: `{ code, message, retryable }` when `ok=false`
|
|
- `output.audio.start`
|
|
- Fields: `trackId`
|
|
- `output.audio.end`
|
|
- Fields: `trackId`
|
|
- `response.interrupted`
|
|
- Fields: `trackId`
|
|
- `metrics.ttfb`
|
|
- Fields: `trackId`, `latencyMs`
|
|
- `error`
|
|
- Fields: `sender`, `code`, `message`, `trackId`
|
|
- `trackId` convention:
|
|
- `audio_in` for `stage in {audio, asr}`
|
|
- `audio_out` for `stage in {llm, tts, tool}`
|
|
- `control` otherwise (including protocol errors)
|
|
|
|
Track IDs (MVP fixed values):
|
|
- `audio_in`: ASR/VAD input-side events (`input.*`, `transcript.*`)
|
|
- `audio_out`: assistant output-side events (`assistant.*`, `output.audio.*`, `response.interrupted`, `metrics.ttfb`)
|
|
- `control`: session/control events (`session.*`, `error`, optional `config.resolved`)
|
|
|
|
Correlation IDs (`event.data`):
|
|
- `turn_id`: one user-assistant interaction turn.
|
|
- `utterance_id`: one ASR final utterance.
|
|
- `response_id`: one assistant response generation.
|
|
- `tool_call_id`: one tool invocation.
|
|
- `tts_id`: one TTS playback segment.
|
|
|
|
## Binary Audio Frames
|
|
|
|
After `session.started`, client may send binary PCM chunks continuously.
|
|
|
|
MVP fixed format:
|
|
- 16-bit signed little-endian PCM (`pcm_s16le`)
|
|
- mono (1 channel)
|
|
- 16000 Hz
|
|
- 20ms frame = 640 bytes
|
|
|
|
Framing rules:
|
|
- Binary audio frame unit is 640 bytes.
|
|
- A WS binary message may carry one or multiple complete 640-byte frames.
|
|
- Non-640-multiple payloads are rejected as `audio.frame_size_mismatch`; that WS message is dropped (no partial buffering/reassembly).
|
|
|
|
TTS boundary events:
|
|
- `output.audio.start` and `output.audio.end` mark assistant playback boundaries.
|
|
- `output.audio.end` means server-side audio send completed (not guaranteed speaker drain).
|
|
- For speaker-drain confirmation, client should send `output.audio.played`.
|
|
|
|
## Event Throttling
|
|
|
|
To keep client rendering and server load stable, v1 applies/recommends:
|
|
- `transcript.delta`: merge to ~200-500ms cadence (server default: 300ms).
|
|
- `assistant.response.delta`: merge to ~50-100ms cadence (server default: 80ms).
|
|
- Metrics streams (if enabled beyond `metrics.ttfb`): emit every ~500-1000ms.
|
|
|
|
## Error Structure
|
|
|
|
`error` keeps legacy top-level fields (`code`, `message`) and adds structured info:
|
|
- `stage`: `protocol | asr | llm | tts | tool | audio`
|
|
- `retryable`: boolean
|
|
- `data.error`: `{ stage, code, message, retryable }`
|
|
|
|
## Compatibility
|
|
|
|
This endpoint now enforces v1 message schema for JSON control frames.
|
|
Legacy command names (`invite`, `chat`, etc.) are no longer part of the public protocol.
|