# WS v1 Protocol Schema (`/ws`) This document defines the public WebSocket protocol for the `/ws` endpoint. Validation policy: - WS v1 JSON control messages are validated strictly. - Unknown top-level fields are rejected for all defined client message types. - `hello.version` is fixed to `"v1"`. ## Transport - A single WebSocket connection carries: - JSON text frames for control/events. - Binary frames for raw PCM audio (`pcm_s16le`, mono, 16kHz by default). ## Handshake and State Machine Required message order: 1. Client sends `hello`. 2. Server replies `hello.ack`. 3. Client sends `session.start`. 4. Server replies `session.started`. 5. Client may stream binary audio and/or send `input.text`. 6. Client sends `session.stop` (or closes socket). If order is violated, server emits `error` with `code = "protocol.order"`. ## Client -> Server Messages ### `hello` ```json { "type": "hello", "version": "v1" } ``` Rules: - `version` must be `v1`. ### `session.start` ```json { "type": "session.start", "audio": { "encoding": "pcm_s16le", "sample_rate_hz": 16000, "channels": 1 }, "metadata": { "appId": "assistant_123", "channel": "web", "configVersionId": "cfg_20260217_01", "client": "web-debug", "output": { "mode": "audio" }, "systemPrompt": "You are concise.", "greeting": "Hi, how can I help?", "dynamicVariables": { "customer_name": "Alice", "plan_tier": "Pro" } } } ``` Rules: - Client-side `metadata.services` is ignored. - Service config (including secrets) is resolved server-side (env/backend). - Client should pass stable IDs (`appId`, `channel`, `configVersionId`) plus small runtime overrides (e.g. `output`, `bargeIn`, greeting/prompt style hints). - `metadata.dynamicVariables` is optional and must be an object of string key/value pairs. - Key pattern: `^[a-zA-Z_][a-zA-Z0-9_]{0,63}$` - Max entries: 30 - Max value length: 1000 chars - Placeholder format in `systemPrompt` and `greeting`: `{{variable_name}}`. - Built-in system variables (always available): `{{system__time}}`, `{{system_utc}}`, `{{system_timezone}}`. - `system__time`: current local time (`YYYY-MM-DD HH:mm:ss`) - `system_utc`: current UTC time (`YYYY-MM-DD HH:mm:ss`) - `system_timezone`: current local timezone - Missing referenced placeholders reject `session.start` with `protocol.dynamic_variables_missing`. - Invalid `dynamicVariables` payload rejects `session.start` with `protocol.dynamic_variables_invalid`. Text-only mode: - Set `metadata.output.mode = "text"`. - In this mode server still sends `assistant.response.delta/final`, but will not emit audio frames or `output.audio.start/end`. ### `input.text` ```json { "type": "input.text", "text": "What can you do?" } ``` ### `response.cancel` ```json { "type": "response.cancel", "graceful": false } ``` ### `session.stop` ```json { "type": "session.stop", "reason": "client_disconnect" } ``` ### `tool_call.results` Client tool execution results returned to server. Only needed when `assistant.tool_call.executor == "client"` (default execution is server-side). ```json { "type": "tool_call.results", "results": [ { "tool_call_id": "call_abc123", "name": "weather", "output": { "temp_c": 21, "condition": "sunny" }, "status": { "code": 200, "message": "ok" } } ] } ``` ## Server -> Client Events All server events include an envelope: ```json { "type": "event.name", "timestamp": 1730000000000, "sessionId": "sess_xxx", "seq": 42, "source": "asr", "trackId": "audio_in", "data": {} } ``` Envelope notes: - `seq` is monotonically increasing within one session (for replay/resume). - `source` is one of: `asr | llm | tts | tool | system | client | server`. - For `assistant.tool_result`, `source` may be `client` or `server` to indicate execution side. - `data` is structured payload; legacy top-level fields are kept for compatibility. Common events: - `hello.ack` - Fields: `sessionId`, `version` - `session.started` - Fields: `sessionId`, `trackId`, `tracks`, `audio` - `config.resolved` - Fields: `sessionId`, `trackId`, `config` - Sent immediately after `session.started`. - Contains effective model/voice/output/tool allowlist/prompt hash, and never includes secrets. - `session.stopped` - Fields: `sessionId`, `reason` - `heartbeat` - `input.speech_started` - Fields: `trackId`, `probability` - `input.speech_stopped` - Fields: `trackId`, `probability` - `transcript.delta` - Fields: `trackId`, `text` - `transcript.final` - Fields: `trackId`, `text` - `assistant.response.delta` - Fields: `trackId`, `text` - `assistant.response.final` - Fields: `trackId`, `text` - `assistant.tool_call` - Fields: `trackId`, `tool_call`, `tool_call_id`, `tool_name`, `arguments`, `executor`, `timeout_ms` - `assistant.tool_result` - Fields: `trackId`, `source`, `result`, `tool_call_id`, `tool_name`, `ok`, `error` - `error`: `{ code, message, retryable }` when `ok=false` - `output.audio.start` - Fields: `trackId` - `output.audio.end` - Fields: `trackId` - `response.interrupted` - Fields: `trackId` - `metrics.ttfb` - Fields: `trackId`, `latencyMs` - `error` - Fields: `sender`, `code`, `message`, `trackId` - `trackId` convention: - `audio_in` for `stage in {audio, asr}` - `audio_out` for `stage in {llm, tts, tool}` - `control` otherwise (including protocol errors) Track IDs (MVP fixed values): - `audio_in`: ASR/VAD input-side events (`input.*`, `transcript.*`) - `audio_out`: assistant output-side events (`assistant.*`, `output.audio.*`, `response.interrupted`, `metrics.ttfb`) - `control`: session/control events (`session.*`, `hello.*`, `error`, `config.resolved`) Correlation IDs (`event.data`): - `turn_id`: one user-assistant interaction turn. - `utterance_id`: one ASR final utterance. - `response_id`: one assistant response generation. - `tool_call_id`: one tool invocation. - `tts_id`: one TTS playback segment. ## Binary Audio Frames After `session.started`, client may send binary PCM chunks continuously. MVP fixed format: - 16-bit signed little-endian PCM (`pcm_s16le`) - mono (1 channel) - 16000 Hz - 20ms frame = 640 bytes Framing rules: - Binary audio frame unit is 640 bytes. - A WS binary message may carry one or multiple complete 640-byte frames. - Non-640-multiple payloads are rejected as `audio.frame_size_mismatch`; that WS message is dropped (no partial buffering/reassembly). TTS boundary events: - `output.audio.start` and `output.audio.end` mark assistant playback boundaries. ## Event Throttling To keep client rendering and server load stable, v1 applies/recommends: - `transcript.delta`: merge to ~200-500ms cadence (server default: 300ms). - `assistant.response.delta`: merge to ~50-100ms cadence (server default: 80ms). - Metrics streams (if enabled beyond `metrics.ttfb`): emit every ~500-1000ms. ## Error Structure `error` keeps legacy top-level fields (`code`, `message`) and adds structured info: - `stage`: `protocol | asr | llm | tts | tool | audio` - `retryable`: boolean - `data.error`: `{ stage, code, message, retryable }` ## Compatibility This endpoint now enforces v1 message schema for JSON control frames. Legacy command names (`invite`, `chat`, etc.) are no longer part of the public protocol.