# WS v1 Protocol Schema (`/ws`) This document defines the public WebSocket protocol for the `/ws` endpoint. ## Transport - A single WebSocket connection carries: - JSON text frames for control/events. - Binary frames for raw PCM audio (`pcm_s16le`, mono, 16kHz by default). ## Handshake and State Machine Required message order: 1. Client sends `hello`. 2. Server replies `hello.ack`. 3. Client sends `session.start`. 4. Server replies `session.started`. 5. Client may stream binary audio and/or send `input.text`. 6. Client sends `session.stop` (or closes socket). If order is violated, server emits `error` with `code = "protocol.order"`. ## Client -> Server Messages ### `hello` ```json { "type": "hello", "version": "v1", "auth": { "apiKey": "optional-api-key", "jwt": "optional-jwt" } } ``` Rules: - `version` must be `v1`. - If `WS_API_KEY` is configured on server, `auth.apiKey` must match. - If `WS_REQUIRE_AUTH=true`, either `auth.apiKey` or `auth.jwt` must be present. ### `session.start` ```json { "type": "session.start", "audio": { "encoding": "pcm_s16le", "sample_rate_hz": 16000, "channels": 1 }, "metadata": { "client": "web-debug", "systemPrompt": "You are concise.", "greeting": "Hi, how can I help?", "services": { "llm": { "provider": "openai", "model": "gpt-4o-mini", "apiKey": "sk-...", "baseUrl": "https://api.openai.com/v1" }, "asr": { "provider": "siliconflow", "model": "FunAudioLLM/SenseVoiceSmall", "apiKey": "sf-...", "interimIntervalMs": 500, "minAudioMs": 300 }, "tts": { "provider": "siliconflow", "model": "FunAudioLLM/CosyVoice2-0.5B", "apiKey": "sf-...", "voice": "anna", "speed": 1.0 } } } } ``` `metadata.services` is optional. If omitted, server defaults to environment configuration. ### `input.text` ```json { "type": "input.text", "text": "What can you do?" } ``` ### `response.cancel` ```json { "type": "response.cancel", "graceful": false } ``` ### `session.stop` ```json { "type": "session.stop", "reason": "client_disconnect" } ``` ### `tool_call.results` Client tool execution results returned to server. ```json { "type": "tool_call.results", "results": [ { "tool_call_id": "call_abc123", "name": "weather", "output": { "temp_c": 21, "condition": "sunny" }, "status": { "code": 200, "message": "ok" } } ] } ``` ## Server -> Client Events All server events include: ```json { "type": "event.name", "timestamp": 1730000000000 } ``` Common events: - `hello.ack` - Fields: `sessionId`, `version` - `session.started` - Fields: `sessionId`, `trackId`, `audio` - `session.stopped` - Fields: `sessionId`, `reason` - `heartbeat` - `input.speech_started` - Fields: `trackId`, `probability` - `input.speech_stopped` - Fields: `trackId`, `probability` - `transcript.delta` - Fields: `trackId`, `text` - `transcript.final` - Fields: `trackId`, `text` - `assistant.response.delta` - Fields: `trackId`, `text` - `assistant.response.final` - Fields: `trackId`, `text` - `assistant.tool_call` - Fields: `trackId`, `tool_call` (`tool_call.executor` is `client` or `server`) - `assistant.tool_result` - Fields: `trackId`, `source`, `result` - `output.audio.start` - Fields: `trackId` - `output.audio.end` - Fields: `trackId` - `response.interrupted` - Fields: `trackId` - `metrics.ttfb` - Fields: `trackId`, `latencyMs` - `error` - Fields: `sender`, `code`, `message`, `trackId` ## Binary Audio Frames After `session.started`, client may send binary PCM chunks continuously. Recommended format: - 16-bit signed little-endian PCM. - 1 channel. - 16000 Hz. - 20ms frames (640 bytes) preferred. ## Compatibility This endpoint now enforces v1 message schema for JSON control frames. Legacy command names (`invite`, `chat`, etc.) are no longer part of the public protocol.