Init commit

2026-02-17 10:39:23 +08:00
commit 30eb4397c2
56 changed files with 11983 additions and 0 deletions
--- a/docs/ws_v1_schema.md
+++ b/docs/ws_v1_schema.md
@@ -0,0 +1,199 @@
+# WS v1 Protocol Schema (`/ws`)
+
+This document defines the public WebSocket protocol for the `/ws` endpoint.
+
+## Transport
+
+- A single WebSocket connection carries:
+  - JSON text frames for control/events.
+  - Binary frames for raw PCM audio (`pcm_s16le`, mono, 16kHz by default).
+
+## Handshake and State Machine
+
+Required message order:
+
+1. Client sends `hello`.
+2. Server replies `hello.ack`.
+3. Client sends `session.start`.
+4. Server replies `session.started`.
+5. Client may stream binary audio and/or send `input.text`.
+6. Client sends `session.stop` (or closes socket).
+
+If order is violated, server emits `error` with `code = "protocol.order"`.
+
+## Client -> Server Messages
+
+### `hello`
+
+```json
+{
+  "type": "hello",
+  "version": "v1",
+  "auth": {
+    "apiKey": "optional-api-key",
+    "jwt": "optional-jwt"
+  }
+}
+```
+
+Rules:
+- `version` must be `v1`.
+- If `WS_API_KEY` is configured on server, `auth.apiKey` must match.
+- If `WS_REQUIRE_AUTH=true`, either `auth.apiKey` or `auth.jwt` must be present.
+
+### `session.start`
+
+```json
+{
+  "type": "session.start",
+  "audio": {
+    "encoding": "pcm_s16le",
+    "sample_rate_hz": 16000,
+    "channels": 1
+  },
+  "metadata": {
+    "client": "web-debug",
+    "output": {
+      "mode": "audio"
+    },
+    "systemPrompt": "You are concise.",
+    "greeting": "Hi, how can I help?",
+    "services": {
+      "llm": {
+        "provider": "openai",
+        "model": "gpt-4o-mini",
+        "apiKey": "sk-...",
+        "baseUrl": "https://api.openai.com/v1"
+      },
+      "asr": {
+        "provider": "openai_compatible",
+        "model": "FunAudioLLM/SenseVoiceSmall",
+        "apiKey": "sf-...",
+        "interimIntervalMs": 500,
+        "minAudioMs": 300
+      },
+      "tts": {
+        "enabled": true,
+        "provider": "openai_compatible",
+        "model": "FunAudioLLM/CosyVoice2-0.5B",
+        "apiKey": "sf-...",
+        "voice": "anna",
+        "speed": 1.0
+      }
+    }
+  }
+}
+```
+
+`metadata.services` is optional. If omitted, server defaults to environment configuration.
+
+Text-only mode:
+- Set `metadata.output.mode = "text"` OR `metadata.services.tts.enabled = false`.
+- In this mode server still sends `assistant.response.delta/final`, but will not emit audio frames or `output.audio.start/end`.
+
+### `input.text`
+
+```json
+{
+  "type": "input.text",
+  "text": "What can you do?"
+}
+```
+
+### `response.cancel`
+
+```json
+{
+  "type": "response.cancel",
+  "graceful": false
+}
+```
+
+### `session.stop`
+
+```json
+{
+  "type": "session.stop",
+  "reason": "client_disconnect"
+}
+```
+
+### `tool_call.results`
+
+Client tool execution results returned to server.
+
+```json
+{
+  "type": "tool_call.results",
+  "results": [
+    {
+      "tool_call_id": "call_abc123",
+      "name": "weather",
+      "output": { "temp_c": 21, "condition": "sunny" },
+      "status": { "code": 200, "message": "ok" }
+    }
+  ]
+}
+```
+
+## Server -> Client Events
+
+All server events include:
+
+```json
+{
+  "type": "event.name",
+  "timestamp": 1730000000000
+}
+```
+
+Common events:
+
+- `hello.ack`
+  - Fields: `sessionId`, `version`
+- `session.started`
+  - Fields: `sessionId`, `trackId`, `audio`
+- `session.stopped`
+  - Fields: `sessionId`, `reason`
+- `heartbeat`
+- `input.speech_started`
+  - Fields: `trackId`, `probability`
+- `input.speech_stopped`
+  - Fields: `trackId`, `probability`
+- `transcript.delta`
+  - Fields: `trackId`, `text`
+- `transcript.final`
+  - Fields: `trackId`, `text`
+- `assistant.response.delta`
+  - Fields: `trackId`, `text`
+- `assistant.response.final`
+  - Fields: `trackId`, `text`
+- `assistant.tool_call`
+  - Fields: `trackId`, `tool_call` (`tool_call.executor` is `client` or `server`)
+- `assistant.tool_result`
+  - Fields: `trackId`, `source`, `result`
+- `output.audio.start`
+  - Fields: `trackId`
+- `output.audio.end`
+  - Fields: `trackId`
+- `response.interrupted`
+  - Fields: `trackId`
+- `metrics.ttfb`
+  - Fields: `trackId`, `latencyMs`
+- `error`
+  - Fields: `sender`, `code`, `message`, `trackId`
+
+## Binary Audio Frames
+
+After `session.started`, client may send binary PCM chunks continuously.
+
+Recommended format:
+- 16-bit signed little-endian PCM.
+- 1 channel.
+- 16000 Hz.
+- 20ms frames (640 bytes) preferred.
+
+## Compatibility
+
+This endpoint now enforces v1 message schema for JSON control frames.
+Legacy command names (`invite`, `chat`, etc.) are no longer part of the public protocol.