Improve schema

This commit is contained in:
Xin Wang
2026-02-24 05:55:47 +08:00
parent c6c84b5af9
commit 6290fdd60e
4 changed files with 88 additions and 36 deletions

View File

@@ -2,6 +2,11 @@
This document defines the public WebSocket protocol for the `/ws` endpoint.
Validation policy:
- WS v1 JSON control messages are validated strictly.
- Unknown top-level fields are rejected for all defined client message types.
- `hello.version` is fixed to `"v1"`.
## Transport
- A single WebSocket connection carries:
@@ -138,7 +143,8 @@ All server events include an envelope:
Envelope notes:
- `seq` is monotonically increasing within one session (for replay/resume).
- `source` is one of: `asr | llm | tts | tool | system`.
- `source` is one of: `asr | llm | tts | tool | system | client | server`.
- For `assistant.tool_result`, `source` may be `client` or `server` to indicate execution side.
- `data` is structured payload; legacy top-level fields are kept for compatibility.
Common events:
@@ -181,6 +187,10 @@ Common events:
- Fields: `trackId`, `latencyMs`
- `error`
- Fields: `sender`, `code`, `message`, `trackId`
- `trackId` convention:
- `audio_in` for `stage in {audio, asr}`
- `audio_out` for `stage in {llm, tts, tool}`
- `control` otherwise (including protocol/auth errors)
Track IDs (MVP fixed values):
- `audio_in`: ASR/VAD input-side events (`input.*`, `transcript.*`)
@@ -207,7 +217,7 @@ MVP fixed format:
Framing rules:
- Binary audio frame unit is 640 bytes.
- A WS binary message may carry one or multiple complete 640-byte frames.
- Non-640-multiple payloads are treated as `audio.frame_size_mismatch` protocol errors.
- Non-640-multiple payloads are rejected as `audio.frame_size_mismatch`; that WS message is dropped (no partial buffering/reassembly).
TTS boundary events:
- `output.audio.start` and `output.audio.end` mark assistant playback boundaries.