Init commit
This commit is contained in:
199
docs/ws_v1_schema.md
Normal file
199
docs/ws_v1_schema.md
Normal file
@@ -0,0 +1,199 @@
|
||||
# WS v1 Protocol Schema (`/ws`)
|
||||
|
||||
This document defines the public WebSocket protocol for the `/ws` endpoint.
|
||||
|
||||
## Transport
|
||||
|
||||
- A single WebSocket connection carries:
|
||||
- JSON text frames for control/events.
|
||||
- Binary frames for raw PCM audio (`pcm_s16le`, mono, 16kHz by default).
|
||||
|
||||
## Handshake and State Machine
|
||||
|
||||
Required message order:
|
||||
|
||||
1. Client sends `hello`.
|
||||
2. Server replies `hello.ack`.
|
||||
3. Client sends `session.start`.
|
||||
4. Server replies `session.started`.
|
||||
5. Client may stream binary audio and/or send `input.text`.
|
||||
6. Client sends `session.stop` (or closes socket).
|
||||
|
||||
If order is violated, server emits `error` with `code = "protocol.order"`.
|
||||
|
||||
## Client -> Server Messages
|
||||
|
||||
### `hello`
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "hello",
|
||||
"version": "v1",
|
||||
"auth": {
|
||||
"apiKey": "optional-api-key",
|
||||
"jwt": "optional-jwt"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Rules:
|
||||
- `version` must be `v1`.
|
||||
- If `WS_API_KEY` is configured on server, `auth.apiKey` must match.
|
||||
- If `WS_REQUIRE_AUTH=true`, either `auth.apiKey` or `auth.jwt` must be present.
|
||||
|
||||
### `session.start`
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "session.start",
|
||||
"audio": {
|
||||
"encoding": "pcm_s16le",
|
||||
"sample_rate_hz": 16000,
|
||||
"channels": 1
|
||||
},
|
||||
"metadata": {
|
||||
"client": "web-debug",
|
||||
"output": {
|
||||
"mode": "audio"
|
||||
},
|
||||
"systemPrompt": "You are concise.",
|
||||
"greeting": "Hi, how can I help?",
|
||||
"services": {
|
||||
"llm": {
|
||||
"provider": "openai",
|
||||
"model": "gpt-4o-mini",
|
||||
"apiKey": "sk-...",
|
||||
"baseUrl": "https://api.openai.com/v1"
|
||||
},
|
||||
"asr": {
|
||||
"provider": "openai_compatible",
|
||||
"model": "FunAudioLLM/SenseVoiceSmall",
|
||||
"apiKey": "sf-...",
|
||||
"interimIntervalMs": 500,
|
||||
"minAudioMs": 300
|
||||
},
|
||||
"tts": {
|
||||
"enabled": true,
|
||||
"provider": "openai_compatible",
|
||||
"model": "FunAudioLLM/CosyVoice2-0.5B",
|
||||
"apiKey": "sf-...",
|
||||
"voice": "anna",
|
||||
"speed": 1.0
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`metadata.services` is optional. If omitted, server defaults to environment configuration.
|
||||
|
||||
Text-only mode:
|
||||
- Set `metadata.output.mode = "text"` OR `metadata.services.tts.enabled = false`.
|
||||
- In this mode server still sends `assistant.response.delta/final`, but will not emit audio frames or `output.audio.start/end`.
|
||||
|
||||
### `input.text`
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "input.text",
|
||||
"text": "What can you do?"
|
||||
}
|
||||
```
|
||||
|
||||
### `response.cancel`
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "response.cancel",
|
||||
"graceful": false
|
||||
}
|
||||
```
|
||||
|
||||
### `session.stop`
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "session.stop",
|
||||
"reason": "client_disconnect"
|
||||
}
|
||||
```
|
||||
|
||||
### `tool_call.results`
|
||||
|
||||
Client tool execution results returned to server.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "tool_call.results",
|
||||
"results": [
|
||||
{
|
||||
"tool_call_id": "call_abc123",
|
||||
"name": "weather",
|
||||
"output": { "temp_c": 21, "condition": "sunny" },
|
||||
"status": { "code": 200, "message": "ok" }
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Server -> Client Events
|
||||
|
||||
All server events include:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "event.name",
|
||||
"timestamp": 1730000000000
|
||||
}
|
||||
```
|
||||
|
||||
Common events:
|
||||
|
||||
- `hello.ack`
|
||||
- Fields: `sessionId`, `version`
|
||||
- `session.started`
|
||||
- Fields: `sessionId`, `trackId`, `audio`
|
||||
- `session.stopped`
|
||||
- Fields: `sessionId`, `reason`
|
||||
- `heartbeat`
|
||||
- `input.speech_started`
|
||||
- Fields: `trackId`, `probability`
|
||||
- `input.speech_stopped`
|
||||
- Fields: `trackId`, `probability`
|
||||
- `transcript.delta`
|
||||
- Fields: `trackId`, `text`
|
||||
- `transcript.final`
|
||||
- Fields: `trackId`, `text`
|
||||
- `assistant.response.delta`
|
||||
- Fields: `trackId`, `text`
|
||||
- `assistant.response.final`
|
||||
- Fields: `trackId`, `text`
|
||||
- `assistant.tool_call`
|
||||
- Fields: `trackId`, `tool_call` (`tool_call.executor` is `client` or `server`)
|
||||
- `assistant.tool_result`
|
||||
- Fields: `trackId`, `source`, `result`
|
||||
- `output.audio.start`
|
||||
- Fields: `trackId`
|
||||
- `output.audio.end`
|
||||
- Fields: `trackId`
|
||||
- `response.interrupted`
|
||||
- Fields: `trackId`
|
||||
- `metrics.ttfb`
|
||||
- Fields: `trackId`, `latencyMs`
|
||||
- `error`
|
||||
- Fields: `sender`, `code`, `message`, `trackId`
|
||||
|
||||
## Binary Audio Frames
|
||||
|
||||
After `session.started`, client may send binary PCM chunks continuously.
|
||||
|
||||
Recommended format:
|
||||
- 16-bit signed little-endian PCM.
|
||||
- 1 channel.
|
||||
- 16000 Hz.
|
||||
- 20ms frames (640 bytes) preferred.
|
||||
|
||||
## Compatibility
|
||||
|
||||
This endpoint now enforces v1 message schema for JSON control frames.
|
||||
Legacy command names (`invite`, `chat`, etc.) are no longer part of the public protocol.
|
||||
Reference in New Issue
Block a user