Unify db api
This commit is contained in:
@@ -2,6 +2,11 @@
|
||||
|
||||
This document defines the public WebSocket protocol for the `/ws` endpoint.
|
||||
|
||||
Validation policy:
|
||||
- WS v1 JSON control messages are validated strictly.
|
||||
- Unknown top-level fields are rejected for all defined client message types.
|
||||
- `hello.version` is fixed to `"v1"`.
|
||||
|
||||
## Transport
|
||||
|
||||
- A single WebSocket connection carries:
|
||||
@@ -52,43 +57,26 @@ Rules:
|
||||
"channels": 1
|
||||
},
|
||||
"metadata": {
|
||||
"appId": "assistant_123",
|
||||
"channel": "web",
|
||||
"configVersionId": "cfg_20260217_01",
|
||||
"client": "web-debug",
|
||||
"output": {
|
||||
"mode": "audio"
|
||||
},
|
||||
"systemPrompt": "You are concise.",
|
||||
"greeting": "Hi, how can I help?",
|
||||
"services": {
|
||||
"llm": {
|
||||
"provider": "openai",
|
||||
"model": "gpt-4o-mini",
|
||||
"apiKey": "sk-...",
|
||||
"baseUrl": "https://api.openai.com/v1"
|
||||
},
|
||||
"asr": {
|
||||
"provider": "openai_compatible",
|
||||
"model": "FunAudioLLM/SenseVoiceSmall",
|
||||
"apiKey": "sf-...",
|
||||
"interimIntervalMs": 500,
|
||||
"minAudioMs": 300
|
||||
},
|
||||
"tts": {
|
||||
"enabled": true,
|
||||
"provider": "openai_compatible",
|
||||
"model": "FunAudioLLM/CosyVoice2-0.5B",
|
||||
"apiKey": "sf-...",
|
||||
"voice": "anna",
|
||||
"speed": 1.0
|
||||
}
|
||||
}
|
||||
"greeting": "Hi, how can I help?"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`metadata.services` is optional. If omitted, server defaults to environment configuration.
|
||||
Rules:
|
||||
- Client-side `metadata.services` is ignored.
|
||||
- Service config (including secrets) is resolved server-side (env/backend).
|
||||
- Client should pass stable IDs (`appId`, `channel`, `configVersionId`) plus small runtime overrides (e.g. `output`, `bargeIn`, greeting/prompt style hints).
|
||||
|
||||
Text-only mode:
|
||||
- Set `metadata.output.mode = "text"` OR `metadata.services.tts.enabled = false`.
|
||||
- Set `metadata.output.mode = "text"`.
|
||||
- In this mode server still sends `assistant.response.delta/final`, but will not emit audio frames or `output.audio.start/end`.
|
||||
|
||||
### `input.text`
|
||||
@@ -121,6 +109,7 @@ Text-only mode:
|
||||
### `tool_call.results`
|
||||
|
||||
Client tool execution results returned to server.
|
||||
Only needed when `assistant.tool_call.executor == "client"` (default execution is server-side).
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -138,21 +127,36 @@ Client tool execution results returned to server.
|
||||
|
||||
## Server -> Client Events
|
||||
|
||||
All server events include:
|
||||
All server events include an envelope:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "event.name",
|
||||
"timestamp": 1730000000000
|
||||
"timestamp": 1730000000000,
|
||||
"sessionId": "sess_xxx",
|
||||
"seq": 42,
|
||||
"source": "asr",
|
||||
"trackId": "audio_in",
|
||||
"data": {}
|
||||
}
|
||||
```
|
||||
|
||||
Envelope notes:
|
||||
- `seq` is monotonically increasing within one session (for replay/resume).
|
||||
- `source` is one of: `asr | llm | tts | tool | system | client | server`.
|
||||
- For `assistant.tool_result`, `source` may be `client` or `server` to indicate execution side.
|
||||
- `data` is structured payload; legacy top-level fields are kept for compatibility.
|
||||
|
||||
Common events:
|
||||
|
||||
- `hello.ack`
|
||||
- Fields: `sessionId`, `version`
|
||||
- `session.started`
|
||||
- Fields: `sessionId`, `trackId`, `audio`
|
||||
- Fields: `sessionId`, `trackId`, `tracks`, `audio`
|
||||
- `config.resolved`
|
||||
- Fields: `sessionId`, `trackId`, `config`
|
||||
- Sent immediately after `session.started`.
|
||||
- Contains effective model/voice/output/tool allowlist/prompt hash, and never includes secrets.
|
||||
- `session.stopped`
|
||||
- Fields: `sessionId`, `reason`
|
||||
- `heartbeat`
|
||||
@@ -169,9 +173,10 @@ Common events:
|
||||
- `assistant.response.final`
|
||||
- Fields: `trackId`, `text`
|
||||
- `assistant.tool_call`
|
||||
- Fields: `trackId`, `tool_call` (`tool_call.executor` is `client` or `server`)
|
||||
- Fields: `trackId`, `tool_call`, `tool_call_id`, `tool_name`, `arguments`, `executor`, `timeout_ms`
|
||||
- `assistant.tool_result`
|
||||
- Fields: `trackId`, `source`, `result`
|
||||
- Fields: `trackId`, `source`, `result`, `tool_call_id`, `tool_name`, `ok`, `error`
|
||||
- `error`: `{ code, message, retryable }` when `ok=false`
|
||||
- `output.audio.start`
|
||||
- Fields: `trackId`
|
||||
- `output.audio.end`
|
||||
@@ -182,16 +187,54 @@ Common events:
|
||||
- Fields: `trackId`, `latencyMs`
|
||||
- `error`
|
||||
- Fields: `sender`, `code`, `message`, `trackId`
|
||||
- `trackId` convention:
|
||||
- `audio_in` for `stage in {audio, asr}`
|
||||
- `audio_out` for `stage in {llm, tts, tool}`
|
||||
- `control` otherwise (including protocol/auth errors)
|
||||
|
||||
Track IDs (MVP fixed values):
|
||||
- `audio_in`: ASR/VAD input-side events (`input.*`, `transcript.*`)
|
||||
- `audio_out`: assistant output-side events (`assistant.*`, `output.audio.*`, `response.interrupted`, `metrics.ttfb`)
|
||||
- `control`: session/control events (`session.*`, `hello.*`, `error`, `config.resolved`)
|
||||
|
||||
Correlation IDs (`event.data`):
|
||||
- `turn_id`: one user-assistant interaction turn.
|
||||
- `utterance_id`: one ASR final utterance.
|
||||
- `response_id`: one assistant response generation.
|
||||
- `tool_call_id`: one tool invocation.
|
||||
- `tts_id`: one TTS playback segment.
|
||||
|
||||
## Binary Audio Frames
|
||||
|
||||
After `session.started`, client may send binary PCM chunks continuously.
|
||||
|
||||
Recommended format:
|
||||
- 16-bit signed little-endian PCM.
|
||||
- 1 channel.
|
||||
- 16000 Hz.
|
||||
- 20ms frames (640 bytes) preferred.
|
||||
MVP fixed format:
|
||||
- 16-bit signed little-endian PCM (`pcm_s16le`)
|
||||
- mono (1 channel)
|
||||
- 16000 Hz
|
||||
- 20ms frame = 640 bytes
|
||||
|
||||
Framing rules:
|
||||
- Binary audio frame unit is 640 bytes.
|
||||
- A WS binary message may carry one or multiple complete 640-byte frames.
|
||||
- Non-640-multiple payloads are rejected as `audio.frame_size_mismatch`; that WS message is dropped (no partial buffering/reassembly).
|
||||
|
||||
TTS boundary events:
|
||||
- `output.audio.start` and `output.audio.end` mark assistant playback boundaries.
|
||||
|
||||
## Event Throttling
|
||||
|
||||
To keep client rendering and server load stable, v1 applies/recommends:
|
||||
- `transcript.delta`: merge to ~200-500ms cadence (server default: 300ms).
|
||||
- `assistant.response.delta`: merge to ~50-100ms cadence (server default: 80ms).
|
||||
- Metrics streams (if enabled beyond `metrics.ttfb`): emit every ~500-1000ms.
|
||||
|
||||
## Error Structure
|
||||
|
||||
`error` keeps legacy top-level fields (`code`, `message`) and adds structured info:
|
||||
- `stage`: `protocol | asr | llm | tts | tool | audio`
|
||||
- `retryable`: boolean
|
||||
- `data.error`: `{ stage, code, message, retryable }`
|
||||
|
||||
## Compatibility
|
||||
|
||||
|
||||
Reference in New Issue
Block a user