Unify db api

2026-02-26 01:58:39 +08:00
parent 56f8aa2191
commit 72ed7d0512
40 changed files with 3926 additions and 593 deletions
--- a/engine/docs/ws_v1_schema.md
+++ b/engine/docs/ws_v1_schema.md
@@ -2,6 +2,11 @@

 This document defines the public WebSocket protocol for the `/ws` endpoint.

+Validation policy:
+- WS v1 JSON control messages are validated strictly.
+- Unknown top-level fields are rejected for all defined client message types.
+- `hello.version` is fixed to `"v1"`.
+
 ## Transport

 - A single WebSocket connection carries:
@@ -52,43 +57,26 @@ Rules:
    "channels": 1
  },
  "metadata": {
+    "appId": "assistant_123",
+    "channel": "web",
+    "configVersionId": "cfg_20260217_01",
    "client": "web-debug",
    "output": {
      "mode": "audio"
    },
    "systemPrompt": "You are concise.",
-    "greeting": "Hi, how can I help?",
-    "services": {
-      "llm": {
-        "provider": "openai",
-        "model": "gpt-4o-mini",
-        "apiKey": "sk-...",
-        "baseUrl": "https://api.openai.com/v1"
-      },
-      "asr": {
-        "provider": "openai_compatible",
-        "model": "FunAudioLLM/SenseVoiceSmall",
-        "apiKey": "sf-...",
-        "interimIntervalMs": 500,
-        "minAudioMs": 300
-      },
-      "tts": {
-        "enabled": true,
-        "provider": "openai_compatible",
-        "model": "FunAudioLLM/CosyVoice2-0.5B",
-        "apiKey": "sf-...",
-        "voice": "anna",
-        "speed": 1.0
-      }
-    }
+    "greeting": "Hi, how can I help?"
  }
 }
 ```

-`metadata.services` is optional. If omitted, server defaults to environment configuration.
+Rules:
+- Client-side `metadata.services` is ignored.
+- Service config (including secrets) is resolved server-side (env/backend).
+- Client should pass stable IDs (`appId`, `channel`, `configVersionId`) plus small runtime overrides (e.g. `output`, `bargeIn`, greeting/prompt style hints).

 Text-only mode:
- Set `metadata.output.mode = "text"` OR `metadata.services.tts.enabled = false`.
+- Set `metadata.output.mode = "text"`.
 - In this mode server still sends `assistant.response.delta/final`, but will not emit audio frames or `output.audio.start/end`.

 ### `input.text`
@@ -121,6 +109,7 @@ Text-only mode:
 ### `tool_call.results`

 Client tool execution results returned to server.
+Only needed when `assistant.tool_call.executor == "client"` (default execution is server-side).

 ```json
 {
@@ -138,21 +127,36 @@ Client tool execution results returned to server.

 ## Server -> Client Events

-All server events include:
+All server events include an envelope:

 ```json
 {
  "type": "event.name",
-  "timestamp": 1730000000000
+  "timestamp": 1730000000000,
+  "sessionId": "sess_xxx",
+  "seq": 42,
+  "source": "asr",
+  "trackId": "audio_in",
+  "data": {}
 }
 ```

+Envelope notes:
+- `seq` is monotonically increasing within one session (for replay/resume).
+- `source` is one of: `asr | llm | tts | tool | system | client | server`.
+  - For `assistant.tool_result`, `source` may be `client` or `server` to indicate execution side.
+- `data` is structured payload; legacy top-level fields are kept for compatibility.
+
 Common events:

 - `hello.ack`
  - Fields: `sessionId`, `version`
 - `session.started`
-  - Fields: `sessionId`, `trackId`, `audio`
+  - Fields: `sessionId`, `trackId`, `tracks`, `audio`
+- `config.resolved`
+  - Fields: `sessionId`, `trackId`, `config`
+  - Sent immediately after `session.started`.
+  - Contains effective model/voice/output/tool allowlist/prompt hash, and never includes secrets.
 - `session.stopped`
  - Fields: `sessionId`, `reason`
 - `heartbeat`
@@ -169,9 +173,10 @@ Common events:
 - `assistant.response.final`
  - Fields: `trackId`, `text`
 - `assistant.tool_call`
-  - Fields: `trackId`, `tool_call` (`tool_call.executor` is `client` or `server`)
+  - Fields: `trackId`, `tool_call`, `tool_call_id`, `tool_name`, `arguments`, `executor`, `timeout_ms`
 - `assistant.tool_result`
-  - Fields: `trackId`, `source`, `result`
+  - Fields: `trackId`, `source`, `result`, `tool_call_id`, `tool_name`, `ok`, `error`
+  - `error`: `{ code, message, retryable }` when `ok=false`
 - `output.audio.start`
  - Fields: `trackId`
 - `output.audio.end`
@@ -182,16 +187,54 @@ Common events:
  - Fields: `trackId`, `latencyMs`
 - `error`
  - Fields: `sender`, `code`, `message`, `trackId`
+  - `trackId` convention:
+    - `audio_in` for `stage in {audio, asr}`
+    - `audio_out` for `stage in {llm, tts, tool}`
+    - `control` otherwise (including protocol/auth errors)
+
+Track IDs (MVP fixed values):
+- `audio_in`: ASR/VAD input-side events (`input.*`, `transcript.*`)
+- `audio_out`: assistant output-side events (`assistant.*`, `output.audio.*`, `response.interrupted`, `metrics.ttfb`)
+- `control`: session/control events (`session.*`, `hello.*`, `error`, `config.resolved`)
+
+Correlation IDs (`event.data`):
+- `turn_id`: one user-assistant interaction turn.
+- `utterance_id`: one ASR final utterance.
+- `response_id`: one assistant response generation.
+- `tool_call_id`: one tool invocation.
+- `tts_id`: one TTS playback segment.

 ## Binary Audio Frames

 After `session.started`, client may send binary PCM chunks continuously.

-Recommended format:
- 16-bit signed little-endian PCM.
- 1 channel.
- 16000 Hz.
- 20ms frames (640 bytes) preferred.
+MVP fixed format:
+- 16-bit signed little-endian PCM (`pcm_s16le`)
+- mono (1 channel)
+- 16000 Hz
+- 20ms frame = 640 bytes
+
+Framing rules:
+- Binary audio frame unit is 640 bytes.
+- A WS binary message may carry one or multiple complete 640-byte frames.
+- Non-640-multiple payloads are rejected as `audio.frame_size_mismatch`; that WS message is dropped (no partial buffering/reassembly).
+
+TTS boundary events:
+- `output.audio.start` and `output.audio.end` mark assistant playback boundaries.
+
+## Event Throttling
+
+To keep client rendering and server load stable, v1 applies/recommends:
+- `transcript.delta`: merge to ~200-500ms cadence (server default: 300ms).
+- `assistant.response.delta`: merge to ~50-100ms cadence (server default: 80ms).
+- Metrics streams (if enabled beyond `metrics.ttfb`): emit every ~500-1000ms.
+
+## Error Structure
+
+`error` keeps legacy top-level fields (`code`, `message`) and adds structured info:
+- `stage`: `protocol | asr | llm | tts | tool | audio`
+- `retryable`: boolean
+- `data.error`: `{ stage, code, message, retryable }`

 ## Compatibility