AI-VideoAssistant/engine/docs/ws_v1_schema.md

# WS v1 Protocol Schema (`/ws`)

This document defines the public WebSocket protocol for the `/ws` endpoint.

Validation policy:
- WS v1 JSON control messages are validated strictly.
- Unknown top-level fields are rejected for all defined client message types.
- `assistant_id` query parameter is required on `/ws`.

## Transport

- A single WebSocket connection carries:
  - JSON text frames for control/events.
  - Binary frames for raw PCM audio (`pcm_s16le`, mono, 16kHz by default).

## Handshake and State Machine

Required message order:

1. Client connects to `/ws?assistant_id=<id>`.
2. Client sends `session.start`.
3. Server replies `session.started`.
4. Client may stream binary audio and/or send `input.text`.
5. Client sends `session.stop` (or closes socket).

If order is violated, server emits `error` with `code = "protocol.order"`.

## Client -> Server Messages

### `session.start`

```json
{
  "type": "session.start",
  "audio": {
    "encoding": "pcm_s16le",
    "sample_rate_hz": 16000,
    "channels": 1
  },
  "metadata": {
    "channel": "web",
    "source": "web-debug",
    "history": {
      "userId": 1
    },
    "overrides": {
      "output": {
        "mode": "audio"
      },
      "systemPrompt": "You are concise.",
      "greeting": "Hi, how can I help?"
    },
    "dynamicVariables": {
      "customer_name": "Alice",
      "plan_tier": "Pro"
    }
  }
}
```

Rules:
- Assistant config is resolved strictly by URL query `assistant_id`.
- `metadata` top-level keys allowed: `overrides`, `dynamicVariables`, `channel`, `source`, `history`, `workflow` (`workflow` is ignored).
- `metadata.overrides` whitelist: `systemPrompt`, `greeting`, `firstTurnMode`, `generatedOpenerEnabled`, `output`, `bargeIn`, `knowledgeBaseId`, `knowledge`, `tools`, `openerAudio`.
- `metadata.services` is rejected with `protocol.invalid_override`.
- `metadata.workflow` is ignored in this MVP protocol version.
- Top-level IDs are forbidden in payload (`assistantId`, `appId`, `app_id`, `configVersionId`, `config_version_id`).
- Secret-like keys are forbidden in metadata (`apiKey`, `token`, `secret`, `password`, `authorization`).
- `metadata.dynamicVariables` is optional and must be an object of string key/value pairs.
  - Key pattern: `^[a-zA-Z_][a-zA-Z0-9_]{0,63}$`
  - Max entries: 30
  - Max value length: 1000 chars
- Placeholder format in `systemPrompt` and `greeting`: `{{variable_name}}`.
  - Built-in system variables (always available): `{{system__time}}`, `{{system_utc}}`, `{{system_timezone}}`.
    - `system__time`: current local time (`YYYY-MM-DD HH:mm:ss`)
    - `system_utc`: current UTC time (`YYYY-MM-DD HH:mm:ss`)
    - `system_timezone`: current local timezone
  - Missing referenced placeholders reject `session.start` with `protocol.dynamic_variables_missing`.
  - Invalid `dynamicVariables` payload rejects `session.start` with `protocol.dynamic_variables_invalid`.

Text-only mode:
- Set `metadata.overrides.output.mode = "text"`.
- In this mode server still sends `assistant.response.delta/final`, but will not emit audio frames or `output.audio.start/end`.

### `input.text`

```json
{
  "type": "input.text",
  "text": "What can you do?"
}
```

### `response.cancel`

```json
{
  "type": "response.cancel",
  "graceful": false
}
```

### `session.stop`

```json
{
  "type": "session.stop",
  "reason": "client_disconnect"
}
```

### `tool_call.results`

Client tool execution results returned to server.
Only needed when `assistant.tool_call.executor == "client"` (default execution is server-side).

```json
{
  "type": "tool_call.results",
  "results": [
    {
      "tool_call_id": "call_abc123",
      "name": "weather",
      "output": { "temp_c": 21, "condition": "sunny" },
      "status": { "code": 200, "message": "ok" }
    }
  ]
}
```

## Server -> Client Events

All server events include an envelope:

```json
{
  "type": "event.name",
  "timestamp": 1730000000000,
  "sessionId": "sess_xxx",
  "seq": 42,
  "source": "asr",
  "trackId": "audio_in",
  "data": {}
}
```

Envelope notes:
- `seq` is monotonically increasing within one session (for replay/resume).
- `source` is one of: `asr | llm | tts | tool | system | client | server`.
  - For `assistant.tool_result`, `source` may be `client` or `server` to indicate execution side.
- `data` is structured payload; legacy top-level fields are kept for compatibility.

Common events:

- `session.started`
  - Fields: `sessionId`, `trackId`, `tracks`, `audio`
- `config.resolved`
  - Fields: `sessionId`, `trackId`, `config`
  - Sent immediately after `session.started`.
  - Contains effective model/voice/output/tool allowlist/prompt hash, and never includes secrets.
- `session.stopped`
  - Fields: `sessionId`, `reason`
- `heartbeat`
- `input.speech_started`
  - Fields: `trackId`, `probability`
- `input.speech_stopped`
  - Fields: `trackId`, `probability`
- `transcript.delta`
  - Fields: `trackId`, `text`
- `transcript.final`
  - Fields: `trackId`, `text`
- `assistant.response.delta`
  - Fields: `trackId`, `text`
- `assistant.response.final`
  - Fields: `trackId`, `text`
- `assistant.tool_call`
  - Fields: `trackId`, `tool_call`, `tool_call_id`, `tool_name`, `arguments`, `executor`, `timeout_ms`
- `assistant.tool_result`
  - Fields: `trackId`, `source`, `result`, `tool_call_id`, `tool_name`, `ok`, `error`
  - `error`: `{ code, message, retryable }` when `ok=false`
- `output.audio.start`
  - Fields: `trackId`
- `output.audio.end`
  - Fields: `trackId`
- `response.interrupted`
  - Fields: `trackId`
- `metrics.ttfb`
  - Fields: `trackId`, `latencyMs`
- `error`
  - Fields: `sender`, `code`, `message`, `trackId`
  - `trackId` convention:
    - `audio_in` for `stage in {audio, asr}`
    - `audio_out` for `stage in {llm, tts, tool}`
    - `control` otherwise (including protocol errors)

Track IDs (MVP fixed values):
- `audio_in`: ASR/VAD input-side events (`input.*`, `transcript.*`)
- `audio_out`: assistant output-side events (`assistant.*`, `output.audio.*`, `response.interrupted`, `metrics.ttfb`)
- `control`: session/control events (`session.*`, `error`, `config.resolved`)

Correlation IDs (`event.data`):
- `turn_id`: one user-assistant interaction turn.
- `utterance_id`: one ASR final utterance.
- `response_id`: one assistant response generation.
- `tool_call_id`: one tool invocation.
- `tts_id`: one TTS playback segment.

## Binary Audio Frames

After `session.started`, client may send binary PCM chunks continuously.

MVP fixed format:
- 16-bit signed little-endian PCM (`pcm_s16le`)
- mono (1 channel)
- 16000 Hz
- 20ms frame = 640 bytes

Framing rules:
- Binary audio frame unit is 640 bytes.
- A WS binary message may carry one or multiple complete 640-byte frames.
- Non-640-multiple payloads are rejected as `audio.frame_size_mismatch`; that WS message is dropped (no partial buffering/reassembly).

TTS boundary events:
- `output.audio.start` and `output.audio.end` mark assistant playback boundaries.

## Event Throttling

To keep client rendering and server load stable, v1 applies/recommends:
- `transcript.delta`: merge to ~200-500ms cadence (server default: 300ms).
- `assistant.response.delta`: merge to ~50-100ms cadence (server default: 80ms).
- Metrics streams (if enabled beyond `metrics.ttfb`): emit every ~500-1000ms.

## Error Structure

`error` keeps legacy top-level fields (`code`, `message`) and adds structured info:
- `stage`: `protocol | asr | llm | tts | tool | audio`
- `retryable`: boolean
- `data.error`: `{ stage, code, message, retryable }`

## Compatibility

This endpoint now enforces v1 message schema for JSON control frames.
Legacy command names (`invite`, `chat`, etc.) are no longer part of the public protocol.