AI-VideoAssistant-Engine-V2/docs/ws_v1_schema.md

# WS v1 Protocol Schema (`/ws`)

This document defines the public WebSocket protocol for the `/ws` endpoint.

## Transport

- A single WebSocket connection carries:
  - JSON text frames for control/events.
  - Binary frames for raw PCM audio (`pcm_s16le`, mono, 16kHz by default).

## Handshake and State Machine

Required message order:

1. Client sends `hello`.
2. Server replies `hello.ack`.
3. Client sends `session.start`.
4. Server replies `session.started`.
5. Client may stream binary audio and/or send `input.text`.
6. Client sends `session.stop` (or closes socket).

If order is violated, server emits `error` with `code = "protocol.order"`.

## Client -> Server Messages

### `hello`

```json
{
  "type": "hello",
  "version": "v1",
  "auth": {
    "apiKey": "optional-api-key",
    "jwt": "optional-jwt"
  }
}
```

Rules:
- `version` must be `v1`.
- If `WS_API_KEY` is configured on server, `auth.apiKey` must match.
- If `WS_REQUIRE_AUTH=true`, either `auth.apiKey` or `auth.jwt` must be present.

### `session.start`

```json
{
  "type": "session.start",
  "audio": {
    "encoding": "pcm_s16le",
    "sample_rate_hz": 16000,
    "channels": 1
  },
  "metadata": {
    "client": "web-debug",
    "output": {
      "mode": "audio"
    },
    "systemPrompt": "You are concise.",
    "greeting": "Hi, how can I help?",
    "services": {
      "llm": {
        "provider": "openai",
        "model": "gpt-4o-mini",
        "apiKey": "sk-...",
        "baseUrl": "https://api.openai.com/v1"
      },
      "asr": {
        "provider": "openai_compatible",
        "model": "FunAudioLLM/SenseVoiceSmall",
        "apiKey": "sf-...",
        "interimIntervalMs": 500,
        "minAudioMs": 300
      },
      "tts": {
        "enabled": true,
        "provider": "openai_compatible",
        "model": "FunAudioLLM/CosyVoice2-0.5B",
        "apiKey": "sf-...",
        "voice": "anna",
        "speed": 1.0
      }
    }
  }
}
```

`metadata.services` is optional. If omitted, server defaults to environment configuration.

Text-only mode:
- Set `metadata.output.mode = "text"` OR `metadata.services.tts.enabled = false`.
- In this mode server still sends `assistant.response.delta/final`, but will not emit audio frames or `output.audio.start/end`.

### `input.text`

```json
{
  "type": "input.text",
  "text": "What can you do?"
}
```

### `response.cancel`

```json
{
  "type": "response.cancel",
  "graceful": false
}
```

### `session.stop`

```json
{
  "type": "session.stop",
  "reason": "client_disconnect"
}
```

### `tool_call.results`

Client tool execution results returned to server.

```json
{
  "type": "tool_call.results",
  "results": [
    {
      "tool_call_id": "call_abc123",
      "name": "weather",
      "output": { "temp_c": 21, "condition": "sunny" },
      "status": { "code": 200, "message": "ok" }
    }
  ]
}
```

## Server -> Client Events

All server events include:

```json
{
  "type": "event.name",
  "timestamp": 1730000000000
}
```

Common events:

- `hello.ack`
  - Fields: `sessionId`, `version`
- `session.started`
  - Fields: `sessionId`, `trackId`, `audio`
- `session.stopped`
  - Fields: `sessionId`, `reason`
- `heartbeat`
- `input.speech_started`
  - Fields: `trackId`, `probability`
- `input.speech_stopped`
  - Fields: `trackId`, `probability`
- `transcript.delta`
  - Fields: `trackId`, `text`
- `transcript.final`
  - Fields: `trackId`, `text`
- `assistant.response.delta`
  - Fields: `trackId`, `text`
- `assistant.response.final`
  - Fields: `trackId`, `text`
- `assistant.tool_call`
  - Fields: `trackId`, `tool_call` (`tool_call.executor` is `client` or `server`)
- `assistant.tool_result`
  - Fields: `trackId`, `source`, `result`
- `output.audio.start`
  - Fields: `trackId`
- `output.audio.end`
  - Fields: `trackId`
- `response.interrupted`
  - Fields: `trackId`
- `metrics.ttfb`
  - Fields: `trackId`, `latencyMs`
- `error`
  - Fields: `sender`, `code`, `message`, `trackId`

## Binary Audio Frames

After `session.started`, client may send binary PCM chunks continuously.

Recommended format:
- 16-bit signed little-endian PCM.
- 1 channel.
- 16000 Hz.
- 20ms frames (640 bytes) preferred.

## Compatibility

This endpoint now enforces v1 message schema for JSON control frames.
Legacy command names (`invite`, `chat`, etc.) are no longer part of the public protocol.