Files
AI-VideoAssistant-Engine-V2/docs/ws_v1_schema.md
2026-02-23 17:16:18 +08:00

6.1 KiB

WS v1 Protocol Schema (/ws)

This document defines the public WebSocket protocol for the /ws endpoint.

Transport

  • A single WebSocket connection carries:
    • JSON text frames for control/events.
    • Binary frames for raw PCM audio (pcm_s16le, mono, 16kHz by default).

Handshake and State Machine

Required message order:

  1. Client sends hello.
  2. Server replies hello.ack.
  3. Client sends session.start.
  4. Server replies session.started.
  5. Client may stream binary audio and/or send input.text.
  6. Client sends session.stop (or closes socket).

If order is violated, server emits error with code = "protocol.order".

Client -> Server Messages

hello

{
  "type": "hello",
  "version": "v1",
  "auth": {
    "apiKey": "optional-api-key",
    "jwt": "optional-jwt"
  }
}

Rules:

  • version must be v1.
  • If WS_API_KEY is configured on server, auth.apiKey must match.
  • If WS_REQUIRE_AUTH=true, either auth.apiKey or auth.jwt must be present.

session.start

{
  "type": "session.start",
  "audio": {
    "encoding": "pcm_s16le",
    "sample_rate_hz": 16000,
    "channels": 1
  },
  "metadata": {
    "appId": "assistant_123",
    "channel": "web",
    "configVersionId": "cfg_20260217_01",
    "client": "web-debug",
    "output": {
      "mode": "audio"
    },
    "systemPrompt": "You are concise.",
    "greeting": "Hi, how can I help?"
  }
}

Rules:

  • Client-side metadata.services is ignored.
  • Service config (including secrets) is resolved server-side (env/backend).
  • Client should pass stable IDs (appId, channel, configVersionId) plus small runtime overrides (e.g. output, bargeIn, greeting/prompt style hints).

Text-only mode:

  • Set metadata.output.mode = "text".
  • In this mode server still sends assistant.response.delta/final, but will not emit audio frames or output.audio.start/end.

input.text

{
  "type": "input.text",
  "text": "What can you do?"
}

response.cancel

{
  "type": "response.cancel",
  "graceful": false
}

session.stop

{
  "type": "session.stop",
  "reason": "client_disconnect"
}

tool_call.results

Client tool execution results returned to server. Only needed when assistant.tool_call.executor == "client" (default execution is server-side).

{
  "type": "tool_call.results",
  "results": [
    {
      "tool_call_id": "call_abc123",
      "name": "weather",
      "output": { "temp_c": 21, "condition": "sunny" },
      "status": { "code": 200, "message": "ok" }
    }
  ]
}

Server -> Client Events

All server events include an envelope:

{
  "type": "event.name",
  "timestamp": 1730000000000,
  "sessionId": "sess_xxx",
  "seq": 42,
  "source": "asr",
  "trackId": "audio_in",
  "data": {}
}

Envelope notes:

  • seq is monotonically increasing within one session (for replay/resume).
  • source is one of: asr | llm | tts | tool | system.
  • data is structured payload; legacy top-level fields are kept for compatibility.

Common events:

  • hello.ack
    • Fields: sessionId, version
  • session.started
    • Fields: sessionId, trackId, tracks, audio
  • config.resolved
    • Fields: sessionId, trackId, config
    • Sent immediately after session.started.
    • Contains effective model/voice/output/tool allowlist/prompt hash, and never includes secrets.
  • session.stopped
    • Fields: sessionId, reason
  • heartbeat
  • input.speech_started
    • Fields: trackId, probability
  • input.speech_stopped
    • Fields: trackId, probability
  • transcript.delta
    • Fields: trackId, text
  • transcript.final
    • Fields: trackId, text
  • assistant.response.delta
    • Fields: trackId, text
  • assistant.response.final
    • Fields: trackId, text
  • assistant.tool_call
    • Fields: trackId, tool_call, tool_call_id, tool_name, arguments, executor, timeout_ms
  • assistant.tool_result
    • Fields: trackId, source, result, tool_call_id, tool_name, ok, error
    • error: { code, message, retryable } when ok=false
  • output.audio.start
    • Fields: trackId
  • output.audio.end
    • Fields: trackId
  • response.interrupted
    • Fields: trackId
  • metrics.ttfb
    • Fields: trackId, latencyMs
  • error
    • Fields: sender, code, message, trackId

Track IDs (MVP fixed values):

  • audio_in: ASR/VAD input-side events (input.*, transcript.*)
  • audio_out: assistant output-side events (assistant.*, output.audio.*, response.interrupted, metrics.ttfb)
  • control: session/control events (session.*, hello.*, error, config.resolved)

Correlation IDs (event.data):

  • turn_id: one user-assistant interaction turn.
  • utterance_id: one ASR final utterance.
  • response_id: one assistant response generation.
  • tool_call_id: one tool invocation.
  • tts_id: one TTS playback segment.

Binary Audio Frames

After session.started, client may send binary PCM chunks continuously.

MVP fixed format:

  • 16-bit signed little-endian PCM (pcm_s16le)
  • mono (1 channel)
  • 16000 Hz
  • 20ms frame = 640 bytes

Framing rules:

  • Binary audio frame unit is 640 bytes.
  • A WS binary message may carry one or multiple complete 640-byte frames.
  • Non-640-multiple payloads are treated as audio.frame_size_mismatch protocol errors.

TTS boundary events:

  • output.audio.start and output.audio.end mark assistant playback boundaries.

Event Throttling

To keep client rendering and server load stable, v1 applies/recommends:

  • transcript.delta: merge to ~200-500ms cadence (server default: 300ms).
  • assistant.response.delta: merge to ~50-100ms cadence (server default: 80ms).
  • Metrics streams (if enabled beyond metrics.ttfb): emit every ~500-1000ms.

Error Structure

error keeps legacy top-level fields (code, message) and adds structured info:

  • stage: protocol | asr | llm | tts | tool | audio
  • retryable: boolean
  • data.error: { stage, code, message, retryable }

Compatibility

This endpoint now enforces v1 message schema for JSON control frames. Legacy command names (invite, chat, etc.) are no longer part of the public protocol.