wx44wx/AI-VideoAssistant-Engine-V2

Fork 0

Files

Xin Wang c6c84b5af9 Update engine

2026-02-23 17:16:18 +08:00

6.1 KiB

Raw Blame History

WS v1 Protocol Schema (`/ws`)

This document defines the public WebSocket protocol for the /ws endpoint.

Transport

A single WebSocket connection carries:
- JSON text frames for control/events.
- Binary frames for raw PCM audio (pcm_s16le, mono, 16kHz by default).

Handshake and State Machine

Required message order:

Client sends hello.
Server replies hello.ack.
Client sends session.start.
Server replies session.started.
Client may stream binary audio and/or send input.text.
Client sends session.stop (or closes socket).

If order is violated, server emits error with code = "protocol.order".

Client -> Server Messages

`hello`

{
  "type": "hello",
  "version": "v1",
  "auth": {
    "apiKey": "optional-api-key",
    "jwt": "optional-jwt"
  }
}

Rules:

version must be v1.
If WS_API_KEY is configured on server, auth.apiKey must match.
If WS_REQUIRE_AUTH=true, either auth.apiKey or auth.jwt must be present.

`session.start`

{
  "type": "session.start",
  "audio": {
    "encoding": "pcm_s16le",
    "sample_rate_hz": 16000,
    "channels": 1
  },
  "metadata": {
    "appId": "assistant_123",
    "channel": "web",
    "configVersionId": "cfg_20260217_01",
    "client": "web-debug",
    "output": {
      "mode": "audio"
    },
    "systemPrompt": "You are concise.",
    "greeting": "Hi, how can I help?"
  }
}

Rules:

Client-side metadata.services is ignored.
Service config (including secrets) is resolved server-side (env/backend).
Client should pass stable IDs (appId, channel, configVersionId) plus small runtime overrides (e.g. output, bargeIn, greeting/prompt style hints).

Text-only mode:

Set metadata.output.mode = "text".
In this mode server still sends assistant.response.delta/final, but will not emit audio frames or output.audio.start/end.

`input.text`

{
  "type": "input.text",
  "text": "What can you do?"
}

`response.cancel`

{
  "type": "response.cancel",
  "graceful": false
}

`session.stop`

{
  "type": "session.stop",
  "reason": "client_disconnect"
}

`tool_call.results`

Client tool execution results returned to server. Only needed when assistant.tool_call.executor == "client" (default execution is server-side).

{
  "type": "tool_call.results",
  "results": [
    {
      "tool_call_id": "call_abc123",
      "name": "weather",
      "output": { "temp_c": 21, "condition": "sunny" },
      "status": { "code": 200, "message": "ok" }
    }
  ]
}

Server -> Client Events

All server events include an envelope:

{
  "type": "event.name",
  "timestamp": 1730000000000,
  "sessionId": "sess_xxx",
  "seq": 42,
  "source": "asr",
  "trackId": "audio_in",
  "data": {}
}

Envelope notes:

seq is monotonically increasing within one session (for replay/resume).
source is one of: asr | llm | tts | tool | system.
data is structured payload; legacy top-level fields are kept for compatibility.

Common events:

hello.ack
- Fields: sessionId, version
session.started
- Fields: sessionId, trackId, tracks, audio
config.resolved
- Fields: sessionId, trackId, config
- Sent immediately after session.started.
- Contains effective model/voice/output/tool allowlist/prompt hash, and never includes secrets.
session.stopped
- Fields: sessionId, reason
heartbeat
input.speech_started
- Fields: trackId, probability
input.speech_stopped
- Fields: trackId, probability
transcript.delta
- Fields: trackId, text
transcript.final
- Fields: trackId, text
assistant.response.delta
- Fields: trackId, text
assistant.response.final
- Fields: trackId, text
assistant.tool_call
- Fields: trackId, tool_call, tool_call_id, tool_name, arguments, executor, timeout_ms
assistant.tool_result
- Fields: trackId, source, result, tool_call_id, tool_name, ok, error
- error: { code, message, retryable } when ok=false
output.audio.start
- Fields: trackId
output.audio.end
- Fields: trackId
response.interrupted
- Fields: trackId
metrics.ttfb
- Fields: trackId, latencyMs
error
- Fields: sender, code, message, trackId

Track IDs (MVP fixed values):

audio_in: ASR/VAD input-side events (input.*, transcript.*)
audio_out: assistant output-side events (assistant.*, output.audio.*, response.interrupted, metrics.ttfb)
control: session/control events (session.*, hello.*, error, config.resolved)

Correlation IDs (event.data):

turn_id: one user-assistant interaction turn.
utterance_id: one ASR final utterance.
response_id: one assistant response generation.
tool_call_id: one tool invocation.
tts_id: one TTS playback segment.

Binary Audio Frames

After session.started, client may send binary PCM chunks continuously.

MVP fixed format:

16-bit signed little-endian PCM (pcm_s16le)
mono (1 channel)
16000 Hz
20ms frame = 640 bytes

Framing rules:

Binary audio frame unit is 640 bytes.
A WS binary message may carry one or multiple complete 640-byte frames.
Non-640-multiple payloads are treated as audio.frame_size_mismatch protocol errors.

TTS boundary events:

output.audio.start and output.audio.end mark assistant playback boundaries.

Event Throttling

To keep client rendering and server load stable, v1 applies/recommends:

transcript.delta: merge to ~200-500ms cadence (server default: 300ms).
assistant.response.delta: merge to ~50-100ms cadence (server default: 80ms).
Metrics streams (if enabled beyond metrics.ttfb): emit every ~500-1000ms.

Error Structure

error keeps legacy top-level fields (code, message) and adds structured info:

stage: protocol | asr | llm | tts | tool | audio
retryable: boolean
data.error: { stage, code, message, retryable }

Compatibility

This endpoint now enforces v1 message schema for JSON control frames. Legacy command names (invite, chat, etc.) are no longer part of the public protocol.

6.1 KiB Raw Blame History

WS v1 Protocol Schema (/ws)