Files
AI-VideoAssistant/engine/docs/ws_v1_schema.md
2026-02-10 19:13:54 +08:00

3.9 KiB

WS v1 Protocol Schema (/ws)

This document defines the public WebSocket protocol for the /ws endpoint.

Transport

  • A single WebSocket connection carries:
    • JSON text frames for control/events.
    • Binary frames for raw PCM audio (pcm_s16le, mono, 16kHz by default).

Handshake and State Machine

Required message order:

  1. Client sends hello.
  2. Server replies hello.ack.
  3. Client sends session.start.
  4. Server replies session.started.
  5. Client may stream binary audio and/or send input.text.
  6. Client sends session.stop (or closes socket).

If order is violated, server emits error with code = "protocol.order".

Client -> Server Messages

hello

{
  "type": "hello",
  "version": "v1",
  "auth": {
    "apiKey": "optional-api-key",
    "jwt": "optional-jwt"
  }
}

Rules:

  • version must be v1.
  • If WS_API_KEY is configured on server, auth.apiKey must match.
  • If WS_REQUIRE_AUTH=true, either auth.apiKey or auth.jwt must be present.

session.start

{
  "type": "session.start",
  "audio": {
    "encoding": "pcm_s16le",
    "sample_rate_hz": 16000,
    "channels": 1
  },
  "metadata": {
    "client": "web-debug",
    "systemPrompt": "You are concise.",
    "greeting": "Hi, how can I help?",
    "services": {
      "llm": {
        "provider": "openai",
        "model": "gpt-4o-mini",
        "apiKey": "sk-...",
        "baseUrl": "https://api.openai.com/v1"
      },
      "asr": {
        "provider": "siliconflow",
        "model": "FunAudioLLM/SenseVoiceSmall",
        "apiKey": "sf-...",
        "interimIntervalMs": 500,
        "minAudioMs": 300
      },
      "tts": {
        "provider": "siliconflow",
        "model": "FunAudioLLM/CosyVoice2-0.5B",
        "apiKey": "sf-...",
        "voice": "anna",
        "speed": 1.0
      }
    }
  }
}

metadata.services is optional. If omitted, server defaults to environment configuration.

input.text

{
  "type": "input.text",
  "text": "What can you do?"
}

response.cancel

{
  "type": "response.cancel",
  "graceful": false
}

session.stop

{
  "type": "session.stop",
  "reason": "client_disconnect"
}

tool_call.results

Client tool execution results returned to server.

{
  "type": "tool_call.results",
  "results": [
    {
      "tool_call_id": "call_abc123",
      "name": "weather",
      "output": { "temp_c": 21, "condition": "sunny" },
      "status": { "code": 200, "message": "ok" }
    }
  ]
}

Server -> Client Events

All server events include:

{
  "type": "event.name",
  "timestamp": 1730000000000
}

Common events:

  • hello.ack
    • Fields: sessionId, version
  • session.started
    • Fields: sessionId, trackId, audio
  • session.stopped
    • Fields: sessionId, reason
  • heartbeat
  • input.speech_started
    • Fields: trackId, probability
  • input.speech_stopped
    • Fields: trackId, probability
  • transcript.delta
    • Fields: trackId, text
  • transcript.final
    • Fields: trackId, text
  • assistant.response.delta
    • Fields: trackId, text
  • assistant.response.final
    • Fields: trackId, text
  • assistant.tool_call
    • Fields: trackId, tool_call (tool_call.executor is client or server)
  • assistant.tool_result
    • Fields: trackId, source, result
  • output.audio.start
    • Fields: trackId
  • output.audio.end
    • Fields: trackId
  • response.interrupted
    • Fields: trackId
  • metrics.ttfb
    • Fields: trackId, latencyMs
  • error
    • Fields: sender, code, message, trackId

Binary Audio Frames

After session.started, client may send binary PCM chunks continuously.

Recommended format:

  • 16-bit signed little-endian PCM.
  • 1 channel.
  • 16000 Hz.
  • 20ms frames (640 bytes) preferred.

Compatibility

This endpoint now enforces v1 message schema for JSON control frames. Legacy command names (invite, chat, etc.) are no longer part of the public protocol.