wx44wx/AI-VideoAssistant

Fork 0

Files

Xin Wang 6cac24918d Now we have server tool and client tool

2026-02-10 19:13:54 +08:00

3.9 KiB

Raw Blame History

WS v1 Protocol Schema (`/ws`)

This document defines the public WebSocket protocol for the /ws endpoint.

Transport

A single WebSocket connection carries:
- JSON text frames for control/events.
- Binary frames for raw PCM audio (pcm_s16le, mono, 16kHz by default).

Handshake and State Machine

Required message order:

Client sends hello.
Server replies hello.ack.
Client sends session.start.
Server replies session.started.
Client may stream binary audio and/or send input.text.
Client sends session.stop (or closes socket).

If order is violated, server emits error with code = "protocol.order".

Client -> Server Messages

`hello`

{
  "type": "hello",
  "version": "v1",
  "auth": {
    "apiKey": "optional-api-key",
    "jwt": "optional-jwt"
  }
}

Rules:

version must be v1.
If WS_API_KEY is configured on server, auth.apiKey must match.
If WS_REQUIRE_AUTH=true, either auth.apiKey or auth.jwt must be present.

`session.start`

{
  "type": "session.start",
  "audio": {
    "encoding": "pcm_s16le",
    "sample_rate_hz": 16000,
    "channels": 1
  },
  "metadata": {
    "client": "web-debug",
    "systemPrompt": "You are concise.",
    "greeting": "Hi, how can I help?",
    "services": {
      "llm": {
        "provider": "openai",
        "model": "gpt-4o-mini",
        "apiKey": "sk-...",
        "baseUrl": "https://api.openai.com/v1"
      },
      "asr": {
        "provider": "siliconflow",
        "model": "FunAudioLLM/SenseVoiceSmall",
        "apiKey": "sf-...",
        "interimIntervalMs": 500,
        "minAudioMs": 300
      },
      "tts": {
        "provider": "siliconflow",
        "model": "FunAudioLLM/CosyVoice2-0.5B",
        "apiKey": "sf-...",
        "voice": "anna",
        "speed": 1.0
      }
    }
  }
}

metadata.services is optional. If omitted, server defaults to environment configuration.

`input.text`

{
  "type": "input.text",
  "text": "What can you do?"
}

`response.cancel`

{
  "type": "response.cancel",
  "graceful": false
}

`session.stop`

{
  "type": "session.stop",
  "reason": "client_disconnect"
}

`tool_call.results`

Client tool execution results returned to server.

{
  "type": "tool_call.results",
  "results": [
    {
      "tool_call_id": "call_abc123",
      "name": "weather",
      "output": { "temp_c": 21, "condition": "sunny" },
      "status": { "code": 200, "message": "ok" }
    }
  ]
}

Server -> Client Events

All server events include:

{
  "type": "event.name",
  "timestamp": 1730000000000
}

Common events:

hello.ack
- Fields: sessionId, version
session.started
- Fields: sessionId, trackId, audio
session.stopped
- Fields: sessionId, reason
heartbeat
input.speech_started
- Fields: trackId, probability
input.speech_stopped
- Fields: trackId, probability
transcript.delta
- Fields: trackId, text
transcript.final
- Fields: trackId, text
assistant.response.delta
- Fields: trackId, text
assistant.response.final
- Fields: trackId, text
assistant.tool_call
- Fields: trackId, tool_call (tool_call.executor is client or server)
assistant.tool_result
- Fields: trackId, source, result
output.audio.start
- Fields: trackId
output.audio.end
- Fields: trackId
response.interrupted
- Fields: trackId
metrics.ttfb
- Fields: trackId, latencyMs
error
- Fields: sender, code, message, trackId

Binary Audio Frames

After session.started, client may send binary PCM chunks continuously.

Recommended format:

16-bit signed little-endian PCM.
1 channel.
16000 Hz.
20ms frames (640 bytes) preferred.

Compatibility

This endpoint now enforces v1 message schema for JSON control frames. Legacy command names (invite, chat, etc.) are no longer part of the public protocol.

3.9 KiB Raw Blame History

WS v1 Protocol Schema (/ws)