Files

Xin Wang 3643431565 Enhance WebSocket session configuration by introducing an optional config.resolved event, which provides a public snapshot of the session's configuration. Update the API reference documentation to clarify the conditions under which this event is emitted and the details it includes. Modify session management to respect the new setting for emitting configuration details, ensuring sensitive information remains secure. Update tests to validate the new behavior and ensure compliance with the updated configuration schema.

2026-03-01 23:08:44 +08:00

7.7 KiB

Raw Blame History

WS v1 Protocol Schema (`/ws`)

This document defines the public WebSocket protocol for the /ws endpoint.

Validation policy:

WS v1 JSON control messages are validated strictly.
Unknown top-level fields are rejected for all defined client message types.
assistant_id query parameter is required on /ws.

Transport

A single WebSocket connection carries:
- JSON text frames for control/events.
- Binary frames for raw PCM audio (pcm_s16le, mono, 16kHz by default).

Handshake and State Machine

Required message order:

Client connects to /ws?assistant_id=<id>.
Client sends session.start.
Server replies session.started.
Client may stream binary audio and/or send input.text.
Client sends session.stop (or closes socket).

If order is violated, server emits error with code = "protocol.order".

Client -> Server Messages

`session.start`

{
  "type": "session.start",
  "audio": {
    "encoding": "pcm_s16le",
    "sample_rate_hz": 16000,
    "channels": 1
  },
  "metadata": {
    "channel": "web",
    "source": "web-debug",
    "history": {
      "userId": 1
    },
    "overrides": {
      "output": {
        "mode": "audio"
      },
      "systemPrompt": "You are concise.",
      "greeting": "Hi, how can I help?"
    },
    "dynamicVariables": {
      "customer_name": "Alice",
      "plan_tier": "Pro"
    }
  }
}

Rules:

Assistant config is resolved strictly by URL query assistant_id.
metadata top-level keys allowed: overrides, dynamicVariables, channel, source, history, workflow (workflow is ignored).
metadata.overrides whitelist: systemPrompt, greeting, firstTurnMode, generatedOpenerEnabled, output, bargeIn, knowledgeBaseId, knowledge, tools, openerAudio.
metadata.services is rejected with protocol.invalid_override.
metadata.workflow is ignored in this MVP protocol version.
Top-level IDs are forbidden in payload (assistantId, appId, app_id, configVersionId, config_version_id).
Secret-like keys are forbidden in metadata (apiKey, token, secret, password, authorization).
metadata.dynamicVariables is optional and must be an object of string key/value pairs.
- Key pattern: ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$
- Max entries: 30
- Max value length: 1000 chars
Placeholder format in systemPrompt and greeting: {{variable_name}}.
- Built-in system variables (always available): {{system__time}}, {{system_utc}}, {{system_timezone}}.
  - system__time: current local time (YYYY-MM-DD HH:mm:ss)
  - system_utc: current UTC time (YYYY-MM-DD HH:mm:ss)
  - system_timezone: current local timezone
- Missing referenced placeholders reject session.start with protocol.dynamic_variables_missing.
- Invalid dynamicVariables payload rejects session.start with protocol.dynamic_variables_invalid.

Text-only mode:

Set metadata.overrides.output.mode = "text".
In this mode server still sends assistant.response.delta/final, but will not emit audio frames or output.audio.start/end.

`input.text`

{
  "type": "input.text",
  "text": "What can you do?"
}

`response.cancel`

{
  "type": "response.cancel",
  "graceful": false
}

`session.stop`

{
  "type": "session.stop",
  "reason": "client_disconnect"
}

`tool_call.results`

Client tool execution results returned to server. Only needed when assistant.tool_call.executor == "client" (default execution is server-side).

{
  "type": "tool_call.results",
  "results": [
    {
      "tool_call_id": "call_abc123",
      "name": "weather",
      "output": { "temp_c": 21, "condition": "sunny" },
      "status": { "code": 200, "message": "ok" }
    }
  ]
}

Server -> Client Events

All server events include an envelope:

{
  "type": "event.name",
  "timestamp": 1730000000000,
  "sessionId": "sess_xxx",
  "seq": 42,
  "source": "asr",
  "trackId": "audio_in",
  "data": {}
}

Envelope notes:

seq is monotonically increasing within one session (for replay/resume).
source is one of: asr | llm | tts | tool | system | client | server.
- For assistant.tool_result, source may be client or server to indicate execution side.
data is structured payload; legacy top-level fields are kept for compatibility.

Common events:

session.started
- Fields: sessionId, trackId, tracks, audio
config.resolved
- Fields: sessionId, trackId, config
- Optional debug event. Disabled by default (ws_emit_config_resolved=false).
- config is SaaS-safe and public-only: channel (if provided), output.mode, tools.enabled, tools.count, tracks.
- Must not expose internal IDs or runtime internals (assistantId/appId/configVersionId/services/provider/model/baseUrl/systemPrompt).
session.stopped
- Fields: sessionId, reason
heartbeat
input.speech_started
- Fields: trackId, probability
input.speech_stopped
- Fields: trackId, probability
transcript.delta
- Fields: trackId, text
transcript.final
- Fields: trackId, text
assistant.response.delta
- Fields: trackId, text
assistant.response.final
- Fields: trackId, text
assistant.tool_call
- Fields: trackId, tool_call, tool_call_id, tool_name, arguments, executor, timeout_ms
assistant.tool_result
- Fields: trackId, source, result, tool_call_id, tool_name, ok, error
- error: { code, message, retryable } when ok=false
output.audio.start
- Fields: trackId
output.audio.end
- Fields: trackId
response.interrupted
- Fields: trackId
metrics.ttfb
- Fields: trackId, latencyMs
error
- Fields: sender, code, message, trackId
- trackId convention:
  - audio_in for stage in {audio, asr}
  - audio_out for stage in {llm, tts, tool}
  - control otherwise (including protocol errors)

Track IDs (MVP fixed values):

audio_in: ASR/VAD input-side events (input.*, transcript.*)
audio_out: assistant output-side events (assistant.*, output.audio.*, response.interrupted, metrics.ttfb)
control: session/control events (session.*, error, optional config.resolved)

Correlation IDs (event.data):

turn_id: one user-assistant interaction turn.
utterance_id: one ASR final utterance.
response_id: one assistant response generation.
tool_call_id: one tool invocation.
tts_id: one TTS playback segment.

Binary Audio Frames

After session.started, client may send binary PCM chunks continuously.

MVP fixed format:

16-bit signed little-endian PCM (pcm_s16le)
mono (1 channel)
16000 Hz
20ms frame = 640 bytes

Framing rules:

Binary audio frame unit is 640 bytes.
A WS binary message may carry one or multiple complete 640-byte frames.
Non-640-multiple payloads are rejected as audio.frame_size_mismatch; that WS message is dropped (no partial buffering/reassembly).

TTS boundary events:

output.audio.start and output.audio.end mark assistant playback boundaries.

Event Throttling

To keep client rendering and server load stable, v1 applies/recommends:

transcript.delta: merge to ~200-500ms cadence (server default: 300ms).
assistant.response.delta: merge to ~50-100ms cadence (server default: 80ms).
Metrics streams (if enabled beyond metrics.ttfb): emit every ~500-1000ms.

Error Structure

error keeps legacy top-level fields (code, message) and adds structured info:

stage: protocol | asr | llm | tts | tool | audio
retryable: boolean
data.error: { stage, code, message, retryable }

Compatibility

This endpoint now enforces v1 message schema for JSON control frames. Legacy command names (invite, chat, etc.) are no longer part of the public protocol.

7.7 KiB Raw Blame History

WS v1 Protocol Schema (/ws)