7.7 KiB
WS v1 Protocol Schema (/ws)
This document defines the public WebSocket protocol for the /ws endpoint.
Validation policy:
- WS v1 JSON control messages are validated strictly.
- Unknown top-level fields are rejected for all defined client message types.
assistant_idquery parameter is required on/ws.
Transport
- A single WebSocket connection carries:
- JSON text frames for control/events.
- Binary frames for raw PCM audio (
pcm_s16le, mono, 16kHz by default).
Handshake and State Machine
Required message order:
- Client connects to
/ws?assistant_id=<id>. - Client sends
session.start. - Server replies
session.started. - Client may stream binary audio and/or send
input.text. - Client sends
session.stop(or closes socket).
If order is violated, server emits error with code = "protocol.order".
Client -> Server Messages
session.start
{
"type": "session.start",
"audio": {
"encoding": "pcm_s16le",
"sample_rate_hz": 16000,
"channels": 1
},
"metadata": {
"channel": "web",
"source": "web-debug",
"history": {
"userId": 1
},
"overrides": {
"output": {
"mode": "audio"
},
"systemPrompt": "You are concise.",
"greeting": "Hi, how can I help?"
},
"dynamicVariables": {
"customer_name": "Alice",
"plan_tier": "Pro"
}
}
}
Rules:
- Assistant config is resolved strictly by URL query
assistant_id. metadatatop-level keys allowed:overrides,dynamicVariables,channel,source,history,workflow(workflowis ignored).metadata.overrideswhitelist:systemPrompt,greeting,firstTurnMode,generatedOpenerEnabled,output,bargeIn,knowledgeBaseId,knowledge,tools,openerAudio.metadata.servicesis rejected withprotocol.invalid_override.metadata.workflowis ignored in this MVP protocol version.- Top-level IDs are forbidden in payload (
assistantId,appId,app_id,configVersionId,config_version_id). - Secret-like keys are forbidden in metadata (
apiKey,token,secret,password,authorization). metadata.dynamicVariablesis optional and must be an object of string key/value pairs.- Key pattern:
^[a-zA-Z_][a-zA-Z0-9_]{0,63}$ - Max entries: 30
- Max value length: 1000 chars
- Key pattern:
- Placeholder format in
systemPromptandgreeting:{{variable_name}}.- Built-in system variables (always available):
{{system__time}},{{system_utc}},{{system_timezone}}.system__time: current local time (YYYY-MM-DD HH:mm:ss)system_utc: current UTC time (YYYY-MM-DD HH:mm:ss)system_timezone: current local timezone
- Missing referenced placeholders reject
session.startwithprotocol.dynamic_variables_missing. - Invalid
dynamicVariablespayload rejectssession.startwithprotocol.dynamic_variables_invalid.
- Built-in system variables (always available):
Text-only mode:
- Set
metadata.overrides.output.mode = "text". - In this mode server still sends
assistant.response.delta/final, but will not emit audio frames oroutput.audio.start/end.
input.text
{
"type": "input.text",
"text": "What can you do?"
}
response.cancel
{
"type": "response.cancel",
"graceful": false
}
session.stop
{
"type": "session.stop",
"reason": "client_disconnect"
}
tool_call.results
Client tool execution results returned to server.
Only needed when assistant.tool_call.executor == "client" (default execution is server-side).
{
"type": "tool_call.results",
"results": [
{
"tool_call_id": "call_abc123",
"name": "weather",
"output": { "temp_c": 21, "condition": "sunny" },
"status": { "code": 200, "message": "ok" }
}
]
}
Server -> Client Events
All server events include an envelope:
{
"type": "event.name",
"timestamp": 1730000000000,
"sessionId": "sess_xxx",
"seq": 42,
"source": "asr",
"trackId": "audio_in",
"data": {}
}
Envelope notes:
seqis monotonically increasing within one session (for replay/resume).sourceis one of:asr | llm | tts | tool | system | client | server.- For
assistant.tool_result,sourcemay beclientorserverto indicate execution side.
- For
datais structured payload; legacy top-level fields are kept for compatibility.
Common events:
session.started- Fields:
sessionId,trackId,tracks,audio
- Fields:
config.resolved- Fields:
sessionId,trackId,config - Optional debug event. Disabled by default (
ws_emit_config_resolved=false). configis SaaS-safe and public-only:channel(if provided),output.mode,tools.enabled,tools.count,tracks.- Must not expose internal IDs or runtime internals (
assistantId/appId/configVersionId/services/provider/model/baseUrl/systemPrompt).
- Fields:
session.stopped- Fields:
sessionId,reason
- Fields:
heartbeatinput.speech_started- Fields:
trackId,probability
- Fields:
input.speech_stopped- Fields:
trackId,probability
- Fields:
transcript.delta- Fields:
trackId,text
- Fields:
transcript.final- Fields:
trackId,text
- Fields:
assistant.response.delta- Fields:
trackId,text
- Fields:
assistant.response.final- Fields:
trackId,text
- Fields:
assistant.tool_call- Fields:
trackId,tool_call,tool_call_id,tool_name,arguments,executor,timeout_ms
- Fields:
assistant.tool_result- Fields:
trackId,source,result,tool_call_id,tool_name,ok,error error:{ code, message, retryable }whenok=false
- Fields:
output.audio.start- Fields:
trackId
- Fields:
output.audio.end- Fields:
trackId
- Fields:
response.interrupted- Fields:
trackId
- Fields:
metrics.ttfb- Fields:
trackId,latencyMs
- Fields:
error- Fields:
sender,code,message,trackId trackIdconvention:audio_inforstage in {audio, asr}audio_outforstage in {llm, tts, tool}controlotherwise (including protocol errors)
- Fields:
Track IDs (MVP fixed values):
audio_in: ASR/VAD input-side events (input.*,transcript.*)audio_out: assistant output-side events (assistant.*,output.audio.*,response.interrupted,metrics.ttfb)control: session/control events (session.*,error, optionalconfig.resolved)
Correlation IDs (event.data):
turn_id: one user-assistant interaction turn.utterance_id: one ASR final utterance.response_id: one assistant response generation.tool_call_id: one tool invocation.tts_id: one TTS playback segment.
Binary Audio Frames
After session.started, client may send binary PCM chunks continuously.
MVP fixed format:
- 16-bit signed little-endian PCM (
pcm_s16le) - mono (1 channel)
- 16000 Hz
- 20ms frame = 640 bytes
Framing rules:
- Binary audio frame unit is 640 bytes.
- A WS binary message may carry one or multiple complete 640-byte frames.
- Non-640-multiple payloads are rejected as
audio.frame_size_mismatch; that WS message is dropped (no partial buffering/reassembly).
TTS boundary events:
output.audio.startandoutput.audio.endmark assistant playback boundaries.
Event Throttling
To keep client rendering and server load stable, v1 applies/recommends:
transcript.delta: merge to ~200-500ms cadence (server default: 300ms).assistant.response.delta: merge to ~50-100ms cadence (server default: 80ms).- Metrics streams (if enabled beyond
metrics.ttfb): emit every ~500-1000ms.
Error Structure
error keeps legacy top-level fields (code, message) and adds structured info:
stage:protocol | asr | llm | tts | tool | audioretryable: booleandata.error:{ stage, code, message, retryable }
Compatibility
This endpoint now enforces v1 message schema for JSON control frames.
Legacy command names (invite, chat, etc.) are no longer part of the public protocol.