6.6 KiB
WS v1 Protocol Schema (/ws)
This document defines the public WebSocket protocol for the /ws endpoint.
Validation policy:
- WS v1 JSON control messages are validated strictly.
- Unknown top-level fields are rejected for all defined client message types.
hello.versionis fixed to"v1".
Transport
- A single WebSocket connection carries:
- JSON text frames for control/events.
- Binary frames for raw PCM audio (
pcm_s16le, mono, 16kHz by default).
Handshake and State Machine
Required message order:
- Client sends
hello. - Server replies
hello.ack. - Client sends
session.start. - Server replies
session.started. - Client may stream binary audio and/or send
input.text. - Client sends
session.stop(or closes socket).
If order is violated, server emits error with code = "protocol.order".
Client -> Server Messages
hello
{
"type": "hello",
"version": "v1",
"auth": {
"apiKey": "optional-api-key",
"jwt": "optional-jwt"
}
}
Rules:
versionmust bev1.- If
WS_API_KEYis configured on server,auth.apiKeymust match. - If
WS_REQUIRE_AUTH=true, eitherauth.apiKeyorauth.jwtmust be present.
session.start
{
"type": "session.start",
"audio": {
"encoding": "pcm_s16le",
"sample_rate_hz": 16000,
"channels": 1
},
"metadata": {
"appId": "assistant_123",
"channel": "web",
"configVersionId": "cfg_20260217_01",
"client": "web-debug",
"output": {
"mode": "audio"
},
"systemPrompt": "You are concise.",
"greeting": "Hi, how can I help?"
}
}
Rules:
- Client-side
metadata.servicesis ignored. - Service config (including secrets) is resolved server-side (env/backend).
- Client should pass stable IDs (
appId,channel,configVersionId) plus small runtime overrides (e.g.output,bargeIn, greeting/prompt style hints).
Text-only mode:
- Set
metadata.output.mode = "text". - In this mode server still sends
assistant.response.delta/final, but will not emit audio frames oroutput.audio.start/end.
input.text
{
"type": "input.text",
"text": "What can you do?"
}
response.cancel
{
"type": "response.cancel",
"graceful": false
}
session.stop
{
"type": "session.stop",
"reason": "client_disconnect"
}
tool_call.results
Client tool execution results returned to server.
Only needed when assistant.tool_call.executor == "client" (default execution is server-side).
{
"type": "tool_call.results",
"results": [
{
"tool_call_id": "call_abc123",
"name": "weather",
"output": { "temp_c": 21, "condition": "sunny" },
"status": { "code": 200, "message": "ok" }
}
]
}
Server -> Client Events
All server events include an envelope:
{
"type": "event.name",
"timestamp": 1730000000000,
"sessionId": "sess_xxx",
"seq": 42,
"source": "asr",
"trackId": "audio_in",
"data": {}
}
Envelope notes:
seqis monotonically increasing within one session (for replay/resume).sourceis one of:asr | llm | tts | tool | system | client | server.- For
assistant.tool_result,sourcemay beclientorserverto indicate execution side.
- For
datais structured payload; legacy top-level fields are kept for compatibility.
Common events:
hello.ack- Fields:
sessionId,version
- Fields:
session.started- Fields:
sessionId,trackId,tracks,audio
- Fields:
config.resolved- Fields:
sessionId,trackId,config - Sent immediately after
session.started. - Contains effective model/voice/output/tool allowlist/prompt hash, and never includes secrets.
- Fields:
session.stopped- Fields:
sessionId,reason
- Fields:
heartbeatinput.speech_started- Fields:
trackId,probability
- Fields:
input.speech_stopped- Fields:
trackId,probability
- Fields:
transcript.delta- Fields:
trackId,text
- Fields:
transcript.final- Fields:
trackId,text
- Fields:
assistant.response.delta- Fields:
trackId,text
- Fields:
assistant.response.final- Fields:
trackId,text
- Fields:
assistant.tool_call- Fields:
trackId,tool_call,tool_call_id,tool_name,arguments,executor,timeout_ms
- Fields:
assistant.tool_result- Fields:
trackId,source,result,tool_call_id,tool_name,ok,error error:{ code, message, retryable }whenok=false
- Fields:
output.audio.start- Fields:
trackId
- Fields:
output.audio.end- Fields:
trackId
- Fields:
response.interrupted- Fields:
trackId
- Fields:
metrics.ttfb- Fields:
trackId,latencyMs
- Fields:
error- Fields:
sender,code,message,trackId trackIdconvention:audio_inforstage in {audio, asr}audio_outforstage in {llm, tts, tool}controlotherwise (including protocol/auth errors)
- Fields:
Track IDs (MVP fixed values):
audio_in: ASR/VAD input-side events (input.*,transcript.*)audio_out: assistant output-side events (assistant.*,output.audio.*,response.interrupted,metrics.ttfb)control: session/control events (session.*,hello.*,error,config.resolved)
Correlation IDs (event.data):
turn_id: one user-assistant interaction turn.utterance_id: one ASR final utterance.response_id: one assistant response generation.tool_call_id: one tool invocation.tts_id: one TTS playback segment.
Binary Audio Frames
After session.started, client may send binary PCM chunks continuously.
MVP fixed format:
- 16-bit signed little-endian PCM (
pcm_s16le) - mono (1 channel)
- 16000 Hz
- 20ms frame = 640 bytes
Framing rules:
- Binary audio frame unit is 640 bytes.
- A WS binary message may carry one or multiple complete 640-byte frames.
- Non-640-multiple payloads are rejected as
audio.frame_size_mismatch; that WS message is dropped (no partial buffering/reassembly).
TTS boundary events:
output.audio.startandoutput.audio.endmark assistant playback boundaries.
Event Throttling
To keep client rendering and server load stable, v1 applies/recommends:
transcript.delta: merge to ~200-500ms cadence (server default: 300ms).assistant.response.delta: merge to ~50-100ms cadence (server default: 80ms).- Metrics streams (if enabled beyond
metrics.ttfb): emit every ~500-1000ms.
Error Structure
error keeps legacy top-level fields (code, message) and adds structured info:
stage:protocol | asr | llm | tts | tool | audioretryable: booleandata.error:{ stage, code, message, retryable }
Compatibility
This endpoint now enforces v1 message schema for JSON control frames.
Legacy command names (invite, chat, etc.) are no longer part of the public protocol.