4.2 KiB
4.2 KiB
WS v1 Protocol Schema (/ws)
This document defines the public WebSocket protocol for the /ws endpoint.
Transport
- A single WebSocket connection carries:
- JSON text frames for control/events.
- Binary frames for raw PCM audio (
pcm_s16le, mono, 16kHz by default).
Handshake and State Machine
Required message order:
- Client sends
hello. - Server replies
hello.ack. - Client sends
session.start. - Server replies
session.started. - Client may stream binary audio and/or send
input.text. - Client sends
session.stop(or closes socket).
If order is violated, server emits error with code = "protocol.order".
Client -> Server Messages
hello
{
"type": "hello",
"version": "v1",
"auth": {
"apiKey": "optional-api-key",
"jwt": "optional-jwt"
}
}
Rules:
versionmust bev1.- If
WS_API_KEYis configured on server,auth.apiKeymust match. - If
WS_REQUIRE_AUTH=true, eitherauth.apiKeyorauth.jwtmust be present.
session.start
{
"type": "session.start",
"audio": {
"encoding": "pcm_s16le",
"sample_rate_hz": 16000,
"channels": 1
},
"metadata": {
"client": "web-debug",
"output": {
"mode": "audio"
},
"systemPrompt": "You are concise.",
"greeting": "Hi, how can I help?",
"services": {
"llm": {
"provider": "openai",
"model": "gpt-4o-mini",
"apiKey": "sk-...",
"baseUrl": "https://api.openai.com/v1"
},
"asr": {
"provider": "openai_compatible",
"model": "FunAudioLLM/SenseVoiceSmall",
"apiKey": "sf-...",
"interimIntervalMs": 500,
"minAudioMs": 300
},
"tts": {
"enabled": true,
"provider": "openai_compatible",
"model": "FunAudioLLM/CosyVoice2-0.5B",
"apiKey": "sf-...",
"voice": "anna",
"speed": 1.0
}
}
}
}
metadata.services is optional. If omitted, server defaults to environment configuration.
Text-only mode:
- Set
metadata.output.mode = "text"ORmetadata.services.tts.enabled = false. - In this mode server still sends
assistant.response.delta/final, but will not emit audio frames oroutput.audio.start/end.
input.text
{
"type": "input.text",
"text": "What can you do?"
}
response.cancel
{
"type": "response.cancel",
"graceful": false
}
session.stop
{
"type": "session.stop",
"reason": "client_disconnect"
}
tool_call.results
Client tool execution results returned to server.
{
"type": "tool_call.results",
"results": [
{
"tool_call_id": "call_abc123",
"name": "weather",
"output": { "temp_c": 21, "condition": "sunny" },
"status": { "code": 200, "message": "ok" }
}
]
}
Server -> Client Events
All server events include:
{
"type": "event.name",
"timestamp": 1730000000000
}
Common events:
hello.ack- Fields:
sessionId,version
- Fields:
session.started- Fields:
sessionId,trackId,audio
- Fields:
session.stopped- Fields:
sessionId,reason
- Fields:
heartbeatinput.speech_started- Fields:
trackId,probability
- Fields:
input.speech_stopped- Fields:
trackId,probability
- Fields:
transcript.delta- Fields:
trackId,text
- Fields:
transcript.final- Fields:
trackId,text
- Fields:
assistant.response.delta- Fields:
trackId,text
- Fields:
assistant.response.final- Fields:
trackId,text
- Fields:
assistant.tool_call- Fields:
trackId,tool_call(tool_call.executorisclientorserver)
- Fields:
assistant.tool_result- Fields:
trackId,source,result
- Fields:
output.audio.start- Fields:
trackId
- Fields:
output.audio.end- Fields:
trackId
- Fields:
response.interrupted- Fields:
trackId
- Fields:
metrics.ttfb- Fields:
trackId,latencyMs
- Fields:
error- Fields:
sender,code,message,trackId
- Fields:
Binary Audio Frames
After session.started, client may send binary PCM chunks continuously.
Recommended format:
- 16-bit signed little-endian PCM.
- 1 channel.
- 16000 Hz.
- 20ms frames (640 bytes) preferred.
Compatibility
This endpoint now enforces v1 message schema for JSON control frames.
Legacy command names (invite, chat, etc.) are no longer part of the public protocol.