Add output.audio.played message handling and update documentation

- Introduced `output.audio.played` message type for client acknowledgment of audio playback completion.
- Updated `DuplexPipeline` to track client playback state and handle playback completion events.
- Enhanced session handling to route `output.audio.played` messages to the pipeline.
- Revised API documentation to include details about the new message type and its fields.
- Updated schema documentation to reflect the addition of `output.audio.played` in the message flow.
This commit is contained in:
Xin Wang
2026-03-04 10:01:34 +08:00
parent 80fff09b76
commit 7d4af18815
8 changed files with 275 additions and 19 deletions

View File

@@ -20,7 +20,7 @@ Required message order:
1. Client connects to `/ws?assistant_id=<id>`.
2. Client sends `session.start`.
3. Server replies `session.started`.
4. Client may stream binary audio and/or send `input.text`.
4. Client may stream binary audio and/or send `input.text`, `response.cancel`, `output.audio.played`, `tool_call.results`.
5. Client sends `session.stop` (or closes socket).
If order is violated, server emits `error` with `code = "protocol.order"`.
@@ -100,6 +100,22 @@ Text-only mode:
}
```
### `output.audio.played`
Client playback ACK after assistant audio is actually drained on local speakers
(including jitter buffer / playback queue).
```json
{
"type": "output.audio.played",
"tts_id": "tts_001",
"response_id": "resp_001",
"turn_id": "turn_001",
"played_at_ms": 1730000018450,
"played_ms": 2520
}
```
### `session.stop`
```json
@@ -223,6 +239,8 @@ Framing rules:
TTS boundary events:
- `output.audio.start` and `output.audio.end` mark assistant playback boundaries.
- `output.audio.end` means server-side audio send completed (not guaranteed speaker drain).
- For speaker-drain confirmation, client should send `output.audio.played`.
## Event Throttling

View File

@@ -46,6 +46,7 @@
- 二进制音频
- `input.text`(可选)
- `response.cancel`(可选)
- `output.audio.played`(可选)
- `tool_call.results`(可选)
6. 客户端发送 `session.stop` 或直接断开连接
@@ -190,7 +191,35 @@
| `type` | string | 是 | - | 固定 `"response.cancel"` | 请求中断当前回答 |
| `graceful` | boolean | 否 | `false` | 取消方式 | `false` 立即打断;`true` 当前实现主要用于记录日志,不强制中断 |
## 3.5 `tool_call.results`
## 3.5 `output.audio.played`
客户端在本地扬声器真正播完后回执(含 jitter buffer / 播放队列)。
示例:
```json
{
"type": "output.audio.played",
"tts_id": "tts_001",
"response_id": "resp_001",
"turn_id": "turn_001",
"played_at_ms": 1730000018450,
"played_ms": 2520
}
```
字段说明:
| 字段 | 类型 | 必填 | 约束 | 含义 | 使用说明 |
|---|---|---|---|---|---|
| `type` | string | 是 | 固定 `"output.audio.played"` | 播放完成回执 | 客户端播完后上送 |
| `tts_id` | string | 是 | 非空字符串 | TTS 段 ID | 建议使用 `output.audio.start/end` 中同一 `tts_id` |
| `response_id` | string \| null | 否 | 任意字符串 | 回复 ID | 建议回传,便于聚合 |
| `turn_id` | string \| null | 否 | 任意字符串 | 轮次 ID | 建议回传,便于聚合 |
| `played_at_ms` | number \| null | 否 | 毫秒时间戳 | 客户端播放完成时间 | 用于时延分析 |
| `played_ms` | number \| null | 否 | 非负数 | 客户端播放耗时 | 用于播放器统计 |
## 3.6 `tool_call.results`
仅在工具执行端为客户端时使用(`assistant.tool_call.executor == "client"`)。
@@ -228,7 +257,7 @@
- 重复回传会被忽略;
- 超时未回传会由服务端合成超时结果(`504`)。
## 3.6 `session.stop`
## 3.7 `session.stop`
示例:
@@ -406,7 +435,7 @@
- 含义TTS 音频输出开始边界
6. `output.audio.end`
- 含义TTS 音频输出结束边界
- 含义TTS 音频输出结束边界(服务端发送完成,不等价于扬声器已播完)
7. `response.interrupted`
- 含义当前回答被打断barge-in 或 cancel
@@ -434,6 +463,7 @@
- 音频为 PCM 二进制帧;
- 发送单位对齐到 `640 bytes`(不足会补零后发送);
- 前端通常结合 `output.audio.start/end` 做播放边界控制;
- 若需要“扬声器真实播完”语义,前端应在播完后发送 `output.audio.played`
- 收到 `response.interrupted` 后应丢弃队列中未播放完的旧音频。
---
@@ -502,8 +532,9 @@
2. 语音输入严格按 16k/16bit/mono并保证每个 WS 二进制消息长度是 `640*n`
3. UI 层把 `assistant.response.delta` 当作流式显示,把 `assistant.response.final` 当作收敛结果。
4. 播放器用 `output.audio.start/end` 管理一轮播报生命周期。
5. 工具调用场景下,若 `executor=client`,务必按 `tool_call_id` 回传 `tool_call.results`
6. 出现 `error` 时优先按 `code` 分流处理,而不是仅看 `message`
5. 若业务依赖“扬声器真实播完”,请在播完时上送 `output.audio.played`
6. 工具调用场景下,若 `executor=client`,务必按 `tool_call_id` 回传 `tool_call.results`
7. 出现 `error` 时优先按 `code` 分流处理,而不是仅看 `message`
---
@@ -521,6 +552,7 @@ Server <- assistant.response.delta / assistant.response.final
Server <- output.audio.start
Server <- (binary pcm frames...)
Server <- output.audio.end
Client -> output.audio.played (optional)
Client -> session.stop
Server <- session.stopped
```