Add output.audio.played message handling and update documentation
- Introduced `output.audio.played` message type for client acknowledgment of audio playback completion. - Updated `DuplexPipeline` to track client playback state and handle playback completion events. - Enhanced session handling to route `output.audio.played` messages to the pipeline. - Revised API documentation to include details about the new message type and its fields. - Updated schema documentation to reflect the addition of `output.audio.played` in the message flow.
This commit is contained in:
@@ -20,7 +20,7 @@ Required message order:
|
||||
1. Client connects to `/ws?assistant_id=<id>`.
|
||||
2. Client sends `session.start`.
|
||||
3. Server replies `session.started`.
|
||||
4. Client may stream binary audio and/or send `input.text`.
|
||||
4. Client may stream binary audio and/or send `input.text`, `response.cancel`, `output.audio.played`, `tool_call.results`.
|
||||
5. Client sends `session.stop` (or closes socket).
|
||||
|
||||
If order is violated, server emits `error` with `code = "protocol.order"`.
|
||||
@@ -100,6 +100,22 @@ Text-only mode:
|
||||
}
|
||||
```
|
||||
|
||||
### `output.audio.played`
|
||||
|
||||
Client playback ACK after assistant audio is actually drained on local speakers
|
||||
(including jitter buffer / playback queue).
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "output.audio.played",
|
||||
"tts_id": "tts_001",
|
||||
"response_id": "resp_001",
|
||||
"turn_id": "turn_001",
|
||||
"played_at_ms": 1730000018450,
|
||||
"played_ms": 2520
|
||||
}
|
||||
```
|
||||
|
||||
### `session.stop`
|
||||
|
||||
```json
|
||||
@@ -223,6 +239,8 @@ Framing rules:
|
||||
|
||||
TTS boundary events:
|
||||
- `output.audio.start` and `output.audio.end` mark assistant playback boundaries.
|
||||
- `output.audio.end` means server-side audio send completed (not guaranteed speaker drain).
|
||||
- For speaker-drain confirmation, client should send `output.audio.played`.
|
||||
|
||||
## Event Throttling
|
||||
|
||||
|
||||
@@ -46,6 +46,7 @@
|
||||
- 二进制音频
|
||||
- `input.text`(可选)
|
||||
- `response.cancel`(可选)
|
||||
- `output.audio.played`(可选)
|
||||
- `tool_call.results`(可选)
|
||||
6. 客户端发送 `session.stop` 或直接断开连接
|
||||
|
||||
@@ -190,7 +191,35 @@
|
||||
| `type` | string | 是 | - | 固定 `"response.cancel"` | 请求中断当前回答 |
|
||||
| `graceful` | boolean | 否 | `false` | 取消方式 | `false` 立即打断;`true` 当前实现主要用于记录日志,不强制中断 |
|
||||
|
||||
## 3.5 `tool_call.results`
|
||||
## 3.5 `output.audio.played`
|
||||
|
||||
客户端在本地扬声器真正播完后回执(含 jitter buffer / 播放队列)。
|
||||
|
||||
示例:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "output.audio.played",
|
||||
"tts_id": "tts_001",
|
||||
"response_id": "resp_001",
|
||||
"turn_id": "turn_001",
|
||||
"played_at_ms": 1730000018450,
|
||||
"played_ms": 2520
|
||||
}
|
||||
```
|
||||
|
||||
字段说明:
|
||||
|
||||
| 字段 | 类型 | 必填 | 约束 | 含义 | 使用说明 |
|
||||
|---|---|---|---|---|---|
|
||||
| `type` | string | 是 | 固定 `"output.audio.played"` | 播放完成回执 | 客户端播完后上送 |
|
||||
| `tts_id` | string | 是 | 非空字符串 | TTS 段 ID | 建议使用 `output.audio.start/end` 中同一 `tts_id` |
|
||||
| `response_id` | string \| null | 否 | 任意字符串 | 回复 ID | 建议回传,便于聚合 |
|
||||
| `turn_id` | string \| null | 否 | 任意字符串 | 轮次 ID | 建议回传,便于聚合 |
|
||||
| `played_at_ms` | number \| null | 否 | 毫秒时间戳 | 客户端播放完成时间 | 用于时延分析 |
|
||||
| `played_ms` | number \| null | 否 | 非负数 | 客户端播放耗时 | 用于播放器统计 |
|
||||
|
||||
## 3.6 `tool_call.results`
|
||||
|
||||
仅在工具执行端为客户端时使用(`assistant.tool_call.executor == "client"`)。
|
||||
|
||||
@@ -228,7 +257,7 @@
|
||||
- 重复回传会被忽略;
|
||||
- 超时未回传会由服务端合成超时结果(`504`)。
|
||||
|
||||
## 3.6 `session.stop`
|
||||
## 3.7 `session.stop`
|
||||
|
||||
示例:
|
||||
|
||||
@@ -406,7 +435,7 @@
|
||||
- 含义:TTS 音频输出开始边界
|
||||
|
||||
6. `output.audio.end`
|
||||
- 含义:TTS 音频输出结束边界
|
||||
- 含义:TTS 音频输出结束边界(服务端发送完成,不等价于扬声器已播完)
|
||||
|
||||
7. `response.interrupted`
|
||||
- 含义:当前回答被打断(barge-in 或 cancel)
|
||||
@@ -434,6 +463,7 @@
|
||||
- 音频为 PCM 二进制帧;
|
||||
- 发送单位对齐到 `640 bytes`(不足会补零后发送);
|
||||
- 前端通常结合 `output.audio.start/end` 做播放边界控制;
|
||||
- 若需要“扬声器真实播完”语义,前端应在播完后发送 `output.audio.played`;
|
||||
- 收到 `response.interrupted` 后应丢弃队列中未播放完的旧音频。
|
||||
|
||||
---
|
||||
@@ -502,8 +532,9 @@
|
||||
2. 语音输入严格按 16k/16bit/mono,并保证每个 WS 二进制消息长度是 `640*n`。
|
||||
3. UI 层把 `assistant.response.delta` 当作流式显示,把 `assistant.response.final` 当作收敛结果。
|
||||
4. 播放器用 `output.audio.start/end` 管理一轮播报生命周期。
|
||||
5. 工具调用场景下,若 `executor=client`,务必按 `tool_call_id` 回传 `tool_call.results`。
|
||||
6. 出现 `error` 时优先按 `code` 分流处理,而不是仅看 `message`。
|
||||
5. 若业务依赖“扬声器真实播完”,请在播完时上送 `output.audio.played`。
|
||||
6. 工具调用场景下,若 `executor=client`,务必按 `tool_call_id` 回传 `tool_call.results`。
|
||||
7. 出现 `error` 时优先按 `code` 分流处理,而不是仅看 `message`。
|
||||
|
||||
---
|
||||
|
||||
@@ -521,6 +552,7 @@ Server <- assistant.response.delta / assistant.response.final
|
||||
Server <- output.audio.start
|
||||
Server <- (binary pcm frames...)
|
||||
Server <- output.audio.end
|
||||
Client -> output.audio.played (optional)
|
||||
Client -> session.stop
|
||||
Server <- session.stopped
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user