Add output.audio.played message handling and update documentation

- Introduced `output.audio.played` message type for client acknowledgment of audio playback completion. - Updated `DuplexPipeline` to track client playback state and handle playback completion events. - Enhanced session handling to route `output.audio.played` messages to the pipeline. - Revised API documentation to include details about the new message type and its fields. - Updated schema documentation to reflect the addition of `output.audio.played` in the message flow.
2026-03-04 10:01:34 +08:00
parent 80fff09b76
commit 7d4af18815
8 changed files with 275 additions and 19 deletions
--- a/engine/docs/ws_v1_schema.md
+++ b/engine/docs/ws_v1_schema.md
@@ -20,7 +20,7 @@ Required message order:
 1. Client connects to `/ws?assistant_id=<id>`.
 2. Client sends `session.start`.
 3. Server replies `session.started`.
-4. Client may stream binary audio and/or send `input.text`.
+4. Client may stream binary audio and/or send `input.text`, `response.cancel`, `output.audio.played`, `tool_call.results`.
 5. Client sends `session.stop` (or closes socket).

 If order is violated, server emits `error` with `code = "protocol.order"`.
@@ -100,6 +100,22 @@ Text-only mode:
 }
 ```

+### `output.audio.played`
+
+Client playback ACK after assistant audio is actually drained on local speakers
+(including jitter buffer / playback queue).
+
+```json
+{
+  "type": "output.audio.played",
+  "tts_id": "tts_001",
+  "response_id": "resp_001",
+  "turn_id": "turn_001",
+  "played_at_ms": 1730000018450,
+  "played_ms": 2520
+}
+```
+
 ### `session.stop`

 ```json
@@ -223,6 +239,8 @@ Framing rules:

 TTS boundary events:
 - `output.audio.start` and `output.audio.end` mark assistant playback boundaries.
+- `output.audio.end` means server-side audio send completed (not guaranteed speaker drain).
+- For speaker-drain confirmation, client should send `output.audio.played`.

 ## Event Throttling

--- a/engine/docs/ws_v1_schema_zh.md
+++ b/engine/docs/ws_v1_schema_zh.md
@@ -46,6 +46,7 @@
  - 二进制音频
  - `input.text`（可选）
  - `response.cancel`（可选）
+  - `output.audio.played`（可选）
  - `tool_call.results`（可选）
 6. 客户端发送 `session.stop` 或直接断开连接

@@ -190,7 +191,35 @@
 | `type` | string | 是 | - | 固定 `"response.cancel"` | 请求中断当前回答 |
 | `graceful` | boolean | 否 | `false` | 取消方式 | `false` 立即打断；`true` 当前实现主要用于记录日志，不强制中断 |

-## 3.5 `tool_call.results`
+## 3.5 `output.audio.played`
+
+客户端在本地扬声器真正播完后回执（含 jitter buffer / 播放队列）。
+
+示例：
+
+```json
+{
+  "type": "output.audio.played",
+  "tts_id": "tts_001",
+  "response_id": "resp_001",
+  "turn_id": "turn_001",
+  "played_at_ms": 1730000018450,
+  "played_ms": 2520
+}
+```
+
+字段说明：
+
+| 字段 | 类型 | 必填 | 约束 | 含义 | 使用说明 |
+|---|---|---|---|---|---|
+| `type` | string | 是 | 固定 `"output.audio.played"` | 播放完成回执 | 客户端播完后上送 |
+| `tts_id` | string | 是 | 非空字符串 | TTS 段 ID | 建议使用 `output.audio.start/end` 中同一 `tts_id` |
+| `response_id` | string \| null | 否 | 任意字符串 | 回复 ID | 建议回传，便于聚合 |
+| `turn_id` | string \| null | 否 | 任意字符串 | 轮次 ID | 建议回传，便于聚合 |
+| `played_at_ms` | number \| null | 否 | 毫秒时间戳 | 客户端播放完成时间 | 用于时延分析 |
+| `played_ms` | number \| null | 否 | 非负数 | 客户端播放耗时 | 用于播放器统计 |
+
+## 3.6 `tool_call.results`

 仅在工具执行端为客户端时使用（`assistant.tool_call.executor == "client"`）。

@@ -228,7 +257,7 @@
 - 重复回传会被忽略；
 - 超时未回传会由服务端合成超时结果（`504`）。

-## 3.6 `session.stop`
+## 3.7 `session.stop`

 示例：

@@ -406,7 +435,7 @@
 - 含义：TTS 音频输出开始边界

 6. `output.audio.end`
- 含义：TTS 音频输出结束边界
+- 含义：TTS 音频输出结束边界（服务端发送完成，不等价于扬声器已播完）

 7. `response.interrupted`
 - 含义：当前回答被打断（barge-in 或 cancel）
@@ -434,6 +463,7 @@
 - 音频为 PCM 二进制帧；
 - 发送单位对齐到 `640 bytes`（不足会补零后发送）；
 - 前端通常结合 `output.audio.start/end` 做播放边界控制；
+- 若需要“扬声器真实播完”语义，前端应在播完后发送 `output.audio.played`；
 - 收到 `response.interrupted` 后应丢弃队列中未播放完的旧音频。

 ---
@@ -502,8 +532,9 @@
 2. 语音输入严格按 16k/16bit/mono，并保证每个 WS 二进制消息长度是 `640*n`。
 3. UI 层把 `assistant.response.delta` 当作流式显示，把 `assistant.response.final` 当作收敛结果。
 4. 播放器用 `output.audio.start/end` 管理一轮播报生命周期。
-5. 工具调用场景下，若 `executor=client`，务必按 `tool_call_id` 回传 `tool_call.results`。
-6. 出现 `error` 时优先按 `code` 分流处理，而不是仅看 `message`。
+5. 若业务依赖“扬声器真实播完”，请在播完时上送 `output.audio.played`。
+6. 工具调用场景下，若 `executor=client`，务必按 `tool_call_id` 回传 `tool_call.results`。
+7. 出现 `error` 时优先按 `code` 分流处理，而不是仅看 `message`。

 ---

@@ -521,6 +552,7 @@ Server <- assistant.response.delta / assistant.response.final
 Server <- output.audio.start
 Server <- (binary pcm frames...)
 Server <- output.audio.end
+Client -> output.audio.played (optional)
 Client -> session.stop
 Server <- session.stopped
 ```