- Included a new JavaScript file for Mermaid configuration to ensure consistent diagram sizing across documentation. - Enhanced architecture documentation to reflect the updated pipeline engine structure, including VAD, ASR, TD, LLM, and TTS components. - Updated various sections to clarify the integration of external services and tools within the architecture. - Improved styling for Mermaid diagrams to enhance visual consistency and usability.
350 lines
8.0 KiB
Markdown
350 lines
8.0 KiB
Markdown
# 引擎架构详解
|
||
|
||
深入了解 RAS 的两种引擎架构:管线式引擎和多模态引擎。
|
||
|
||
---
|
||
|
||
## 引擎概述
|
||
|
||
引擎是 RAS 的核心,负责处理实时语音交互。根据不同需求,可以选择两种架构:
|
||
|
||
| 架构 | 特点 | 适用场景 |
|
||
|------|------|---------|
|
||
| **管线式** | 灵活、可定制、成本可控 | 大多数场景 |
|
||
| **多模态** | 低延迟、自然、简单 | 高端体验场景 |
|
||
|
||
---
|
||
|
||
## 管线式引擎 (Pipeline)
|
||
|
||
### 架构设计
|
||
|
||
管线式引擎包含 **声音活动检测(VAD)**、**语音识别(ASR)**、**回合检测(TD)**、**大语言模型(LLM)**、**语音合成(TTS)**,各环节可对接**外部服务**(OpenAI、SiliconFlow、DashScope、本地模型)。LLM 可连接**工具**(Webhook、客户端工具、内建工具)。
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
subgraph Input["输入处理"]
|
||
Audio[用户音频] --> VAD[声音活动检测 VAD]
|
||
VAD --> ASR[语音识别 ASR]
|
||
ASR --> Text[转写文本]
|
||
Text --> TD[回合检测 TD]
|
||
end
|
||
|
||
subgraph Process["语义处理"]
|
||
TD --> LLM[大语言模型 LLM]
|
||
LLM --> Response[回复文本]
|
||
LLM --> Tools[工具]
|
||
end
|
||
|
||
subgraph Output["输出生成"]
|
||
Response --> TTS[语音合成 TTS]
|
||
TTS --> OutputAudio[助手音频]
|
||
end
|
||
```
|
||
|
||
### 数据流详解
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant U as 用户
|
||
participant E as 引擎
|
||
participant ASR as ASR 服务
|
||
participant LLM as LLM 服务
|
||
participant TTS as TTS 服务
|
||
|
||
U->>E: 音频帧 (PCM 16kHz)
|
||
|
||
Note over E: VAD 检测语音活动
|
||
E->>E: 累积音频缓冲
|
||
|
||
Note over E: 回合检测 (TD) 确定可送 LLM 的输入
|
||
E->>ASR: 发送音频
|
||
ASR-->>E: 转写文本 (流式)
|
||
E-->>U: transcript.delta
|
||
E-->>U: transcript.final
|
||
|
||
E->>LLM: 发送对话历史 + 用户输入
|
||
LLM-->>E: 回复文本 (流式)
|
||
E-->>U: assistant.response.delta
|
||
|
||
loop 流式合成
|
||
E->>TTS: 文本片段
|
||
TTS-->>E: 音频片段
|
||
E-->>U: 音频帧
|
||
end
|
||
|
||
E-->>U: assistant.response.final
|
||
```
|
||
|
||
### 延迟分析
|
||
|
||
管线式引擎的延迟由各环节累加:
|
||
|
||
| 环节 | 典型延迟 | 优化方向 |
|
||
|------|---------|---------|
|
||
| VAD/EOU | 200-500ms | 调整灵敏度 |
|
||
| ASR | 100-300ms | 选择快速模型 |
|
||
| LLM TTFT | 200-500ms | 选择低延迟模型 |
|
||
| TTS | 100-200ms | 流式合成 |
|
||
| **总计** | **600-1500ms** | - |
|
||
|
||
### 流式优化
|
||
|
||
为降低感知延迟,采用流式处理:
|
||
|
||
```mermaid
|
||
gantt
|
||
title 非流式 vs 流式处理
|
||
dateFormat X
|
||
axisFormat %s
|
||
|
||
section 非流式
|
||
ASR完成 :a1, 0, 300ms
|
||
LLM完成 :a2, after a1, 800ms
|
||
TTS完成 :a3, after a2, 500ms
|
||
播放 :a4, after a3, 500ms
|
||
|
||
section 流式
|
||
ASR :b1, 0, 300ms
|
||
LLM开始 :b2, after b1, 200ms
|
||
TTS开始 :b3, after b2, 100ms
|
||
边生成边播放 :b4, after b3, 600ms
|
||
```
|
||
|
||
---
|
||
|
||
## 实时交互引擎与多模态
|
||
|
||
### 实时交互引擎连接
|
||
|
||
实时交互引擎可连接**实时交互引擎**后端,包括:
|
||
|
||
| 后端 | 说明 |
|
||
|------|------|
|
||
| **OpenAI Realtime** | OpenAI 实时语音模型 |
|
||
| **Gemini Live** | Google 实时多模态 |
|
||
| **Doubao 实时交互引擎** | 豆包实时交互 |
|
||
|
||
实时交互引擎与管线式引擎中的 LLM 一样,均可连接**工具**:Webhook、客户端工具、内建工具。
|
||
|
||
### 多模态引擎架构
|
||
|
||
多模态引擎使用端到端模型,直接处理音频输入输出:
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
subgraph Client["客户端"]
|
||
Mic[麦克风] --> AudioIn[音频输入]
|
||
AudioOut[音频输出] --> Speaker[扬声器]
|
||
end
|
||
|
||
subgraph Engine["引擎"]
|
||
AudioIn --> RT[Realtime Model]
|
||
RT --> AudioOut
|
||
RT --> Tools[工具]
|
||
end
|
||
|
||
subgraph Model["实时交互引擎"]
|
||
RT --> GPT4o[OpenAI Realtime]
|
||
RT --> Gemini[Gemini Live]
|
||
RT --> Doubao[Doubao 实时]
|
||
end
|
||
```
|
||
|
||
### 数据流详解
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant U as 用户
|
||
participant E as 引擎
|
||
participant RT as Realtime Model
|
||
|
||
U->>E: 音频帧
|
||
E->>RT: 转发音频
|
||
|
||
Note over RT: 端到端处理
|
||
|
||
RT-->>E: 音频响应 (流式)
|
||
E-->>U: 播放音频
|
||
|
||
Note over U,RT: 支持全双工<br/>用户可随时打断
|
||
```
|
||
|
||
### 外部服务(管线式)
|
||
|
||
管线式引擎各环节可选用以下**外部服务**:
|
||
|
||
| 服务 | 说明 |
|
||
|------|------|
|
||
| **OpenAI** | LLM / ASR / TTS 等 |
|
||
| **SiliconFlow** | 国内 API 服务 |
|
||
| **DashScope** | 阿里云灵积 |
|
||
| **本地模型** | 私有化部署模型 |
|
||
|
||
### 支持的实时交互模型
|
||
|
||
| 模型 | 供应商 | 特点 |
|
||
|------|--------|------|
|
||
| **OpenAI Realtime** | OpenAI | 最自然的语音,延迟极低 |
|
||
| **Gemini Live** | Google | 多模态能力强 |
|
||
| **Doubao 实时交互** | 字节跳动 | 国内可用,中文优化 |
|
||
|
||
### 延迟对比
|
||
|
||
```mermaid
|
||
xychart-beta
|
||
title "端到端延迟对比"
|
||
x-axis ["管线式 (普通)", "管线式 (优化)", "多模态"]
|
||
y-axis "延迟 (ms)" 0 --> 1500
|
||
bar [1200, 700, 300]
|
||
```
|
||
|
||
---
|
||
|
||
## 智能打断机制
|
||
|
||
两种引擎都支持智能打断,但实现方式不同。
|
||
|
||
### 管线式引擎打断
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant U as 用户
|
||
participant E as 引擎
|
||
participant TTS as TTS
|
||
|
||
Note over E,TTS: TTS 正在合成播放
|
||
E->>U: 音频帧...
|
||
|
||
U->>E: 用户说话 (检测到 VAD)
|
||
E->>E: 判断是否有效打断
|
||
|
||
alt 有效打断
|
||
E->>TTS: 停止合成
|
||
E->>E: 清空音频缓冲
|
||
E-->>U: output.audio.interrupted
|
||
Note over E: 处理新输入
|
||
else 噪音/误触发
|
||
Note over E: 继续播放
|
||
end
|
||
```
|
||
|
||
### 多模态引擎打断
|
||
|
||
多模态模型原生支持全双工,打断由模型内部处理:
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant U as 用户
|
||
participant E as 引擎
|
||
participant RT as Realtime Model
|
||
|
||
Note over RT: 模型正在输出
|
||
RT-->>E: 音频流...
|
||
E-->>U: 播放
|
||
|
||
U->>E: 用户说话
|
||
E->>RT: 转发用户音频
|
||
|
||
Note over RT: 模型检测到打断<br/>自动停止输出
|
||
|
||
RT-->>E: 新的响应
|
||
E-->>U: 播放新响应
|
||
```
|
||
|
||
---
|
||
|
||
## 引擎选择指南
|
||
|
||
### 决策流程
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
Start[选择引擎] --> Q1{延迟要求?}
|
||
|
||
Q1 -->|< 500ms| Q2{预算充足?}
|
||
Q1 -->|> 500ms 可接受| Pipeline[管线式引擎]
|
||
|
||
Q2 -->|是| Q3{模型可用?}
|
||
Q2 -->|否| Pipeline
|
||
|
||
Q3 -->|GPT-4o/Gemini 可用| Multimodal[多模态引擎]
|
||
Q3 -->|国内环境受限| Q4{Step Audio?}
|
||
|
||
Q4 -->|可用| Multimodal
|
||
Q4 -->|不可用| Pipeline
|
||
```
|
||
|
||
### 场景推荐
|
||
|
||
| 场景 | 推荐引擎 | 理由 |
|
||
|------|---------|------|
|
||
| **企业客服** | 管线式 | 成本可控,可定制 ASR |
|
||
| **高端虚拟人** | 多模态 | 最自然的交互体验 |
|
||
| **电话机器人** | 管线式 | 可对接电信 ASR |
|
||
| **语音助手** | 多模态 | 低延迟,自然对话 |
|
||
| **口语练习** | 管线式 | 需要精确的 ASR 评分 |
|
||
|
||
### 混合方案
|
||
|
||
也可以根据用户等级使用不同引擎:
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
User[用户请求] --> Router{路由判断}
|
||
|
||
Router -->|VIP 用户| Multimodal[多模态引擎]
|
||
Router -->|普通用户| Pipeline[管线式引擎]
|
||
|
||
Multimodal --> Response[响应]
|
||
Pipeline --> Response
|
||
```
|
||
|
||
---
|
||
|
||
## 配置示例
|
||
|
||
### 管线式引擎配置
|
||
|
||
```json
|
||
{
|
||
"engine": "pipeline",
|
||
"asr": {
|
||
"provider": "openai-compatible",
|
||
"model": "FunAudioLLM/SenseVoiceSmall",
|
||
"language": "zh"
|
||
},
|
||
"llm": {
|
||
"provider": "openai",
|
||
"model": "gpt-4o-mini",
|
||
"temperature": 0.7
|
||
},
|
||
"tts": {
|
||
"provider": "openai-compatible",
|
||
"model": "FunAudioLLM/CosyVoice2-0.5B",
|
||
"voice": "anna"
|
||
}
|
||
}
|
||
```
|
||
|
||
### 多模态引擎配置
|
||
|
||
```json
|
||
{
|
||
"engine": "multimodal",
|
||
"model": {
|
||
"provider": "openai",
|
||
"model": "gpt-4o-realtime-preview",
|
||
"voice": "alloy"
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 相关文档
|
||
|
||
- [系统架构](../overview/architecture.md) - 整体架构设计
|
||
- [WebSocket 协议](../api-reference/websocket.md) - 协议详情
|
||
- [部署指南](../deployment/index.md) - 引擎部署配置
|