Add Mermaid diagram support and update architecture documentation
- Included a new JavaScript file for Mermaid configuration to ensure consistent diagram sizing across documentation. - Enhanced architecture documentation to reflect the updated pipeline engine structure, including VAD, ASR, TD, LLM, and TTS components. - Updated various sections to clarify the integration of external services and tools within the architecture. - Improved styling for Mermaid diagrams to enhance visual consistency and usability.
This commit is contained in:
@@ -19,23 +19,25 @@
|
|||||||
|
|
||||||
### 架构设计
|
### 架构设计
|
||||||
|
|
||||||
管线式引擎将语音交互拆分为三个独立阶段:
|
管线式引擎包含 **声音活动检测(VAD)**、**语音识别(ASR)**、**回合检测(TD)**、**大语言模型(LLM)**、**语音合成(TTS)**,各环节可对接**外部服务**(OpenAI、SiliconFlow、DashScope、本地模型)。LLM 可连接**工具**(Webhook、客户端工具、内建工具)。
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
flowchart LR
|
flowchart LR
|
||||||
subgraph Input["输入处理"]
|
subgraph Input["输入处理"]
|
||||||
Audio[用户音频] --> VAD[VAD 检测]
|
Audio[用户音频] --> VAD[声音活动检测 VAD]
|
||||||
VAD --> ASR[语音识别]
|
VAD --> ASR[语音识别 ASR]
|
||||||
ASR --> Text[转写文本]
|
ASR --> Text[转写文本]
|
||||||
|
Text --> TD[回合检测 TD]
|
||||||
end
|
end
|
||||||
|
|
||||||
subgraph Process["语义处理"]
|
subgraph Process["语义处理"]
|
||||||
Text --> LLM[大语言模型]
|
TD --> LLM[大语言模型 LLM]
|
||||||
LLM --> Response[回复文本]
|
LLM --> Response[回复文本]
|
||||||
|
LLM --> Tools[工具]
|
||||||
end
|
end
|
||||||
|
|
||||||
subgraph Output["输出生成"]
|
subgraph Output["输出生成"]
|
||||||
Response --> TTS[语音合成]
|
Response --> TTS[语音合成 TTS]
|
||||||
TTS --> OutputAudio[助手音频]
|
TTS --> OutputAudio[助手音频]
|
||||||
end
|
end
|
||||||
```
|
```
|
||||||
@@ -55,7 +57,7 @@ sequenceDiagram
|
|||||||
Note over E: VAD 检测语音活动
|
Note over E: VAD 检测语音活动
|
||||||
E->>E: 累积音频缓冲
|
E->>E: 累积音频缓冲
|
||||||
|
|
||||||
Note over E: 检测到语音结束 (EOU)
|
Note over E: 回合检测 (TD) 确定可送 LLM 的输入
|
||||||
E->>ASR: 发送音频
|
E->>ASR: 发送音频
|
||||||
ASR-->>E: 转写文本 (流式)
|
ASR-->>E: 转写文本 (流式)
|
||||||
E-->>U: transcript.delta
|
E-->>U: transcript.delta
|
||||||
@@ -111,9 +113,21 @@ gantt
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 多模态引擎 (Multimodal)
|
## 实时交互引擎与多模态
|
||||||
|
|
||||||
### 架构设计
|
### 实时交互引擎连接
|
||||||
|
|
||||||
|
实时交互引擎可连接**实时交互引擎**后端,包括:
|
||||||
|
|
||||||
|
| 后端 | 说明 |
|
||||||
|
|------|------|
|
||||||
|
| **OpenAI Realtime** | OpenAI 实时语音模型 |
|
||||||
|
| **Gemini Live** | Google 实时多模态 |
|
||||||
|
| **Doubao 实时交互引擎** | 豆包实时交互 |
|
||||||
|
|
||||||
|
实时交互引擎与管线式引擎中的 LLM 一样,均可连接**工具**:Webhook、客户端工具、内建工具。
|
||||||
|
|
||||||
|
### 多模态引擎架构
|
||||||
|
|
||||||
多模态引擎使用端到端模型,直接处理音频输入输出:
|
多模态引擎使用端到端模型,直接处理音频输入输出:
|
||||||
|
|
||||||
@@ -127,12 +141,13 @@ flowchart LR
|
|||||||
subgraph Engine["引擎"]
|
subgraph Engine["引擎"]
|
||||||
AudioIn --> RT[Realtime Model]
|
AudioIn --> RT[Realtime Model]
|
||||||
RT --> AudioOut
|
RT --> AudioOut
|
||||||
|
RT --> Tools[工具]
|
||||||
end
|
end
|
||||||
|
|
||||||
subgraph Model["多模态模型"]
|
subgraph Model["实时交互引擎"]
|
||||||
RT --> GPT4o[GPT-4o Realtime]
|
RT --> GPT4o[OpenAI Realtime]
|
||||||
RT --> Gemini[Gemini Live]
|
RT --> Gemini[Gemini Live]
|
||||||
RT --> Step[Step Audio]
|
RT --> Doubao[Doubao 实时]
|
||||||
end
|
end
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -155,13 +170,24 @@ sequenceDiagram
|
|||||||
Note over U,RT: 支持全双工<br/>用户可随时打断
|
Note over U,RT: 支持全双工<br/>用户可随时打断
|
||||||
```
|
```
|
||||||
|
|
||||||
### 支持的模型
|
### 外部服务(管线式)
|
||||||
|
|
||||||
|
管线式引擎各环节可选用以下**外部服务**:
|
||||||
|
|
||||||
|
| 服务 | 说明 |
|
||||||
|
|------|------|
|
||||||
|
| **OpenAI** | LLM / ASR / TTS 等 |
|
||||||
|
| **SiliconFlow** | 国内 API 服务 |
|
||||||
|
| **DashScope** | 阿里云灵积 |
|
||||||
|
| **本地模型** | 私有化部署模型 |
|
||||||
|
|
||||||
|
### 支持的实时交互模型
|
||||||
|
|
||||||
| 模型 | 供应商 | 特点 |
|
| 模型 | 供应商 | 特点 |
|
||||||
|------|--------|------|
|
|------|--------|------|
|
||||||
| **GPT-4o Realtime** | OpenAI | 最自然的语音,延迟极低 |
|
| **OpenAI Realtime** | OpenAI | 最自然的语音,延迟极低 |
|
||||||
| **Gemini Live** | Google | 多模态能力强 |
|
| **Gemini Live** | Google | 多模态能力强 |
|
||||||
| **Step Audio** | 阶跃星辰 | 国内可用,中文优化 |
|
| **Doubao 实时交互** | 字节跳动 | 国内可用,中文优化 |
|
||||||
|
|
||||||
### 延迟对比
|
### 延迟对比
|
||||||
|
|
||||||
|
|||||||
@@ -102,16 +102,16 @@ RAS 支持两种引擎架构,适用于不同场景。
|
|||||||
|
|
||||||
### 管线式引擎 (Pipeline)
|
### 管线式引擎 (Pipeline)
|
||||||
|
|
||||||
将语音交互拆分为三个独立环节:
|
将语音交互拆分为多个环节,包含 **VAD(声音活动检测)**、**ASR(语音识别)**、**TD(回合检测)**、**LLM(大语言模型)**、**TTS(语音合成)**。外部服务可选 **OpenAI**、**SiliconFlow**、**DashScope**、**本地模型**。LLM 与实时交互引擎均可连接**工具**(Webhook、客户端工具、内建工具)。
|
||||||
|
|
||||||
```
|
```
|
||||||
用户语音 → [ASR] → 文本 → [LLM] → 回复 → [TTS] → 助手语音
|
用户语音 → [VAD] → [ASR] → [TD] → 文本 → [LLM] → 回复 → [TTS] → 助手语音
|
||||||
```
|
```
|
||||||
|
|
||||||
**优点:**
|
**优点:**
|
||||||
|
|
||||||
- 灵活选择各环节供应商
|
- 灵活选择各环节供应商(OpenAI、SiliconFlow、DashScope、本地模型)
|
||||||
- 可独立优化每个环节
|
- 可独立优化 VAD、ASR、TD、LLM、TTS 每个环节
|
||||||
- 成本可控
|
- 成本可控
|
||||||
|
|
||||||
**缺点:**
|
**缺点:**
|
||||||
@@ -119,9 +119,9 @@ RAS 支持两种引擎架构,适用于不同场景。
|
|||||||
- 延迟较高(累加延迟)
|
- 延迟较高(累加延迟)
|
||||||
- 需要协调多个服务
|
- 需要协调多个服务
|
||||||
|
|
||||||
### 多模态引擎 (Multimodal)
|
### 实时交互引擎与多模态 (Realtime / Multimodal)
|
||||||
|
|
||||||
使用端到端模型直接处理:
|
实时交互引擎可连接 **OpenAI Realtime**、**Gemini Live**、**Doubao 实时交互引擎** 等,同样可连接工具。使用端到端模型直接处理:
|
||||||
|
|
||||||
```
|
```
|
||||||
用户语音 → [Realtime Model] → 助手语音
|
用户语音 → [Realtime Model] → 助手语音
|
||||||
@@ -191,11 +191,13 @@ sequenceDiagram
|
|||||||
|
|
||||||
### 工具类型
|
### 工具类型
|
||||||
|
|
||||||
|
管线式引擎中的 LLM 与实时交互引擎均可连接**工具**,包括:
|
||||||
|
|
||||||
| 类型 | 说明 | 示例 |
|
| 类型 | 说明 | 示例 |
|
||||||
|------|------|------|
|
|------|------|------|
|
||||||
| **Webhook** | 调用外部 HTTP API | 查询订单、预约日程 |
|
| **Webhook** | 调用外部 HTTP API | 查询订单、预约日程 |
|
||||||
| **客户端** | 由客户端执行的操作 | 打开页面、显示表单 |
|
| **客户端工具** | 由客户端执行的操作 | 打开页面、显示表单 |
|
||||||
| **内置** | 平台提供的工具 | 代码执行、计算器 |
|
| **内建工具** | 平台提供的工具 | 代码执行、计算器 |
|
||||||
|
|
||||||
### 工具调用流程
|
### 工具调用流程
|
||||||
|
|
||||||
|
|||||||
@@ -38,7 +38,7 @@ Realtime Agent Studio (RAS) 是一款以大语言模型为核心,构建实时
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
管线式全双工架构,ASR/LLM/TTS 流水线处理,支持智能打断,端到端延迟 < 500ms
|
管线式全双工架构,VAD/ASR/TD/LLM/TTS 流水线处理,支持智能打断,端到端延迟 < 500ms
|
||||||
|
|
||||||
- :brain: **多模态模型支持**
|
- :brain: **多模态模型支持**
|
||||||
|
|
||||||
@@ -76,38 +76,152 @@ Realtime Agent Studio (RAS) 是一款以大语言模型为核心,构建实时
|
|||||||
|
|
||||||
## 系统架构
|
## 系统架构
|
||||||
|
|
||||||
|
平台架构层级:
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart TB
|
||||||
|
|
||||||
|
%% ================= ACCESS =================
|
||||||
|
subgraph Access["Access Layer"]
|
||||||
|
direction TB
|
||||||
|
API[API]
|
||||||
|
SDK[SDK]
|
||||||
|
Browser[Browser UI]
|
||||||
|
Embed[Web Embed]
|
||||||
|
end
|
||||||
|
|
||||||
|
|
||||||
|
%% ================= REALTIME ENGINE =================
|
||||||
|
subgraph Runtime["Realtime Interaction Engine"]
|
||||||
|
|
||||||
|
direction LR
|
||||||
|
|
||||||
|
%% -------- Duplex Engine --------
|
||||||
|
subgraph Duplex["Duplex Interaction Engine"]
|
||||||
|
direction LR
|
||||||
|
|
||||||
|
subgraph Pipeline["Pipeline Engine"]
|
||||||
|
direction LR
|
||||||
|
VAD[VAD]
|
||||||
|
ASR[ASR]
|
||||||
|
TD[Turn Detection]
|
||||||
|
LLM[LLM]
|
||||||
|
TTS[TTS]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph Multi["Realtime Engine"]
|
||||||
|
MM[Realtime Model]
|
||||||
|
end
|
||||||
|
|
||||||
|
end
|
||||||
|
|
||||||
|
|
||||||
|
%% -------- Capabilities --------
|
||||||
|
subgraph Capability["Agent Capabilities"]
|
||||||
|
|
||||||
|
subgraph Tools["Tool System"]
|
||||||
|
Webhook[Webhook]
|
||||||
|
ClientTool[Client Tools]
|
||||||
|
Builtin[Builtin Tools]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph KB["Knowledge System"]
|
||||||
|
Docs[Documents]
|
||||||
|
Vector[(Vector Index)]
|
||||||
|
Retrieval[Retrieval]
|
||||||
|
end
|
||||||
|
|
||||||
|
end
|
||||||
|
|
||||||
|
end
|
||||||
|
|
||||||
|
|
||||||
|
%% ================= PLATFORM =================
|
||||||
|
subgraph Platform["Platform Services"]
|
||||||
|
direction TB
|
||||||
|
Backend[Backend Service]
|
||||||
|
Frontend[Frontend Console]
|
||||||
|
DB[(Database)]
|
||||||
|
end
|
||||||
|
|
||||||
|
|
||||||
|
%% ================= CONNECTIONS =================
|
||||||
|
|
||||||
|
Access --> Runtime
|
||||||
|
|
||||||
|
Runtime <--> Backend
|
||||||
|
Backend <--> DB
|
||||||
|
Backend <--> Frontend
|
||||||
|
|
||||||
|
LLM --> Tools
|
||||||
|
MM --> Tools
|
||||||
|
|
||||||
|
LLM <--> KB
|
||||||
|
MM <--> KB
|
||||||
|
```
|
||||||
|
|
||||||
|
管线式引擎交互引擎对话流程图:
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
flowchart LR
|
flowchart LR
|
||||||
subgraph Client["客户端"]
|
|
||||||
Web[Web 浏览器]
|
|
||||||
App[移动应用]
|
|
||||||
SDK[SDK]
|
|
||||||
end
|
|
||||||
|
|
||||||
subgraph RAS["Realtime Agent Studio"]
|
User((User Speech))
|
||||||
Engine[实时交互引擎]
|
Audio[Audio Stream]
|
||||||
API[API 服务]
|
|
||||||
DB[(数据库)]
|
|
||||||
end
|
|
||||||
|
|
||||||
subgraph Pipeline["管线式引擎"]
|
VAD[VAD\nVoice Activity Detection]
|
||||||
ASR[语音识别]
|
ASR[ASR\nSpeech Recognition]
|
||||||
LLM[大语言模型]
|
|
||||||
TTS[语音合成]
|
|
||||||
end
|
|
||||||
|
|
||||||
subgraph External["外部服务"]
|
TD[Turn Detection]
|
||||||
OpenAI[OpenAI]
|
|
||||||
Azure[Azure]
|
|
||||||
Local[本地模型]
|
|
||||||
end
|
|
||||||
|
|
||||||
Client -->|WebSocket| Engine
|
LLM[LLM\nReasoning]
|
||||||
Client -->|REST| API
|
|
||||||
Engine --> Pipeline
|
Tools[Tools / APIs]
|
||||||
Engine <--> API
|
|
||||||
API <--> DB
|
TTS[TTS\nSpeech Synthesis]
|
||||||
Pipeline --> External
|
|
||||||
|
AudioOut[Audio Stream Out]
|
||||||
|
|
||||||
|
User --> Audio
|
||||||
|
Audio --> VAD
|
||||||
|
VAD --> ASR
|
||||||
|
ASR --> TD
|
||||||
|
TD --> LLM
|
||||||
|
|
||||||
|
LLM --> Tools
|
||||||
|
Tools --> LLM
|
||||||
|
|
||||||
|
LLM --> TTS
|
||||||
|
TTS --> AudioOut
|
||||||
|
AudioOut --> User
|
||||||
|
```
|
||||||
|
|
||||||
|
基于实时交互模型的对话流程图:
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart LR
|
||||||
|
|
||||||
|
User((User))
|
||||||
|
|
||||||
|
Input[Audio / Video / Text]
|
||||||
|
|
||||||
|
MM[Multimodal Model]
|
||||||
|
|
||||||
|
Tools[Tools / APIs]
|
||||||
|
KB[Knowledge Base]
|
||||||
|
|
||||||
|
Output[Audio / Video / Text]
|
||||||
|
|
||||||
|
User --> Input
|
||||||
|
Input --> MM
|
||||||
|
|
||||||
|
MM --> Tools
|
||||||
|
Tools --> MM
|
||||||
|
|
||||||
|
MM --> KB
|
||||||
|
KB --> MM
|
||||||
|
|
||||||
|
MM --> Output
|
||||||
|
Output --> User
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -119,9 +233,9 @@ flowchart LR
|
|||||||
| **前端** | React 18, TypeScript, Tailwind CSS, Zustand |
|
| **前端** | React 18, TypeScript, Tailwind CSS, Zustand |
|
||||||
| **后端** | FastAPI (Python 3.10+) |
|
| **后端** | FastAPI (Python 3.10+) |
|
||||||
| **引擎** | Python, WebSocket, asyncio |
|
| **引擎** | Python, WebSocket, asyncio |
|
||||||
| **数据库** | SQLite / PostgreSQL |
|
| **数据库** | SQLite |
|
||||||
| **知识库** | chroma |
|
| **知识库** | chroma |
|
||||||
| **部署** | Docker, Nginx |
|
| **部署** | Docker |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -204,16 +318,6 @@ ws.onopen = () => {
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 参与贡献
|
|
||||||
|
|
||||||
我们欢迎社区贡献!查看 [贡献指南](https://github.com/your-org/AI-VideoAssistant/blob/main/CONTRIBUTING.md) 了解如何参与。
|
|
||||||
|
|
||||||
- :star: Star 项目支持我们
|
|
||||||
- :bug: 提交 Issue 报告问题
|
|
||||||
- :hammer: 提交 PR 贡献代码
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 许可证
|
## 许可证
|
||||||
|
|
||||||
本项目基于 [MIT 许可证](https://github.com/your-org/AI-VideoAssistant/blob/main/LICENSE) 开源。
|
本项目基于 [MIT 许可证](https://github.com/your-org/AI-VideoAssistant/blob/main/LICENSE) 开源。
|
||||||
|
|||||||
18
docs/content/javascripts/mermaid.mjs
Normal file
18
docs/content/javascripts/mermaid.mjs
Normal file
@@ -0,0 +1,18 @@
|
|||||||
|
/**
|
||||||
|
* Global Mermaid config for consistent diagram sizing across all docs.
|
||||||
|
* Exposed as window.mermaid so Material for MkDocs uses this instance.
|
||||||
|
*/
|
||||||
|
import mermaid from "https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs";
|
||||||
|
|
||||||
|
mermaid.initialize({
|
||||||
|
startOnLoad: false,
|
||||||
|
securityLevel: "loose",
|
||||||
|
theme: "base",
|
||||||
|
useMaxWidth: false,
|
||||||
|
themeVariables: {
|
||||||
|
fontSize: "14px",
|
||||||
|
fontFamily: "Inter, sans-serif",
|
||||||
|
},
|
||||||
|
});
|
||||||
|
|
||||||
|
window.mermaid = mermaid;
|
||||||
@@ -31,9 +31,16 @@ flowchart TB
|
|||||||
end
|
end
|
||||||
|
|
||||||
subgraph External["外部服务"]
|
subgraph External["外部服务"]
|
||||||
LLM[LLM 服务]
|
OpenAI[OpenAI]
|
||||||
ASR[ASR 服务]
|
SiliconFlow[SiliconFlow]
|
||||||
TTS[TTS 服务]
|
DashScope[DashScope]
|
||||||
|
LocalModel[本地模型]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph Tools["工具"]
|
||||||
|
Webhook[Webhook]
|
||||||
|
ClientTool[客户端工具]
|
||||||
|
Builtin[内建工具]
|
||||||
end
|
end
|
||||||
|
|
||||||
Browser --> WebApp
|
Browser --> WebApp
|
||||||
@@ -44,9 +51,8 @@ flowchart TB
|
|||||||
API <--> DB
|
API <--> DB
|
||||||
API <--> FileStore
|
API <--> FileStore
|
||||||
Engine <--> API
|
Engine <--> API
|
||||||
Engine --> LLM
|
Engine --> External
|
||||||
Engine --> ASR
|
Engine --> Tools
|
||||||
Engine --> TTS
|
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -60,7 +66,7 @@ flowchart TB
|
|||||||
| 功能模块 | 说明 |
|
| 功能模块 | 说明 |
|
||||||
|---------|------|
|
|---------|------|
|
||||||
| 助手管理 | 创建、配置、测试智能助手 |
|
| 助手管理 | 创建、配置、测试智能助手 |
|
||||||
| 资源库 | LLM/ASR/TTS 模型管理 |
|
| 资源库 | LLM/ASR/TTS/VAD 等模型管理 |
|
||||||
| 知识库 | RAG 文档上传与管理 |
|
| 知识库 | RAG 文档上传与管理 |
|
||||||
| 历史记录 | 会话日志查询与回放 |
|
| 历史记录 | 会话日志查询与回放 |
|
||||||
| 仪表盘 | 实时数据统计 |
|
| 仪表盘 | 实时数据统计 |
|
||||||
@@ -103,45 +109,74 @@ flowchart TB
|
|||||||
SM[会话管理器]
|
SM[会话管理器]
|
||||||
|
|
||||||
subgraph Pipeline["管线式引擎"]
|
subgraph Pipeline["管线式引擎"]
|
||||||
VAD[VAD 检测]
|
VAD[声音活动检测 VAD]
|
||||||
ASR[语音识别]
|
ASR[语音识别 ASR]
|
||||||
LLM[大语言模型]
|
TD[回合检测 TD]
|
||||||
TTS[语音合成]
|
LLM[大语言模型 LLM]
|
||||||
|
TTS[语音合成 TTS]
|
||||||
end
|
end
|
||||||
|
|
||||||
subgraph Multimodal["多模态引擎"]
|
subgraph Realtime["实时交互引擎连接"]
|
||||||
RT[Realtime Model<br/>GPT-4o / Gemini]
|
RTOpenAI[OpenAI Realtime]
|
||||||
|
RTGemini[Gemini Live]
|
||||||
|
RTDoubao[Doubao 实时交互]
|
||||||
|
end
|
||||||
|
|
||||||
|
subgraph Tools["工具"]
|
||||||
|
Webhook[Webhook]
|
||||||
|
ClientTool[客户端工具]
|
||||||
|
Builtin[内建工具]
|
||||||
end
|
end
|
||||||
end
|
end
|
||||||
|
|
||||||
Client[客户端] -->|音频流| WS
|
Client[客户端] -->|音频流| WS
|
||||||
WS --> SM
|
WS --> SM
|
||||||
SM --> Pipeline
|
SM --> Pipeline
|
||||||
SM --> Multimodal
|
SM --> Realtime
|
||||||
|
Pipeline --> LLM
|
||||||
|
LLM --> Tools
|
||||||
|
Realtime --> Tools
|
||||||
Pipeline -->|文本/音频| WS
|
Pipeline -->|文本/音频| WS
|
||||||
Multimodal -->|文本/音频| WS
|
Realtime -->|文本/音频| WS
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### 外部服务与工具
|
||||||
|
|
||||||
|
| 类别 | 说明 | 可选项 |
|
||||||
|
|------|------|--------|
|
||||||
|
| **外部服务** | 管线式引擎各环节所依赖的云/本地服务 | OpenAI、SiliconFlow、DashScope、本地模型 |
|
||||||
|
| **实时交互引擎** | 实时交互引擎可连接的后端 | OpenAI Realtime、Gemini Live、Doubao 实时交互引擎 |
|
||||||
|
| **工具** | 管线式 LLM 与实时交互引擎均可调用 | Webhook、客户端工具、内建工具 |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 引擎架构
|
## 引擎架构
|
||||||
|
|
||||||
### 管线式全双工引擎
|
### 管线式全双工引擎
|
||||||
|
|
||||||
传统方案,将语音交互拆分为三个独立阶段:
|
管线式引擎包含:**声音活动检测(VAD)**、**语音识别(ASR)**、**回合检测(TD)**、**大语言模型(LLM)**、**语音合成(TTS)**。外部服务可选用 **OpenAI**、**SiliconFlow**、**DashScope**、**本地模型**。LLM 可连接**工具**(Webhook、客户端工具、内建工具)。
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
sequenceDiagram
|
sequenceDiagram
|
||||||
participant C as 客户端
|
participant C as 客户端
|
||||||
participant E as 引擎
|
participant E as 引擎
|
||||||
|
participant VAD as VAD
|
||||||
participant ASR as 语音识别
|
participant ASR as 语音识别
|
||||||
|
participant TD as 回合检测
|
||||||
participant LLM as 大语言模型
|
participant LLM as 大语言模型
|
||||||
participant TTS as 语音合成
|
participant TTS as 语音合成
|
||||||
|
participant Tools as 工具
|
||||||
|
|
||||||
C->>E: 音频流 (PCM)
|
C->>E: 音频流 (PCM)
|
||||||
|
E->>VAD: 检测语音活动
|
||||||
|
VAD-->>E: 有效语音段
|
||||||
E->>ASR: 语音转文字
|
E->>ASR: 语音转文字
|
||||||
ASR-->>E: 转写文本
|
ASR-->>E: 转写文本
|
||||||
|
E->>TD: 回合边界
|
||||||
|
TD-->>E: 可送 LLM 的输入
|
||||||
E->>LLM: 生成回复
|
E->>LLM: 生成回复
|
||||||
|
LLM->>Tools: 可选:调用工具
|
||||||
|
Tools-->>LLM: 工具结果
|
||||||
LLM-->>E: 回复文本 (流式)
|
LLM-->>E: 回复文本 (流式)
|
||||||
E->>TTS: 文字转语音
|
E->>TTS: 文字转语音
|
||||||
TTS-->>E: 音频流
|
TTS-->>E: 音频流
|
||||||
@@ -150,10 +185,15 @@ sequenceDiagram
|
|||||||
|
|
||||||
**特点:**
|
**特点:**
|
||||||
|
|
||||||
- 灵活选择各环节供应商
|
- 灵活选择各环节供应商(OpenAI、SiliconFlow、DashScope、本地模型)
|
||||||
- 可独立优化每个环节
|
- 可独立优化 VAD、ASR、TD、LLM、TTS 每个环节
|
||||||
|
- LLM 与工具联动(Webhook、客户端工具、内建工具)
|
||||||
- 延迟约 500-1500ms
|
- 延迟约 500-1500ms
|
||||||
|
|
||||||
|
### 实时交互引擎
|
||||||
|
|
||||||
|
实时交互引擎可连接**实时交互引擎**,包括 **OpenAI Realtime**、**Gemini Live**、**Doubao 实时交互引擎**等,同样可连接**工具**(Webhook、客户端工具、内建工具)。
|
||||||
|
|
||||||
### 原生多模态引擎
|
### 原生多模态引擎
|
||||||
|
|
||||||
使用端到端多模态模型(如 GPT-4o Realtime):
|
使用端到端多模态模型(如 GPT-4o Realtime):
|
||||||
|
|||||||
@@ -83,9 +83,13 @@
|
|||||||
border: none;
|
border: none;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Mermaid Diagram Styling */
|
/* Mermaid Diagram Styling - consistent element size across diagrams */
|
||||||
.mermaid {
|
.mermaid {
|
||||||
margin: 1.5rem 0;
|
margin: 1.5rem 0;
|
||||||
|
overflow-x: auto;
|
||||||
|
}
|
||||||
|
.mermaid svg {
|
||||||
|
min-width: min-content;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Navigation Enhancement */
|
/* Navigation Enhancement */
|
||||||
|
|||||||
@@ -162,4 +162,5 @@ extra_css:
|
|||||||
- stylesheets/extra.css
|
- stylesheets/extra.css
|
||||||
|
|
||||||
extra_javascript:
|
extra_javascript:
|
||||||
|
- javascripts/mermaid.mjs
|
||||||
- javascripts/extra.js
|
- javascripts/extra.js
|
||||||
|
|||||||
Reference in New Issue
Block a user