- Corrected phrasing in the introduction of RAS as an open-source alternative. - Added new documentation sections for voice AI and voice agents. - Enhanced the flowchart for assistant components to include detailed configurations. - Updated terminology for engine types to clarify distinctions between Pipeline and Realtime engines. - Introduced a new section on user utterance endpoints (EoU) to explain detection mechanisms and configurations.
5.8 KiB
5.8 KiB
构建实时交互音视频智能体的开源工作平台
什么是 Realtime Agent Studio?
Realtime Agent Studio (RAS) 是一款以大语言模型为核心,构建实时交互音视频智能体的工作平台。支持管线式的全双工交互引擎和原生多模态模型两种架构,覆盖实时交互智能体的配置、测试、发布、监控全流程。
可以将 RAS 看作 Vapi、Retell、ElevenLabs Agents 的开源替代方案。
核心特性
-
⚡ 低延迟实时引擎
管线式全双工架构,VAD/ASR/TD/LLM/TTS 流水线处理,支持智能打断,端到端延迟 < 500ms
-
🧠 多模态模型支持
支持 GPT-4o Realtime、Gemini Live、Step Audio 等原生多模态模型直连
-
🔧 可视化配置
无代码配置助手、提示词、工具调用、知识库关联,所见即所得
-
🔌 开放 API
标准 WebSocket 协议,RESTful 管理接口,支持 Webhook 回调
-
🛡️ 私有化部署
Docker 一键部署,数据完全自主可控,支持本地模型
-
📈 全链路监控
完整会话回放,实时仪表盘,自动化测试与效果评估
系统架构
平台架构层级:
flowchart TB
%% ================= ACCESS =================
subgraph Access["Access Layer"]
direction TB
API[API]
SDK[SDK]
Browser[Browser UI]
Embed[Web Embed]
end
%% ================= REALTIME ENGINE =================
subgraph Runtime["Realtime Interaction Engine"]
direction LR
%% -------- Duplex Engine --------
subgraph Duplex["Duplex Interaction Engine"]
direction LR
subgraph Pipeline["Pipeline Engine"]
direction LR
VAD[VAD]
ASR[ASR]
TD[Turn Detection]
LLM[LLM]
TTS[TTS]
end
subgraph Multi["Realtime Engine"]
MM[Realtime Model]
end
end
%% -------- Capabilities --------
subgraph Capability["Agent Capabilities"]
subgraph Tools["Tool System"]
Webhook[Webhook]
ClientTool[Client Tools]
Builtin[Builtin Tools]
end
subgraph KB["Knowledge System"]
Docs[Documents]
Vector[(Vector Index)]
Retrieval[Retrieval]
end
end
end
%% ================= PLATFORM =================
subgraph Platform["Platform Services"]
direction TB
Backend[Backend Service]
Frontend[Frontend Console]
DB[(Database)]
end
%% ================= CONNECTIONS =================
Access --> Runtime
Runtime <--> Backend
Backend <--> DB
Backend <--> Frontend
LLM --> Tools
MM --> Tools
LLM <--> KB
MM <--> KB
管线式引擎交互引擎对话流程图:
flowchart LR
User((User Speech))
Audio[Audio Stream]
VAD[VAD\nVoice Activity Detection]
ASR[ASR\nSpeech Recognition]
TD[Turn Detection]
LLM[LLM\nReasoning]
Tools[Tools / APIs]
TTS[TTS\nSpeech Synthesis]
AudioOut[Audio Stream Out]
User --> Audio
Audio --> VAD
VAD --> ASR
ASR --> TD
TD --> LLM
LLM --> Tools
Tools --> LLM
LLM --> TTS
TTS --> AudioOut
AudioOut --> User
基于实时交互模型的对话流程图:
flowchart LR
User((User))
Input[Audio / Video / Text]
MM[Multimodal Model]
Tools[Tools / APIs]
KB[Knowledge Base]
Output[Audio / Video / Text]
User --> Input
Input --> MM
MM --> Tools
Tools --> MM
MM --> KB
KB --> MM
MM --> Output
Output --> User
技术栈
| 层级 | 技术 |
|---|---|
| 前端 | React 18, TypeScript, Tailwind CSS, Zustand |
| 后端 | FastAPI (Python 3.10+) |
| 引擎 | Python, WebSocket, asyncio |
| 数据库 | SQLite |
| 知识库 | chroma |
| 部署 | Docker |
快速导航
快速体验
使用 Docker 启动
git clone https://github.com/your-org/AI-VideoAssistant.git
cd docker
docker-compose up -d
# for development
# docker compose --profile dev up -d
访问 http://localhost:3000 即可使用控制台。
WebSocket 连接示例
const ws = new WebSocket('ws://localhost:8000/ws?assistant_id=YOUR_ID');
ws.onopen = () => {
ws.send(JSON.stringify({
type: 'session.start',
audio: { encoding: 'pcm_s16le', sample_rate_hz: 16000, channels: 1 }
}));
};
许可证
本项目基于 MIT 许可证 开源。
