Files

Xin Wang b300b469dc Update documentation for Realtime Agent Studio with enhanced content and structure

- Revised site name and description for clarity and detail.
- Updated navigation structure to better reflect the organization of content.
- Improved changelog entries for better readability and consistency.
- Migrated assistant configuration and prompt guidelines to new documentation paths.
- Enhanced core concepts section to clarify the roles and capabilities of assistants and engines.
- Streamlined workflow documentation to provide clearer guidance on configuration and usage.

2026-03-09 05:38:43 +08:00

2.6 KiB

Raw Blame History

Realtime 引擎

Realtime 引擎直接连接端到端实时模型，适合把低延迟和自然语音体验放在第一位的场景。

运行链路

flowchart LR
    Input[音频 / 视频 / 文本输入] --> RT[Realtime Model]
    RT --> Output[音频 / 文本输出]
    RT --> Tools[工具]

与 Pipeline 不同，Realtime 引擎不会把 ASR、回合检测、LLM、TTS 作为独立阶段暴露出来，而是更多依赖实时模型整体处理。

常见后端

后端	特点
OpenAI Realtime	语音交互自然，延迟低
Gemini Live	多模态能力强
Doubao 实时交互	更适合国内环境与中文场景

它适合什么场景

语音助手、陪练、虚拟角色等高自然度体验场景
对首响和连续打断体验要求高的入口
希望减少链路拼装复杂度，直接接入端到端模型的团队

数据流

sequenceDiagram
    participant U as 用户
    participant E as 引擎
    participant RT as Realtime Model

    U->>E: 音频 / 视频 / 文本输入
    E->>RT: 转发实时流
    RT-->>E: 流式文本 / 音频输出
    E-->>U: 播放或渲染结果

Realtime 的优势

延迟更低：链路更短，用户感知更自然
全双工更顺滑：用户插话时，模型更容易在内部处理打断
多模态更直接：适合音频、视频、文本混合输入输出场景

Realtime 的取舍

更依赖实时模型供应商的能力边界
不容易对 ASR / TTS / 回合检测做独立替换
成本和可观测性往往不如 Pipeline 那样可逐环节拆分

智能打断

Realtime 模型通常原生支持全双工和打断：

sequenceDiagram
    participant U as 用户
    participant E as 引擎
    participant RT as Realtime Model

    Note over RT: 模型正在输出
    RT-->>E: 音频流...
    E-->>U: 播放
    U->>E: 用户开始说话
    E->>RT: 转发新输入
    Note over RT: 模型内部处理中断并切换回复
    RT-->>E: 新的响应
    E-->>U: 播放新响应

这种方式更自然，但你通常只能看到模型的整体行为，而不是每个中间阶段的细节。

配置示例

{
  "engine": "multimodal",
  "model": {
    "provider": "openai",
    "model": "gpt-4o-realtime-preview",
    "voice": "alloy"
  }
}

2.6 KiB Raw Blame History Unescape Escape