Files
AI-VideoAssistant/engine/docs/high_level_architecture.md
Xin Wang 4e2450e800 Refactor backend integration and service architecture
- Removed the backend client compatibility wrapper and associated methods to streamline backend integration.
- Updated session management to utilize control plane gateways and runtime configuration providers.
- Adjusted TTS service implementations to remove the EdgeTTS service and simplify service dependencies.
- Enhanced documentation to reflect changes in backend integration and service architecture.
- Updated configuration files to remove deprecated TTS provider options and clarify available settings.
2026-03-06 09:00:43 +08:00

4.8 KiB

Engine High-Level Architecture

This document describes the runtime architecture of engine for realtime voice/text assistant interactions.

Goals

  • Low-latency duplex interaction (user speaks while assistant can respond)
  • Clear separation between transport, orchestration, and model/service integrations
  • Backend-optional runtime (works with or without external backend)
  • Protocol-first interoperability through strict WS v1 control messages

Top-Level Components

flowchart LR
  C[Client\nWeb / Mobile / Device] <-- WS v1 + PCM --> A[FastAPI App\napp/main.py]
  A --> S[Session\ncore/session.py]
  S --> D[Duplex Pipeline\ncore/duplex_pipeline.py]

  D --> P[Processors\nVAD / EOU / Tracks]
  D --> R[Workflow Runner\ncore/workflow_runner.py]
  D --> E[Event Bus + Models\ncore/events.py + models/*]

  R --> SV[Service Layer\nservices/asr.py\nservices/llm.py\nservices/tts.py]
  R --> TE[Tool Executor\ncore/tool_executor.py]

  S --> HB[History Bridge\ncore/history_bridge.py]
  S --> BA[Control Plane Port\ncore/ports/control_plane.py]
  BA --> AD[Adapters\napp/backend_adapters.py]

  AD --> B[(External Backend API\noptional)]
  SV --> M[(ASR/LLM/TTS Providers)]

Request Lifecycle (Simplified)

  1. Client connects to /ws?assistant_id=<id> and sends session.start.
  2. App creates a Session with resolved assistant config (backend or local YAML).
  3. Binary PCM frames enter the duplex pipeline.
  4. VAD/EOU processors detect speech segments and trigger ASR finalization.
  5. ASR text is routed into workflow + LLM generation.
  6. Optional tool calls are executed (server-side or client-side result return).
  7. LLM output streams as text deltas; TTS produces audio chunks for playback.
  8. Session emits structured events (transcript.*, assistant.*, output.audio.*, error).
  9. History bridge persists conversation data asynchronously.
  10. On session.stop (or disconnect), session finalizes and drains pending writes.

Layering and Responsibilities

1) Transport / API Layer

  • Entry point: app/main.py
  • Responsibilities:
    • WebSocket lifecycle management
    • WS v1 message validation and order guarantees
    • Session creation and teardown
    • Converting raw WS frames into internal events

2) Session + Orchestration Layer

  • Core: core/session.py, core/duplex_pipeline.py, core/conversation.py
  • Responsibilities:
    • Per-session state machine
    • Turn boundaries and interruption/cancel handling
    • Event sequencing (seq) and envelope consistency
    • Bridging input/output tracks (audio_in, audio_out, control)

3) Processing Layer

  • Modules: processors/vad.py, processors/eou.py, processors/tracks.py
  • Responsibilities:
    • Speech activity detection
    • End-of-utterance decisioning
    • Track-oriented routing and timing-sensitive pre/post processing

4) Workflow + Tooling Layer

  • Modules: core/workflow_runner.py, core/tool_executor.py
  • Responsibilities:
    • Assistant workflow execution
    • Tool call planning/execution and timeout handling
    • Tool result normalization into protocol events

5) Service Integration Layer

  • Modules: services/*
  • Responsibilities:
    • Abstracting ASR/LLM/TTS provider differences
    • Streaming token/audio adaptation
    • Provider-specific adapters (OpenAI-compatible, DashScope, SiliconFlow, etc.)

6) Backend Integration Layer (Optional)

  • Port: core/ports/control_plane.py
  • Adapters: app/backend_adapters.py
  • Responsibilities:
    • Fetching assistant runtime config
    • Persisting call/session metadata and history
    • Supporting BACKEND_MODE=auto|http|disabled

7) Persistence / Reliability Layer

  • Module: core/history_bridge.py
  • Responsibilities:
    • Non-blocking queue-based history writes
    • Retry with backoff on backend failures
    • Best-effort drain on session finalize

Key Design Principles

  • Dependency inversion for backend: session/pipeline depend on port interfaces, not concrete clients.
  • Streaming-first: text/audio are emitted incrementally to minimize perceived latency.
  • Fail-soft behavior: backend/history failures should not block realtime interaction paths.
  • Protocol strictness: WS v1 rejects malformed/out-of-order control traffic early.
  • Explicit event model: all client-observable state changes are represented as typed events.

Configuration Boundaries

  • Runtime environment settings live in app/config.py.
  • Assistant-specific behavior is loaded by assistant_id:
    • backend mode: from backend API
    • engine-only mode: local engine/config/agents/<assistant_id>.yaml
  • Client-provided metadata.overrides and dynamicVariables can alter runtime behavior within protocol constraints.
  • WS protocol: engine/docs/ws_v1_schema.md
  • Backend integration details: engine/docs/backend_integration.md
  • Duplex interaction diagram: engine/docs/duplex_interaction.svg