AI-VideoAssistant/engine/docs/high_level_architecture.md

# Engine High-Level Architecture

This document describes the runtime architecture of `engine` for realtime voice/text assistant interactions.

## Goals

- Low-latency duplex interaction (user speaks while assistant can respond)
- Clear separation between transport, orchestration, and model/service integrations
- Backend-optional runtime (works with or without external backend)
- Protocol-first interoperability through strict WS v1 control messages

## Top-Level Components

```mermaid
flowchart LR
  C[Client\nWeb / Mobile / Device] <-- WS v1 + PCM --> A[FastAPI App\napp/main.py]
  A --> S[Session\ncore/session.py]
  S --> D[Duplex Pipeline\ncore/duplex_pipeline.py]

  D --> P[Processors\nVAD / EOU / Tracks]
  D --> R[Workflow Runner\ncore/workflow_runner.py]
  D --> E[Event Bus + Models\ncore/events.py + models/*]

  R --> SV[Service Layer\nservices/asr.py\nservices/llm.py\nservices/tts.py]
  R --> TE[Tool Executor\ncore/tool_executor.py]

  S --> HB[History Bridge\ncore/history_bridge.py]
  S --> BA[Control Plane Port\ncore/ports/control_plane.py]
  BA --> AD[Adapters\napp/backend_adapters.py]

  AD --> B[(External Backend API\noptional)]
  SV --> M[(ASR/LLM/TTS Providers)]
```

## Request Lifecycle (Simplified)

1. Client connects to `/ws?assistant_id=<id>` and sends `session.start`.
2. App creates a `Session` with resolved assistant config (backend or local YAML).
3. Binary PCM frames enter the duplex pipeline.
4. `VAD`/`EOU` processors detect speech segments and trigger ASR finalization.
5. ASR text is routed into workflow + LLM generation.
6. Optional tool calls are executed (server-side or client-side result return).
7. LLM output streams as text deltas; TTS produces audio chunks for playback.
8. Session emits structured events (`transcript.*`, `assistant.*`, `output.audio.*`, `error`).
9. History bridge persists conversation data asynchronously.
10. On `session.stop` (or disconnect), session finalizes and drains pending writes.

## Layering and Responsibilities

### 1) Transport / API Layer

- Entry point: `app/main.py`
- Responsibilities:
  - WebSocket lifecycle management
  - WS v1 message validation and order guarantees
  - Session creation and teardown
  - Converting raw WS frames into internal events

### 2) Session + Orchestration Layer

- Core: `core/session.py`, `core/duplex_pipeline.py`, `core/conversation.py`
- Responsibilities:
  - Per-session state machine
  - Turn boundaries and interruption/cancel handling
  - Event sequencing (`seq`) and envelope consistency
  - Bridging input/output tracks (`audio_in`, `audio_out`, `control`)

### 3) Processing Layer

- Modules: `processors/vad.py`, `processors/eou.py`, `processors/tracks.py`
- Responsibilities:
  - Speech activity detection
  - End-of-utterance decisioning
  - Track-oriented routing and timing-sensitive pre/post processing

### 4) Workflow + Tooling Layer

- Modules: `core/workflow_runner.py`, `core/tool_executor.py`
- Responsibilities:
  - Assistant workflow execution
  - Tool call planning/execution and timeout handling
  - Tool result normalization into protocol events

### 5) Service Integration Layer

- Modules: `services/*`
- Responsibilities:
  - Abstracting ASR/LLM/TTS provider differences
  - Streaming token/audio adaptation
  - Provider-specific adapters (OpenAI-compatible, DashScope, SiliconFlow, etc.)

### 6) Backend Integration Layer (Optional)

- Port: `core/ports/control_plane.py`
- Adapters: `app/backend_adapters.py`
- Responsibilities:
  - Fetching assistant runtime config
  - Persisting call/session metadata and history
  - Supporting `BACKEND_MODE=auto|http|disabled`

### 7) Persistence / Reliability Layer

- Module: `core/history_bridge.py`
- Responsibilities:
  - Non-blocking queue-based history writes
  - Retry with backoff on backend failures
  - Best-effort drain on session finalize

## Key Design Principles

- Dependency inversion for backend: session/pipeline depend on port interfaces, not concrete clients.
- Streaming-first: text/audio are emitted incrementally to minimize perceived latency.
- Fail-soft behavior: backend/history failures should not block realtime interaction paths.
- Protocol strictness: WS v1 rejects malformed/out-of-order control traffic early.
- Explicit event model: all client-observable state changes are represented as typed events.

## Configuration Boundaries

- Runtime environment settings live in `app/config.py`.
- Assistant-specific behavior is loaded by `assistant_id`:
  - backend mode: from backend API
  - engine-only mode: local `engine/config/agents/<assistant_id>.yaml`
- Client-provided `metadata.overrides` and `dynamicVariables` can alter runtime behavior within protocol constraints.

## Related Docs

- WS protocol: `engine/docs/ws_v1_schema.md`
- Backend integration details: `engine/docs/backend_integration.md`
- Duplex interaction diagram: `engine/docs/duplex_interaction.svg`