Add backend api and engine

This commit is contained in:
Xin Wang
2026-02-06 14:01:34 +08:00
parent 590014e821
commit d5c1ab34b3
61 changed files with 10351 additions and 1 deletions

View File

@@ -0,0 +1,96 @@
<svg width="1200" height="620" viewBox="0 0 1200 620" xmlns="http://www.w3.org/2000/svg">
<defs>
<style>
.box { fill:#11131a; stroke:#3a3f4b; stroke-width:1.2; rx:10; ry:10; }
.title { font: 600 14px 'Arial'; fill:#f2f3f7; }
.text { font: 12px 'Arial'; fill:#c8ccd8; }
.arrow { stroke:#7aa2ff; stroke-width:1.6; marker-end:url(#arrow); fill:none; }
.arrow2 { stroke:#2dd4bf; stroke-width:1.6; marker-end:url(#arrow); fill:none; }
.arrow3 { stroke:#ff6b6b; stroke-width:1.6; marker-end:url(#arrow); fill:none; }
.label { font: 11px 'Arial'; fill:#9aa3b2; }
</style>
<marker id="arrow" markerWidth="8" markerHeight="8" refX="7" refY="4" orient="auto">
<path d="M0,0 L8,4 L0,8 Z" fill="#7aa2ff"/>
</marker>
</defs>
<rect x="40" y="40" width="250" height="120" class="box"/>
<text x="60" y="70" class="title">Web Client</text>
<text x="60" y="95" class="text">WS JSON commands</text>
<text x="60" y="115" class="text">WS binary PCM audio</text>
<rect x="350" y="40" width="250" height="120" class="box"/>
<text x="370" y="70" class="title">FastAPI /ws</text>
<text x="370" y="95" class="text">Session + Transport</text>
<rect x="660" y="40" width="250" height="120" class="box"/>
<text x="680" y="70" class="title">DuplexPipeline</text>
<text x="680" y="95" class="text">process_audio / process_text</text>
<rect x="920" y="40" width="240" height="120" class="box"/>
<text x="940" y="70" class="title">ConversationManager</text>
<text x="940" y="95" class="text">turns + state</text>
<rect x="660" y="200" width="180" height="100" class="box"/>
<text x="680" y="230" class="title">VADProcessor</text>
<text x="680" y="255" class="text">speech/silence</text>
<rect x="860" y="200" width="180" height="100" class="box"/>
<text x="880" y="230" class="title">EOU Detector</text>
<text x="880" y="255" class="text">end-of-utterance</text>
<rect x="1060" y="200" width="120" height="100" class="box"/>
<text x="1075" y="230" class="title">ASR</text>
<text x="1075" y="255" class="text">transcripts</text>
<rect x="920" y="350" width="240" height="110" class="box"/>
<text x="940" y="380" class="title">LLM (stream)</text>
<text x="940" y="405" class="text">llmResponse events</text>
<rect x="660" y="350" width="220" height="110" class="box"/>
<text x="680" y="380" class="title">TTS (stream)</text>
<text x="680" y="405" class="text">PCM audio</text>
<rect x="40" y="350" width="250" height="110" class="box"/>
<text x="60" y="380" class="title">Web Client</text>
<text x="60" y="405" class="text">audio playback + UI</text>
<path d="M290 80 L350 80" class="arrow"/>
<text x="300" y="70" class="label">JSON / PCM</text>
<path d="M600 80 L660 80" class="arrow"/>
<text x="615" y="70" class="label">dispatch</text>
<path d="M910 80 L920 80" class="arrow"/>
<text x="880" y="70" class="label">turn mgmt</text>
<path d="M750 160 L750 200" class="arrow"/>
<text x="705" y="190" class="label">audio chunks</text>
<path d="M840 250 L860 250" class="arrow"/>
<text x="835" y="240" class="label">vad status</text>
<path d="M1040 250 L1060 250" class="arrow"/>
<text x="1010" y="240" class="label">audio buffer</text>
<path d="M950 300 L950 350" class="arrow2"/>
<text x="930" y="340" class="label">EOU -> LLM</text>
<path d="M880 405 L920 405" class="arrow2"/>
<text x="870" y="395" class="label">text stream</text>
<path d="M660 405 L290 405" class="arrow2"/>
<text x="430" y="395" class="label">PCM audio</text>
<path d="M660 450 L350 450" class="arrow"/>
<text x="420" y="440" class="label">events: trackStart/End</text>
<path d="M350 450 L290 450" class="arrow"/>
<text x="315" y="440" class="label">UI updates</text>
<path d="M750 200 L750 160" class="arrow3"/>
<text x="700" y="145" class="label">barge-in detection</text>
<path d="M760 170 L920 170" class="arrow3"/>
<text x="820" y="160" class="label">interrupt event + cancel</text>
</svg>

After

Width:  |  Height:  |  Size: 3.9 KiB

187
engine/docs/proejct_todo.md Normal file
View File

@@ -0,0 +1,187 @@
# OmniSense: 12-Week Sprint Board + Tech Stack (Python Backend) — TODO
## Scope
- [ ] Build a realtime AI SaaS (OmniSense) focused on web-first audio + video with WebSocket + WebRTC endpoints
- [ ] Deliver assistant builder, tool execution, observability, evals, optional telephony later
- [ ] Keep scope aligned to 2-person team, self-hosted services
---
## Sprint Board (12 weeks, 2-week sprints)
Team assumption: 2 engineers. Scope prioritized to web-first audio + video, with BYO-SFU adapters.
### Sprint 1 (Weeks 12) — Realtime Core MVP (WebSocket + WebRTC Audio)
- Deliverables
- [ ] WebSocket transport: audio in/out streaming (1:1)
- [ ] WebRTC transport: audio in/out streaming (1:1)
- [ ] Adapter contract wired into runtime (transport-agnostic session core)
- [ ] ASR → LLM → TTS pipeline, streaming both directions
- [ ] Basic session state (start/stop, silence timeout)
- [ ] Transcript persistence
- Acceptance criteria
- [ ] < 1.5s median round-trip for short responses
- [ ] Stable streaming for 10+ minute session
### Sprint 2 (Weeks 34) — Video + Realtime UX
- Deliverables
- [ ] WebRTC video capture + streaming (assistant can “see” frames)
- [ ] WebSocket video streaming for local/dev mode
- [ ] Low-latency UI: push-to-talk, live captions, speaking indicator
- [ ] Recording + transcript storage (web sessions)
- Acceptance criteria
- [ ] Video < 2.5s end-to-end latency for analysis
- [ ] Audio quality acceptable (no clipping, jitter handling)
### Sprint 3 (Weeks 56) — Assistant Builder v1
- Deliverables
- [ ] Assistant schema + versioning
- [ ] UI: Model/Voice/Transcriber/Tools/Video/Transport tabs
- [ ] “Test/Chat/Talk to Assistant” (web)
- Acceptance criteria
- [ ] Create/publish assistant and run a live web session
- [ ] All config changes tracked by version
### Sprint 4 (Weeks 78) — Tooling + Structured Outputs
- Deliverables
- [ ] Tool registry + custom HTTP tools
- [ ] Tool auth secrets management
- [ ] Structured outputs (JSON extraction)
- Acceptance criteria
- [ ] Tool calls executed with retries/timeouts
- [ ] Structured JSON stored per call/session
### Sprint 5 (Weeks 910) — Observability + QA + Dev Platform
- Deliverables
- [ ] Session logs + chat logs + media logs
- [ ] Evals engine + test suites
- [ ] Basic analytics dashboard
- [ ] Public WebSocket API spec + message schema
- [ ] JS/TS SDK (connect, send audio/video, receive transcripts)
- Acceptance criteria
- [ ] Reproducible test suite runs
- [ ] Log filters by assistant/time/status
- [ ] SDK demo app runs end-to-end
### Sprint 6 (Weeks 1112) — SaaS Hardening
- Deliverables
- [ ] Org/RBAC + API keys + rate limits
- [ ] Usage metering + credits
- [ ] Stripe billing integration
- [ ] Self-hosted DB ops (migrations, backup/restore, monitoring)
- Acceptance criteria
- [ ] Metered usage per org
- [ ] Credits decrement correctly
- [ ] Optional telephony spike documented (defer build)
- [ ] Enterprise adapter guide published (BYO-SFU)
---
## Tech Stack by Service (Self-Hosted, Web-First)
### 1) Transport Gateway (Realtime)
- [ ] WebRTC (browser) + WebSocket (lightweight/dev) protocols
- [ ] BYO-SFU adapter (enterprise) + LiveKit optional adapter + WS transport server
- [ ] Python core (FastAPI + asyncio) + Node.js mediasoup adapters when needed
- [ ] Media: Opus/VP8, jitter buffer, VAD, echo cancellation
- [ ] Storage: S3-compatible (MinIO) for recordings
### 2) ASR Service
- [ ] Whisper (self-hosted) baseline
- [ ] gRPC/WebSocket streaming transport
- [ ] Python native service
- [ ] Optional cloud provider fallback (later)
### 3) TTS Service
- [ ] Piper or Coqui TTS (self-hosted)
- [ ] gRPC/WebSocket streaming transport
- [ ] Python native service
- [ ] Redis cache for common phrases
### 4) LLM Orchestrator
- [ ] Self-hosted (vLLM + open model)
- [ ] Python (FastAPI + asyncio)
- [ ] Streaming, tool calling, JSON mode
- [ ] Safety filters + prompt templates
### 5) Assistant Config Service
- [ ] PostgreSQL
- [ ] Python (SQLAlchemy or SQLModel)
- [ ] Versioning, publish/rollback
### 6) Session Service
- [ ] PostgreSQL + Redis
- [ ] Python
- [ ] State machine, timeouts, events
### 7) Tool Execution Layer
- [ ] PostgreSQL
- [ ] Python
- [ ] Auth secret vault, retry policies, tool schemas
### 8) Observability + Logs
- [ ] Postgres (metadata), ClickHouse (logs/metrics)
- [ ] OpenSearch for search
- [ ] Prometheus + Grafana metrics
- [ ] OpenTelemetry tracing
### 9) Billing + Usage Metering
- [ ] Stripe billing
- [ ] PostgreSQL
- [ ] NATS JetStream (events) + Redis counters
### 10) Web App (Dashboard)
- [ ] React + Next.js
- [ ] Tailwind or Radix UI
- [ ] WebRTC client + WS client; adapter-based RTC integration
- [ ] ECharts/Recharts
### 11) Auth + RBAC
- [ ] Keycloak (self-hosted) or custom JWT
- [ ] Org/user/role tables in Postgres
### 12) Public WebSocket API + SDK
- [ ] WS API: versioned schema, binary audio frames + JSON control messages
- [ ] SDKs: JS/TS first, optional Python/Go clients
- [ ] Docs: quickstart, auth flow, session lifecycle, examples
---
## Infrastructure (Self-Hosted)
- [ ] Docker Compose → k3s (later)
- [ ] Redis Streams or NATS
- [ ] MinIO object store
- [ ] GitHub Actions + Helm or kustomize
- [ ] Self-hosted Postgres + pgbackrest backups
- [ ] Vault for secrets
---
## Suggested MVP Sequence
- [ ] WebRTC demo + ASR/LLM/TTS streaming
- [ ] Assistant schema + versioning (web-first)
- [ ] Video capture + multimodal analysis
- [ ] Tool execution + structured outputs
- [ ] Logs + evals + public WS API + SDK
- [ ] Telephony (optional, later)
---
## Public WebSocket API (Minimum Spec)
- [ ] Auth: API key or JWT in initial `hello` message
- [ ] Core messages: `session.start`, `session.stop`, `audio.append`, `audio.commit`, `video.append`, `transcript.delta`, `assistant.response`, `tool.call`, `tool.result`, `error`
- [ ] Binary payloads: PCM/Opus frames with metadata in control channel
- [ ] Versioning: `v1` schema with backward compatibility rules
---
## Self-Hosted DB Ops Checklist
- [ ] Postgres in Docker/k3s with persistent volumes
- [ ] Migrations: `alembic` or `atlas`
- [ ] Backups: `pgbackrest` nightly + on-demand
- [ ] Monitoring: postgres_exporter + alerts
---
## RTC Adapter Contract (BYO-SFU First)
- [ ] Keep RTC pluggable; LiveKit optional, not core dependency
- [ ] Define adapter interface (TypeScript sketch)