Init commit

2026-02-17 10:39:23 +08:00
commit 30eb4397c2
56 changed files with 11983 additions and 0 deletions
--- a/docs/duplex_interaction.svg
+++ b/docs/duplex_interaction.svg
@@ -0,0 +1,96 @@
+<svg width="1200" height="620" viewBox="0 0 1200 620" xmlns="http://www.w3.org/2000/svg">
+  <defs>
+    <style>
+      .box { fill:#11131a; stroke:#3a3f4b; stroke-width:1.2; rx:10; ry:10; }
+      .title { font: 600 14px 'Arial'; fill:#f2f3f7; }
+      .text { font: 12px 'Arial'; fill:#c8ccd8; }
+      .arrow { stroke:#7aa2ff; stroke-width:1.6; marker-end:url(#arrow); fill:none; }
+      .arrow2 { stroke:#2dd4bf; stroke-width:1.6; marker-end:url(#arrow); fill:none; }
+      .arrow3 { stroke:#ff6b6b; stroke-width:1.6; marker-end:url(#arrow); fill:none; }
+      .label { font: 11px 'Arial'; fill:#9aa3b2; }
+    </style>
+    <marker id="arrow" markerWidth="8" markerHeight="8" refX="7" refY="4" orient="auto">
+      <path d="M0,0 L8,4 L0,8 Z" fill="#7aa2ff"/>
+    </marker>
+  </defs>
+
+  <rect x="40" y="40" width="250" height="120" class="box"/>
+  <text x="60" y="70" class="title">Web Client</text>
+  <text x="60" y="95" class="text">WS JSON commands</text>
+  <text x="60" y="115" class="text">WS binary PCM audio</text>
+
+  <rect x="350" y="40" width="250" height="120" class="box"/>
+  <text x="370" y="70" class="title">FastAPI /ws</text>
+  <text x="370" y="95" class="text">Session + Transport</text>
+
+  <rect x="660" y="40" width="250" height="120" class="box"/>
+  <text x="680" y="70" class="title">DuplexPipeline</text>
+  <text x="680" y="95" class="text">process_audio / process_text</text>
+
+  <rect x="920" y="40" width="240" height="120" class="box"/>
+  <text x="940" y="70" class="title">ConversationManager</text>
+  <text x="940" y="95" class="text">turns + state</text>
+
+  <rect x="660" y="200" width="180" height="100" class="box"/>
+  <text x="680" y="230" class="title">VADProcessor</text>
+  <text x="680" y="255" class="text">speech/silence</text>
+
+  <rect x="860" y="200" width="180" height="100" class="box"/>
+  <text x="880" y="230" class="title">EOU Detector</text>
+  <text x="880" y="255" class="text">end-of-utterance</text>
+
+  <rect x="1060" y="200" width="120" height="100" class="box"/>
+  <text x="1075" y="230" class="title">ASR</text>
+  <text x="1075" y="255" class="text">transcripts</text>
+
+  <rect x="920" y="350" width="240" height="110" class="box"/>
+  <text x="940" y="380" class="title">LLM (stream)</text>
+  <text x="940" y="405" class="text">llmResponse events</text>
+
+  <rect x="660" y="350" width="220" height="110" class="box"/>
+  <text x="680" y="380" class="title">TTS (stream)</text>
+  <text x="680" y="405" class="text">PCM audio</text>
+
+  <rect x="40" y="350" width="250" height="110" class="box"/>
+  <text x="60" y="380" class="title">Web Client</text>
+  <text x="60" y="405" class="text">audio playback + UI</text>
+
+  <path d="M290 80 L350 80" class="arrow"/>
+  <text x="300" y="70" class="label">JSON / PCM</text>
+
+  <path d="M600 80 L660 80" class="arrow"/>
+  <text x="615" y="70" class="label">dispatch</text>
+
+  <path d="M910 80 L920 80" class="arrow"/>
+  <text x="880" y="70" class="label">turn mgmt</text>
+
+  <path d="M750 160 L750 200" class="arrow"/>
+  <text x="705" y="190" class="label">audio chunks</text>
+
+  <path d="M840 250 L860 250" class="arrow"/>
+  <text x="835" y="240" class="label">vad status</text>
+
+  <path d="M1040 250 L1060 250" class="arrow"/>
+  <text x="1010" y="240" class="label">audio buffer</text>
+
+  <path d="M950 300 L950 350" class="arrow2"/>
+  <text x="930" y="340" class="label">EOU -> LLM</text>
+
+  <path d="M880 405 L920 405" class="arrow2"/>
+  <text x="870" y="395" class="label">text stream</text>
+
+  <path d="M660 405 L290 405" class="arrow2"/>
+  <text x="430" y="395" class="label">PCM audio</text>
+
+  <path d="M660 450 L350 450" class="arrow"/>
+  <text x="420" y="440" class="label">events: trackStart/End</text>
+
+  <path d="M350 450 L290 450" class="arrow"/>
+  <text x="315" y="440" class="label">UI updates</text>
+
+  <path d="M750 200 L750 160" class="arrow3"/>
+  <text x="700" y="145" class="label">barge-in detection</text>
+
+  <path d="M760 170 L920 170" class="arrow3"/>
+  <text x="820" y="160" class="label">interrupt event + cancel</text>
+</svg>
--- a/docs/proejct_todo.md
+++ b/docs/proejct_todo.md
@@ -0,0 +1,187 @@
+# OmniSense: 12-Week Sprint Board + Tech Stack (Python Backend) — TODO
+
+## Scope
+- [ ] Build a realtime AI SaaS (OmniSense) focused on web-first audio + video with WebSocket + WebRTC endpoints
+- [ ] Deliver assistant builder, tool execution, observability, evals, optional telephony later
+- [ ] Keep scope aligned to 2-person team, self-hosted services
+
+---
+
+## Sprint Board (12 weeks, 2-week sprints)
+Team assumption: 2 engineers. Scope prioritized to web-first audio + video, with BYO-SFU adapters.
+
+### Sprint 1 (Weeks 1–2) — Realtime Core MVP (WebSocket + WebRTC Audio)
+- Deliverables
+  - [ ] WebSocket transport: audio in/out streaming (1:1)
+  - [ ] WebRTC transport: audio in/out streaming (1:1)
+  - [ ] Adapter contract wired into runtime (transport-agnostic session core)
+  - [ ] ASR → LLM → TTS pipeline, streaming both directions
+  - [ ] Basic session state (start/stop, silence timeout)
+  - [ ] Transcript persistence
+- Acceptance criteria
+  - [ ] < 1.5s median round-trip for short responses
+  - [ ] Stable streaming for 10+ minute session
+
+### Sprint 2 (Weeks 3–4) — Video + Realtime UX
+- Deliverables
+  - [ ] WebRTC video capture + streaming (assistant can “see” frames)
+  - [ ] WebSocket video streaming for local/dev mode
+  - [ ] Low-latency UI: push-to-talk, live captions, speaking indicator
+  - [ ] Recording + transcript storage (web sessions)
+- Acceptance criteria
+  - [ ] Video < 2.5s end-to-end latency for analysis
+  - [ ] Audio quality acceptable (no clipping, jitter handling)
+
+### Sprint 3 (Weeks 5–6) — Assistant Builder v1
+- Deliverables
+  - [ ] Assistant schema + versioning
+  - [ ] UI: Model/Voice/Transcriber/Tools/Video/Transport tabs
+  - [ ] “Test/Chat/Talk to Assistant” (web)
+- Acceptance criteria
+  - [ ] Create/publish assistant and run a live web session
+  - [ ] All config changes tracked by version
+
+### Sprint 4 (Weeks 7–8) — Tooling + Structured Outputs
+- Deliverables
+  - [ ] Tool registry + custom HTTP tools
+  - [ ] Tool auth secrets management
+  - [ ] Structured outputs (JSON extraction)
+- Acceptance criteria
+  - [ ] Tool calls executed with retries/timeouts
+  - [ ] Structured JSON stored per call/session
+
+### Sprint 5 (Weeks 9–10) — Observability + QA + Dev Platform
+- Deliverables
+  - [ ] Session logs + chat logs + media logs
+  - [ ] Evals engine + test suites
+  - [ ] Basic analytics dashboard
+  - [ ] Public WebSocket API spec + message schema
+  - [ ] JS/TS SDK (connect, send audio/video, receive transcripts)
+- Acceptance criteria
+  - [ ] Reproducible test suite runs
+  - [ ] Log filters by assistant/time/status
+  - [ ] SDK demo app runs end-to-end
+
+### Sprint 6 (Weeks 11–12) — SaaS Hardening
+- Deliverables
+  - [ ] Org/RBAC + API keys + rate limits
+  - [ ] Usage metering + credits
+  - [ ] Stripe billing integration
+  - [ ] Self-hosted DB ops (migrations, backup/restore, monitoring)
+- Acceptance criteria
+  - [ ] Metered usage per org
+  - [ ] Credits decrement correctly
+  - [ ] Optional telephony spike documented (defer build)
+  - [ ] Enterprise adapter guide published (BYO-SFU)
+
+---
+
+## Tech Stack by Service (Self-Hosted, Web-First)
+
+### 1) Transport Gateway (Realtime)
+- [ ] WebRTC (browser) + WebSocket (lightweight/dev) protocols
+- [ ] BYO-SFU adapter (enterprise) + LiveKit optional adapter + WS transport server
+- [ ] Python core (FastAPI + asyncio) + Node.js mediasoup adapters when needed
+- [ ] Media: Opus/VP8, jitter buffer, VAD, echo cancellation
+- [ ] Storage: S3-compatible (MinIO) for recordings
+
+### 2) ASR Service
+- [ ] Whisper (self-hosted) baseline
+- [ ] gRPC/WebSocket streaming transport
+- [ ] Python native service
+- [ ] Optional cloud provider fallback (later)
+
+### 3) TTS Service
+- [ ] Piper or Coqui TTS (self-hosted)
+- [ ] gRPC/WebSocket streaming transport
+- [ ] Python native service
+- [ ] Redis cache for common phrases
+
+### 4) LLM Orchestrator
+- [ ] Self-hosted (vLLM + open model)
+- [ ] Python (FastAPI + asyncio)
+- [ ] Streaming, tool calling, JSON mode
+- [ ] Safety filters + prompt templates
+
+### 5) Assistant Config Service
+- [ ] PostgreSQL
+- [ ] Python (SQLAlchemy or SQLModel)
+- [ ] Versioning, publish/rollback
+
+### 6) Session Service
+- [ ] PostgreSQL + Redis
+- [ ] Python
+- [ ] State machine, timeouts, events
+
+### 7) Tool Execution Layer
+- [ ] PostgreSQL
+- [ ] Python
+- [ ] Auth secret vault, retry policies, tool schemas
+
+### 8) Observability + Logs
+- [ ] Postgres (metadata), ClickHouse (logs/metrics)
+- [ ] OpenSearch for search
+- [ ] Prometheus + Grafana metrics
+- [ ] OpenTelemetry tracing
+
+### 9) Billing + Usage Metering
+- [ ] Stripe billing
+- [ ] PostgreSQL
+- [ ] NATS JetStream (events) + Redis counters
+
+### 10) Web App (Dashboard)
+- [ ] React + Next.js
+- [ ] Tailwind or Radix UI
+- [ ] WebRTC client + WS client; adapter-based RTC integration
+- [ ] ECharts/Recharts
+
+### 11) Auth + RBAC
+- [ ] Keycloak (self-hosted) or custom JWT
+- [ ] Org/user/role tables in Postgres
+
+### 12) Public WebSocket API + SDK
+- [ ] WS API: versioned schema, binary audio frames + JSON control messages
+- [ ] SDKs: JS/TS first, optional Python/Go clients
+- [ ] Docs: quickstart, auth flow, session lifecycle, examples
+
+---
+
+## Infrastructure (Self-Hosted)
+- [ ] Docker Compose → k3s (later)
+- [ ] Redis Streams or NATS
+- [ ] MinIO object store
+- [ ] GitHub Actions + Helm or kustomize
+- [ ] Self-hosted Postgres + pgbackrest backups
+- [ ] Vault for secrets
+
+---
+
+## Suggested MVP Sequence
+- [ ] WebRTC demo + ASR/LLM/TTS streaming
+- [ ] Assistant schema + versioning (web-first)
+- [ ] Video capture + multimodal analysis
+- [ ] Tool execution + structured outputs
+- [ ] Logs + evals + public WS API + SDK
+- [ ] Telephony (optional, later)
+
+---
+
+## Public WebSocket API (Minimum Spec)
+- [ ] Auth: API key or JWT in initial `hello` message
+- [ ] Core messages: `session.start`, `session.stop`, `audio.append`, `audio.commit`, `video.append`, `transcript.delta`, `assistant.response`, `tool.call`, `tool.result`, `error`
+- [ ] Binary payloads: PCM/Opus frames with metadata in control channel
+- [ ] Versioning: `v1` schema with backward compatibility rules
+
+---
+
+## Self-Hosted DB Ops Checklist
+- [ ] Postgres in Docker/k3s with persistent volumes
+- [ ] Migrations: `alembic` or `atlas`
+- [ ] Backups: `pgbackrest` nightly + on-demand
+- [ ] Monitoring: postgres_exporter + alerts
+
+---
+
+## RTC Adapter Contract (BYO-SFU First)
+- [ ] Keep RTC pluggable; LiveKit optional, not core dependency
+- [ ] Define adapter interface (TypeScript sketch)
--- a/docs/ws_v1_schema.md
+++ b/docs/ws_v1_schema.md
@@ -0,0 +1,199 @@
+# WS v1 Protocol Schema (`/ws`)
+
+This document defines the public WebSocket protocol for the `/ws` endpoint.
+
+## Transport
+
+- A single WebSocket connection carries:
+  - JSON text frames for control/events.
+  - Binary frames for raw PCM audio (`pcm_s16le`, mono, 16kHz by default).
+
+## Handshake and State Machine
+
+Required message order:
+
+1. Client sends `hello`.
+2. Server replies `hello.ack`.
+3. Client sends `session.start`.
+4. Server replies `session.started`.
+5. Client may stream binary audio and/or send `input.text`.
+6. Client sends `session.stop` (or closes socket).
+
+If order is violated, server emits `error` with `code = "protocol.order"`.
+
+## Client -> Server Messages
+
+### `hello`
+
+```json
+{
+  "type": "hello",
+  "version": "v1",
+  "auth": {
+    "apiKey": "optional-api-key",
+    "jwt": "optional-jwt"
+  }
+}
+```
+
+Rules:
+- `version` must be `v1`.
+- If `WS_API_KEY` is configured on server, `auth.apiKey` must match.
+- If `WS_REQUIRE_AUTH=true`, either `auth.apiKey` or `auth.jwt` must be present.
+
+### `session.start`
+
+```json
+{
+  "type": "session.start",
+  "audio": {
+    "encoding": "pcm_s16le",
+    "sample_rate_hz": 16000,
+    "channels": 1
+  },
+  "metadata": {
+    "client": "web-debug",
+    "output": {
+      "mode": "audio"
+    },
+    "systemPrompt": "You are concise.",
+    "greeting": "Hi, how can I help?",
+    "services": {
+      "llm": {
+        "provider": "openai",
+        "model": "gpt-4o-mini",
+        "apiKey": "sk-...",
+        "baseUrl": "https://api.openai.com/v1"
+      },
+      "asr": {
+        "provider": "openai_compatible",
+        "model": "FunAudioLLM/SenseVoiceSmall",
+        "apiKey": "sf-...",
+        "interimIntervalMs": 500,
+        "minAudioMs": 300
+      },
+      "tts": {
+        "enabled": true,
+        "provider": "openai_compatible",
+        "model": "FunAudioLLM/CosyVoice2-0.5B",
+        "apiKey": "sf-...",
+        "voice": "anna",
+        "speed": 1.0
+      }
+    }
+  }
+}
+```
+
+`metadata.services` is optional. If omitted, server defaults to environment configuration.
+
+Text-only mode:
+- Set `metadata.output.mode = "text"` OR `metadata.services.tts.enabled = false`.
+- In this mode server still sends `assistant.response.delta/final`, but will not emit audio frames or `output.audio.start/end`.
+
+### `input.text`
+
+```json
+{
+  "type": "input.text",
+  "text": "What can you do?"
+}
+```
+
+### `response.cancel`
+
+```json
+{
+  "type": "response.cancel",
+  "graceful": false
+}
+```
+
+### `session.stop`
+
+```json
+{
+  "type": "session.stop",
+  "reason": "client_disconnect"
+}
+```
+
+### `tool_call.results`
+
+Client tool execution results returned to server.
+
+```json
+{
+  "type": "tool_call.results",
+  "results": [
+    {
+      "tool_call_id": "call_abc123",
+      "name": "weather",
+      "output": { "temp_c": 21, "condition": "sunny" },
+      "status": { "code": 200, "message": "ok" }
+    }
+  ]
+}
+```
+
+## Server -> Client Events
+
+All server events include:
+
+```json
+{
+  "type": "event.name",
+  "timestamp": 1730000000000
+}
+```
+
+Common events:
+
+- `hello.ack`
+  - Fields: `sessionId`, `version`
+- `session.started`
+  - Fields: `sessionId`, `trackId`, `audio`
+- `session.stopped`
+  - Fields: `sessionId`, `reason`
+- `heartbeat`
+- `input.speech_started`
+  - Fields: `trackId`, `probability`
+- `input.speech_stopped`
+  - Fields: `trackId`, `probability`
+- `transcript.delta`
+  - Fields: `trackId`, `text`
+- `transcript.final`
+  - Fields: `trackId`, `text`
+- `assistant.response.delta`
+  - Fields: `trackId`, `text`
+- `assistant.response.final`
+  - Fields: `trackId`, `text`
+- `assistant.tool_call`
+  - Fields: `trackId`, `tool_call` (`tool_call.executor` is `client` or `server`)
+- `assistant.tool_result`
+  - Fields: `trackId`, `source`, `result`
+- `output.audio.start`
+  - Fields: `trackId`
+- `output.audio.end`
+  - Fields: `trackId`
+- `response.interrupted`
+  - Fields: `trackId`
+- `metrics.ttfb`
+  - Fields: `trackId`, `latencyMs`
+- `error`
+  - Fields: `sender`, `code`, `message`, `trackId`
+
+## Binary Audio Frames
+
+After `session.started`, client may send binary PCM chunks continuously.
+
+Recommended format:
+- 16-bit signed little-endian PCM.
+- 1 channel.
+- 16000 Hz.
+- 20ms frames (640 bytes) preferred.
+
+## Compatibility
+
+This endpoint now enforces v1 message schema for JSON control frames.
+Legacy command names (`invite`, `chat`, etc.) are no longer part of the public protocol.