wx44wx/py-active-call

Fork 0

Files

Xin Wang 14013608a9 Init Projecto

2026-01-28 10:19:04 +08:00

12 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a Python implementation of active-call (originally Rust), a high-performance Voice AI Gateway that bridges telephony protocols (WebSocket, WebRTC) with AI pipelines (LLM, ASR, TTS). The system follows a decoupled architecture where the Media Gateway (this service) handles low-level audio/signaling, while Business Logic (AI Agent) controls it via WebSocket API.

Technology Stack:

Python 3.11+ with asyncio for all I/O
FastAPI + Uvicorn for WebSocket/WebRTC endpoints
aiortc for WebRTC media transport (optional dependency)
Silero VAD for voice activity detection (optional dependency)
Pydantic for protocol validation
Loguru for structured logging

Common Development Commands

Running the Server

# Start development server (with auto-reload)
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

# Start with specific host/port
HOST=0.0.0.0 PORT=8080 uvicorn app.main:app

# Using Docker
docker-compose up --build

Testing

# Run WebSocket test client (sine wave generation)
python scripts/test_websocket.py --url ws://localhost:8000/ws --sine

# Run WebSocket test client (with audio file)
python scripts/test_websocket.py --url ws://localhost:8000/ws --file test_audio.wav

# Run WebRTC test client
python scripts/test_webrtc.py --url ws://localhost:8000/webrtc

# Run unit tests
pytest tests/ -v --cov=app --cov=core

# Run specific test file
pytest tests/test_session.py -v

# Run with coverage report
pytest tests/ --cov=app --cov=core --cov-report=html

Code Quality

# Format code
black app/ core/ models/ processors/ utils/ scripts/

# Lint code
ruff check app/ core/ models/ processors/ utils/ scripts/

# Type checking
mypy app/ core/

Dependency Management

# Install all dependencies
pip install -r requirements.txt

# Install development dependencies
pip install -r requirements-dev.txt

# Update dependencies
pip install --upgrade -r requirements.txt

Architecture Overview

Decoupled Design Pattern

The system implements a decoupled architecture separating concerns:

Media Gateway Layer (app/, core/, processors/)
- Handles low-level audio transport (WebSocket, WebRTC)
- Manages session lifecycle and state
- Processes audio through pipeline (VAD, resampling)
- Emits events (speaking, silence, error) to control layer
Business Logic Layer (External AI Agent)
- Connects via WebSocket
- Receives real-time events (speech detection, ASR transcripts)
- Sends commands (tts, play, interrupt, hangup)

Key Architecture Components

Transport Abstraction (core/transports.py):

BaseTransport - Abstract interface with send_event() and send_audio()
SocketTransport - WebSocket with mixed text/binary frames, uses asyncio.Lock to prevent frame interleaving
WebRtcTransport - WebSocket signaling + aiortc RTCPeerConnection for media

Session Management (core/session.py):

Each WebSocket/WebRTC connection creates a Session with unique UUID
Routes incoming JSON commands to handlers via parse_command() from models/commands.py
Routes binary audio data to AudioPipeline
Manages session state: created → invited → accepted → ringing → hungup
Cleanup on disconnect

Audio Pipeline (core/pipeline.py):

Processes audio through VAD (Voice Activity Detection)
Emits events to global event bus when VAD state changes
Supports interruption for barge-in scenarios

Event Bus (core/events.py):

Global pub/sub system for inter-component communication
Subscribe to specific event types (speaking, silence, error)
Async notification to all subscribers

Protocol Compatibility

The implementation must maintain protocol compatibility with the original Rust API. All commands and events are strictly defined in:

models/commands.py - Command models (invite, accept, reject, tts, play, interrupt, hangup, chat)
models/events.py - Event models (answer, speaking, silence, trackStart, trackEnd, error)
models/config.py - Configuration models (CallOption, VADOption, TTSOption, ASROption, etc.)

Important: Always use parse_command() from models/commands.py to parse incoming JSON - never manually parse command strings. This ensures type safety and validation.

WebSocket Protocol (`/ws` endpoint)

Mixed Frame Handling:

Text frames → JSON commands (invite, tts, play, interrupt, hangup, etc.)
Binary frames → Raw PCM audio (16kHz, 16-bit, mono)

Flow:

Client connects and sends invite command with codec configuration
Server responds with answer event
Client streams binary audio frames
Server processes audio and emits events (speaking, silence)
Client can send commands at any time (tts, play, interrupt, hangup)

WebRTC Protocol (`/webrtc` endpoint)

Signaling Flow:

Client connects via WebSocket
Client sends SDP offer (JSON with sdp and type fields)
Server creates RTCPeerConnection and generates SDP answer
Server responds with answer event containing SDP
WebRTC media flows via UDP (managed by aiortc)
Commands can be sent via WebSocket text frames at any time

Audio Track Handling:

When pc.on("track") fires, wrap received track with Resampled16kTrack
Pull frames from track and convert to bytes
Feed bytes to session.handle_audio()

Session Lifecycle

1. Connection → WebSocket/WebRTC endpoint accepts
2. Session creation → New Session(uuid, transport)
3. Invite → Client sends invite command
4. Answer → Server sends answer event
5. Audio streaming → Client sends binary audio / WebRTC media
6. Commands → Client sends JSON commands (tts, play, interrupt)
7. Hangup → Client sends hangup command OR connection closes
8. Cleanup → Session cleanup, remove from active_sessions

Optional Dependencies

The following dependencies are optional - the code gracefully degrades without them:

aiortc + av (PyAV) - Required for WebRTC functionality. Without them:
- /webrtc endpoint will reject connections
- WebRTC transport cannot be used
- WebSocket endpoint still works fine
onnxruntime - Required for VAD functionality. Without it:
- VAD always returns "Speech" with probability 1.0
- speaking/silence events still emitted but not accurate

Important Implementation Details

Thread Safety in WebSocket Transport

The SocketTransport uses asyncio.Lock() because FastAPI WebSocket's send_text() and send_bytes() are NOT thread-safe. Without the lock, rapidly sending text and binary frames can interleave, causing protocol violations.

async def send_event(self, event: dict):
    async with self.lock:  # Critical for thread safety
        await self.ws.send_text(json.dumps(event))

async def send_audio(self, pcm_bytes: bytes):
    async with self.lock:
        await self.ws.send_bytes(pcm_bytes)

Event Bus Usage

Components subscribe to event types and are notified asynchronously:

event_bus = get_event_bus()

# Subscribe to speaking events
event_bus.subscribe("speaking", my_callback)

# Publish events
await event_bus.publish("speaking", {"trackId": session_id, "probability": 0.9})

Error Handling Pattern

All errors are sent as error events to the client:

await self.transport.send_event({
    "event": "error",
    "trackId": self.current_track_id,
    "timestamp": self._get_timestamp_ms(),
    "sender": "server",  # or "asr", "tts", "media", etc.
    "error": error_message
})

Configuration Management

Configuration is loaded from:

Environment variables
.env file (gitignored)
Default values in app/config.py

Never commit .env - it may contain sensitive keys. Use .env.example as a template.

Audio Format Specifications

Input/Output Audio:

Sample rate: 16kHz
Bit depth: 16-bit (PCM)
Channels: Mono
Chunk size: 640 bytes (20ms at 16kHz)

Format: Little-endian signed 16-bit integers (int16)

Key Files Reference

When working with this codebase, these files are the most critical:

app/main.py - FastAPI endpoints, session lifecycle, event hooks
core/transports.py - Transport abstraction and WebSocket/WebRTC handling
core/session.py - Command routing, session state management
core/pipeline.py - Audio processing, VAD integration, event emission
models/commands.py - Protocol command definitions and parsing
models/events.py - Protocol event definitions
processors/vad.py - Silero VAD implementation
reference/active-call/docs/api.md - Complete API specification from original Rust implementation

Testing Strategy

When implementing new features:

Unit tests - Test individual components (transports, session, pipeline)
Integration tests - Test endpoint behavior with test clients
Protocol tests - Verify commands/events match API specification
Manual testing - Use scripts/test_websocket.py and scripts/test_webrtc.py

Reference Implementations

Original Rust implementation: reference/active-call/ - Complete feature set with SIP, ASR, TTS
Python reference: reference/py-active-call/ - Partial implementation with bot integration

Use these as references for:

Protocol specification details
Architecture patterns
Testing approaches
Edge case handling

Common Patterns

Creating a new command:

Add model to models/commands.py
Add to COMMAND_TYPES dict
Add handler method in core/session.py (e.g., _handle_mycommand)
Route in Session.handle_text() under the command type

Adding a new event:

Add model to models/events.py
Add to EVENT_TYPES dict
Emit via transport.send_event() or event_bus.publish()

Adding a new processor:

Create in processors/myprocessor.py
Integrate into core/pipeline.py AudioPipeline
Emit events through event bus

Session State Management

Sessions track state through these transitions:

created - Initial state
invited - Invite command received
accepted - Accept command received
ringing - Ringing command sent
hungup - Hangup command or disconnect

The state attribute is updated in each handler and logged for debugging.

Testing Endpoints Without Full Dependencies

The WebSocket endpoint (/ws) works without aiortc, av, or onnxruntime. Use this for testing core functionality:

# Install minimal dependencies
pip install fastapi uvicorn numpy pydantic python-dotenv loguru aiohttp

# Start server
uvicorn app.main:app

# Test with basic client
python scripts/test_websocket.py

The WebRTC endpoint requires aiortc+av (PyAV) which can be challenging to install on Windows. Consider Linux/macOS for full WebRTC development.

Logging

Logs are written to:

Console (stdout) - Real-time output
logs/active_call_YYYY-MM-DD.log - Rotated daily, retained for 7 days

Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL

Set via LOG_LEVEL environment variable or in .env.

Dependencies Note

On Windows with Python 3.11, aiortc and av (PyAV) may have installation issues due to:

Missing C compilers
Incompatible binary wheel versions
FFmpeg/library dependencies

The code gracefully handles missing optional dependencies with try/except imports and runtime checks. Consider using Docker for consistent development environments.

12 KiB Raw Blame History