12 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a Python implementation of active-call (originally Rust), a high-performance Voice AI Gateway that bridges telephony protocols (WebSocket, WebRTC) with AI pipelines (LLM, ASR, TTS). The system follows a decoupled architecture where the Media Gateway (this service) handles low-level audio/signaling, while Business Logic (AI Agent) controls it via WebSocket API.
Technology Stack:
- Python 3.11+ with
asynciofor all I/O - FastAPI + Uvicorn for WebSocket/WebRTC endpoints
- aiortc for WebRTC media transport (optional dependency)
- Silero VAD for voice activity detection (optional dependency)
- Pydantic for protocol validation
- Loguru for structured logging
Common Development Commands
Running the Server
# Start development server (with auto-reload)
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
# Start with specific host/port
HOST=0.0.0.0 PORT=8080 uvicorn app.main:app
# Using Docker
docker-compose up --build
Testing
# Run WebSocket test client (sine wave generation)
python scripts/test_websocket.py --url ws://localhost:8000/ws --sine
# Run WebSocket test client (with audio file)
python scripts/test_websocket.py --url ws://localhost:8000/ws --file test_audio.wav
# Run WebRTC test client
python scripts/test_webrtc.py --url ws://localhost:8000/webrtc
# Run unit tests
pytest tests/ -v --cov=app --cov=core
# Run specific test file
pytest tests/test_session.py -v
# Run with coverage report
pytest tests/ --cov=app --cov=core --cov-report=html
Code Quality
# Format code
black app/ core/ models/ processors/ utils/ scripts/
# Lint code
ruff check app/ core/ models/ processors/ utils/ scripts/
# Type checking
mypy app/ core/
Dependency Management
# Install all dependencies
pip install -r requirements.txt
# Install development dependencies
pip install -r requirements-dev.txt
# Update dependencies
pip install --upgrade -r requirements.txt
Architecture Overview
Decoupled Design Pattern
The system implements a decoupled architecture separating concerns:
-
Media Gateway Layer (
app/,core/,processors/)- Handles low-level audio transport (WebSocket, WebRTC)
- Manages session lifecycle and state
- Processes audio through pipeline (VAD, resampling)
- Emits events (speaking, silence, error) to control layer
-
Business Logic Layer (External AI Agent)
- Connects via WebSocket
- Receives real-time events (speech detection, ASR transcripts)
- Sends commands (tts, play, interrupt, hangup)
Key Architecture Components
Transport Abstraction (core/transports.py):
BaseTransport- Abstract interface withsend_event()andsend_audio()SocketTransport- WebSocket with mixed text/binary frames, usesasyncio.Lockto prevent frame interleavingWebRtcTransport- WebSocket signaling + aiortc RTCPeerConnection for media
Session Management (core/session.py):
- Each WebSocket/WebRTC connection creates a
Sessionwith unique UUID - Routes incoming JSON commands to handlers via
parse_command()frommodels/commands.py - Routes binary audio data to
AudioPipeline - Manages session state: created → invited → accepted → ringing → hungup
- Cleanup on disconnect
Audio Pipeline (core/pipeline.py):
- Processes audio through VAD (Voice Activity Detection)
- Emits events to global event bus when VAD state changes
- Supports interruption for barge-in scenarios
Event Bus (core/events.py):
- Global pub/sub system for inter-component communication
- Subscribe to specific event types (speaking, silence, error)
- Async notification to all subscribers
Protocol Compatibility
The implementation must maintain protocol compatibility with the original Rust API. All commands and events are strictly defined in:
models/commands.py- Command models (invite, accept, reject, tts, play, interrupt, hangup, chat)models/events.py- Event models (answer, speaking, silence, trackStart, trackEnd, error)models/config.py- Configuration models (CallOption, VADOption, TTSOption, ASROption, etc.)
Important: Always use parse_command() from models/commands.py to parse incoming JSON - never manually parse command strings. This ensures type safety and validation.
WebSocket Protocol (/ws endpoint)
Mixed Frame Handling:
- Text frames → JSON commands (invite, tts, play, interrupt, hangup, etc.)
- Binary frames → Raw PCM audio (16kHz, 16-bit, mono)
Flow:
- Client connects and sends
invitecommand with codec configuration - Server responds with
answerevent - Client streams binary audio frames
- Server processes audio and emits events (speaking, silence)
- Client can send commands at any time (tts, play, interrupt, hangup)
WebRTC Protocol (/webrtc endpoint)
Signaling Flow:
- Client connects via WebSocket
- Client sends SDP offer (JSON with
sdpandtypefields) - Server creates RTCPeerConnection and generates SDP answer
- Server responds with
answerevent containing SDP - WebRTC media flows via UDP (managed by aiortc)
- Commands can be sent via WebSocket text frames at any time
Audio Track Handling:
- When
pc.on("track")fires, wrap received track withResampled16kTrack - Pull frames from track and convert to bytes
- Feed bytes to
session.handle_audio()
Session Lifecycle
1. Connection → WebSocket/WebRTC endpoint accepts
2. Session creation → New Session(uuid, transport)
3. Invite → Client sends invite command
4. Answer → Server sends answer event
5. Audio streaming → Client sends binary audio / WebRTC media
6. Commands → Client sends JSON commands (tts, play, interrupt)
7. Hangup → Client sends hangup command OR connection closes
8. Cleanup → Session cleanup, remove from active_sessions
Optional Dependencies
The following dependencies are optional - the code gracefully degrades without them:
-
aiortc + av (PyAV) - Required for WebRTC functionality. Without them:
/webrtcendpoint will reject connections- WebRTC transport cannot be used
- WebSocket endpoint still works fine
-
onnxruntime - Required for VAD functionality. Without it:
- VAD always returns "Speech" with probability 1.0
- speaking/silence events still emitted but not accurate
Important Implementation Details
Thread Safety in WebSocket Transport
The SocketTransport uses asyncio.Lock() because FastAPI WebSocket's send_text() and send_bytes() are NOT thread-safe. Without the lock, rapidly sending text and binary frames can interleave, causing protocol violations.
async def send_event(self, event: dict):
async with self.lock: # Critical for thread safety
await self.ws.send_text(json.dumps(event))
async def send_audio(self, pcm_bytes: bytes):
async with self.lock:
await self.ws.send_bytes(pcm_bytes)
Event Bus Usage
Components subscribe to event types and are notified asynchronously:
event_bus = get_event_bus()
# Subscribe to speaking events
event_bus.subscribe("speaking", my_callback)
# Publish events
await event_bus.publish("speaking", {"trackId": session_id, "probability": 0.9})
Error Handling Pattern
All errors are sent as error events to the client:
await self.transport.send_event({
"event": "error",
"trackId": self.current_track_id,
"timestamp": self._get_timestamp_ms(),
"sender": "server", # or "asr", "tts", "media", etc.
"error": error_message
})
Configuration Management
Configuration is loaded from:
- Environment variables
.envfile (gitignored)- Default values in
app/config.py
Never commit .env - it may contain sensitive keys. Use .env.example as a template.
Audio Format Specifications
Input/Output Audio:
- Sample rate: 16kHz
- Bit depth: 16-bit (PCM)
- Channels: Mono
- Chunk size: 640 bytes (20ms at 16kHz)
Format: Little-endian signed 16-bit integers (int16)
Key Files Reference
When working with this codebase, these files are the most critical:
app/main.py- FastAPI endpoints, session lifecycle, event hookscore/transports.py- Transport abstraction and WebSocket/WebRTC handlingcore/session.py- Command routing, session state managementcore/pipeline.py- Audio processing, VAD integration, event emissionmodels/commands.py- Protocol command definitions and parsingmodels/events.py- Protocol event definitionsprocessors/vad.py- Silero VAD implementationreference/active-call/docs/api.md- Complete API specification from original Rust implementation
Testing Strategy
When implementing new features:
- Unit tests - Test individual components (transports, session, pipeline)
- Integration tests - Test endpoint behavior with test clients
- Protocol tests - Verify commands/events match API specification
- Manual testing - Use
scripts/test_websocket.pyandscripts/test_webrtc.py
Reference Implementations
- Original Rust implementation:
reference/active-call/- Complete feature set with SIP, ASR, TTS - Python reference:
reference/py-active-call/- Partial implementation with bot integration
Use these as references for:
- Protocol specification details
- Architecture patterns
- Testing approaches
- Edge case handling
Common Patterns
Creating a new command:
- Add model to
models/commands.py - Add to
COMMAND_TYPESdict - Add handler method in
core/session.py(e.g.,_handle_mycommand) - Route in
Session.handle_text()under the command type
Adding a new event:
- Add model to
models/events.py - Add to
EVENT_TYPESdict - Emit via
transport.send_event()orevent_bus.publish()
Adding a new processor:
- Create in
processors/myprocessor.py - Integrate into
core/pipeline.pyAudioPipeline - Emit events through event bus
Session State Management
Sessions track state through these transitions:
created- Initial stateinvited- Invite command receivedaccepted- Accept command receivedringing- Ringing command senthungup- Hangup command or disconnect
The state attribute is updated in each handler and logged for debugging.
Testing Endpoints Without Full Dependencies
The WebSocket endpoint (/ws) works without aiortc, av, or onnxruntime. Use this for testing core functionality:
# Install minimal dependencies
pip install fastapi uvicorn numpy pydantic python-dotenv loguru aiohttp
# Start server
uvicorn app.main:app
# Test with basic client
python scripts/test_websocket.py
The WebRTC endpoint requires aiortc+av (PyAV) which can be challenging to install on Windows. Consider Linux/macOS for full WebRTC development.
Logging
Logs are written to:
- Console (stdout) - Real-time output
logs/active_call_YYYY-MM-DD.log- Rotated daily, retained for 7 days
Log levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
Set via LOG_LEVEL environment variable or in .env.
Dependencies Note
On Windows with Python 3.11, aiortc and av (PyAV) may have installation issues due to:
- Missing C compilers
- Incompatible binary wheel versions
- FFmpeg/library dependencies
The code gracefully handles missing optional dependencies with try/except imports and runtime checks. Consider using Docker for consistent development environments.