Compare commits
22 Commits
kompfner-p
...
hush/claud
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
c9f5f44287 | ||
|
|
ecf2e69f3f | ||
|
|
1542d922e7 | ||
|
|
15d5d1159e | ||
|
|
884630a6bd | ||
|
|
1cf137c6a8 | ||
|
|
9c6b11cecf | ||
|
|
061a0dc43d | ||
|
|
328bbe069f | ||
|
|
dc32ecc872 | ||
|
|
f94a60f381 | ||
|
|
a4acc12f91 | ||
|
|
e93112e76e | ||
|
|
680bcaac66 | ||
|
|
d2ac9006a2 | ||
|
|
bcb019e8ab | ||
|
|
4ea546785f | ||
|
|
3b3c7aa8cc | ||
|
|
0b1a4792b8 | ||
|
|
14bd3b1b32 | ||
|
|
f733e77496 | ||
|
|
38506f51f7 |
8
.claude/.gitignore
vendored
Normal file
8
.claude/.gitignore
vendored
Normal file
@@ -0,0 +1,8 @@
|
||||
# Claude Code temporary files
|
||||
*.tmp
|
||||
*.log
|
||||
.claude-cache/
|
||||
|
||||
# OS files
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
200
.claude/QUICKSTART.md
Normal file
200
.claude/QUICKSTART.md
Normal file
@@ -0,0 +1,200 @@
|
||||
# Claude Code Quick Start for Pipecat
|
||||
|
||||
This guide helps you get started using Claude Code with the Pipecat project.
|
||||
|
||||
## Initial Setup
|
||||
|
||||
1. **Install Claude Code** (if not already installed):
|
||||
```bash
|
||||
# Follow instructions at https://claude.ai/claude-code
|
||||
```
|
||||
|
||||
2. **Install project dependencies**:
|
||||
```bash
|
||||
uv sync --group dev --all-extras --no-extra gstreamer --no-extra krisp --no-extra local
|
||||
```
|
||||
|
||||
3. **Install pre-commit hooks**:
|
||||
```bash
|
||||
uv run pre-commit install
|
||||
```
|
||||
|
||||
## Common Commands
|
||||
|
||||
### Testing
|
||||
- "Run all tests"
|
||||
- "Run tests for [specific file]"
|
||||
- "Run tests and show coverage"
|
||||
|
||||
### Code Quality
|
||||
- "Format the code"
|
||||
- "Fix linting issues"
|
||||
- "Run type checking"
|
||||
- "Run pre-commit hooks"
|
||||
|
||||
### Development
|
||||
- "Add a new TTS service for [provider]"
|
||||
- "Create a new processor that [does something]"
|
||||
- "Add a new frame type for [purpose]"
|
||||
- "Document the [ClassName] class" (uses `/docstring` skill)
|
||||
|
||||
### Documentation
|
||||
- "Document this module using Google style"
|
||||
- "Add docstrings to [file/class]"
|
||||
- Use `/docstring ClassName` for comprehensive class documentation
|
||||
|
||||
### Git Operations
|
||||
- "Create a commit for these changes"
|
||||
- "Create a pull request"
|
||||
- Use `/pr-description` skill for detailed PR descriptions
|
||||
- Use `/changelog` skill for changelog entries
|
||||
|
||||
## Custom Skills
|
||||
|
||||
### `/docstring [ClassName]`
|
||||
Automatically documents a Python class and its methods following Google-style conventions.
|
||||
|
||||
**Example:**
|
||||
```
|
||||
/docstring AudioProcessor
|
||||
```
|
||||
|
||||
This will:
|
||||
- Find the class in the codebase
|
||||
- Add module docstring if missing
|
||||
- Add class docstring with purpose and event handlers
|
||||
- Document all public methods
|
||||
- Document constructor parameters
|
||||
- Skip private methods and already-documented code
|
||||
|
||||
### `/changelog`
|
||||
Generates changelog entries using towncrier.
|
||||
|
||||
### `/pr-description`
|
||||
Creates comprehensive pull request descriptions based on your changes.
|
||||
|
||||
## Project-Specific Tips
|
||||
|
||||
### Understanding Pipecat Architecture
|
||||
|
||||
When asking Claude Code to help with development:
|
||||
|
||||
1. **Frame-Based System**: All data flows through frames
|
||||
- Ask: "Explain how frames work in this pipeline"
|
||||
- Reference: `src/pipecat/frames/frames.py`
|
||||
|
||||
2. **Processor Pattern**: Everything is a processor
|
||||
- Ask: "Show me how to create a custom processor"
|
||||
- Reference: `src/pipecat/processors/frame_processor.py`
|
||||
|
||||
3. **Service Integrations**: Many AI service integrations
|
||||
- Ask: "How do I add a new TTS service?"
|
||||
- Reference: `src/pipecat/services/tts/`
|
||||
|
||||
### Working with Examples
|
||||
|
||||
- "Show me examples of [feature]"
|
||||
- "Create a simple example that [does something]"
|
||||
- Examples are in `examples/foundational/` (building blocks) and `examples/` (complete apps)
|
||||
|
||||
### Debugging
|
||||
|
||||
- "Help me debug this pipeline"
|
||||
- "Why isn't my processor receiving frames?"
|
||||
- "Trace the flow of this frame type through the pipeline"
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Be Specific**: Instead of "fix this", say "fix the audio dropouts in the TTS processor"
|
||||
|
||||
2. **Context**: Provide context about what you're building
|
||||
- "I'm building a voice assistant that needs to interrupt TTS"
|
||||
- "I want to add vision capabilities to this chatbot"
|
||||
|
||||
3. **Reference Examples**: Point to existing patterns
|
||||
- "Similar to how DeepgramTTS works"
|
||||
- "Following the pattern in OpenAILLMService"
|
||||
|
||||
4. **Test-Driven**: Ask for tests
|
||||
- "Create tests for this processor"
|
||||
- "Add test coverage for the error handling"
|
||||
|
||||
5. **Documentation**: Keep docs updated
|
||||
- "Update the docstrings for these changes"
|
||||
- "Add a usage example to the class docstring"
|
||||
|
||||
## Example Conversations
|
||||
|
||||
### Adding a New Feature
|
||||
```
|
||||
You: "I need to add a processor that detects when the user says 'hello' and triggers an event"
|
||||
|
||||
Claude Code will:
|
||||
1. Create the processor class
|
||||
2. Implement frame processing logic
|
||||
3. Add event emission
|
||||
4. Create tests
|
||||
5. Add documentation
|
||||
```
|
||||
|
||||
### Debugging an Issue
|
||||
```
|
||||
You: "The audio is cutting out in my pipeline. Here's the code: [paste code]"
|
||||
|
||||
Claude Code will:
|
||||
1. Analyze the pipeline structure
|
||||
2. Check for common issues (buffer sizes, async handling, etc.)
|
||||
3. Suggest fixes
|
||||
4. Explain the root cause
|
||||
```
|
||||
|
||||
### Refactoring
|
||||
```
|
||||
You: "Refactor the XYZ service to use the new WebSocket pattern from ABC service"
|
||||
|
||||
Claude Code will:
|
||||
1. Analyze both services
|
||||
2. Identify the pattern differences
|
||||
3. Apply the refactoring
|
||||
4. Update tests
|
||||
5. Maintain backward compatibility if needed
|
||||
```
|
||||
|
||||
## Useful Prompts
|
||||
|
||||
- "Explain how [feature] works in this codebase"
|
||||
- "Add error handling for [scenario]"
|
||||
- "Create an example that demonstrates [feature]"
|
||||
- "Optimize this processor for [use case]"
|
||||
- "Add logging to help debug [issue]"
|
||||
- "Make this code more maintainable"
|
||||
- "Add type hints to this file"
|
||||
- "Create a comprehensive test suite for [component]"
|
||||
|
||||
## Configuration Reference
|
||||
|
||||
All Claude Code settings are in [.claude/settings.json](.claude/settings.json):
|
||||
- Project commands (test, lint, format, etc.)
|
||||
- Coding standards
|
||||
- File patterns
|
||||
- Important files and directories
|
||||
|
||||
For detailed architecture info, see [.claude/README.md](.claude/README.md).
|
||||
|
||||
## Getting Help
|
||||
|
||||
- **Project docs**: https://docs.pipecat.ai
|
||||
- **Discord**: https://discord.gg/pipecat
|
||||
- **GitHub Issues**: https://github.com/pipecat-ai/pipecat/issues
|
||||
- **Examples**: https://github.com/pipecat-ai/pipecat-examples
|
||||
|
||||
## Tips for Success
|
||||
|
||||
1. Start with small, specific tasks
|
||||
2. Use the custom skills (`/docstring`, `/pr-description`, etc.)
|
||||
3. Reference existing code patterns
|
||||
4. Ask for explanations when confused
|
||||
5. Request tests and documentation
|
||||
6. Run pre-commit hooks before committing
|
||||
|
||||
Happy coding with Claude! 🎙️🤖
|
||||
177
.claude/README.md
Normal file
177
.claude/README.md
Normal file
@@ -0,0 +1,177 @@
|
||||
# Claude Code Setup for Pipecat
|
||||
|
||||
This directory contains configuration and custom skills for working with the Pipecat project using Claude Code.
|
||||
|
||||
## Project Overview
|
||||
|
||||
Pipecat is an open-source Python framework for building real-time voice and multimodal conversational agents. It provides a composable, frame-based architecture for orchestrating audio, video, AI services, and conversation pipelines.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Core Concepts
|
||||
|
||||
1. **Frames** - The fundamental data units in Pipecat (audio, text, images, system messages, etc.)
|
||||
- Located in: `src/pipecat/frames/frames.py`
|
||||
- Different frame types for different data: `AudioRawFrame`, `TextFrame`, `ImageRawFrame`, etc.
|
||||
|
||||
2. **Processors** - Processing units that receive, transform, and emit frames
|
||||
- Base class: `src/pipecat/processors/frame_processor.py`
|
||||
- Can be chained to form pipelines
|
||||
- Examples: STT services, LLMs, TTS services, aggregators, etc.
|
||||
|
||||
3. **Pipelines** - Chains of processors that define data flow
|
||||
- Created using the `Pipeline` class
|
||||
- Processors linked using `link()` method or `|` operator
|
||||
|
||||
4. **Transports** - Handle input/output for audio/video streams
|
||||
- WebRTC (Daily), WebSocket, Local audio, etc.
|
||||
- Located in: `src/pipecat/transports/`
|
||||
|
||||
### Key Directories
|
||||
|
||||
- `src/pipecat/` - Main source code
|
||||
- `frames/` - Frame definitions and utilities
|
||||
- `processors/` - Base processors and common processors
|
||||
- `services/` - AI service integrations (STT, TTS, LLM, etc.)
|
||||
- `transports/` - Transport implementations
|
||||
- `audio/` - Audio processing utilities
|
||||
- `examples/` - Example applications and foundational examples
|
||||
- `tests/` - Test suite
|
||||
- `docs/` - Documentation source
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
# Install dependencies
|
||||
uv sync --group dev --all-extras --no-extra gstreamer --no-extra krisp --no-extra local
|
||||
|
||||
# Install pre-commit hooks
|
||||
uv run pre-commit install
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
```bash
|
||||
# All tests
|
||||
uv run pytest
|
||||
|
||||
# Specific test file
|
||||
uv run pytest tests/test_name.py
|
||||
|
||||
# With coverage
|
||||
uv run coverage run --module pytest
|
||||
uv run coverage report
|
||||
```
|
||||
|
||||
### Code Quality
|
||||
|
||||
```bash
|
||||
# Format code
|
||||
uv run ruff format .
|
||||
|
||||
# Lint code
|
||||
uv run ruff check .
|
||||
|
||||
# Fix linting issues
|
||||
uv run ruff check --fix .
|
||||
|
||||
# Type checking
|
||||
uv run pyright
|
||||
|
||||
# Run all pre-commit hooks
|
||||
uv run pre-commit run --all-files
|
||||
```
|
||||
|
||||
### Building
|
||||
|
||||
```bash
|
||||
# Build package
|
||||
uv build
|
||||
```
|
||||
|
||||
## Custom Skills
|
||||
|
||||
This project includes custom Claude Code skills:
|
||||
|
||||
### `/docstring`
|
||||
Document Python modules and classes using Google-style docstrings.
|
||||
|
||||
Usage: `/docstring ClassName`
|
||||
|
||||
### `/changelog`
|
||||
Generate changelog entries using towncrier.
|
||||
|
||||
### `/pr-description`
|
||||
Generate comprehensive PR descriptions based on changes.
|
||||
|
||||
## Coding Standards
|
||||
|
||||
1. **Docstrings** - Use Google-style docstrings for all public APIs
|
||||
- Module docstrings required
|
||||
- Class docstrings with purpose and event handlers
|
||||
- Method docstrings with Args/Returns/Raises
|
||||
- Constructor (`__init__`) must document all parameters
|
||||
|
||||
2. **Type Hints** - Required for all function signatures
|
||||
- Use `from typing import ...` for complex types
|
||||
- Dataclasses should have field type annotations
|
||||
|
||||
3. **Async/Await** - Consistent use of async patterns
|
||||
- Most processors use async methods
|
||||
- Tests use pytest-asyncio
|
||||
|
||||
4. **Code Style**
|
||||
- Line length: 100 characters max
|
||||
- Ruff for linting and formatting
|
||||
- Follow existing patterns in the codebase
|
||||
|
||||
5. **Testing**
|
||||
- Write tests for new features
|
||||
- Use pytest fixtures for common setups
|
||||
- Mock external services when appropriate
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Fork the repository
|
||||
2. Create a feature branch
|
||||
3. Make changes following coding standards
|
||||
4. Add tests for new functionality
|
||||
5. Run pre-commit hooks: `uv run pre-commit run --all-files`
|
||||
6. Submit a pull request
|
||||
|
||||
## Common Tasks
|
||||
|
||||
### Adding a New Service Integration
|
||||
|
||||
1. Create service file in `src/pipecat/services/<category>/`
|
||||
2. Inherit from appropriate base class (e.g., `TTSService`, `LLMService`)
|
||||
3. Implement required abstract methods
|
||||
4. Add service to `pyproject.toml` optional dependencies
|
||||
5. Add documentation
|
||||
6. Add tests in `tests/`
|
||||
|
||||
### Adding a New Processor
|
||||
|
||||
1. Create processor in `src/pipecat/processors/`
|
||||
2. Inherit from `FrameProcessor` or appropriate subclass
|
||||
3. Override `process_frame()` method
|
||||
4. Handle relevant frame types
|
||||
5. Emit frames using `await self.push_frame()`
|
||||
6. Add tests
|
||||
|
||||
### Adding a New Frame Type
|
||||
|
||||
1. Add frame definition to `src/pipecat/frames/frames.py`
|
||||
2. Inherit from appropriate base frame class
|
||||
3. Use `@dataclass` decorator for data frames
|
||||
4. Document the frame type and its fields
|
||||
5. Update processors that should handle this frame type
|
||||
|
||||
## Resources
|
||||
|
||||
- [Documentation](https://docs.pipecat.ai)
|
||||
- [GitHub Repository](https://github.com/pipecat-ai/pipecat)
|
||||
- [Examples](https://github.com/pipecat-ai/pipecat-examples)
|
||||
- [Discord Community](https://discord.gg/pipecat)
|
||||
81
.claude/settings.json
Normal file
81
.claude/settings.json
Normal file
@@ -0,0 +1,81 @@
|
||||
{
|
||||
"description": "Pipecat - Open-source Python framework for real-time voice and multimodal AI agents",
|
||||
"conventions": {
|
||||
"language": "Python",
|
||||
"version": ">=3.10",
|
||||
"package_manager": "uv",
|
||||
"code_style": "Google docstrings, Ruff formatting",
|
||||
"test_framework": "pytest",
|
||||
"async": true
|
||||
},
|
||||
"project_info": {
|
||||
"type": "python_library",
|
||||
"framework": "pipecat",
|
||||
"main_source": "src/pipecat",
|
||||
"examples": "examples/",
|
||||
"tests": "tests/",
|
||||
"docs": "docs/"
|
||||
},
|
||||
"commands": {
|
||||
"install": "uv sync --group dev --all-extras --no-extra gstreamer --no-extra krisp --no-extra local",
|
||||
"test": "uv run pytest",
|
||||
"test_file": "uv run pytest {file}",
|
||||
"lint": "uv run ruff check .",
|
||||
"lint_fix": "uv run ruff check --fix .",
|
||||
"format": "uv run ruff format .",
|
||||
"format_check": "uv run ruff format --check .",
|
||||
"type_check": "uv run pyright",
|
||||
"pre_commit": "uv run pre-commit run --all-files",
|
||||
"build": "uv build",
|
||||
"changelog": "uv run towncrier build --version {version}"
|
||||
},
|
||||
"coding_standards": [
|
||||
"Use Google-style docstrings for all public classes and methods",
|
||||
"Follow Ruff linting rules (see pyproject.toml)",
|
||||
"Maintain type hints for all function signatures",
|
||||
"Use async/await patterns consistently",
|
||||
"Keep line length at 100 characters maximum",
|
||||
"Use dataclasses with type annotations for configuration classes",
|
||||
"Prefer composition over inheritance where appropriate",
|
||||
"Write comprehensive pytest tests with asyncio support",
|
||||
"Document event handlers in class docstrings with Example:: sections"
|
||||
],
|
||||
"file_patterns": {
|
||||
"source_files": "src/pipecat/**/*.py",
|
||||
"test_files": "tests/**/*.py",
|
||||
"example_files": "examples/**/*.py",
|
||||
"config_files": "pyproject.toml"
|
||||
},
|
||||
"important_files": [
|
||||
"pyproject.toml - Project configuration and dependencies",
|
||||
"CONTRIBUTING.md - Contributing guidelines",
|
||||
"README.md - Project overview and quick start",
|
||||
"src/pipecat/__init__.py - Main package exports",
|
||||
"src/pipecat/frames/frames.py - Core frame definitions",
|
||||
"src/pipecat/processors/frame_processor.py - Base processor class"
|
||||
],
|
||||
"documentation": {
|
||||
"style": "Google",
|
||||
"build_command": "cd docs && make html",
|
||||
"skip_private_methods": true,
|
||||
"skip_simple_dunders": true,
|
||||
"require_module_docstrings": true,
|
||||
"require_class_docstrings": true,
|
||||
"require_init_docstrings": true
|
||||
},
|
||||
"git": {
|
||||
"main_branch": "main",
|
||||
"commit_style": "Conventional Commits",
|
||||
"pre_commit_hooks": true
|
||||
},
|
||||
"ai_assistance_notes": [
|
||||
"This project is a real-time voice and multimodal AI framework",
|
||||
"Core concepts: Frames (data units), Processors (processing units), Pipelines (chains of processors)",
|
||||
"Heavy use of async/await for real-time processing",
|
||||
"WebRTC and WebSocket transports for audio/video streaming",
|
||||
"Integration with many AI services (OpenAI, Anthropic, Deepgram, ElevenLabs, etc.)",
|
||||
"Frame-based architecture allows composable, modular pipeline construction",
|
||||
"Tests use pytest-asyncio for async test support",
|
||||
"Pre-commit hooks enforce code quality (run 'uv run pre-commit install')"
|
||||
]
|
||||
}
|
||||
7
.gitignore
vendored
7
.gitignore
vendored
@@ -61,4 +61,9 @@ docs/api/api
|
||||
.python-version
|
||||
|
||||
# Pipecat
|
||||
whisker_setup.py
|
||||
whisker_setup.py
|
||||
|
||||
# Claude Code - exclude temporary files but keep configuration
|
||||
.claude/.claude-cache/
|
||||
.claude/**/*.tmp
|
||||
.claude/**/*.log
|
||||
1
changelog/3406.fixed.md
Normal file
1
changelog/3406.fixed.md
Normal file
@@ -0,0 +1 @@
|
||||
- Fixed an issue where if you were using `OpenRouterLLMService` with a Gemini model, it wouldn't handle multiple `"system"` messages as expected (and as we do in `GoogleLLMService`), which is to convert subsequent ones into `"user"` messages. Instead, the latest `"system"` message would overwrite the previous ones.
|
||||
1
changelog/3495.changed.2.md
Normal file
1
changelog/3495.changed.2.md
Normal file
@@ -0,0 +1 @@
|
||||
- `SarvamSTTService` now defaults `vad_signals` and `high_vad_sensitivity` to `None` (omitted from connection parameters), improving latency by ~300ms compared to the previous defaults.
|
||||
1
changelog/3495.changed.md
Normal file
1
changelog/3495.changed.md
Normal file
@@ -0,0 +1 @@
|
||||
- Improved the STT TTFB (Time To First Byte) measurement, reporting the delay between when the user stops speaking and when the final transcription is received. Note: Unlike traditional TTFB which measures from a discrete request, STT services receive continuous audio input—so we measure from speech end to final transcript, which captures the latency that matters for voice AI applications. In support of this change, added `finalized` field to `TranscriptionFrame` to indicate when a transcript is the final result for an utterance.
|
||||
@@ -1 +1 @@
|
||||
- Added new `SMART_TURN_LOG_DATA` environment variable, which causes Smart Turn input data to be saved to disk
|
||||
- Added new `PIPECAT_SMART_TURN_LOG_DATA` environment variable, which causes Smart Turn input data to be saved to disk
|
||||
|
||||
@@ -4,7 +4,7 @@ This directory contains examples showing how to build voice and multimodal agent
|
||||
|
||||
## Setup
|
||||
|
||||
1. Follow the [README](../../README.md#%EF%B8%8F-contributing-to-the-framework) steps to get your local environment configured.
|
||||
1. Follow the [README](https://github.com/pipecat-ai/pipecat/blob/main/README.md#%EF%B8%8F-contributing-to-the-framework) steps to get your local environment configured.
|
||||
|
||||
> **Run from root directory**: Make sure you are running the steps from the root directory.
|
||||
|
||||
@@ -140,4 +140,4 @@ uv run python <example-name> --host 0.0.0.0 --port 8080
|
||||
- **Connection errors**: Verify API keys in `.env` file
|
||||
- **Port conflicts**: Use `--port` to change the port
|
||||
|
||||
For more examples, visit our the [`pipecat-examples repository](https://github.com/pipecat-ai/pipecat-examples).
|
||||
For more examples, visit our the [pipecat-examples repository](https://github.com/pipecat-ai/pipecat-examples).
|
||||
|
||||
@@ -54,7 +54,7 @@ assemblyai = [ "pipecat-ai[websockets-base]" ]
|
||||
asyncai = [ "pipecat-ai[websockets-base]" ]
|
||||
aws = [ "aioboto3~=15.5.0", "pipecat-ai[websockets-base]" ]
|
||||
aws-nova-sonic = [ "aws_sdk_bedrock_runtime~=0.2.0; python_version>='3.12'" ]
|
||||
azure = [ "azure-cognitiveservices-speech~=1.44.0"]
|
||||
azure = [ "azure-cognitiveservices-speech~=1.47.0"]
|
||||
cartesia = [ "cartesia~=2.0.3", "pipecat-ai[websockets-base]" ]
|
||||
camb = [ "camb-sdk>=1.5.4" ]
|
||||
cerebras = []
|
||||
|
||||
@@ -49,7 +49,7 @@ class LocalSmartTurnAnalyzerV3(BaseSmartTurn):
|
||||
"""
|
||||
super().__init__(**kwargs)
|
||||
|
||||
self._log_data = env_truthy("SMART_TURN_LOG_DATA", default=False)
|
||||
self._log_data = env_truthy("PIPECAT_SMART_TURN_LOG_DATA", default=False)
|
||||
|
||||
if not smart_turn_model_path:
|
||||
# Load bundled model
|
||||
|
||||
@@ -426,12 +426,15 @@ class TranscriptionFrame(TextFrame):
|
||||
timestamp: When the transcription occurred.
|
||||
language: Detected or specified language of the speech.
|
||||
result: Raw result from the STT service.
|
||||
finalized: Whether this is the final transcription for an utterance.
|
||||
Set by STT services that support commit/finalize signals.
|
||||
"""
|
||||
|
||||
user_id: str
|
||||
timestamp: str
|
||||
language: Optional[Language] = None
|
||||
result: Optional[Any] = None
|
||||
finalized: bool = False
|
||||
|
||||
def __str__(self):
|
||||
return f"{self.name}(user: {self.user_id}, text: [{self.text}], language: {self.language}, timestamp: {self.timestamp})"
|
||||
|
||||
@@ -833,7 +833,7 @@ class LLMAssistantAggregator(LLMContextAggregator):
|
||||
"id": frame.tool_call_id,
|
||||
"function": {
|
||||
"name": frame.function_name,
|
||||
"arguments": json.dumps(frame.arguments),
|
||||
"arguments": json.dumps(frame.arguments, ensure_ascii=False),
|
||||
},
|
||||
"type": "function",
|
||||
}
|
||||
@@ -866,7 +866,7 @@ class LLMAssistantAggregator(LLMContextAggregator):
|
||||
|
||||
# Update context with the function call result
|
||||
if frame.result:
|
||||
result = json.dumps(frame.result)
|
||||
result = json.dumps(frame.result, ensure_ascii=False)
|
||||
self._update_function_call_result(frame.function_name, frame.tool_call_id, result)
|
||||
else:
|
||||
self._update_function_call_result(frame.function_name, frame.tool_call_id, "COMPLETED")
|
||||
|
||||
@@ -161,7 +161,7 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
if isinstance(frame, VADUserStartedSpeakingFrame):
|
||||
await self.start_ttfb_metrics()
|
||||
pass
|
||||
elif isinstance(frame, VADUserStoppedSpeakingFrame):
|
||||
if (
|
||||
self._vad_force_turn_endpoint
|
||||
@@ -354,7 +354,6 @@ class AssemblyAISTTService(WebsocketSTTService):
|
||||
"""Handle transcription results."""
|
||||
if not message.transcript:
|
||||
return
|
||||
await self.stop_ttfb_metrics()
|
||||
if message.end_of_turn and (
|
||||
not self._connection_params.formatted_finals or message.turn_is_formatted
|
||||
):
|
||||
|
||||
@@ -158,7 +158,6 @@ class AWSTranscribeSTTService(WebsocketSTTService):
|
||||
await self._websocket.send(event_message)
|
||||
# Start metrics after first chunk sent
|
||||
await self.start_processing_metrics()
|
||||
await self.start_ttfb_metrics()
|
||||
except Exception as e:
|
||||
yield ErrorFrame(error=f"Error sending audio: {e}")
|
||||
|
||||
@@ -470,7 +469,6 @@ class AWSTranscribeSTTService(WebsocketSTTService):
|
||||
is_final = not result.get("IsPartial", True)
|
||||
|
||||
if transcript:
|
||||
await self.stop_ttfb_metrics()
|
||||
if is_final:
|
||||
await self.push_frame(
|
||||
TranscriptionFrame(
|
||||
|
||||
@@ -116,7 +116,6 @@ class AzureSTTService(STTService):
|
||||
"""
|
||||
try:
|
||||
await self.start_processing_metrics()
|
||||
await self.start_ttfb_metrics()
|
||||
if self._audio_stream:
|
||||
self._audio_stream.write(audio)
|
||||
yield None
|
||||
@@ -191,7 +190,6 @@ class AzureSTTService(STTService):
|
||||
self, transcript: str, is_final: bool, language: Optional[Language] = None
|
||||
):
|
||||
"""Handle a transcription result with tracing."""
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
def _on_handle_recognized(self, event):
|
||||
|
||||
@@ -90,7 +90,7 @@ class AzureBaseTTSService:
|
||||
emphasis: Emphasis level for speech ("strong", "moderate", "reduced").
|
||||
language: Language for synthesis. Defaults to English (US).
|
||||
pitch: Voice pitch adjustment (e.g., "+10%", "-5Hz", "high").
|
||||
rate: Speech rate multiplier. Defaults to "1.05".
|
||||
rate: Speech rate adjustment (e.g., "1.0", "1.25", "slow", "fast").
|
||||
role: Voice role for expression (e.g., "YoungAdultFemale").
|
||||
style: Speaking style (e.g., "cheerful", "sad", "excited").
|
||||
style_degree: Intensity of the speaking style (0.01 to 2.0).
|
||||
@@ -100,7 +100,7 @@ class AzureBaseTTSService:
|
||||
emphasis: Optional[str] = None
|
||||
language: Optional[Language] = Language.EN_US
|
||||
pitch: Optional[str] = None
|
||||
rate: Optional[str] = "1.05"
|
||||
rate: Optional[str] = None
|
||||
role: Optional[str] = None
|
||||
style: Optional[str] = None
|
||||
style_degree: Optional[str] = None
|
||||
@@ -185,7 +185,9 @@ class AzureBaseTTSService:
|
||||
if self._settings["volume"]:
|
||||
prosody_attrs.append(f"volume='{self._settings['volume']}'")
|
||||
|
||||
ssml += f"<prosody {' '.join(prosody_attrs)}>"
|
||||
# Only wrap in prosody tag if there are prosody attributes
|
||||
if prosody_attrs:
|
||||
ssml += f"<prosody {' '.join(prosody_attrs)}>"
|
||||
|
||||
if self._settings["emphasis"]:
|
||||
ssml += f"<emphasis level='{self._settings['emphasis']}'>"
|
||||
@@ -195,7 +197,8 @@ class AzureBaseTTSService:
|
||||
if self._settings["emphasis"]:
|
||||
ssml += "</emphasis>"
|
||||
|
||||
ssml += "</prosody>"
|
||||
if prosody_attrs:
|
||||
ssml += "</prosody>"
|
||||
|
||||
if self._settings["style"]:
|
||||
ssml += "</mstts:express-as>"
|
||||
@@ -277,6 +280,11 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
|
||||
self._started = False
|
||||
self._first_chunk = True
|
||||
self._cumulative_audio_offset: float = 0.0 # Cumulative audio duration in seconds
|
||||
self._current_sentence_base_offset: float = 0.0 # Base offset for current sentence
|
||||
self._current_sentence_duration: float = 0.0 # Duration from Azure callback
|
||||
self._current_sentence_max_word_offset: float = (
|
||||
0.0 # Max word boundary offset seen in current sentence (for 8kHz workaround)
|
||||
)
|
||||
self._last_word: Optional[str] = None # Track last word for punctuation merging
|
||||
self._last_timestamp: Optional[float] = None # Track last timestamp
|
||||
|
||||
@@ -386,8 +394,14 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
|
||||
word = evt.text
|
||||
sentence_relative_seconds = evt.audio_offset / 10_000_000.0
|
||||
|
||||
# Add cumulative offset to get absolute timestamp across sentences
|
||||
absolute_seconds = self._cumulative_audio_offset + sentence_relative_seconds
|
||||
# Use base offset captured at start of run_tts to avoid race conditions
|
||||
# with callbacks from overlapping TTS requests
|
||||
absolute_seconds = self._current_sentence_base_offset + sentence_relative_seconds
|
||||
|
||||
# Track max word offset for accurate cumulative timing
|
||||
# (audio_duration from Azure doesn't always match word boundary offsets at 8kHz)
|
||||
if sentence_relative_seconds > self._current_sentence_max_word_offset:
|
||||
self._current_sentence_max_word_offset = sentence_relative_seconds
|
||||
|
||||
if not word:
|
||||
return
|
||||
@@ -492,9 +506,9 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
|
||||
self._last_word = None
|
||||
self._last_timestamp = None
|
||||
|
||||
# Update cumulative audio offset for next sentence
|
||||
# Store duration for cumulative offset calculation
|
||||
if evt.result and evt.result.audio_duration:
|
||||
self._cumulative_audio_offset += evt.result.audio_duration.total_seconds()
|
||||
self._current_sentence_duration = evt.result.audio_duration.total_seconds()
|
||||
|
||||
self._audio_queue.put_nowait(None) # Signal completion
|
||||
|
||||
@@ -530,6 +544,9 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
|
||||
self._started = False
|
||||
self._first_chunk = True
|
||||
self._cumulative_audio_offset = 0.0
|
||||
self._current_sentence_base_offset = 0.0
|
||||
self._current_sentence_duration = 0.0
|
||||
self._current_sentence_max_word_offset = 0.0
|
||||
self._last_word = None
|
||||
self._last_timestamp = None
|
||||
|
||||
@@ -604,6 +621,12 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
|
||||
self._started = True
|
||||
self._first_chunk = True
|
||||
|
||||
# Capture base offset BEFORE starting synthesis to avoid race conditions
|
||||
# Word boundary callbacks will use this value
|
||||
self._current_sentence_base_offset = self._cumulative_audio_offset
|
||||
self._current_sentence_duration = 0.0
|
||||
self._current_sentence_max_word_offset = 0.0
|
||||
|
||||
ssml = self._construct_ssml(text)
|
||||
self._speech_synthesizer.speak_ssml_async(ssml)
|
||||
await self.start_tts_usage_metrics(text)
|
||||
@@ -627,6 +650,16 @@ class AzureTTSService(WordTTSService, AzureBaseTTSService):
|
||||
)
|
||||
yield frame
|
||||
|
||||
# Update cumulative offset for next sentence
|
||||
# At 8kHz, Azure's audio_duration doesn't match word boundary offsets,
|
||||
# so we use max_word_offset as a workaround. At other sample rates,
|
||||
# audio_duration is accurate.
|
||||
# TODO: Remove after Azure fixes word boundary timing at 8kHz
|
||||
if self.sample_rate == 8000:
|
||||
self._cumulative_audio_offset += self._current_sentence_max_word_offset
|
||||
else:
|
||||
self._cumulative_audio_offset += self._current_sentence_duration
|
||||
|
||||
except Exception as e:
|
||||
yield ErrorFrame(error=f"Unknown error occurred: {e}")
|
||||
yield TTSStoppedFrame()
|
||||
|
||||
@@ -207,9 +207,8 @@ class CartesiaSTTService(WebsocketSTTService):
|
||||
await super().cancel(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def start_metrics(self):
|
||||
async def _start_metrics(self):
|
||||
"""Start performance metrics collection for transcription processing."""
|
||||
await self.start_ttfb_metrics()
|
||||
await self.start_processing_metrics()
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
@@ -222,7 +221,7 @@ class CartesiaSTTService(WebsocketSTTService):
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, VADUserStartedSpeakingFrame):
|
||||
await self.start_metrics()
|
||||
await self._start_metrics()
|
||||
elif isinstance(frame, VADUserStoppedSpeakingFrame):
|
||||
# Send finalize command to flush the transcription session
|
||||
if self._websocket and self._websocket.state is State.OPEN:
|
||||
@@ -342,7 +341,6 @@ class CartesiaSTTService(WebsocketSTTService):
|
||||
pass
|
||||
|
||||
if len(transcript) > 0:
|
||||
await self.stop_ttfb_metrics()
|
||||
if is_final:
|
||||
await self.push_frame(
|
||||
TranscriptionFrame(
|
||||
|
||||
@@ -659,6 +659,8 @@ class DeepgramFluxSTTService(WebsocketSTTService):
|
||||
average_confidence = self._calculate_average_confidence(data)
|
||||
|
||||
if not self._params.min_confidence or average_confidence > self._params.min_confidence:
|
||||
# EndOfTurn means Flux has determined the turn is complete,
|
||||
# so this TranscriptionFrame is always finalized
|
||||
await self.push_frame(
|
||||
TranscriptionFrame(
|
||||
transcript,
|
||||
@@ -666,6 +668,7 @@ class DeepgramFluxSTTService(WebsocketSTTService):
|
||||
time_now_iso8601(),
|
||||
self._language,
|
||||
result=data,
|
||||
finalized=True,
|
||||
)
|
||||
)
|
||||
else:
|
||||
|
||||
@@ -276,9 +276,8 @@ class DeepgramSTTService(STTService):
|
||||
# GH issue: https://github.com/deepgram/deepgram-python-sdk/issues/570
|
||||
await self._connection.finish()
|
||||
|
||||
async def start_metrics(self):
|
||||
"""Start TTFB and processing metrics collection."""
|
||||
await self.start_ttfb_metrics()
|
||||
async def _start_metrics(self):
|
||||
"""Start processing metrics collection for this utterance."""
|
||||
await self.start_processing_metrics()
|
||||
|
||||
async def _on_error(self, *args, **kwargs):
|
||||
@@ -292,7 +291,7 @@ class DeepgramSTTService(STTService):
|
||||
await self._connect()
|
||||
|
||||
async def _on_speech_started(self, *args, **kwargs):
|
||||
await self.start_metrics()
|
||||
await self._start_metrics()
|
||||
await self._call_event_handler("on_speech_started", *args, **kwargs)
|
||||
await self.broadcast_frame(UserStartedSpeakingFrame)
|
||||
if self._should_interrupt:
|
||||
@@ -320,8 +319,12 @@ class DeepgramSTTService(STTService):
|
||||
language = result.channel.alternatives[0].languages[0]
|
||||
language = Language(language)
|
||||
if len(transcript) > 0:
|
||||
await self.stop_ttfb_metrics()
|
||||
if is_final:
|
||||
# Check if this response is from a finalize() call.
|
||||
# Only mark as finalized when both we requested it AND Deepgram confirms it.
|
||||
from_finalize = getattr(result, "from_finalize", False)
|
||||
if from_finalize:
|
||||
self.confirm_finalize()
|
||||
await self.push_frame(
|
||||
TranscriptionFrame(
|
||||
transcript,
|
||||
@@ -356,8 +359,10 @@ class DeepgramSTTService(STTService):
|
||||
|
||||
if isinstance(frame, VADUserStartedSpeakingFrame) and not self.vad_enabled:
|
||||
# Start metrics if Deepgram VAD is disabled & pipeline VAD has detected speech
|
||||
await self.start_metrics()
|
||||
await self._start_metrics()
|
||||
elif isinstance(frame, VADUserStoppedSpeakingFrame):
|
||||
# https://developers.deepgram.com/docs/finalize
|
||||
# Mark that we're awaiting a from_finalize response
|
||||
self.request_finalize()
|
||||
await self._connection.finalize()
|
||||
logger.trace(f"Triggered finalize event on: {frame.name=}, {direction=}")
|
||||
|
||||
@@ -363,9 +363,6 @@ class DeepgramSageMakerSTTService(STTService):
|
||||
if not transcript.strip():
|
||||
return
|
||||
|
||||
# Stop TTFB metrics on first transcript
|
||||
await self.stop_ttfb_metrics()
|
||||
|
||||
is_final = parsed.get("is_final", False)
|
||||
speech_final = parsed.get("speech_final", False)
|
||||
|
||||
@@ -417,9 +414,8 @@ class DeepgramSageMakerSTTService(STTService):
|
||||
"""
|
||||
pass
|
||||
|
||||
async def start_metrics(self):
|
||||
"""Start TTFB and processing metrics collection."""
|
||||
await self.start_ttfb_metrics()
|
||||
async def _start_metrics(self):
|
||||
"""Start processing metrics collection."""
|
||||
await self.start_processing_metrics()
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
@@ -433,7 +429,7 @@ class DeepgramSageMakerSTTService(STTService):
|
||||
|
||||
# Start metrics when user starts speaking (if VAD is not provided by Deepgram)
|
||||
if isinstance(frame, VADUserStartedSpeakingFrame):
|
||||
await self.start_metrics()
|
||||
await self._start_metrics()
|
||||
elif isinstance(frame, VADUserStoppedSpeakingFrame):
|
||||
# Send finalize message to Deepgram when user stops speaking
|
||||
# This tells Deepgram to flush any remaining audio and return final results
|
||||
|
||||
@@ -310,7 +310,6 @@ class ElevenLabsSTTService(SegmentedSTTService):
|
||||
self, transcript: str, is_final: bool, language: Optional[str] = None
|
||||
):
|
||||
"""Handle a transcription result with tracing."""
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
|
||||
@@ -328,7 +327,6 @@ class ElevenLabsSTTService(SegmentedSTTService):
|
||||
"""
|
||||
try:
|
||||
await self.start_processing_metrics()
|
||||
await self.start_ttfb_metrics()
|
||||
|
||||
# Upload audio and get transcription result directly
|
||||
result = await self._transcribe_audio(audio)
|
||||
@@ -539,9 +537,8 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
|
||||
await super().cancel(frame)
|
||||
await self._disconnect()
|
||||
|
||||
async def start_metrics(self):
|
||||
async def _start_metrics(self):
|
||||
"""Start performance metrics collection for transcription processing."""
|
||||
await self.start_ttfb_metrics()
|
||||
await self.start_processing_metrics()
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
@@ -555,7 +552,7 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
|
||||
|
||||
if isinstance(frame, VADUserStartedSpeakingFrame):
|
||||
# Start metrics when user starts speaking
|
||||
await self.start_metrics()
|
||||
await self._start_metrics()
|
||||
elif isinstance(frame, VADUserStoppedSpeakingFrame):
|
||||
# Send commit when user stops speaking (manual commit mode)
|
||||
if self._params.commit_strategy == CommitStrategy.MANUAL:
|
||||
@@ -764,8 +761,6 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
|
||||
if not text:
|
||||
return
|
||||
|
||||
await self.stop_ttfb_metrics()
|
||||
|
||||
# Get language if provided
|
||||
language = data.get("language_code")
|
||||
|
||||
@@ -803,7 +798,6 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
|
||||
if not text:
|
||||
return
|
||||
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
# Get language if provided
|
||||
@@ -845,7 +839,6 @@ class ElevenLabsRealtimeSTTService(WebsocketSTTService):
|
||||
if not text:
|
||||
return
|
||||
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
# Get language if provided
|
||||
|
||||
@@ -249,7 +249,6 @@ class FalSTTService(SegmentedSTTService):
|
||||
self, transcript: str, is_final: bool, language: Optional[str] = None
|
||||
):
|
||||
"""Handle a transcription result with tracing."""
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
async def run_stt(self, audio: bytes) -> AsyncGenerator[Frame, None]:
|
||||
@@ -267,7 +266,6 @@ class FalSTTService(SegmentedSTTService):
|
||||
"""
|
||||
try:
|
||||
await self.start_processing_metrics()
|
||||
await self.start_ttfb_metrics()
|
||||
|
||||
# Send to Fal directly (audio is already in WAV format from base class)
|
||||
data_uri = fal_client.encode(audio, "audio/x-wav")
|
||||
|
||||
@@ -385,7 +385,6 @@ class GladiaSTTService(WebsocketSTTService):
|
||||
Yields:
|
||||
None (processing is handled asynchronously via WebSocket).
|
||||
"""
|
||||
await self.start_ttfb_metrics()
|
||||
await self.start_processing_metrics()
|
||||
|
||||
# Add audio to buffer
|
||||
@@ -513,7 +512,6 @@ class GladiaSTTService(WebsocketSTTService):
|
||||
async def _handle_transcription(
|
||||
self, transcript: str, is_final: bool, language: Optional[str] = None
|
||||
):
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
async def _on_speech_started(self):
|
||||
|
||||
@@ -1674,7 +1674,7 @@ class GeminiLiveLLMService(LLMService):
|
||||
# start a timeout task to flush it later
|
||||
if self._user_transcription_buffer:
|
||||
self._transcription_timeout_task = self.create_task(
|
||||
await self._transcription_timeout_handler()
|
||||
self._transcription_timeout_handler()
|
||||
)
|
||||
|
||||
async def _handle_msg_output_transcription(self, message: LiveServerMessage):
|
||||
|
||||
@@ -823,7 +823,6 @@ class GoogleSTTService(STTService):
|
||||
"""
|
||||
if self._streaming_task:
|
||||
# Queue the audio data
|
||||
await self.start_ttfb_metrics()
|
||||
await self.start_processing_metrics()
|
||||
await self._request_queue.put(audio)
|
||||
yield None
|
||||
@@ -875,7 +874,6 @@ class GoogleSTTService(STTService):
|
||||
)
|
||||
else:
|
||||
self._last_transcript_was_final = False
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.push_frame(
|
||||
InterimTranscriptionFrame(
|
||||
transcript,
|
||||
|
||||
@@ -122,7 +122,6 @@ class GradiumSTTService(WebsocketSTTService):
|
||||
None (processing handled via WebSocket messages).
|
||||
"""
|
||||
self._audio_buffer.extend(audio)
|
||||
await self.start_ttfb_metrics()
|
||||
await self.start_processing_metrics()
|
||||
|
||||
while len(self._audio_buffer) >= self._chunk_size_bytes:
|
||||
|
||||
@@ -111,7 +111,6 @@ class HathoraSTTService(SegmentedSTTService):
|
||||
"""
|
||||
try:
|
||||
await self.start_processing_metrics()
|
||||
await self.start_ttfb_metrics()
|
||||
|
||||
url = f"{self._base_url}"
|
||||
|
||||
@@ -153,7 +152,6 @@ class HathoraSTTService(SegmentedSTTService):
|
||||
result=response,
|
||||
)
|
||||
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
except Exception as e:
|
||||
|
||||
@@ -307,7 +307,6 @@ class NvidiaSTTService(STTService):
|
||||
|
||||
transcript = result.alternatives[0].transcript
|
||||
if transcript and len(transcript) > 0:
|
||||
await self.stop_ttfb_metrics()
|
||||
if result.is_final:
|
||||
await self.stop_processing_metrics()
|
||||
await self.push_frame(
|
||||
@@ -344,7 +343,6 @@ class NvidiaSTTService(STTService):
|
||||
Yields:
|
||||
None - transcription results are pushed to the pipeline via frames.
|
||||
"""
|
||||
await self.start_ttfb_metrics()
|
||||
await self.start_processing_metrics()
|
||||
await self._queue.put(audio)
|
||||
yield None
|
||||
@@ -598,12 +596,10 @@ class NvidiaSegmentedSTTService(SegmentedSTTService):
|
||||
assert self._config is not None, "Recognition config not created"
|
||||
|
||||
await self.start_processing_metrics()
|
||||
await self.start_ttfb_metrics()
|
||||
|
||||
# Process audio with NVIDIA Riva ASR - explicitly request non-future response
|
||||
raw_response = self._asr_service.offline_recognize(audio, self._config, future=False)
|
||||
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
# Process the response - handle different possible return types
|
||||
|
||||
@@ -10,7 +10,7 @@ This module provides an OpenAI-compatible interface for interacting with OpenRou
|
||||
extending the base OpenAI LLM service functionality.
|
||||
"""
|
||||
|
||||
from typing import Optional
|
||||
from typing import Any, Dict, Optional
|
||||
|
||||
from loguru import logger
|
||||
|
||||
@@ -61,3 +61,35 @@ class OpenRouterLLMService(OpenAILLMService):
|
||||
"""
|
||||
logger.debug(f"Creating OpenRouter client with api {base_url}")
|
||||
return super().create_client(api_key, base_url, **kwargs)
|
||||
|
||||
def build_chat_completion_params(self, params_from_context: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Builds chat parameters, handling model-specific constraints.
|
||||
|
||||
Args:
|
||||
params_from_context: Parameters from the LLM context.
|
||||
|
||||
Returns:
|
||||
Transformed parameters ready for the API call.
|
||||
"""
|
||||
params = super().build_chat_completion_params(params_from_context)
|
||||
model = getattr(self, "model_name", getattr(self, "model", "")).lower()
|
||||
if "gemini" in model:
|
||||
messages = params.get("messages", [])
|
||||
if not messages:
|
||||
return params
|
||||
transformed_messages = []
|
||||
system_message_seen = False
|
||||
for msg in messages:
|
||||
if msg.get("role") == "system":
|
||||
if not system_message_seen:
|
||||
transformed_messages.append(msg)
|
||||
system_message_seen = True
|
||||
else:
|
||||
new_msg = msg.copy()
|
||||
new_msg["role"] = "user"
|
||||
transformed_messages.append(new_msg)
|
||||
else:
|
||||
transformed_messages.append(msg)
|
||||
params["messages"] = transformed_messages
|
||||
|
||||
return params
|
||||
|
||||
@@ -15,9 +15,15 @@ from pipecat.frames.frames import (
|
||||
CancelFrame,
|
||||
EndFrame,
|
||||
ErrorFrame,
|
||||
Frame,
|
||||
StartFrame,
|
||||
TranscriptionFrame,
|
||||
UserStartedSpeakingFrame,
|
||||
UserStoppedSpeakingFrame,
|
||||
VADUserStartedSpeakingFrame,
|
||||
VADUserStoppedSpeakingFrame,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.services.sarvam._sdk import sdk_headers
|
||||
from pipecat.services.stt_service import STTService
|
||||
from pipecat.transcriptions.language import Language, resolve_language
|
||||
@@ -75,14 +81,14 @@ class SarvamSTTService(STTService):
|
||||
language: Target language for transcription. Defaults to None (required for saarika models).
|
||||
prompt: Optional prompt to guide translation style/context for STT-Translate models.
|
||||
Only applicable to saaras (STT-Translate) models. Defaults to None.
|
||||
vad_signals: Enable VAD signals in response. Defaults to True.
|
||||
high_vad_sensitivity: Enable high VAD (Voice Activity Detection) sensitivity. Defaults to False.
|
||||
vad_signals: Enable VAD signals in response. Defaults to None.
|
||||
high_vad_sensitivity: Enable high VAD (Voice Activity Detection) sensitivity. Defaults to None.
|
||||
"""
|
||||
|
||||
language: Optional[Language] = None
|
||||
prompt: Optional[str] = None
|
||||
vad_signals: bool = True
|
||||
high_vad_sensitivity: bool = False
|
||||
vad_signals: bool = None
|
||||
high_vad_sensitivity: bool = None
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
@@ -155,6 +161,7 @@ class SarvamSTTService(STTService):
|
||||
self._websocket_context = None
|
||||
self._socket_client = None
|
||||
self._receive_task = None
|
||||
logger.info(f"Sarvam STT initialized with SDK headers: {self._sdk_headers}")
|
||||
|
||||
def language_to_service_language(self, language: Language) -> str:
|
||||
"""Convert pipecat Language enum to Sarvam's language code.
|
||||
@@ -175,6 +182,22 @@ class SarvamSTTService(STTService):
|
||||
"""
|
||||
return True
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
"""Process incoming frames.
|
||||
|
||||
Handles VAD frames for TTFB tracking when using Pipecat's VAD
|
||||
instead of Sarvam's built-in VAD.
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
# Only handle VAD frames when not using Sarvam's VAD signals
|
||||
if not self._vad_signals:
|
||||
if isinstance(frame, VADUserStartedSpeakingFrame):
|
||||
await self._start_metrics()
|
||||
elif isinstance(frame, VADUserStoppedSpeakingFrame):
|
||||
if self._socket_client:
|
||||
await self._socket_client.flush()
|
||||
|
||||
async def set_language(self, language: Language):
|
||||
"""Set the recognition language and reconnect.
|
||||
|
||||
@@ -411,16 +434,18 @@ class SarvamSTTService(STTService):
|
||||
logger.debug(f"VAD Signal: {signal}, Occurred at: {timestamp}")
|
||||
|
||||
if signal == "START_SPEECH":
|
||||
await self.start_metrics()
|
||||
await self._start_metrics()
|
||||
logger.debug("User started speaking")
|
||||
await self._call_event_handler("on_speech_started")
|
||||
await self.broadcast_frame(UserStartedSpeakingFrame)
|
||||
await self.push_interruption_task_frame_and_wait()
|
||||
|
||||
elif signal == "END_SPEECH":
|
||||
logger.debug("User stopped speaking")
|
||||
await self._call_event_handler("on_speech_stopped")
|
||||
await self.broadcast_frame(UserStoppedSpeakingFrame)
|
||||
|
||||
elif message.type == "data":
|
||||
await self.stop_ttfb_metrics()
|
||||
transcript = message.data.transcript
|
||||
language_code = message.data.language_code
|
||||
# Prefer language from message (auto-detected for translate models). Fallback to configured.
|
||||
@@ -482,7 +507,6 @@ class SarvamSTTService(STTService):
|
||||
}
|
||||
return mapping.get(language_code, Language.HI_IN)
|
||||
|
||||
async def start_metrics(self):
|
||||
"""Start TTFB and processing metrics collection."""
|
||||
await self.start_ttfb_metrics()
|
||||
async def _start_metrics(self):
|
||||
"""Start processing metrics collection."""
|
||||
await self.start_processing_metrics()
|
||||
|
||||
@@ -21,7 +21,7 @@ from pipecat.frames.frames import (
|
||||
InterimTranscriptionFrame,
|
||||
StartFrame,
|
||||
TranscriptionFrame,
|
||||
UserStoppedSpeakingFrame,
|
||||
VADUserStoppedSpeakingFrame,
|
||||
)
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.services.stt_service import WebsocketSTTService
|
||||
@@ -162,7 +162,7 @@ class SonioxSTTService(WebsocketSTTService):
|
||||
sample_rate: Audio sample rate.
|
||||
params: Additional configuration parameters, such as language hints, context and
|
||||
speaker diarization.
|
||||
vad_force_turn_endpoint: Listen to `UserStoppedSpeakingFrame` to send finalize message to Soniox. If disabled, Soniox will detect the end of the speech.
|
||||
vad_force_turn_endpoint: Listen to `VADUserStoppedSpeakingFrame` to send finalize message to Soniox. If disabled, Soniox will detect the end of the speech.
|
||||
**kwargs: Additional arguments passed to the STTService.
|
||||
"""
|
||||
super().__init__(sample_rate=sample_rate, **kwargs)
|
||||
@@ -247,7 +247,7 @@ class SonioxSTTService(WebsocketSTTService):
|
||||
"""
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
if isinstance(frame, UserStoppedSpeakingFrame) and self._vad_force_turn_endpoint:
|
||||
if isinstance(frame, VADUserStoppedSpeakingFrame) and self._vad_force_turn_endpoint:
|
||||
# Send finalize message to Soniox so we get the final tokens asap.
|
||||
if self._websocket and self._websocket.state is State.OPEN:
|
||||
await self._websocket.send(FINALIZE_MESSAGE)
|
||||
@@ -374,12 +374,15 @@ class SonioxSTTService(WebsocketSTTService):
|
||||
async def send_endpoint_transcript():
|
||||
if self._final_transcription_buffer:
|
||||
text = "".join(map(lambda token: token["text"], self._final_transcription_buffer))
|
||||
# Soniox only pushes TranscriptionFrame when an end token is received,
|
||||
# so every TranscriptionFrame is inherently finalized
|
||||
await self.push_frame(
|
||||
TranscriptionFrame(
|
||||
text=text,
|
||||
user_id=self._user_id,
|
||||
timestamp=time_now_iso8601(),
|
||||
result=self._final_transcription_buffer,
|
||||
finalized=True,
|
||||
)
|
||||
)
|
||||
await self._handle_transcription(text, is_final=True)
|
||||
|
||||
@@ -8,7 +8,6 @@
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import time
|
||||
from enum import Enum
|
||||
from typing import Any, AsyncGenerator
|
||||
|
||||
@@ -598,9 +597,6 @@ class SpeechmaticsSTTService(STTService):
|
||||
if segments:
|
||||
await self._send_frames(segments)
|
||||
|
||||
# Update metrics
|
||||
await self._emit_metrics(message.get("metadata", {}).get("processing_time", 0.0))
|
||||
|
||||
async def _handle_segment(self, message: dict[str, Any]) -> None:
|
||||
"""Handle AddSegment events.
|
||||
|
||||
@@ -804,28 +800,6 @@ class SpeechmaticsSTTService(STTService):
|
||||
yield ErrorFrame(f"Speechmatics error: {e}")
|
||||
await self._disconnect()
|
||||
|
||||
async def _emit_metrics(self, processing_time: float) -> None:
|
||||
"""Create TTFB metrics.
|
||||
|
||||
The TTFB is the seconds between the person speaking and the STT
|
||||
engine emitting the first partial. This is only calculated at the
|
||||
start of an utterance.
|
||||
"""
|
||||
# Skip if metrics not available
|
||||
if not self._metrics or processing_time == 0.0:
|
||||
return
|
||||
|
||||
# Calculate time as time.time() - ttfb (which is seconds)
|
||||
start_time = time.time() - processing_time
|
||||
|
||||
# Update internal metrics
|
||||
self._metrics._start_ttfb_time = start_time
|
||||
self._metrics._start_processing_time = start_time
|
||||
|
||||
# Stop TTFB metrics
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
# ============================================================================
|
||||
# HELPERS
|
||||
# ============================================================================
|
||||
|
||||
@@ -6,7 +6,9 @@
|
||||
|
||||
"""Base classes for Speech-to-Text services with continuous and segmented processing."""
|
||||
|
||||
import asyncio
|
||||
import io
|
||||
import time
|
||||
import wave
|
||||
from abc import abstractmethod
|
||||
from typing import Any, AsyncGenerator, Dict, Mapping, Optional
|
||||
@@ -17,12 +19,17 @@ from pipecat.frames.frames import (
|
||||
AudioRawFrame,
|
||||
ErrorFrame,
|
||||
Frame,
|
||||
InterruptionFrame,
|
||||
MetricsFrame,
|
||||
SpeechControlParamsFrame,
|
||||
StartFrame,
|
||||
STTMuteFrame,
|
||||
STTUpdateSettingsFrame,
|
||||
TranscriptionFrame,
|
||||
VADUserStartedSpeakingFrame,
|
||||
VADUserStoppedSpeakingFrame,
|
||||
)
|
||||
from pipecat.metrics.metrics import TTFBMetricsData
|
||||
from pipecat.processors.frame_processor import FrameDirection
|
||||
from pipecat.services.ai_service import AIService
|
||||
from pipecat.services.websocket_service import WebsocketService
|
||||
@@ -61,6 +68,8 @@ class STTService(AIService):
|
||||
audio_passthrough=True,
|
||||
# STT input sample rate
|
||||
sample_rate: Optional[int] = None,
|
||||
# STT TTFB timeout - time to wait after VAD stop before reporting TTFB
|
||||
stt_ttfb_timeout: float = 2.0,
|
||||
**kwargs,
|
||||
):
|
||||
"""Initialize the STT service.
|
||||
@@ -70,6 +79,12 @@ class STTService(AIService):
|
||||
Defaults to True.
|
||||
sample_rate: The sample rate for audio input. If None, will be determined
|
||||
from the start frame.
|
||||
stt_ttfb_timeout: Time in seconds to wait after VAD stop before reporting
|
||||
TTFB. This delay allows the final transcription to arrive. Defaults to 2.0.
|
||||
Note: STT "TTFB" differs from traditional TTFB (which measures from a discrete
|
||||
request to first response byte). Since STT receives continuous audio, we measure
|
||||
from when the user stops speaking to when the final transcript arrives—capturing
|
||||
the latency that matters for voice AI applications.
|
||||
**kwargs: Additional arguments passed to the parent AIService.
|
||||
"""
|
||||
super().__init__(**kwargs)
|
||||
@@ -81,6 +96,16 @@ class STTService(AIService):
|
||||
self._muted: bool = False
|
||||
self._user_id: str = ""
|
||||
|
||||
# STT TTFB tracking state
|
||||
self._stt_ttfb_timeout = stt_ttfb_timeout
|
||||
self._ttfb_timeout_task: Optional[asyncio.Task] = None
|
||||
self._vad_stop_secs: Optional[float] = None
|
||||
self._speech_end_time: Optional[float] = None
|
||||
self._user_speaking: bool = False
|
||||
self._last_transcription_time: Optional[float] = None
|
||||
self._finalize_pending: bool = False
|
||||
self._finalize_requested: bool = False
|
||||
|
||||
self._register_event_handler("on_connected")
|
||||
self._register_event_handler("on_disconnected")
|
||||
self._register_event_handler("on_connection_error")
|
||||
@@ -94,6 +119,31 @@ class STTService(AIService):
|
||||
"""
|
||||
return self._muted
|
||||
|
||||
def request_finalize(self):
|
||||
"""Mark that a finalize request has been sent, awaiting server confirmation.
|
||||
|
||||
For providers that have explicit server confirmation of finalization
|
||||
(e.g., Deepgram's from_finalize field), call this when sending the finalize
|
||||
request. Then call confirm_finalize() when the server confirms.
|
||||
|
||||
For providers without server confirmation, don't call this method - just
|
||||
send the finalize/flush/commit command and rely on the TTFB timeout.
|
||||
"""
|
||||
self._finalize_requested = True
|
||||
|
||||
def confirm_finalize(self):
|
||||
"""Confirm that the server has acknowledged the finalize request.
|
||||
|
||||
Call this when the server response confirms finalization (e.g., Deepgram's
|
||||
from_finalize=True). The next TranscriptionFrame pushed will be marked
|
||||
as finalized.
|
||||
|
||||
Only has effect if request_finalize() was previously called.
|
||||
"""
|
||||
if self._finalize_requested:
|
||||
self._finalize_pending = True
|
||||
self._finalize_requested = False
|
||||
|
||||
@property
|
||||
def sample_rate(self) -> int:
|
||||
"""Get the current sample rate for audio processing.
|
||||
@@ -144,6 +194,11 @@ class STTService(AIService):
|
||||
self._sample_rate = self._init_sample_rate or frame.audio_in_sample_rate
|
||||
self._tracing_enabled = frame.enable_tracing
|
||||
|
||||
async def cleanup(self):
|
||||
"""Clean up STT service resources."""
|
||||
await super().cleanup()
|
||||
await self._cancel_ttfb_timeout()
|
||||
|
||||
async def _update_settings(self, settings: Mapping[str, Any]):
|
||||
logger.info(f"Updating STT settings: {self._settings}")
|
||||
for key, value in settings.items():
|
||||
@@ -206,14 +261,168 @@ class STTService(AIService):
|
||||
await self.process_audio_frame(frame, direction)
|
||||
if self._audio_passthrough:
|
||||
await self.push_frame(frame, direction)
|
||||
elif isinstance(frame, SpeechControlParamsFrame):
|
||||
await self._handle_speech_control_params(frame)
|
||||
await self.push_frame(frame, direction)
|
||||
elif isinstance(frame, VADUserStartedSpeakingFrame):
|
||||
await self._handle_vad_user_started_speaking(frame)
|
||||
await self.push_frame(frame, direction)
|
||||
elif isinstance(frame, VADUserStoppedSpeakingFrame):
|
||||
await self._handle_vad_user_stopped_speaking(frame)
|
||||
await self.push_frame(frame, direction)
|
||||
elif isinstance(frame, STTUpdateSettingsFrame):
|
||||
await self._update_settings(frame.settings)
|
||||
elif isinstance(frame, STTMuteFrame):
|
||||
self._muted = frame.mute
|
||||
logger.debug(f"STT service {'muted' if frame.mute else 'unmuted'}")
|
||||
elif isinstance(frame, InterruptionFrame):
|
||||
await self._reset_stt_ttfb_state()
|
||||
await self.push_frame(frame, direction)
|
||||
else:
|
||||
await self.push_frame(frame, direction)
|
||||
|
||||
async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
|
||||
"""Push a frame downstream, tracking TranscriptionFrame timestamps for TTFB.
|
||||
|
||||
Stores the timestamp of each TranscriptionFrame for TTFB calculation.
|
||||
If the frame is marked as finalized (via request_finalize/confirm_finalize),
|
||||
reports TTFB immediately and cancels any pending timeout. Otherwise, TTFB is
|
||||
reported after a timeout.
|
||||
|
||||
Args:
|
||||
frame: The frame to push.
|
||||
direction: The direction to push the frame.
|
||||
"""
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
# Store the transcription time for TTFB calculation
|
||||
self._last_transcription_time = time.time()
|
||||
|
||||
# Set finalized from pending state and auto-reset
|
||||
if self._finalize_pending:
|
||||
frame.finalized = True
|
||||
self._finalize_pending = False
|
||||
|
||||
# If this is a finalized transcription, report TTFB immediately
|
||||
if frame.finalized and self._speech_end_time is not None:
|
||||
ttfb = self._last_transcription_time - self._speech_end_time
|
||||
await self._emit_stt_ttfb_metric(ttfb)
|
||||
# Cancel the timeout since we've already reported
|
||||
await self._cancel_ttfb_timeout()
|
||||
# Clear state
|
||||
self._speech_end_time = None
|
||||
self._last_transcription_time = None
|
||||
|
||||
await super().push_frame(frame, direction)
|
||||
|
||||
async def _handle_speech_control_params(self, frame: SpeechControlParamsFrame):
|
||||
"""Handle speech control parameters frame to extract VAD stop_secs.
|
||||
|
||||
Args:
|
||||
frame: The speech control parameters frame.
|
||||
"""
|
||||
if frame.vad_params is not None:
|
||||
self._vad_stop_secs = frame.vad_params.stop_secs
|
||||
|
||||
async def _cancel_ttfb_timeout(self):
|
||||
"""Cancel any pending TTFB timeout task."""
|
||||
if self._ttfb_timeout_task:
|
||||
await self.cancel_task(self._ttfb_timeout_task)
|
||||
self._ttfb_timeout_task = None
|
||||
|
||||
async def _reset_stt_ttfb_state(self):
|
||||
"""Reset STT TTFB measurement state.
|
||||
|
||||
Called when starting a new utterance or on interruption to ensure
|
||||
we don't use stale state for TTFB calculations. This specifically guards
|
||||
against the case where a TranscriptionFrame is received without corresponding
|
||||
VADUserStartedSpeakingFrame and VADUserStoppedSpeakingFrame frames.
|
||||
|
||||
Note: Does not reset _user_speaking since InterruptionFrame can arrive
|
||||
while user is still speaking.
|
||||
"""
|
||||
await self._cancel_ttfb_timeout()
|
||||
self._speech_end_time = None
|
||||
self._last_transcription_time = None
|
||||
|
||||
async def _handle_vad_user_started_speaking(self, frame: VADUserStartedSpeakingFrame):
|
||||
"""Handle VAD user started speaking frame to start tracking transcriptions.
|
||||
|
||||
Cancels any pending TTFB timeout, resets TTFB tracking state, and marks user as speaking.
|
||||
Also resets finalization state to prevent stale finalization from a previous utterance.
|
||||
|
||||
Args:
|
||||
frame: The VAD user started speaking frame.
|
||||
"""
|
||||
await self._reset_stt_ttfb_state()
|
||||
self._user_speaking = True
|
||||
self._finalize_requested = False
|
||||
self._finalize_pending = False
|
||||
|
||||
async def _handle_vad_user_stopped_speaking(self, frame: VADUserStoppedSpeakingFrame):
|
||||
"""Handle VAD user stopped speaking frame.
|
||||
|
||||
Calculates the actual speech end time and starts a timeout task to wait
|
||||
for the final transcription before reporting TTFB.
|
||||
|
||||
Args:
|
||||
frame: The VAD user stopped speaking frame.
|
||||
"""
|
||||
self._user_speaking = False
|
||||
|
||||
# Skip TTFB measurement if we don't have VAD params
|
||||
if self._vad_stop_secs is None:
|
||||
return
|
||||
|
||||
# Calculate the actual speech end time (current time minus VAD stop delay).
|
||||
# This approximates when the last user audio was sent to the STT service,
|
||||
# which we use to measure against the eventual transcription response.
|
||||
self._speech_end_time = time.time() - self._vad_stop_secs
|
||||
|
||||
# Start timeout task (any previous timeout was cancelled by VADUserStartedSpeakingFrame
|
||||
# or InterruptionFrame)
|
||||
self._ttfb_timeout_task = self.create_task(
|
||||
self._ttfb_timeout_handler(), name="stt_ttfb_timeout"
|
||||
)
|
||||
|
||||
async def _ttfb_timeout_handler(self):
|
||||
"""Wait for timeout then report TTFB using the last transcription timestamp.
|
||||
|
||||
This timeout allows the final transcription to arrive before we calculate
|
||||
and report TTFB. If no transcription arrived, no TTFB is reported.
|
||||
"""
|
||||
try:
|
||||
await asyncio.sleep(self._stt_ttfb_timeout)
|
||||
|
||||
# Report TTFB if we have both speech end time and transcription time
|
||||
if self._speech_end_time is not None and self._last_transcription_time is not None:
|
||||
ttfb = self._last_transcription_time - self._speech_end_time
|
||||
await self._emit_stt_ttfb_metric(ttfb)
|
||||
|
||||
# Clear state after reporting
|
||||
self._speech_end_time = None
|
||||
self._last_transcription_time = None
|
||||
except asyncio.CancelledError:
|
||||
# Task was cancelled (new utterance or interruption), which is expected behavior
|
||||
pass
|
||||
finally:
|
||||
self._ttfb_timeout_task = None
|
||||
|
||||
async def _emit_stt_ttfb_metric(self, ttfb: float):
|
||||
"""Emit STT TTFB metric if value is non-negative.
|
||||
|
||||
Args:
|
||||
ttfb: The TTFB value in seconds.
|
||||
"""
|
||||
if ttfb >= 0:
|
||||
logger.debug(f"{self} TTFB: {ttfb:.3f}s")
|
||||
if self.metrics_enabled:
|
||||
ttfb_data = TTFBMetricsData(
|
||||
processor=self.name,
|
||||
model=self.model_name,
|
||||
value=ttfb,
|
||||
)
|
||||
await super().push_frame(MetricsFrame(data=[ttfb_data]))
|
||||
|
||||
|
||||
class SegmentedSTTService(STTService):
|
||||
"""STT service that processes speech in segments using VAD events.
|
||||
@@ -250,6 +459,20 @@ class SegmentedSTTService(STTService):
|
||||
await super().start(frame)
|
||||
self._audio_buffer_size_1s = self.sample_rate * 2
|
||||
|
||||
async def push_frame(self, frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM):
|
||||
"""Push a frame, marking TranscriptionFrames as finalized.
|
||||
|
||||
Segmented STT services process complete speech segments and return a single
|
||||
TranscriptionFrame per segment, so every transcription is inherently finalized.
|
||||
|
||||
Args:
|
||||
frame: The frame to push.
|
||||
direction: The direction of frame flow in the pipeline.
|
||||
"""
|
||||
if isinstance(frame, TranscriptionFrame):
|
||||
frame.finalized = True
|
||||
await super().push_frame(frame, direction)
|
||||
|
||||
async def process_frame(self, frame: Frame, direction: FrameDirection):
|
||||
"""Process frames, handling VAD events and audio segmentation."""
|
||||
await super().process_frame(frame, direction)
|
||||
|
||||
@@ -204,11 +204,9 @@ class BaseWhisperSTTService(SegmentedSTTService):
|
||||
"""
|
||||
try:
|
||||
await self.start_processing_metrics()
|
||||
await self.start_ttfb_metrics()
|
||||
|
||||
response = await self._transcribe(audio)
|
||||
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
text = response.text.strip()
|
||||
|
||||
@@ -289,7 +289,6 @@ class WhisperSTTService(SegmentedSTTService):
|
||||
return
|
||||
|
||||
await self.start_processing_metrics()
|
||||
await self.start_ttfb_metrics()
|
||||
|
||||
# Divide by 32768 because we have signed 16-bit data.
|
||||
audio_float = np.frombuffer(audio, dtype=np.int16).astype(np.float32) / 32768.0
|
||||
@@ -303,7 +302,6 @@ class WhisperSTTService(SegmentedSTTService):
|
||||
if segment.no_speech_prob < self._no_speech_prob:
|
||||
text += f"{segment.text} "
|
||||
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
if text:
|
||||
@@ -388,7 +386,6 @@ class WhisperSTTServiceMLX(WhisperSTTService):
|
||||
import mlx_whisper
|
||||
|
||||
await self.start_processing_metrics()
|
||||
await self.start_ttfb_metrics()
|
||||
|
||||
# Divide by 32768 because we have signed 16-bit data.
|
||||
audio_float = np.frombuffer(audio, dtype=np.int16).astype(np.float32) / 32768.0
|
||||
@@ -413,7 +410,6 @@ class WhisperSTTServiceMLX(WhisperSTTService):
|
||||
if len(text.strip()) == 0:
|
||||
text = None
|
||||
|
||||
await self.stop_ttfb_metrics()
|
||||
await self.stop_processing_metrics()
|
||||
|
||||
if text:
|
||||
|
||||
@@ -1733,7 +1733,7 @@ class DailyInputTransport(BaseInputTransport):
|
||||
message: The message data to send.
|
||||
sender: ID of the message sender.
|
||||
"""
|
||||
await self.broadcast_frame_class(
|
||||
await self.broadcast_frame(
|
||||
DailyInputTransportMessageFrame, message=message, participant_id=sender
|
||||
)
|
||||
|
||||
|
||||
@@ -698,7 +698,7 @@ class SmallWebRTCInputTransport(BaseInputTransport):
|
||||
message: The application message to process.
|
||||
"""
|
||||
logger.debug(f"Received app message inside SmallWebRTCInputTransport {message}")
|
||||
await self.broadcast_frame_class(InputTransportMessageFrame, message=message)
|
||||
await self.broadcast_frame(InputTransportMessageFrame, message=message)
|
||||
|
||||
# Add this method similar to DailyInputTransport.request_participant_image
|
||||
async def request_participant_image(self, frame: UserImageRequestFrame):
|
||||
|
||||
16
uv.lock
generated
16
uv.lock
generated
@@ -531,18 +531,18 @@ wheels = [
|
||||
|
||||
[[package]]
|
||||
name = "azure-cognitiveservices-speech"
|
||||
version = "1.44.0"
|
||||
version = "1.47.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "azure-core" },
|
||||
]
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/0b/0d/0752835f079e8d2cc42bb634f3ccd761c8d6e9d0d46a2d6cf7b3ed8e714c/azure_cognitiveservices_speech-1.44.0-py3-none-macosx_10_14_x86_64.whl", hash = "sha256:78037a147ba72abb57e8c10b693d43a1bb029986fae0918f1f9b7d6342737bfe", size = 7492396, upload-time = "2025-05-19T15:46:11.318Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/76/1d/d0ed4ec0f51303a2a532dc845eeb72c7729a3c8639b08050f3c1cd96db79/azure_cognitiveservices_speech-1.44.0-py3-none-macosx_11_0_arm64.whl", hash = "sha256:2c9b436326cd8dd82dfa88454b7b68359dfc7149e2ac9029f9bcff155ebd5c95", size = 7347577, upload-time = "2025-05-19T15:46:13.644Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/89/c8/f0a4ea8bea014b912046f737e429378ceadad68258395454d62acf7f65bb/azure_cognitiveservices_speech-1.44.0-py3-none-manylinux1_x86_64.whl", hash = "sha256:e5f07fc0587067850288c17aebf33d307d2c1ef9e0b2d11d9f44bff2af400568", size = 40977193, upload-time = "2025-05-19T15:46:15.878Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/6a/0d/0a0394e8102d6660afeec6b780c451401f6074b1e19f00e90785529e459e/azure_cognitiveservices_speech-1.44.0-py3-none-manylinux2014_aarch64.whl", hash = "sha256:3461e22cf04816f69a964d936218d920240f987c0656fdaaf46571529ff0f7e6", size = 40747860, upload-time = "2025-05-19T15:46:19.316Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/55/ad/3b7f6eca73040821358ce01f22067446a03d876bfed41cd784291706db4c/azure_cognitiveservices_speech-1.44.0-py3-none-win32.whl", hash = "sha256:a3fe7fd67ba7db281ae490de3d71b5a22648454ec2630eb6a70797f666330586", size = 2164045, upload-time = "2025-05-19T15:46:22.373Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/83/ac/f491487d7d0e25ae2929b4f07e7f9b7456feb38e65b36fb605b2c9685b10/azure_cognitiveservices_speech-1.44.0-py3-none-win_amd64.whl", hash = "sha256:77cfb5dd40733b7ccc21edc427e9fb4720997832ea8a1ba460dc94345f3588ae", size = 2422937, upload-time = "2025-05-19T15:46:23.657Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/cc/e3/b6a3d1ef4f135f8ef00ed084b9284e65409e9cd52bc96cd0453a5c6637c6/azure_cognitiveservices_speech-1.47.0-py3-none-macosx_10_14_x86_64.whl", hash = "sha256:656577ed01ed4b8cd7c70fab2c921b300181b906f101758a16406bc99b133681", size = 3574346, upload-time = "2025-11-11T21:13:37.717Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/82/fa/9cc0c5400e9d433bd98a1239bedf97b34abf410dbc8932a50886ae43e115/azure_cognitiveservices_speech-1.47.0-py3-none-macosx_11_0_arm64.whl", hash = "sha256:afd91653ceca482ccea5459eedda1ec9aa95ee07df12a15fc588c42d4f90f0a9", size = 3506219, upload-time = "2025-11-11T21:13:39.702Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/6b/d6/b8f55421b8cb40b478f4fb793c52b1bb0ed794263a5475ae2a6490a4cd53/azure_cognitiveservices_speech-1.47.0-py3-none-manylinux1_x86_64.whl", hash = "sha256:577b702ee30d35ecc581e7e2ac23f4387782f93c241d7f8f3c86f72bb883d02d", size = 35399363, upload-time = "2025-11-11T21:13:41.915Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/98/91/c36be146824797f57b194128a173baf289a260c2540c86c166f8c7fbebe3/azure_cognitiveservices_speech-1.47.0-py3-none-manylinux2014_aarch64.whl", hash = "sha256:ff72c74abe4b4c0f5a527eabf8511a8c0e689d884a95c54a46495b293e302e73", size = 35196906, upload-time = "2025-11-11T21:13:45.31Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/fb/19/dd6f08dc623f2b336cc9cd5cf765712df5262fd675583e701922491e455d/azure_cognitiveservices_speech-1.47.0-py3-none-win_amd64.whl", hash = "sha256:ecfce57d66907afe305fb2950cc781ea8f327274facd2db66950e701b6cfd715", size = 2182376, upload-time = "2025-11-11T21:13:47.753Z" },
|
||||
{ url = "https://files.pythonhosted.org/packages/1b/16/a6d1f7ab7eae21b00da2eee7186a7db9c9a2434e0ef833f071ff686b833f/azure_cognitiveservices_speech-1.47.0-py3-none-win_arm64.whl", hash = "sha256:4351734cf240d11340a057ecb388397e5ecf40e97e4b67a6a990fffe2791b56c", size = 1978493, upload-time = "2025-11-11T21:13:49.445Z" },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -4504,7 +4504,7 @@ requires-dist = [
|
||||
{ name = "audioop-lts", marker = "python_full_version >= '3.13'", specifier = "~=0.2.1" },
|
||||
{ name = "aws-sdk-bedrock-runtime", marker = "python_full_version >= '3.12' and extra == 'aws-nova-sonic'", specifier = "~=0.2.0" },
|
||||
{ name = "aws-sdk-sagemaker-runtime-http2", marker = "python_full_version >= '3.12' and extra == 'sagemaker'" },
|
||||
{ name = "azure-cognitiveservices-speech", marker = "extra == 'azure'", specifier = "~=1.44.0" },
|
||||
{ name = "azure-cognitiveservices-speech", marker = "extra == 'azure'", specifier = "~=1.47.0" },
|
||||
{ name = "camb-sdk", marker = "extra == 'camb'", specifier = ">=1.5.4" },
|
||||
{ name = "cartesia", marker = "extra == 'cartesia'", specifier = "~=2.0.3" },
|
||||
{ name = "coremltools", marker = "extra == 'local-smart-turn'", specifier = ">=8.0" },
|
||||
|
||||
Reference in New Issue
Block a user