Files
pipecat/scripts/evals
Paul Kompfner 272532a3ea Update examples, wherever possible, to use LLMContext and associated machinery instead of OpenAILLMContext and associated machinery.
With all these examples updated, we no longer need dedicated examples illustrating `LLMContext`, so they're removed.

Here’s where we *don’t* yet use `LLMContext` and associated machinery:
- Realtime services: OpenAI Realtime, Gemini Live, and AWS Nova Sonic (support coming soon)
- `GoogleLLMOpenAIBetaService` (it’s deprecated, so we didn’t bother adding support)
- `LLMLogObserver` (support coming soon)
- `GatedOpenAILLMContextAggregator` (support coming soon)
- `LangchainProcessor` (support coming soon)
- `Mem0MemoryService` (support coming soon)
- Examples that use LLM-specific tools definitions as opposed to `ToolsSchema` (these will be updated soon)
- Examples that rely `GoogleLLMContext.upgrade_to_google` (TBD what to do with these)

Examples that use `LLMLogObserver`:
- 30-

Examples that use `GatedOpenAILLMContextAggregator`:
- 22-

Examples that use `LangchainProcessor`:
- 07b-

Examples that use `Mem0MemoryService`:
- 37-

Examples that need updating to use `ToolsSchema`:
- 15-
- 15a-
- 20a-
- 20c-
- 20d-
- 22b-
- 22c-
- 33-
- 36-

Examples that use `GoogleLLMContext.upgrade_to_google`:
- 22d-
- 25-
2025-09-22 16:21:35 -04:00
..
2025-08-11 20:06:24 -07:00
2025-05-30 16:55:55 -07:00

Pipecat Evals

This directory contains a set of utilities to help test Pipecat, specifically its examples.

Release Evals

Before any Pipecat release, we make sure that all (or most) of the examples work flawlessly. We have 100+ examples, and checking each one manually was very time-consuming (and painful!), especially because we aim to release often.

To make this process easier, we designed these "release evals," which do the following:

  • Start one of the foundational examples (the user bot)
  • Start an eval bot

The user bot (i.e. the example) introduces itself, and the eval bot then asks a question. The user bot replies, and the eval bot verifies the response.

For example, the eval bot might ask:

"What's 2 plus 2?"

The user bot replies:

"2 plus 2 is 4."

The eval bot (powered by an LLM) evaluates the response and emits a result. It also explains why it thinks the answer is valid or invalid.

To run the release evals:

uv run run-release-evals.py -a -v

This runs all the evals and stores logs and audio (-a) for each test.

You can also specify which tests to run. For example, to run all 07 series tests:

uv run run-release-evals.py -p 07 -a -v

Script Evals

You can also run evals for a single example (not part of the release set):

uv run run-eval.py -p "A simple math addition" -a -v YOUR_EXAMPLE_SCRIPT

Your script needs to follow any of the foundation examples pattern.