Files
pipecat/scripts/evals
Paul Kompfner 1a4a6f4edf refactor(gemini-live): bring tool-result handling in line with the canonical realtime pattern
Lays groundwork for cancel_on_interruption=False support on Gemini Live by
restructuring _process_completed_function_calls to match the shape used by
AWSNovaSonicLLMService and OpenAIRealtimeLLMService in #4441: a single-pass
forward iteration over raw context messages that detects async-tool
messages via async_tool_messages.parse_message and routes them — started
skipped silently, intermediate logged-as-error and surfaced via push_error,
final delivered via the formal FunctionResponse channel.

Replaces the prior two-pass structure that went through the adapter for
sync results — the service now uses a lightweight self._tool_call_id_to_name
map (populated when the model issues tool calls) for the name lookup the
adapter used to provide. Extracts a new GeminiLLMAdapter.to_function_response_dict
static method for the dict-coercion logic that wraps non-dict tool returns
as {value: <result>} for Gemini's FunctionResponse.response field; the
adapter's existing inline copy in _from_standard_message uses it too.

Example consolidation:

- Folds realtime-gemini-live-function-calling.py into the base
  realtime-gemini-live.py example so the base exercises function calling
  out of the box (matching realtime-openai.py and realtime-aws-nova-sonic.py).
- Renames realtime-gemini-live-vertex-function-calling.py to
  realtime-gemini-live-vertex.py, mirroring the consolidation.
- Adds realtime-gemini-live-async-tool.py.
- Updates scripts/evals/run-release-evals.py for the renames.

This commit alone doesn't make cancel_on_interruption=False fully work on
Gemini Live — additional investigation is pending. This is foundational
work to be built on.
2026-05-08 16:42:54 -04:00
..
2025-08-11 20:06:24 -07:00

Pipecat Evals

This directory contains a set of utilities to help test Pipecat, specifically its examples.

Release Evals

Before any Pipecat release, we make sure that all (or most) of the examples work flawlessly. We have 100+ examples, and checking each one manually was very time-consuming (and painful!), especially because we aim to release often.

To make this process easier, we designed these "release evals," which do the following:

  • Start one of the foundational examples (the user bot)
  • Start an eval bot

The user bot (i.e. the example) introduces itself, and the eval bot then asks a question. The user bot replies, and the eval bot verifies the response.

For example, the eval bot might ask:

"What's 2 plus 2?"

The user bot replies:

"2 plus 2 is 4."

The eval bot (powered by an LLM) evaluates the response and emits a result. It also explains why it thinks the answer is valid or invalid.

To run the release evals:

uv run run-release-evals.py -a -v

This runs all the evals and stores logs and audio (-a) for each test.

You can also specify which tests to run. For example, to run all 07 series tests:

uv run run-release-evals.py -p 07 -a -v

Script Evals

You can also run evals for a single example (not part of the release set):

uv run run-eval.py -p "A simple math addition" -a -v YOUR_EXAMPLE_SCRIPT

Your script needs to follow any of the foundation examples pattern.