DEV Community

Koichi
Koichi

Posted on

Building an Agent Evaluation Pipeline with Google ADK

Intro

If you've built an agent, you've probably noticed that "it works" is surprisingly hard to prove. LLMs are probabilistic, so the deterministic pass/fail model of unit tests doesn't quite apply. This post walks through why agents need their own evaluation story, and how to build one with ADK in three steps: prepare test cases → prepare criteria → run the evaluation.

The sample code is in this repository:

GitHub logo Koichi73 / adk-tool-trajectory-eval-sample

sample of using tool_trajectory_avg_score in ADK.

adk-tool-trajectory-eval-sample

A minimal sample for evaluating an ADK agent with the tool_trajectory_avg_score criterion.

Layout

.
├── sample_agent/         # the agent under evaluation
│   └── agent.py
└── tests/
    ├── eval_config.json  # evaluation criteria
    ├── evalsets/         # test cases
    └── test_sample_agent.py

Setup

uv sync
cp .env.example .env  # fill in the values
Enter fullscreen mode Exit fullscreen mode

Run the evaluation

uv run pytest
Enter fullscreen mode Exit fullscreen mode



Why evaluate agents?

In traditional software, tests return a clean pass/fail, which gives you both quality gates and regression detection for free. Agents don't behave that way — the same input can produce different outputs, so deterministic pass/fail alone can't capture quality.

If you ignore that gap and keep shipping, agent quality tends to plateau, and you have no clear signal for where to improve next. In Your AI Product Needs Evals, Hamel Husain calls out three symptoms he saw in products built without an eval system:

  1. Addressing one failure mode led to the emergence of others, resembling a game of whack-a-mole.
  2. There was limited visibility into the AI system’s effectiveness across tasks beyond vibe checks.
  3. Prompts expanded into long and unwieldy forms, attempting to cover numerous edge cases and examples.

He sums up the root cause:

I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust evaluation systems.

The flip side: evals are the foundation that lets you run a quality-improvement loop. Without them, you don't even know which direction to improve in.

What do you actually measure?

LLM evaluation focuses on the quality of the final answer. Agent evaluation has to go further and look at the decision-making process itself. The ADK docs split agent evaluation into two axes:

  • Trajectory evaluation: analyze the steps the agent took to reach a solution. Which tools did it call, with which arguments, in what order?
  • Final response evaluation: judge the quality, relevance, and accuracy of the final answer.

Imagine the agent is asked about the weather and gets the answer right — but along the way it called three unrelated tools. The output is right, but the agent's behavior clearly isn't. Trajectory evaluation is how you catch this.

The ADK evaluation pipeline

With ADK, you can break the evaluation pipeline into three steps:

  1. Prepare test cases
  2. Prepare evaluation criteria
  3. Run the evaluation

We'll go through each in turn.

1. Prepare test cases

ADK supports two file formats for writing test cases:

  • Test file (.test.json)
  • Eval set file (.evalset.json)

The official docs frame the former as a format for single, simple sessions (a form of unit testing) and the latter as ideal for multi-turn conversations. In practice, the two share a schema, and you can perfectly well write simple tests in an eval set file. I'll use eval set files throughout this post.

An eval set file looks like this:

{
  "eval_set_id": "basic_eval",
  "name": "Basic Agent Evaluation",
  "description": "Sample evaluation set.",
  "eval_cases": [
    {
      "eval_id": "weather_query",
      "conversation": [
        {
          "user_content": {
            "parts": [{"text": "What's the weather in Tokyo?"}]
          },
          "final_response": {
            "parts": [{"text": "Today is sunny in Tokyo."}]
          }
        }
      ],
      "session_input": {
        "app_name": "app",
        "user_id": "eval_user",
        "state": {}
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Each case in eval_cases describes a test case through fields such as the user input (user_content) and the expected output (final_response). session_input.state lets you seed the initial state for the run — useful when the agent depends on something like a user profile being present from the start.

2. Prepare evaluation criteria

ADK ships with 11 built-in criteria (and you can define your own) — see the criteria docs for the full list. I'll cover the four most representative ones below.

Criterion What it checks Needs ground truth Uses a judge LLM
tool_trajectory_avg_score Whether the tool-call trajectory matches the expected one Yes No
response_match_score ROUGE-1 similarity between the final response and ground truth Yes No
final_response_match_v2 Semantic match between the final response and ground truth Yes Yes
rubric_based_final_response_quality_v1 Quality of the final response against a rubric No Yes

tool_trajectory_avg_score

Mechanically checks whether the tool-call trajectory (tool name, arguments, order) matches a predefined ground truth. Each call gets a 1.0 (match) or 0.0 (no match), and the final score is the average.

You can choose how matches are determined with match_type. The default is EXACT.

  • EXACT: the actual trajectory must match exactly — no extra calls, no missing calls.
  • IN_ORDER: expected calls must appear in the same relative order; additional calls may be interleaved.
  • ANY_ORDER: all expected calls must appear in any order; additional calls may be interleaved.

response_match_score

Computes the ROUGE-1 similarity between the agent's final response and the ground truth. The score ranges from 0.0 to 1.0; closer to 1.0 is better.

ROUGE-1 tokenizes both strings and computes an F1 score from word overlap. Take this example:

  • Ground truth: the cat sat on the mat
  • Output: the cat sat on a mat

Each splits into 6 tokens, and 5 tokens (the, cat, sat, on, mat) overlap. Precision and recall are both 5/6, giving an F1 of about 0.83.

final_response_match_v2

Uses a judge LLM (LLM-as-a-Judge) to evaluate semantic similarity between the agent's final response and the ground truth. Where response_match_score works at the token level, this one asks "are they saying the same thing?" — so it's robust to phrasing differences.

For example, if the ground truth is "It's sunny in Tokyo today." and the output is "Today's Tokyo weather is clear.", the judge will treat them as equivalent. The score ranges from 0.0 to 1.0, and you can set both the pass threshold and the judge model in eval_config.json (covered below).

A note on cost and latency

final_response_match_v2 calls a judge LLM internally, so every evaluation incurs an API cost, and latency is higher than response_match_score. Be mindful of call volume and cost when wiring this into CI.

rubric_based_final_response_quality_v1

The three criteria above all require a predefined "correct answer". This rubric-based criterion is different: instead of defining the answer itself, you define the properties the answer should have (the rubric). The judge LLM rates each rubric as satisfied (1.0) or not (0.0), and the score is the average.

This is a good fit for open-ended tasks where there's no single right answer — summarization, suggestions, conversational responses, and so on. Rubrics look like this:

{
  "criteria": {
    "rubric_based_final_response_quality_v1": {
      "rubrics": [
        {
          "rubric_id": "conciseness",
          "rubric_content": {
            "text_property": "The response is concise and avoids unnecessary preamble or repetition."
          }
        },
        {
          "rubric_id": "intent_inference",
          "rubric_content": {
            "text_property": "Even for ambiguous questions, the response infers and addresses the user's actual intent."
          }
        }
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Writing the criteria

I'd recommend putting criteria in eval_config.json (you can also pass them from code, but I won't cover that here). The format:

{
  "criteria": {
    "response_match_score": 0.8
  }
}
Enter fullscreen mode Exit fullscreen mode

The value is the pass threshold. Any test case scoring below it is treated as a failure.

3. Run the evaluation

ADK gives you three ways to run evaluations:

  1. Web UI (adk web)
  2. Programmatic, via pytest
  3. CLI (adk eval)

This post focuses on option 2 — it's easy to put in CI, and the flow is explicit in code. For the others, see the official docs.

Writing evals with pytest

ADK exposes the AgentEvaluator class for this. The basic shape:

from google.adk.evaluation.agent_evaluator import AgentEvaluator
from google.adk.evaluation.eval_config import EvalConfig
from google.adk.evaluation.eval_set import EvalSet
import pytest


@pytest.mark.asyncio
async def test_sample_agent():
    with open("tests/evalsets/basic.evalset.json", "r") as f:
        eval_set = EvalSet.model_validate_json(f.read())

    with open("tests/eval_config.json", "r") as f:
        eval_config = EvalConfig.model_validate_json(f.read())

    await AgentEvaluator.evaluate_eval_set(
        agent_module="sample_agent",
        eval_set=eval_set,
        eval_config=eval_config,
    )
Enter fullscreen mode Exit fullscreen mode

agent_module takes the Python module name (e.g. sample_agent) of the agent you want to evaluate. Load the eval set and config, pass them to evaluate_eval_set, and it runs every case you defined.

AgentEvaluator.evaluate also exists and handles the loading for you, so the code above can be made even shorter. I'm using evaluate_eval_set here on purpose because it keeps the flow visible.

Summary

  • Agents are probabilistic, so traditional unit and integration tests can't fully guarantee quality. Building an eval setup early gives you the foundation to actually run a quality-improvement loop.
  • ADK structures the evaluation pipeline into three steps: prepare test cases, prepare criteria, run the evaluation.
  • There are 11 built-in criteria, ranging from trajectory checks (tool_trajectory_avg_score) to semantic match (final_response_match_v2) to rubric-based judgments (rubric_based_final_response_quality_v1) — pick whichever fits the task.
  • Because it plugs into pytest, it's easy to put on CI.

A reasonable starting point is to take a small agent, write 2-3 test cases against it, and grow from there as you need more coverage.

I've put a minimal tool_trajectory_avg_score setup in this repo — uv syncuv run pytest and you're ready to go:

GitHub logo Koichi73 / adk-tool-trajectory-eval-sample

sample of using tool_trajectory_avg_score in ADK.

adk-tool-trajectory-eval-sample

A minimal sample for evaluating an ADK agent with the tool_trajectory_avg_score criterion.

Layout

.
├── sample_agent/         # the agent under evaluation
│   └── agent.py
└── tests/
    ├── eval_config.json  # evaluation criteria
    ├── evalsets/         # test cases
    └── test_sample_agent.py

Setup

uv sync
cp .env.example .env  # fill in the values
Enter fullscreen mode Exit fullscreen mode

Run the evaluation

uv run pytest
Enter fullscreen mode Exit fullscreen mode





That's it — hope this gives you a useful starting point for your own agent evals!

Top comments (1)

Collapse
 
nayuu profile image
Murilo Augusto

boa irmão