Agentic AI 3 min

Evaluating and Testing Agents

Metrics, tracing, and eval frameworks for measuring how reliable your agent actually is.

Agent evaluation should test outcomes, not just fluent text. Measure task success, tool correctness, latency, cost, and safety behavior. A good-looking trace full of confident Thoughts can still hide a wrong final answer.

Here is a real, minimal eval harness, adapted from `agentic/research-agent/eval.py` in the companion repo. It has two layers on purpose: fast, free unit checks on the parser that run on every commit, and slower, real end-to-end checks that cost API calls and run less often.

python
# Layer 1: free, deterministic, no API calls - catches parser regressions
PARSER_CASES = [
    {
        "input": 'Thought: need data.\nAction: search\nAction Input: {"query": "AAPL price"}',
        "expected_action": "search",
        "expected_input": {"query": "AAPL price"},
    },
    {
        "input": "Thought: trying.\nAction: search\nAction Input: {not valid json}",
        "expected_action": "search",
        "expected_input": {},  # malformed JSON should degrade gracefully, not crash
    },
]

def run_parser_checks() -> bool:
    passed = 0
    for case in PARSER_CASES:
        result = parse_action(case["input"])
        ok = result is not None and result == (case["expected_action"], case["expected_input"])
        passed += ok
    return passed == len(PARSER_CASES)

# Layer 2: real API calls, opt-in, graded against expected keywords + a time budget
TASK_CASES = [
    {"task": "What is RAG in AI and how does it work?", "expected_keywords": ["retrieval", "generation"]},
]

def run_task_checks() -> bool:
    passed = 0
    for case in TASK_CASES:
        report = run_agent(case["task"]).lower()
        passed += all(kw in report for kw in case["expected_keywords"])
    return passed == len(TASK_CASES)
Core Metrics
Track success rate, steps-to-completion, tool error rate, retry count, average token usage, and human override frequency. These metrics reveal reliability trends quickly, and the layered harness above is how you compute them on every change instead of guessing.

Run `python eval.py` in the companion repo for the free layer in under a second, or `python eval.py --live` to also run the real task checks. Treat this like CI for agent behavior: every prompt change or tool change re-runs it, not a one-time manual QA pass before launch.

Flight Simulator

You do not test pilots only in perfect weather. Evaluate agents on edge cases, ambiguous inputs, and partial failures to see how they recover under pressure.

What's Next
Evaluation tells you an agent is broken. Production guardrails are what stop 'broken' from becoming expensive or dangerous while you fix it.