Evaluating and Testing Agents
Metrics, tracing, and eval frameworks for measuring how reliable your agent actually is.
Agent evaluation should test outcomes, not just fluent text. Measure task success, tool correctness, latency, cost, and safety behavior. A good-looking trace full of confident Thoughts can still hide a wrong final answer.
Here is a real, minimal eval harness, adapted from `agentic/research-agent/eval.py` in the companion repo. It has two layers on purpose: fast, free unit checks on the parser that run on every commit, and slower, real end-to-end checks that cost API calls and run less often.
# Layer 1: free, deterministic, no API calls - catches parser regressions
PARSER_CASES = [
{
"input": 'Thought: need data.\nAction: search\nAction Input: {"query": "AAPL price"}',
"expected_action": "search",
"expected_input": {"query": "AAPL price"},
},
{
"input": "Thought: trying.\nAction: search\nAction Input: {not valid json}",
"expected_action": "search",
"expected_input": {}, # malformed JSON should degrade gracefully, not crash
},
]
def run_parser_checks() -> bool:
passed = 0
for case in PARSER_CASES:
result = parse_action(case["input"])
ok = result is not None and result == (case["expected_action"], case["expected_input"])
passed += ok
return passed == len(PARSER_CASES)
# Layer 2: real API calls, opt-in, graded against expected keywords + a time budget
TASK_CASES = [
{"task": "What is RAG in AI and how does it work?", "expected_keywords": ["retrieval", "generation"]},
]
def run_task_checks() -> bool:
passed = 0
for case in TASK_CASES:
report = run_agent(case["task"]).lower()
passed += all(kw in report for kw in case["expected_keywords"])
return passed == len(TASK_CASES)Run `python eval.py` in the companion repo for the free layer in under a second, or `python eval.py --live` to also run the real task checks. Treat this like CI for agent behavior: every prompt change or tool change re-runs it, not a one-time manual QA pass before launch.
You do not test pilots only in perfect weather. Evaluate agents on edge cases, ambiguous inputs, and partial failures to see how they recover under pressure.