build-eval
Write rigorous evals for LLM agents, multi-agent systems, skills, MCP servers, and prompts. Use when: building test suites, measuring agent effectiveness, evaluating coordination, or choosing eval frameworks. Covers: DeepEval, Braintrust, RAGAS, precision/recall, F1, task completion, pass@k, iterative metrics, multi-agent coordination.
SKILL.md
| Name | build-eval |
| Description | Write rigorous evals for LLM agents, multi-agent systems, skills, MCP servers, and prompts. Use when: building test suites, measuring agent effectiveness, evaluating coordination, or choosing eval frameworks. Covers: DeepEval, Braintrust, RAGAS, precision/recall, F1, task completion, pass@k, iterative metrics, multi-agent coordination. |
name: build-eval description: "Write rigorous evals for LLM agents, multi-agent systems, skills, MCP servers, and prompts. Use when: building test suites, measuring agent effectiveness, evaluating coordination, or choosing eval frameworks. Covers: DeepEval, Braintrust, RAGAS, precision/recall, F1, task completion, pass@k, iterative metrics, multi-agent coordination."
Eval-1337
Write evals that measure what matters. Not vanity metrics.
The key to success is measuring performance and iterating (Anthropic 2026).
The Core Problem
# This tells you NOTHING
activation_rate = 100% # Activates on every prompt = useless
Single metrics lie. You need to measure BOTH failure modes.
Three Grader Types
| Type | Examples | Use When |
|---|---|---|
| Code-based | String match, regex, test suites, outcome verification | Deterministic checks (speed, objectivity) |
| Model-based | LLM rubric scoring, pairwise comparison, multi-judge | Open-ended tasks (flexibility, nuance) |
| Human | Expert review, crowdsourced judgment, spot-check | Gold standard (expensive, slow) |
Code-based first: Prefer deterministic graders; use model-based for flexibility; apply partial credit for multi-component tasks. "Grade what the agent produced, not the path it took" (Anthropic 2026).
Match Eval to Agent Type
| Agent Type | Primary Grader | Key Metrics | Benchmark |
|---|---|---|---|
| Coding | Code (test suites) | Tests pass, no regressions | SWE-bench Verified |
| Conversational | Multi (state + transcript + rubric) | Resolution, turn limits, tone | τ2-Bench |
| Research | Model (groundedness, coverage) | Claim support, source quality | Custom |
| Computer Use | Code (screenshot, state inspection) | GUI state, file system | WebArena, OSWorld |
| Skills | Code (activation) + Model (methodology) | F1 + adherence rubric | Custom |
| Multi-Agent | Multi (milestones + coordination) | Task score, handoff success | MultiAgentBench |
| Pipeline | Per-stage + handoffs + end-to-end | Stage success, bottleneck | Custom |
Non-Determinism Metrics
LLMs are stochastic. Run 5+ trials per task.
| Metric | Formula | Use When |
|---|---|---|
| pass@k | P(≥1 success in k trials) | One success is enough |
| pass^k | P(all k trials succeed) | Reliability-critical |
Example: 75% per-trial success → pass@3 ≈ 98%, pass^3 ≈ 42%.
Use pass@k for exploration; pass^k for production reliability.
Iterative Metrics (Ralph Pattern)
Traditional pass@k treats trials as independent. Iterative eval uses failures as feedback:
| Metric | Formula | Question |
|---|---|---|
| pass@k (iterative) | Success within k retries with feedback | Can it recover? |
| iterations_to_pass | Retries until success | Learning speed |
| recovery_rate | (pass@k - pass@1) / (1 - pass@1) | % failures that recover |
| feedback_sensitivity | Δscore per iteration | Does guidance help? |
Use case: Agent has 60% pass@1. Is that its ceiling, or can it do better with feedback?
Iterative eval result:
├── pass@1: 60% (baseline)
├── pass@3: 91% (with retry + feedback)
└── recovery_rate: 78% → Deploy with retry loop, not better prompts
Multi-Agent Metrics
Single-agent metrics miss coordination failures:
| Metric | Formula | Measures |
|---|---|---|
| Task Score | Σ(milestone × weight) | Goal achievement |
| Handoff Success | Completed / expected | Task transfers work? |
| Comm Efficiency | Useful messages / total | Signal vs noise |
| Role Adherence | On-role actions / total | Staying specialized? |
| ToM Score | Passed scenarios / total | Theory of mind |
Match Metric to Target
| Target | What to Measure | Metric | Framework |
|---|---|---|---|
| Agents | Task completion | pass@k / accuracy | DeepEval |
| Agents | Tool usage | ToolCorrectnessMetric | DeepEval |
| Skills | Activation (L1) | Precision/Recall/F1 | Custom |
| Skills | Methodology (L2) | LLM rubric | DeepEval GEval |
| MCP Servers | Tool calls | ToolCallAccuracy | RAGAS |
| MCP Servers | Reliability | MCPGauge 4-dim | Custom |
| Prompts | Output quality | LLM-as-judge | Braintrust |
| Any | Traces | Span analysis | Phoenix (local) |
| Any | Behavioral | Bloom scenarios | Anthropic Bloom |
Building Evals: The Roadmap
From Anthropic's agent evaluation guide (2026):
Step 0: Start early with 20-50 tasks from actual failures
Step 1: Write unambiguous tasks (pass expert test)
Step 2: Build balanced problem sets (positive AND negative)
Step 3: Robust harness with clean environments per trial
Step 4: Thoughtful grader design (deterministic preferred)
Step 5: Read transcripts (verify graders, understand failures)
Step 6: Monitor saturation (100% → only tracks regressions)
Step 7: Maintain as living artifact (dedicated ownership)
Common Pitfalls
| Trap | Fix |
|---|---|
| Single test run | 5+ runs - stochastic outputs |
| One-sided test set | Balance positive/negative - prevents overfitting |
| Measuring recall only | Add precision - high recall + low precision = noise |
| "Forced eval" inflation | Realistic conditions - forced mode inflates scores |
| No ground truth | Label expectations - must_trigger, should_not |
| Grader too rigid | Accept valid variations - grade outcome, not path |
| Shared state between runs | Isolate environments - leftover files cause correlation |
| Bypass vulnerabilities | Design to require solving - agents exploit loopholes |
| Eval saturation | Expand difficulty - high pass rates mask improvements |
Classification Metrics (F1)
Use when you have TWO failure modes:
ACTUAL
Yes No
+-----------+-----------+
EXPECTED Yes | TP | FN |
| Correct | Missed |
+-----------+-----------+
No | FP | TN |
| Noise | Correct |
+-----------+-----------+
Precision = TP / (TP + FP) "when it fires, is it right?"
Recall = TP / (TP + FN) "when it should fire, does it?"
F1 = 2×(P×R)/(P+R) "balanced score"
Labeled Test Cases
{"input": "What crate for CLI args?", "expectation": "must_trigger"}
{"input": "Write a haiku", "expectation": "should_not_trigger"}
{"input": "Explain ownership", "expectation": "acceptable"}
| Label | Meaning | Measures |
|---|---|---|
| must_trigger | Should definitely fire | Recall (misses) |
| should_not_trigger | Must not fire | Precision (noise) |
| acceptable | Either outcome fine | Excluded |
Defense in Depth (Swiss Cheese)
No single eval method catches everything. Layer them:
| Method | Speed | Coverage | Best For |
|---|---|---|---|
| Automated evals | Fast | Narrow | Regression prevention |
| Production monitoring | Real-time | Broad | Real behavior |
| A/B testing | Days/weeks | Statistical | Outcome measurement |
| Manual transcript review | Slow | Deep | Building intuition |
| Human studies | Very slow | Gold-standard | Subjective quality |
Use multiple methods; each layer catches what others miss.
Framework Decision
| Situation | Use | Why |
|---|---|---|
| Python agent evals | DeepEval | TaskCompletionMetric, ToolCorrectness |
| TypeScript/Node | Braintrust | Identical Python/TS API |
| RAG pipelines | RAGAS | ToolCallF1, context metrics |
| Skill activation | Custom | Precision/recall with labeled expectations |
| Behavioral evals | Bloom | Automated scenario generation |
| Infrastructure | Harbor, Promptfoo | Containerized, YAML-based |
Quick Reference
AGENTS
Coding: Test suites (SWE-bench pattern)
Conversational: Multi-grader (state + transcript + rubric)
Research: LLM groundedness + coverage
Metrics: pass@k (exploration), pass^k (reliability)
MULTI-AGENT
Task: Milestone-weighted task score
Coordination: Handoff success, comm efficiency
Roles: Role adherence, work duplication
Advanced: Theory of Mind (ToM) scenarios
PIPELINE (Sequential A → B → C)
Level 1: Single-agent metrics per stage
Level 2: Handoff quality between stages
Level 3: End-to-end pipeline metrics
Key: Find bottleneck stage, error propagation
ITERATIVE (Ralph Pattern)
When: Deciding retry loop vs better prompts
Metrics: iterations_to_pass, recovery_rate, feedback_sensitivity
Key insight: pass@1 ≠ capability ceiling
SKILLS
Level 1 (Activation): F1 with labeled expectations
Level 2 (Methodology): GEval rubric (evidence, WHY, verification)
Observable: skill_check, skill_match spans
MCP SERVERS
RAGAS: ToolCallAccuracy, ToolCallF1
MCPGauge: proactivity, compliance, effectiveness, overhead
BEHAVIORAL
Bloom: Automated scenario generation for alignment properties
Targets: sycophancy, self-preservation, sabotage
OBSERVABILITY
ALL extensions should have OTel spans
Skills: skill_check, skill_match
Agents: agent_run, llm_call, tool_call
MCP: mcp_server, mcp_call
Domain Routing
| Detected | Load |
|---|---|
| agent, task completion, pass@k | agents.md |
| multi-agent, coordination, handoff | multi-agent.md |
| pipeline, sequential, stage, chain | multi-agent.md |
| iterative, retry, recovery, ralph | iterative.md |
| skill, activation, trigger | skills.md |
| methodology, behavioral, adherence | methodology.md |
| MCP, tool call, server | mcp.md |
| prompt, quality, judge | prompts.md |
| trace, debug, analyze spans | observability.md |
| security, red team, adversarial | security.md |
| benchmark, SWE-bench, WebArena | benchmarks.md |
| dataset, labeling | datasets.md |
| DeepEval, Braintrust, RAGAS | frameworks.md |
| Full citations | sources.md |