Agent Skill
2/7/2026

build-eval

Write rigorous evals for LLM agents, multi-agent systems, skills, MCP servers, and prompts. Use when: building test suites, measuring agent effectiveness, evaluating coordination, or choosing eval frameworks. Covers: DeepEval, Braintrust, RAGAS, precision/recall, F1, task completion, pass@k, iterative metrics, multi-agent coordination.

Y
yzavyas
4GitHub Stars
1Views
npx skills add yzavyas/claude-1337

SKILL.md

Namebuild-eval
DescriptionWrite rigorous evals for LLM agents, multi-agent systems, skills, MCP servers, and prompts. Use when: building test suites, measuring agent effectiveness, evaluating coordination, or choosing eval frameworks. Covers: DeepEval, Braintrust, RAGAS, precision/recall, F1, task completion, pass@k, iterative metrics, multi-agent coordination.

name: build-eval description: "Write rigorous evals for LLM agents, multi-agent systems, skills, MCP servers, and prompts. Use when: building test suites, measuring agent effectiveness, evaluating coordination, or choosing eval frameworks. Covers: DeepEval, Braintrust, RAGAS, precision/recall, F1, task completion, pass@k, iterative metrics, multi-agent coordination."

Eval-1337

Write evals that measure what matters. Not vanity metrics.

The key to success is measuring performance and iterating (Anthropic 2026).

The Core Problem

# This tells you NOTHING
activation_rate = 100%  # Activates on every prompt = useless

Single metrics lie. You need to measure BOTH failure modes.

Three Grader Types

TypeExamplesUse When
Code-basedString match, regex, test suites, outcome verificationDeterministic checks (speed, objectivity)
Model-basedLLM rubric scoring, pairwise comparison, multi-judgeOpen-ended tasks (flexibility, nuance)
HumanExpert review, crowdsourced judgment, spot-checkGold standard (expensive, slow)

Code-based first: Prefer deterministic graders; use model-based for flexibility; apply partial credit for multi-component tasks. "Grade what the agent produced, not the path it took" (Anthropic 2026).

Match Eval to Agent Type

Agent TypePrimary GraderKey MetricsBenchmark
CodingCode (test suites)Tests pass, no regressionsSWE-bench Verified
ConversationalMulti (state + transcript + rubric)Resolution, turn limits, toneτ2-Bench
ResearchModel (groundedness, coverage)Claim support, source qualityCustom
Computer UseCode (screenshot, state inspection)GUI state, file systemWebArena, OSWorld
SkillsCode (activation) + Model (methodology)F1 + adherence rubricCustom
Multi-AgentMulti (milestones + coordination)Task score, handoff successMultiAgentBench
PipelinePer-stage + handoffs + end-to-endStage success, bottleneckCustom

Non-Determinism Metrics

LLMs are stochastic. Run 5+ trials per task.

MetricFormulaUse When
pass@kP(≥1 success in k trials)One success is enough
pass^kP(all k trials succeed)Reliability-critical

Example: 75% per-trial success → pass@3 ≈ 98%, pass^3 ≈ 42%.

Use pass@k for exploration; pass^k for production reliability.

Iterative Metrics (Ralph Pattern)

Traditional pass@k treats trials as independent. Iterative eval uses failures as feedback:

MetricFormulaQuestion
pass@k (iterative)Success within k retries with feedbackCan it recover?
iterations_to_passRetries until successLearning speed
recovery_rate(pass@k - pass@1) / (1 - pass@1)% failures that recover
feedback_sensitivityΔscore per iterationDoes guidance help?

Use case: Agent has 60% pass@1. Is that its ceiling, or can it do better with feedback?

Iterative eval result:
├── pass@1: 60%         (baseline)
├── pass@3: 91%         (with retry + feedback)
└── recovery_rate: 78%  → Deploy with retry loop, not better prompts

Multi-Agent Metrics

Single-agent metrics miss coordination failures:

MetricFormulaMeasures
Task ScoreΣ(milestone × weight)Goal achievement
Handoff SuccessCompleted / expectedTask transfers work?
Comm EfficiencyUseful messages / totalSignal vs noise
Role AdherenceOn-role actions / totalStaying specialized?
ToM ScorePassed scenarios / totalTheory of mind

Match Metric to Target

TargetWhat to MeasureMetricFramework
AgentsTask completionpass@k / accuracyDeepEval
AgentsTool usageToolCorrectnessMetricDeepEval
SkillsActivation (L1)Precision/Recall/F1Custom
SkillsMethodology (L2)LLM rubricDeepEval GEval
MCP ServersTool callsToolCallAccuracyRAGAS
MCP ServersReliabilityMCPGauge 4-dimCustom
PromptsOutput qualityLLM-as-judgeBraintrust
AnyTracesSpan analysisPhoenix (local)
AnyBehavioralBloom scenariosAnthropic Bloom

Building Evals: The Roadmap

From Anthropic's agent evaluation guide (2026):

Step 0: Start early with 20-50 tasks from actual failures
Step 1: Write unambiguous tasks (pass expert test)
Step 2: Build balanced problem sets (positive AND negative)
Step 3: Robust harness with clean environments per trial
Step 4: Thoughtful grader design (deterministic preferred)
Step 5: Read transcripts (verify graders, understand failures)
Step 6: Monitor saturation (100% → only tracks regressions)
Step 7: Maintain as living artifact (dedicated ownership)

Common Pitfalls

TrapFix
Single test run5+ runs - stochastic outputs
One-sided test setBalance positive/negative - prevents overfitting
Measuring recall onlyAdd precision - high recall + low precision = noise
"Forced eval" inflationRealistic conditions - forced mode inflates scores
No ground truthLabel expectations - must_trigger, should_not
Grader too rigidAccept valid variations - grade outcome, not path
Shared state between runsIsolate environments - leftover files cause correlation
Bypass vulnerabilitiesDesign to require solving - agents exploit loopholes
Eval saturationExpand difficulty - high pass rates mask improvements

Classification Metrics (F1)

Use when you have TWO failure modes:

                      ACTUAL
                      Yes         No
                  +-----------+-----------+
EXPECTED    Yes   |    TP     |    FN     |
                  |  Correct  |  Missed   |
                  +-----------+-----------+
            No    |    FP     |    TN     |
                  |   Noise   |  Correct  |
                  +-----------+-----------+

Precision = TP / (TP + FP)   "when it fires, is it right?"
Recall    = TP / (TP + FN)   "when it should fire, does it?"
F1        = 2×(P×R)/(P+R)    "balanced score"

Labeled Test Cases

{"input": "What crate for CLI args?", "expectation": "must_trigger"}
{"input": "Write a haiku", "expectation": "should_not_trigger"}
{"input": "Explain ownership", "expectation": "acceptable"}
LabelMeaningMeasures
must_triggerShould definitely fireRecall (misses)
should_not_triggerMust not firePrecision (noise)
acceptableEither outcome fineExcluded

Defense in Depth (Swiss Cheese)

No single eval method catches everything. Layer them:

MethodSpeedCoverageBest For
Automated evalsFastNarrowRegression prevention
Production monitoringReal-timeBroadReal behavior
A/B testingDays/weeksStatisticalOutcome measurement
Manual transcript reviewSlowDeepBuilding intuition
Human studiesVery slowGold-standardSubjective quality

Use multiple methods; each layer catches what others miss.

Framework Decision

SituationUseWhy
Python agent evalsDeepEvalTaskCompletionMetric, ToolCorrectness
TypeScript/NodeBraintrustIdentical Python/TS API
RAG pipelinesRAGASToolCallF1, context metrics
Skill activationCustomPrecision/recall with labeled expectations
Behavioral evalsBloomAutomated scenario generation
InfrastructureHarbor, PromptfooContainerized, YAML-based

Quick Reference

AGENTS
  Coding: Test suites (SWE-bench pattern)
  Conversational: Multi-grader (state + transcript + rubric)
  Research: LLM groundedness + coverage
  Metrics: pass@k (exploration), pass^k (reliability)

MULTI-AGENT
  Task: Milestone-weighted task score
  Coordination: Handoff success, comm efficiency
  Roles: Role adherence, work duplication
  Advanced: Theory of Mind (ToM) scenarios

PIPELINE (Sequential A → B → C)
  Level 1: Single-agent metrics per stage
  Level 2: Handoff quality between stages
  Level 3: End-to-end pipeline metrics
  Key: Find bottleneck stage, error propagation

ITERATIVE (Ralph Pattern)
  When: Deciding retry loop vs better prompts
  Metrics: iterations_to_pass, recovery_rate, feedback_sensitivity
  Key insight: pass@1 ≠ capability ceiling

SKILLS
  Level 1 (Activation): F1 with labeled expectations
  Level 2 (Methodology): GEval rubric (evidence, WHY, verification)
  Observable: skill_check, skill_match spans

MCP SERVERS
  RAGAS: ToolCallAccuracy, ToolCallF1
  MCPGauge: proactivity, compliance, effectiveness, overhead

BEHAVIORAL
  Bloom: Automated scenario generation for alignment properties
  Targets: sycophancy, self-preservation, sabotage

OBSERVABILITY
  ALL extensions should have OTel spans
  Skills: skill_check, skill_match
  Agents: agent_run, llm_call, tool_call
  MCP: mcp_server, mcp_call

Domain Routing

DetectedLoad
agent, task completion, pass@kagents.md
multi-agent, coordination, handoffmulti-agent.md
pipeline, sequential, stage, chainmulti-agent.md
iterative, retry, recovery, ralphiterative.md
skill, activation, triggerskills.md
methodology, behavioral, adherencemethodology.md
MCP, tool call, servermcp.md
prompt, quality, judgeprompts.md
trace, debug, analyze spansobservability.md
security, red team, adversarialsecurity.md
benchmark, SWE-bench, WebArenabenchmarks.md
dataset, labelingdatasets.md
DeepEval, Braintrust, RAGASframeworks.md
Full citationssources.md
Skills Info
Original Name:build-evalAuthor:yzavyas