validate-skill
Validate skill directories against AgentSkills spec
SKILL.md
| Name | validate-skill |
| Description | Validate skill directories against AgentSkills spec |
Web Search Agent Evaluations
Evaluate multiple agents (Claude Code, Gemini, Droid, Codex) with different web search tools (builtin, You.com MCP) in isolated Docker containers.
Overview
This evaluation system runs a matrix comparison: 4 agents Γ 2 tools = 8 pairings, capturing full trajectories for analysis.
Key Features:
- Headless adapters - No custom code, just JSON schemas (@plaited/agent-eval-harness)
- Flag-based architecture - Single service per agent, MCP mode selected via environment variable
- Type-safe constants - MCP server definitions in
mcp-servers.ts - TypeScript entrypoint - Bun shell script for runtime MCP configuration
- Isolated execution - Each pairing runs in its own Docker container
Evaluation Pipeline
flowchart TD
Env[SEARCH_PROVIDER env var] --> Entrypoint[docker/entrypoint]
Entrypoint -->|builtin| SkipMCP[Skip MCP setup]
Entrypoint -->|you| ConfigMCP[Configure MCP via CLI]
SkipMCP --> Harness[agent-eval-harness trials]
ConfigMCP --> Harness
Prompts[prompts.jsonl] --> Harness
Schemas[agent-schemas/*.json] --> Harness
Harness --> Results[data/results/YYYY-MM-DD/agent/tool.jsonl]
Analysis Pipeline
flowchart LR
Results[Trial Results<br/>YYYY-MM-DD/agent/tool.jsonl] --> Compare[compare.ts]
Grader[inline-grader.ts] --> Compare
Compare -->|weighted| Weighted[*-weighted.json<br/>Capability + Reliability]
Compare -->|statistical| Statistical[*-statistical.json<br/>Bootstrap CIs]
Weighted --> Report[report.ts]
Statistical --> Report
Report --> REPORT[REPORT.md<br/>Rankings + Analysis]
style Compare fill:#e1f5ff
style Report fill:#e1f5ff
style REPORT fill:#d4edda
Latest Results
π View Latest Evaluation Report - Comprehensive analysis with quality rankings, performance metrics, tool call statistics, and recommendations
Quick Start
1. Install Dependencies
bun install
2. Set API Keys
Create .env file (gitignored):
cp .env.example .env
nano .env
Required keys:
ANTHROPIC_API_KEY- Claude Code agentGEMINI_API_KEY- Gemini agent + inline grader LLM scoringFACTORY_API_KEY- Droid agentOPENAI_API_KEY- Codex agentYDC_API_KEY- You.com MCP tool
3. Run Evaluations
# Full dataset (151 prompts), k=5 β all 8 agentΓprovider scenarios
bun run trials
# Quick smoke test (5 random prompts, single trial)
bun run trials -- --count 5 -k 1
# Specific agent or provider
bun run trials -- --agent droid --search-provider builtin
# Different trial types
bun run trials -- --trial-type capability # k=10, deep capability exploration
bun run trials -- --trial-type regression # k=3, fast regression check
# Control parallelism
bun run trials -- -j 4 # Limit to 4 containers
bun run trials -- --prompt-concurrency 4 # 4 prompts per container
4. Analyze Results
Compare pass@k performance across agents:
# Run weighted comparison (latest date auto-detected)
bun run compare
# Statistical analysis with bootstrap confidence intervals
bun run compare:stat
# Specific date or agent
bun run compare -- --run-date 2026-02-18
bun run compare -- --agent droid --search-provider builtin
# Generate comprehensive REPORT.md
bun run report
View raw results:
cat data/comparisons/2026-02-18/all-weighted.json | jq '.capability'
cat data/results/2026-02-18/droid/builtin.jsonl | jq '{id, passRate, passAtK, passExpK}'
Pass@k Analysis
Run multiple trials per prompt across all agents and search providers to measure reliability:
# Full dataset (151 prompts), k=5 β all 8 combinations
bun run trials
# Quick sample for exploration
bun run trials -- --count 5 # 5 random prompts, k=5
bun run trials -- --count 5 -k 1 # Smoke test (fastest)
# Trial type presets
bun run trials -- --trial-type capability # k=10, deep exploration
bun run trials -- --trial-type regression # k=3, fast regression
# Filter by agent or provider
bun run trials -- --agent gemini
bun run trials -- --search-provider you
bun run trials -- --agent claude-code --search-provider builtin
# Custom k value
bun run trials -- -k 7
View pass@k metrics:
cat data/results/2026-02-18/*/builtin.jsonl | jq '{id, passRate, passAtK, passExpK}'
cat data/results/2026-02-18/droid/builtin.jsonl | jq '.passRate'
Metrics:
passAtK- Capability (can it do the task at all? =1 - (1-p)^k)passExpK- Reliability (does it always succeed? =p^k)
Output: Results written to data/results/YYYY-MM-DD/{agent}/{provider}.jsonl
Architecture
Agent Schemas (agent-schemas/)
Headless adapter schemas - no custom code, just JSON configuration:
| Schema | Agent | Mode | Status |
|---|---|---|---|
claude-code.json | Claude Code | stream | β Tested |
gemini.json | Gemini CLI | iterative | β Tested |
droid.json | Droid CLI | stream | β Tested |
codex.json | Codex CLI | stream | β Tested |
Session Modes:
- stream: Process stays alive, multi-turn via stdin
- iterative: New process per turn, history accumulated
MCP Configuration (mcp-servers.ts)
Single source of truth for MCP server configurations. The TypeScript entrypoint (docker/entrypoint) imports these constants and configures agents at runtime via their official CLI commands.
Available Tools:
builtin- Agent's native search (no MCP config)you- You.com MCP server (requiresYDC_API_KEY)- Expected tools:
you-search,you-express,you-contents
- Expected tools:
To add new MCP tools, see .claude/skills/web-search-agent-evals/SKILL.md.
CLI Scripts (scripts/)
| Script | Purpose |
|---|---|
run-trials.ts | Run pass@k trials (bun run trials) β full dataset, k=5 default |
compare.ts | Compare trial results across agents and providers |
report.ts | Generate comprehensive REPORT.md from comparison data |
inline-grader.ts | Hybrid grader (deterministic + LLM scoring) |
calibrate.ts | Interactive grader calibration tool |
generate-mcp-prompts.ts | Generate MCP variant prompts with metadata |
See "Analyze Results" in Quick Start for comparison usage examples.
Docker Infrastructure
Isolated execution for reproducibility:
docker/
βββ base.Dockerfile # Shared base (Bun + Node 24)
βββ claude-code.Dockerfile
βββ gemini.Dockerfile
βββ droid.Dockerfile
βββ codex.Dockerfile
βββ entrypoint # TypeScript entrypoint (Bun shell)
βββ docker-compose.yml # 4 services (one per agent)
The entrypoint script:
- Reads
SEARCH_PROVIDERenvironment variable (builtinoryou) - Configures MCP via agent CLI if needed (skips for
builtin) - If
PROMPT_COUNTis set, shuffles full prompts and samples N into a temp file - Runs
@plaited/agent-eval-harness trialswith the selected prompts
Prompts
Prompts live in a single full/ directory with builtin and MCP variants:
| File | Prompts | Format | Use With |
|---|---|---|---|
full/prompts.jsonl | 151 | Standard | SEARCH_PROVIDER=builtin |
full/prompts-you.jsonl | 151 | MCP variant | SEARCH_PROVIDER=you |
Builtin prompts are plain queries. MCP prompts add "Use {server-name} and answer\n" prefix and MCP metadata (mcpServer, expectedTools).
Metadata structure (MCP variants only):
{
"mcpServer": "ydc-server",
"expectedTools": ["you-search", "you-express", "you-contents"]
}
To sample a subset at runtime, use --count N (shuffles full dataset, no pre-generation needed):
bun run trials -- --count 5 # 5 random prompts from full dataset
All prompts are designed to trigger web search with time-sensitive queries and recent events.
Results
All trial results are written to dated directories:
data/results/
βββ 2026-02-18/
βββ claude-code/
β βββ builtin.jsonl
β βββ you.jsonl
βββ gemini/
βββ droid/
βββ codex/
Versioning: Each run is committed with a dated directory.
Compare runs:
bun run compare # Latest date auto-detected
bun run compare -- --run-date 2026-02-18
Each result line is a TrialResult with pass@k metrics and full trajectory per trial.
Comparisons
Comparison analyses are versioned alongside the raw results they evaluate.
Comparison Metrics
Each comparison output includes:
Quality Metrics (per agent+tool pairing):
avgScore- Mean inline grader score (0-1 scale)passRate- Percentage of prompts that passed (score β₯ 0.7)passCount/failCount- Number of passing/failing promptsscoreDistribution- Histogram of scores by quintile
Performance Metrics:
latency.p50/p90/p99- Response time percentiles (milliseconds)firstResponse- Time to first outputtotalDuration- Total execution time across all prompts
Head-to-Head Analysis:
- Pairwise win/loss/tie records
- Statistical significance (when using
--strategy statistical)
Comparison Strategies
Weighted Strategy (default): Balances multiple dimensions with configurable weights:
COMPARE_QUALITY(default: 0.6) - Inline grader scoresCOMPARE_LATENCY(default: 0.3) - Response speedCOMPARE_RELIABILITY(default: 0.1) - Pass rate consistency
Statistical Strategy: Uses bootstrap sampling (1000 iterations by default) to compute:
- Confidence intervals for mean scores
- Statistical significance testing (p<0.05)
- Reduces false conclusions from small sample sizes
Configure via COMPARE_BOOTSTRAP_ITERATIONS environment variable.
Output Structure
data/comparisons/
βββ 2026-02-18/
βββ all-builtin-weighted.json # Builtin-only comparison
βββ all-builtin-statistical.json
βββ all-you-weighted.json # MCP-only comparison
βββ builtin-vs-you-weighted.json # Cross-provider comparison
βββ builtin-vs-you-statistical.json
βββ REPORT.md # Comprehensive analysis report
View Comparison Results
# Capability and reliability rankings
jq '.capability' data/comparisons/2026-02-18/all-builtin-weighted.json
jq '.reliability' data/comparisons/2026-02-18/builtin-vs-you-weighted.json
# Head-to-head win rates
jq '.headToHead.capability' data/comparisons/2026-02-18/all-builtin-statistical.json
# Generate full report
bun run report -- --run-date 2026-02-18
Inline Grader
The project uses a hybrid grading approach in scripts/inline-grader.ts that evaluates agent responses on a 100-point scale.
Scoring Breakdown (100 points total)
Deterministic Scoring (70 points maximum):
- 10 pts - Basic output: Has substantial content (β₯40 characters)
- 25 pts - Tool usage: Called correct tool (partial credit for wrong tool if MCP expected)
- 25 pts - Clean execution: No errors or timeouts
- 10 pts - Sources bonus: Includes URLs or source references
LLM Scoring (30 points maximum): Uses Gemini Flash 3.0 to evaluate search result quality:
- 0-15 pts - Query match: Does it answer the search query?
- 0-5 pts - Source evidence: Are sources/URLs cited?
- 0-5 pts - Content substance: Specific info or generic fluff?
- 0-5 pts - Format quality: Well-organized structure?
Pass Threshold: 65/100 (normalized score β₯ 0.65)
Automatic Failures:
- Execution timeouts β score 0
- Tool execution errors β score 0
Fallback Mode: Works without GEMINI_API_KEY (deterministic-only, max 60 points)
MCP Tool Detection
The grader tracks whether agents used the expected MCP tools by checking trajectory metadata:
- Claude Code: Detects
mcp__<server>__<tool>format - Codex: Checks
mcpServerfield in trajectory - DROID: Detects
<server>___<tool>format (triple underscore) - GEMINI: Matches tool names against expected tools list
This metadata enables analysis of tool selection patterns (e.g., Express vs. Search usage).
Calibration
The LLM component may hallucinate facts. Always:
- Review sampled failures manually before trusting scores
- Use
bunx @plaited/agent-eval-harness calibrateto validate grader accuracy - Check for systematic biases in LLM scoring
For detailed grading concepts (validation, calibration, best practices), see the agent-eval-harness skill documentation.
Development
Code Quality
# Type check
bun run typecheck
# Lint and format
bun run check
# Auto-fix
bun run check:write
# Run tests
bun test
Adding Agents
- Create adapter schema (
agent-schemas/<agent>.json) - Create Dockerfile (
docker/<agent>.Dockerfile) - Add Docker Compose service
- Update TypeScript entrypoint (
docker/entrypoint)
See .claude/skills/web-search-agent-evals/SKILL.md for detailed guide.
Adding MCP Tools
- Add to mcp-servers.ts - Define server configuration with name, URL, auth, and expectedTools
- Update docker/entrypoint - Add case to
configureMcp()function for each agent CLI - Update .env and .env.example - Add required API keys
- Generate MCP prompts - Run
bun scripts/generate-mcp-prompts.ts --mcp-key <key>to createfull/prompts-<key>.jsonl
See .claude/skills/web-search-agent-evals/SKILL.md for detailed guide.
Note: All scripts automatically pick up new MCP servers from mcp-servers.ts.
Troubleshooting
MCP Config Issues
# Verify API keys
cat .env | grep API_KEY
# Test inside container
docker compose run --rm -e SEARCH_PROVIDER=you claude-code bash -c "cat ~/.mcp.json"
Agent Schema Issues
# Test adapter compliance
bunx @plaited/agent-eval-harness adapter:check -- \
bunx @plaited/agent-eval-harness headless --schema agent-schemas/<agent>.json
Docker Build Failures
# Check base image
docker build -t base -f docker/base.Dockerfile .
docker run --rm base bun --version
# Check agent CLI
docker build -t test-<agent> -f docker/<agent>.Dockerfile .
docker run --rm test-<agent> <agent> --version
Project Structure
evals/
βββ agent-schemas/ # Headless schemas
β βββ claude-code.json
β βββ gemini.json
β βββ droid.json
β βββ codex.json
β
βββ mcp-servers.ts # MCP configuration (TypeScript constants)
β
βββ scripts/ # CLI tools
β βββ run-trials.ts # Pass@k trials runner (bun run trials)
β βββ compare.ts # Comparison tool (bun run compare)
β βββ report.ts # Report generator (bun run report)
β βββ inline-grader.ts # Hybrid grader
β βββ calibrate.ts # Grader calibration tool
β βββ generate-mcp-prompts.ts # MCP variant generator
β
βββ docker/ # Container infrastructure
β βββ base.Dockerfile
β βββ {agent}.Dockerfile # One per agent
β βββ entrypoint # TypeScript entrypoint
β βββ docker-compose.yml
β
βββ data/
β βββ prompts/ # Evaluation prompts (151 prompts)
β βββ results/ # Trial outputs by date (gitignored)
β βββ comparisons/ # Comparison reports by date
β
βββ .claude/skills/web-search-agent-evals/ # Development assistant skill
Skills
This project uses AgentSkills for agent-first development:
- web-search-agent-evals - Development assistant for this evaluation system
- agent-eval-harness - Capture, trials, and analysis commands
See @AGENTS.md for development rules and conventions.
Built With
- @plaited/agent-eval-harness - Trajectory capture framework
- Zod - TypeScript-first schema validation with runtime type checking (schemas in
scripts/schemas/) - Bun - Fast TypeScript runtime
- Docker - Isolated execution