eval-patterns
This skill provides common evaluation patterns and integration guidance. Use when: - Integrating eval-framework with other plugins - Designing evaluation workflows - Choosing between content vs behavior evaluation - Setting up project-local rubrics
SKILL.md
| Name | eval-patterns |
| Description | This skill provides common evaluation patterns and integration guidance. Use when: - Integrating eval-framework with other plugins - Designing evaluation workflows - Choosing between content vs behavior evaluation - Setting up project-local rubrics |
name: eval-patterns description: | This skill provides common evaluation patterns and integration guidance. Use when:
- Integrating eval-framework with other plugins
- Designing evaluation workflows
- Choosing between content vs behavior evaluation
- Setting up project-local rubrics version: 1.0.0
Evaluation Patterns & Integration
Common patterns for using the eval-framework effectively in different contexts.
Evaluation Types
Content Evaluation
Evaluates static content: copy, documentation, code files.
Use for:
- Marketing copy review
- Documentation quality
- Code style/patterns
- Configuration validation
Invocation:
/eval-run brand-voice app/routes/sell-on-vouchline.tsx
Behavior Evaluation
Evaluates actions and outputs: what Claude did, not just what exists.
Use for:
- Code review after implementation
- Commit message quality
- Test coverage verification
- API response validation
Invocation:
Judge agent triggered: "Review what I just implemented against the code-security rubric"
Combined Evaluation
Evaluates both content and behavior together.
Use for:
- Full code review (style + security + behavior)
- Documentation with examples (accuracy + completeness)
- Feature implementation review
Project-Local Setup
Directory Structure
your-project/
├── .claude/
│ └── evals/
│ ├── brand-voice.yaml # Project rubrics
│ ├── code-security.yaml
│ └── api-design.yaml
Quick Setup
- Create directory:
mkdir -p .claude/evals - Create rubric:
/eval-create brand-voice --from docs/brand/voice.md - Run evaluation:
/eval-run brand-voice
Rubric Discovery
The judge agent automatically discovers rubrics in:
.claude/evals/*.yaml(project-local).claude/evals/*.yml(alternate extension)- Explicit paths passed to commands
Integration Patterns
Pattern 1: Post-Implementation Review
After completing significant work, invoke judge for quality check:
User: "I just finished the authentication module"
Claude: [Uses judge agent to evaluate against code-security rubric]
The judge agent's when_to_use description enables proactive triggering after code review requests.
Pattern 2: Command-Based Validation
Explicit validation during development:
/eval-run brand-voice app/routes/sell-on-vouchline.tsx
Returns structured feedback before committing.
Pattern 3: Plugin Integration
Other plugins can invoke the judge programmatically:
## In your plugin's agent/command:
Invoke the eval-framework judge agent with:
- Rubric: [name or path]
- Content: [what to evaluate]
- Context: [additional context]
The judge will return structured evaluation results.
Pattern 4: Pre-Commit Workflow
Manual pre-commit check (not automated hook):
User: "Check my changes before I commit"
Claude: [Runs relevant rubrics against staged files]
Choosing Rubrics
By Content Type
| Content | Recommended Rubric |
|---|---|
| Marketing copy | brand-voice |
| API code | code-security, api-design |
| Documentation | docs-quality |
| Test files | test-coverage |
| Config files | config-validation |
By Quality Gate
| Gate | Threshold | Required Criteria |
|---|---|---|
| Draft review | 60% | None |
| PR review | 75% | Core criteria |
| Production | 85% | All security |
Rubric Composition
Layered Rubrics
Create focused rubrics that can be run together:
# code-style.yaml - formatting, naming
# code-security.yaml - vulnerabilities
# code-perf.yaml - performance patterns
Run multiple: /eval-run code-style && /eval-run code-security
Domain-Specific Rubrics
Create rubrics for specific features:
# auth-flow.yaml - authentication patterns
# payment-handling.yaml - financial code
# user-input.yaml - input validation
Best Practices
Start Simple: Begin with 2-3 criteria, add more as needed.
Iterate Rubrics: Version your rubrics and refine based on false positives/negatives.
Context Matters: Include file patterns in scope to auto-filter relevant files.
Required vs Optional: Use required_criteria for must-pass items, let others contribute to score.
Actionable Feedback: Every check message should tell how to fix the issue.
Troubleshooting
Rubric not found: Check .claude/evals/ exists and rubric name matches file.
False positives: Refine regex patterns or use custom checks for nuance.
Score too low: Review thresholds - they might be too strict for your context.
Slow evaluation: Reduce custom checks (LLM-evaluated) where pattern checks work.
Reference Files
See references/ for additional patterns:
integration-examples.md- Real-world integration examples