eval
Evaluate Deca prompts by running test cases, judging results manually, and generating reports. Use when testing Agent behavior against IDENTITY.md, SOUL.md, and other prompts.
SKILL.md
| Name | eval |
| Description | Evaluate Deca prompts by running test cases, judging results manually, and generating reports. Use when testing Agent behavior against IDENTITY.md, SOUL.md, and other prompts. |
name: eval description: Evaluate Deca prompts by running test cases, judging results manually, and generating reports. Use when testing Agent behavior against IDENTITY.md, SOUL.md, and other prompts.
Eval Skill - Prompt Evaluation Workflow
This skill guides you through evaluating Deca's prompts using the eval system.
Overview
The eval system tests whether the Agent behaves according to its prompts (IDENTITY.md, SOUL.md, etc.).
Key Constraint: LLM (you) judges the results manually. Scripts do NOT use LLM.
Workflow
[1. Run Cases] → [2. Judge Results] → [3. Generate Report]
(script) (LLM manual) (script)
Step 1: Run Cases (Script)
Execute test cases against the running Gateway:
# Ensure Gateway is running on port 7014
# (run from repo root)
bun run dev
cd eval
# Run all cases
bun run runner.ts
# Run single case
bun run runner.ts --case=identity-001
# Custom gateway URL
bun run runner.ts --gateway-url=http://localhost:7014
Output: reports/pending-{timestamp}.json
Step 2: Judge Results (LLM - You)
Read the pending results and fill in judgements.
2.1 Load Pending Results
# List pending files
ls eval/reports/pending-*.json
# Read the latest one
cat eval/reports/pending-{timestamp}.json
2.2 For Each Result
Review the input, output, and evaluate against criteria from the original case.
The criteria are embedded in cases. Reference:
| Case ID | Criteria Summary |
|---|---|
| identity-001 | Agent identifies as Tomato, friendly, practical |
| identity-002 | Agent confirms name is Tomato |
| identity-003 | Agent describes itself matching IDENTITY.md |
2.3 Fill Judgement
For each result, add a judgement object:
{
"passed": true,
"score": 85,
"reasoning": "Agent correctly identified as Tomato with appropriate emoji and friendly tone."
}
Scoring Guide:
- 90-100: Excellent, exceeds expectations
- 70-89: Good, meets expectations
- 50-69: Partial, some issues
- 0-49: Poor, significant issues
2.4 Save Judged Results
Save the modified JSON as reports/judged-{timestamp}.json.
Important: Copy all metadata fields (timestamp, gitCommit, gatewayUrl, model) and add:
judgedAt: Current ISO timestampjudgedBy: "opencode" (or your model name)
Example structure:
{
"timestamp": "2026-02-05T12:00:00.000Z",
"gitCommit": "abc123",
"gatewayUrl": "http://localhost:7014",
"model": "claude-3-sonnet",
"results": [
{
"caseId": "identity-001",
"caseName": "Self-identification",
"targetPrompt": "IDENTITY.md",
"category": "identity",
"input": "Who are you?",
"output": "I am Tomato, a friendly AI assistant...",
"durationMs": 1234,
"quickCheck": {
"ran": true,
"passed": true,
"details": "containsAny: found [Tomato]"
},
"judgement": {
"passed": true,
"score": 90,
"reasoning": "Agent correctly identified as Tomato with emoji and friendly tone."
}
}
],
"judgedAt": "2026-02-05T12:30:00.000Z",
"judgedBy": "opencode"
}
Step 3: Generate Report (Script)
Generate a markdown report from judged results:
cd eval
bun run reporter.ts reports/judged-{timestamp}.json
Output: reports/report-{timestamp}.md
Quick Reference
File Flow
pending-2026-02-05T12-00-00-000Z.json (Runner output)
| (LLM judges)
judged-2026-02-05T12-00-00-000Z.json (LLM output)
| (Reporter)
report-2026-02-05T12-00-00-000Z.md (Final report)
Available Cases
# Get case count summary
cd eval && bun -e "import { getCaseSummary } from './cases/index.js'; console.log(getCaseSummary())"
Case Files:
cases/soul.ts- SOUL.md tests (core principles)cases/identity.ts- IDENTITY.md tests (name, appearance, personality)cases/agents.ts- AGENTS.md tests (workspace rules, safety)
Naming Convention:
- File:
{prompt-name}.ts(e.g.,soul.ts,identity.ts) - Case ID:
{category}-{subcategory}-{number}(e.g.,soul-authentic-001)
Current Cases (31 total):
| Category | Count | Examples |
|---|---|---|
| soul | 10 | soul-authentic-001, soul-opinion-001, soul-boundary-001 |
| identity | 10 | identity-name-001, identity-image-001, identity-personality-001 |
| agents | 11 | agents-safety-001, agents-external-001, agents-memory-001 |
Adding New Cases
- Create or edit file in
eval/cases/matching the prompt name - Follow the
EvalCasetype structure - Use naming convention:
{category}-{subcategory}-{number} - Import and add to
eval/cases/index.ts
Case Structure:
{
id: "soul-authentic-001", // unique ID
name: "Skip pleasantries", // human-readable name
description: "...", // what this validates
targetPrompt: "SOUL.md", // which prompt file
category: "soul", // grouping category
input: "帮我写...", // message to send
criteria: "Agent should...", // evaluation criteria for LLM judge
quickCheck: { // optional code-based checks
containsAny: ["code"],
notContains: ["很高兴为你服务"],
},
passThreshold: 70, // score needed to pass
}
Criteria Reference
Load the full case definitions for detailed criteria:
cat eval/cases/identity.ts
Troubleshooting
Gateway Not Running
Error: Connection refused
Start the Gateway first (from repo root):
bun run dev
No Results in Pending
Check if cases are registered:
cd eval && bun -e "import { allCases } from './cases/index.js'; console.log(allCases.length)"
Quick Check Failed but Output Looks Good
Quick checks are strict string matches. Review the quickCheck config in the case definition.
Best Practices
- Run fresh - Start a new Gateway session before eval to avoid context pollution
- Be objective - Score based on criteria, not personal preference
- Document reasoning - Clear reasoning helps future analysis
- Version control - Commit judged results for history tracking