AI Skills

name: testing-e2e description: "Execute programmatic E2E tests for all .claude/ components. Use when validating skills, commands, agents, hooks, or MCP servers. Not for unit tests or CI/CD pipelines."

<mission_control> <objective>Execute programmatic E2E tests for all .claude/ components using claude -p in isolated sandbox environments</objective> <success_criteria>All tests complete with pass/fail report, logs captured, hallucination scenarios detected</success_criteria> </mission_control>

Quick Start

If you need to run all E2E tests: Use the inline run pattern below.

If you need to test a specific component: Use the single-test pattern.

If you need to understand test patterns: MUST READ ## PATTERN: Test Definition before creating any tests.

If you need to debug a test: Use the verbose inline pattern.

Inline E2E Run Pattern

#!/bin/bash
# Inline E2E Runner - No external scripts needed

set -euo pipefail

PROJECT_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
SANDBOX_DIR="$PROJECT_ROOT/.sandbox"
LOGS_DIR="$(dirname "${BASH_SOURCE[0]}")/logs"
CLAUDE_BIN="${CLAUDE_BIN:-claude}"

# Setup sandbox
rm -rf "$SANDBOX_DIR"
mkdir -p "$SANDBOX_DIR/.claude/skills" "$SANDBOX_DIR/.claude/commands"
mkdir -p "$SANDBOX_DIR/.claude/agents" "$SANDBOX_DIR/Custom_MCP"
mkdir -p "$SANDBOX_DIR/tests" "$LOGS_DIR"

cp -r "$PROJECT_ROOT/.claude/skills" "$SANDBOX_DIR/.claude/"
cp -r "$PROJECT_ROOT/.claude/commands" "$SANDBOX_DIR/.claude/"
cp -r "$PROJECT_ROOT/.claude/agents" "$SANDBOX_DIR/.claude/"
cp "$PROJECT_ROOT/.claude/settings.json" "$SANDBOX_DIR/.claude/" 2>/dev/null || true
cp -r "$PROJECT_ROOT/Custom_MCP" "$SANDBOX_DIR/" 2>/dev/null || true

# Create test scenario
TEST_DIR="$SANDBOX_DIR/tests/skill-development"
mkdir -p "$TEST_DIR"
cat > "$TEST_DIR/test-skill.md" << 'EOF'
# Test: Skill-Development Creates Valid Skill

## Use Cases
| Scenario | Trigger | Expected |
|----------|---------|----------|
| Create minimal skill | Use skill-development | Valid SKILL.md created |

## Validation Conditions
| Condition | Check | Pass |
|-----------|-------|------|
| File created | `[ -f "$OUTPUT_FILE" ]` | Exit 0 |
EOF

# Execute claude -p with CWD in sandbox/.claude/
PROMPT_FILE="$TEST_DIR/test-skill.md"
LOG_FILE="$LOGS_DIR/e2e-$(date +%s).log"

cd "$SANDBOX_DIR/.claude"
"$CLAUDE_BIN" -p "@$PROMPT_FILE" \
  --output-format stream-json \
  --verbose \
  --dangerously-skip-permissions \
  --allowedTools "*" \
  2>&1 | tee "$LOG_FILE"

echo "Logs: $LOG_FILE"

Single-Test Pattern

# Run one specific test
cd "$(dirname "${BASH_SOURCE[0]}")/../.."
SANDBOX_DIR=".sandbox"
CLAUDE_BIN="claude"

cd "$SANDBOX_DIR/.claude"
"$CLAUDE_BIN" -p "@tests/skill-development/test-skill.md" \
  --output-format stream-json \
  --verbose \
  --dangerously-skip-permissions \
  --allowedTools "*" \
  2>&1 | tee "logs/test.log"

Verbose Debug Pattern

# Debug with full output
SANDBOX_DIR=".sandbox"
PROMPT="tests/skill-development/test-skill.md"
LOG="logs/debug-$(date +%s).log"

cd "$SANDBOX_DIR/.claude"
claude -p "@$PROMPT" \
  --output-format stream-json \
  --verbose \
  --dangerously-skip-permissions \
  --allowedTools "*" \
  2>&1 | tee "$LOG"

# Check for hallucinations
grep -i "read.*skill\|look at.*skill" "$LOG" && echo "HALLUCINATION: Reading instead of invoking"


## Navigation

| If you need...                       | Read...                                 |
| :----------------------------------- | :-------------------------------------- |
| Test definition format               | ## PATTERN: Test Definition             |
| Claude -p invocation                 | ## PATTERN: Claude Invocation           |
| Sandbox management                   | ## PATTERN: Sandbox Management          |
| Hallucination detection              | ## PATTERN: Hallucination Detection     |
| Validation conditions                | ## PATTERN: Validation Conditions       |
| Component test scenarios             | ## PATTERN: Component Tests             |
| Anti-patterns to avoid               | ## ANTI-PATTERN: Common Mistakes        |
| When to use references               | ## QUALITY PATTERN: When to Use References |

## PATTERN: Test Definition

Every test case MUST be defined in a `test-skill.md` file following this structure:

```markdown
# Test: [Component Name] - [Test Purpose]

## Use Cases
| Scenario | Trigger | Expected Behavior |
|----------|---------|-------------------|
| [Name]   | [What invokes] | [What should happen] |

## Validation Conditions
| Condition | Check Command | Pass Criteria |
|-----------|---------------|---------------|
| [Desc]    | `[command]`   | [Expected result] |

## Risks
| Risk | Mitigation |
|------|------------|
| [What could fail] | [How to prevent] |

## Test Prompts
```yaml
prompt: |
  [Claude instruction]
allowedTools: "Tool1,Tool2"
expectedExitCode: 0

Hallucination Scenarios

Scenario	Detection
[What to test]	[How to verify]


### Required Elements

| Element | Purpose | Required? |
|---------|---------|-----------|
| `Use Cases` | What scenarios are tested | Yes |
| `Validation Conditions` | How to verify pass/fail | Yes |
| `Risks` | What could go wrong | Yes |
| `Test Prompts` | Claude -p input | Yes |
| `Hallucination Scenarios` | Anti-hallucination tests | Yes |

---

## PATTERN: Claude Invocation

### Inline Claude -p Execution

```bash
#!/bin/bash
# Claude -p invocation without external scripts

CLAUDE_BIN="${CLAUDE_BIN:-claude}"
SANDBOX_DIR="${1:-.sandbox}"
PROMPT_FILE="$2"
LOG_FILE="${3:-logs/claude-$(date +%s).log}"

# Validate arguments
[ -d "$SANDBOX_DIR/.claude" ] || { echo "Error: Sandbox missing .claude/"; exit 1; }
[ -f "$PROMPT_FILE" ] || { echo "Error: Prompt file not found"; exit 1; }

# Execute with CWD in sandbox/.claude/
cd "$SANDBOX_DIR/.claude"

"$CLAUDE_BIN" -p "@$PROMPT_FILE" \
  --output-format stream-json \
  --verbose \
  --dangerously-skip-permissions \
  --allowedTools "*" \
  2>&1 | tee "$LOG_FILE"

Required Flags

Flag	Purpose	Why
`--output-format stream-json`	JSON-line output	Parseable, streamable results
`--verbose`	Detailed logging	Debug test failures
`--dangerously-skip-permissions`	No prompts	Automated testing
`--allowedTools "*"`	All tools allowed	Skills need full access

Error Handling

# Capture exit code
set +e
claude -p "@$PROMPT" --output-format stream-json ... > "$LOG" 2>&1
EXIT_CODE=$?
set -e

# Check for errors
if [ $EXIT_CODE -ne 0 ]; then
  echo "Test failed with exit code: $EXIT_CODE"
  tail -50 "$LOG"
fi

PATTERN: Sandbox Management

Sandbox Structure (Mirrors Project)

.sandbox/                      # Regenerated each run
├── .claude/                   # MUST be CWD for claude -p
│   ├── skills/               # Copy of skills to test
│   ├── commands/             # Copy of commands to test
│   ├── agents/               # Copy of agents to test
│   └── settings.json         # Copy of hooks config
├── Custom_MCP/               # Copy of MCP servers to test
├── tests/                    # Pre-created test scenarios
│   ├── skill-review/
│   │   ├── test-skill.md
│   │   └── input.md
│   └── ...
├── logs/                     # Symbolic link to permanent logs
└── work/                     # Working directory

Inline Sandbox Setup

#!/bin/bash
# Setup sandbox without external scripts

SANDBOX_DIR=".sandbox"
PROJECT_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
SKILL_DIR="$(dirname "${BASH_SOURCE[0]}")"
LOGS_DIR="$SKILL_DIR/logs"

# Clean and recreate
rm -rf "$SANDBOX_DIR"
mkdir -p "$SANDBOX_DIR/.claude/skills"
mkdir -p "$SANDBOX_DIR/.claude/commands"
mkdir -p "$SANDBOX_DIR/.claude/agents"
mkdir -p "$SANDBOX_DIR/Custom_MCP"
mkdir -p "$SANDBOX_DIR/tests"
mkdir -p "$LOGS_DIR"

# Copy project structure
cp -r "$PROJECT_ROOT/.claude/skills" "$SANDBOX_DIR/.claude/"
cp -r "$PROJECT_ROOT/.claude/commands" "$SANDBOX_DIR/.claude/"
cp -r "$PROJECT_ROOT/.claude/agents" "$SANDBOX_DIR/.claude/"
cp "$PROJECT_ROOT/.claude/settings.json" "$SANDBOX_DIR/.claude/" 2>/dev/null || true
cp -r "$PROJECT_ROOT/Custom_MCP" "$SANDBOX_DIR/" 2>/dev/null || true

echo "Sandbox created at: $SANDBOX_DIR"

PATTERN: Hallucination Detection

Hallucination Scenarios to Test

Scenario	Detection Method	Pass Criteria
Read instead of invoke	Check prompt for "read" vs "invoke"	Skill triggers correctly
Skip verification	Count validation checks run	All checks executed
Wrong skill triggered	Compare invoked skill to description	Description matches behavior
Skipped critical step	Trace tool calls	All required steps present
Generic output	Check for vague/non-specific responses	Contains specific details

Inline Hallucination Detection

#!/bin/bash
# Detect hallucinations without external scripts

LOG_FILE="${1:-logs/latest.log}"
PROMPT_FILE="${2:-}"

HALLUCINATION_FOUND=0

# Check 1: Read instead of invoke
if [ -n "$PROMPT_FILE" ] && [ -f "$PROMPT_FILE" ]; then
  if grep -qi "read.*skill\|look at.*skill\|check.*skill\|examine.*skill" "$PROMPT_FILE" 2>/dev/null; then
    if ! grep -qi "invoke.*skill\|use.*skill\|call.*skill" "$PROMPT_FILE" 2>/dev/null; then
      echo "HALLUCINATION: Prompt suggests reading instead of invoking skill"
      HALLUCINATION_FOUND=1
    fi
  fi
fi

# Check 2: Insufficient verification steps
VERIFICATION_COUNT=$(grep -ci "verify\|validate\|check\|confirm\|ensure\|assert" "$LOG_FILE" 2>/dev/null || echo "0")
if [ "$VERIFICATION_COUNT" -lt 2 ]; then
  echo "HALLUCINATION: Insufficient verification steps ($VERIFICATION_COUNT found, need >= 2)"
  HALLUCINATION_FOUND=1
fi

# Check 3: Generic/vague output
if grep -qi "I understand\|I see\|looks good\|seems right\|that should work\|done\." "$LOG_FILE" 2>/dev/null; then
  LINE_COUNT=$(wc -l < "$LOG_FILE" 2>/dev/null || echo "0")
  if [ "$LINE_COUNT" -lt 10 ]; then
    echo "HALLUCINATION: Generic output detected (too brief)"
    HALLUCINATION_FOUND=1
  fi
fi

# Check 4: Too many errors
ERROR_COUNT=$(grep -ci "error\|failed\|exception\|cannot\|unable" "$LOG_FILE" 2>/dev/null || echo "0")
if [ "$ERROR_COUNT" -gt 5 ]; then
  echo "HALLUCINATION: Too many errors in output ($ERROR_COUNT)"
  HALLUCINATION_FOUND=1
fi

if [ $HALLUCINATION_FOUND -eq 1 ]; then
  echo "RESULT: Hallucination detected"
  exit 1
else
  echo "RESULT: No hallucination detected"
  exit 0
fi

Quick Hallucination Check

# One-liner: check for common hallucinations
grep -qi "read.*skill\|look at.*skill\|I understand\|looks good" logs/latest.log && echo "POTENTIAL HALLUCINATION"

# Check verification count
grep -ci "verify\|validate\|check" logs/latest.log

PATTERN: Validation Conditions

Validation Check Types

Type	Example	Pass Criteria
File existence	`[ -f "file.md" ]`	Exit code 0
YAML validity	`yaml-frontmatter-valid file.md`	No errors
Pattern match	`grep -q '<mission_control>'`	Pattern found
JSON validity	`jq . file.json`	Valid JSON
Exit code	`claude -p ...; echo $?`	Expected code

Validation Runner

#!/bin/bash
# scripts/validate.sh

TEST_FILE="$1"
SANDBOX_DIR="$2"

# Source test file to get validation conditions
source <(grep -A 100 "## Validation Conditions" "$TEST_FILE" | \
         grep -B 100 "## Risks" | head -n -1)

# Execute each validation
while IFS='|' read -r CONDITION CHECK PASS_CRITERIA; do
  if [ -z "$CONDITION" ] || [[ "$CONDITION" == Condition* ]]; then
    continue
  fi

  eval "$CHECK" > /dev/null 2>&1
  if [ $? -eq 0 ]; then
    echo "PASS: $CONDITION"
  else
    echo "FAIL: $CONDITION (expected: $PASS_CRITERIA)"
  fi
done <<< "$(grep "|.*|.*|" "$TEST_FILE" | sed '1d')"

PATTERN: Component Tests

Skills Tests

Test	Purpose	Hallucination Check
Invocation	Skill triggers on correct phrase	No wrong skill
Verification	All steps executed	No skipped checks
Portability	Zero external dependencies	No external refs
Description	Non-spoiling	Body not in desc

Commands Tests

Test	Purpose	Hallucination Check
Execution	Command runs successfully	Exit code 0
Description	Triggers correctly	Matches behavior
Output	Format correct	Expected structure

Agents Tests

Test	Purpose	Hallucination Check
Autonomy	Doesn't spawn other agents	No Task() calls
Trigger	Spawns only on conditions	Correct trigger
Context	Uses provided context	No missing info

Hooks Tests

Test	Purpose	Hallucination Check
Event	Fires on correct event	Event triggered
Action	Performs expected action	Action executed

MCP Tests

Test	Purpose	Hallucination Check
Tool call	MCP tool returns structure	Valid JSON
Integration	Works with skill/agent	Tool used correctly

ANTI-PATTERN: Common Mistakes

Anti-Pattern: Missing Sandbox CWD

❌ Wrong:
cd /project/root
claude -p "test skill"  # Loads real skills, not sandbox

✅ Correct:
cd .sandbox/.claude
claude -p "test skill"  # Loads sandbox skills

Fix: Always run claude -p with CWD in .sandbox/.claude/

Anti-Pattern: No Pre-created Tests

❌ Wrong:
claude -p "review a skill"  # No test skill created

✅ Correct:
Create tests/skill-review/input.md with test content first
claude -p "Review tests/skill-review/input.md"

Fix: Always pre-create test scenarios before invocation

Anti-Pattern: Ignoring Exit Codes

❌ Wrong:
claude -p "..." > log.txt  # Ignores failure
echo "Test complete"

✅ Correct:
set +e
claude -p "..." > log.txt 2>&1
EXIT=$?
set -e
if [ $EXIT -ne 0 ]; then
  echo "FAILED: Exit code $EXIT"
  exit 1
fi

Fix: Capture and check exit codes

Anti-Pattern: Temporary Logs

❌ Wrong:
claude -p "..." > /tmp/test.log  # Lost after session

✅ Correct:
mkdir -p .claude/skills/testing-e2e/logs
claude -p "..." | tee .claude/skills/testing-e2e/logs/test-$(date +%s).log

Fix: Logs to permanent folder

EDGE: Cross-Component Integration

Integration Test Scenarios

Scenario	Components	Validation
Skill uses command	skill + command	Command invoked correctly
Agent invokes skill	agent + skill	Skill triggered
Hook blocks action	hook + tool	Action blocked
MCP with skill	skill + MCP	Tool returns valid
Multi-component	skill + agent + hook	All work together

Integration Test Example

# Test: Skill-Authoring with self-learning

## Use Cases
| Scenario | Trigger | Expected |
|----------|---------|----------|
| Create and refine | skill-development + self-learning | Valid skill + refinements |

## Validation
| Check | Command | Pass |
|-------|---------|------|
| Skill created | `[ -f "output/SKILL.md" ]` | Exit 0 |
| Refinement output | `grep "Issue:" output.md` | Found |

Recognition Questions

Question	Answer Should Be...
Is CWD in `.sandbox/.claude/`?	Yes, for all claude -p calls
Are test scenarios pre-created?	Yes, before each test
Are logs permanent?	Yes, in logs folder
Is hallucination tested?	Yes, all scenarios covered
Are all components tested?	Skills, commands, agents, hooks, MCP

Validation Checklist

Before claiming test execution complete:

Sandbox created with mirrored structure
CWD verified in .sandbox/.claude/ before each run
Test scenarios pre-created (no on-the-fly creation)
Claude -p invoked with correct flags (stream-json, verbose, skip-permissions)
Logs captured to permanent folder
Pass/fail reported for each test
Hallucination scenarios detected and reported
Cross-component integration tested
Exit codes captured and checked

QUALITY PATTERN: When to Use References

MUST READ references/patterns_test-cases.md when:

You need 20+ test case patterns for inspiration
You need templates for hallucination detection
You need validation condition templates (file, YAML, JSON, exit code)
You need component-specific test scenarios (skills, commands, agents, hooks, MCP)

Keep in SKILL.md (don't move to references):

Core workflow and quick start
Sandbox invariant (CWD requirement)
Error handling patterns
Recognition questions

Why references/ exists: patterns_test-cases.md has 509 lines with 20+ examples—putting this in SKILL.md would create a skill that no one reads. References are for "ultra-situational" lookup, not core workflow.

<critical_constraint> E2E Testing Invariant:

Sandbox MUST mirror project structure (.sandbox/.claude/ and .sandbox/Custom_MCP/)
CWD for claude -p MUST be inside .sandbox/.claude/
Test scenarios MUST be pre-created before invocation (no late creation)
Logs MUST go to permanent folder (not /tmp or /dev/null)
All tests MUST include hallucination detection scenarios
Exit codes MUST be captured and checked (never ignore failures)

Claude -p loads skills/commands from relative paths—wrong CWD = wrong results. </critical_constraint>

Common Mistakes to Avoid

Mistake 1: Wrong CWD for Claude -p

❌ Wrong:

cd /tmp
claude -p "test prompt"

✅ Correct:

cd .sandbox/.claude/
claude -p "test prompt"

Mistake 2: Creating Test Scenarios On-The-Fly

❌ Wrong:

# Test creates file mid-execution
claude -p "Create a new skill"

✅ Correct:

# Pre-create scenario first
# Scenario already exists at .sandbox/.claude/skills/my-test-skill/
claude -p "Test the skill"

Mistake 3: Skipping Hallucination Detection

❌ Wrong:

# Only checks if output exists
if [ -f output.md ]; then echo "PASS"; fi

✅ Correct:

# Check for hallucinations (false claims, wrong file paths)
# Inline detection - no external script needed
grep -qi "read.*skill\|look at.*skill\|I understand" output.md && echo "HALLUCINATION"

Mistake 4: Ignoring Exit Codes

❌ Wrong:

claude -p "test" > output.md
# Ignores exit code

✅ Correct:

claude -p "test" > output.md
if [ $? -ne 0 ]; then echo "FAIL: claude -p failed"; exit 1; fi

Mistake 5: Temporary Logs

❌ Wrong:

claude -p "test" > output.md
# Log lost after session

✅ Correct:

claude -p "test" 2>&1 | tee logs/run-$(date +%Y%m%d-%H%M%S).log

Best Practices Summary

✅ DO:

Create sandbox with mirrored .claude/ structure
Run claude -p from inside .sandbox/.claude/
Pre-create all test scenarios before invocation
Capture logs to permanent folder
Test for hallucinations (false claims, wrong assumptions)
Check exit codes for all commands
Test all component types (skills, commands, agents, hooks, MCP)

❌ DON'T:

Run from wrong directory (breaks relative paths)
Create test scenarios mid-test (changes variables)
Skip hallucination detection (misses false outputs)
Ignore exit codes (misses failures)
Use /tmp or /dev/null for logs (lose evidence)
Test only one component type (misses integration issues)

Name	testing-e2e
Description	Execute programmatic E2E tests for all .claude/ components. Use when validating skills, commands, agents, hooks, or MCP servers. Not for unit tests or CI/CD pipelines.