Agent Skill
2/7/2026

plan-validate-structure

DEPRECATED: The notes/plan/ directory has been removed. Planning is now done directly through GitHub issues. See gh-read-issue-context and gh-post-issue-update skills instead.

H
homericintelligence
2GitHub Stars
2Views
npx skills add HomericIntelligence/ProjectScylla

SKILL.md

Nameplan-validate-structure
DescriptionDEPRECATED: The notes/plan/ directory has been removed. Planning is now done directly through GitHub issues. See gh-read-issue-context and gh-post-issue-update skills instead.

ProjectScylla

Python License Tests Status

๐Ÿ“‘ Table of Contents

๐ŸŽฏ What is ProjectScylla?

ProjectScylla is a comprehensive testing framework for AI agent workflows that:

  • ๐Ÿ”ฌ Measures agent performance under constrained conditions
  • ๐Ÿ“ˆ Analyzes results with rigorous statistical methods
  • โš–๏ธ Optimizes agent decisions through trade-off evaluation
  • ๐Ÿ“‹ Generates publication-ready reports, figures, and tables

Key Output: Publication-quality statistical reports with 27 figures and 11 tables from a single command.

"In Homer's Odyssey, Scylla represents one of the greatest challenges on the journey home โ€” a monster that forced sailors to navigate perilous straits where every choice carried risk. ProjectScylla provides the same proving ground for AI agents."

Quick Start Guide

๐Ÿš€ 5-Minute Setup

# 1. Install prerequisites
curl -fsSL https://pixi.sh/install.sh | bash

# 2. Clone and setup
git clone https://github.com/HomericIntelligence/ProjectScylla.git
cd ProjectScylla

# 3. Run your first analysis
pixi run python --version  # Verify installation
pixi run python scripts/generate_all_results.py --data-dir ~/fullruns

# 4. View results (27 figures + 11 tables generated)
open results/analysis/figures/*.png  # macOS
xdg-open results/analysis/figures/*.png  # Linux

That's it! All outputs appear in results/analysis/ directory.

๐Ÿ’ก Usage Examples

Compare Two Agent Configurations:

pixi run python scripts/generate_all_results.py \
  --data-dir ~/experiments/ \
  --output-dir comparison_results/ \
  --exclude test001-dryrun

Fast Development Mode (No Rendering):

# Quick iteration - generates Vega-Lite specs only
pixi run python scripts/generate_all_results.py \
  --data-dir ~/quick_test \
  --no-render \
  --skip-data  # Skip if CSVs already exist

๐Ÿ“Š System Requirements

Minimum Requirements:

  • Python 3.10+
  • 8GB RAM for full dataset analysis
  • 2GB disk space for results

Typical Performance:

  • Full analysis: 10-15 minutes (10,000 bootstrap samples)
  • Figures only: 2-3 minutes
  • Tables only: 1-2 minutes

Scale: Handles experiments with 1000+ runs efficiently


Core Concepts

  • โš–๏ธ Trade-Off Evaluation: Agents face scenarios where every decision has cost, mirroring Scylla and Charybdis dilemma
  • ๐Ÿ“Š Metrics & Benchmarks: Structured measurement across adaptability, efficiency, and reliability
  • ๐Ÿ”„ Iterative Optimization: Continuous refinement through repeated trials
  • ๐Ÿงญ Resilience Testing: Assessment under uncertainty, constraints, and risks

Ecosystem

  • ProjectOdyssey โ†’ Training and capability development
  • ProjectKeystone โ†’ Communication and distributed agent coordination
  • ProjectScylla โ†’ Testing, measurement, and optimization under trial

Together: cohesive ecosystem for building, connecting, and refining agent workflows.


Running the Analysis Pipeline

Full Analysis (Recommended)

Generate all outputs (data exports, figures, tables):

pixi run python scripts/generate_all_results.py \
  --data-dir ~/fullruns \
  --output-dir results/analysis

Key Options:

  • --data-dir โ†’ Directory with experiment results (default: ~/fullruns)
  • --output-dir โ†’ Base output directory (default: docs/)
  • --no-render โ†’ Skip PNG/PDF (faster, Vega-Lite specs only)
  • --skip-data/skip-figures/skip-tables โ†’ Generate specific components only
  • --exclude โ†’ Filter experiments (e.g., --exclude test001-dryrun)
# Development mode - no rendering
pixi run python scripts/generate_all_results.py \
  --no-render \
  --exclude test001-dryrun test001-debug

# Regenerate tables only (assumes data/figures exist)
pixi run python scripts/generate_all_results.py \
  --skip-data --skip-figures

Individual Pipeline Steps

1. Export Data Only

pixi run python scripts/export_data.py \
  --data-dir ~/fullruns \
  --output-dir results/analysis/data

Outputs: runs.csv, judges.csv, criteria.csv, subtests.csv, summary.json, statistical_results.json

2. Generate Figures Only (27 figures ร— 5 formats)

pixi run python scripts/generate_figures.py \
  --data-dir ~/fullruns \
  --output-dir results/analysis/figures

Outputs: *.vl.json, *.csv, *.png (300 DPI), *.pdf, *_include.tex

3. Generate Tables Only (11 tables ร— 2 formats)

pixi run python scripts/generate_tables.py \
  --data-dir ~/fullruns \
  --output-dir results/analysis/tables

Outputs: *.md (human-readable), *.tex (LaTeX, booktabs formatted)

Output Structure

results/analysis/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ runs.csv                      # Per-run metrics
โ”‚   โ”œโ”€โ”€ judges.csv                    # Judge evaluations
โ”‚   โ”œโ”€โ”€ criteria.csv                  # Criterion-level scores
โ”‚   โ”œโ”€โ”€ subtests.csv                  # Subtest metadata
โ”‚   โ”œโ”€โ”€ summary.json                  # Experiment summary
โ”‚   โ””โ”€โ”€ statistical_results.json      # Statistical analysis
โ”œโ”€โ”€ figures/                          # 27 figures ร— 5 formats
โ”‚   โ”œโ”€โ”€ fig01_score_variance.*
โ”‚   โ”œโ”€โ”€ fig02_grade_distribution.*
โ”‚   โ””โ”€โ”€ ... (27 total)
โ””โ”€โ”€ tables/                           # 11 tables ร— 2 formats
    โ”œโ”€โ”€ table01_tier_summary.md
    โ”œโ”€โ”€ table01_tier_summary.tex
    โ””โ”€โ”€ ... (11 total)

Using the Outputs

LaTeX Integration:

\begin{figure}
  \centering
  \input{results/analysis/figures/fig04_pass_rate_by_tier_include.tex}
  \caption{Pass rate by tier with 95\% bootstrap confidence intervals.}
  \label{fig:pass-rate}
\end{figure}

\input{results/analysis/tables/table02_tier_comparison.tex}

Python/Jupyter:

import pandas as pd
import json

# Load data
runs_df = pd.read_csv('results/analysis/data/runs.csv')
judges_df = pd.read_csv('results/analysis/data/judges.csv')

# Load statistical results
with open('results/analysis/data/statistical_results.json') as f:
    stats = json.load(f)

Experiment Management Scripts

ProjectScylla provides comprehensive scripts for running, managing, and analyzing experiments.

๐Ÿงช Running Experiments

Primary Experiment Runner:

# Run full experiment
pixi run python scripts/manage_experiment.py run --config config/test.yaml

# Run specific tiers
pixi run python scripts/manage_experiment.py run \
  --tiers-dir tests/fixtures/tests/test-001 \
  --tiers T0 T1 --runs 10 -v

Container-Based Execution:

./scripts/setup_api_key.sh
./scripts/run_experiment_in_container.sh \
  --tiers-dir tests/fixtures/tests/test-001 \
  --tiers T0 --runs 5 --verbose

๐Ÿ”„ Recovery & Re-running

# Re-run failed agents
pixi run python scripts/manage_experiment.py rerun-agents \
  ~/fullruns/test_experiment --tier T0 T1

# Re-run failed judges
pixi run python scripts/manage_experiment.py rerun-judges \
  ~/fullruns/test_experiment

๐Ÿ“Š Results Management

# Regenerate all results
pixi run python scripts/manage_experiment.py regenerate \
  ~/fullruns/test_experiment

# Repair corrupt checkpoint
pixi run python scripts/manage_experiment.py repair \
  ~/fullruns/test_experiment/checkpoint.json

Analysis Pipeline Architecture

Statistical Methodology

Rigorous non-parametric methods for bounded, ordinal, non-normal data:

  • Bootstrap Confidence Intervals: BCa with 10,000 resamples
  • Omnibus Testing: Kruskal-Wallis H test (controls FWER)
  • Pairwise Comparisons: Mann-Whitney U + Holm-Bonferroni correction
  • Effect Sizes: Cliff's delta with bootstrapped CIs
  • Inter-Rater Reliability: Krippendorff's alpha for judge agreement

Configuration: scylla/analysis/config.yaml (all parameters externalized)

Metrics

Quality:

  • Pass-Rate (functional test coverage)
  • Implementation Rate (semantic satisfaction)
  • Score (weighted rubric evaluation)
  • Consistency (1 - Coefficient of Variation)

Economic:

  • Cost-of-Pass (expected cost per success)
  • Frontier CoP (minimum CoP across configs)
  • Token Distribution (cost breakdown)

Process:

  • Latency (query to resolution time)
  • Judge Agreement (Krippendorff's alpha)

Data Requirements

Expected structure:

fullruns/{experiment_name}/{timestamp}/
โ”œโ”€โ”€ config/experiment.json            # Metadata
โ””โ”€โ”€ T0-T6/{subtest_id}/run_{01-10}/
    โ”œโ”€โ”€ run_result.json              # Outcomes
    โ””โ”€โ”€ judge/judge_{01-03}/judgment.json  # Evaluations

Required in run.json:

  • run_number (integer)
  • exit_code (0 = success)
  • judges (list with grades & criteria)

Schema: scylla/analysis/schemas/run_result.schema.json


Development

๐Ÿงช Testing

ProjectScylla has a comprehensive test suite covering all functionality. To see the current test count:

pixi run pytest tests/ --collect-only -q | tail -1

Test Categories

  • Unit Tests: Analysis (incl. integration-style tests), adapters, config, executors, judges, metrics, reporting
  • E2E Tests (1 file): Full pipeline validation
  • Test Fixtures (47+ scenarios): Complete test cases with expected outputs

Running Tests

# All tests (comprehensive)
pixi run pytest tests/ --verbose

# Unit tests only (fastest)
pixi run pytest tests/unit/ -v

# Specific modules
pixi run pytest tests/unit/analysis/ -v
pixi run pytest tests/unit/adapters/ -v
pixi run pytest tests/unit/config/ -v

# Coverage analysis
pixi run pytest tests/ --cov=scylla --cov-report=html

# Specific test file
pixi run pytest tests/unit/analysis/test_stats.py -v

Test Quality Assurance

# Code quality (linting + formatting)
pixi run ruff check scylla/
pixi run ruff format scylla/ --check

Git Hooks

Git hooks enforce quality checks locally before code reaches CI. Install them once after cloning:

bash scripts/install_hooks.sh
HookTriggerWhat it does
pre-pushEvery git pushRuns the full test suite with coverage; aborts the push if tests fail or coverage drops below the threshold in pyproject.toml

The coverage threshold is read directly from pyproject.toml โ€” update it there and the hook stays in sync automatically.

Hook source files live in scripts/hooks/ and are version-controlled. See scripts/README.md for details.

Adding Components

New Figures:

  1. Create module in scylla/analysis/figures/
  2. Implement function following existing pattern
  3. Register in scripts/generate_figures.py
  4. Add tests in tests/unit/analysis/test_figures.py

New Tables:

  1. Add function to module in scylla/analysis/tables/
  2. Register in scripts/generate_tables.py
  3. Add tests in tests/unit/analysis/test_tables.py

Code Quality

# Linting
pixi run ruff check scylla/analysis/

# Auto-fix and format
pixi run ruff check --fix scylla/analysis/
pixi run ruff format scylla/analysis/

๐Ÿ”ง Troubleshooting

Quick Reference

SymptomSolution
Schema validation failed: 'N/A' does not matchEnsure grades are S, A, B, C, D, or F only
[Errno 2] No such file or directoryRun: find ~/fullruns -name "run_result.json"
TypeError: unsupported operandFix type coercion in criterion.achieved values
Empty outputsCheck: โ‰ฅ2 experiments, โ‰ฅ1 completed run each
Slow performanceUse --no-render flag for faster iteration

Common Issues

1. Data Validation Errors

Schema validation failed: 'N/A' does not match '^[SABCDF]$'

Fix: Review problematic runs, ensure valid grades S/A/B/C/D/F or update schema.

2. Missing Files

Failed to load: [Errno 2] No such file or directory

Fix: Incomplete runs skipped with warnings. Investigate:

find ~/fullruns -name "run_*" -type d -exec sh -c 'test -f "$1/run_result.json" || echo "Missing: $1"' _ {} \;

3. Type Errors

TypeError: unsupported operand type(s) for +: 'float' and 'str'

Fix: Some criterion.achieved are strings. Fix in data generation or add coercion.

Getting Help

  • Documentation: docs/research.md for methodology
  • Examples: tests/unit/analysis/ for usage patterns
  • Issues: GitHub Issues
  • Support: Create an issue with error message and steps to reproduce

Publication Readiness

โœ… Rigorous non-parametric statistics (Kruskal-Wallis, Mann-Whitney U, Cliff's delta)

โœ… Multiple comparison correction (Holm-Bonferroni throughout)

โœ… Bootstrap confidence intervals (BCa, 10K resamples, seed=42)

โœ… Effect sizes with confidence intervals

โœ… 300 DPI publication-quality figures

โœ… LaTeX-ready tables with booktabs formatting

โœ… Reproducible configuration (all parameters in config.yaml)

โœ… Comprehensive test suite

โœ… Documented methodology with citations

See docs/research.md for complete research methodology and metric definitions.

LaTeX Dependencies

Required packages for document compilation:

\documentclass{article}
 \usepackage{booktabs}   % Professional tables
 \usepackage{longtable}  % Multi-page tables
 \usepackage{threeparttable} % Table notes
 \usepackage{graphicx}   % Figure inclusion
 \usepackage{amsmath}    % Statistical symbols

\begin{document}
% Your content here
\end{document}

๐Ÿค Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines on:

  • Development setup and environment configuration
  • Git workflow and branch management
  • Code quality standards and testing requirements
  • Pull request and code review process
  • Issue reporting guidelines

Quick Start for Contributors:

  1. Fork the repository and clone locally
  2. Copy .env.example to .env and configure API keys
  3. Install dependencies: curl -fsSL https://pixi.sh/install.sh | bash
  4. Install git hooks: bash scripts/install_hooks.sh
  5. Run tests: pixi run pytest tests/ -v
  6. Check CONTRIBUTING.md for detailed workflow

Areas for contribution:

  • Additional statistical methods and metrics
  • New visualization types and formats
  • Performance optimizations
  • Documentation improvements
  • Bug fixes and feature requests

Visit our GitHub Repository to get started.


License

License

Citation

@software{projectscylla2026,
  title = {ProjectScylla: A Testing and Optimization Framework for Agentic Workflows},
  author = {Micah Villmow},
  year = {2026},
  url = {https://github.com/HomericIntelligence/ProjectScylla}
}
Skills Info
Original Name:plan-validate-structureAuthor:homericintelligence