plan-validate-structure
DEPRECATED: The notes/plan/ directory has been removed. Planning is now done directly through GitHub issues. See gh-read-issue-context and gh-post-issue-update skills instead.
SKILL.md
| Name | plan-validate-structure |
| Description | DEPRECATED: The notes/plan/ directory has been removed. Planning is now done directly through GitHub issues. See gh-read-issue-context and gh-post-issue-update skills instead. |
ProjectScylla
๐ Table of Contents
- ๐ฏ What is ProjectScylla?
- Core Concepts
- ๐ Quick Start
- ๐ System Requirements
- Analysis Pipeline Architecture
- Development
- ๐ง Troubleshooting
- Publication Readiness
- ๐ค Contributing
๐ฏ What is ProjectScylla?
ProjectScylla is a comprehensive testing framework for AI agent workflows that:
- ๐ฌ Measures agent performance under constrained conditions
- ๐ Analyzes results with rigorous statistical methods
- โ๏ธ Optimizes agent decisions through trade-off evaluation
- ๐ Generates publication-ready reports, figures, and tables
Key Output: Publication-quality statistical reports with 27 figures and 11 tables from a single command.
"In Homer's Odyssey, Scylla represents one of the greatest challenges on the journey home โ a monster that forced sailors to navigate perilous straits where every choice carried risk. ProjectScylla provides the same proving ground for AI agents."
Quick Start Guide
๐ 5-Minute Setup
# 1. Install prerequisites
curl -fsSL https://pixi.sh/install.sh | bash
# 2. Clone and setup
git clone https://github.com/HomericIntelligence/ProjectScylla.git
cd ProjectScylla
# 3. Run your first analysis
pixi run python --version # Verify installation
pixi run python scripts/generate_all_results.py --data-dir ~/fullruns
# 4. View results (27 figures + 11 tables generated)
open results/analysis/figures/*.png # macOS
xdg-open results/analysis/figures/*.png # Linux
That's it! All outputs appear in results/analysis/ directory.
๐ก Usage Examples
Compare Two Agent Configurations:
pixi run python scripts/generate_all_results.py \
--data-dir ~/experiments/ \
--output-dir comparison_results/ \
--exclude test001-dryrun
Fast Development Mode (No Rendering):
# Quick iteration - generates Vega-Lite specs only
pixi run python scripts/generate_all_results.py \
--data-dir ~/quick_test \
--no-render \
--skip-data # Skip if CSVs already exist
๐ System Requirements
Minimum Requirements:
- Python 3.10+
- 8GB RAM for full dataset analysis
- 2GB disk space for results
Typical Performance:
- Full analysis: 10-15 minutes (10,000 bootstrap samples)
- Figures only: 2-3 minutes
- Tables only: 1-2 minutes
Scale: Handles experiments with 1000+ runs efficiently
Core Concepts
- โ๏ธ Trade-Off Evaluation: Agents face scenarios where every decision has cost, mirroring Scylla and Charybdis dilemma
- ๐ Metrics & Benchmarks: Structured measurement across adaptability, efficiency, and reliability
- ๐ Iterative Optimization: Continuous refinement through repeated trials
- ๐งญ Resilience Testing: Assessment under uncertainty, constraints, and risks
Ecosystem
- ProjectOdyssey โ Training and capability development
- ProjectKeystone โ Communication and distributed agent coordination
- ProjectScylla โ Testing, measurement, and optimization under trial
Together: cohesive ecosystem for building, connecting, and refining agent workflows.
Running the Analysis Pipeline
Full Analysis (Recommended)
Generate all outputs (data exports, figures, tables):
pixi run python scripts/generate_all_results.py \
--data-dir ~/fullruns \
--output-dir results/analysis
Key Options:
--data-dirโ Directory with experiment results (default:~/fullruns)--output-dirโ Base output directory (default:docs/)--no-renderโ Skip PNG/PDF (faster, Vega-Lite specs only)--skip-data/skip-figures/skip-tablesโ Generate specific components only--excludeโ Filter experiments (e.g.,--exclude test001-dryrun)
# Development mode - no rendering
pixi run python scripts/generate_all_results.py \
--no-render \
--exclude test001-dryrun test001-debug
# Regenerate tables only (assumes data/figures exist)
pixi run python scripts/generate_all_results.py \
--skip-data --skip-figures
Individual Pipeline Steps
1. Export Data Only
pixi run python scripts/export_data.py \
--data-dir ~/fullruns \
--output-dir results/analysis/data
Outputs: runs.csv, judges.csv, criteria.csv, subtests.csv, summary.json, statistical_results.json
2. Generate Figures Only (27 figures ร 5 formats)
pixi run python scripts/generate_figures.py \
--data-dir ~/fullruns \
--output-dir results/analysis/figures
Outputs: *.vl.json, *.csv, *.png (300 DPI), *.pdf, *_include.tex
3. Generate Tables Only (11 tables ร 2 formats)
pixi run python scripts/generate_tables.py \
--data-dir ~/fullruns \
--output-dir results/analysis/tables
Outputs: *.md (human-readable), *.tex (LaTeX, booktabs formatted)
Output Structure
results/analysis/
โโโ data/
โ โโโ runs.csv # Per-run metrics
โ โโโ judges.csv # Judge evaluations
โ โโโ criteria.csv # Criterion-level scores
โ โโโ subtests.csv # Subtest metadata
โ โโโ summary.json # Experiment summary
โ โโโ statistical_results.json # Statistical analysis
โโโ figures/ # 27 figures ร 5 formats
โ โโโ fig01_score_variance.*
โ โโโ fig02_grade_distribution.*
โ โโโ ... (27 total)
โโโ tables/ # 11 tables ร 2 formats
โโโ table01_tier_summary.md
โโโ table01_tier_summary.tex
โโโ ... (11 total)
Using the Outputs
LaTeX Integration:
\begin{figure}
\centering
\input{results/analysis/figures/fig04_pass_rate_by_tier_include.tex}
\caption{Pass rate by tier with 95\% bootstrap confidence intervals.}
\label{fig:pass-rate}
\end{figure}
\input{results/analysis/tables/table02_tier_comparison.tex}
Python/Jupyter:
import pandas as pd
import json
# Load data
runs_df = pd.read_csv('results/analysis/data/runs.csv')
judges_df = pd.read_csv('results/analysis/data/judges.csv')
# Load statistical results
with open('results/analysis/data/statistical_results.json') as f:
stats = json.load(f)
Experiment Management Scripts
ProjectScylla provides comprehensive scripts for running, managing, and analyzing experiments.
๐งช Running Experiments
Primary Experiment Runner:
# Run full experiment
pixi run python scripts/manage_experiment.py run --config config/test.yaml
# Run specific tiers
pixi run python scripts/manage_experiment.py run \
--tiers-dir tests/fixtures/tests/test-001 \
--tiers T0 T1 --runs 10 -v
Container-Based Execution:
./scripts/setup_api_key.sh
./scripts/run_experiment_in_container.sh \
--tiers-dir tests/fixtures/tests/test-001 \
--tiers T0 --runs 5 --verbose
๐ Recovery & Re-running
# Re-run failed agents
pixi run python scripts/manage_experiment.py rerun-agents \
~/fullruns/test_experiment --tier T0 T1
# Re-run failed judges
pixi run python scripts/manage_experiment.py rerun-judges \
~/fullruns/test_experiment
๐ Results Management
# Regenerate all results
pixi run python scripts/manage_experiment.py regenerate \
~/fullruns/test_experiment
# Repair corrupt checkpoint
pixi run python scripts/manage_experiment.py repair \
~/fullruns/test_experiment/checkpoint.json
Analysis Pipeline Architecture
Statistical Methodology
Rigorous non-parametric methods for bounded, ordinal, non-normal data:
- Bootstrap Confidence Intervals: BCa with 10,000 resamples
- Omnibus Testing: Kruskal-Wallis H test (controls FWER)
- Pairwise Comparisons: Mann-Whitney U + Holm-Bonferroni correction
- Effect Sizes: Cliff's delta with bootstrapped CIs
- Inter-Rater Reliability: Krippendorff's alpha for judge agreement
Configuration: scylla/analysis/config.yaml (all parameters externalized)
Metrics
Quality:
- Pass-Rate (functional test coverage)
- Implementation Rate (semantic satisfaction)
- Score (weighted rubric evaluation)
- Consistency (1 - Coefficient of Variation)
Economic:
- Cost-of-Pass (expected cost per success)
- Frontier CoP (minimum CoP across configs)
- Token Distribution (cost breakdown)
Process:
- Latency (query to resolution time)
- Judge Agreement (Krippendorff's alpha)
Data Requirements
Expected structure:
fullruns/{experiment_name}/{timestamp}/
โโโ config/experiment.json # Metadata
โโโ T0-T6/{subtest_id}/run_{01-10}/
โโโ run_result.json # Outcomes
โโโ judge/judge_{01-03}/judgment.json # Evaluations
Required in run.json:
run_number(integer)exit_code(0 = success)judges(list with grades & criteria)
Schema: scylla/analysis/schemas/run_result.schema.json
Development
๐งช Testing
ProjectScylla has a comprehensive test suite covering all functionality. To see the current test count:
pixi run pytest tests/ --collect-only -q | tail -1
Test Categories
- Unit Tests: Analysis (incl. integration-style tests), adapters, config, executors, judges, metrics, reporting
- E2E Tests (1 file): Full pipeline validation
- Test Fixtures (47+ scenarios): Complete test cases with expected outputs
Running Tests
# All tests (comprehensive)
pixi run pytest tests/ --verbose
# Unit tests only (fastest)
pixi run pytest tests/unit/ -v
# Specific modules
pixi run pytest tests/unit/analysis/ -v
pixi run pytest tests/unit/adapters/ -v
pixi run pytest tests/unit/config/ -v
# Coverage analysis
pixi run pytest tests/ --cov=scylla --cov-report=html
# Specific test file
pixi run pytest tests/unit/analysis/test_stats.py -v
Test Quality Assurance
# Code quality (linting + formatting)
pixi run ruff check scylla/
pixi run ruff format scylla/ --check
Git Hooks
Git hooks enforce quality checks locally before code reaches CI. Install them once after cloning:
bash scripts/install_hooks.sh
| Hook | Trigger | What it does |
|---|---|---|
pre-push | Every git push | Runs the full test suite with coverage; aborts the push if tests fail or coverage drops below the threshold in pyproject.toml |
The coverage threshold is read directly from pyproject.toml โ update it there and the hook stays in sync automatically.
Hook source files live in
scripts/hooks/and are version-controlled. Seescripts/README.mdfor details.
Adding Components
New Figures:
- Create module in
scylla/analysis/figures/ - Implement function following existing pattern
- Register in
scripts/generate_figures.py - Add tests in
tests/unit/analysis/test_figures.py
New Tables:
- Add function to module in
scylla/analysis/tables/ - Register in
scripts/generate_tables.py - Add tests in
tests/unit/analysis/test_tables.py
Code Quality
# Linting
pixi run ruff check scylla/analysis/
# Auto-fix and format
pixi run ruff check --fix scylla/analysis/
pixi run ruff format scylla/analysis/
๐ง Troubleshooting
Quick Reference
| Symptom | Solution |
|---|---|
Schema validation failed: 'N/A' does not match | Ensure grades are S, A, B, C, D, or F only |
[Errno 2] No such file or directory | Run: find ~/fullruns -name "run_result.json" |
TypeError: unsupported operand | Fix type coercion in criterion.achieved values |
| Empty outputs | Check: โฅ2 experiments, โฅ1 completed run each |
| Slow performance | Use --no-render flag for faster iteration |
Common Issues
1. Data Validation Errors
Schema validation failed: 'N/A' does not match '^[SABCDF]$'
Fix: Review problematic runs, ensure valid grades S/A/B/C/D/F or update schema.
2. Missing Files
Failed to load: [Errno 2] No such file or directory
Fix: Incomplete runs skipped with warnings. Investigate:
find ~/fullruns -name "run_*" -type d -exec sh -c 'test -f "$1/run_result.json" || echo "Missing: $1"' _ {} \;
3. Type Errors
TypeError: unsupported operand type(s) for +: 'float' and 'str'
Fix: Some criterion.achieved are strings. Fix in data generation or add coercion.
Getting Help
- Documentation:
docs/research.mdfor methodology - Examples:
tests/unit/analysis/for usage patterns - Issues: GitHub Issues
- Support: Create an issue with error message and steps to reproduce
Publication Readiness
โ Rigorous non-parametric statistics (Kruskal-Wallis, Mann-Whitney U, Cliff's delta)
โ Multiple comparison correction (Holm-Bonferroni throughout)
โ Bootstrap confidence intervals (BCa, 10K resamples, seed=42)
โ Effect sizes with confidence intervals
โ 300 DPI publication-quality figures
โ LaTeX-ready tables with booktabs formatting
โ Reproducible configuration (all parameters in config.yaml)
โ Comprehensive test suite
โ Documented methodology with citations
See docs/research.md for complete research methodology and metric definitions.
LaTeX Dependencies
Required packages for document compilation:
\documentclass{article}
\usepackage{booktabs} % Professional tables
\usepackage{longtable} % Multi-page tables
\usepackage{threeparttable} % Table notes
\usepackage{graphicx} % Figure inclusion
\usepackage{amsmath} % Statistical symbols
\begin{document}
% Your content here
\end{document}
๐ค Contributing
We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines on:
- Development setup and environment configuration
- Git workflow and branch management
- Code quality standards and testing requirements
- Pull request and code review process
- Issue reporting guidelines
Quick Start for Contributors:
- Fork the repository and clone locally
- Copy
.env.exampleto.envand configure API keys - Install dependencies:
curl -fsSL https://pixi.sh/install.sh | bash - Install git hooks:
bash scripts/install_hooks.sh - Run tests:
pixi run pytest tests/ -v - Check CONTRIBUTING.md for detailed workflow
Areas for contribution:
- Additional statistical methods and metrics
- New visualization types and formats
- Performance optimizations
- Documentation improvements
- Bug fixes and feature requests
Visit our GitHub Repository to get started.
License
Citation
@software{projectscylla2026,
title = {ProjectScylla: A Testing and Optimization Framework for Agentic Workflows},
author = {Micah Villmow},
year = {2026},
url = {https://github.com/HomericIntelligence/ProjectScylla}
}