dspy-optimization-workflow
Complete 3-phase guide for optimizing DSPy programs in Skills-Fleet. Use when implementing quality improvements, using optimization API endpoints, or troubleshooting DSPy issues.
SKILL.md
| Name | dspy-optimization-workflow |
| Description | Complete 3-phase guide for optimizing DSPy programs in Skills-Fleet. Use when implementing quality improvements, using optimization API endpoints, or troubleshooting DSPy issues. |
name: dspy-optimization-workflow description: Complete 3-phase guide for optimizing DSPy programs in Skills-Fleet. Use when implementing quality improvements, using optimization API endpoints, or troubleshooting DSPy issues.
DSPy Optimization Workflow for Skills-Fleet
When to Use
Load this skill when you need to:
- Optimize DSPy programs in skills-fleet for better quality
- Implement production patterns (monitoring, error handling, ensemble methods)
- Use the optimization API endpoints (
/optimization/start,/optimization/status) - Design effective DSPy signatures with Literal types and constraints
- Create or expand training datasets for robust optimization
- Troubleshoot optimization issues (low scores, API failures, type errors)
- Implement advanced patterns (versioning, A/B testing, caching)
This skill documents the complete 3-phase DSPy quality improvement workflow successfully implemented in January 2026.
Quick Start
Run Optimization (Simplest)
# Using the quick optimization script
uv run python scripts/run_optimization.py
# Expected: Runs GEPA optimization with trainset_v4.json (50 examples)
# Saves to: config/optimized/skill_program_gepa_v1.pkl
Run Optimization via API
# 1. Start server
uv run skill-fleet serve
# 2. Trigger optimization
curl -X POST http://localhost:8000/api/v1/optimization/start \
-H "Content-Type: application/json" \
-d '{
"optimizer": "miprov2",
"trainset_file": "config/training/trainset_v4.json",
"auto": "medium"
}'
# 3. Check status (use job_id from response)
curl http://localhost:8000/api/v1/optimization/status/{job_id}
Test Your Implementation
# Run comprehensive validation
uv run python scripts/test_phase_implementation.py
# Expected: 10/10 tests pass
3-Phase Implementation Guide
Phase 1: Foundation (Week 1)
Goal: Enhance signatures, expand training data, add monitoring
Tasks:
-
Enhance Signatures → See references/phase1-implementation.md
- Add Literal types for constrained outputs
- Specific OutputField constraints with quality indicators
- Concise, actionable docstrings
-
Expand Training Data → See references/phase1-implementation.md
- Target: 50-100 examples (DSPy recommendation)
- Extract from existing skills
- Generate synthetic examples for diversity
- Use
scripts/expand_training_data.pyandscripts/generate_synthetic_examples.py
-
Add Monitoring → See references/phase1-implementation.md
- ModuleMonitor: Wrap modules for tracking
- ExecutionTracer: Collect detailed traces
- MLflowLogger: Optional experiment tracking
Example: examples/example_signature.py
Phase 2: Optimization (Week 2)
Goal: Run optimization, implement custom metrics, add error handling
Tasks:
-
Run Optimization → See references/phase2-optimization.md
- Use MIPROv2 with
auto="medium"for balanced cost/quality - Or GEPA for faster, reflection-based optimization
- Configure via API or CLI
- Use MIPROv2 with
-
Enhanced Metrics → See references/phase2-optimization.md
- taxonomy_accuracy_metric
- metadata_quality_metric
- skill_style_alignment_metric
- comprehensive_metric (weighted combination)
-
Error Handling → See references/phase2-optimization.md
- RobustModule: Retry with exponential backoff
- ValidatedModule: Output validation
- Phase-specific fallbacks
Example: examples/example_metric.py
Phase 3: Advanced Patterns (Week 3)
Goal: Implement ensemble, versioning, caching for production
Tasks:
-
Ensemble Methods → See references/phase3-advanced.md
- EnsembleModule: Multiple models, best selection
- BestOfN: Generate N, pick highest quality
- MajorityVote: Classification consensus
-
Versioning → See references/phase3-advanced.md
- ProgramRegistry: Manage multiple versions
- ABTestRouter: Gradual rollout, A/B testing
-
Caching → See references/phase3-advanced.md
- CachedModule: Multi-level caching (memory + disk)
- Significant performance gains (30-50% faster)
Example: examples/example_ensemble.py
API Usage Patterns
Complete reference: references/api-reference.md
Key Endpoints
POST /api/v1/optimization/start
- Trigger background optimization job
- Supports MIPROv2, GEPA, BootstrapFewShot
- Uses trainset JSON files or skill paths
GET /api/v1/optimization/status/{job_id}
- Check optimization progress
- Returns: status, progress (0-1), result, error
GET /api/v1/optimization/optimizers
- List available optimizers with parameters
- Useful for discovering configuration options
Integration Example
# Start optimization programmatically
import httpx
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8000/api/v1/optimization/start",
json={
"optimizer": "miprov2",
"trainset_file": "config/training/trainset_v4.json",
"auto": "medium",
}
)
job_id = response.json()["job_id"]
# Poll for completion
while True:
status = await client.get(
f"http://localhost:8000/api/v1/optimization/status/{job_id}"
)
data = status.json()
if data["status"] == "completed":
print(f"Quality score: {data['result']['quality_score']}")
break
await asyncio.sleep(5)
Best Practices & Patterns
Complete guide: references/best-practices.md
DSPy Signature Design
✅ DO:
- Use Literal types for enums/categories
- Add specific constraints to OutputField descriptions
- Include quality indicators ("quality >0.80", "3-5 examples")
- Keep docstrings concise and actionable
❌ DON'T:
- Use generic
strtypes when Literal would work - Write verbose explanations in docstrings
- Skip OutputField descriptions
- Use underscores or spaces in field names
Training Data
✅ DO:
- Aim for 50-100 diverse examples
- Include all skill styles (comprehensive, navigation_hub, minimal)
- Cover all major categories
- Use both golden and synthetic examples
❌ DON'T:
- Rely on <20 examples (insufficient for robust optimization)
- Duplicate examples (reduces effective dataset size)
- Skip validation of JSON structure
- Ignore category distribution
Optimization Strategy
✅ DO:
- Start with GEPA for quick iteration (fast, cheap)
- Use MIPROv2
auto="medium"for production (balanced) - Monitor costs and quality during optimization
- Evaluate on separate test set
❌ DON'T:
- Jump straight to
auto="heavy"(expensive, often unnecessary) - Optimize on entire dataset without train/test split
- Ignore baseline evaluation
- Skip monitoring/logging during long runs
Troubleshooting
Complete guide: references/troubleshooting.md
Common Issues
Low Quality Scores (<0.70)
- ✓ Check training data diversity (need 50+ examples)
- ✓ Verify signature constraints are specific
- ✓ Review metric function (might be too strict)
- ✓ Try MIPROv2 instead of BootstrapFewShot
API Optimization Job Fails
- ✓ Check trainset JSON structure
- ✓ Verify GOOGLE_API_KEY is set
- ✓ Check server logs for specific error
- ✓ Ensure enough memory (optimization is CPU/memory intensive)
Type Errors in Signatures
- ✓ Add
from __future__ import annotations - ✓ Import types from
typing(Literal, etc.) - ✓ Run
uv run ty check src/to validate - ✓ Check for unresolved references
Slow Optimization
- ✓ Use GEPA instead of MIPROv2 for faster iteration
- ✓ Reduce
num_candidate_programs(default 16 → 8) - ✓ Lower
max_bootstrapped_demos(4 → 2) - ✓ Use
auto="light"instead of "medium"
Utilities & Scripts
Quick Optimization Runner
# Run optimization with sensible defaults
.skills/dspy-optimization-workflow/scripts/quick_optimize.py \
--trainset config/training/trainset_v4.json \
--optimizer gepa
Test Custom Metrics
# Test metric against examples
.skills/dspy-optimization-workflow/scripts/test_metrics.py \
--metric comprehensive_metric \
--examples 10
Compare Program Versions
# Compare two optimized versions
.skills/dspy-optimization-workflow/scripts/compare_versions.py \
--v1 config/optimized/program_v1.pkl \
--v2 config/optimized/program_v2.pkl
Export Monitoring Traces
# Export traces for analysis
.skills/dspy-optimization-workflow/scripts/export_traces.py \
--output traces_analysis.json
Key Files Reference
Configuration
config/training/trainset_v4.json- 50 training examples (ready to use)config/config.yaml- LLM configuration (roles, models)
Core Implementation
src/skill_fleet/core/dspy/signatures/- Enhanced signature definitionssrc/skill_fleet/core/dspy/metrics/enhanced_metrics.py- Evaluation metricssrc/skill_fleet/core/dspy/monitoring/- Monitoring infrastructuresrc/skill_fleet/core/dspy/modules/error_handling.py- Error handling wrapperssrc/skill_fleet/core/dspy/modules/ensemble.py- Ensemble methodssrc/skill_fleet/core/dspy/versioning.py- Version managementsrc/skill_fleet/core/dspy/caching.py- Caching strategies
API
src/skill_fleet/api/routes/optimization.py- Optimization endpoints
Scripts
scripts/run_optimization.py- Main optimization runnerscripts/test_phase_implementation.py- Comprehensive testsscripts/expand_training_data.py- Training data extractionscripts/generate_synthetic_examples.py- Synthetic example generation
Expected Results
With complete Phase 1-3 implementation:
- Quality Score: 0.70-0.75 → 0.85-0.90 (+15-20%)
- Obra Compliance: ~60% → ~85% (+25%)
- Consistency: Much improved with Literal type constraints
- Performance: 30-50% faster with strategic caching
- Reliability: Improved with retry logic and fallbacks
- Observability: Full monitoring and tracing in production
Next Steps
After loading this skill:
- For new optimization: Start with Phase 1 (signatures + training data)
- For existing setup: Jump to Quick Start and run optimization
- For troubleshooting: Check references/troubleshooting.md
- For API integration: See references/api-reference.md
- For advanced patterns: Review references/phase3-advanced.md
Implementation Status: ✅ All phases complete (Jan 19, 2026) Test Results: 10/10 tests passing Type Checks: ✅ Passing (11 expected MLflow warnings) Ready for Production: Yes