domain-retrospective
Turn experiment reports and development notes into summaries and reusable skills. Adapts behavior based on project domain (research, unsloth, cuda) by reading registry.json. Triggers on <retrospective> or requests for lessons learned.
SKILL.md
| Name | domain-retrospective |
| Description | Turn experiment reports and development notes into summaries and reusable skills. Adapts behavior based on project domain (research, unsloth, cuda) by reading registry.json. Triggers on <retrospective> or requests for lessons learned. |
name: domain-retrospective description: > Turn experiment reports and development notes into summaries and reusable skills. Adapts behavior based on project domain (research, unsloth, cuda) by reading registry.json. Triggers on <retrospective> or requests for lessons learned. metadata: short-description: "Summarize findings and distill them into skills" tags: - documentation - retrospective - knowledge-capture
Skill: domain-retrospective
When to use
Use this skill when:
- The user message starts with
<retrospective>, or - The user requests a summary or lessons-learned across experiments/development.
Initialization
-
Read
.codex/skills/registry.jsonto determine:domain: research | unsloth | cudapaths.reports: where to find experiment/benchmark reportspaths.experiment_log: path to experiment logpaths.troubleshooting: path to troubleshooting guidepaths.templates: path to templates directory
-
Adapt behavior based on domain (see Domain-Specific Behavior below).
Behavior
-
Select inputs
- Use the user's description to identify relevant:
- Reports from
paths.reportsdirectory - Sections of
paths.experiment_log
- Reports from
- If ambiguous, list candidate reports and ask the user to choose.
- Use the user's description to identify relevant:
-
Summarize findings
- For each report, extract:
- Setup and configuration
- Key parameters/settings
- Metrics and results
- What worked (successes)
- What failed (with reasons)
- Write a markdown summary with:
- "What we tried"
- "Key findings"
- "What failed"
- "Open questions"
- For each report, extract:
-
Update troubleshooting (if needed)
- If experiments reveal new error patterns and fixes:
- Propose new entries for
paths.troubleshooting - Use template from
templates/references/troubleshooting-entry-template.md - Ask user for confirmation before editing.
- Propose new entries for
- If experiments reveal new error patterns and fixes:
-
Propose or update result skills
- Decide what result skills should capture these findings.
- For each skill:
- If new: start from
templates/skills/result-skill-template.md - If existing: identify which sections to update
- If new: start from
- Draft SKILL.md content including:
- General description and context
- When to apply this knowledge
- Results summary with concrete numbers
- Recommended practice
- Failure modes to avoid
- Use domain-appropriate terminology and focus areas.
-
Ask before writing
- Present the proposed skill changes.
- Only create or modify files under
.codex/skills/with user approval.
-
Log the retrospective
- Append a summarized entry to
paths.experiment_log - Example: "2025-01-12 – Retrospective on LoRA rank experiments"
- Include a short "General description" line for context.
- Append a summarized entry to
Domain-Specific Behavior
Research Domain
When domain: research:
What to extract from reports:
- Model architecture details
- Training hyperparameters (lr, batch_size, epochs, warmup)
- Dataset configurations and mixtures
- Evaluation metrics (accuracy, loss, perplexity)
- Training dynamics (convergence speed, stability)
Result skill focus:
- Hyperparameter recommendations for specific tasks
- Dataset mixture recipes
- Model architecture insights
- Training tips and tricks
Skill naming convention:
{task}-{finding}e.g.,colbert-chunking-optimal,gpt2-lr-schedule
Unsloth Domain
When domain: unsloth:
What to extract from reports:
- LoRA configuration (rank, alpha, target_modules)
- Quantization settings
- Memory usage and batch sizes achieved
- Fine-tuning duration and throughput
- Model-specific quirks
Result skill focus:
- Optimal LoRA configurations for model families
- Memory-efficient training recipes
- Quantization tradeoffs
- Common fine-tuning pitfalls
Skill naming convention:
{model}-{config}e.g.,llama3-lora-optimal,mistral-4bit-recipe
CUDA Domain
When domain: cuda:
What to extract from reports:
- Kernel configurations (block sizes, grid dims)
- Memory access patterns
- Bandwidth and FLOPS achieved
- Occupancy and register usage
- Profiling metrics (from nsight/ncu)
Result skill focus:
- Optimal tiling strategies for operations
- Memory coalescing patterns
- Warp-level optimization techniques
- Triton autotuning configurations
Skill naming convention:
{operation}-{optimization}e.g.,softmax-online,matmul-tiled,attention-flash
Result Skill Template
The generated result skill should follow this structure:
---
name: {skill-name}
description: >
{One-line description with trigger conditions}
Use when: {specific scenarios}
metadata:
short-description: "{Brief tagline}"
tags:
- {tag1}
- {tag2}
domain: {research|unsloth|cuda}
created: {YYYY-MM-DD}
author: {name}
---
# {Skill Name}
## General Description
{2-3 sentences on what this skill captures and why it matters}
## When to Apply
Use this knowledge when:
- {Condition 1}
- {Condition 2}
## Results Summary
| Metric | Value | Notes |
|--------|-------|-------|
| {metric1} | {value1} | {notes1} |
## Recommended Practice
{Concrete, actionable recommendations with specific values}
## Failure Modes
| What Failed | Why | Lesson |
|-------------|-----|--------|
| {attempt1} | {reason1} | {lesson1} |
## Configuration
{Copy-paste ready configuration, if applicable}
Example Output
Research Retrospective
## Retrospective: Attention Head Experiments (Jan 2025)
### What we tried
- Varied attention heads from 4 to 12 on GPT-2 small architecture
- Fixed: lr=1e-4, batch_size=32, 10 epochs
### Key findings
- 6 heads achieved 91.5% accuracy (vs 92% baseline with 8 heads)
- 4 heads dropped to 87% - too aggressive
- Wider FFN (4096) partially compensated for fewer heads
### What failed
- 4 heads without FFN compensation: 87% accuracy
- 12 heads: no improvement, just slower training
### Open questions
- Would 6 heads + deeper network work better?
- Test on larger model scales
---
**Proposed skill:** `attention-head-scaling`
Unsloth Retrospective
## Retrospective: Llama-3 Fine-tuning (Jan 2025)
### What we tried
- LoRA ranks: 8, 16, 32 on Llama-3 8B
- Quantization: 4-bit vs 8-bit
- Gradient checkpointing variations
### Key findings
- rank=16 + 4-bit optimal for A100 40GB
- rank=32 needed CPU offload, 2x slower
- 8-bit gave marginal quality improvement, not worth memory cost
### What failed
- rank=8: underfitting on complex tasks
- Full fine-tune: OOM even with offload
---
**Proposed skill:** `llama3-lora-optimal`
CUDA Retrospective
## Retrospective: Softmax Kernel Optimization (Jan 2025)
### What we tried
- 1D tiling (baseline)
- 2D tiling with various block sizes
- Warp-level reduction
- Online softmax algorithm
### Key findings
- 2D tiling (64x64) achieved 95% bandwidth utilization
- Online softmax 1.5x faster for attention fusion
- Warp shuffles eliminated shared memory bank conflicts
### What failed
- BLOCK_M=128: register spilling, 30% slowdown
- Naive reduction: bank conflicts killed performance
---
**Proposed skill:** `softmax-online`