Agent Skill
2/7/2026

root-cause-tracing

Trace execution errors to their original triggers using systematic debugging techniques. Use when debugging complex issues, investigating production errors, or analyzing failure chains.

A
allanninal
0GitHub Stars
1Views
npx skills add allanninal/claude-code-skills

SKILL.md

Nameroot-cause-tracing
DescriptionTrace execution errors to their original triggers using systematic debugging techniques. Use when debugging complex issues, investigating production errors, or analyzing failure chains.

name: root-cause-tracing description: Trace execution errors to their original triggers using systematic debugging techniques. Use when debugging complex issues, investigating production errors, or analyzing failure chains.

Root Cause Tracing

When to Use This Skill

  • Investigating production errors
  • Debugging complex multi-step failures
  • Analyzing error chains and cascading failures
  • Understanding why a specific state occurred
  • Post-mortem analysis of incidents

Tracing Methodology

1. Start from the Symptom

## Error Chain Template

SYMPTOM: [What the user/system reported]
↓
IMMEDIATE CAUSE: [Direct technical cause]
↓
CONTRIBUTING FACTOR: [What enabled the immediate cause]
↓
ROOT CAUSE: [The fundamental issue to fix]

Example Trace

SYMPTOM: User sees "500 Internal Server Error"
↓
IMMEDIATE CAUSE: Unhandled null pointer exception in UserService.getProfile()
↓
CONTRIBUTING FACTOR: Database returned null for user that should exist
↓
ROOT CAUSE: Race condition during user registration - DB write not committed before redirect

Trace Techniques

Stack Trace Analysis

# Given this stack trace:
# Traceback (most recent call last):
#   File "api/handlers.py", line 45, in get_user
#     profile = user_service.get_profile(user_id)
#   File "services/user.py", line 23, in get_profile
#     return self.repo.find(user_id).to_dict()
#   File "models/user.py", line 67, in to_dict
#     'email': self.email.lower()
# AttributeError: 'NoneType' object has no attribute 'lower'

# Trace backwards:
# 1. self.email is None (immediate cause)
# 2. User model was created without email validation
# 3. API endpoint doesn't validate email before save
# 4. ROOT CAUSE: Missing input validation

Log Correlation

# Find related logs by request ID
grep "req_abc123" /var/log/app/*.log | sort -t: -k2

# Timeline reconstruction
grep -h "2024-01-15T10:3" error.log access.log | sort

# Find first occurrence of error pattern
grep -n "NullPointerException" app.log | head -1

State Inspection

# Add trace points to understand state flow
def process_order(order):
    logger.debug(f"[TRACE] Input state: {order.__dict__}")

    validated = validate_order(order)
    logger.debug(f"[TRACE] After validation: {validated.__dict__}")

    calculated = calculate_totals(validated)
    logger.debug(f"[TRACE] After calculation: {calculated.__dict__}")

    saved = save_order(calculated)
    logger.debug(f"[TRACE] After save: {saved.__dict__}")

    return saved

Debugging Patterns

Binary Search Debugging

# When you have a long process that fails somewhere

def long_process(data):
    # Add checkpoint
    step1_result = step1(data)
    print(f"CHECKPOINT 1: {step1_result is not None}")  # Pass

    step2_result = step2(step1_result)
    print(f"CHECKPOINT 2: {step2_result is not None}")  # Pass

    step3_result = step3(step2_result)
    print(f"CHECKPOINT 3: {step3_result is not None}")  # FAIL - narrow down here

    step4_result = step4(step3_result)
    # ...

Delta Debugging

# Find which commit introduced a bug
git bisect start
git bisect bad HEAD
git bisect good v1.0.0
# Git will binary search through commits
# Mark each as good/bad until root cause commit is found

Rubber Duck Tracing

## Explain the flow out loud:

1. User clicks "Submit Order"
2. Frontend sends POST to /api/orders
3. Backend validates the payload... WAIT
   - Does it validate the discount code?
   - What if discount code is empty string vs null?
4. Found it: Empty string "" passes validation but fails lookup

Error Pattern Recognition

Null/Undefined Errors

SYMPTOM: Cannot read property 'X' of null/undefined

TRACE QUESTIONS:
1. What variable is null?
2. Where was it supposed to be set?
3. What condition would leave it unset?
4. Is there a race condition?
5. Is there a missing await/callback?

COMMON ROOT CAUSES:
- Async operation not awaited
- Conditional initialization with edge case
- Object destructuring with missing keys
- Database query returning no results

Race Conditions

SYMPTOM: Intermittent failures, works on retry

TRACE QUESTIONS:
1. Are there multiple async operations?
2. Is there shared state?
3. Are there assumptions about order of execution?
4. Are database transactions being used?

COMMON ROOT CAUSES:
- Missing database transaction
- Read-after-write without waiting
- Multiple requests modifying same resource
- Cache invalidation timing

Resource Exhaustion

SYMPTOM: System slows/crashes under load

TRACE QUESTIONS:
1. What resources are being consumed?
2. Are connections being closed?
3. Are there memory leaks?
4. Is there unbounded growth?

COMMON ROOT CAUSES:
- Database connection pool exhaustion
- Memory leaks in long-running processes
- Unbounded queues or caches
- Missing cleanup in error paths

Systematic Trace Template

## Root Cause Analysis: [Issue Title]

### 1. Incident Summary
- **Date/Time**:
- **Duration**:
- **Impact**:
- **Detected by**:

### 2. Timeline
| Time | Event |
|------|-------|
| 10:00 | First error logged |
| 10:05 | Alert triggered |
| 10:10 | Investigation started |
| 10:30 | Root cause identified |
| 10:45 | Fix deployed |

### 3. Error Chain

[Symptom] ↓ [Immediate Cause] ↓ [Contributing Factor] ↓ [Root Cause]


### 4. Evidence
- Log snippets
- Stack traces
- Metrics/graphs
- Reproduction steps

### 5. Root Cause
[Clear statement of the fundamental issue]

### 6. Fix
[What was done to resolve]

### 7. Prevention
- [ ] Add validation for X
- [ ] Add monitoring for Y
- [ ] Update documentation for Z

Tools for Tracing

Distributed Tracing

# Using OpenTelemetry
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_request(request):
    with tracer.start_as_current_span("process_request") as span:
        span.set_attribute("user_id", request.user_id)

        with tracer.start_as_current_span("validate"):
            validate(request)

        with tracer.start_as_current_span("process"):
            result = process(request)
            span.set_attribute("result_count", len(result))

        return result

Error Aggregation Query

-- Find error patterns
SELECT
  error_type,
  error_message,
  COUNT(*) as occurrences,
  MIN(timestamp) as first_seen,
  MAX(timestamp) as last_seen
FROM error_logs
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY error_type, error_message
ORDER BY occurrences DESC
LIMIT 20;

Checklist

  • Capture exact error message and stack trace
  • Identify timestamp and affected users/requests
  • Gather relevant logs around the timeframe
  • Reproduce in isolation if possible
  • Trace backwards from symptom to root
  • Document the error chain
  • Identify fix AND prevention
  • Create regression test
Skills Info
Original Name:root-cause-tracingAuthor:allanninal