java-performance-profiling
A reproducible Java performance profiling workflow using JFR and async-profiler concepts: capture evidence (CPU/alloc/locks), analyze flamegraphs, form hypotheses, ship fixes, and verify regressions. Use when CPU is high or latency spikes.
SKILL.md
| Name | java-performance-profiling |
| Description | A reproducible Java performance profiling workflow using JFR and async-profiler concepts: capture evidence (CPU/alloc/locks), analyze flamegraphs, form hypotheses, ship fixes, and verify regressions. Use when CPU is high or latency spikes. |
name: java-performance-profiling description: A reproducible Java performance profiling workflow using JFR and async-profiler concepts: capture evidence (CPU/alloc/locks), analyze flamegraphs, form hypotheses, ship fixes, and verify regressions. Use when CPU is high or latency spikes. license: CC-BY-4.0 compatibility: "JDK 17+ (recommended JDK 21). Production-safe approach uses JFR. async-profiler usage depends on deployment permissions." metadata: owner: "backend" version: "1.0" tags: [java, performance, profiling, jfr, async-profiler, flamegraph, latency]
Java Performance Profiling (JFR + async-profiler workflow)
Intent
This skill gives you a repeatable profiling loop:
- Confirm the symptom and isolate a reproducible scenario
- Capture evidence with the lowest-risk tool first (JFR)
- Analyze: CPU hot paths, allocation hotspots, lock contention, thread states
- Make a small fix
- Re-measure and produce a concise report
Goal: ship performance fixes with confidence, not guesswork.
Scope
In scope
- CPU profiling (sampling)
- Allocation profiling (where memory is allocated)
- Lock contention analysis
- Thread state analysis (runnable/blocked/waiting)
- JFR capture and analysis
- async-profiler-style flamegraph workflow (conceptual + practical guidance)
- Verification via A/B load tests
Out of scope
- Microbenchmark-only optimization (use JMH skill if needed)
- OS kernel perf tuning
- Full distributed tracing diagnosis (covered by observability skill)
When to use
Triggers:
- CPU high in production
- p95/p99 latency spikes
- throughput regression after release
- frequent GC due to allocation storms
- lock contention symptoms (threads blocked, long waits)
Required inputs (context to attach in Cursor)
- The hot endpoints / services
- Recent changes (PR diff)
- Baseline metrics dashboards (latency, CPU, GC, error rate)
- Deployment constraints:
- can you run JFR in prod?
- can you attach profilers?
- can you reproduce in staging?
Safety-first tool selection
Use tools in this order:
- Metrics + logs + traces to locate where to profile
- JFR for production-safe profiling
- async-profiler for deeper CPU/alloc/lock signals when allowed
- Heap dump / GC deep dive if memory-leak suspected
Procedure (step-by-step)
Step 1 — Write a profiling question (avoid fishing)
Examples:
- “Which methods consume the most CPU during /search at p99?”
- “Where are we allocating the most objects per request?”
- “Which lock is contended during peak traffic?”
Deliverable: a single profiling question + success metric.
Step 2 — Reproduce and minimize noise
- Use a stable load scenario (same input size, same dataset, same concurrency)
- Pin down:
- request path
- concurrency level
- duration
- Confirm the symptom exists under the scenario.
Deliverable: “Repro recipe” (commands, parameters, expected symptom).
Step 3 — Capture JFR (preferred baseline)
Start a bounded recording:
- short duration (30–300s)
- include CPU, allocation, locks, thread events
Capture methods vary:
jcmd <pid> JFR.start ...jfrtool (depending on environment)- JVM startup flags in a controlled environment (not first choice)
Always store:
- recording file name
- time window
- workload parameters
- build/version hash
Deliverable: a JFR recording + metadata.
Step 4 — Analyze JFR: a structured checklist
When opening JFR (JMC or other tooling), check:
CPU / Execution
- hottest methods and call stacks
- suspicious loops
- high cost serialization/deserialization
- regex backtracking hotspots
- logging overhead (string concat, JSON encode)
Allocation
- top allocation sites (per request)
- big object arrays, large maps/lists
- repeated parsing of the same data
- missing caching for derived results
Locks
- monitors and contended locks
- synchronized blocks with I/O inside
- thread parking patterns
Thread states
- many threads blocked on DB pool?
- many runnable threads competing for CPU?
- many waiting threads due to missing timeouts?
Deliverable: “Top 3 findings” with evidence pointers (screen captures or notes).
Step 5 — Use async-profiler-style flamegraphs when permitted
If you can attach a profiler in staging or a safe prod window:
- Generate a CPU flamegraph for the same scenario
- Optionally allocation or lock flamegraphs (depending on tool support)
Rules:
- keep capture windows short
- run during controlled load
- avoid collecting sensitive data
- get explicit approval for production attaches
Deliverable: flamegraph(s) + short interpretation.
Step 6 — Convert findings into hypotheses and fixes
For each finding, produce:
- Hypothesis: “X is expensive because Y”
- Fix idea: “Change A to B”
- Expected impact: “Reduce allocations by ~N per request” or “Reduce CPU in method M”
- Risk: correctness risk and rollback plan
Prefer small fixes:
- avoid “rewrite everything”
- isolate changes
- add micro-level regression tests when appropriate
Deliverable: a small PR with focused perf fix.
Step 7 — Verify with the same workload and report
Re-run the same test:
- baseline vs new
- compare p95/p99, throughput, CPU, allocation rate, GC time, errors
Report template:
- Context (service/version)
- Workload
- Baseline metrics
- Changes
- After metrics
- Evidence (JFR/flamegraph deltas)
- Risk / rollback plan
Deliverable: perf report + artifacts in references/ folder (or attachment store).
Outputs / Artifacts
- Repro recipe
- JFR recording + metadata
- Optional flamegraphs
- “Top 3 findings” summary
- PR with measured improvements
- Report template filled
Definition of Done (DoD)
- Profiling question defined and answered with evidence
- Recording captured and stored with metadata
- Fix is small and has rollback plan
- Re-measurement confirms improvement or documents no-change
- No new correctness regressions (tests passed)
- Report written (baseline vs after)
Common failure modes & fixes
-
Symptom: flamegraph shows “native” or “unknown”
- Cause: missing symbols, container restrictions
- Fix: use JFR; ensure correct permissions; profile in staging
-
Symptom: results not reproducible
- Cause: workload not controlled, noisy environment
- Fix: stabilize inputs, duration, concurrency; repeat runs
-
Symptom: you optimize the wrong thing
- Cause: not tying profiling to p95/p99 paths
- Fix: start from metrics/traces; profile where pain exists
Guardrails (What NOT to do)
- Do NOT profile production with high-overhead tooling without explicit approval.
- Do NOT “optimize” without measurements.
- Do NOT micro-optimize before fixing algorithmic or I/O bottlenecks.
- Do NOT commit profiler configs that leak secrets or sensitive paths.
References (primary)
- Java Flight Recorder (JDK tooling): https://docs.oracle.com/en/java/javase/21/jfapi/using-java-flight-recorder.html
jcmdand diagnostics overview: https://docs.oracle.com/en/java/javase/21/troubleshoot/diagnostic-tools.htmljfrcommand (Oracle tool reference): https://docs.oracle.com/en/java/javase/21/docs/specs/man/jfr.html- Flight Recorder (OpenJDK JEP 328): https://openjdk.org/jeps/328
- async-profiler (project): https://github.com/jvm-profiling-tools/async-profiler