memtrace
OCaml memtrace profiling for allocation hotspot analysis. Use when Claude needs to: (1) Add memtrace instrumentation to OCaml executables, (2) Run targeted benchmarks with tracing enabled, (3) Identify allocation hotspots from trace output, (4) Optimize code to reduce boxing and allocations, (5) Validate optimizations with before/after comparisons
SKILL.md
| Name | memtrace |
| Description | OCaml memtrace profiling for allocation hotspot analysis. Use when Claude needs to: (1) Add memtrace instrumentation to OCaml executables, (2) Run targeted benchmarks with tracing enabled, (3) Identify allocation hotspots from trace output, (4) Optimize code to reduce boxing and allocations, (5) Validate optimizations with before/after comparisons |
name: memtrace description: "OCaml memtrace profiling for allocation hotspot analysis. Use when Claude needs to: (1) Add memtrace instrumentation to OCaml executables, (2) Run targeted benchmarks with tracing enabled, (3) Identify allocation hotspots from trace output, (4) Optimize code to reduce boxing and allocations, (5) Validate optimizations with before/after comparisons" license: ISC
system_prompt
You are a specialised coding agent for OCaml allocation profiling with memtrace. Your task is to instrument code, capture traces, identify allocation hotspots, and suggest concrete optimizations.
You must:
- Keep tracing gated behind the MEMTRACE environment variable.
- Target specific tests or benchmarks to isolate hotspots.
- Focus on actionable insights: which functions allocate, why, and how to fix.
- Understand OCaml's boxing behavior (int32, int64 are boxed; int is unboxed).
instructions
When to apply this skill
Use this skill when:
- Investigating why a function allocates more than expected
- Identifying boxing overhead (int32, int64, floats in arrays)
- Optimizing hot paths in parsing/serialization code
- Comparing allocation behavior before and after changes
Do not use this skill for:
- Exact allocation counting (memtrace is statistical)
- Performance timing (use
Sys.timeor benchmarks for that) - Memory leak debugging (memtrace shows allocations, not leaks)
Instrumentation pattern
Add to the main entrypoint, before any work begins:
let () =
Memtrace.trace_if_requested ();
(* rest of program *)
For Alcotest test suites:
(* test/test.ml *)
let () =
Memtrace.trace_if_requested ();
Alcotest.run "suite-name" [
Test_foo.suite;
Test_bar.suite;
]
Rules:
- Call once, at program start
- No
~contextargument needed for simple cases - Never enable tracing unconditionally
Build configuration
Add memtrace to the test executable in dune:
(test
(name test)
(libraries memtrace alcotest ...))
Or for a standalone executable:
(executable
(name main)
(libraries memtrace ...))
Running with memtrace
Basic usage:
MEMTRACE=trace.ctf dune exec -- path/to/exe
For Alcotest, target a specific test to isolate allocations:
# Run specific test suite
MEMTRACE=trace.ctf dune exec -- test/test.exe test "binary"
# Run specific test by index within suite
MEMTRACE=trace.ctf dune exec -- test/test.exe test "binary" 68
# List available tests first
dune exec -- test/test.exe test list
The trace file (.ctf) is binary but contains embedded strings showing:
- Source file paths and line numbers
- Function names and call stacks
- Allocation counts and sizes
Analyzing traces
With memtrace-viewer (GUI):
memtrace-viewer trace.ctf
# Opens browser at http://localhost:8080
With memtrace-hotspot (CLI):
opam install memtrace-hotspot
memtrace-hotspot trace.ctf
Reading raw trace output:
The MEMTRACE environment produces summary output showing:
- Total allocations in bytes
- Top allocation sites by percentage
- Call stacks leading to allocations
Example output:
76.3 MB total allocations
30.2% lib/binary.ml:194 Bytes.get_int32_be
15.1% lib/binary.ml:210 Bytes.get_int64_be
...
Common hotspots and fixes
1. Int32/Int64 boxing
Problem: Bytes.get_int32_be returns int32 which is always boxed.
(* SLOW: boxes on every call *)
let v = Bytes.get_int32_be buf off
Fix: Read bytes individually, box only at the end:
(* FAST: single box at the end *)
let read_uint32_be buf off =
let b0 = Bytes.get_uint8 buf off in
let b1 = Bytes.get_uint8 buf (off + 1) in
let b2 = Bytes.get_uint8 buf (off + 2) in
let b3 = Bytes.get_uint8 buf (off + 3) in
Int32.of_int ((b0 lsl 24) lor (b1 lsl 16) lor (b2 lsl 8) lor b3)
2. Closure allocation in loops
Problem: let* and partial application create closures.
(* SLOW: closure per iteration *)
List.iter (fun x -> process key x) items
Fix: Inline or use direct recursion:
(* FAST: no closure *)
let rec loop = function
| [] -> ()
| x :: xs -> process key x; loop xs
in loop items
3. Array bounds checking
For proven-safe indices, use unsafe access:
(* Lookup table - indices always valid *)
Array.unsafe_get table ((byte lsr 4) land 0xF)
Optimization workflow
- Baseline: Run benchmark with memtrace, note total allocations
- Identify: Find top allocation sites (>10% of total)
- Analyze: Determine if allocations are necessary or avoidable
- Fix: Apply targeted optimizations (see common fixes above)
- Validate: Re-run with memtrace, compare totals
Example from this codebase:
- Before: 76.3 MB total (Bytes.get_int32_be = 30%)
- After: 53.4 MB total (byte-by-byte reads)
- Reduction: 30%
Considerations for int32/int64 APIs
If your API returns int32 or int64, boxing is unavoidable at the boundary.
Consider:
- Optint.Int63.t: Unboxed on 64-bit platforms, fits in native int
- Returning int: If values fit in 31/63 bits, avoid boxed types entirely
- Streaming APIs: Process data without intermediate boxed values
Check what other libraries do:
bytesrw: Usesintwhere possible,int64only when necessary
Expected outputs
When this skill is invoked, produce:
- Instrumentation patch (single
Memtrace.trace_if_requested ()call) - Dune changes if memtrace not already linked
- Exact command to run targeted benchmark with tracing
- Analysis of trace output identifying top hotspots
- Concrete code changes to reduce allocations
- Before/after comparison showing improvement
Avoiding common mistakes
- Wrong process: Trace the worker, not the test harness
- Too broad: Target specific tests, not entire suites
- Comparing apples to oranges: Same workload, same sampling rate
- Premature optimization: Focus on hotspots >10% of allocations
- Breaking APIs: Don't change public signatures just to avoid boxing