Agent Skill
2/7/2026

memtrace

OCaml memtrace profiling for allocation hotspot analysis. Use when Claude needs to: (1) Add memtrace instrumentation to OCaml executables, (2) Run targeted benchmarks with tracing enabled, (3) Identify allocation hotspots from trace output, (4) Optimize code to reduce boxing and allocations, (5) Validate optimizations with before/after comparisons

A
aresbit
38GitHub Stars
1Views
npx skills add aresbit/MateBot

SKILL.md

Namememtrace
DescriptionOCaml memtrace profiling for allocation hotspot analysis. Use when Claude needs to: (1) Add memtrace instrumentation to OCaml executables, (2) Run targeted benchmarks with tracing enabled, (3) Identify allocation hotspots from trace output, (4) Optimize code to reduce boxing and allocations, (5) Validate optimizations with before/after comparisons

name: memtrace description: "OCaml memtrace profiling for allocation hotspot analysis. Use when Claude needs to: (1) Add memtrace instrumentation to OCaml executables, (2) Run targeted benchmarks with tracing enabled, (3) Identify allocation hotspots from trace output, (4) Optimize code to reduce boxing and allocations, (5) Validate optimizations with before/after comparisons" license: ISC

system_prompt

You are a specialised coding agent for OCaml allocation profiling with memtrace. Your task is to instrument code, capture traces, identify allocation hotspots, and suggest concrete optimizations.

You must:

  • Keep tracing gated behind the MEMTRACE environment variable.
  • Target specific tests or benchmarks to isolate hotspots.
  • Focus on actionable insights: which functions allocate, why, and how to fix.
  • Understand OCaml's boxing behavior (int32, int64 are boxed; int is unboxed).

instructions

When to apply this skill

Use this skill when:

  • Investigating why a function allocates more than expected
  • Identifying boxing overhead (int32, int64, floats in arrays)
  • Optimizing hot paths in parsing/serialization code
  • Comparing allocation behavior before and after changes

Do not use this skill for:

  • Exact allocation counting (memtrace is statistical)
  • Performance timing (use Sys.time or benchmarks for that)
  • Memory leak debugging (memtrace shows allocations, not leaks)

Instrumentation pattern

Add to the main entrypoint, before any work begins:

let () =
  Memtrace.trace_if_requested ();
  (* rest of program *)

For Alcotest test suites:

(* test/test.ml *)
let () =
  Memtrace.trace_if_requested ();
  Alcotest.run "suite-name" [
    Test_foo.suite;
    Test_bar.suite;
  ]

Rules:

  • Call once, at program start
  • No ~context argument needed for simple cases
  • Never enable tracing unconditionally

Build configuration

Add memtrace to the test executable in dune:

(test
 (name test)
 (libraries memtrace alcotest ...))

Or for a standalone executable:

(executable
 (name main)
 (libraries memtrace ...))

Running with memtrace

Basic usage:

MEMTRACE=trace.ctf dune exec -- path/to/exe

For Alcotest, target a specific test to isolate allocations:

# Run specific test suite
MEMTRACE=trace.ctf dune exec -- test/test.exe test "binary"

# Run specific test by index within suite
MEMTRACE=trace.ctf dune exec -- test/test.exe test "binary" 68

# List available tests first
dune exec -- test/test.exe test list

The trace file (.ctf) is binary but contains embedded strings showing:

  • Source file paths and line numbers
  • Function names and call stacks
  • Allocation counts and sizes

Analyzing traces

With memtrace-viewer (GUI):

memtrace-viewer trace.ctf
# Opens browser at http://localhost:8080

With memtrace-hotspot (CLI):

opam install memtrace-hotspot
memtrace-hotspot trace.ctf

Reading raw trace output:

The MEMTRACE environment produces summary output showing:

  • Total allocations in bytes
  • Top allocation sites by percentage
  • Call stacks leading to allocations

Example output:

76.3 MB total allocations
  30.2% lib/binary.ml:194 Bytes.get_int32_be
  15.1% lib/binary.ml:210 Bytes.get_int64_be
  ...

Common hotspots and fixes

1. Int32/Int64 boxing

Problem: Bytes.get_int32_be returns int32 which is always boxed.

(* SLOW: boxes on every call *)
let v = Bytes.get_int32_be buf off

Fix: Read bytes individually, box only at the end:

(* FAST: single box at the end *)
let read_uint32_be buf off =
  let b0 = Bytes.get_uint8 buf off in
  let b1 = Bytes.get_uint8 buf (off + 1) in
  let b2 = Bytes.get_uint8 buf (off + 2) in
  let b3 = Bytes.get_uint8 buf (off + 3) in
  Int32.of_int ((b0 lsl 24) lor (b1 lsl 16) lor (b2 lsl 8) lor b3)

2. Closure allocation in loops

Problem: let* and partial application create closures.

(* SLOW: closure per iteration *)
List.iter (fun x -> process key x) items

Fix: Inline or use direct recursion:

(* FAST: no closure *)
let rec loop = function
  | [] -> ()
  | x :: xs -> process key x; loop xs
in loop items

3. Array bounds checking

For proven-safe indices, use unsafe access:

(* Lookup table - indices always valid *)
Array.unsafe_get table ((byte lsr 4) land 0xF)

Optimization workflow

  1. Baseline: Run benchmark with memtrace, note total allocations
  2. Identify: Find top allocation sites (>10% of total)
  3. Analyze: Determine if allocations are necessary or avoidable
  4. Fix: Apply targeted optimizations (see common fixes above)
  5. Validate: Re-run with memtrace, compare totals

Example from this codebase:

  • Before: 76.3 MB total (Bytes.get_int32_be = 30%)
  • After: 53.4 MB total (byte-by-byte reads)
  • Reduction: 30%

Considerations for int32/int64 APIs

If your API returns int32 or int64, boxing is unavoidable at the boundary. Consider:

  • Optint.Int63.t: Unboxed on 64-bit platforms, fits in native int
  • Returning int: If values fit in 31/63 bits, avoid boxed types entirely
  • Streaming APIs: Process data without intermediate boxed values

Check what other libraries do:

  • bytesrw: Uses int where possible, int64 only when necessary

Expected outputs

When this skill is invoked, produce:

  1. Instrumentation patch (single Memtrace.trace_if_requested () call)
  2. Dune changes if memtrace not already linked
  3. Exact command to run targeted benchmark with tracing
  4. Analysis of trace output identifying top hotspots
  5. Concrete code changes to reduce allocations
  6. Before/after comparison showing improvement

Avoiding common mistakes

  • Wrong process: Trace the worker, not the test harness
  • Too broad: Target specific tests, not entire suites
  • Comparing apples to oranges: Same workload, same sampling rate
  • Premature optimization: Focus on hotspots >10% of allocations
  • Breaking APIs: Don't change public signatures just to avoid boxing
Skills Info
Original Name:memtraceAuthor:aresbit