Agent Skill
2/7/2026

go-performance

Go performance profiling and optimization. Load when analyzing bottlenecks, running benchmarks, profiling CPU/memory, or optimizing hot paths. Triggers: pprof, benchmarks, -bench, -memprofile, -cpuprofile, escape analysis, sync.Pool, allocation reduction, or performance tuning.

D
d3fvxl
0GitHub Stars
1Views
npx skills add d3fvxl/.dotfiles

SKILL.md

Namego-performance
DescriptionGo performance profiling and optimization. Load when analyzing bottlenecks, running benchmarks, profiling CPU/memory, or optimizing hot paths. Triggers: pprof, benchmarks, -bench, -memprofile, -cpuprofile, escape analysis, sync.Pool, allocation reduction, or performance tuning.

name: go-performance description: > Go performance profiling and optimization. Load when analyzing bottlenecks, running benchmarks, profiling CPU/memory, or optimizing hot paths. Triggers: pprof, benchmarks, -bench, -memprofile, -cpuprofile, escape analysis, sync.Pool, allocation reduction, or performance tuning.

Go Performance Profiling

Performance analysis and optimization patterns for Go applications.

When This Skill MUST Be Used

ALWAYS invoke this skill when the user's request involves ANY of these:

  • Running benchmarks (go test -bench)
  • Profiling CPU or memory usage
  • Analyzing pprof output
  • Optimizing hot paths or reducing allocations
  • Escape analysis questions
  • Using sync.Pool or object reuse
  • Memory leak investigation
  • Latency optimization

If you're about to optimize Go code, STOP and use this skill first.

Critical Safety Rules

NEVER:

  • Optimize without benchmarks proving the problem
  • Use sync.Pool for small objects (overhead exceeds benefit)
  • Ignore escape analysis when reducing allocations
  • Profile in debug mode (use release builds)
  • Assume CPU is the bottleneck (often it's memory/IO)

ALWAYS:

  • Benchmark before and after optimization
  • Use go test -benchmem to track allocations
  • Profile production-like workloads
  • Check escape analysis with -gcflags="-m"
  • Consider readability vs performance tradeoff

Quick Reference

TaskCommand
Run benchmarksgo test -bench=. ./...
With memory statsgo test -bench=. -benchmem ./...
CPU profilego test -bench=. -cpuprofile=cpu.prof
Memory profilego test -bench=. -memprofile=mem.prof
Escape analysisgo build -gcflags="-m" ./...
View profilego tool pprof cpu.prof
Compare benchmarksbenchstat old.txt new.txt

Benchmarking

Writing Benchmarks

func BenchmarkOrderCreation(b *testing.B) {
    // Setup outside the loop
    repo := NewMockRepository()
    svc := NewOrderService(repo)
    
    b.ResetTimer() // Reset after setup
    
    for i := 0; i < b.N; i++ {
        svc.CreateOrder(ctx, testOrder)
    }
}

Benchmark Flags

# Run all benchmarks
go test -bench=. ./...

# Run specific benchmark
go test -bench=BenchmarkOrderCreation ./...

# With memory allocation stats
go test -bench=. -benchmem ./...

# Run multiple times for statistical significance
go test -bench=. -count=10 ./...

# Set minimum run time
go test -bench=. -benchtime=5s ./...

# Disable CPU profiling overhead
go test -bench=. -cpu=1 ./...

Sub-benchmarks

func BenchmarkProcess(b *testing.B) {
    sizes := []int{10, 100, 1000, 10000}
    
    for _, size := range sizes {
        b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
            data := generateData(size)
            b.ResetTimer()
            
            for i := 0; i < b.N; i++ {
                process(data)
            }
        })
    }
}

Comparing Benchmarks

# Install benchstat
go install golang.org/x/perf/cmd/benchstat@latest

# Run benchmarks, save results
go test -bench=. -count=10 ./... > old.txt

# Make changes, run again
go test -bench=. -count=10 ./... > new.txt

# Compare
benchstat old.txt new.txt

Profiling with pprof

Generate Profiles

# CPU profile
go test -bench=. -cpuprofile=cpu.prof ./...

# Memory profile (allocations)
go test -bench=. -memprofile=mem.prof ./...

# Block profile (contention)
go test -bench=. -blockprofile=block.prof ./...

# Mutex profile
go test -bench=. -mutexprofile=mutex.prof ./...

Analyze Profiles

# Interactive CLI
go tool pprof cpu.prof

# Web UI (recommended)
go tool pprof -http=:8080 cpu.prof

# Top functions by CPU
go tool pprof -top cpu.prof

# Show specific function
go tool pprof -list=ProcessOrder cpu.prof

pprof Commands

CommandDescription
topShow top functions by sample count
top -cumShow top by cumulative time
list <func>Show annotated source
webOpen call graph in browser
peek <func>Show callers and callees
disasm <func>Show assembly

HTTP pprof (Runtime Profiling)

import _ "net/http/pprof"

func main() {
    // Exposes /debug/pprof/* endpoints
    go http.ListenAndServe(":6060", nil)
    
    // Your application...
}
# Capture 30-second CPU profile from running app
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Memory profile
go tool pprof http://localhost:6060/debug/pprof/heap

# Goroutine profile
go tool pprof http://localhost:6060/debug/pprof/goroutine

Escape Analysis

Understanding Escape

Variables "escape" to the heap when the compiler cannot prove they're safe on the stack:

# Show escape analysis decisions
go build -gcflags="-m" ./...

# More verbose
go build -gcflags="-m -m" ./...

Common Escape Causes

PatternEscapes?Why
Return pointer to localYesOutlives function
Store in interfaceOftenRuntime type info needed
Closure captures variableOftenClosure outlives scope
Slice grows beyond capacityYesReallocation
Large stack objectYesExceeds stack limit

Preventing Escapes

// BAD: Pointer escapes
func NewUser(name string) *User {
    return &User{Name: name}  // Escapes to heap
}

// GOOD: Return value (if struct is small)
func NewUser(name string) User {
    return User{Name: name}  // Stays on stack
}

// BAD: Interface causes escape
func process(v interface{}) { ... }
process(myStruct)  // myStruct escapes

// GOOD: Use generics (Go 1.18+)
func process[T any](v T) { ... }
process(myStruct)  // May stay on stack

Reducing Allocations

Preallocate Slices

// BAD: Multiple allocations as slice grows
var results []Result
for _, item := range items {
    results = append(results, process(item))
}

// GOOD: Single allocation
results := make([]Result, 0, len(items))
for _, item := range items {
    results = append(results, process(item))
}

String Building

// BAD: Multiple allocations
result := ""
for _, s := range strings {
    result += s  // New allocation each iteration
}

// GOOD: Single allocation
var b strings.Builder
b.Grow(totalLen)  // Optional: preallocate
for _, s := range strings {
    b.WriteString(s)
}
result := b.String()

Reuse Buffers

// BAD: New buffer each call
func process(data []byte) []byte {
    buf := make([]byte, 1024)
    // use buf...
    return result
}

// GOOD: Reuse with sync.Pool
var bufPool = sync.Pool{
    New: func() interface{} {
        return make([]byte, 1024)
    },
}

func process(data []byte) []byte {
    buf := bufPool.Get().([]byte)
    defer bufPool.Put(buf)
    // use buf...
    return result
}

Avoid Boxing

// BAD: Interface boxing allocates
func Log(v interface{}) { ... }
Log(42)  // Allocates to box int

// GOOD: Use generics or specific types
func Log[T any](v T) { ... }
Log(42)  // No allocation

sync.Pool

When to Use

  • Frequently allocated objects
  • Objects are similar size
  • Allocation shows up in memory profile
  • Object creation is expensive

When NOT to Use

  • Small objects (< 64 bytes)
  • Objects with varying sizes
  • Objects with complex initialization
  • Low allocation rate

Pattern

var pool = sync.Pool{
    New: func() interface{} {
        return &Buffer{data: make([]byte, 4096)}
    },
}

func process() {
    buf := pool.Get().(*Buffer)
    defer pool.Put(buf)
    
    buf.Reset()  // IMPORTANT: Reset state before use
    // use buf...
}

Gotchas

  • Pool is cleared on GC - don't rely on it for caching
  • Always reset objects before returning to pool
  • Objects may be collected between Put and Get

Memory Optimization

Struct Field Ordering

// BAD: 24 bytes (padding between fields)
type Bad struct {
    a bool   // 1 byte + 7 padding
    b int64  // 8 bytes
    c bool   // 1 byte + 7 padding
}

// GOOD: 16 bytes (fields ordered by size)
type Good struct {
    b int64  // 8 bytes
    a bool   // 1 byte
    c bool   // 1 byte + 6 padding
}
# Check struct sizes
go install golang.org/x/tools/go/analysis/passes/fieldalignment/cmd/fieldalignment@latest
fieldalignment ./...

Slice vs Array

// Array: fixed size, value type, can stay on stack
var arr [100]int

// Slice: dynamic, reference type, backing array may escape
slice := make([]int, 100)

Map Optimization

// Preallocate if size known
m := make(map[string]int, expectedSize)

// Clear map without reallocating
for k := range m {
    delete(m, k)
}

CPU Optimization

Inlining

// Small functions get inlined automatically
// Check with: go build -gcflags="-m"

// Force no inline (for debugging)
//go:noinline
func process() { ... }

Bounds Check Elimination

// BAD: Bounds check each iteration
for i := 0; i < len(slice); i++ {
    _ = slice[i]  // Bounds check
}

// GOOD: Hint to compiler
_ = slice[len(slice)-1]  // Bounds check once
for i := 0; i < len(slice); i++ {
    _ = slice[i]  // No bounds check
}

Loop Optimization

// BAD: Function call in loop condition
for i := 0; i < len(getItems()); i++ { ... }

// GOOD: Cache length
items := getItems()
for i := 0; i < len(items); i++ { ... }

// BEST: Range (idiomatic)
for i, item := range items { ... }

Concurrency Performance

Worker Pool

func processItems(items []Item, workers int) []Result {
    jobs := make(chan Item, len(items))
    results := make(chan Result, len(items))
    
    // Start workers
    var wg sync.WaitGroup
    for i := 0; i < workers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for item := range jobs {
                results <- process(item)
            }
        }()
    }
    
    // Send jobs
    for _, item := range items {
        jobs <- item
    }
    close(jobs)
    
    // Wait and collect
    go func() {
        wg.Wait()
        close(results)
    }()
    
    var out []Result
    for r := range results {
        out = append(out, r)
    }
    return out
}

Reduce Lock Contention

// BAD: Single lock for all operations
type Cache struct {
    mu   sync.Mutex
    data map[string]Value
}

// GOOD: Sharded locks
type Cache struct {
    shards [256]struct {
        mu   sync.Mutex
        data map[string]Value
    }
}

func (c *Cache) shard(key string) *shard {
    h := fnv.New32a()
    h.Write([]byte(key))
    return &c.shards[h.Sum32()%256]
}

sync.RWMutex for Read-Heavy Workloads

type Cache struct {
    mu   sync.RWMutex
    data map[string]Value
}

func (c *Cache) Get(key string) (Value, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()
    v, ok := c.data[key]
    return v, ok
}

Troubleshooting

IssueCauseFix
High memory usageAllocations in hot pathUse sync.Pool, preallocate slices
GC pausesMany small allocationsReduce allocations, use value types
Slow benchmarksSetup in loopMove setup before b.ResetTimer()
Profile shows nothingDebug buildUse -gcflags="-N -l" to disable optimizations for profiling
Escape to heapPointer return, interfaceReturn value types, use generics
Lock contentionSingle mutexShard locks, use RWMutex
Benchmark varianceExternal factorsUse -count=10, use benchstat

Anti-Patterns

PatternProblemFix
Premature optimizationWaste of timeProfile first, prove problem exists
Micro-benchmarks onlyMiss real bottlenecksProfile production workloads
Optimize allocations blindlyMay not be bottleneckCheck if GC is actually a problem
Copy large structsStack pressurePass by pointer
Pass small structs by pointerExtra indirectionPass by value
Global sync.PoolMay not helpBenchmark before and after

Example Requests

User RequestAction
"Profile this code"Generate CPU/memory profile with pprof
"Why is this slow?"Run benchmarks, profile, check escape analysis
"Reduce allocations"Use -benchmem, check escape, use sync.Pool
"Optimize this loop"Preallocate, bounds check elimination, cache length
"Memory leak"Heap profile, check goroutine leaks, pprof heap
"Compare performance"Use benchstat with multiple runs
"This allocates too much"Escape analysis, preallocate, reuse buffers
Skills Info
Original Name:go-performanceAuthor:d3fvxl