rust-performance
Write high-performance, computational Rust code that parallelizes well and allocates memory efficiently. Use when writing performance-critical Rust code, optimizing existing Rust code for speed or memory, implementing parallel/concurrent algorithms, working with SIMD, or when the user mentions performance, optimization, allocation, or parallelism in a Rust context. Covers build configuration, profiling, memory allocation patterns, parallelism with rayon/crossbeam, and micro-optimization techniques.
SKILL.md
| Name | rust-performance |
| Description | Write high-performance, computational Rust code that parallelizes well and allocates memory efficiently. Use when writing performance-critical Rust code, optimizing existing Rust code for speed or memory, implementing parallel/concurrent algorithms, working with SIMD, or when the user mentions performance, optimization, allocation, or parallelism in a Rust context. Covers build configuration, profiling, memory allocation patterns, parallelism with rayon/crossbeam, and micro-optimization techniques. |
name: rust-performance description: Write high-performance, computational Rust code that parallelizes well and allocates memory efficiently. Use when writing performance-critical Rust code, optimizing existing Rust code for speed or memory, implementing parallel/concurrent algorithms, working with SIMD, or when the user mentions performance, optimization, allocation, or parallelism in a Rust context. Covers build configuration, profiling, memory allocation patterns, parallelism with rayon/crossbeam, and micro-optimization techniques.
High-Performance Rust
This skill covers writing Rust code optimized for computational performance, parallelism, and memory efficiency.
Build Configuration for Performance
Always use release builds with optimizations:
[profile.release]
codegen-units = 1 # Better optimization, slower compile
lto = "fat" # Link-time optimization
panic = "abort" # Smaller binary, no unwinding
For maximum speed, also add:
RUSTFLAGS="-C target-cpu=native" cargo build --release
Consider alternative allocators for allocation-heavy workloads:
// jemalloc (Linux/macOS) or mimalloc (cross-platform)
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
Error Handling: Avoid unwrap()
Never use .unwrap() in performance-critical code paths. Beyond the panic risk, unwrap() inserts panic-handling code that can inhibit optimizations.
Prefer these patterns:
// GOOD: let-else for early return
let Some(value) = maybe_value else {
return Err("Value not found".into());
};
// GOOD: ok_or_else with lazy error creation
let value = maybe_value.ok_or_else(|| Error::NotFound)?;
// GOOD: match for complex handling
let value = match maybe_value {
Some(v) => v,
None => return default_value,
};
For fallback values, use unwrap_or or unwrap_or_else:
let value = maybe_value.unwrap_or(0);
let value = maybe_value.unwrap_or_else(|| compute_default());
See: https://corrode.dev/blog/rust-option-handling-best-practices/
Memory Allocation Best Practices
Pre-allocate When Size is Known
// BAD: Multiple reallocations
let mut v = Vec::new();
for i in 0..1000 { v.push(i); }
// GOOD: Single allocation
let mut v = Vec::with_capacity(1000);
for i in 0..1000 { v.push(i); }
Reuse Buffers
let mut buffer = Vec::with_capacity(4096);
for item in items {
buffer.clear(); // Keeps capacity
process_into(&mut buffer, item);
}
Use Stack Allocation for Small Collections
use smallvec::{smallvec, SmallVec};
// Stack-allocates up to 8 elements
let mut v: SmallVec<[u32; 8]> = SmallVec::new();
Avoid Cloning in Hot Paths
// BAD: Always clones
fn process(data: Vec<u8>) -> Result<()>
// GOOD: Borrows when possible
fn process(data: &[u8]) -> Result<()>
// GOOD: Copy-on-write for conditional mutation
use std::borrow::Cow;
fn process(data: Cow<[u8]>) -> Result<()>
For detailed memory patterns, see: references/memory-allocation.md
Parallelism
Rayon for Data Parallelism
use rayon::prelude::*;
// Parallel iteration
let results: Vec<_> = data
.par_iter()
.map(|x| expensive_computation(x))
.collect();
// Parallel sort
data.par_sort();
Chunked Processing for Cache Efficiency
data.par_chunks(1024)
.map(|chunk| {
// Sequential within chunk for cache locality
chunk.iter().map(process).sum::<f64>()
})
.sum::<f64>()
Crossbeam for Fine-Grained Concurrency
use crossbeam::scope;
scope(|s| {
for item in &mut items {
s.spawn(move |_| process(item));
}
}).unwrap();
For detailed parallelism patterns, see: references/parallelism.md
Micro-Optimizations
Bounds Check Elimination
// Help compiler eliminate bounds checks
assert!(index < slice.len());
let value = slice[index];
// Or use iteration instead of indexing
for value in &slice[start..end] {
// No bounds checks in loop
}
Inlining Hot Functions
#[inline(always)]
fn hot_inner_loop_fn() { /* ... */ }
#[inline(never)]
fn cold_error_handling() { /* ... */ }
#[cold]
fn rarely_called() { /* ... */ }
Faster Hashing
use rustc_hash::FxHashMap; // Faster than std HashMap
let mut map: FxHashMap<K, V> = FxHashMap::default();
Avoid Iterator Chain Overhead in Hot Loops
// May be slower due to chain overhead
iter.filter(pred).map(f).collect()
// Consider manual loop for hot paths
let mut result = Vec::with_capacity(hint);
for item in iter {
if pred(&item) {
result.push(f(item));
}
}
Profiling
Before optimizing, profile to find actual bottlenecks:
# Compile with debug info for profiling
RUSTFLAGS="-C force-frame-pointers=yes" cargo build --release
# Linux: Use perf + flamegraph
cargo install flamegraph
cargo flamegraph
# Cross-platform: Use samply
cargo install samply
samply record ./target/release/myapp
For heap profiling, use DHAT:
#[cfg(feature = "dhat-heap")]
#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;
Reference Material
- For specific topics, run:
scripts/extract_section.py "<section>"- Available: Benchmarking, Build Configuration, Profiling, Inlining, Hashing, Heap Allocations, Type Sizes, Iterators, Parallelism, etc.
- Parallelism patterns:
references/parallelism.md - Memory allocation patterns:
references/memory-allocation.md - Full performance book:
assets/perf-book.md
Quick Checklist
Before shipping performance-critical code:
- Release build with
codegen-units = 1andlto = "fat" - No
unwrap()in hot paths (uselet-elseorok_or_else) - Pre-allocated collections where size is known
- Buffers reused instead of reallocated
- Hot functions marked
#[inline]or#[inline(always)] - Cold/error paths marked
#[cold]or#[inline(never)] - FxHashMap/FxHashSet instead of std HashMap/HashSet
- Profiled with real workloads to confirm optimizations help
- Parallel iteration with rayon where appropriate
- No unnecessary cloning in hot paths