Chapter 11: Benchmarking & Profiling

Benchmarking & Profiling

Go provides excellent tools for measuring and improving performance. Performance optimization is a journey from measurement to understanding to improvement. Go’s tooling makes this journey straightforward and scientific.

Performance matters in Go applications, but premature optimization wastes time. The Go philosophy: write clear code first, measure to find real bottlenecks, then optimize what matters. Go’s runtime is already fast - most code doesn’t need optimization. But when performance counts (high-throughput services, real-time systems, resource-constrained environments), Go gives you the tools to understand and improve it.

This chapter covers the essential performance tools: benchmarks for measurement, pprof for profiling CPU and memory usage, and the trace tool for understanding concurrency. You’ll learn to write meaningful benchmarks, interpret profiling data, identify common performance bottlenecks, and apply targeted optimizations. The goal isn’t making everything fast - it’s making the right things fast enough.

Writing Benchmarks

Understanding Benchmarks

Benchmarks measure code performance scientifically. Instead of guessing which approach is faster, you measure both and let data decide. Go’s testing package makes benchmarking as easy as writing tests - benchmarks live alongside tests in _test.go files with the same tooling.

A benchmark function accepts *testing.B and runs the code being measured b.N times. The testing framework automatically determines an appropriate N to get statistically significant results - it starts small and increases until the benchmark runs long enough to measure accurately. You don’t choose N; Go does.

The key insight: benchmarks measure relative performance, not absolute. Comparing two approaches in the same benchmark environment reveals which is faster, but the exact numbers depend on your hardware. Focus on ratios (2x faster, 50% fewer allocations) rather than absolute values (234ns per operation).

Benchmark functions start with Benchmark and use *testing.B:

Benchmark Best Practices

// In a _test.go file:

func BenchmarkConcat(b *testing.B) {
    strs := []string{"hello", "world", "foo", "bar"}

    // Reset timer after setup
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        concatWithBuilder(strs)
    }
}

// Run with: go test -bench=. -benchmem
// Output: BenchmarkConcat-8  5000000  234 ns/op  48 B/op  2 allocs/op

Memory Benchmarking

Why Memory Matters

Memory allocations are often the hidden performance killer in Go programs. Every allocation means work for the garbage collector. Frequent allocations in hot code paths can trigger GC more often, causing latency spikes and reducing throughput. Understanding allocation patterns is as important as understanding CPU usage.

The -benchmem flag adds allocation statistics to benchmark output: bytes allocated per operation and number of allocations. These numbers reveal optimization opportunities. An algorithm might seem fast but allocate heavily - reducing allocations often speeds up the algorithm and reduces GC pressure simultaneously.

Allocation-free code is the gold standard for performance-critical paths. This doesn’t mean avoiding all allocations - it means being intentional about them. Preallocate buffers, reuse objects with sync.Pool, work with []byte instead of strings, and return values instead of pointers for small types. These techniques dramatically reduce allocation rates.

Using pprof

Understanding pprof

pprof is Go’s built-in profiler for identifying CPU hotspots and memory bottlenecks. Unlike benchmarks that measure specific functions in isolation, pprof analyzes entire programs to show where time is spent and memory is allocated. It answers the critical question: “What should I optimize?”

CPU profiling samples your program periodically (100 times per second by default) to record which functions are executing. After collection, pprof aggregates the data to show time spent per function, including time spent in called functions (cumulative) versus time in the function itself (flat). The top functions by cumulative time are your optimization targets.

Memory profiling tracks allocations, showing which functions allocate the most bytes and objects. This reveals unexpected allocation patterns - maybe a function called rarely allocates huge amounts, or a frequently-called function has small but numerous allocations. Both problems have different solutions, and pprof helps you identify them.

# CPU profiling
go test -cpuprofile=cpu.prof -bench=.
go tool pprof cpu.prof

# Memory profiling
go test -memprofile=mem.prof -bench=.
go tool pprof mem.prof

# Common pprof commands:
# top10 - show top 10 functions
# list funcName - show source with annotations
# web - open interactive graph in browser

HTTP pprof

import _ "net/http/pprof"

func main() {
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
    // Your application...
}

// Access at:
// http://localhost:6060/debug/pprof/
// http://localhost:6060/debug/pprof/heap
// http://localhost:6060/debug/pprof/goroutine

Common Optimizations

Applying Targeted Improvements

Once profiling reveals bottlenecks, you apply targeted optimizations. The following patterns appear repeatedly in performance-critical Go code. They’re not appropriate everywhere - use them where profiling shows they matter, not preemptively.

These optimizations share a theme: reduce allocations, minimize copying, and leverage Go’s efficient primitives. sync.Pool reuses temporary objects. Preallocation eliminates growth overhead. Value receivers avoid pointer indirection for small types. Working with bytes instead of strings avoids conversions. Each technique has specific use cases where it shines.

The art of optimization is knowing when to apply these patterns. A function called once per request doesn’t need sync.Pool. A slice of 10 items doesn’t need preallocation. A 64-byte struct passed by value is fine. Profile first, understand the bottleneck, then apply the appropriate technique.

Avoiding Allocations

Trace Tool

Understanding Execution Traces

The trace tool visualizes program execution over time, showing goroutine scheduling, GC activity, and system interactions. Unlike pprof which aggregates data, traces show the timeline of events - you can see exactly when goroutines run, block, and communicate. This is invaluable for understanding concurrency issues.

Traces excel at revealing concurrency problems that profiling misses. Is your program underutilizing CPUs because goroutines block on channels? Are goroutines creating contention for locks? Is the GC pausing your application at critical moments? The trace timeline makes these patterns visible.

The interactive trace viewer shows multiple timelines: per-processor goroutine execution, heap size, GC events, and goroutine creation/blocking. Click events to see details, zoom in on interesting periods, and correlate across timelines. Common insights: goroutines spending too much time blocked, inadequate parallelism, or GC triggering too frequently. The trace points to root causes that profiling data alone can’t reveal.

# Generate trace
go test -trace=trace.out -bench=.

# View trace
go tool trace trace.out

The trace shows:

Goroutine execution timeline
GC events
Syscalls
Network blocking

Key Takeaways

Benchmark first - measure before optimizing
Use -benchmem - track allocations, not just time
Profile in production - pprof over HTTP
Preallocate - slices and maps when size is known
Avoid allocations - sync.Pool, strconv, []byte
Trace for concurrency - go tool trace

Exercise

Optimize a Slow Function

hard

Given a function that counts word frequencies in a text, identify and fix the performance bottlenecks. The optimized version should be at least 3x faster.

Next up: Chapter 12: Patterns & Gotchas