Go’s Memory Allocator

How Go hands out memory at high speed while keeping your latency low

Dec 07, 2025

“Fast programs aren’t just about fast code. They’re about getting memory without waiting in line.”
— An allocator engineer who spent too much time staring at flame graphs

I used to assume memory allocation was a simple request–response operation: ask the runtime for a few bytes, get a pointer back, move on with life. Then I looked at latency traces from a heavily loaded Go service and realized something important. Allocation wasn’t just a background detail. It was one of the most active, performance-critical parts of the entire system.

Profiling Go Applications with Flamegraphs

Go hides the complexity well, but under the hood the allocator is working constantly, juggling object sizes, managing per-P caches, recycling memory, and maintaining tight cooperation with the garbage collector. It’s a balancing act designed to keep allocations so cheap that you barely think about them.

If the garbage collector is responsible for reclaiming memory, the allocator is responsible for handing it out efficiently. Go’s allocator is deeply integrated with the GC and the scheduler, forming a memory subsystem designed for high concurrency, predictable latency, and fast allocation paths. Although it is inspired by classic allocators like tcmalloc, Go’s version includes its own refinements for goroutines, stack behavior, and language semantics.

Principle of memory allocator implementation in Go language - SoByte

At a high level, the allocator revolves around three concepts: spans, size classes, and three-tier caches. Memory comes from the OS as large chunks and gets carved into spans, which the runtime splits into objects of specific size classes. These objects are then distributed through per-P caches for extremely fast allocation that avoids global locks.

Spans and Size Classes

Go organizes heap memory into regions called spans. A span is simply a contiguous run of pages (aligned with the OS page size). Each span belongs to a size class. Size classes define how big the allocated objects in that span should be. The runtime predefines dozens of size classes, ranging from tiny 8-byte cells to multi-KB blocks.

A visual guide to Go Memory Allocator from scratch (Golang) | by Ankur Anand | Medium

When you allocate a small object:

type Point struct {
    x, y int
}

p := &Point{1, 2}

The allocator determines that Point fits in a particular size class, finds a span from that class with free space, and hands back a pointer. This is extremely quick because the allocator already knows which spans are available and how many free slots they contain.

The Per-P Cache. Fast Allocation for Goroutines

Go’s scheduler maintains logical processors (P structures). Each P has its own local cache for small object allocation. This per-P cache is the secret behind Go’s famously fast allocations: most allocations do not involve global locks or shared structures. Instead, the P grabs objects directly from its local span cache.

If a local cache runs out of objects for a size class, it refills from the central cache. If the central cache runs low, it requests more spans from the heap. This tiered design means that goroutines running on different Ps very rarely contend with one another during allocation.

A small example that benefits from this:

func makePoints(n int) []*Point {
    out := make([]*Point, n)
    for i := 0; i < n; i++ {
        out[i] = &Point{x: i, y: i}
    }
    return out
}

Even though this function allocates thousands of objects, most of them come straight from the per-P allocator. Only occasionally does the P talk to the central allocator, keeping the fast path extremely fast.

Post-Mortem Heap Analysis: TCMalloc - Backtrace Engineering

Go’s per-P cache (known as mcache) is similar in concept to the per-thread/per-CPU caches in tcmalloc, as the Go memory allocator was originally based on tcmalloc. Both use a thread-local (or processor-local in Go’s case) cache to reduce lock contention during small object allocations.

Large Objects

Small objects come from size-class spans, but large objects bypass the size classes entirely. If an allocation request exceeds a certain threshold (just over 32 KB), Go allocates a dedicated span big enough to hold the object. These spans are tracked separately and freed independently during sweeping.

This explains why very large slices or byte buffers immediately request new spans:

buf := make([]byte, 10*1024*1024) // 10 MB buffer

The runtime grabs a large span directly from the heap rather than using size classes, which avoids excessive fragmentation.

A goroutine first checks its local cache (mcache) for small allocations, then moves to the central pool (mcentral), and finally requests more memory from the operating system if needed.

How Go’s allocator works.

mcache (Local cache). Each logical processor (P) has its own cache (mcache) of pre-allocated memory blocks, which are grouped by size class (e.g., 8 bytes, 16 bytes). This allows for very fast allocation for small objects without using locks.
mcentral (Central pool). If an mcache is depleted, it requests a new memory block from a central pool (mcentral). The mcentral maintains lists of memory “spans” (contiguous blocks of memory) for each size class.
Spans. A span is a contiguous block of memory, typically 8KB. When a new span is needed, the mcentral gets it from the mheap. The mcentral then splits it up to provide the requested size, and the rest is kept as a free list for that size class.
mheap (Global heap). If the mcentral cannot satisfy a request, it asks the mheap for a new span. The mheap is the central component that requests larger blocks of memory from the operating system. This is the final fallback when all other levels are exhausted.

Talking to the OS

When the allocator itself runs out of free spans, it requests memory from the OS. Go uses virtual memory mapping calls (like mmap on Unix) to reserve new regions. This relationship is cooperative: the allocator returns spans during sweeping when possible, and the OS lazily backs virtual pages with physical memory only as they are touched.

This ties back to virtual memory principles. Go may reserve hundreds of megabytes of virtual memory during runtime growth, but physical memory is allocated gradually as the program actually writes to pages. This is why large-but-mostly-unused slices are not disastrous for RAM usage.

Integration with the Garbage Collector

The allocator and garbage collector share metadata about spans so the GC knows which objects are pointers, which contain no pointers, and which slots are free. Pointerless objects skip scanning entirely, which significantly reduces GC work. Because the size class of a span determines the layout of every object inside it, the GC can traverse objects quickly without needing per-object headers or type descriptors.

The allocator also collaborates with sweeping. As a P allocates new objects, it may perform a little sweeping work on the side, helping the system reclaim memory incrementally instead of waiting for long pauses.

Why Go’s Allocator Feels Fast

Developers often notice that small allocations in Go are surprisingly cheap compared to languages that rely on a global heap lock or slower bump-pointer allocators combined with compacting GCs. Go’s secret is that the actual hot path is usually just:

Grab a free slot from the P’s local span.
Zero the memory (required by the spec).
Return the pointer.

No kernel calls. No global lock. No stop-the-world behavior.

And because spans are reused aggressively, memory fragmentation stays reasonable without requiring compaction.

Putting It All Together

Go’s allocator is not simply a memory distribution system. It is a carefully tuned component in Go’s whole runtime ecosystem. It keeps goroutines fast, it cooperates with the garbage collector, it plays well with virtual memory, and it ensures that high-concurrency programs can allocate rapidly without bottlenecks.

Understanding it helps explain why some patterns allocate more than you expect, why pointer heavy structures increase GC pressure, and why preallocating slices can significantly reduce span churn*. With this knowledge, you can write Go programs that align with the runtime’s strengths instead of fighting them.

* Preallocating slices significantly reduces “churn” by avoiding frequent reallocations, memory copying, and the associated garbage collection overhead that occurs as slices grow beyond their initial capacity.

Thanks for reading! This post is public so feel free to share it.

The Coding Gopher

Discussion about this post

Ready for more?