Why custom allocators/pools are hard

In languages with manually managed memory such as C and C++ as well as in garbage-collected languages, you sometimes want to roll your own memory allocator. Some common reasons are:

  • Speed: return &pool[last++] is faster than malloc. (A real pool would usually be slower than that, but still faster than malloc; especially since your "free", ready-to-be-allocated objects in the pool could have a lot of state initialized already since the last time they were used, unlike a malloc'd buffer – in OO terms, you don't need to call the constructor after allocating).
  • Predictability: people usually refer to "the pool advantage" as "lower fragmentation" and hence less chances of running out of memory due to "sudden" fragmentation in unexpected circumstances. Actually, fragmentation is higher with pools: a pool of 100 objects of type A can not be used to allocate objects of type B, even if you're using just one (or zero) A objects right now – so your memory is very much fragmented. However, it's fragmented predictably, leading to predictable allocation times.
  • Stability: Another things which higher fragmentation buys. Pools let you allocate B objects after running out of A objects from the predictably available "B fragment" (pool). This means you can actually handle out-of-memory conditions if you can live without another A object. A malloc-based program "runs out of everything" when it runs out of memory, so it's very unlikely to survive.

How hard are pools? Algorithmically, they're misleadingly easy, unlike malloc which is rather clearly hard. Malloc must be able to allocate chunks of many different sizes in an unpredictable order. A fast algorithm tending to have low fragmentation – or implementing garbage collection and heap defragmentation – is between hard and impossible (where "impossible" means that you'll always have pathological workloads leading to horrible behavior, and you're trying to select an algorithm such that real-life workloads are well-supported and the pathological ones remain theoretical).

Pools, however, usually allocate same-sized chunks. How hard can it be? You keep an array of free pointers and a last index; allocate() pops from this free pointer stack, and free() pushes a pointer back into it, and you're done.

The problem with pools isn't the allocation algorithm but the fact that a new memory management interface has been created. Memory management is fundamental to programming. Built-in memory management interfaces come with a lot of tool support which is rather hard for a custom interface to match.

Consider a garbage-collected language. Most often, such languages provide you a rather strong correctness guarantee: as long as an object is referenced, it will be kept alive and won't be reclaimed and reused for something else. In other words, you're promised to never have problems with dangling references (of course, a subset of these will turn into memory leak problems – too many objects that are no longer "really" needed but referenced – but these are generally way easier to debug).

However, if you implement a pool in a language with GC, that guarantee is gone. The language runtime doesn't know that pool.free(obj) "frees" something – as far as it can tell, the object is very much alive. If someone frees an object and then accesses it, it may very well be that the object has since been reused for something else, and now you have a nasty dangling reference problem.

Your only guarantee now is that you'll only get the "type-safe" variant of dangling references – you'd be fiddling with someone else's object of the same type as yours – but this doesn't necessarily make debugging easier (because changes to the wrong object of the right type may look "too sensible" to provoke the suspicion that they deserve).

Can you tell the runtime, "pool.free actually frees, and I want you to call it instead of your normal reclaiming procedure when the object is no longer referenced?" Perhaps some GC languages have this; it's certainly not a trivial thing to support, because part of the point of pools is to keep hairy, already-constructed objects in them, which point to other objects, some of which might be themselves allocated from pools and some not.

What about languages with manually managed memory? At first glance, the problem seems irrelevant to these because of their "advantage" of not providing any guarantees anyway. You very much can have dangling references with malloc, and pools don't change this.

However, there are tools such as Valgrind which flag a large share of these conditions, by marking chunks passed to free as "inaccessible", and chunks returned by malloc as "undefined" (inaccessible for reading until the first write which initializes the data). The trouble with pools is that, again, Valgrind doesn't know that pool.free frees, and hence it can't flag accesses through dangling references any more.

Is there a workaround? The answer depends on your situation and disposition:

  • Valgrind has a client request mechanism which lets you mark memory regions as "inaccessible" or "undefined", and your pools can issue these requests using Valgrind macros.
  • However, this isn't something that can be done in the pool implementation if the pool keeps constructed objects rather than plain memory chunks. You'll need a per-object-type function marking some of the memory as inaccessible/undefined – but not all of it. For instance, if the object keeps a pointer to a pre-allocated buffer, then maybe the buffer data become undefined when the object is freed and then reallocated, but the pointer to the buffer is defined, because it's already valid. For hairy objects, this can mean a lot of code for making Valgrind work as well as with malloc, and this code can have bugs, marking the wrong things as "defined".
  • If you're using tools other than Valgrind, you'll need to find an equivalent mechanism for these. If you use several tools, then you need to support several mechanisms. There's no standard interface for custom allocators (there could be – there is, in many languages, a standard interface for specifying custom operators, so it's not like there can't be standard ways for doing custom things; there just isn't for pools, at least there isn't in many real languages).

The main point I'm trying to make is, don't have every developer roll their own pool, unless it's for a specific type of objects used briefly and locally in some "private" bit of code. If you need pools for many different kinds of objects and these objects have long, non-trivial lifecycles and are accessed in many different contexts, standardize on the pools.

In a whole lot of cases, code reuse actually isn't worth the trouble and it's fine for people to do their own slightly different version of something which could become a common library – but it'd take too much effort and coordination and misunderstandings during that coordination.

Pools aren't one of these places. Their algorithmic simplicity actually makes it easy to standardize on a few common variants (what variant can one desire that others don't also need?) – and their non-algorithmic complications make standardization very worthwhile.

There are a bunch of other non-algorithmic problems you can have with pools besides having to describe your live objects to tools – for example:

  • Thread safety is another potentially non-portable aspect of memory allocation which is already handled by the language's built-in allocator and will become a headache for a custom one. You could use OS locks, or spinlocks, or a combination, or you could have a per-thread arena to avoid locking if it's too slow, in which case you'll need to handle deallocation by a thread different from the allocating one. Or perhaps you could do lock-free allocation if, say, there's an atomic increment and it's sufficient.
  • Placement new is something you might want to use in C++ that rather many C++ programmers aren't aware of. If you want to have your pool initialize objects in a memory chunk that's passed to it from outside, and you intend to use the pool with classes with non-empty constructors and destructors, then you'll want to do something like for(i=0;i<n;++i) new (buf+i*sizeof(T)) T(args) or what-not, and call ~T directly when the pool shuts down. If everyone rolls their own pools, a lot will do this bit wrong.

The upshot is that pools are surprisingly gnarly, and are really best avoided; there's a very good reason to build memory allocation into a programming language. To the extent that circumstances dictate the use of pools, it's a very good idea to standardize on a few common implementations, debug them, and leave them alone (though unfortunately a closed implementation likely can not deal with live objects tracking, and bugs will appear in user-implemented parts doing that).

The algorithmic simplicity of a pool, prompting people to just declare an object array and a pointer array and use these as a quick-and-dirty pool, is really quite deceiving.

19 comments ↓

#1 Barry Kelly on 12.04.12 at 12:15 am

Four occasions where I've seen custom allocators / pools / arenas:

1) I wrote one for .NET, for recycling largish byte arrays used as network buffers amongst other things. These were large enough that they lived in the Large Object Heap, meaning they weren't collected until expensive gen2 collections, but they were "allocated" every single request.

2) In a compiler, every translation unit had its own arena for allocations. Allocations were never released independently. Arenas were disposed of whole when a translation unit needed recompilation owing to a change in the IDE editor.

3) A C++ compiler that used arenas not just for allocations, but also for precompiled headers. The arena was streamed out, and for loads streamed in and pointers fixed up. Faster than rebuilding an object graph.

4) I wrote a simple custom allocator for debugging memory corruption errors on Windows with ASLR, which would cause addresses from malloc to vary from one run to the next. A trick for finding out how a pointer got a certain value is to set a hardware breakpoint on the pointer's location, then run the program, and potentially perform a binary search on the breakpoint hits to find the bad write. But it usually requires predictable allocation addresses, something you don't get with ASLR. So a simple custom allocator can fix this.

#2 Yossi Kreinin on 12.04.12 at 2:28 am

Is there no way to simply turn off ASLR on Windows?

#3 Prakhar Goel on 12.04.12 at 3:50 am

I think the PostgreSQL RDBMS has an interesting use of pools.

They use it to prevent memory leaks in their (considerable) C codebase. A function gets an arena. All allocation goes through that arena. When the function is done, the entire arena is de-allocated. Arenas are nested so even if a function forgets about an arena, the next level up function can throw the entire set out.

I'm oversimplifying a bit but I think that's the basic idea.

(Not sure if the above is analogous to region-based allocation).

#4 Yossi Kreinin on 12.04.12 at 5:07 am

This is fine as long as things allocated by a function don't continue to live past the exit point.

#5 Jerry on 12.04.12 at 1:56 pm

The times I use pools are for performance due to allocating and deallocating memory through pointers. The primary example for me is objects with vector members. Each new() actually implies a cascade of allocation, and each delete throws that collection of memory chunks away.

I also end up needing each object to have a clear() function to return it to a reusable state.

#6 Jean-Marc on 12.04.12 at 11:16 pm

Another interesting use of custom allocators is to roll out your own pointer type (allocator::pointer), if the maximum number of objects is known, one can use pointers smaller than the default 32 or 64 bits. This will reduce the memory usage particularly in data structures using lot of pointers (list, graphs,…).
Note that not all containers honor the allocator pointer type, some use void* internally with disastrous consequences.

#7 Yossi Kreinin on 12.04.12 at 11:25 pm

@Jerry: yeah, that; trouble is, all the dangling references you could have into those "zombie" objects – though of course it's no big deal as long as the program and the team are small enough.

@Jean-Mark: if I actually wanted to roll my own pointer type (or use indexes instead of pointers, I guess that's what's going on there, sort of), I wouldn't dare to hope that any off the shelf library or interface would cooperate with me… I'd write it myself from scratch, or "not tell" the library that my indexes were "logically pointers"…

#8 Leo Sutic on 12.05.12 at 1:19 am

I would add one more use case: garbage disposal in a GC language. Since GC isn't predictable, you may be forced to call a dispose() method on each object, particularly if it maps to a scarce system resource.

Using a pool (or a stack) as a bookkeeping system works well if the objects have bounded lifetimes.

You start with the pool point at

int mark = pool_pointer;

Then you just take objects from the pool, never returning them. Finally, you dispose() all objects between mark and the current pool_pointer, and set the pool_pointer to mark.

I did this for buffers in a render loop.

int mark = pool.mark();
renderShape (…);
pool.reset (mark);

Of course, since renderShape doesn't put anything back into the pool, the consumption of buffers must be bounded, but this really simplified the code of renderShape, as I didn't have to check that dispose() was called once on every path, for every object.

Ok, this may seem more like a stack allocator, but the buffer object need not be completely disposed, making it a pool of sorts.

#9 neleai on 12.14.12 at 10:03 am

As you mentioned threads they totally defeat purpose of using memory pool in first place.

With locks, atomic operations… you hit a constraint that single atomic compare and swap takes ~100 cycles.

As current c malloc/free is amortized 200-400 cycles per call you won't get big speedup.

Avoiding this cost is possible but complex. You need TLS or equivalent and several tricks how amortize free.

I have several ideas how improve malloc instead.

#10 Yossi Kreinin on 12.14.12 at 11:55 am

You don't necessarily use pools for the speed-up – you may care more about predictability. Then speed varies depending on what sort of threads/cores, etc. But yeah, threads hurt performance of all shared allocators.

#11 Sanjoy Das on 12.29.12 at 1:49 pm

Apart from the benefit that pools help freeing complex nested structures (which has already been mentioned), I find them interesting because they reduce cache misses. A linked list of integers (say) is likely to incur more cache misses on each deref of the "next" pointer if the individual elements have been malloced than if they were allocated from a memory pool (and hence are more likely to lie close to each other). Of course, this only applies to languages like C or C++; languages like Java can have a cleverly compacting GC take care of this.

#12 Yossi Kreinin on 12.29.12 at 2:48 pm

Languages like Java keep what, 8 bytes of overhead per object, minimum? (The method table pointer and the lock, if I recall correctly.) Which inflates the number of cache lines you need before you consider overhead of gc metadata.

What languages like Java do nicely is keep arrays of integers (incidentally, so do most languages – even Python has numpy to store arrays without its usual ridiculous per-object overheads).

Not that I have anything against languages with plenty of overhead, I just don't think that cleverness and a high level of abstraction are the path to efficiency; I think brute force and less abstraction are that path. But that's an eternal "philosophical" dispute, there being no way to truly end it with data.

#13 Sanjoy Das on 12.29.12 at 3:01 pm

I can't but agree that the per-object overhead must adversely affect the cache performance. What I was really driving at is that memory pools make less sense in managed languages since you don't have to (or cannot adequately, as you point out) deal with two important problems they take care of — deallocation of nested structures and better cache performance.

#14 Yossi Kreinin on 01.01.13 at 10:22 am

I do think that if you have hairy objects in a managed language, then it can make sense to keep them in a pool because taking one from a pool is faster than initializing them from scratch.

#15 Stefan on 07.24.13 at 4:02 am

Pools *are* hard to implement, but not so much for the reasons you stated. As you said, speed and maybe fragmentation is almost always the reason for using a pool. But, either way, the standard memory allocator is only then not sufficient, if you are allocating and deallocating _very_ many objects.

That means, a pool only makes sense if it is fast for huge numbers of objects. Consequently, every single operation must have a complexity of O(1), or amortized O(1) (i. e. averaged over all calls). At the same time, you can't afford a lot of memory overhead, or else you cannot sensibly use it for comparitively small objects: you do need some additional memory to maintain the information about free and occupied memory spots.

Finding fast, O(1) algorithms, that can keep your organizational data up to date at all times, and with a minimum of memory overhead, is indeed a complex task.

#16 Mate Soos on 02.22.14 at 11:12 am

You just saved my ass — I didn't know about the Valgrind client request mechanism and I was using a pool. So I read up on it, marked my free memory and found a bug. Nice! Thanks a lot!

By the way, another good reason to use a custom allocator is to place data that is frequently accessed in unison next to one another (& use a 32b index, to make the 'pointer' 32b instead of 64b). This is the case for SAT solvers' boolean clauses: all modern SAT solvers use a custom memory manager for these two reasons. We only free clauses in a very predictable manner, in large batches, relatively rarely. The speedup is more than 10% of overall runtime thanks to the better memory&cache behaviour.

#17 Yossi Kreinin on 03.03.14 at 12:38 pm

Glad to hear it was helpful.

SAT solvers rule. I've been working with a guy who draws a SAT solver to shoot at every second problem he meets. Much of the time the solver gets the shit done.

#18 Matt on 04.18.14 at 6:43 pm

Object finalization is what you sound like you want here. I'm not sure you can call Common Lisp a "real language" anymore due to it's lack of use, but I know that the trivial-garbage package there abstracts implementation specific finalizers which get called when the object is deallocated. I used this in a foreign function interface I was writing one time. Basically, I wrapped a foreign (C) struct pointer in a Lisp class, a finalizer for the object as an :after method to initialize-instance. You had to be careful not to reference the object in the finalizer lambda, or else it would never be collected. I think it looked something like this, (but I also know this is wrong; it's just for an idea)

(defmethod initialize-instance :after (obj &key)
(with-slots (cptr) obj
(add-finalizer cptr #'(lambda () (foreign-free cptr)))))

This made the GC "just work" with the C code. I can't imagine that only lisp has this… But I could be wrong I guess…

#19 Yossi Kreinin on 04.18.14 at 11:40 pm

I think you can do this kind of thing in most languages with an FFI; I'm not sure if it's trivial to use with a custom allocator written in that same language though, where some objects are allocated from pools and some not, and those allocated from pools are "partially constructed" and keep references to objects allocated from other pools, and there may be circular references within those partially constructed objects, etc. Maybe it works out just fine, I'm not concentrating on this sufficiently right now… What's obvious to me is it works fine for an opaque blob allocated in C that you need to deallocate when you run out of references to it, and it may get harder if it's not an opaque blob but a native data structure and you get to worry about its guts referencing stuff and it's no longer mostly C's problem but fully your problem.

Leave a Comment