Why custom allocators/pools are hard
In languages with manually managed memory such as C and C++ as well as in garbage-collected languages, you sometimes want to
roll your own memory allocator. Some common reasons are:
- Speed: return &pool[last++] is faster than malloc. (A real pool would usually be slower than that, but
still faster than malloc; especially since your "free", ready-to-be-allocated objects in the pool could have a lot of state
initialized already since the last time they were used, unlike a malloc'd buffer – in OO terms, you don't need to call the
constructor after allocating).
- Predictability: people usually refer to "the pool advantage" as "lower fragmentation" and hence less
chances of running out of memory due to "sudden" fragmentation in unexpected circumstances. Actually, fragmentation is
higher with pools: a pool of 100 objects of type A can not be used to allocate objects of type B, even if you're using
just one (or zero) A objects right now – so your memory is very much fragmented. However, it's fragmented predictably,
leading to predictable allocation times.
- Stability: Another things which higher fragmentation buys. Pools let you allocate B objects after running
out of A objects from the predictably available "B fragment" (pool). This means you can actually handle out-of-memory conditions
if you can live without another A object. A malloc-based program "runs out of everything" when it runs out of memory, so it's
very unlikely to survive.
How hard are pools? Algorithmically, they're misleadingly easy, unlike malloc which is rather clearly hard. Malloc must be
able to allocate chunks of many different sizes in an unpredictable order. A fast algorithm tending to have low fragmentation –
or implementing garbage collection and heap defragmentation – is between hard and impossible (where "impossible" means that
you'll always have pathological workloads leading to horrible behavior, and you're trying to select an algorithm such that
real-life workloads are well-supported and the pathological ones remain theoretical).
Pools, however, usually allocate same-sized chunks. How hard can it be? You keep an array of free pointers and a last index;
allocate() pops from this free pointer stack, and free() pushes a pointer back into it, and you're done.
The problem with pools isn't the allocation algorithm but the fact that a new memory management interface has been
created. Memory management is fundamental to programming. Built-in memory management interfaces come with a lot of tool
support which is rather hard for a custom interface to match.
Consider a garbage-collected language. Most often, such languages provide you a rather strong correctness guarantee: as long
as an object is referenced, it will be kept alive and won't be reclaimed and reused for something else. In other words, you're
promised to never have problems with dangling references (of course, a subset of these will turn into memory leak problems – too
many objects that are no longer "really" needed but referenced – but these are generally way easier to debug).
However, if you implement a pool in a language with GC, that guarantee is gone. The language runtime doesn't know that
pool.free(obj) "frees" something – as far as it can tell, the object is very much alive. If someone frees an object and then
accesses it, it may very well be that the object has since been reused for something else, and now you have a nasty dangling
reference problem.
Your only guarantee now is that you'll only get the "type-safe" variant of dangling references – you'd be fiddling with
someone else's object of the same type as yours – but this doesn't necessarily make debugging easier (because changes to the
wrong object of the right type may look "too sensible" to provoke the suspicion that they deserve).
Can you tell the runtime, "pool.free actually frees, and I want you to call it instead of your normal reclaiming procedure
when the object is no longer referenced?" Perhaps some GC languages have this; it's certainly not a trivial thing to support,
because part of the point of pools is to keep hairy, already-constructed objects in them, which point to other objects, some of
which might be themselves allocated from pools and some not.
What about languages with manually managed memory? At first glance, the problem seems irrelevant to these because of their
"advantage" of not providing any guarantees anyway. You very much can have dangling references with malloc, and pools don't
change this.
However, there are tools such as Valgrind which flag a large share of these conditions, by marking chunks passed to free as
"inaccessible", and chunks returned by malloc as "undefined" (inaccessible for reading until the first write which initializes
the data). The trouble with pools is that, again, Valgrind doesn't know that pool.free frees, and hence it can't flag accesses
through dangling references any more.
Is there a workaround? The answer depends on your situation and disposition:
- Valgrind has a client request mechanism which lets you mark memory regions as "inaccessible" or "undefined", and your pools
can issue these requests using Valgrind macros.
- However, this isn't something that can be done in the pool implementation if the pool keeps constructed objects rather than
plain memory chunks. You'll need a per-object-type function marking some of the memory as inaccessible/undefined – but not all
of it. For instance, if the object keeps a pointer to a pre-allocated buffer, then maybe the buffer data become undefined when
the object is freed and then reallocated, but the pointer to the buffer is defined, because it's already valid. For
hairy objects, this can mean a lot of code for making Valgrind work as well as with malloc, and this code can have bugs, marking
the wrong things as "defined".
- If you're using tools other than Valgrind, you'll need to find an equivalent mechanism for these. If you use several tools,
then you need to support several mechanisms. There's no standard interface for custom allocators (there could be – there is, in
many languages, a standard interface for specifying custom operators, so it's not like there can't be standard ways for
doing custom things; there just isn't for pools, at least there isn't in many real languages).
The main point I'm trying to make is, don't have every developer roll their own pool, unless it's for a
specific type of objects used briefly and locally in some "private" bit of code. If you need pools for many different kinds of
objects and these objects have long, non-trivial lifecycles and are accessed in many different contexts, standardize on the
pools.
In a whole lot of cases, code reuse actually isn't worth the trouble and it's fine for people to do their own slightly
different version of something which could become a common library – but it'd take too much effort and coordination and
misunderstandings during that coordination.
Pools aren't one of these places. Their algorithmic simplicity actually makes it easy to standardize on a few common variants
(what variant can one desire that others don't also need?) – and their non-algorithmic complications make standardization very
worthwhile.
There are a bunch of other non-algorithmic problems you can have with pools besides having to describe your live objects to
tools – for example:
- Thread safety is another potentially non-portable aspect of memory allocation which is already handled by
the language's built-in allocator and will become a headache for a custom one. You could use OS locks, or spinlocks, or a
combination, or you could have a per-thread arena to avoid locking if it's too slow, in which case you'll need to handle
deallocation by a thread different from the allocating one. Or perhaps you could do lock-free allocation if, say, there's an
atomic increment and it's sufficient.
- Placement new is something you might want to use in C++ that rather many C++ programmers aren't aware of.
If you want to have your pool initialize objects in a memory chunk that's passed to it from outside, and you intend to use the
pool with classes with non-empty constructors and destructors, then you'll want to do something like
for(i=0;i<n;++i)
new (buf+i*sizeof(T)) T(args)
or what-not, and call ~T directly when the pool shuts down. If everyone rolls their own
pools, a lot will do this bit wrong.
The upshot is that pools are surprisingly gnarly, and are really best avoided; there's a very good reason to build memory
allocation into a programming language. To the extent that circumstances dictate the use of pools, it's a very good idea to
standardize on a few common implementations, debug them, and leave them alone (though unfortunately a closed implementation
likely can not deal with live objects tracking, and bugs will appear in user-implemented parts doing that).
The algorithmic simplicity of a pool, prompting people to just declare an object array and a pointer array and use these as a
quick-and-dirty pool, is really quite deceiving.
Four occasions where I've seen custom allocators / pools /
arenas:
1) I wrote one for .NET, for recycling largish byte arrays used as
network buffers amongst other things. These were large enough that they
lived in the Large Object Heap, meaning they weren't collected until
expensive gen2 collections, but they were "allocated" every single
request.
2) In a compiler, every translation unit had its own arena for
allocations. Allocations were never released independently. Arenas were
disposed of whole when a translation unit needed recompilation owing to
a change in the IDE editor.
3) A C++ compiler that used arenas not just for allocations, but also
for precompiled headers. The arena was streamed out, and for loads
streamed in and pointers fixed up. Faster than rebuilding an object
graph.
4) I wrote a simple custom allocator for debugging memory corruption
errors on Windows with ASLR, which would cause addresses from malloc to
vary from one run to the next. A trick for finding out how a pointer got
a certain value is to set a hardware breakpoint on the pointer's
location, then run the program, and potentially perform a binary search
on the breakpoint hits to find the bad write. But it usually requires
predictable allocation addresses, something you don't get with ASLR. So
a simple custom allocator can fix this.
Is there no way to simply turn off ASLR on Windows?
I think the PostgreSQL RDBMS has an interesting use of pools.
They use it to prevent memory leaks in their (considerable) C
codebase. A function gets an arena. All allocation goes through that
arena. When the function is done, the entire arena is de-allocated.
Arenas are nested so even if a function forgets about an arena, the next
level up function can throw the entire set out.
I'm oversimplifying a bit but I think that's the basic idea.
(Not sure if the above is analogous to region-based allocation).
This is fine as long as things allocated by a function don't continue
to live past the exit point.
The times I use pools are for performance due to allocating and
deallocating memory through pointers. The primary example for me is
objects with vector members. Each new() actually implies a cascade of
allocation, and each delete throws that collection of memory chunks
away.
I also end up needing each object to have a clear() function to
return it to a reusable state.
Another interesting use of custom allocators is to roll out your own
pointer type (allocator::pointer), if the maximum number of objects is
known, one can use pointers smaller than the default 32 or 64 bits. This
will reduce the memory usage particularly in data structures using lot
of pointers (list, graphs,...).
Note that not all containers honor the allocator pointer type, some use
void* internally with disastrous consequences.
@Jerry: yeah, that; trouble is, all the dangling references you could
have into those "zombie" objects – though of course it's no big deal as
long as the program and the team are small enough.
@Jean-Mark: if I actually wanted to roll my own pointer type (or use
indexes instead of pointers, I guess that's what's going on there, sort
of), I wouldn't dare to hope that any off the shelf library or interface
would cooperate with me... I'd write it myself from scratch, or "not
tell" the library that my indexes were "logically pointers"...
I would add one more use case: garbage disposal in a GC language.
Since GC isn't predictable, you may be forced to call a dispose() method
on each object, particularly if it maps to a scarce system resource.
Using a pool (or a stack) as a bookkeeping system works well if the
objects have bounded lifetimes.
You start with the pool point at
int mark = pool_pointer;
Then you just take objects from the pool, never returning them.
Finally, you dispose() all objects between mark and the current
pool_pointer, and set the pool_pointer to mark.
I did this for buffers in a render loop.
int mark = pool.mark();
renderShape (...);
pool.reset (mark);
Of course, since renderShape doesn't put anything back into the pool,
the consumption of buffers must be bounded, but this really simplified
the code of renderShape, as I didn't have to check that dispose() was
called once on every path, for every object.
Ok, this may seem more like a stack allocator, but the buffer object
need not be completely disposed, making it a pool of sorts.
As you mentioned threads they totally defeat purpose of using memory
pool in first place.
With locks, atomic operations... you hit a constraint that single
atomic compare and swap takes ~100 cycles.
As current c malloc/free is amortized 200-400 cycles per call you
won't get big speedup.
Avoiding this cost is possible but complex. You need TLS or
equivalent and several tricks how amortize free.
I have several ideas how improve malloc instead.
You don't necessarily use pools for the speed-up – you may care more
about predictability. Then speed varies depending on what sort of
threads/cores, etc. But yeah, threads hurt performance of all shared
allocators.
Apart from the benefit that pools help freeing complex nested
structures (which has already been mentioned), I find them interesting
because they reduce cache misses. A linked list of integers (say) is
likely to incur more cache misses on each deref of the "next" pointer if
the individual elements have been malloced than if they were allocated
from a memory pool (and hence are more likely to lie close to each
other). Of course, this only applies to languages like C or C++;
languages like Java can have a cleverly compacting GC take care of
this.
Languages like Java keep what, 8 bytes of overhead per object,
minimum? (The method table pointer and the lock, if I recall correctly.)
Which inflates the number of cache lines you need before you consider
overhead of gc metadata.
What languages like Java do nicely is keep arrays of integers
(incidentally, so do most languages – even Python has numpy to store
arrays without its usual ridiculous per-object overheads).
Not that I have anything against languages with plenty of overhead, I
just don't think that cleverness and a high level of abstraction are the
path to efficiency; I think brute force and less abstraction are that
path. But that's an eternal "philosophical" dispute, there being no way
to truly end it with data.
I can't but agree that the per-object overhead must adversely affect
the cache performance. What I was really driving at is that memory pools
make less sense in managed languages since you don't have to (or cannot
adequately, as you point out) deal with two important problems they take
care of — deallocation of nested structures and better cache
performance.
I do think that if you have hairy objects in a managed language, then
it can make sense to keep them in a pool because taking one from a pool
is faster than initializing them from scratch.
Pools *are* hard to implement, but not so much for the reasons you
stated. As you said, speed and maybe fragmentation is almost always the
reason for using a pool. But, either way, the standard memory allocator
is only then not sufficient, if you are allocating and deallocating
_very_ many objects.
That means, a pool only makes sense if it is fast for huge numbers of
objects. Consequently, every single operation must have a complexity of
O(1), or amortized O(1) (i. e. averaged over all calls). At the same
time, you can't afford a lot of memory overhead, or else you cannot
sensibly use it for comparitively small objects: you do need some
additional memory to maintain the information about free and occupied
memory spots.
Finding fast, O(1) algorithms, that can keep your organizational data
up to date at all times, and with a minimum of memory overhead, is
indeed a complex task.
You just saved my ass — I didn't know about the Valgrind client
request mechanism and I was using a pool. So I read up on it, marked my
free memory and found a bug. Nice! Thanks a lot!
By the way, another good reason to use a custom allocator is to place
data that is frequently accessed in unison next to one another (&
use a 32b index, to make the 'pointer' 32b instead of 64b). This is the
case for SAT solvers' boolean clauses: all modern SAT solvers use a
custom memory manager for these two reasons. We only free clauses in a
very predictable manner, in large batches, relatively rarely. The
speedup is more than 10% of overall runtime thanks to the better
memory&cache behaviour.
Glad to hear it was helpful.
SAT solvers rule. I've been working with a guy who draws a SAT solver
to shoot at every second problem he meets. Much of the time the solver
gets the shit done.
Object finalization is what you sound like you want here. I'm not
sure you can call Common Lisp a "real language" anymore due to it's lack
of use, but I know that the trivial-garbage package there abstracts
implementation specific finalizers which get called when the object is
deallocated. I used this in a foreign function interface I was writing
one time. Basically, I wrapped a foreign (C) struct pointer in a Lisp
class, a finalizer for the object as an :after method to
initialize-instance. You had to be careful not to reference the object
in the finalizer lambda, or else it would never be collected. I think it
looked something like this, (but I also know this is wrong; it's just
for an idea)
(defmethod initialize-instance :after (obj &key)
(with-slots (cptr) obj
(add-finalizer cptr #'(lambda () (foreign-free cptr)))))
This made the GC "just work" with the C code. I can't imagine that
only lisp has this... But I could be wrong I guess...
I think you can do this kind of thing in most languages with an FFI;
I'm not sure if it's trivial to use with a custom allocator written in
that same language though, where some objects are allocated from pools
and some not, and those allocated from pools are "partially constructed"
and keep references to objects allocated from other pools, and there may
be circular references within those partially constructed objects, etc.
Maybe it works out just fine, I'm not concentrating on this sufficiently
right now... What's obvious to me is it works fine for an opaque blob
allocated in C that you need to deallocate when you run out of
references to it, and it may get harder if it's not an opaque blob but a
native data structure and you get to worry about its guts referencing
stuff and it's no longer mostly C's problem but fully your problem.
Yes, pool allocators are not so powerful in Java since they can't
allocate objects of different types.
Pool allocators shine in C/C++ where allocation is just syntactic
sugar around a function which takes number of bytes you want and gives
back a pointer.
So pool allocation is just: return current position in the pool,
increment it by the # bytes requested paramater. (+ some check if the
pool if full)
What you describe only works if all the objects in the pool are freed
together. If you need to free them separately, then allocating objects
of different sizes from the same memory area means you need to roll your
own malloc in that memory area. The alternative is a pool of memory
chunks sized for an object of one type, which is exactly what
Java's/similar type system would make you do.
Post a comment