After some discussion on reddit recently the issue of uniform representation came up, and how it’s seemingly necessary to do GC, and what problems it causes. While I’m 99% sure I didn’t invent any of this, I can’t seem to find it while searching so I’ll just write down a quite note about how to do GC without enfocing uniform representation of objects.
So what is uniform representation? Basically, it means that all the members in an object are stored as “references” to their own separate allocations instead of embedded directly in their owner. Why do we need this? Well, the GC needs to be able to trace through the object graph in order to mark them as in-use. This means that all pointers need to point to the start of an object allocation so that we can easily retrieve its meta data, mark bit, and so on. If we were to allow objects to be stored embedded directly within a parent object, then any reference to that object would be pointing to some random place in the middle of an allocation and we wouldn’t be able to find the corresponding meta data in order to recursively trace the target object.
The problem with uniform representation
So while uniform representation makes the GC’s life a bit easier, it also makes it a lot harder. In particular it leads to very pointer-heavy data structures. Since every member of a class that isn’t a value type has to be a reference, you typically get lots of tiny classes that just contains a bunch of pointers to other tiny classes and so on. The heap becomes full of pointers. This wasteful from a storage perspective (all those pointers, and allocation meta data overhead, soak up memory and cache line space that could be better used elsewhere). It’s also expensive from a tracing perspective because there are just more pointers to trace.
Thus it would be far better if objects could be stored within other objects (like in C++), without requiring a separate heap allocation. This would lead to less space wasted storing pointers, fewer cache misses due to things being stored more compactly, and less space wasted in allocation meta data. It would also make the GC run faster due to there being fewer edges in the object graph.
Again, I’m pretty sure I didn’t invent either of these techinques, but maybe I did. Here are the options I can think of in order to support the storage of objects “embedded” within other objects, and still let code treat them like any other object (e.g. it’s fine to store a reference to an object that’s embedded within another object, the GC will appropriately keep the parent object alive to ensure the allocation isn’t freed before its time).
Use a back-chain of header words
Basically, store a header word even for objects that are stored within another allocation. But mark it to be “dummy” somehow, and just have it refer back to the “real” header word (the one that belongs to the “parent” allocation). The construction of the parent object is responsible for setting up all these “back-chaining” header words. When the GC follows a pointer to an object it would detect such a dummy header and then just do an extra hop to get the real allocation header. From there it would get the rest of the meta data it needs to continue tracing (e.g. type maps and so on) just like normal.
This is simple, but also potentially a bit wasteful. Since we’re storing header words even for embedded objects we’re giving up one of the key benefits we were hoping to get from allowing objectcs to be stored embedded in other objects.
Store a start-of-allocation bitmap
This is also pretty simple. We allocate a bitmap for the entire heap that’s zero everywhere exept for the start of each allocation. If we enforce that allocations are e.g. 16 byte aligned, then this adds only 1/128=0.8% overhead, so in practice it’s likely to cause less overhead than the previous technique.
When the GC follows a pointer, it looks at the corresponding bitmap position and scans backwards to find the closest set bit (using _BitScan intrinsics). That’s the start of the parent object. Once we have this start we can find the header word and other meta data as normal.
Doing this kind of open ended reverse scan might seem a bit costly, but remember that a single 64 bit _BitScan can cover 64*16=1024 bytes of data, so for most objects we’d expect to find the start within a single test. If the first such _BitScan is unable to find a start-of-allocation bit set, then we could switch to an SSE routine that tests 128 or 256 bits at a time in order to very rapidly find make our way backwards in the heap and find the start of the allocation.
We could also treat large allocations separately just to avoid worst-case scenarios here. For those objects we could them in a segregated adress space so we can identify large objects by just looking at the pointer adress. For this large object heap we can increase the minimum allocation size so that each bit in the start-of-allocation bitmap corresponds to a large chunk of memory (e.g. 4KB). That would speed up the bit-scanning operation. We could also use a completely different strategy altogether and just store “shortcut” pointers in a separate table that maps each 4KB page to a pointer to the start of the corresponding allocation (this implies that these “large objects” are 4KB aligned on both ends so that a page only ever refers to one allocation).
Storea a is-pointer bitmap
This is similar to the previous idea, but in this case we assume that there are no header words in our heap. This can save even more memory, but might not be as useful in systems that use header words for other book-keeping (e.g. locking, reflection, etc.).
We still need a start-of-allocation bit like before so that we know where the allocation starts. Just like before this is necessary because when we follow a pointer we might be pointing into the middle of an allocation so to be sure that we’re tracing the whole object we have to “rewind” the pointer a bit.
We also need a mark bit, but since we no longer have a header where this can go, we simply store a mark bit for every QWORD. This might be a bit wasteful since you’re storing bits for every QWORD even though you only need one for each allocation, but depending on typical object sizes the overhead could be managable (it probalby leads to a faster “sweep” phase of a GC since you can do those entirely within the bitmaps, not touching the actual heap at all).
So for every QWORD we store two bits. The first one is the start-of-allocation bit, and the second one is the mark bit. However, since the mark bit is only needed if the start-of-allocation bit is set, we can reuse it to indicate the end of the allocation when the start-of-allocation bit isn’t set. I.e. if the first bit in a pair is set then you’re looking at the start of the object (and in that case the second bit is the makr bit), and if it the first bit is clear then the second bit will tell if you’re looking at the last QWORD of the object (note: this implies that an allocation is always at least 2 QWORDs).
So, with this scheme we have a way to “rewind” a pointer to the start of an allocation by looking at a bitmask, and we also have a way of finding the end of the object. Then to trace the object we look at yet another bitmap that for every 8 bytes indicates whether it represents a pointer. These bits get filled in when an object is allocated using per-type meta data (meaning we don’t need to find any per-type meta data during tracing, it all gets “baked in” to this is-a-pointer bitmap).
So while tracing, we find the start of the allocation a reference is pointing to, and then just look at the is-a-pointer bitmap to find pointers until we hit the “end of object” bit pattern.
While the is-a-pointer bitmap has to store 1 bit for every 8 bytes (on a 64 bit system), the start-of-allocation and mark bitmaps can be at a coarser granularity for reduced bitmap overhead (at the expense of inflating the minimum allocation size). Even at 3 bits per 8 bytes, we’re “only” talking about 4.7% overhead which isn’t a whole ton in comparison to what you might see for a header word (at least for small-to-medium sized objects).