R-trees – adapting out-of-core techniques to modern memory architectures

It’s been a couple of months since GDC, and I’ve been meaning to do a blog post about the talk I gave there (and if you have membership to the GDC vault you can even see a video).

An R-tree is a data structure for spatial indexing that’s optimized for out-of-core scenarios (basically databases). Spatial indexing is used for storing “stuff” in space. For example, a game might store all of the game objects in a spatial index so that we can efficiently query for things like “give me all objects that are within the view frustum”, or “give me all pairs of monsters and exploding barrels within 5m of each other”.

So, spatial indices are useful for lots of things, especially in games. The problem is that most spatial index structures use tons of small nodes and pointers, and for architectures like the Xbox 360 or PS3 where the L2 cache is small, memory access is slow, and the CPU is too dumb to reorder instructions on the fly, the cost of following a pointer can be massive. An L2 cache miss on the 360 is around 600 cycles. And that’s when you have immediate access to the memory system, if something else is putting pressure on things you may have to wait even longer than that. In comparison, an L1 cache miss and L2 cache hit is about 40 cycles. As Allan Murphy said on one of the other GDC talks: “How many things do you know of that can routinely give you a 15x speed boost?“. In fact, it’s even higher than that since L2-friendly algorithms tend to be reasonably L1 friendly too, so you generally don’t get an L1 miss on each memory access if you’ve made sure the L2 cache is hot.

So, the main idea of my talk is that we need to use different algorithms on these “pointer challenged” architectures in order to minimize L2 cache misses. Basically we need to start treating memory like we treat disk. People have been writing disk-based data structures for decades in out-of-core scenarios (e.g. databases), and indeed R-trees are an example of one such data structure. So what I’ve done is basically just taken that data structure and shifted it up the memory hierarchy one level so that we can use it to minimize memory access instead of using it to minimize disk access.

That’s the rough overview. I won’t repeat my whole talk here, so I’ll just link the slides and ask that you flip through them if you didn’t see the talk, before continuing on to the next section. Here they are.

Hindsight

Looking back on this there are a number of things I would’ve wanted to talk about but simply didn’t have time for in the 25 minutes I had available. This is essentially a brain dump of various things I wanted to flesh out, not necessarily a thorough treatment of each of the issues.

First, it’s worth discussing the breadth-first versus depth-first traversal issue in more detail. Breadth-first does mean we can pre-fetch nodes because unless the “traversal queue” is empty we know ahead of time what the next node will be. What I glossed over in the talk is that breadth-first traversal tends to need more storage for the traversal queue itself, and may therefore cause extra cache misses due to that. Switching between BFS and DFS is a simple matter of switching between a queue and a stack though, so it’s easy enough to benchmark for your particular scenario to see if you happen to hit a case where DFS doesn’t lead to a speedup anymore. In general R-trees have pretty few child nodes for most of the tree (since it has such a high branching factor), so the queue doesn’t really get that full.

Then there were a couple of optimizations that I briefly mentioned, but didn’t go into detail on. For example unrolling the AABB-tests to avoid VMX latency. It’s probably worth noting that you need two different AABB/Frustum tests here. For internal nodes you need to test both the “near” and “far” points on the bounding box because it matters if an AABB is entirely outside/inside or intersecting. If it’s entirely inside the frustum you can skip any further tests and just add all of the children directly (see further discussion on this later), but if it’s intersecting you need to add the children to the traversal queue/stack. However, for leaf nodes you do not care about whether or not the frustum is intersecting an AABB, you only care if the AABB is entirely outside the frustum or not, which means you can test only one point instead of two. This simplified test also means you have to unroll more nodes in order to hide all the VMX latency (because we’re doing less work per AABB). Because of the way the hierarchy looks, you’ll typically spend the vast majority of your time testing leaf nodes so optimizing for that case by using a simpler test gives a major speed boost.

Once we have determined that an internal node is entirely within the frustum we need to add all of the leaf nodes underneath it to the output without further testing. Currently I do this by just putting it on a separate queue and traversing it at the end without tests. Unfortunately because there are no tests we don’t really do any work to hide any latency here. One optimization would be to link together all of the leaf nodes in a linked list (in pre-order traversal order), and then in each internal node store pointers to the first and last leaves in that sub tree. The closer to the root you get, the larger this range would be. This way we can save the traversal required to actually find the leaf nodes – when an internal node is determined to be entirely within the frustum, we simply add its two “range pointers” to the queue, and then do a linked-list traversal without having to process any more internal nodes.

Another interesting thing that struck me while working on this is how neglected spatial joins usually are in games. We have tree-like structures everywhere, but if we want to find the closest cover for each enemy, or the N closest lights for each object, we generally do an O(n log n)-style search where we loop through each of the objects and do an O( log n ) search in the tree for each one. Reading about R-trees, which is a database algorithm, you’re bound to come across joins as they are the bread and butter of database operations, and that’s how I realised that we’ve been neglecting this really cool tool. We should start adding spatial joins to our spatial indices (to get average time O(log2 n) performance), as they’re pretty useful. Basically the implementation just steps down two trees in lock step, and at each level scans through all the possible pairs of sub-trees and determines if those two sub-trees are close enough to need further processing, and if so adds them to the stack/queue. The reason this is faster is because we can say, effectively, “this big part of the level is too far away from that big part of the level that no objects in each could interact with each other”.

One aspects of AABBs that I didn’t really touch on is the actual “AABB” part of them. It’s tempting to think that you should save space and ALU by using a bounding sphere instead of an AABB, but this is almost certainly a mistake. The reason for this is that AABBs are pretty good at containing _other _AABBs with minimal space wastage, whereas spheres absolutely suck at containing other spheres, and of course in a hierarchical setting this  is a key property. You might go even further in this direction and use an 8-DOP which is even better at packing a bunch of 8-DOP children (and only requires two more floats per node), but I do think that AABBs are probably the simplest you can go without seriously sacrificing query performance due to wasteful bounding volumes. That said, spheres can be very good at packing the leaf objects themselves, so you could potentially switch to spheres in the leaves, but then use AABBs for the internal bounding volumes. I haven’t tried this yet, but it seems like it could save some space, and improve efficiency of the leaf-node tests (which constitute the bulk of the tests).

Also, I’d obviously like to implement this on the GPU. It seems very suited for that kind of thing (since each node is very wide, you could even process them in a SIMD-fashion). In particular I’d like to try it for ray tracing (tracing a single ray using all N lanes of the ALU cores to process each node, one lane per child AABB, instead of resorting to things like packet tracing), and maybe even a FEM-based global illumination renderer (“find me all pairs of hierarchical surface elements that are close enough to either transmit radiance or occlude each other”).

Other benefits

There are two major benefits that I feel like I didn’t really emphasize enough in the talk. The first is that R-trees are just plain easy to implement. I mean, there are tons of variations in all the different parts, and some of these variations are quite tricky to get right, but the final configuration I ended up with is very simple. So even if you’re not interested in cache performance, but need a spatial index, I’d still recommend R-trees over just about any other system. Wide trees are usually considered more complicated than binary ones, but in practice it turns out to be the other way around. Once you have enough children they end up giving you a good estimate of what the entire sub-tree at a given node looks like. That means you can do node splitting by looking only at the immediate children. With a binary tree, however, the immediate children tell you almost nothing, so if you want to split a node you’d have to do some serious analysis of the whole sub-tree. In the end, this leads to R-trees having a very simple, dynamic, and incremental construction strategy based on node-splitting, that still produces very good hierarchies.

The other benefit that I didn’t focus enough on isn’t really unique to R-trees, but AABB-trees in general, and it’s that they’re pretty bullet-proof. Other strategies have all sorts of corner cases that are painful to deal with (KD-trees need splitting, uniform grids need some way of dealing with nodes that cover multiple cells and large empty areas etc.), but AABB-trees just kind of adapt flexibly to the input and works just as well for pretty much anything you throw at it. That’s a very nice property to have for your main scene organisation structure.

Performance

Of course, the biggest omission in my talk was rigorous performance comparisons. One reason for this was simply lack of time in the talk itself. I only really had time for one slide, so I tried to pick one that would illustrate the general “shape” of the curve for various parameters and trade-offs for R-trees themselves. The problem with that slide is that it only compares R-trees against other R-trees. I would’ve liked to compare it to other strategies, in particular binary AABB-trees, or BIH-trees (since they have roughly the same strengths as R-trees), implemented with all the usual cache-tricks to improve performance. I suspect that R-trees are a lot faster, at least on the Xbox 360 due to the high cost of cache misses, simply because the other techniques try to improve cache performance very indirectly (by compressing nodes, rearranging them in memory etc., to try to improve the probability of a cache hit, and without really attempting to maximize memory bandwidth), whereas R-trees improves cache performance in a more direct way by explicitly fetching “blocks” of memory into the cache at near-maximum memory bandwidth, just when needed, and not fetching (much) data that’s not actually used.

So, does this mean that I have no clue how fast R-trees are? No, of course not, I just don’t have a rigorous and publishable clue. The next paragraph is therefore very tentative and should be taken with a few grains of salt.

I know they’re faster than the old method of visibility tests in  Red (the engine used for all Rare games these days, based on the Banjo engine) because I can do rough comparisons of roughly the same workloads. That system is essentially just a uniform grid. The problem is that it’s a uniform grid that hasn’t really been optimized with nearly the same effort as the R-trees have (in particular, it makes no effort to pre-fetch memory, relying entirely on its relatively benign data layout instead), so I wouldn’t feel comfortable concluding that R-trees are better than uniform grids in general. For our own engine, though, I feel fairly comfortable in saying that R-trees perform better than the old system (as well as scaling better, using less memory, and generally having no real annoying corner cases that have to be worked around, unlike uniform grids). Prior to the uniform grid (which was implemented for Banjo) we had a fairly standard AABB-system, so because we switched away from that for performance reasons, I’d transitively expect R-trees to be faster in a real benchmark than that system too. But again, I couldn’t claim that the old AABB-implementation was in any way optimized, so I’m not prepared to make any definite statements about them in general.

This may seem like a shocking lack of academic rigour on my part, and maybe it is, but at the end of the day I’m not a researcher, and I don’t always have time to implement every conceivable data structure for performance evaluation, so I have to make an educated guess at where the best performance will be found, and then go for it. If someone is looking for an interesting MSc thesis to do, though, I’d be very interested in seeing up-to-date benchmarks on recent consoles (against the data structures mentioned above but also against others like hash grids, or regular uniform grids, even though they aren’t necessarily comparable w.r.t. flexibility and features). It would also be interesting to see how these cache-oriented optimizations fare on a modern PC which aren’t nearly as “pointer-challenged” – perhaps you could even identify a specific threshold point (in “cycles per cache miss”) where you’re better off using cache-oriented data structures.

Comment Form is loading comments...