Thoughts on light culling for clustered shading

A while back I came across the presentation Improved Culling for Tiled and Clustered Shading by Michal Drobot and it’s really clever. It inspired a bunch of thoughts that I’ve been meaning to try out but since it’s now several months later and I still haven’t got around to it, I figured I’d just jot down some notes here instead (in other words: Caution, untested braindumps ahead!).

Review

First, let’s review the basic idea. A common case of storing lights in tiled/clustered lighting is to construct a 3D grid in view space where the Z slices are placed along some roughly logarithmic spacing and X/Y is uniformly space. Then typically each cell contains a “light list” of visible lights in that cell. At rendering time all you have to do is look up the corresponding light list and shade.

That works and is easy to understand but the problem is that the memory cost scales as O(X*Y*Z*N) where X, Y, Z is the resolution of the grid and N is the number of lights. Suffice to say you typically end up seeing relatively large cells for this approach, and thus poor culling efficiency.

The aforementioned slide deck proposes something different:

  1. First store a single light list sorted by Z distance in camera space.
  2. Split the frustum along Z into an array of “Z bins” (uniformly). Each bin represents a range of Z values and stores the start and end index of lights that intersect that bin. Again, there’s just one of these.
  3. Subdivide the screen in a 2D grid and store a bitmask that indicates which lights in the light list intersect the 2D tile.

At runtime all you have to do is look up the current Z bin, which gives you a range of lights that might intersect the current pixel (based on its Z coordinate). Then you find the visibility mask from the current X/Y cell and use that to quickly reject lights that are in the right Z bin but do not intersect the current 2D tile.

In other words, the visibility gets factored into two orthogonal pieces - the Z visibility and the XY visibility, and then they are combined at runtime. The benefit of this is that the memory cost is now O(X*Y*N + Z + N), which scales a lot better and allows you to have much denser grids (and thus better culling).

This is already really clever and would probably be my “go to” if I was implementing light culling today, but the key takeaway for me is the more general idea of splitting up visibility into multiple orthogonal visibility systems and then intersecting them at runtime. Here are a few random thoughts for how to extend this idea in various ways.

Idea #1 - Split X and Y visibility as well

The core algorithm splits Z visibility and X/Y visibility. A simple extension would be to split up X and Y visibility as well.

The way this would work is that you sort lights by Z as before, but instead of storing a 2D grid of bitmasks, you store two independent 1D arrays of bitmasks for X and Y. So basically split the screen into a number of 1D ranges along X and also along Y. So for the X direction each slice represents a vertical stripe of pixels and stores a bit mask of all lights that intersect it, and analogously for the Y direction. At runtime you’d fetch both the X and Y bitmasks from the appropriate bins, AND them together and then proceed as normal.

What are the implications of this? Well, it would reduce culling accuracy in some cases because you’re now testing only axis aligned extents of the lights in 2D, rather than testing lights against 2D rectangles. On the other hand, the memory use (and culling complexity) is reduced to O(X*N + Y*N + Z + N), which may mean you could crank up the grid density even more. One extreme version is to store a bitmask for each row and one for each column of pixels (may be overkill).

Idea #2 - Go beyond 1D visibility for Z

The original idea singles out Z for special treatment, giving it and it alone an actual spatial organization which facilitiates culling. But we could also dice things differently. For example, most lights tend to live roughly in the X/Z plane (i.e. near the ground) so we could decide to organize our lights in the X/Z plane, and then use bit fields only for Y-visibility.

One way would be to sort the light list by their Z bin index instead of distance, then lights with the same Z bin index are sorted by X coordinate. Then instead of a Z-binning array, we would store a 2D grid in the X/Z plane where each cell contains a start and end index into the light list. At runtime you’d find your X/Z cell to retrieve candidate light range, and then find visibilitiy bits based on your Y coordinate to prune this set further.

This could reduce the number of lights you have to check by using a richer data structure to organize the “light ranges” while keeping memory reasonably low (it’s now O(X*Z + N + Y*N) which compares favorably to the original algorithm, but is not as good as idea #1 on this metric).

Idea #3 - Use a more flexible primary “sort axis”

The original algorithm decides to treat Z specially by sorting lists along the Z axis and using Z bins to prune down the initial set of lights (that are then pruned further by bit fields). Why does that make sense? Why single out Z and not X or Y? The reason Z is a good choice is because view frusta tend to be deeper than they are wide or tall, which means that after view frustum culling the positions of your lights will vary more in the (view space) Z direction than they will in X or Y. Thus, if you organize your lights according to their Z coordinate you’ll be able to more effectively pare down the inital range of lights to check.

However, while Z is a better default than X or Y, there’s no reason we have to pick one of the coordinate axes.

So maybe we take our lights (culled against the frustum) and perform Principal Component Analysis on their positions to extract the dominant direction of our lights. Use the 1st principal component to sort our lights, and the next 2 gives us the axes for the 2D visibility bit fields.

This would help with pathological scenarios for the original algorithm. For example if you’re staring down a highway with 1000 street lights stretching off in the horizon Z-binning would work great, most bins would only have a few lights in them. However if you move off to the side of the highway and look across it instead of down it, all of the lights end up in the same Z-bin and now you have to exhaustively check them against your X/Y bit fields (this is not as bad as it sounds since most of the lights will now be culled by the frustum culling, but still). By using the first principal component as the sort direction our lights would always be sorted along the direction of the highway, regardless of the view direction.

Idea #4 - Remove the view dependence entirely

This is really riffing on idea #3. Now that the sort direction for the lights is independent from the view direction, we could go one step further and go completely view-independent.

First, let’s add another level of organization on top. Split up the world into a rough uniform 3D grid. Each cell in this grid would be large enough that the overhead is kept low, but small enough that you could reasonably detect a dominant direction for the lights in each cell. For example, maybe each cell is 64m3 or something, or maybe you can tie to this to any other kind of top-level organization you might have (e.g. based on portals, rooms, or whatever). Each cell would then have its own independent light list and bit fields. Perform PCA on the lights inside a cell to find the optimal sort axis (and the two axes for bit fields too). At runtime first find the right 3D cell, then proceed as before by reading start and end indices according to the Z bin (which is no longer guaranteed to be aligned to Z, but let’s keep the name), etc.

The nice part about this is that we can reuse the same light organization over multiple frames (with incremental updates for dynamic lights, and occasional PCA/re-sorting).

This could be combined with the other ideas. For example a nice fit might be idea #1, having independent X and Y bit fields. In fact you may have to do so in order to keep the memory requirements managable now that we’re storing all of the lights, not just the visible ones.

Conclusion

As you can probably tell I’m pretty excited about the algorithm proposed by Drobot (you should absolutely look over the slides - lots of other good stuff too). But more than anything I got excited about the general idea behind the algorithm - splitting visibility apart into multiple pieces to reduce the dimensionality and then combininng them at runtime. Above are just a few variants on this approach but I doubt it’s exhaustive, so feel free to share more in the comments or twitter or whatever!

More Reading
sebastiansylvan