Devlog 33 - Sorting
Lately I’ve been a bit busy, but I managed to do some research on sorting.
I wanted to make sure Lepre does not weight on the user with something so basic like sorting. I searched the web for some benchmarks, but I wasn’t lucky.
I’ve found this radix sort implementation http://erik.gorset.no/2011/04/radix-sort-is-faster-than-quicksort.html claiming to be faster than quick sort. So I’ve implemented quicksort, radix sort and insertion sort and I benchmarked them. The results were positive. Luckily Casey Muratori started coding the z sorting too in Handmade Hero, so on the its forums: https://hero.handmadedev.org/forum/code-discussion/1020-day-229-what-about-radix-sort Interestingly faster radix sort that one! Using an auxiliary array improves cache locality and makes radix sort really faster! I implemented that one too and here are the results. (time in ns on the y-axis, number of ordered elements on the x-axis)
Note: below 65336 the algorithms use 16bit unsigned and above they use 32bit ones.
As you can see radix sort without additional memory (radix) is way slower than quick sort iterative (qsi) on the 32bit range. Radix with additional memory (radix_mc) is instead faster than qsi in that range.
But let’s take a look closer in the 16bit range.
Hey, look! Radix is faster than qsi! But still radix_mc is faster. Maybe log scale could help.
So, insertion sort (ins) is actually faster than all of them below a certain threshold. Radix_mc is the faster one, above the ins threshold. So we have a winner!
Note: radix_mc times account for the reordering of the keys, so it is actually faster if you don’t need the keys to be reordered!
A question remains: do the 16bit specialization actually shortens times for radix_mc? For radix it’s clear that it does. Let’s see.
A little bit, above a certain threshold. And yeah, radix has a great benefit from that.
All these data refers to reordering an array of random data, generated with a xorshift128plus. Generally in real life usage you’ll not be adding quads with a random z. What if there are repetitions?
Qsi is the one that benefits more from repetitions. The others are practically the same.
Conclusion
The fastest sorting algorithm for 16bit unsigned values is radix_mc. We can sort 2^16 elements in under a ms. This is pretty good.
It occurs to me that rendering sprites from a 3D world would require to use 32 bit floating point numbers. We could map the z of the sprites in view to the 16bit unsigned domain. It has been done here http://stereopsis.com/radix.html for float32 to u32. So we could use 32 bit values or convert float32 to float16 ( like https://www.mathworks.com/matlabcentral/fileexchange/23173-ieee-754r-half-precision-floating-point-converter )and then to u16. I’ll have to build a test scenario and verify which is the best solution.
[ I’ve taken the qsi from https://en.wikibooks.org/wiki/Algorithm_Implementation/Sorting/Quicksort#Iterative_Quicksort ]
The last commit: https://bitbucket.org/theGiallo/liblepre/commits/99aee5b563a4e2dbda472df366b488c2a2bb65cd. (there are other new things, but I’m going to split the devlog in two)
* jumps away *









