Quote: I'll be honest guys, I'm incredibly dissapointed! I thought I had something near optimum, but I must be missing something obvious! My meak 15 million per second is not much compared to your dozens of millions!
Anyway, incase any of you are interested, this is the best way to sort 7 cards. Please note, this sort algorithm I wrote beats all the classic sorting algorithms (quick sort, merge, insertion, std::Sort). It's basically a precomputed LUT for a pigeonhole sort.
There seems to be alot of dereferences and increments in the above method (ie: even though pigeonhole sort is O(n) there are alot of constant factors to be added for such a small n...).
Have you compared it to a hard compiled Sorting Network, eg:
Even though it takes 16 compare and swap operations, it can actually be done in 6 parallel steps (ie: it's depth) and a modern compiler should be able to get close to this by utilizing CPU pipelines. Overall, for very small n I've always found them to be the most efficient sort in terms of CPU cycles.