![]() |
|
#81
|
|||
|
|||
|
[ QUOTE ]
I'll be honest guys, I'm incredibly dissapointed! I thought I had something near optimum, but I must be missing something obvious! My meak 15 million per second is not much compared to your dozens of millions! Anyway, incase any of you are interested, this is the best way to sort 7 cards. Please note, this sort algorithm I wrote beats all the classic sorting algorithms (quick sort, merge, insertion, std::Sort). It's basically a precomputed LUT for a pigeonhole sort. // Represent the hand sevenCardHand = ((__int64)1 << c1) | ((__int64)1 << c2) | ((__int64)1 << c3) | ((__int64)1 << c4) | ((__int64)1 << c5) | ((__int64)1 << c6) | ((__int64)1 << c7); // Split 64 bit int into 4 chunks of 16 bits firstChunk = (sevenCardHand & 0xFFFF); secondChunk = (sevenCardHand & 0xFFFF0000) >> 16; thirdChunk = (sevenCardHand & 0xFFFF00000000) >> 32; fourthChunk = (sevenCardHand & 0xFFFF000000000000) >> 48; // Reset pointer to sorted hand array sortedPointer = 0; // Retrieve records numToLoop = intToIndexArr_L1[firstChunk][0]; for(anotherCounter = 1; anotherCounter < numToLoop; anotherCounter++) sortedHand[sortedPointer++] = intToIndexArr_L1[firstChunk][anotherCounter]; numToLoop = intToIndexArr_L1[secondChunk][0]; for(anotherCounter = 1; anotherCounter < numToLoop; anotherCounter++) sortedHand[sortedPointer++] = intToIndexArr_L1[secondChunk][anotherCounter] + 16; numToLoop = intToIndexArr_L1[thirdChunk][0]; for(anotherCounter = 1; anotherCounter < numToLoop; anotherCounter++) sortedHand[sortedPointer++] = intToIndexArr_L1[thirdChunk][anotherCounter] + 32; numToLoop = intToIndexArr_L1[fourthChunk][0]; for(anotherCounter = 1; anotherCounter < numToLoop; anotherCounter++) sortedHand[sortedPointer++] = intToIndexArr_L1[fourthChunk][anotherCounter] + 48; [/ QUOTE ] There seems to be alot of dereferences and increments in the above method (ie: even though pigeonhole sort is O(n) there are alot of constant factors to be added for such a small n...). Have you compared it to a hard compiled Sorting Network, eg: [ QUOTE ] Network for N=7, using Bose-Nelson Algorithm: SWAP(1, 2); SWAP(0, 2); SWAP(0, 1); SWAP(3, 4); SWAP(5, 6); SWAP(3, 5); SWAP(4, 6); SWAP(4, 5); SWAP(0, 4); SWAP(0, 3); SWAP(1, 5); SWAP(2, 6); SWAP(2, 5); SWAP(1, 3); SWAP(2, 4); SWAP(2, 3); [/ QUOTE ] Even though it takes 16 compare and swap operations, it can actually be done in 6 parallel steps (ie: it's depth) and a modern compiler should be able to get close to this by utilizing CPU pipelines. Overall, for very small n I've always found them to be the most efficient sort in terms of CPU cycles. Juk [img]/images/graemlins/smile.gif[/img] |
|
#82
|
|||
|
|||
|
[ QUOTE ]
Here is my demo (210 Kb). http://www.pokerbolide.com/files/handeval7cards.exe Andrzej Nironen [/ QUOTE ] 2.53 secs on my 3.4GHz P4. My Hand_7_Eval, the source code for which is at http://www.stevebrecher.com/Software/software.html takes 3.69 secs when exercised as shown below. It uses some lookup tables whose total size is 254KB. <font class="small">Code:</font><hr /><pre> //deck is a 52-element array of __int64 //with one bit set per element int main(void) { clock_t timer; int ix1, ix2, ix3, ix4, ix5, ix6, ix7; Hand_T board1, board2, board3, board4, board5, board6; int handTypeSum[9]; Init_Hand_Eval(); InitDeck(); for (ix1 = 0; ix1 < 9; ix1++) handTypeSum[ix1] = 0; timer = clock(); for (ix1 = 0; ix1 < 46; ix1++) { board1 = deck[ix1]; for (ix2 = ix1+1; ix2 < 47; ix2++) { board2 = board1 | deck[ix2]; for (ix3 = ix2+1; ix3 < 48; ix3++) { board3 = board2 | deck[ix3]; for (ix4 = ix3+1; ix4 < 49; ix4++) { board4 = board3 | deck[ix4]; for (ix5 = ix4+1; ix5 < 50; ix5++) { board5 = board4 | deck[ix5]; for (ix6 = ix5+1; ix6 < 51; ix6++) { board6 = board5 | deck[ix6]; for (ix7 = ix6+1; ix7 < 52; ix7++) { handTypeSum[Hand_7_Eval(board6 | deck[ix7]) >> 24]++; } } } } } } } timer = clock() - timer; for (ix1 = 0; ix1 < 9; ix1++) printf("%d\n", handTypeSum[ix1]); printf("seconds = %.2f\n", (float)timer/CLOCKS_PER_SEC); } </pre><hr /> |
|
#83
|
|||
|
|||
|
<font class="small">Code:</font><hr /><pre>// Test7SortingNet.cpp : Defines the entry point for the console application.
// #include "stdafx.h" #include "random.h" #include <iostream> using namespace std; inline unsigned long long RDTSC(void) { _asm rdtsc; } #define NUM_TESTS 100000 #define NUM_ITERS 100000000 int _tmain(int argc, _TCHAR* argv[]) { // Create the tests. char Tests[NUM_TESTS][7]; for (int I=0;I<NUM_TESTS;I++) { for (int J=0;J<7;J++) { Tests[I][J]=RandInt(51); } } cout << "Tests created." << endl; // Read start cycle. unsigned long long StartTime=RDTSC(); // Run the tests. for (int I=0;I<NUM_ITERS;I++) { char* V=Tests[I%NUM_TESTS]; // Using XOR. //#define SWAP(I,J) {if (V[I]>V[J]) {V[I]^=V[J]; V[J]^=V[I]; V[I]^=V[J];}} // Using swaps. int T; #define SWAP(I,J) {if (V[I]>V[J]) {T=V[I]; V[I]=V[J]; V[J]=T;}} SWAP(0, 4); SWAP(1, 5); SWAP(2, 6); SWAP(0, 2); SWAP(1, 3); SWAP(4, 6); SWAP(2, 4); SWAP(3, 5); SWAP(0, 1); SWAP(2, 3); SWAP(4, 5); SWAP(1, 4); SWAP(3, 6); SWAP(1, 2); SWAP(3, 4); SWAP(5, 6); } // cout << (double)(RDTSC()-StartTime)/(double)NUM_ITERS << " cycles per sort." << endl; return 0; }</pre><hr /> Runs at about 70 cycles per sort atm, but the MS VC compiler isn't generating what I want: <font class="small">Code:</font><hr /><pre>;SWAP(1, 5): mov cl, BYTE PTR [eax+1] mov dl, BYTE PTR [eax+5] cmp cl, dl jle SHORT $LN15@wmain movsx ecx, cl mov BYTE PTR [eax+1], dl mov BYTE PTR [eax+5], cl</pre><hr /> The idea is that you can fit the 7 elements in 2 (or more) registers and then you don't need the "mov xx, BYTE PTR [eax+1]" instructions, but the compiler can't seem to see this and insists on storing as a vector (even if you create a 64 bit structure of 8x8 bytes and use the "register" definition prefix it does this too...). This messing about using just cl and dl is also causing pipelines to stall which wouldn't happen as much if it were using registers to hold each of the 7 elements. If I get more time I'll try and get it working as I'm pretty sure it should be possible to do in 30-40 cycles when optimized. One other thing to try is to see if the XOR swap is any faster for an AMD, as for a P4 it is exactly the same even though it compiled down to an extra instruction per SWAP(). Juk [img]/images/graemlins/smile.gif[/img] |
|
#84
|
|||
|
|||
|
that won't run in VB will it [img]/images/graemlins/smile.gif[/img] darn.
|
|
#85
|
|||
|
|||
|
Guys you discuss different "weight categories". It's obvious if not to care about RAM used a huge lookup table works best. And it's much harder to create a fast algorithm using any reasonable amount of RAM.
Andrzej Nironen |
|
#86
|
|||
|
|||
|
[ QUOTE ]
[ QUOTE ] [ QUOTE ] <font class="small">Code:</font><hr /><pre> 133784560 hands 1.188 secs 112613302 hands/sec hc 23294460 pr 58627800 tp 31433400 th 6461620 st 6180020 fl 4047644 fh 3473184 fr 224848 sf 41584 </pre><hr /> This was on an AMD 3000+ with 1 gig memory running Win2k This computer produced 133784560 hands 5.156 secs 25947355 hands/sec hc 23294460 pr 58627800 tp 31433400 th 6461620 st 6180020 fl 4047644 fh 3473184 fr 224848 sf 41584 using Andrzej Nironen's "Hand Evaluator Speed Demo" [/ QUOTE ] Very impressive! (I think you have the results mixed up though - 112.5mil when doing extra work of incrementing the hand type counts, 50mil when not... [img]/images/graemlins/smile.gif[/img] ) Have you considered posting this method on the pokersource yahoo group? Juk [img]/images/graemlins/smile.gif[/img] EDIT: Sorry, I just noticed you print the hand type counts in both procedures... BTW: How fast does it run without doing this (ie: just assigning an integer the hand evaluation rather than incrementing totals[]?) [/ QUOTE ] 0.953 secs 140382574 hands/sec However I'm doubting the granularity of the timing process used. I'll switch to the RDTSC method and repost relative values. Using RDTSC as a timer 133784560 hands 2.640 secs 50675964 hands/sec becomes 133784560 hands 5752663748 cycles 42.999 cycles/hand 133784560 hands 1.188 secs 112613302 hands/sec becomes 133784560 hands 2645152642 cycles 19.772 cycles/hand and 133784560 hands 0.953 secs 140382574 hands/sec becomes 133784560 hands 2077269818 cycles 15.527 cycles/hand [/ QUOTE ] I've kinda lost track of this thread now, but so far (out of all the algorithms listed here), which are the current best algorithms (in cycles per second & assuming no "cheating" - ie: algorithm speed measured against a 7-card hand selected uniformly randomly from all possible 7-card hands) for: a) A "packed" 64bit integer representation. b) A unordered 7-integer representation. Most of my work has used a packed representation in the past, but after seeing mykey1961's nested lookup idea running at 15 cycles per hand, I am seriously considering moving everything over to an unpacked representation to take advantage of this (~20mil hands/sec -> ~200mil hands/sec can't be ignored!). Overall this thread has opened my eyes; I'd always just taken foregranted that the pokersource lib was highly optimized and assumed only marginal speed improvements would be possible in the future... Great thread - Juk [img]/images/graemlins/smile.gif[/img] |
|
#87
|
|||
|
|||
|
Jukofyork: Thank you for the sorting network idea! Never seen them before, very interesting! It sped my code up from 15million per second to 35 million per second, I think I can squeeze a little more out of it as well!
|
|
#88
|
|||
|
|||
|
D&L, you can use the sorting network idea as well, just copy the defines into the SWAP statment.
|
|
#89
|
|||
|
|||
|
Hey mykey question on your evaluator
[ QUOTE ] Totals[Table[Table[Table[Table[Table[Table[Table[c1+52]+c2]+c3]+c4]+c5]+c6]+c7] shr 20]); [/ QUOTE ] not good with C++, but what are you representing hand strength as? a digit between 1 and 9? I notice your counters are highly efficient, because you can just use the hand rank as an index to your array. But to do that, is your hand rank just a number 1 through 9? as opposed to say, defining wich "2 pair" beat another "2 pair." Perhaps i'm just misunderstand, cause I don't know c++ P.s. how fast was your mapped array when incrementing counters, and when not? i think you flipped your times around..reporting a higher time with counters? At anyrate, i'm 2/3 of the way done making your mapped table array for comparison purposes. Mapping one 4MB array to another takes like trillions of calculations (maybe you had some shortcut)...my computers grinding away.... Only one more array to map. |
|
#90
|
|||
|
|||
|
Just realised however, in the test given the cards are all sorted, therfore it would run faster. I'm doing a real world test now with unsorted hands to test performance.
Edit: Tests sucesful! The precomputed sorting network works 2-3x faster. Thanks! |
![]() |
|
|