7 Card Hand Evaluators - Page 9

jukofyork · #81 12-28-2006, 06:25 PM

[ QUOTE ]
I'll be honest guys, I'm incredibly dissapointed! I thought I had something near optimum, but I must be missing something obvious! My meak 15 million per second is not much compared to your dozens of millions!

Anyway, incase any of you are interested, this is the best way to sort 7 cards. Please note, this sort algorithm I wrote beats all the classic sorting algorithms (quick sort, merge, insertion, std::Sort). It's basically a precomputed LUT for a pigeonhole sort.

// Represent the hand
sevenCardHand = ((__int64)1 << c1) | ((__int64)1 << c2) | ((__int64)1 << c3) | ((__int64)1 << c4) | ((__int64)1 << c5) | ((__int64)1 << c6) | ((__int64)1 << c7);

// Split 64 bit int into 4 chunks of 16 bits
firstChunk = (sevenCardHand & 0xFFFF);
secondChunk = (sevenCardHand & 0xFFFF0000) >> 16;
thirdChunk = (sevenCardHand & 0xFFFF00000000) >> 32;
fourthChunk = (sevenCardHand & 0xFFFF000000000000) >> 48;

// Reset pointer to sorted hand array
sortedPointer = 0;

// Retrieve records
numToLoop = intToIndexArr_L1[firstChunk][0];
for(anotherCounter = 1; anotherCounter < numToLoop; anotherCounter++)
sortedHand[sortedPointer++] = intToIndexArr_L1[firstChunk][anotherCounter];

numToLoop = intToIndexArr_L1[secondChunk][0];
for(anotherCounter = 1; anotherCounter < numToLoop; anotherCounter++)
sortedHand[sortedPointer++] = intToIndexArr_L1[secondChunk][anotherCounter] + 16;

numToLoop = intToIndexArr_L1[thirdChunk][0];
for(anotherCounter = 1; anotherCounter < numToLoop; anotherCounter++)
sortedHand[sortedPointer++] = intToIndexArr_L1[thirdChunk][anotherCounter] + 32;

numToLoop = intToIndexArr_L1[fourthChunk][0];
for(anotherCounter = 1; anotherCounter < numToLoop; anotherCounter++)
sortedHand[sortedPointer++] = intToIndexArr_L1[fourthChunk][anotherCounter] + 48;

[/ QUOTE ]
There seems to be alot of dereferences and increments in the above method (ie: even though pigeonhole sort is O(n) there are alot of constant factors to be added for such a small n...).

Have you compared it to a hard compiled Sorting Network, eg:

[ QUOTE ]
Network for N=7, using Bose-Nelson Algorithm:
SWAP(1, 2);
SWAP(0, 2);
SWAP(0, 1);
SWAP(3, 4);
SWAP(5, 6);
SWAP(3, 5);
SWAP(4, 6);
SWAP(4, 5);
SWAP(0, 4);
SWAP(0, 3);
SWAP(1, 5);
SWAP(2, 6);
SWAP(2, 5);
SWAP(1, 3);
SWAP(2, 4);
SWAP(2, 3);

[/ QUOTE ]

Even though it takes 16 compare and swap operations, it can actually be done in 6 parallel steps (ie: it's depth) and a modern compiler should be able to get close to this by utilizing CPU pipelines. Overall, for very small n I've always found them to be the most efficient sort in terms of CPU cycles.

Juk [img]/images/graemlins/smile.gif[/img]

Steve Brecher · #82 12-28-2006, 06:39 PM

[ QUOTE ]

Here is my demo (210 Kb).

http://www.pokerbolide.com/files/handeval7cards.exe

Andrzej Nironen

[/ QUOTE ]
2.53 secs on my 3.4GHz P4.

My Hand_7_Eval, the source code for which is at
http://www.stevebrecher.com/Software/software.html
takes 3.69 secs when exercised as shown below. It uses some lookup tables whose total size is 254KB.

Code:<hr /><pre>
//deck is a 52-element array of __int64
//with one bit set per element
int main(void)
{
clock_t timer;

int ix1, ix2, ix3, ix4, ix5, ix6, ix7;
Hand_T board1, board2, board3, board4, board5, board6;
int handTypeSum[9];

Init_Hand_Eval();
InitDeck();
for (ix1 = 0; ix1 < 9; ix1++)
handTypeSum[ix1] = 0;

timer = clock();

for (ix1 = 0; ix1 < 46; ix1++) {
board1 = deck[ix1];
for (ix2 = ix1+1; ix2 < 47; ix2++) {
board2 = board1 | deck[ix2];
for (ix3 = ix2+1; ix3 < 48; ix3++) {
board3 = board2 | deck[ix3];
for (ix4 = ix3+1; ix4 < 49; ix4++) {
board4 = board3 | deck[ix4];
for (ix5 = ix4+1; ix5 < 50; ix5++) {
board5 = board4 | deck[ix5];
for (ix6 = ix5+1; ix6 < 51; ix6++) {
board6 = board5 | deck[ix6];
for (ix7 = ix6+1; ix7 < 52; ix7++) {
handTypeSum[Hand_7_Eval(board6 | deck[ix7]) >> 24]++;
}
}
}
}
}
}
}

timer = clock() - timer;

for (ix1 = 0; ix1 < 9; ix1++)
printf("%d\n", handTypeSum[ix1]);
printf("seconds = %.2f\n", (float)timer/CLOCKS_PER_SEC);
}
</pre><hr />

jukofyork · #83 12-28-2006, 07:51 PM

Code:<hr /><pre>// Test7SortingNet.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include "random.h"
#include <iostream>

using namespace std;

inline unsigned long long RDTSC(void)
{
_asm rdtsc;
}

#define NUM_TESTS 100000
#define NUM_ITERS 100000000

int _tmain(int argc, _TCHAR* argv[])
{

// Create the tests.
char Tests[NUM_TESTS][7];
for (int I=0;I<NUM_TESTS;I++) {
for (int J=0;J<7;J++) {
Tests[I][J]=RandInt(51);
}
}
cout << "Tests created." << endl;

// Read start cycle.
unsigned long long StartTime=RDTSC();

// Run the tests.
for (int I=0;I<NUM_ITERS;I++) {

char* V=Tests[I%NUM_TESTS];

// Using XOR.
//#define SWAP(I,J) {if (V[I]>V[J]) {V[I]^=V[J]; V[J]^=V[I]; V[I]^=V[J];}}

// Using swaps.
int T;
#define SWAP(I,J) {if (V[I]>V[J]) {T=V[I]; V[I]=V[J]; V[J]=T;}}

SWAP(0, 4);
SWAP(1, 5);
SWAP(2, 6);
SWAP(0, 2);
SWAP(1, 3);
SWAP(4, 6);
SWAP(2, 4);
SWAP(3, 5);
SWAP(0, 1);
SWAP(2, 3);
SWAP(4, 5);
SWAP(1, 4);
SWAP(3, 6);
SWAP(1, 2);
SWAP(3, 4);
SWAP(5, 6);

}

//
cout << (double)(RDTSC()-StartTime)/(double)NUM_ITERS << " cycles per sort." << endl;

return 0;
}</pre><hr />

Runs at about 70 cycles per sort atm, but the MS VC compiler isn't generating what I want:

Code:<hr /><pre>;SWAP(1, 5):
mov cl, BYTE PTR [eax+1]
mov dl, BYTE PTR [eax+5]
cmp cl, dl
jle SHORT $LN15@wmain
movsx ecx, cl
mov BYTE PTR [eax+1], dl
mov BYTE PTR [eax+5], cl</pre><hr />
The idea is that you can fit the 7 elements in 2 (or more) registers and then you don't need the "mov xx, BYTE PTR [eax+1]" instructions, but the compiler can't seem to see this and insists on storing as a vector (even if you create a 64 bit structure of 8x8 bytes and use the "register" definition prefix it does this too...).

This messing about using just cl and dl is also causing pipelines to stall which wouldn't happen as much if it were using registers to hold each of the 7 elements.

If I get more time I'll try and get it working as I'm pretty sure it should be possible to do in 30-40 cycles when optimized.

One other thing to try is to see if the XOR swap is any faster for an AMD, as for a P4 it is exactly the same even though it compiled down to an extra instruction per SWAP().

Juk [img]/images/graemlins/smile.gif[/img]

_D&L_ · #84 12-28-2006, 08:49 PM

that won't run in VB will it [img]/images/graemlins/smile.gif[/img] darn.

A.Nironen · #85 12-29-2006, 02:07 AM

Guys you discuss different "weight categories". It's obvious if not to care about RAM used a huge lookup table works best. And it's much harder to create a fast algorithm using any reasonable amount of RAM.

Andrzej Nironen

jukofyork · #86 12-29-2006, 04:23 AM

[ QUOTE ]
[ QUOTE ]
[ QUOTE ]

Code:<hr /><pre>
133784560 hands
1.188 secs
112613302 hands/sec
hc 23294460
pr 58627800
tp 31433400
th 6461620
st 6180020
fl 4047644
fh 3473184
fr 224848
sf 41584

</pre><hr />

This was on an AMD 3000+ with 1 gig memory running Win2k

This computer produced

133784560 hands
5.156 secs
25947355 hands/sec
hc 23294460
pr 58627800
tp 31433400
th 6461620
st 6180020
fl 4047644
fh 3473184
fr 224848
sf 41584

using Andrzej Nironen's "Hand Evaluator Speed Demo"

[/ QUOTE ]
Very impressive! (I think you have the results mixed up though - 112.5mil when doing extra work of incrementing the hand type counts, 50mil when not... [img]/images/graemlins/smile.gif[/img] )

Have you considered posting this method on the pokersource yahoo group?

Juk [img]/images/graemlins/smile.gif[/img]

EDIT: Sorry, I just noticed you print the hand type counts in both procedures... BTW: How fast does it run without doing this (ie: just assigning an integer the hand evaluation rather than incrementing totals[]?)

[/ QUOTE ]

0.953 secs
140382574 hands/sec

However I'm doubting the granularity of the timing process used.

I'll switch to the RDTSC method and repost relative values.

Using RDTSC as a timer

133784560 hands
2.640 secs
50675964 hands/sec

becomes

133784560 hands
5752663748 cycles
42.999 cycles/hand

133784560 hands
1.188 secs
112613302 hands/sec

becomes

133784560 hands
2645152642 cycles
19.772 cycles/hand

and

133784560 hands
0.953 secs
140382574 hands/sec

becomes

133784560 hands
2077269818 cycles
15.527 cycles/hand

[/ QUOTE ]
I've kinda lost track of this thread now, but so far (out of all the algorithms listed here), which are the current best algorithms (in cycles per second & assuming no "cheating" - ie: algorithm speed measured against a 7-card hand selected uniformly randomly from all possible 7-card hands) for:

a) A "packed" 64bit integer representation.
b) A unordered 7-integer representation.

Most of my work has used a packed representation in the past, but after seeing mykey1961's nested lookup idea running at 15 cycles per hand, I am seriously considering moving everything over to an unpacked representation to take advantage of this (~20mil hands/sec -> ~200mil hands/sec can't be ignored!).

Overall this thread has opened my eyes; I'd always just taken foregranted that the pokersource lib was highly optimized and assumed only marginal speed improvements would be possible in the future...

Great thread - Juk [img]/images/graemlins/smile.gif[/img]

Gullanian · #87 12-29-2006, 11:12 AM

Jukofyork: Thank you for the sorting network idea! Never seen them before, very interesting! It sped my code up from 15million per second to 35 million per second, I think I can squeeze a little more out of it as well!

Gullanian · #88 12-29-2006, 11:13 AM

D&L, you can use the sorting network idea as well, just copy the defines into the SWAP statment.

_D&L_ · #89 12-29-2006, 11:17 AM

Hey mykey question on your evaluator

[ QUOTE ]

Totals[Table[Table[Table[Table[Table[Table[Table[c1+52]+c2]+c3]+c4]+c5]+c6]+c7] shr 20]);

[/ QUOTE ]

not good with C++, but what are you representing hand strength as? a digit between 1 and 9? I notice your counters are highly efficient, because you can just use the hand rank as an index to your array. But to do that, is your hand rank just a number 1 through 9? as opposed to say, defining wich "2 pair" beat another "2 pair." Perhaps i'm just misunderstand, cause I don't know c++

P.s. how fast was your mapped array when incrementing counters, and when not? i think you flipped your times around..reporting a higher time with counters?

At anyrate, i'm 2/3 of the way done making your mapped table array for comparison purposes. Mapping one 4MB array to another takes like trillions of calculations (maybe you had some shortcut)...my computers grinding away.... Only one more array to map.

Gullanian · #90 12-29-2006, 11:28 AM

Just realised however, in the test given the cards are all sorted, therfore it would run faster. I'm doing a real world test now with unsorted hands to test performance.

Edit: Tests sucesful! The precomputed sorting network works 2-3x faster. Thanks!