Two Plus Two Newer Archives - View Single Post

jukofyork · #83 12-28-2006, 07:51 PM

<font class="small">Code:</font><hr /><pre>// Test7SortingNet.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include "random.h"
#include <iostream>

using namespace std;

inline unsigned long long RDTSC(void)
{
_asm rdtsc;
}

#define NUM_TESTS 100000
#define NUM_ITERS 100000000

int _tmain(int argc, _TCHAR* argv[])
{

// Create the tests.
char Tests[NUM_TESTS][7];
for (int I=0;I<NUM_TESTS;I++) {
for (int J=0;J<7;J++) {
Tests[I][J]=RandInt(51);
}
}
cout << "Tests created." << endl;

// Read start cycle.
unsigned long long StartTime=RDTSC();

// Run the tests.
for (int I=0;I<NUM_ITERS;I++) {

char* V=Tests[I%NUM_TESTS];

// Using XOR.
//#define SWAP(I,J) {if (V[I]>V[J]) {V[I]^=V[J]; V[J]^=V[I]; V[I]^=V[J];}}

// Using swaps.
int T;
#define SWAP(I,J) {if (V[I]>V[J]) {T=V[I]; V[I]=V[J]; V[J]=T;}}

SWAP(0, 4);
SWAP(1, 5);
SWAP(2, 6);
SWAP(0, 2);
SWAP(1, 3);
SWAP(4, 6);
SWAP(2, 4);
SWAP(3, 5);
SWAP(0, 1);
SWAP(2, 3);
SWAP(4, 5);
SWAP(1, 4);
SWAP(3, 6);
SWAP(1, 2);
SWAP(3, 4);
SWAP(5, 6);

}

//
cout << (double)(RDTSC()-StartTime)/(double)NUM_ITERS << " cycles per sort." << endl;

return 0;
}</pre><hr />

Runs at about 70 cycles per sort atm, but the MS VC compiler isn't generating what I want:

<font class="small">Code:</font><hr /><pre>;SWAP(1, 5):
mov cl, BYTE PTR [eax+1]
mov dl, BYTE PTR [eax+5]
cmp cl, dl
jle SHORT $LN15@wmain
movsx ecx, cl
mov BYTE PTR [eax+1], dl
mov BYTE PTR [eax+5], cl</pre><hr />
The idea is that you can fit the 7 elements in 2 (or more) registers and then you don't need the "mov xx, BYTE PTR [eax+1]" instructions, but the compiler can't seem to see this and insists on storing as a vector (even if you create a 64 bit structure of 8x8 bytes and use the "register" definition prefix it does this too...).

This messing about using just cl and dl is also causing pipelines to stall which wouldn't happen as much if it were using registers to hold each of the 7 elements.

If I get more time I'll try and get it working as I'm pretty sure it should be possible to do in 30-40 cycles when optimized.

One other thing to try is to see if the XOR swap is any faster for an AMD, as for a P4 it is exactly the same even though it compiled down to an extra instruction per SWAP().

Juk [img]/images/graemlins/smile.gif[/img]