3DPM: Can we fix it? Yes we can!

Yes. Bay Trail-D beat out some AMD processors in 3DPM that had/have faster 32-bit fp throughput according to some other software (AIDA64, etc). In light of observations made about cache size/speed on Bay Trail-D and the thread linked in the OP where Dr. Cutress’ code snippet is held up for some review, it is logical to conclude that there is more to 3DPM than a simple fp benchmark.

Isn’t this just “OMP parallel for” looped over the num particles?

Yes. It should create one separate thread each time it loops over the constant particles. The release post says that it creates 10k threads, so it’s reasonable to assume that the value of particles is 10000.

My first attempt to improve on Dr. Cutress’ code snippet is now available. I have updated the OP to reflect this fact, but for those reading this far down in the thread, it is here:

https://www.dropbox.com/s/4q4f9unhwhtswcf/3DPMRedux662015.zip?dl=0

As with my other Java software, there’s a simple file called runme.bat inside the archive. Just extract the archive somewhere and double-click runme.bat and it should start right up. Java 8 required. Source is here:

https://www.dropbox.com/s/jrpvwxr8aq3au8a/3DPMReduxSource662015.zip?dl=0

The program creates a small text file called output.txt which shows the final x,y, and z coordinates for all 10000 particles handled by the benchmark. It will attempt to delete old output files before writing a new one. The user is free to select how many “steps” of movement are applied to each particle. Be warned that selections past 4 can be a little slow.

You may notice selection 1 and 2 running faster the second time and all subsequent times if you select them repeatedly during the same program execution. This is likely due to Java-style “warmup”, though you will probably not notice it in selections 3 or higher.

This program will probably run a lot faster on AVX2-enabled Haswell processors than anything else. Java 8 is capable of producing AVX/AVX2 instructions via autovectorization through the JVM. The code itself blatantly hints to the JVM that autovectorization should happen, and the same techniques have born fruit in other software I’ve written in the past.

3DPMRedux does feature some deviations from what was in Dr. Cutress’ code snippet that is meant to be representative of 3DPM. Key differences:

1). Instead of using stock sin/cos functions, I implemented a simplified Taylor Series for both functions. The “fakesin” and “fakecos” methods should still produce accuracy out to about 6 significant figures, which is probably “good enough” when using 32-bit floating point numbers.

2). I allowed negative values for z-axis particle movement.

3). I allowed user-selection of the “steps” value.

4). Instead of reporting a performance number without discernible units, I track the total time of execution for the entire benchmark process (save the process of initializing the particles themselves when the program is first opened).

Possible bugs:

Repeated operation of the program may eventually cause the Particles objects to store X and Y values that can no longer increment upwards. This flaw stems from the fact that 32-bit fp only permits so much accuracy, to the point that it is impossible to increase the value of a 32-bit fp number beyond a certain point if the maximum additive increment has some fixed maximum value.

For example, if you have a 32-bit fp variable and you repeatedly add 1.0f to it, eventually it will max out at a value where if you continue adding 1.0f to it, it will go up by exactly 0.

3DPMRedux makes no attempt to compensate for this problem. Later versions could reinitialize all Particles objects to reset the starting values to 0, but for now, it doesn’t seem to be a big issue. It might be for very large values of steps.