java test: unexpected results on AMD CPUs

cytg111 · Sep 24, 2014

JoeRambo said:
Without seeing generated assembly it is barely a guess, but in my opinion HT helps floating point much more due to these reasons:

- Yea, was pondering on writing an C++ equivalent .. The blackbox bytecode->jit->machinecode is making profiling it hard.

1) Workload dominated by division and no chance for out of order execution since everything depends on everything

OoO.. This might be an error in my understanding of the concept(s), but since everything is this straight forward, and the miniscule amount of data should fit nicely in level1, is there an actual benefit for OoO here? I am thinking the same thing with HT.. with a good continuous run, no stalls, no HT.

3) Float version is working on 32bit values and executing DIVSS instructions that are very tight in what ports they use, leaving a lot of opportunities for HT sibling to execute.

Maybe it is a misconception that HT only works when a stall occurs? If not maybe it is a consequence of all the context switching.. (and your Counter class would do away with it..). But still, even for 48 threads, the data segment is still pretty miniscule, should all fint nicely in level1.

Too much guesswork, if Im not too wasted come evening, i may try and put a c++ version together.

Also, just as a side note to the concept
https://software.intel.com/en-us/forums/topic/473413
Disabled HT, pegging a ST task to a specific core significantly reduces the cache misses (and thus stalls).

Interesting stuff

SlowSpyder · Sep 24, 2014

Ok, I'm lowering my clock and voltage a little. Doing so is improving my numbers. I am running with my case open, I think I might be running into VRM throttling since there is very little airflow over the heatsinks without the side cover on. Funny how when you overclock you'll find different oddities at times. I've gamed for hours on end at 5.1+ GHz, never noticed any issues. But it is what it is. 208 x 24 = 4.992GHz, what I just ran. Still seeing cores drop a bit, but not nearly as much. I'll add a stock clock run shortly.

Starting OriginalCode run!
It took 42287 milliseconds to complete the Integer loop.
It took 57886 milliseconds to complete the Float loop.
Starting OriginalCodeNoDiv run!
It took 30468 milliseconds to complete the Integer loop.
It took 38378 milliseconds to complete the Float loop.
Starting LatchEnabled run!
It took 53483 milliseconds to complete the Integer loop.
It took 74191 milliseconds to complete the Float loop.
Starting LatchEnabledNoDiv run!
It took 33917 milliseconds to complete the Integer loop.
It took 36888 milliseconds to complete the Float loop.

Cerb · Sep 24, 2014

cytg111 said:
OoO.. This might be an error in my understanding of the concept(s), but since everything is this straight forward, and the miniscule amount of data should fit nicely in level1, is there an actual benefit for OoO here? I am thinking the same thing with HT.. with a good continuous run, no stalls, no HT.

OOOE works well, period, if the code can use it, and the data is in any part of the cache. This code might or might not be able to get much out of it, but for code that can, fitting in some level of cache has nothing to do with it, at least as far as whether it is or isn't useful (which is why in-order fast attempts to do it all in software have consistently been rather poor).

Maybe it is a misconception that HT only works when a stall occurs? If not maybe it is a consequence of all the context switching..

HT needs free resources, and that's it. There is some partitioning for fairness (I forget where, in Haswell, TBH), but instructions from both threads can be issued and executed in the same clock cycle. HT today and HT in Willamette Xeons are far cries from one another . For the most part, the core gets each thread's registers renamed, and then from there on in, which instruction goes with which core doesn't matter, except to the branch predictor and various loading and storing sections.

SlowSpyder · Sep 24, 2014

I put some air over the VRM's, clocked down a bit more, stayed at factory voltage of 1.5 this time. 208x23 = 4.784GHz. I did not see any throttling this time, so I must have been triggering some kind of protection / throttling before with this. Anyway, my 4.784GHz numbers:

Starting OriginalCode run!
It took 43917 milliseconds to complete the Integer loop.
It took 62788 milliseconds to complete the Float loop.
Starting OriginalCodeNoDiv run!
It took 27448 milliseconds to complete the Integer loop.
It took 34409 milliseconds to complete the Float loop.
Starting LatchEnabled run!
It took 44446 milliseconds to complete the Integer loop.
It took 70787 milliseconds to complete the Float loop.
Starting LatchEnabledNoDiv run!
It took 27396 milliseconds to complete the Integer loop.
It took 28607 milliseconds to complete the Float loop

UnmskUnderflow · Sep 24, 2014

float1++
float2++

Ok, bothered me enough to post. If this is an abused int-land shorthand for "float1 + 1.0f" then after 24 bits of increment your floats will have the 1.0 land in sticky and be by default rounded down. Thus for most of the loop it will be moving the float nums by zero. A smart compiler or result cache would notice and subsequently dispatch NOPs depending on settings, etc.

If you don't mind, could you do a non-float-trap increment to see if results differ? That would eliminate some possible sneaky optimizations, if they were detected in the first place.

DrMrLordX · Sep 24, 2014

UnmskUnderflow said:
float1++
float2++

Ok, bothered me enough to post. If this is an abused int-land shorthand for "float1 + 1.0f" then after 24 bits of increment your floats will have the 1.0 land in sticky and be by default rounded down. Thus for most of the loop it will be moving the float nums by zero. A smart compiler or result cache would notice and subsequently dispatch NOPs depending on settings, etc.

If you don't mind, could you do a non-float-trap increment to see if results differ? That would eliminate some possible sneaky optimizations, if they were detected in the first place.

I was going to wait to post until finishing testing on other stuff, but I wanted to let you know that I'll look into the possibility that the float increments might be failing to increment and how that might affect performance. Main reason I wanted to keep the majority of the operations in the FloatLoop class fp-related was to avoid making it a combined integer/float test (there is still the while loop incrementation, but nobody's perfect).

edit: good catch! float1 and float2 both fail to increment beyond 16777216 with the existing code. That is contrary to my intentions. Oops! I'll uh, do something to fix that.

UnmskUnderflow · Sep 24, 2014

DrMrLordX said:
edit: good catch! float1 and float2 both fail to increment beyond 16777216 with the existing code. That is contrary to my intentions. Oops! I'll uh, do something to fix that.

No worries, HPC guys get that wrong all the time.

Classic solutions since you just want linear number movement
1) change the round mode to round to infinity (easiest to keep this consistent)
2) try just adding the number to itself. That will scale, but has exponential side effects...you may eventually INF/NaN out.
3.) Case handling per range. No more pretty loops, but representative of real ugly float code

I recommend figuring out how to round to inf, since that is your intention.

Keep this up, translate to C++, and maybe vectorize, you have yourself a pretty nice benchmark imho. (I'd be soooo happy if you vectorized. Put that cache system to work!)

DrMrLordX · Sep 24, 2014

UnmskUnderflow said:
No worries, HPC guys get that wrong all the time.

Classic solutions since you just want linear number movement
1) change the round mode to round to infinity (easiest to keep this consistent)
2) try just adding the number to itself. That will scale, but has exponential side effects...you may eventually INF/NaN out.
3.) Case handling per range. No more pretty loops, but representative of real ugly float code

I recommend figuring out how to round to inf, since that is your intention.

Keep this up, translate to C++, and maybe vectorize, you have yourself a pretty nice benchmark imho. (I'd be soooo happy if you vectorized. Put that cache system to work!)

I tried several solutions with different results. In fact, I'll break it down like this:

First solution I tried was a rolling increment. I did float1 += float2 and float2 -= float1, which produces a 6-increment cycle after which the float values float1 and float2 repeat themselves (I also had to set the initial values to something other than 0. I started float1 at 2 and float2 at 1). Still too easy for the VM to optimize that, I think (not like the previous pattern was all that devious or sneaky). So, I scrapped that. Might go back to it for academic purposes to see if it would be really that much faster than the solution I've settled on below.

Second solution I tried was self-adding which did in fact hit Infinity. That was not 100% precisely what I had intended, though it really didn't speed things up (in fact, the test got pretty slow at that point). Might add it as an option later on.

After that, I just threw up my hands and said, "Let's spawn more threads!". So, instead of spawning 48 threads and running each loop 2^31 times, we now spawn 6144 threads and run each loop 2^23 times. Same number of loops as before, but now we have float1 and float2 changing value after each iteration which was (mostly) not the case in any of the previous code.

The new test is still running on my slow-arsed machines, so I won't have results for a wee bit. Once those come out I'll discuss the alternate threading tests I ran.

DrMrLordX · Sep 24, 2014

Ramses said:
It's possible it's a CnQ/motherboard difference. I'm on a Sabertooth R2 and owned an Asrock Extreme9 before, they both treated thermal control very differently, possibly clock related functions too. Just a thought.

I expect that turbo and power-saving features both could affect the reported results from this test.

If I can assist further yell, I got cores to spare...

I'll probably roll out new versions of the test every now and then, so just running those would be great!

richierich1212 said:
3570K @ 4.3GHz (10 tabs open in Chrome, youtube streaming while test was running. Win7 64-bit):

Starting OriginalCode run!
It took 90794 milliseconds to complete the Integer loop.
It took 89126 milliseconds to complete the Float loop.

Starting OriginalCodeNoDiv run!
It took 31110 milliseconds to complete the Integer loop.
It took 37568 milliseconds to complete the Float loop.

Starting LatchEnabled run!
It took 90872 milliseconds to complete the Integer loop.
It took 89602 milliseconds to complete the Float loop.

Starting LatchEnabledNoDiv run!
It took 30918 milliseconds to complete the Integer loop.
It took 36732 milliseconds to complete the Float loop.

Thanks, now we have an Ivy Bridge result. That will be nice to have at least for comparison in case you decide to run the next version, which is quite a bit different thanks to UnmskUnderflow's observations about the float increment problem.

cytg111 said:
- Yea, was pondering on writing an C++ equivalent .. The blackbox bytecode->jit->machinecode is making profiling it hard.

By all means, that would be great to have for comparative purposes.

SlowSpyder said:
I put some air over the VRM's, clocked down a bit more, stayed at factory voltage of 1.5 this time. 208x23 = 4.784GHz. I did not see any throttling this time, so I must have been triggering some kind of protection / throttling before with this. Anyway, my 4.784GHz numbers:

Starting OriginalCode run!
It took 43917 milliseconds to complete the Integer loop.
It took 62788 milliseconds to complete the Float loop.
Starting OriginalCodeNoDiv run!
It took 27448 milliseconds to complete the Integer loop.
It took 34409 milliseconds to complete the Float loop.
Starting LatchEnabled run!
It took 44446 milliseconds to complete the Integer loop.
It took 70787 milliseconds to complete the Float loop.
Starting LatchEnabledNoDiv run!
It took 27396 milliseconds to complete the Integer loop.
It took 28607 milliseconds to complete the Float loop

Ah hah, now we're talking. Eliminating throttling behavior is a good thing. Was the throttling affecting anything else you were running?

DrMrLordX · Sep 24, 2014

Okay, lots of things to put here. First off, some minutiae:

The x2's temp sensors are hosed, so I did temp testing on the E1-2500 using Maximilian's GUI version. The E1 was idling at about 56-57C, and Awesomeballs ran it up to 65C during the FloatLoop portions of the test. I tried Prime95 Blend and hit 68C within about 5 minutes of run time. Obviously, that version of Awesomeballs is not the stressiest of stress tests, though it can cause a little heat.

I experimented with Math.pow performance. It is awful. Avoid Math.pow if you care about Java code performance. Oh, the agony. Need to check Math.random() and Math.round() later to see how bad they are. It's probably been done already under older versions of the VM, just can't be bothered with it right now.

Using simple Thread.start() and Thread.join() is really no faster than using an ExecutorService. In some cases it seems a little slower, but not by much.

Throwing an insane number of threads into system memory (6144!) and reducing the number of while loop iterations per thread sped things up by quite a bit. See notes below:

After spending a lot of time trying to squeeze more performance of the test by reducing threading overhead, I gave up. Nothing was working on my test machines. Spawning 48 threads in a static thread pool was about as fast as it was going to get, especially with Thread.setPriority() calls establishing the worker threads at higher priority. But, I did rewrite the FloatLoop code to iterate only ~16 million times per thread rather than ~2 billion, and I increased the number of threads to compensate. This change allowed me to avoid the problem that UnmskUnderflow pointed out earlier today. To be even-handed, I made the same change to the IntegerLoop code.

Here are the class files for the latest revision that reflects these changes. Included is probably the best Thread/Join example code I put together as "experimental code".

You can get the source here.

If you are interested in using the .class files, just put them in a directory, navigate to the directory one level lower than the .class files themselves, and then type: java mathtester.Mathtester.

Be warned that there are two problems with the new code that may require further work later on, especially if they crop up on other people's machines:

The results for the non-latch test portions may be improbably low. Long story short, I'm having problems with the reporting code executing before its time. I redid the reporting variable as volatile in hopes that I'd force the VM to execute operations involving the timer in proper order, but the reported times for the non-latch code still seems very low. Don't get me wrong, the new code is a LOT faster than the old code, but the reported times still seem off.

As a consequence of making the reporting variable volatile, I've also reintroduced part of the shutdown time into the reported time, which is unfortunate but possibly necessary.

If there are too many problems on account of this issue, I have an alternate plan to fix it.

Also, spawning 6144 threads can use a lot of heap space. My x2 test machine seems prone to OutOfMemory exceptions during the latch tests. I added forced Runtime.getRuntime().gc() calls along with old-fashioned null reference assignments since it was obvious that the GC wasn't doing its darn job when queueing up all of the tests at once. The E1-2500 has not had this problem yet (same amount of memory, interestingly enough). Still, you can avoid the OutOfMemory by running one test at a time instead of trying to run them all in sequence. You may have to exit the program and restart it to really get memory back.

I'll post some times from the x2 and E1 using the new code later.

edit: ach the number of threads is too few. I'll have to fix that pronto.

edit edit: fixed. It's the same code as before, with batch mode disabled (Maximilian may re-enable it anyway, who knows) that runs the tests twice in sequence. So, 6144 threads followed by 6144 threads. The links above point to the new code, and the inadequate code has been deleted. Same instructions for operation as before.

SlowSpyder · Sep 24, 2014

DrMrLordX said:
Ah hah, now we're talking. Eliminating throttling behavior is a good thing. Was this affecting anything else you were running?

Not that I noticed. I game and browse a lot. No game I play ever gets the CPU up to 100% load on all cores.

Enigmoid · Sep 24, 2014

3630qm @ 3.2 ghz
Original bench

Starting Integer bench!
It took 114815 milliseconds to complete the Integer loop.
Starting Float bench!
It took 70992 milliseconds to complete the Float loop.
Stopped, total execution time was 3 minutes and 5 seconds.

DrMrLordX · Sep 24, 2014

SlowSpyder said:
Not that I noticed. I game and browse a lot. No game I play ever gets the CPU up to 100% load on all cores.

Hmm. Well, I'm sure you'll sort it out to your satisfaction. Seems like you have things well in hand as it is.

Enigmoid said:
3630qm @ 3.2 ghz
Original bench

Okay, more Ivy Bridge. Unlike the 3570k, you have HT, and it seems to be making a difference here in much the same fashion as it does on Haswell.

Scores for my machines using the 9/24 "fixed" code that runs the appropriate number of iterations(6144 + 6144):

Stars chip:

Original code:
IntegerLoop: 87502 ms
FloatLoop: 34900 ms

Original code, no division:
IntegerLoop: 7299 ms
FloatLoop: 19856 ms

Latch code:
IntegerLoop: 26308 ms
FloatLoop: 34526 ms

Latch code, no division:
IntegerLoop: 7804 ms
FloatLoop: 15152 ms

Jaguar chip:

Original code:
IntegerLoop: 40739 ms
FloatLoop: 129324 ms

Original code, no division:
IntegerLoop: 23813 ms
FloatLoop: 52725 ms

Latch code:
IntegerLoop: 31140 ms
FloatLoop: 41829 ms

Latch code, no division:
IntegerLoop: 5781 ms
FloatLoop: 7579 ms

I must revise what I said about the non-latch results being unreliable. Frankly I think all of the numbers should be viewed with some skepticism. What is clear from running the new code is that the test is now much faster than it was before it went massively-parallel. You can time the overall run with a stopwatch if you must. The internal reporting code is suffering from the fact that it is running in a low-priority thread among literally thousands. Before I used the volatile keyword, I would get some report times of 0 ms elapsed (really).

Nevertheless, the Jaguar chip used to take upwards of 9 minutes just to complete the IntegerLoop portion of the original code test. Now it takes maybe 2-3 minutes. Assuming the latch is functioning the way it's supposed to (namely, that it isn't popping in error), all 13194139533312 loop iterations are taking place as before, at least in the latch tests. So, something is improved here, or something is just very, very broken.

As an aside, I did modify the IntegerLoop and FloatLoop classes during the redesign to have them report completion of their assigned 2^31 while loop iterations, and sure enough, they did. The 6144 console output commands did slow things down a bit, but it was still a lot faster than when I used only 48 threads, on both the Integer and Float side.

UnmskUnderflow · Sep 24, 2014

For what it's worth, I tried looking up the quick way to force Java floats to round to infinity (or up, or ceiling, or away from zero, or anything besides nearest even). If you did this, you could keep the thread count small, and after it saturated, each addition would increment the LSB naturally.

However, seems the answer is non trivial, which as a hardware guy, I find rather infuriating. ISAs have programmer visible registers for this type of thing, so I'm a bit peeved Java dances around it and force-feeds everyone nearest-even.

Anyway, thanks for being so open to feedback, been a while since I was interested in a home-brew bench. Keep it up!

DrMrLordX · Sep 24, 2014

UnmskUnderflow said:
For what it's worth, I tried looking up the quick way to force Java floats to round to infinity (or up, or ceiling, or away from zero, or anything besides nearest even). If you did this, you could keep the thread count small, and after it saturated, each addition would increment the LSB naturally.

However, seems the answer is non trivial, which as a hardware guy, I find rather infuriating. ISAs have programmer visible registers for this type of thing, so I'm a bit peeved Java dances around it and force-feeds everyone nearest-even.

Anyway, thanks for being so open to feedback, been a while since I was interested in a home-brew bench. Keep it up!

Always glad to learn something new, so thank you for your input. The program runs a lot faster now (rather by accident) on account of my own efforts to escape the float rounding issue. I may revisit the issue to lower thread count and make the memory requirements less insane, though aggressive forced garbage collection seems to be helping in that department, at least a little.

cytg111 · Sep 25, 2014

DrMrLordX said:
Before I used the volatile keyword, I would get some report times of 0 ms elapsed (really)

Use System.nanoTime instead, much higher resolution.

JoeRambo · Sep 25, 2014

This non benchmark just went from using rather okayish version with ThreadExecutor (even if initialized in absurd way of having more threads then there are CPUs to burn CPU time on contention), to using outright stupid amount of threads (each is initialized, has its own thread stack and makes Windows scheduler go haywire if there are too many)

One can invent any excuse why you are doing this, but it is year 2014 and by now even most conservative Java communities realised that worker thread pools beat thread per task model.

Creating 6k threads to workaround programming problems is absurd.

DrMrLordX · Sep 25, 2014

cytg111 said:
Use System.nanoTime instead, much higher resolution.

May try that, but I'm not sure that it would help. The main issue isn't the execution times being reported by individual threads (when I have that enabled, which I don't in the latest version of the code offered here), but the behavior of the counter/totaltime variable in OriginalCode/OriginalCodeNoDiv/LatchEnabled/LatchEnabledNoDiv.

Yes, if I had the individual threads reporting their complete times, then using ns would be better since the first few threads to complete do finish in under 1 ms. The problem is that in OriginalCode(for example), the line

Code:

totaltime = System.currentTimeMillis() - totaltime;

will sometimes execute while some of the worker threads are still going. I had thought the latch would provide better protection from that, but it really doesn't do that at all. Anyway, during some testing, I had the worker threads set up to report their individual completion times. There were many instances of the last few threads reporting completion times well after counter (now named totaltime/totaltime2) had recorded the alleged execution time for all tasks in the ExecutorService.

The actual System.out.println that displays totaltime's contents always seems to come after the threads are complete, but not so with the recording. Very interesting.

JoeRambo said:
This non benchmark just went from using rather okayish version with ThreadExecutor (even if initialized in absurd way of having more threads then there are CPUs to burn CPU time on contention), to using outright stupid amount of threads (each is initialized, has its own thread stack and makes Windows scheduler go haywire if there are too many)

Win8.1 handled it with aplomb. Well, sort of, if you count the entire OS seizing up for the duration of the test "handling it with aplomb". I think that's more the hardware's fault though (E1-2500), since it likes to seize up on all kinds of things. Overclock.net or slashdot.org, for example. Or even 48 worker threads from the earlier version of the test.

Linux pitched a fit when I tried 12k threads and required a reboot. Okay, THAT was a genuinely bad idea. But I had to try anyway.

One can invent any excuse why you are doing this, but it is year 2014 and by now even most conservative Java communities realised that worker thread pools beat thread per task model.

Then I believe what I am doing would be best described as, "reinventing the wheel". Or retesting it anyway.

Creating 6k threads to workaround programming problems is absurd.

It is! That's what's so funny about it. Actually the funny part was 6k threads (twice, in sequence) making the test run faster. Having thought about it some, I can only attribute that fact to the possibility that integer rollovers are expensive (int3 rolled over a lot) and that any operation involving a float that has reached Infinity is also expensive (my own limited testing does confirm this. In any case, float3 spends less time at Infinity now than it did before). Or, due to the overwrought parallelism, it may be that some operations are not completing as they should. I should test for that . . .

In any case I'll probably move away from the mass-parallelism and try nesting loops inside the worker objects themselves, or something along those lines. Less memory footprint.

DrMrLordX · Sep 25, 2014

Quick post: found some errors in the 9/24 build (aside from the mass thread parallelism). It was iterating 1/5th the loops it should have been.

Essentially, there should be around 103 billion total loop iterations. I was getting around 20 billion. Oops!

Code in testing has been revamped so as to go back to 48 threads. Instead, each thread now has a while loop that iterates 2^24 times nested inside a for loop that iterates 128 times. Initial testing indicates that this code is much, much faster. Might need to switch to nanoseconds after all.

UnmskUnderflow · Sep 25, 2014

JoeRambo said:
This non benchmark just went from using rather okayish version with ThreadExecutor (even if initialized in absurd way of having more threads then there are CPUs to burn CPU time on contention), to using outright stupid amount of threads (each is initialized, has its own thread stack and makes Windows scheduler go haywire if there are too many)

One can invent any excuse why you are doing this, but it is year 2014 and by now even most conservative Java communities realised that worker thread pools beat thread per task model.

Creating 6k threads to workaround programming problems is absurd.

In DrMrLordX's defense, if anything this shows what absurd crap can happen to simple loops in VMs or JITs. Overflow of int tanking perf? No easy IEEE 754 control? Weird as crap reactions to massive thread counts?

His code is open source and is becoming more messy the more real it gets. This is why it is more interesting than mickey mouse benches that are proprietary where no one validates functionality and forums like these slurp up results between ISAs as gospel comparisons.

I wish SPEC would come back. Till then, keep up these efforts. This exposes the massive flaws in the weak benchmarking we have available, and if he iterates enough with open feedback, he may yet have a more valid JIT cross compare than is avail elsewhere.

DrMrLordX · Sep 25, 2014

UnmskUnderflow said:
In DrMrLordX's defense, if anything this shows what absurd crap can happen to simple loops in VMs or JITs. Overflow of int tanking perf? No easy IEEE 754 control? Weird as crap reactions to massive thread counts?

I may have over-estimated the expense of int overflow. It's something I need to isolate and test on an individual basis.

The 9/24 build is problematic. Not only was it running too few loop iterations, it also had serious issues with report time fluctuations, probably due to some flaws left over in my loop code from when I was experimenting with ways to get away from rounding failure at float1/float2 = 2^32. Some of my loop code was still doing float1 += float2 and float2 -= float1 instead of float1++ and float2++. So, the 9/24 results are at least partially garbage. The cool part was that spawning such a large number of threads still worked. Pity I hosed up some other parts of the code.

This problem carried over to what was going to be the 9/25 build. On top of that problem, I found yet another problem with how I arranged my nested while loop: all variable declarations moved inside the new for loop. Apparently this caused the VM to decide that the contents of the for loop were something it could afford to simply not carry out. It's easy to understand why. The while loop no longer performed any reads from or writes to variables or objects outside the for loop or even the runloops() method. I knew there was a problem when increasing the number of for loop iterations to over 600000 barely dented performance. That's over ten trillion while loop iterations per thread. No way that's getting done in a few seconds on an x2 220, unlocked or no.

Basically, if you've got a nested loop arrangement like this:

Code:

for (int j = 0; j < 1337; j++)
{
     int stuff = 0;
     int morestuff = 0;
     while (stuff < 31337)
     {
           stuff++;
           morestuff++;
     }
}

the VM is going to ignore the while loop. Or, at least, it certainly APPEARS to do that.

The feared float1++/float2++ NOP problem became a very real problem in which the entire loop was apparently reduced to a NOP. To correct the problem, I moved all variable declarations out of runloops() and, viola, the VM is no longer skipping the while loop, despite the fact that each iteration of the for loop resets all variables to 0.

Code performance is now falling more in line with what it was in version from before 9/24, though it is still a bit faster . . . which is interesting. You'd think it would get slower now that float1 and float2 are iterating properly, but no, it's faster. That's probably the real, measurable effect of reducing the number of int rollovers and operations involving a float at Infinity. Maybe. If I can find an elegant way to stop int3 from rolling over itself and float3 from reaching Infinity without adding if statements or increasing the number of operations per while loop iteration, I may do that next.

His code is open source and is becoming more messy the more real it gets. This is why it is more interesting than mickey mouse benches that are proprietary where no one validates functionality and forums like these slurp up results between ISAs as gospel comparisons.

Messy? Tell me about it. Of course, I'm the one messing it up, so I have no real reason to complain. Now that you mention it, I should put this code (and future versions) under GPL or Creative Commons or something.

I wish SPEC would come back. Till then, keep up these efforts. This exposes the massive flaws in the weak benchmarking we have available, and if he iterates enough with open feedback, he may yet have a more valid JIT cross compare than is avail elsewhere.

Hah! Well, we'll see where it all leads. For now it's just a tinker-toy. But it is fun tinkering with it. Speaking of which . . .

The new not-messed-up version of the Awesomeballs test is here! Hooray!

Here are the .class files. I've included a handy .bat file to run the command-line program for all you Windows users out there. Just unzip the contents and run the .bat file and away you go.

The source code is here.

I do appreciate all the people who have taken the time to run this, and I'll keep working on it as time allows to make it more interesting and informative where possible.

Run times for the 9/25 build:

Stars chip:

It took 545133 milliseconds to complete IntegerLoop.
It took 141815 milliseconds to complete FloatLoop.
It took 26995 milliseconds to complete IntegerLoopNoDiv.
It took 76675 milliseconds to complete FloatLoopNoDiv.
It took 544263 milliseconds to complete IntegerLoopWithLatch.
It took 139666 milliseconds to complete FloatLoopWithLatch.
It took 27023 milliseconds to complete IntegerLoopWithLatchNoDiv.
It took 76350 milliseconds to complete FloatLoopWithLatchNoDiv.

Total execution time for your selection is 1577920 milliseconds.

Jaguar chip:

It took 608504 milliseconds to complete IntegerLoop.
It took 583426 milliseconds to complete FloatLoop.
It took 125737 milliseconds to complete IntegerLoopNoDiv.
It took 231941 milliseconds to complete FloatLoopNoDiv.
It took 600236 milliseconds to complete IntegerLoopWithLatch.
It took 579095 milliseconds to complete FloatLoopWithLatch.
It took 125735 milliseconds to complete IntegerLoopWithLatchNoDiv.
It took 228563 milliseconds to complete FloatLoopWithLatchNoDiv.

Total execution time for your selection is 3083237 milliseconds.

Hmm. Looks like some stuff actually got slower on the Jaguar, such as IntegerLoop.

DrMrLordX · Sep 26, 2014

Finally, it looks like the test is running as originally intended.

In the 9/26 build, I've made two simple changes:

FloatLoop/FloatLoopNoDiv/FloatLoopWithLatch/FloatLoopWithLatchNoDiv all increment their loops with float variables instead of integer variables (j and increment are now float instead of int).

I swapped the number of loop iterations between the while loops and the for loops of every worker variant. Now each worker iterates the for loop 2 ^ 24 times and the while look 2 ^ 7 times. The end results are that int3 does not roll over, and float3 never reaches Infinity. Both values top out at ~700k.

Here are my results from 9/26:

Stars chip:

It took 377182 milliseconds to complete IntegerLoop.
It took 199608 milliseconds to complete FloatLoop.
It took 40406 milliseconds to complete IntegerLoopNoDiv.
It took 100853 milliseconds to complete FloatLoopNoDiv.
It took 353902 milliseconds to complete IntegerLoopWithLatch.
It took 171373 milliseconds to complete FloatLoopWithLatch.
It took 86568 milliseconds to complete IntegerLoopWithLatchNoDiv.
It took 99511 milliseconds to complete FloatLoopWithLatchNoDiv.

Total execution time for your selection is 1429403 milliseconds.

Jaguar chip:

It took 614220 milliseconds to complete IntegerLoop.
It took 592927 milliseconds to complete FloatLoop.
It took 148752 milliseconds to complete IntegerLoopNoDiv.
It took 466238 milliseconds to complete FloatLoopNoDiv.
It took 604581 milliseconds to complete IntegerLoopWithLatch.
It took 586611 milliseconds to complete FloatLoopWithLatch.
It took 149063 milliseconds to complete IntegerLoopWithLatchNoDiv.
It took 473314 milliseconds to complete FloatLoopWithLatchNoDiv.

Total execution time for your selection is 3635706 milliseconds.

9/26 is slower all across the board on Jaguar than 9/25. Since I made two changes between versions, it is difficult to assess exactly what caused the dramatic increase in execution time for FloatLoopNoDiv and FloatLoopWithLatchNoDiv. I would say that some of the decrease in speed comes from increased reliance on yet another incrementation variable (j). That may also be the source of some of the slowdown on the Stars chip.

The big change is in the increase in performance on Stars when running IntegerLoop/IntegerLoopWithLatch. Apparently mixing integer division with integer rollover is just bad news.

If anyone has time to run this test variant, please do. The .class files for 9/26 are now available, and so is the source.

Ramses · Sep 26, 2014

if you can tell my dense self what the command is to run it I'd be happy to

Vesku · Sep 26, 2014

Chrome thinks your dropbox link file is malicious. I know that stuff gives false positives but it is enough to put me off running it, sorry. =S

DrMrLordX · Sep 26, 2014

Ramses said:
if you can tell my dense self what the command is to run it I'd be happy to

Okay, no problem.

Assuming you are running Windows, download the classes file, and extract it anywhere you like. Then double-click runme.bat and you should get a command prompt with the text version of the program.

Vesku said:
Chrome thinks your dropbox link file is malicious. I know that stuff gives false positives but it is enough to put me off running it, sorry. =S

Really? Weird. Does it give the same report on the source file? This system shouldn't be infected with anything, but I'll double-check the archive just to be sure . . .

Okay, I uploaded it to Virustotal and they didn't find anything. My guess is the .bat file is making Chrome throw a false positive. Just a guess.

virscan also reports the file as being clean.

Since Chrome is being funny about the file, I suppose I'll have to work on a GUI version so I can slap it into a self-executing .jar. Sadly, those don't work for console apps (well, not without using Swing or something to make a textbox).

java test: unexpected results on AMD CPUs

Lifer

Lifer

Elite Member

Lifer

Junior Member

Lifer

Junior Member

Lifer

Lifer

Lifer

Lifer

Platinum Member

Lifer

Junior Member

Lifer

Lifer

Golden Member

Lifer

Lifer

Junior Member

Lifer

Lifer

Platinum Member

Diamond Member

Lifer