[ Bloomberg ] AMD Facing Bleak Future

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

NTMBK

Lifer
Nov 14, 2011
10,322
5,352
136
So why is this code running so bad on amd's uarch compared to intel's? Is there any optimization a that could be done to take advantage of amd's uarch without advanced extensions?

It probably depends to a large degree on the implementation of the standard library by Dinkumware. If their implementation of those trigonometric functions happens to be more optimised for Intel processors than AMD processors, it will show up in code like that.

(Note, this isn't me being conspiratorial and claiming that Dinkumware intentionally crippled AMD performance- just that they probably developed their code on Intel powered workstations, and hence optimised for the platform they knew best without even realising they were doing it.)
 

nenforcer

Golden Member
Aug 26, 2008
1,767
1
76
I have to just wonder what an evolution of Phenom II x 4/x6 would have been like on 32nm? Xtors capable of 5 Ghz coupled to a move to four wide front end (up from three wide) probably would have been pretty exciting. (With any luck we will see a return of this evolution).

 

Abwx

Lifer
Apr 2, 2011
11,543
4,327
136
In response to claims that 3DPM is biased against AMD made by Abwx.

3DPM is written, at its base level, very simply.

A for loop is declared to be multithreaded, and the code within the loop deals with x/y/z co-ordinates for trigonometric transforms on a struct with three main float class members.



I used the code to publish several scientific papers regarding electrochemical motion and interaction with surfaces. This is code written by a scientist, rather than a computer scientist with a background in code or programming languages. For lack of a better word, a self-taught noob. I'm a physical chemist first, programmer second.

So one could argue that the loops involved require integers, and the random number generator is predominantly bitshifts, but the bulk of the mathematics that takes time is basic floating point trig functions.

Disclosure: I'm the Senior Editor Ian Cutress on the main site. I don't visit the forums that much(!) If anyone wants to double confirm, I use this handle on Twitter as a secondary account as well as over at OCN. You can tweet me at @IanCutress or @borandi with this link and I'll respond.

What this bench state is that a Bay Trail has 50-60% better FP IPC than a Piledriver/Trinity/Vishera, do you stand by this number, because it s mandatory if you want to insist on this bench allegedly being neutral.?

Of course this number is totaly boggus when we compare the same CPUs in Cinebench 11.5, wich use SSE2 at a rate of 70%, or PovRay, you have thoses benches in your database, didnt the ridiculous score below ring a bell when compared to said FP CB11.5 and Povray benches.?.

Anyway you are giving a completely wrong picture of AMD s product with this bizzaro bench, i would go as far as calling it a viral bench wether this is due to ignorance when designing it or anything else :



 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136

This benchmark is heavily Cache oriented,

Celeron J1900 is 2x faster in Multithread than Celeron J1800 simple because 1900 has 2MB of L2 cache vs 1MB on the J1800.

And performance doubles again with ATOM C2750 that has 4MB L2 cache and 8 cores. Also, ATOM C2750 is faster than Core i5 2500K
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
This benchmark is heavily Cache oriented,

Celeron J1900 is 2x faster in Multithread than Celeron J1800 simple because 1900 has 2MB of L2 cache vs 1MB on the J1800.

And performance doubles again with ATOM C2750 that has 4MB L2 cache and 8 cores. Also, ATOM C2750 is faster than Core i5 2500K

Nonsense.

The J1800 is 2 cores, J1900 4 cores.
 

SlickR12345

Senior member
Jan 9, 2010
542
44
91
www.clubvalenciacf.com
Their Bulldozer bet really backfired. I mean their Phenom processors were really competitive and they had 6 core processors on the market, but they decided to scrap that design win and go for a terribly slow lane design, that is reliant on applications supporting more than 4 cores at the time and they failed.

Their design was too soon to the market and couldn't quite take advantage of any software, I don't remember any software program using more than 4 cores and even those programs were rare.

Most programs used 1 core, decent amount used 2 cores and very small amount used 4 cores. So AMD design to go for many cores didn't pay off, so at that point since no program supported 6 or 8 cores, they had to scrap their plans for 12 and 16 core processors and decided to try to speed up the single lanes by ramping up the frequency, but it was too little, too late.

I think they should have just continued with their strategy and not back off, release the 12 cores, release the 16 cores processors and just hope and pray programmers code their software to take use of all those cores.

I still think they are in decent position, they are not our yet and only win 1 design win. With how slow the refresh process is these days, one design win can have them be at top for 1 year.
 

Abwx

Lifer
Apr 2, 2011
11,543
4,327
136
This benchmark is heavily Cache oriented,

Celeron J1900 is 2x faster in Multithread than Celeron J1800 simple because 1900 has 2MB of L2 cache vs 1MB on the J1800.

And performance doubles again with ATOM C2750 that has 4MB L2 cache and 8 cores. Also, ATOM C2750 is faster than Core i5 2500K

As pointed by Shintai the core count matters but there must be a hell of a CPU dspatching given the Bay trail score in respect of the 5800K, this is just impossible that it has 50-60% better IPC in FP, i would be the bench designer i would double check everything as it s obvious that it s litteraly a viral marketing bench in its current form, in his post Ian Curtress somewhat admit that he did only the mathematical part and that he doesnt actualy know how this bench is optimised/unoptimised in respect of code paths, for me it s obvious that the Intel CPU have a much better optimised path than the Piledriver, for the record i get a 71 single core score with a Pentium T4400 2.2, this suggest that ST FP perf of my T4400 is as good as a 4.1 GHz Piledriver, hence it has 80% more FP IPC at least, wich is just ridiculous.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
One of the algorithms looks like this, in mixed C++/pseudocode using OpenMP:
.

sorry but this code makes no sense, for ex. the induction variable "j" isn't used in the inner loop but there is a series of overwrite to particle instead,

also using a statement like
if(particle.z < 0) {particle.z -= particle.z;}

for a clamp to 0.0 (or a buggy absolute value ?) looks beyond clumsy

I used the code to publish several scientific papers

I really hope this was a fixed version of this nonsense, after proper peer review
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
As pointed by Shintai the core count matters but there must be a hell of a CPU dspatching given the Bay trail score in respect of the 5800K, this is just impossible that it has 50-60% better IPC in FP, i would be the bench designer i would double check everything as it s obvious that it s litteraly a viral marketing bench in its current form, in his post Ian Curtress somewhat admit that he did only the mathematical part and that he doesnt actualy know how this bench is optimised/unoptimised in respect of code paths, for me it s obvious that the Intel CPU have a much better optimised path than the Piledriver, for the record i get a 71 single core score with a Pentium T4400 2.2, this suggest that ST FP perf of my T4400 is as good as a 4.1 GHz Piledriver, hence it has 80% more FP IPC at least, wich is just ridiculous.

Piledriver has only two FPUs compared to the Baytrails 4. Piledriver and all the modular CPUs simply suck at this test, PII is significantly farther ahead.

Again as I said before this is complex mathematical code involving a lot of high latency instructions.

sorry but this code makes no sense, for ex. the induction variable "j" isn't used in the inner loop but there is a series of overwrite to particle instead,

also using a statement like
if(particle.z < 0) {particle.z -= particle.z;}

for a clamp to 0.0 (or a buggy absolute value ?) looks beyond clumsy

I really hope this was a fixed version of this nonsense, after proper peer review


Hint hint, Nobody really cares about how optimized the code is, only that it works and returns the correct values.

Cutress's code, though its being blasted (perhaps as it should be) is perfectly indicative of what you would find in small scale research groups. Nobody is a coder in the group and the programs are clumsily written but do what they need to do. I've seen many examples of poorly written code that continues to be used because it works and that is all people really care about. Case in point programs which would iterate over x^2 values when the iteration only needed to be over x, or singlethreaded code dealing with matrix algebra when it would be fairly simple to multithread or use a GPU. Nobody has the time or knowledge to implement such code though it would be fairly trivial to do so.

3dPM is not indicative of large scale scientific calculations or simulations. It is however very similar to how a chemist or physicist with some programming knowledge is going to implement their code for small scale research work.

Take it or leave it. I think its nice to know but not at all relevant for consumers or anyone outside the field.
 

Abwx

Lifer
Apr 2, 2011
11,543
4,327
136
Piledriver has only two FPUs compared to the Baytrails 4. Piledriver and all the modular CPUs simply suck at this test, PII is significantly farther ahead.

Again as I said before this is complex mathematical code involving a lot of high latency instructions.

If it was an Intel CPU that was that disadvantaged we would hear a totaly different discourse, isnt it, it just show that ultra biaised benches do suit you as long as it s AMD that is handicapped...

Your explanations are just irrelevant, if Piledriver was that weak compared to a Baytrail in FP this would show in Cinebench or Povray.

You have Cinebench R10, 11.5 , R15 and Povray wich are FP, in the former three the 5800K is twice as fast as a J1900 and 2.5x with Povray, but keep on spreading your thoughts that 3D particle could be an accurate representation of thoses two CPUs respective FP perfs, sure that it will help increase your credibility.

http://www.anandtech.com/bench/product/675?vs=1227
 
Last edited:

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136

bronxzv

Senior member
Jun 13, 2011
460
0
71
Hint hint, Nobody really cares about how optimized the code is, only that it works and returns the correct values.

this code is very obviously wrong, so it can't return correct values

moreover, since most computations are done in pure waste (overwritten at each step), only the last step values are the ones kept, no-one can tell what is measured, for example if the compiler know that sin/cos functions have no side effect it can call them only once at the last step and call steps time the random generator
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
If it was an Intel CPU that was that disadvantaged we would hear a totaly different discourse, isnt it, it just show that ultra biaised benches do suit you as long as it s AMD that is handicapped...

Your explanations are just irrelevant, if Piledriver was that weak compared to a Baytrail in FP this would show in Cinebench or Povray.

You have Cinebench R10, 11.5 , R15 and Povray wich are FP, in the former three the 5800K is twice as fast as a J1800 and 2.5x with Povray, but keep on spreading your thoughts that 3D particle could be an accurate representation of thoses two CPUs respective FP perfs, sure that it will help increase your credibility.

http://www.anandtech.com/bench/product/675?vs=1227

First Part: No I never said that. In this particular test AMD doesn't do well. That's that.

Again CB and POVray are composed to relatively simple instructions. There are no trig calculations which take a very long time compared to MULs and ADDs.

And I never said that either. I said that it was a fairly good indication of the type of performance I could expect from similar code with minimal optimization.

Kabini with four(4) FPUs which are faster than Baytrail also sucks running this code.

And it seams Intel HT is working very well, unlike AMDs CMT.

Also, AMD Phenom X6 is faster than Quad Core Haswell

http://www.anandtech.com/show/8427/amd-fx-8370e-cpu-review-vishera-95w/2

P II x6 is significantly faster than similarly clocked FX 8 core too.

As for CMT, this code was written before AMD's CMT and contains no processor specific optimizations. It was likely written for the PII/core 2 architecture (ie complier at that period in time) and as intel's modern architecture is much more similar to core 2 than PD is to PII that is likely the reason why it does so badly on PD.

Again what people don't seem to understand is that the bulk of the time spent in on the trig calculations. If CPU A has a latency of 50 cycles and CPU B has a latency of 80 cycles for trig calculations then CPU A is going to be faster than CPU B even if it has only 65% of the IPC of CPU B in general purpose code.


this code is very obviously wrong, so it can't return correct values

Again are you looking at the entirety of the code? This may be a roundabout and obtuse way of doing this but if the correct answer is returned then that is all that matters.

You are not looking at the code correctly. j is a dummy variable used to iterate over steps. i is the particle number. J is reset to 0 for each particle i. You don't like how it looks, fine, but will it work?
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
but will it work?

obviously not, if you replace the "=" with "+=" it makes some sense, though

hint: the "starting position" comment where positions are initialized, this should be a summation and the code, as is, is clearly buggy as hell
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
obviously not, if you replace the "=" with "+=" it makes some sense, though

hint: the "starting position" comment where positions are initialized, this should be a summation and the code, as is, is clearly buggy as hell

Its very likely he simply extracted code from the program and its missing pieces.

I still don't understand why you do not like that line. The variable is negative. The line subtracts the negative variable from itself (-=) giving 0. Perhaps setting directly to 0 would be better but this looks to me like it should work (I am no expert on programming).
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Its very likely he simply extracted code from the program and its missing pieces.

the inner loop makes no sense, even if extracted from tons of other code, it is maybe a summation as in the last example in my post below ?

I still don't understand why you do not like that line. The variable is negative. The line subtracts the negative variable from itself (-=) giving 0. Perhaps setting directly to 0 would be better but this looks to me like it should work (I am no expert on programming).

indeed this one will work, provided that clamp to 0.0 is its intent
 
Last edited:

bronxzv

Senior member
Jun 13, 2011
460
0
71
Its very likely he simply extracted code from the program and its missing pieces.

consider the simplified example below (where I replaced "steps" by a value)

Code:
float particle_i_x = 0.0;
for (int j=0; j < 100; j++)
{
  float alpha = 2 * pi * randgen();
  particle_i_x  = cosf(alpha);
}
the code can be compiled to do :
100 times in a row
- call the random generator function
- call the cos function
- rewrite the destination (old value lost, computed in pure waste)

the code can also be compiled to do :
100 times in a row
- call the random generator function
1 time
- call the cos function
- write the destination

1) the 99/100 unused computations are a sure sign that this is a buggy example
2) the fact that it can be compiled several *very different* ways make it very bad as a benchmark test


this code with a summation will make more sense IMHO:

Code:
double particle_i_x = 0.0;
for (int j=0; j < 100; j++)
{
  float alpha = 2 * pi * randgen();
  particle_i_x  += cosf(alpha);
}
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
I think his example code is missing something. I understand what you are saying but this is programming and even if true would not cause performance differences between architectures.
 

inf64

Diamond Member
Mar 11, 2011
3,864
4,546
136
I agree with bronxzv, it's very very bad programming. Thee is no way around that fact I'm afraid.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
I agree with bronxzv, it's very very bad programming. Thee is no way around that fact I'm afraid.

btw out of curiosity I quickly tried to profile with vTune the 1st stage "TRIG" of SingleThread.exe (downloaded from http://www.borandi.co.uk/3DPM) and the main hotspot function is with x87 code such as fild, fstp, fadd, etc. and a lot of integer code shrd, add, adc,...

it will make more sense to compile it for >= SSE2 targets with an >= SSE2 math library for benching 2014 CPUs

some code is called in the CRT in MSVCR90.dll which is a VS 2008 DLL, this is probably a VC++ 2008 compilation, not a VC++ 2012 compilation as stated by some people in this thread

let me know if I used the wrong download link
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |