[ Bloomberg ] AMD Facing Bleak Future

Page 10 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

bronxzv

Senior member
Jun 13, 2011
460
0
71
Code:
  particle[i].x = r*cosf(alpha);
  particle[i].y = r*sinf(alpha);
  particle[i].z = newz;

with code like this the main issue with the multi-thread variant is probably the false cache line sharing due to your struct of 3 floats (12 B assumed since the actual struct isn't visible in your example)

there is a lot of writes to the same cache line by groups of 5+ threads (64/12)

if my hypothesis is true, your benchmark is a cache synchronization test not a floating point test

a simple solution will be to pad the struct to 16 B and to do 4 particles per OMP thread instead of 1 particle per OMP thread
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
with code like this the main issue with the multi-thread variant is probably the false cache line sharing due to your struct of 3 floats (12 B assumed since the actual struct isn't visible in your example)

there is a lot of writes to the same cache line by groups of 5+ threads (64/12)

if my hypothesis is true, your benchmark is a cache synchronization test not a floating point test

a simple solution will be to pad the struct to 16 B and to do 4 particles per OMP thread instead of 1 particle per OMP thread

Execution time seems to be limited primarily by the trig calculations. Writing to L1 is around 3/4 cycles. Trig calculations and sqrt are around 32-36 cycles.

I confess I am not a programmer.
 

BigDaveX

Senior member
Jun 12, 2014
440
216
116
If it was an Intel CPU that was that disadvantaged we would hear a totaly different discourse, isnt it, it just show that ultra biaised benches do suit you as long as it s AMD that is handicapped...

Actually, you're wrong there. Some of us remember back in 2000-2001, when the early Pentium 4s had just been brought to the market. And on this type of loose, unoptimized coding - which at the time was mostly pure integer and x87 math, with maybe some SSE1 code if you were lucky - the P4 was absolutely annihilated by the Athlon and Pentium III. As in, barely half the per-clock performance of its rivals.

And guess what? People didn't blame the programmers, they blamed Intel for designing a bad chip.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Execution time seems to be limited primarily by the trig calculations.

not so obvious to me since the speedup with hyperthreading of the "TRIG" component is 1.98x on my Core i7 4770 K with a very small workset, it should be 1.2 x or less with typical fp intensive code

have a look here http://en.wikipedia.org/wiki/False_sharing for some basic understanding of the false cache line sharing issue
you can also google for "false cache line sharing"

each core L1D shared by two hw threads will be an explanation for the unusual speedup with HT: there is 2X less ping-pong of wrongly shared lines between L1D of different cores, also SMT should help to maximize throughput with all the L1D misses endured from the continuous invalidation of the shared lines by other cores

more or less all components of the benchmark seem to suffer from the same basic issue, IMHO there is no point to waste more time analysing it before the author address this weakness
 
Last edited:

NTMBK

Lifer
Nov 14, 2011
10,322
5,352
136
I would recommend using something like TBB to get better threading behaviour. It has useful built in helpers to subdivide your task into ranges, and does a decent job of keeping your threads balanced.
 

Elixer

Lifer
May 7, 2002
10,371
762
126
I just got the program to see what it is about, and, it was compiled with an old compiler.
IMO, it should be cross-compiled for windows using gcc, or, compiled under a recent version of MSVC.
Of course, without the source to examine (BTW, any consideration of open sourcing this benchmark?), and only going by disassembly of it, it doesn't look that well optimized (which could be because of the older compiler being used, or compile flags were not set correctly).
If this would have been compiled with the latest version of gcc (or clang), I am betting the results would be vastly different than what is shown for both AMD & intel.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Some real nice conspiracy theories here


But reality is that MS compiler is generating code just like that in 32bit mode with default switches. I doubt you can blame AT, when pros like Bethesda released Skyrim compiled with disastrous switches (I know cause I was involved in Skyboost project, where Skyrim exe was pached in memory to replace hot x87 with SSE2 code and handtuned SSE4/AVX code). So there is plenty of code like that, you need to care very much, profile even more, something that is not real for each project.

Back on topic of 3DPM - it involves x87 trigonometry instructions. That is usually slowest part of any loop involving them. The are micro coded in CPUs ( like broken down into dozens of operations required to calculate trigonometric function with required precision ) and that means little OOP opportunities and lots of scheduling pain inside (cause that fcos might be executing some mini loop inside, that needs to be tracked etc).
HT is perfect fit here, cause CPU execution ports are underutilized and 2nd thread can make progress by tracking more broken down micro coded ops...

It really puts AMD in bad light, cause it ends up benchmark for money put into x87 microcode quality and scheduling, but it's real world benchmark, there are plenty of programs with code like that getting executed.

But where performance really matters (like getting results out in time before sci paper deadline ), noone will use code like that, Intel has some amazing stuff in their libraries where they use SIMD instructions, or GPU will get used to pump even more values/s.
 

cytg111

Lifer
Mar 17, 2008
23,998
13,522
136
..(I know cause I was involved in Skyboost project, where Skyrim exe was pached in memory to replace hot x87 with SSE2 code and handtuned SSE4/AVX code). ..

just googled that, a ~2x increase in fps. 1. That is pretty impressive 2. The profiling/reversing job, that is pretty amazing. You into rce?
 

schmuckley

Platinum Member
Aug 18, 2011
2,335
1
0
The only thing I want to hear from AMD is:
"Here's a good,powerful Desktop/server chip that beats/matches the competition"
I don't want to hear about onboard graphics or power usage.
2016 is too far away.I surely will not buy any motherboard before they release a chip that shows promise and it is reviewed properly.They're 6 years going down the wrong path.
If your job is to make good CPUs..and you can't find something that's better than what you had before within 6 years...forget it.
 

borandi

Member
Feb 27, 2011
138
117
116
sorry but this code makes no sense, for ex. the induction variable "j" isn't used in the inner loop but there is a series of overwrite to particle instead,

also using a statement like


if(particle.z < 0) {particle.z -= particle.z;}

for a clamp to 0.0 (or a buggy absolute value ?) looks beyond clumsy

I really hope this was a fixed version of this nonsense, after proper peer review


My apologies, I made a mistake in that line. It should be, essentially:

if (a < 0) {a = -a;}

I put the negative sign in the wrong place.

The code there is a mix of pseudo and RW code, just to show what goes on.

As Enigmoid points out, I used code like this in my small-scale research group projects back in 2008-2011. 3DPM was created out of my attempts to configure the CPU code before I ported it onto CUDA. This was back in 2010-ish, when the CPU version of the program was last compiled. I used the CUDA version, on a GTX 280/460 no less, for generating the bigger (1wk+) simulation results for research and the CPU version for the more nuanced edge-cases that only took a few minutes to run.

It is interesting about the x87 code paths. As a non-CS student, I don't really know what that is, nor how to code for SSE2 and up. The compiler flags were set for speed and SSE2 (iirc) and the compiler did what it does. I do know that there's a trick with x87 code on AMD, causing SuperPi to speed up:

http://forum.hwbot.org/showthread.php?t=78490

Though I remember testing it once, ages ago, with no real difference in result. I should have some new FM2+ CPUs inbound, but I do not have an AMD system currently set up (limited space, can't keep everything set up forever). So if/when the CPUs arrive, I will see if TheStilts software again to see if it makes a difference. Though that software isn't something that everyone is going to use, especially not a small-scale researcher.
 

DrMrLordX

Lifer
Apr 27, 2000
22,035
11,620
136
Hey Ian, thanks for posting that code snippet. If that is at all indicative of what is in the x86 version of the code, then I might have some ideas as to why things go poorly on certain AMD chips (possibly all of them).

In my limited experience, AMD chips choke badly on standard math library trig functions. The Java trig implementations are ungodly slow. You could probably do better with lookup tables if memory is not an issue (how precise did this have to be anyway?). I ran into problems with atan and atan2 recently on an A10-7700k and solved the problem by avoiding the stock math libraries in favor of numerical approximations that I knew were prone to SIMD optimization. Granted I stuck with a solution that was not very accurate, but there were other similar options available.

I also found stock random # generation to be very slow, particularly if I dropped random number generation into the middle of my code. Again, if memory is not an issue, you might be better off using a pregenerated list of random #s and piping them in as necessary, or pre-generating a bunch of numbers at launch before you actually start timing the benchmark (which is my solution to the problem).

Regardless, looking at your code snippet, I think you should be able to AVX/AVX2 that sucker. Not exactly sure what OpenMP or VS2012 responds to the most when it comes to SIMD optimization, but I know how to get Sun's JVM to do it.
 

Puffnstuff

Lifer
Mar 9, 2005
16,148
4,848
136
Seems like this topic resurfaces every couple of weeks and it's always the same stuff rehashed. Too bad that amd hasn't made some kind of tremendous breakthrough to increase their cpu performance like it was back in the days of the original athlon. At least they've moderated intel's pricing over the years.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
It should be, essentially:
if (a < 0) {a = -a;}

thank you for the fix, though it isn't the most important issue I have raised

is it right to assume that when you wrote:

Code:
particle[i].x = r*cosf(alpha);
particle[i].y = r*sinf(alpha);
particle[i].z = newz;

you were actually meaning:

Code:
particle[i].x += r*cosf(alpha);
particle[i].y += r*sinf(alpha);
particle[i].z += newz;
?

now, my most important concern is this one about the false cache line sharing:

http://forums.anandtech.com/showpost.php?p=37026913&postcount=226


you should be able to enjoy sizable speedups by addressing this issue, I'll be interested to learn how much faster it is after a very small change
 

DrMrLordX

Lifer
Apr 27, 2000
22,035
11,620
136
Hey, learn something new every day. That would also indicate why the software runs so poorly on AMD processors compared to Intel processors. Their cache architecture is inferior, and has been for some time.

edit: is the cache write issue true for all multithreaded apps with large numbers of threads, or is this more of an issue specific to OpenMP?
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
is the cache write issue true for all multithreaded apps with large numbers of threads, or is this more of an issue specific to OpenMP?

it is true for any threading framework whenever the user specify the data layout in an inflexible way, in this case an array of C/C++ struct (aka AoS)

Xn will be 12 bytes appart from Xn+1, etc. , imagine the damage with 64-byte cache lines and all the "+=" (assuming that my fix stands) doing a read then write access

if one want to write multithreaded code without having to deal with such issues, C/C++ is not a sensible choice, a high level tool like Matlab (typically slower than well written C++ code but good for multithreading overall) can be even faster than C/C++ with poorly written code like the fragment discussed here where AoS blocks vectorization and with probably a *massive slowdown* due to the cache issue I mentioned (several months ago)
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
22,035
11,620
136
I think I understand what you're saying. 12 byte chunks of data won't align with 64-byte cache lines unless you're writing those chunks 16 at a time (that'd be 3 cache lines). But the fix you recommended would make the array of structs 16 bytes, and you'd be handling four of them per thread, so that would line up neatly.

Um, this seems to be drifting way off-topic. I don't want to stop Ian from coming in here and talking about 3DPM, so if it's okay with everyone else maybe we can start a new topic on the subject with input from Ian and others?
 

Fjodor2001

Diamond Member
Feb 6, 2010
3,989
440
126
AMD vs Intel desktop market share:




On par with Intel in 2006. How did AMD manage to screw that up? And could the situation look different today if they would have taken some other decisions in 2002 and beyond? What decisions would that be?
 

Puffnstuff

Lifer
Mar 9, 2005
16,148
4,848
136
I think that their decision to buy ati saved them by diversifying their product portfolio. Lately they're at it again with memory and ssd's, even if they're re-branded they still have them generating revenue. Intel did the same thing many years earlier which has also helped them.
 

BigDaveX

Senior member
Jun 12, 2014
440
216
116
On par with Intel in 2006. How did AMD manage to screw that up? And could the situation look different today if they would have taken some other decisions in 2002 and beyond? What decisions would that be?

The Core 2, and especially the fact that Intel were able to have its quad-core version out so quickly, really wrong-footed AMD. I'd like to think that the AMD of late 2005 were more savvy than their fanboys - who were confident that Yonah would just be two Pentium Ms crudely slapped together, and Merom/Conroe the same again but with 64-bit support - but judging by some of the noises made by ex-employees, that may be a little too generous.

Phenom was a good chip on paper, but it came out a year too late and clocked far too slow, with the TLB fiasco just making things worse. Phenom 2 got things back on track, but by then Intel had the i7, which nullified the Phenom's main advantages. In retrospect, going for the quad-core die across the entire range was their big mistake; they should have kept that for Opteron, and used a dual-chip MCM configuration on the desktop.

Llano had a similar problem. Against the Nehalem-based i3s and Pentium Dual-Cores it would have been extremely competitive, but instead had to go up against Sandy Bridge, a far tougher opponent.

Bulldozer... at best, it was the result of over-confidence, thinking that they could succeed where Intel had failed with NetBurst. At worst, it was a foolish gamble.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
AMD got arrogant in the K8 days. They thought that Intel could never catch K8. AMD dreamed about only selling highend CPUs at 300$+ and servers while Intel could get the lowend. 65nm got delayed on purpose, fabs got delayed, new uarch delayed, money wasted left and right but nothing on improving. Only the sky was the limit, until the sky crashed down.

Had AMD had different leadership, AMD may have been a completely different company today. And its a real shame they have sunken this deep due to one incomptent management after the other and a company leadership culture that still havent changed a bit.
 
Last edited:

Amol S.

Platinum Member
Mar 14, 2015
2,411
713
136
I do not think AMD is a good processor. One of my professor in college told me in computer science class that the reason why AMD makes thire products cheap is because their products do not last long.
 
Apr 20, 2008
10,064
984
126
I do not think AMD is a good processor. One of my professor in college told me in computer science class that the reason why AMD makes thire products cheap is because their products do not last long.

I heard they only last something like three months and then they burn out and start blue screening. That and they catch viruses really easy too.

 
Mar 10, 2006
11,715
2,012
126
I do not think AMD is a good processor. One of my professor in college told me in computer science class that the reason why AMD makes thire products cheap is because their products do not last long.

Yeahh, no. AMD's products are reliable and last long. Built a lot of Athlon II X3/X4 systems for friends back in my college days years ago, and they're all still running fine.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |