[OFFICIAL] Bulldozer Reviews Thread - AnandTech Review Posted

Page 25 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Elixer

Lifer
May 7, 2002
10,376
762
126
They have to take a Risk and price them where they are to start.

The 8150's are sold out everywhere.

NCIX.com has no bulldozers in stock.
That doesn't mean anything, they are sold out since their allocation was low.
They still have yield issues, so there is no short term fix.
With all new products, there is a specific class of people that will get it no matter what.

AMD doesn't need to manufacture and sell a massive number of their FX Bulldozer models, especially with GloFo's 32nm capacity being what it is.

Bulldozer has two purposes that are probably more important to AMD than the enthusiast market:
1) Servers. Interlagos should be quite effective at handling server workloads, as long as power consumption is reasonable.
2) Trinity. Since a modified form of Bulldozer will form the CPU portion of Trinity, AMD needed to see what the Bulldozer architecture's current strengths and weaknesses are in the realm of home computing. Now they have a better idea of what to address before Trinity tapes out.

As long as AMD can expect to sell every single Llano that they can possibly produce, it doesn't really matter all that much how Bulldozer sells; they can just produce more Llanos instead. Bulldozer may have more markup, but that doesn't mean Llano isn't profitable as well.
IIRC, trinity has already taped out, unless they go back and redo some stuff.
I am also pretty sure all their internal tests showed what the CPU was capable of, the question is, why bother with the first iteration of BD (branded Fx) when it is so lackluster for your average workload ?
Will Pile driver fix it ? No idea.
The smart move would have been to keep churning out their cash cow (Llano), tell everyone that BD II(aka pile driver) will be here Q1 '12, and let it be.

Instead, they have a PR nightmare.
 

Abwx

Lifer
Apr 2, 2011
11,167
3,862
136
What should I be comparing? Rather than just telling me I'm wrong, how about correcting me.

If you look at the two right columns , it compare 2 modules/4threads
vs 4 modules/4 threads.

The latter case is the one of a classic CMP configuration
since each core activated in each module has all the
front end ressources.

It s obvious that they largely exceeded their expectations
of 80% of the perfs of a theoritical BD with CMP , so JF-AMD
was spot on when he did write 180% , wich in fact mean
two CMT cores performing at 90% of the same two cores
running in CMP....



 

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
So BD is a multi-billion dollar experiment?

Strengths and weaknesses are part of the design phase (simulation). You don't just design a product and throw it out there to see how it works. You should have a pretty clear understanding of how things work/perform before you ever have silicon in hand.


well, yes and no...

amd will still follow the apu road, llano, bulldozer and GCN are just steps

problem is, if trinity and GCN fails like bulldozer, they may never reach there
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
If you look at the two right columns , it compare 2 modules/4threads
vs 4 modules/4 threads.

The latter case is the one of a classic CMP configuration
since each core activated in each module has all the
front end ressources.

It s obvious that they largely exceeded their expectations
of 80% of the perfs of a theoritical BD with CMP , so JF-AMD
was spot on when he did write 180% , wich in fact mean
two CMT cores performing at 90% of the same two cores
running in CMP....




And yet I look at the uplift between 4M/4C and 4M/8C. Only one test scales 80% with the doubling of cores.

It's obvious that they largely failed to meet their expectations of 80% of the perfs of a theoritical BD with CMP.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
And yet I look at the uplift between 4M/4C and 4M/8C. Only one test scales 80% with the doubling of cores.

It's obvious that they largely failed to meet their expectations of 80% of the perfs of a theoritical BD with CMP.

This is true, the 4C/4M comparison to 8C/4M for Bulldozer is similar to what we'd expect to see in a 4T/4C vs. 8T/4C comparison on the 2600K with HT.

We were told to expect to see "180% on average", not a best case niche-app corner case bench.

Maybe once they get the L1$ BW fixed then we'll see this delivered, although the comments by DrWho on XS leave me concerned.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,106
136
Maybe once they get the L1$ BW fixed then we'll see this delivered, although the comments by DrWho on XS leave me concerned.

Do you have a link? Thanks, I'm not on XS. I would hope cache bandwidth (which sucks across the board) would be priority one with Piledriver, so I'm curious to see what 'DrWho' has to say.
 

Abwx

Lifer
Apr 2, 2011
11,167
3,862
136
This is true, the 4C/4M comparison to 8C/4M for Bulldozer is similar to what we'd expect to see in a 4T/4C vs. 8T/4C comparison on the 2600K with HT.

We were told to expect to see "180% on average", not a best case niche-app corner case bench.

Maybe once they get the L1$ BW fixed then we'll see this delivered, although the comments by DrWho on XS leave me concerned.

You are following Phynaz in his twisted metric.

As already pointed , two cores with two threads performs
at 90% of two modules with a thread each , this is what
this slide show obviously.


For the test of 4M/8 threads , you should have a 8M/8 threads
comparison , but even then , tell me wich are the benchs
in the list above that support more than fours threads..

You can see that the ones complying with this conditions
show more than 50% better perfs when going 8C/8T...
 

intangir

Member
Jun 13, 2005
113
0
76
Time has not told (and writing it bold wouldn't make it true either). Engineers talk about execution of code optimized for a CPU, not for P6, Atom or SB. It even hasn't been tested except for one single app (Cray, als already linked above):
http://ht4u.net/reviews/2011/amd_bulldozer_fx_prozessoren/index17.php

It's not true because it's bold, it's bold because it's true.

Engineers should not be basing their performance projections on code that doesn't exist yet. Making customers recompile their code, and hoping for uarch-based compiler improvements is a horrible strategy. That was pretty much exactly Itanium's mistake. Bob Colwell covers the issue at length in his Stanford presentation "Things CPU Architects Need to Think About".

http://www.stanford.edu/class/ee380/ay0304.html

He relates how the strategic decision at Intel to pursue Itanium as an architecture was based on performance comparison between x86 and Itanium on a hand-coded instruction stream, written by someone with intimate knowledge of the planned microarchitecture. Colwell just shook his head, saying that basing any decision on that was ludicrous because it is completely non-representative of most code. The vast majority of programs will not be coded for any particular architecture, and any assumption that it is will doom the architecture to failure.

Time continues to tell.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
Do you have a link? Thanks, I'm not on XS. I would hope cache bandwidth (which sucks across the board) would be priority one with Piledriver, so I'm curious to see what 'DrWho' has to say.

If I come across my post here with the XS links then I'll pm it to you. Its a post I made sometime today here in the CPU forum.

You are following Phynaz in his twisted metric.

As already pointed , two cores with two threads performs
at 90% of two modules with a thread each , this is what
this slide show obviously.


For the test of 4M/8 threads , you should have a 8M/8 threads
comparison , but even then , tell me wich are the benchs
in the list above that support more than fours threads..

You can see that the ones complying with this conditions
show more than 50% better perfs when going 8C/8T...

Why is it a twisted metric? I thought it was the metric JFAMD wanted us to use.

1 thread = X performance
2 thread = 1.8 X performance on average (meaning some do even better)

Was this not the metric?
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
If I come across my post here with the XS links then I'll pm it to you. Its a post I made sometime today here in the CPU forum.



Why is it a twisted metric? I thought it was the metric JFAMD wanted us to use.

1 thread = X performance
2 thread = 1.8 X performance on average (meaning some do even better)

Was this not the metric?

He's saying a lot of those programs can't fully leverage 8 threads, so the gain from 4 to 8 is going to be less than from 1 to 4. Bibble at least seems able to fully use all 8C. The 4C/4M and 4C/2M comparison does show they met their module loss metric.

I am interested to see if there is a fundamental cache or shared front end problem with fully feeding all 8 cores though.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
He's saying a lot of those programs can't fully leverage 8 threads, so the gain from 4 to 8 is going to be less than from 1 to 4. Bibble at least seems able to fully use all 8C. The 4C/4M and 4C/2M comparison does show they met their module loss metric.

I am interested to see if there is a fundamental cache or shared front end problem with fully feeding all 8 cores though.

Ah yes, I see now.

So really they should be testing 1M/2T vs. 2M/2T, the same way we test the efficacy of HT.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
Ah yes, I see now.

So really they should be testing 1M/2T vs. 2M/2T, the same way we test the efficacy of HT.

Yes, the best approach would be to start at 1M/2T and 2M/2T and work all the way up to the full 4M/8T. With programs that can handle each level of threading. Then see what is the impact of tweaking things like the clock speed, the NB speed, and so on.
 

Dadofamunky

Platinum Member
Jan 4, 2005
2,184
0
0
It's not true because it's bold, it's bold because it's true.

Engineers should not be basing their performance projections on code that doesn't exist yet. Making customers recompile their code, and hoping for uarch-based compiler improvements is a horrible strategy. That was pretty much exactly Itanium's mistake. Bob Colwell covers the issue at length in his Stanford presentation "Things CPU Architects Need to Think About".

http://www.stanford.edu/class/ee380/ay0304.html

He relates how the strategic decision at Intel to pursue Itanium as an architecture was based on performance comparison between x86 and Itanium on a hand-coded instruction stream, written by someone with intimate knowledge of the planned microarchitecture. Colwell just shook his head, saying that basing any decision on that was ludicrous because it is completely non-representative of most code. The vast majority of programs will not be coded for any particular architecture, and any assumption that it is will doom the architecture to failure.

Time continues to tell.

Ahhh. All you get is a bio when you follow the link. I was getting excited there for a second.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
We were told to expect to see "180% on average", not a best case niche-app corner case bench.

Maybe once they get the L1$ BW fixed then we'll see this delivered, although the comments by DrWho on XS leave me concerned.
The 180% sounded more like 160% later on (2*80%).

Cache write BW might be limited w/ streaming stores. Copy BW interestingly is mich higher (no streaming stores used?).

Anyway decode can only deliver 1 store per cycle for 2 threads or 0.5 stores per cycle per thread on average.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
Yes, the best approach would be to start at 1M/2T and 2M/2T and work all the way up to the full 4M/8T. With programs that can handle each level of threading. Then see what is the impact of tweaking things like the clock speed, the NB speed, and so on.

Cinebench lets you do this. So does LinX.

I think I am going to get a 8150 in a few weeks and work it over like I have this 2600K.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
The 180% sounded more like 160% later on (2*80%).

Cache write BW might be limited w/ streaming stores. Copy BW interestingly is mich higher (no streaming stores used?).

Anyway decode can only deliver 1 store per cycle for 2 threads or 0.5 stores per cycle per thread on average.

I tried to get this one "locked down" on more than one occasion, it was something like "90% second core scaling" or so.

But then there was an AMD marketing slide which clearly stated "180%" in writing in the slide. That's when I decided it was official.

If I can find the link I'll post it up, its here in the forums somewhere, maybe someone else remembers it?
 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
But then there was an AMD marketing slide which clearly stated "180%" in writing in the slide. That's when I decided it was official.

If I can find the link I'll post it up, its here in the forums somewhere, maybe someone else remembers it?



It was stated as early as August 2010 by AMD themselves that a" Bulldozer module can achieve 80% of the performance of two complete cores of the same capability."

Of course most people ignored this and tried to come up with very complex explanations of 1*100 + 1*80% for 2nd core....etc.

The statement was actually very clear to me. I always took it as 2 Bulldozer cores in 1 module will only provide up to 80% of the performance of 2 Bulldozer cores on their own (i.e., dedicated cores). From this statement, on several occasions I stated that if in theory 1 Bulldozer core had an IPC increase of say 20-25% over Phenom II, at the same clock speeds, then Bulldozer would still not be any faster than Phenom II since it would need to overcome the 20% penalty (i.e., 0.8 x 1 Module (2 BD cores) x 1.25x IPC increase = 1 x 2 Phenom II cores). I even took a best case scenario and used an average of 10-20% penalty (i.e., assumed 90% base case not 80%).

So you can just imagine that if 1 BD core was slower in IPC than 1 Phenom II core, subtract another 20% for every module and performance starts to fall drastically --> hence the requirement for 4.0ghz+ Turbo just to maintain Phenom II levels....

Technically, JF-AMD can still be right. If BD core is 10% faster per clock than Phenom II, if you have a program that runs 2 threads on 2 BD cores and on 2 Phenom II cores, the 20% module penalty would still make the Phenom II faster despite BD's increase in IPC. Unfortunately we don't have a Bulldozer CPU with dedicated cores, which makes it impossible to test the accuracy of this hypothesis. I suppose we can test some single-threaded benchmarks and extrapolate.
 
Last edited:

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106


It was stated as early as August 2010 by AMD themselves that a" Bulldozer module can achieve 80% of the performance of two complete cores of the same capability."

Of course most people ignored this and tried to come up with very complex explanations of 1*100 + 1*80% for 2nd core....etc.

The statement was actually very clear to me. I always took it as 2 Bulldozer cores in 1 module will only provide up to 80% of the performance of 2 Bulldozer cores on their own (i.e., dedicated cores). From this statement, on several occasions I stated that if in theory 1 Bulldozer core had an IPC increase of say 20-25% over Phenom II, at the same clock speeds, then Bulldozer would still not be any faster than Phenom II since it would need to overcome the 20% penalty (i.e., 0.8 x 1 Module (2 BD cores) x 1.25x IPC increase = 1 x 2 Phenom II cores). I even took a best case scenario and used an average of 10-20% penalty (i.e., assumed 90% base case not 80%).

So you can just imagine that if 1 BD core was slower in IPC than 1 Phenom II core, subtract another 20% for every module and performance starts to fall drastically --> hence the requirement for 4.0ghz+ Turbo just to maintain Phenom II levels....

Technically, JF-AMD can still be right. If BD core is 10% faster per clock than Phenom II, if you have a program that runs 2 threads on 2 BD cores and on 2 Phenom II cores, the 20% module penalty would still make the Phenom II faster despite BD's increase in IPC. Unfortunately we don't have a Bulldozer CPU with dedicated cores, which makes it impossible to test the accuracy of this hypothesis. I suppose we can test some single-threaded benchmarks and extrapolate.


JFAMD stated at one point it was 100%+80%. Of course, now it seems like he must have either been given bad info or misunderstood himself.

I'm on the side that believes there is something that's not working like it's supposed to. That's why the delays, and they just couldn't nail it down and had to go with the silicon they have as it is. If so, then hopefully them and/or GloFo can nail it down. This is assuming it's some design-fab conflict, which with rumored yield issues, could be it. I noticed no 8150's at NewEgg but they have the 8120 and 6100, which IIRC are lower binning pieces of the same chip.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |