Can AMD "rescue" the Bulldozer?

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Zor Prime

Golden Member
Nov 7, 1999
1,023
588
136
I have 2 ASUS M4A89GTD PRO/USB3 that are AM3 socket that I bought last February. There is a beta BIOS (3027) for these boards that adds support for BD. the socket is definitely AM3 and not AM3+. I'm not going to upgrade (downgrade?) to BD as I already have 1090t in both and see no need to change atm. if they improve BD within the next year, then maybe but I don't think that will happen.

My understanding is that an AM3+ chip will physically fit in a AM3 socket, it's up to the vendor to add support for it as AMD will not officially support that option.

Got the same AM3 board here, too.

It's incorrect to say that Bulldozer cannot run on AM3 - it very well can. AMD did not lie about Bulldozer being able to function on AM3 boards. AMD cannot very well force motherboard companies to provide support.

Motherboard companies basically get nothing out of it except expense when there are "official" AM3+ boards to sell. Some companies like ASUS, however, previously pledged support and have followed through with their commitment.
 

ed29a

Senior member
Mar 15, 2011
212
0
0
AMD could do something to save the sinking ship:
(1) Beg for MS to patch Windows 7 ASAP with scheduling optimizations for Bulldozer, this alone could be around 10% gain.
(2) Work on a new stepping (if work hasn't already started) to make small improvements in IPC and/or thermals.
(3) Lower prices of current SKUs, as they are now, they make absolutely no sense.

Lower prices + Windows optimization + small improvements in IPC = we have a decent alternative to SB, especially if priced low.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
So basically you want them to make 2 full cores again... waste alot of die space in the process and call it da day.

Two of the improved Stars core that are in Llano, the one's that get 3-6% better IPC than the cores already in Thuban, are roughly the same size as one bulldozer module when you include the L2$.

One can make the argument they could swap out 4 bulldozer modules in exchange for 8 Llano cores and Zambezi would remain nearly the same diesize.

1 BD Module + 2MB L2$ = 30.9 mm^2

2 Llano Cores + 2MB L2$ = 35.4 mm^2







The problem with this analysis is that for whatever reasons those Llano cores suck in terms of clockspeed and power-consumption. A mere 2.9GHz and the quadcore gulps down the juice when you do CPU intensive stuff.

So what would an 8-core Llano look like in terms of clockspeed and power-consumption? It would not have been pretty. Something is not right with GloFo's process and I think it shows in both Llano and Bulldozer.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
(1) Beg for MS to patch Windows 7 ASAP with scheduling optimizations for Bulldozer, this alone could be around 10% gain.

AMD should have avoided the whole CMT thing, reduced the core count from 8 to 6 but made them real cores.

Of course we are talking about people who couldn't count transistors correctly, nor realize in advance that their microarchitecture was going to suck if they didn't get MS to tune their scheduler before the chip was launched.

So we can't really hold their feet to the fire, its obvious they put their A-team on Brazos, their B-team on Llano, and the leftovers got shoved onto operation derpdozer.
 

Riek

Senior member
Dec 16, 2008
409
14
76
Two of the improved Stars core that are in Llano, the one's that get 3-6% better IPC than the cores already in Thuban, are roughly the same size as one bulldozer module when you include the L2$.

One can make the argument they could swap out 4 bulldozer modules in exchange for 8 Llano cores and Zambezi would remain nearly the same diesize.

1 BD Module + 2MB L2$ = 30.9 mm^2

2 Llano Cores + 2MB L2$ = 35.4 mm^2

when I was reading up to this point I wanted to point out the same thing you pointed at below

The problem with this analysis is that for whatever reasons those Llano cores suck in terms of clockspeed and power-consumption. A mere 2.9GHz and the quadcore gulps down the juice when you do CPU intensive stuff.

So what would an 8-core Llano look like in terms of clockspeed and power-consumption? It would not have been pretty. Something is not right with GloFo's process and I think it shows in both Llano and Bulldozer.

I does seem that they need >1.2V to reach decent frequencies on llano and BD. But if the problem is due to the gate-first approach or related to it, fixing or dramatically improving it will not happen in a Jipie i assume.

On the other side, when they run at a decent voltage, they seem to have great power characteristics, so something is good.


Edit: I'm not sure if they should avoided CMT. I think it is a very good approach in theory. I just don't think Bulldozer is the best execution of the CMT approach atm.. but i think the blame here is the process, the other design choices within BD and not the CMT prinicple. (which actually works very well within BD.. its the only thing that was said by AMD that was on the mark.).

Also am I the only one who doesn't expect alot from sheduler changes? The point AMD wants is that as many threads as possible are crammed in one Module. This will give higher % turbocore than when threads are hopped of other modules. But this also has an impact on the performance due to the mandatory sharing. My guess is that besides a better power consumption we will see performance increases and decreases over the field.
 
Last edited:

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
Edit: I'm not sure if they should avoided CMT. I think it is a very good approach in theory. I just don't think Bulldozer is the best execution of the CMT approach atm.. but i think the blame here is the process, the other design choices within BD and not the CMT prinicple. (which actually works very well within BD.. its the only thing that was said by AMD that was on the mark.).

Also am I the only one who doesn't expect alot from sheduler changes? The point AMD wants is that as many threads as possible are crammed in one Module. This will give higher % turbocore than when threads are hopped of other modules. But this also has an impact on the performance due to the mandatory sharing. My guess is that besides a better power consumption we will see performance increases and decreases over the field.

Long time reader that had to reregister

By setting affinity for modules and threads they were able to essentially bypass the windows horse-with-blinders-on approach that it currently has when scheduling threads on BD chips.

http://techreport.com/articles.x/21865/1

Frankly, it isn't very good. The resource-sharing approach that AMD prioritized isn't bad in theory, especially considering the amount of cores and cache crammed on a 1.2B transistor chip, but the execution is piss-poor. I'd wager that with the better scheduling you'll see very very meager performance increases but a hefty decrease in power consumption. The turbo core on bulldozer is one of the few things that they did right and outmatches Intel in its effectiveness, but with the long pipelines and poor IPC the turbo would have to be reaching over 5ghz to provide a noticeable increase with the threads 6 or under being crammed in such a way as to maximize its effectiveness. Basically, it ain't gonna happen...

Either there has to be serious improvement in cache timing (and size, really) or AMD's turbo will have to bump speeds up by 2ghz to see CMT become a feasible alternative to a straight thuban core in terms of relative performance -- relative to the competition and their previous generation. What TR showed is that AMD's belief that CMT is comparable to their straight cores isn't true; at least not now.

EDIT: TL/DR version: Windows 8/7 scheduler won't help make up for the downfalls of the CMT design, and Turbo, as good as it is, doesn't help make up for that flaw.
 
Last edited:

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
In the year running up to actual Bulldozer launch a lot of people were asking about Windows 7 scheduling. I believe the general AMD attitude was that it didn't matter whether it loaded modules or spread the load since you would get extra clockspeed from loading full modules versus not sharing resources by using threads on different modules. It's as if the software guys at AMD don't really communicate that well with the hardware, because Windows scheduler throws threads around like a bit of a madman since that is what Intel's HT does alright with. It also happens to be the worst way for an OS to treat threads for AMD's CMT design...

I never thought about it until IDC mentioned it, but this would make a lot of sense if their C team was working on Bulldozer DESKTOP. I think for the server team it was legitimately low on the issues list especially considering the areas where BD will make any headway are mostly outside of the Windows eco-system.
 
Last edited:

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
In the year running up to actual Bulldozer launch a lot of people were asking about Windows 7 scheduling. I believe the general AMD attitude was that it didn't matter whether it loaded modules or spread the load since you would get extra clockspeed from loading full modules versus not sharing resources by using threads on different modules. It's as if the software guys at AMD don't really communicate that well with the hardware, because Windows scheduler throws threads around like a bit of a madman since that is what Intel's HT does alright with. It also happens to be the worst way for an OS to treat threads for AMD's CMT design...

Well it doesn't help, certainly. But AMD's preference for sharing threads on the same module, assuming low # of threads ofc, isn't a good idea in terms of performance, either -- perhaps for power consumption -- considering the 1-thread-per-module approach seems to be yielding the better results. Consequently the CMT design has to be be scrutinized even more given the disparity between "within module" and "per module" numbers are so fall apart.

Can't help but think that maybe the resource-sharing wasn't such a bright idea given how long the pipeline and how high the clocks have become and have to be increased, respectively.

If the chip were treated as a 4 core with "advanced hyperthreading" it would fair far better both in terms of performance and maybe the reviews wouldn't have been so harsh. Given the results I linked a couple posts above it along with the comparatively impressive numbers the Thuban has showed (even in the face of AVX, astonishinly enough) that the chip is more akin to a 4 core 8 thread chip than an 8 core 8 thread. Frankly, a straight 8 isolated cores would have been better, given a decreased cache size and a less lengthy but more complex pipeline and subsequently lower clocks (see Llano). The more bulldozer articles I read the more I realize just how awesome that Thuban is (or was given the end of production). Those things were badass when they first came out and even today offer very respectable performance if they can be had at the ~$160 range. Thankfully you can still snag a 960T for like $120 and hope it unlocks

If you're one of the people with high hopes for Piledriver and the new win8 scheduler I'd suggest you probably start saving up for Ivy Bridge =P
 
Last edited:

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
Well it doesn't help, certainly. But AMD's preference for sharing threads on the same module, assuming low # of threads ofc, isn't a good idea in terms of performance, either -- perhaps for power consumption -- considering the 1-thread-per-module approach seems to be yielding the better results. Consequently the CMT design has to be be scrutinized even more given the disparity between "within module" and "per module" numbers are so fall apart.

My point was that you would have to work at it to make a worse scheduler for bulldozer than the one Windows uses now. Not sure why everyone is on this whole "hey look if you run a single thread per module you don't pay the cluster penalty", of course it works that way. The clustering part of the design appears to be working well for the first shipping silicon. It's just AMD obviously missed their thermal+clock targets. (Edit: and possibly their IPC targets, but AMD is being a bit coy regarding that)

In my view there are two main WTFs regarding Bulldozer:

1. Their clock targets seem to be just as aggressive as Intel's for Pentium 4.

2. Working on gate first 32nm and I would guess 28nm seems to be the fab equivalent of fighting during a Russian winter. IBM should be sending Global Foundries chocolates and flowers for all the trailblazing they are doing.
 
Last edited:

sm625

Diamond Member
May 6, 2011
8,172
137
106
The problem with this analysis is that for whatever reasons those Llano cores suck in terms of clockspeed and power-consumption. A mere 2.9GHz and the quadcore gulps down the juice when you do CPU intensive stuff.

Yeah but we also know that a large percentage of, if not nearly all llanos can be undervolted to 1.1~1.15V and still run at 3GHz. At those voltages 8 cores could be had under 120W. It wouldnt be great, but it would be faster than BD. And cheap. Very very cheap.
 

ed29a

Senior member
Mar 15, 2011
212
0
0
AMD should have avoided the whole CMT thing, reduced the core count from 8 to 6 but made them real cores.

Of course we are talking about people who couldn't count transistors correctly, nor realize in advance that their microarchitecture was going to suck if they didn't get MS to tune their scheduler before the chip was launched.

So we can't really hold their feet to the fire, its obvious they put their A-team on Brazos, their B-team on Llano, and the leftovers got shoved onto operation derpdozer.

I would have gone a different way, don't try to cram an obvious server CPU into consumer desktops. Whenever CMT was a mistake or not, it's done. AMD can't go back in time and fix it, but they could do a few steps to make it less painful for desktop users.
 

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
My point was that you would have to work at it to make a worse scheduler for bulldozer than the one Windows uses now. Not sure why everyone is on this whole "hey look if you run a single thread per module you don't pay the cluster penalty", of course it works that way. The clustering part of the design appears to be working well for the first shipping silicon. It's just AMD obviously missed their thermal+clock targets. (Edit: and possibly their IPC targets, but AMD is being a bit coy regarding that)

In my view there are two main WTFs regarding Bulldozer:

1. Their clock targets seem to be just as aggressive as Intel's for Pentium 4.

2. Working on gate first 32nm and I would guess 28nm seems to be the fab equivalent of fighting during a Russian winter. IBM should be sending Global Foundries chocolates and flowers for all the trailblazing they are doing.

They would have had to increase the clock rates far too high to reach what they were aiming for. In fact, they originally planned a 30% increase in clock speed, which would have put them at the mid 4ghz mark. Not sure whether this was intended with turbo or without it. 5ghz with turbo seems just a bit too high to me, but ya never know.

I agree with you. It's just that when you consider that their "minimal performance decrease" when sharing amongst a single module rather than 2 modules, where each has full reign on all the resources without sharing (all your cache are belong to me), so their claims as far as CMT goes just don't hold up when tested.

It's like they half-assed it. They were going for 8 but didn't technically get there but they're also far too close to 8 than they are to 4 and hyperthreading. CMT is stuck somewhere where it's sharing too much and as a result it's hurt IPC by roughly 10-15% and they can't clock it up to make up for that difference (netburst anyone?).

In the server space it isn't all that impressive either, but depending on the workload and the fact that compilers are more common it may end up being okay. Now for all those power consumption figures...

Like I said, given smaller caches ('cept for L1. They should consider increasing that), tighter timings and a more complex but shorter pipeline, this could have been a great chip. Granted, it wouldn't have CMT and would've looked much like the Thuban
 
Last edited:

Ancalagon44

Diamond Member
Feb 17, 2010
3,274
202
106
So basically you want them to make 2 full cores again... waste alot of die space in the process and call it da day.

Reducing L2 to 256 is bad. 1MB sounds better. Reduce latency of this cache and implement a L0 cache like structure or other means to bypass branch hits. (think this is already in the pipeline for steamroller).
Decoding width is fine, in combination with some enhancements like SB should do the trick.
Bumping exeuction resources might be an option, although they would be far better of by making their AGU do more basic calculations.
double the fpu is pure nonsense, affecting execution times would have more effect...

And what exactly has this approach gained them? Two incomplete cores sandwiched together and calle a module hasnt exactly blown any performance metrics off the doors.

Another popular CPU maker uses 256kb of L2 cache - any guesses who it might be?

No need to waste so much die space on L2 cache when you already have a large L3 cache. My guess is that they couldnt get performance high enough so they dumped extra cache onto each core to try shore up performance. Didnt work obviously.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
In their pre-launch technical discussions they were saying roughly 80% of non-CMT design. They seem to have hit that target if not exceeded it even with current Bulldozer (although there are some quirks), it's the IPC*Clockspeed=Wattage part that is disappointing.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
In fact, they originally planned a 30% increase in clock speed, which would have put them at the mid 4ghz mark.

I have read this claim multiple times. It's probably off from their ISSCC presentation. But clock speed targets aren't always met, and they don't tell you which market segment they are talking about.

That's the thing about clock speed targets. Maybe 30% claim was met. Just that Thuban did too well. SKU segmentations, TurboCore, make it all the more complicated.

The web is saying there will be a 300MHz higher clocked part called a FX-8170. The single core Turbo frequency is at 4.5GHz, while multi core one is at 4.2GHz. That's pretty close.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
The web is saying there will be a 300MHz higher clocked part called a FX-8170. The single core Turbo frequency is at 4.5GHz, while multi core one is at 4.2GHz. That's pretty close.

The yet to appear 8170, if they manage to deliver it at those specs and 125W and ~$250 then it will take some of the sharp edges off of Bulldozer.

The also as yet unseen unlocked Llanos are apparently starting to show up on distributor lists. Perhaps they will actually be purchasable soon. I'm interested in seeing what people can get out of them.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Bulldozer isn't a server oriented CPU, SpecInt and FP Interlagos outperforms the Xeon X5690 by 14% and 30% respectively.

The 6 core Core i7 3960X, gets 34% and 65% better in the same metric over Core i7 990X, which is essentially the same chip as X5690. Then there's the 8 core Xeon E5's, which should add 10-15% on top of that because of 33% more cores but lower clocks.

What doesn't work in one area doesn't work anywhere else.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,106
136
AMD could do something to save the sinking ship:
(1) Beg for MS to patch Windows 7 ASAP with scheduling optimizations for Bulldozer, this alone could be around 10% gain.
(2) Work on a new stepping (if work hasn't already started) to make small improvements in IPC and/or thermals.
(3) Lower prices of current SKUs, as they are now, they make absolutely no sense.

Lower prices + Windows optimization + small improvements in IPC = we have a decent alternative to SB, especially if priced low.

(1) is on the way (according the [H]) - I wonder if it'll be part of SP2??.
(2) is on the way (B3), I think it's due out 1Q12.
....
(4) BDII will be out this time next year and should be competitive with SB - oh, but IB will be out by then, so still no joy.



Hey, but GFs 22nm node will be Gate Last and by then AMD should be able to wring some decent performance improvements out of BD - damn, Intel will be @ 14nm by then

It's been a sordid tale at AMD since the K8 'glory' days.
 
Last edited:

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
I look at it as further evidence that the biggest flaw with Bulldozer is that the design doesn't match the fabrication process. Brutal seeing them selling 12 core Interlagos. First the fact they are making 12 cores, meaning they are MCMing two 8 cores with 4 cores burned off or some more frankenstein combinations. Second that it is in the same TDP category as their 16 core and with not much in the way of clocks to show for it.

Bulldozer isn't a server oriented CPU, SpecInt and FP Interlagos outperforms the Xeon X5690 by 14% and 30% respectively.

The 6 core Core i7 3960X, gets 34% and 65% better in the same metric over Core i7 990X, which is essentially the same chip as X5690. Then there's the 8 core Xeon E5's, which should add 10-15% on top of that because of 33% more cores but lower clocks.

What doesn't work in one area doesn't work anywhere else.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
Yeah but we also know that a large percentage of, if not nearly all llanos can be undervolted to 1.1~1.15V and still run at 3GHz. At those voltages 8 cores could be had under 120W. It wouldnt be great, but it would be faster than BD. And cheap. Very very cheap.

That's called binning yield, and it works that way for everyone.

There's a reason AMD set the Vcc for Llano where they did. Its not like some dude at AMD pressed the space bar twice instead of once by accident and because of that mistake all the Llano have excess Vcc.

The biggest reason people find their Llano's can be undervolted is because they aren't trying to validate them as fully functioning (no errors) when the operating temps are near the max allowed CPU temp.

If AMD could have gotten away with spec'ing the max operating temp for Llano to be 50C or 55C then they could have decreased the spec'ed Vcc by quite a bit too.

But back to the general statement about Vcc and yields, Intel could do that too if they wanted craptastic yields, just bin out the low Vcc ones and recycle the silicon for the rest.

My 2600K has a stock Vcc of 1.35 V at stock clocks, but it only needs 1.118 V to be LinX stable at 3.8GHz on all cores. But there's nothing magical there either, it can be undervolted so much because I'm not requiring the chip to be stable at 98C, it only needs to be stable up to the ~55C that it gets to with LinX at that Vcc and clockspeed.

Here's another example of what I am talking about, this is at 4.5GHz:



AMD already has a supply issue with Llano, they got to bin real wide. That they have to do this is just as telling as the GHz and clockspeed issue in the first place.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
This is why GF went straight to trying to sell 28nm production well before AMD started selling 32nm chips. Hope for us computer enthusiasts that it is not as much of a boondoggle.
 

Stuka87

Diamond Member
Dec 10, 2010
6,240
2,559
136
The yet to appear 8170, if they manage to deliver it at those specs and 125W and ~$250 then it will take some of the sharp edges off of Bulldozer.

The also as yet unseen unlocked Llanos are apparently starting to show up on distributor lists. Perhaps they will actually be purchasable soon. I'm interested in seeing what people can get out of them.

I am interested in the unlocked Llano with the GPU disabled. I would really love to see a 32nm STARS based chip come out without the GPU. I still feel this would have been a better idea than Bulldozer.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |