Can AMD "rescue" the Bulldozer?

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
From my perspective I see that there isn't a "bug" in CMT implementation, but rather the reason the BD cores are so unimpressive, because with CMT and heavily-threaded applications AMD thought that they can afford to be.
If that's the case--and we don't know for sure that it wasn't the idea--then BD is in fact AMD's P4. CMT itself, though, should work every bit as well with cores that perform far better, giving both great performance with a smaller number of threads (what we want on the desktop, usually), and good performance with more threads (enough that it is consistently worth it to run more threads). If they felt more cores at lower IPC could be successful, it would have been at least as bad as a plain old CMP of 8 complete cores.

That kind of thinking is just plain wrong, unless the resultant CPU will offer far superior performance/Watt (while maybe not a great enthusiast chip, if BD had 80% of Stars perf/clock, but 50% the load power consumption, it would probably be a killer server chip, even if limited in GHz).

Now, if low-IPC was not the goal, but happened due to either miscalculations, trying to produce something usual while stretching budgets out very thin, or 32nm issues, then I would hope for and expect substantial improvements, especially in future mobile APUs. For example with a much faster (much smaller!) L2, the L1D wouldn't seem out of place for a desktop or notebook, and it could be that a few minor changes like that, on top of generally working out the kinks, could make it a pretty nice CPU.
 
Last edited:

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
If that's the case--and we don't know for sure that it wasn't the idea--then BD is in fact AMD's P4. CMT itself, though, should work every bit as well with cores that perform far better, giving both great performance with a smaller number of threads (what we want on the desktop, usually), and good performance with more threads (enough that it is consistently worth it to run more threads).

Well, that's what I've been wondering too. It's not a secret that AMD wanted to "hold the line" with IPC while offering more threads and cores.

If they felt more cores at lower IPC could be successful, it would have been at least as bad as a plain old CMP of 8 complete cores.

^^ Unfortunately they didn't get to Phenom II level and the clock speeds didn't reach mid 4ghz at stock. I would definitely call this AMD's Pentium 4. Holding on IPC improvements while offering more threads is all fine and dandy for a workstation or server, but on the desktop? Also, AMD doesn't have the resources nor the market share to make the push on the software side to see some Bulldozer-favored implementations through. Nor do they have a history of pushing for that either. Releasing Bulldozer without an optimized scheduler for Windows 7 when they've seen their chips' underperforming is a prime example.

Though you do bring up a good point in that fewer cores with CMT may look entirely different. The downside to that is that the core designs, too, will look entirely different and likely something we're not going to see any time soon.

By the way, this has turned into a rather fruitful BD discussion
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Well, that's what I've been wondering too. It's not a secret that AMD wanted to "hold the line" with IPC while offering more threads and cores.

^^ Unfortunately they didn't get to Phenom II level and the clock speeds didn't reach mid 4ghz at stock. I would definitely call this AMD's Pentium 4. Holding on IPC improvements while offering more threads is all fine and dandy for a workstation or server
No dice. Among the usual means of improving IPC are techniques to reduce and hide latencies, and to improve speculation (branch prediction and data prefetching behavior), in addition to wider execution units and more work being put upon the execution schedulers. Such improvements are as good for servers as they are for desktops, even though the stalls that take up the most time may be substantially different.

While the degree of improvement will vary, enough servers perform enough different workloads that improving performance across the board is pretty much a must.

Limited performance, but high performance per Watt, can be good.
Limited performance, with OK performance per Watt, and high performance per dollar, can be good.
Limited performance, with low performance per Watt, simply is not good.
but on the desktop?
That it may perform a bit better or worse in one arena compared to another, and that's fine. But if performance is hardly acceptable for the high-volume mass market SKUs, an attempt to frame it as some server v. desktop v. mobile thing is either distraction from or rationalization for lackluster performance. All performance-driven markets need higher performance per thread per Watt than Stars.

Also, AMD doesn't have the resources nor the market share to make the push on the software side to see some Bulldozer-favored implementations through. Nor do they have a history of pushing for that either. Releasing Bulldozer without an optimized scheduler for Windows 7 when they've seen their chips' underperforming is a prime example.
Intel had the same issue with Hyperthreading, and Windows 7 had to get multiple scheduler updates for Core 2s, which were not new at the time (not without real need, either, I might add, being affected by some of the pausing bugs that got fixed!). The scheduler issues are important, certainly, but if the single-thread performance were higher, they could be glossed over and looked forward to, rather than looked at as code changes needed to fix AMD's new CPU.

AMD chose to share L1I and L2, and to make L2 big, and as such, which core (int execution unit) gets which thread matters. On the desktop, it's 99% hindrance, too (that is the kind of feature decision that makes sense for a server, likely to be running near code paths in two threads of the same application, and likely for those threads to be working on near sets of data).

Though you do bring up a good point in that fewer cores with CMT may look entirely different. The downside to that is that the core designs, too, will look entirely different and likely something we're not going to see any time soon.
Not fewer cores, but fewer execution units (int portion of a core) per set of front ends. IE, what we have with BD is approximately 2 4-wide cores sitting behind a single 4-wide front end, which can fully serve either core at any given time, but not fully serve both (neither '4' is actually quite so definite, but that's been the case for 15 years or more).

So, given a very high ILP loop who's instruction length of very long, well-formed for BD, that is not terribly dependent on RAM or cache bandwidth, that is not limited by cache misses, and which has good (IE, easily predictable) D$/DTLB behavior...running one thread should get about the same performance as with no CMT. Meanwhile, running two of such threads should get about half the performance. At 8 cores (4 modules), that becomes four sets of front ends to eight execution units. Overall performance should scale as 1->4 400% (with one thread per module), but 1->8 will also be stuck at 400%, because the front end units are starving the rest of the CPU.

Well, in reality, such loops will only exist in synthetic benchmarks, and sustainable IPC tends to be pretty low, resulting in scaling past 1:1 execution:front-end ratio (IE, 5-8 threads per 8-core CPU) that is less than 100% for some applications, but still scaling up much better than if the next set of execution units were left inactive.

With a low-ILP loop otherwise like the exhaustively-described hypothetical one, scaling would be 800% at 8 threads for 4 modules, because the front ends would never be saturated for even one cycle.

It is a trade-off, like SMT (used for BD's FP), but one geared towards making the most use of a small number of high-ILP threads, while also being able to serve many (2x, in BD's case) low-ILP threads, with very little in the way of resource conflict stalls (IE, the bane of shared SMT in a fast CPU). And, finally, that part seems to work just fine. Where it may be a limiting factor in higher thread count cases, it's not enough of us to be able to, at this time, easily separate its effects from those of the caches.
 
Last edited:

wlee15

Senior member
Jan 7, 2009
313
31
91
I wonder if it would make any sense to duplicate only the floating point unit. You wouldn't save much space but you would be able to simplify the design and eliminate some of the queues and buffers in the Fetch and Decode stages. Or they could go full SMT and eliminate the 2nd integer core since they pretty much done most of the legwork from going to CMT.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
I wonder if it would make any sense to duplicate only the floating point unit. You wouldn't save much space but you would be able to simplify the design and eliminate some of the queues and buffers in the Fetch and Decode stages. Or they could go full SMT and eliminate the 2nd integer core since they pretty much done most of the legwork from going to CMT.

CMT has different goals from SMT. SMT is trying to reduce a cpu core's time spent waiting, this results in variable second thread processing. This has worked out pretty well for Intel since the development of the Core line was resulting in beefy cores. CMT is trying to offer consistent thread processing with less die space allowing flexibility in creating multi-threaded processors with reasonably predictable characteristics.

I'd guess that the value purchasers attribute to Intel's Hyper Threading depends highly on their planned usage. Just going off of Bulldozer's design choices and AMD's server marketing, they are targeting server usage that relies on large quantities of RAM as well as those that are looking for core to power density. Perhaps these are areas where HT isn't very useful?
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
If AMD had their own fabs to do rapid design tweaks and re-spins, I say yes, it should theoretically be possible to fix BD's major faults. IMO the shared module design is the biggest drawback, so the best real fix would be a pretty dramatic change. Duplicate the missing portions on the 'cores' so that each core is complete in and of itself as far as FPU/INT.

Without their own fabs, this work will take too long to accomplish, and AMD's quickest path to reasonable success really lies in a die-shrunk 8C Phenom III with further IPC tweaks.

I think neither of these is likely, and AMD looks poised to make a steady retreat from the desktop space.
AMD would have better control over the process, but if technical difficulties are to be solved, the guys doing that are still the same. Only management and working on other processes for different customers might slow that down.

Development, fixing, improvement of the chips is done outside of the fabs.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Perhaps these are areas where HT isn't very useful?
For code where each how much time each thread's work takes makes a difference, not merely the aggregate throughput, SMT can be a negative, as performance is fairly unpredictable, and causes each thread to take longer. OLTP and OLAP alone are going to have as many users gaining from SMT as will not want to touch it.

Without better performance from each thread, though, and/or much improved power consumption, Xeons without HT still look like very good options, against BD Opterons. They'll sell, I'm sure, but enough to start regaining server market share, thus making it worthwhile to develop future server processors? Based on the good typical scaling with server apps, if IPC were significantly better than Stars, they would have had a winner on their hands.
 

denev2004

Member
Dec 3, 2011
105
1
0
I'd guess that the value purchasers attribute to Intel's Hyper Threading depends highly on their planned usage. Just going off of Bulldozer's design choices and AMD's server marketing, they are targeting server usage that relies on large quantities of RAM as well as those that are looking for core to power density. Perhaps these are areas where HT isn't very useful?
Actually I think what they're looking for is high performance which are allowed to be achieved by a TLP way. In this way SMT and CMT are all useful.

Without better performance from each thread, though, and/or much improved power consumption, Xeons without HT still look like very good options, against BD Opterons. They'll sell, I'm sure, but enough to start regaining server market share, thus making it worthwhile to develop future server processors?

Seeing how AMD propaganda Interlagos can we see the difference...They again concentrate effort on talking about performance pre $...
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Seeing how AMD propaganda Interlagos can we see the difference...They again concentrate effort on talking about performance pre $...
With a physically massive chip, whose load power is not great, and an uphill pricing battle v. Intel even if that weren't the case, that all reads like marketing spin.

If the Intel-based server would be $14k, and the AMD one $13k, and operating costs are in the tens of thousands per year, the CPU cost savings aren't enough to worry about, and AMD can thank themselves for allowing that to happen, starting with the K8. AMD needs some kind of merit that makes their CPU look superior at the same price, ideally in situations where Intel's CPUs may not be the best. Right now, there are a handful of niches where that can be true, but I'm not sure if it will be enough. I'd very much rather we not go back to having only Intel to choose from.

Since CMT is working at least well enough that we can't yet isolate it from caches (scheduling w/ shared cache, and a fairly far L2 from the small L1D), and BD can scale very well with generally scalable applications, it would probably be an ideal business application CPU (DBMS, Java, .NET, Erlang, PHP, etc.), if the per-thread performance could be sufficiently improved. The problem is that those niches where CMT could improve performance by saving space and power per thread also tend to be niches that are bound by latencies.

This time around, Intel doesn't have craptastic Netburst Xeons (your average benchmark did not do justice to how much better even slow Opterons were in real 3-tier apps), nor hobbling Core 2 Xeons (exploitable RAM and FSB limitations), so AMD's marketing spin is only going to get them so far. If they could be 15%+ faster where BD is weak now (IE, =>Stars across the board), BD could be kicking ass and taking names, even with clock speed/power limitations in their way (and, again, AMD should expect high speeds to be a problem). 10%+ better where it is weak now, and it could at least have good value from the added real execution units.
 
Last edited:

cbn

Lifer
Mar 27, 2009
12,968
221
106
For code where each how much time each thread's work takes makes a difference

Just wondering, for a video game (as an example).....

Is the game coded with one very long continuous single thread.....that just keeps going.

Or is the game actually made up of a series of very short single threads stringed together in rapid succession?

I am asking, in part, because it seems that we are entering an age where there are just too many cpu cores. Running most of these single or dual threaded programs (games, as an example) consistently on the same cores (cores 0,1) seems like a great way to concentrate heat in only one area of the die.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Just wondering, for a video game (as an example).....

Is the game coded with one very long continuous single thread.....that just keeps going.

Or is the game actually made up of a series of very short single threads stringed together in rapid succession?
A thread in a typical CPU is a unique program counter and set of registers. A process which uses several threads will be running the same binary, but each thread can run different code and data (or the same data, but that should be avoided whenever possible), though all the instructions and data reside in the same virtual memory space.

IOW, I'm not sure you quite grasp how the CPU works. A single basic CPU has registers and/or stacks, which hold data separately from main memory, and memory. It begins by executing from a set spot in memory, and then does what it is told. It has a program counter, which is the location that the current or next instruction is at. This increments as instructions in series are executed, but can also change by jumping (AKA branching) to some other location. Calling it a counter is historical, but it's probably better thought of as a cursor.

In a modern CPU and OS, each program is a process (though it may spawn other processes), which has its own memory space, virtualized from physical memory, so it can usually act like it is all that's running on the CPU.

A process may have one or more threads, each of which is just a unique CPU state, and limited to that process' memory space. With common programming languages, the programmer is tasked with creating and managing all important details of any more threads than the first. Any given thread may voluntary stop itself, or may be forced to stop what it's doing by the OS, at any time (not always true, but close enough for an abstract description).

Each thread can only be doing one thing at one time. To more than that, you need more things to be able to do, more threads, and hardware that can execute the added threads.

I am asking, in part, because it seems that we are entering an age where there are just too many cpu cores. Running most of these single or dual threaded programs (games, as an example) consistently on the same cores (cores 0,1) seems like a great way to concentrate heat in only one area of the die.
That part, at least, is right on, and is why turning cores entirely off is used to help speed and power consumption.
 
Last edited:

cbn

Lifer
Mar 27, 2009
12,968
221
106
A thread in a typical CPU is a unique program counter and set of registers. A process which uses several threads will be running the same binary, but each thread can run different code and data (or the same data, but that should be avoided whenever possible), though all the instructions and data reside in the same virtual memory space.

IOW, I'm not sure you quite grasp how the CPU works. A single basic CPU has registers and/or stacks, which hold data separately from main memory, and memory. It begins by executing from a set spot in memory, and then does what it is told. It has a program counter, which is the location that the current or next instruction is at. This increments as instructions in series are executed, but can also change by jumping (AKA branching) to some other location. Calling it a counter is historical, but it's probably better thought of as a cursor.

In a modern CPU and OS, each program is a process (though it may spawn other processes), which has its own memory space, virtualized from physical memory, so it can usually act like it is all that's running on the CPU.

A process may have one or more threads, each of which is just a unique CPU state, and limited to that process' memory space. With common programming languages, the programmer is tasked with creating and managing all important details of any more threads than the first. Any given thread may voluntary stop itself, or may be forced to stop what it's doing by the OS, at any time (not always true, but close enough for an abstract description).

Each thread can only be doing one thing at one time. To more than that, you need more things to be able to do, more threads, and hardware that can execute the added threads.

How about getting the work to jump around to different cores? Rather than staying on the same one or two cores constantly?

For example, on a eight core cpu.....I am thinking of a situation where the 1-2 threads could constantly rotate among the different cores so as to keep heat from building up in one area. (This to allow faster clock speeds/Turbo)
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
They've done that since far as Windows 95, or even 3.1.
I'm pretty sure SMP wasn't even supported outside of NT. Ironically, bouncing work around cores has historically been somewhat accidental, and not good for server performance. With shared near caches, it will affect any type of application's performance.

I don't know if they rotate the off cores when doing turbo or not, but doing it very quickly would not be good for performance. Turning cores on and off is not very quick, and will have significant memory bandwidth needs per core turned on (grab state, start executing, stall a bunch, fill unshared caches...).
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
How about getting the work to jump around to different cores? Rather than staying on the same one or two cores constantly?

For example, on a eight core cpu.....I am thinking of a situation where the 1-2 threads could constantly rotate among the different cores so as to keep heat from building up in one area. (This to allow faster clock speeds/Turbo)

This is called thread-migration and it is actually a problem from a performance standpoint. People tend to do what they can to minimize thread-migration because of the performance impact.

It does serve to lower the operating temperatures though.

Affinity locked thread processing:


Thread-migration allowed:


But if the goal is simply to reduce operating temperatures and you are willing to do so at the expense of reducing performance then there are far more elegant methods available for that - reduce the clockspeed (i.e. lower the TJmax value such that throttling occurs at an even lower temperature).
 

Idontcare

Elite Member
Oct 10, 1999
21,110
59
91
LOL

Bulldozer was a great codename until we realized that it meant a huge slow power-guzzling machine designed to push around piles of dirt and boulders.

I'm hopeful for "piledriver" but I am also curious to see what manner of misguided analogies are forthcoming if in hindsight it too sucks teh ballz.
 

SickBeast

Lifer
Jul 21, 2000
14,377
19
81
It runs rich, I will give it that much.

AMD is seriously stupid. Their CEOs are acting like morons. Why they scrapped the Phenom II X6 lineup and discontinued their old chips that were better is beyond me.

They need to own up to the fact that they made serious mistakes with Bulldozer and move on from it. They should be selling Llano on 32nm like hotcakes right now. An X6 variant would be great.
 

IonusX

Senior member
Dec 25, 2011
392
0
0
according to my ninja's those wanting amd but dont want bulldozer might be able to settle for trinity. atm its looking to be at 1090t result level + 25% or so (in fluids). if this translates over this puts it well into neahelm i5 4-core turf. which might not be too bad. and on top of that if your gpu ever dies you get something even with a 4870 which is an amazing substitute while you wait on an RMA.
or you wait on piledriver and get that which if nothing else will be a very effective server cpu.
 

SickBeast

Lifer
Jul 21, 2000
14,377
19
81
and you think amd can execute this one well and release it on time?

i don't believe any of their leaked hogwash any more. they're full of lies and they don't treat their current customer base well.

i look at amd as the lesser of two evils right now simply because you do still get a little bit more bang for your buck with them, but really, they aren't the company that they once were.
 

IonusX

Senior member
Dec 25, 2011
392
0
0
and you think amd can execute this one well and release it on time?

i don't believe any of their leaked hogwash any more. they're full of lies and they don't treat their current customer base well.

i look at amd as the lesser of two evils right now simply because you do still get a little bit more bang for your buck with them, but really, they aren't the company that they once were.
baiting me will get you anything you demand..
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |