Some Bulldozer and Bobcat articles have sprung up

DrMrLordX · Aug 28, 2010

jvroig said:
I grew up as a child thinking I was just as American as the people in the shows I watched (cartoons and real live people)... but when I grew up, I realized I am not American at all

Sometimes I wonder how American the real Americans are . . . but I digress!

Thank you for un-derailing the thread. The awesomeness therein is not implied but explicit.

Scali · Aug 28, 2010

jvroig said:
I grew up as a child thinking I was just as American as the people in the shows I watched (cartoons and real live people)... but when I grew up, I realized I am not American at all

How is that?
I think us Europeans tend to be more direct than Americans, perhaps because we don't have as many taboos/'delicate' subjects as they do.

Vesku · Aug 28, 2010

Semantics, they stripped an AGU, ALU and half the cache from their integer unit to make room for a second while sharing the FPU.

GaiaHunter · Aug 28, 2010

jvroig made a nice post but it still doesn't answer the real question.

Take off the second integer core. What else could u take off?

The scheduler for sure (which AMD includes on the 12&#37.

Would you cut the shared cache? The number ALUs and AGUs?

What performance would such a cut down core have? Wouldn't AMD have to beef a Phenom II core in the first place and then try to get a smaller die size with a process?

Guess Vesku that posted while I was typing this is on the same thing but coming from a different side.

The only thing we know for sure is that AMD decided that adding a second int core instead of having a really powerful core was the best decision from their pov.

jvroig · Aug 28, 2010

GaiaHunter said:
What performance would such a cut down core have?

There is no "cut down core", but I suppose what you are asking basically boils down to:
1.) single-thread performance, and
2.) why they didn't make a powerful core

The answers being:
1.) They were concerned enough to mention this quite a few times: they say IPC improved, plus going the module route bears no significant single-threaded performance loss (so they didn't sacrifice any of the IPC improvements just to make a module). If you wish to determine if this is true or not, we will have to wait for benchmarks. If you are only concerned about words from AMD, then they've made no attempt to avoid answering such questions.
2.) They did. The module is composed of two powerful cores, then combined homogeneously in what resulted as a module. Since first info about Bulldozer's arch was disclosed, they were gracious enough to point out that performance relative to Deneb increased, and it is something that they are still saying now (and that is per core, not just the entire chip as a whole). As I said, we still need to wait for benchmarks, but that is what they say.

GaiaHunter said:
The only thing we know for sure is that AMD decided that adding a second int core instead of having a really powerful core was the best decision from their pov.

If by "adding a second int core" you mean having two cores share as much resources as possible so that they only need to over-provision every pair of cores instead of every single core, therefore achieving significant savings in power and space, then yeah, they thought that was a pretty good idea.

As for "instead of having a really poweful core", then you are of the impression that somehow, along the way of designing the Bulldozer module, they made a conscious decision to make "mini cores" or sacrifice single-core performance just to be able to put more cores in a die. They did no such thing (again, this is according to them, and only benchmarks can verify). They managed to cram more cores in less space, true, but most people keep ignoring the other statements: IPC improved, and there is only minimal/negligible single-threaded performance penalty as a result of the module design.

GaiaHunter · Aug 28, 2010

jvroig said:
There is no "cut down core", but I suppose what you are asking basically boils down to:
1.) single-thread performance, and
2.) why they didn't make a powerful core

The answers being:
1.) They were concerned enough to mention this quite a few times: they say IPC improved, plus going the module route bears no significant single-threaded performance loss (so they didn't sacrifice any of the IPC improvements just to make a module). If you wish to determine if this is true or not, we will have to wait for benchmarks. If you are only concerned about words from AMD, then they've made no attempt to avoid answering such questions.

IPC improved compared to last gen core. That doesn't say anything about what could be achieved by non-sharing resources and/or using those 12% (or 50% if you prefer) of module die space for something else.

In fact, AMD states a module will achieve 80% of the performance compared to an alternative design where each core wouldn't share resources.

2.) They did. The module is composed of two powerful cores, then combined homogeneously in what resulted as a module. Since first info about Bulldozer's arch was disclosed, they were gracious enough to point out that performance relative to Deneb increased, and it is something that they are still saying now (and that is per core, not just the entire chip as a whole). As I said, we still need to wait for benchmarks, but that is what they say.

Again, you are talking about performance vs current uarch, while I'm talking about new module based uarch vs 2 hypothetic cores using the new uarch.

It is clearly a trade off that AMD decided it is more beneficial - less overall performance compared to a "more traditional alternative" but less die space too.

If by "adding a second int core" you mean having two cores share as much resources as possible so that they only need to over-provision every pair of cores instead of every single core, therefore achieving significant savings in power and space, then yeah, they thought that was a pretty good idea.

Yes, that is basis I'm using when talking in this thread.

As for "instead of having a really poweful core", then you are of the impression that somehow, along the way of designing the Bulldozer module, they made a conscious decision to make "mini cores" or sacrifice single-core performance just to be able to put more cores in a die. They did no such thing (again, this is according to them, and only benchmarks can verify). They managed to cram more cores in less space, true, but most people keep ignoring the other statements: IPC improved, and there is only minimal/negligible single-threaded performance penalty as a result of the module design.

They decided to sacrifice single-core performance vs an hypothetic dual-core with the BD architecture improvements, that would be bigger and consume more energy.

Any of the 2 paths, in theory and according to AMD word, is faster than Deneb/Thuban cores.

The difference is BD module is 80% of the performance of a "dual-core BD", but smaller and consuming less power.

Scali · Aug 28, 2010

jvroig said:
IPC improved, and there is only minimal/negligible single-threaded performance penalty as a result of the module design.

Isn't this in direct contradiction with eachother?
IPC is measured per thread (unless they want to argue that the average IPC over multiple cores is better... but that depends a lot on the type of multithreaded algorithms you are running etc etc, <insert Amdahl's Law here>, and not really the kind of slippery slope you'd want to get on... IPC in itself is already theoretical enough, with a lot of variation in practice).
So if there is a "minimal/negligible single-threaded performance penalty", that is equivalent to a drop in IPC. At least a drop in IPC per core, to explicitly state it.

jvroig · Aug 28, 2010

Scali said:
Isn't this in direct contradiction with eachother?

I suppose what they mean is IPC improved from Deneb on a per core basis, then in making a module that IPC did drop a bit, but they managed to contain the penalty so that it will still end up better than Deneb.

So yes, a drop in IPC per core if they did not go the module route, but with or without the drop they still claim an improvement over Deneb. Whether it will hold true is up to the benchmarks.

The flip side is that at least now they are comparing to something they own, instead of saying it will be better than Intel's by X%. Or if they did make such an announcement, I missed it.

GaiaHunter said:
They decided to sacrifice single-core performance vs an hypothetic dual-core with the BD architecture improvements, that would be bigger and consume more energy.

Yes, but according to them (not me) this sacrifice is negligible. I do not know what is negligible for them, perhaps 1-3%, maybe 5% at most? But that is something they have been hammering over and over - yes there is a penalty to this shared architecture, but they did not step on single-thread performance.

In fact, the real performance loss is in multi-threaded scenarios, not single-threaded:
- If only 1 core is used, it will have access to all shared resources, which is part of how they claim "minimal single-threaded performance loss". For the sake of illustration, maybe it is 3%.
- In a multi-threaded scenario, the maximum throughput becomes only 1.6x, instead of 2x, due to the nature of the module. Each module is well over-provisioned enough (supposedly, as per goals of their architecture) for one core to be happy, but they have acknowledged that multi-threaded throughput will only be 80%, so that is where the actual hit takes place, but that is taken cared of by being able to put more cores in the chips.

So the 80% performance of a module against a "dual-core BD" is not because they made great sacrifices in single-thread (e.g., making smaller but weaker cores to be able to make more of them), but because of the limits they imposed on the resources provisioned for each module - at full blast and with both cores consuming as much resources as possible, only a peak 160% throughput is expected (as opposed to a reference 200% figure for a normal dual core where each one has its own, isolated resources to use)

GaiaHunter said:
Again, you are talking about performance vs current uarch, while I'm talking about new module based uarch vs 2 hypothetic cores using the new uarch.

The answer to this is found above as well, but I wished to address it specifically because most of the misconceptions against BD seem to stem from here.

At the very least, if we are to believe AMD's word, then performance in single-threaded scenarios is hardly affected by going the module route as opposed to if they just went with the normal independent cores route. Although they are selling "multi-threaded efficiency", it is not what most people seem to think it means. Compared to a regular dual core, a module's dual-thread throughput is far less (20% loss is significant), but it makes them able to add more cores which, in their math, works out to higher total multi-threaded throughput.

It has been long misconceived (due to lack of actual info) that BD cores were some sort of optimized small cores, or that there are secondary cores, or that this is some sort of "hardware-based HT" (which is funny, because HT is also implemented in hardware, though at a very fractional cost), which gave people all the wrong ideas about it (including that "they could have designed a more powerful single core instead of a module but they thought this would be best"idea), and I suppose it is a bit hard to get rid of, despite AMD's best effort during Hot Chips.

Kuzi · Aug 28, 2010

Scali said:
Isn't this in direct contradiction with eachother?
IPC is measured per thread (unless they want to argue that the average IPC over multiple cores is better... but that depends a lot on the type of multithreaded algorithms you are running etc etc, <insert Amdahl's Law here>, and not really the kind of slippery slope you'd want to get on... IPC in itself is already theoretical enough, with a lot of variation in practice).
So if there is a "minimal/negligible single-threaded performance penalty", that is equivalent to a drop in IPC. At least a drop in IPC per core, to explicitly state it.

I don't see any contradiction there. You lose IPC in some area, but make it up and more in others, like improved prefetchers, improved branch prediction unit, lower cache/memory latency etc.

While BD may end up behind SB in pure IPC, the picture changes drastically when running highly threaded apps. The "weak" cores in BD should show their advantage against Intel's HT in those situations.

Because of the way BD was designed, it seems likely to me that it will be able to reach higher clocks than what SB will achieve. This would help reduce the IPC deficit and allow BD to be more competitive.

Of course we have to hope GlobalFoundries won't have any serious issues with their 32nm HK/MG process, otherwise this can affect BD clocks and/or release date. Also maybe AMD can surprise us and release BD earlier than expected, but as IDC mentioned a few times already, it doesn't seem like we will see it release (for Desktops) before Q3 2011

jvroig · Aug 28, 2010

Scali said:
IPC is measured per thread (unless they want to argue that the average IPC over multiple cores is better

For their sake, I hope they do not start arguing that, because that would be completely senseless. Fortunately, they haven't yet (to my knowledge) so don't give them any ideas - if their next marketing slides contain "average IPC over multiple cores" as a selling point to cover up any weakness in IPC improvement (or, specifically, the lack of), I'm blaming you

bryanW1995 · Aug 28, 2010

jvroig, you have made my head hurt. however, I can honestly say that I know more now than I did when I woke up this morning. guess it's time to take the rest of the day off! it's noon somewhere, right?

and yes, I've actually suffered through the entire thread. thanks for getting us back on track.

GaiaHunter · Aug 28, 2010

jvroig said:
It has been long misconceived (due to lack of actual info) that BD cores were some sort of optimized small cores, or that there are secondary cores, or that this is some sort of "hardware-based HT" (which is funny, because HT is also implemented in hardware, though at a very fractional cost), which gave people all the wrong ideas about it (including that "they could have designed a more powerful single core instead of a module but they thought this would be best"idea), and I suppose it is a bit hard to get rid of, despite AMD's best effort during Hot Chips.

Well but that isn't what I'm saying - you are just assuming my opinion is misconceived.

1 Module is bigger than 1 core but smaller than 2 cores.

And they could have indeed designed at least a bigger chip if you prefer, instead of a bigger core. Take the second int core out of a module, give it the FPU core, the predictors, etc, and you have 2 cores with more resources and bigger than 1 module.

About the single thread performance and etc.

Yeah, probably is negligible, unless you are in a situation all cores are occupied and needing all the resources, probably not a very common scenario and hence why AMD went this route - improving single thread IPC, provide several cores just in case (and for servers tasks) while keeping it small.

Of course in situations where only 1 thread is being used, 1 of the int cores have all the resources at available and suffers no performance penalty - only when when both cores are working at the same time will you see that penalty.

Additionally you can't just discard the possibility of AMD increasing resources available for a core if the 2nd int was absent.

We will have a better clue to how the designing process might have happened once we have performance numbers and die size numbers for BD and compare them to Intel CPUs.

Scali · Aug 28, 2010

jvroig said:
I suppose what they mean is IPC improved from Deneb on a per core basis, then in making a module that IPC did drop a bit, but they managed to contain the penalty so that it will still end up better than Deneb.

Well, I went from the assumption that adding extra resources to cater for a second thread (extra decoder, extra integer scheduler, extra integer execution ports) would not affect single-threaded performance...
After all, the mere fact that sharing resources is possible is a crucial part of the micro-architecture. There is no 'non-shared' BD micro-architecture to compare against. The logical comparison is against existing architectures.

Scali · Aug 28, 2010

Kuzi said:
I don't see any contradiction there. You lose IPC in some area, but make it up and more in others, like improved prefetchers, improved branch prediction unit, lower cache/memory latency etc.

You lose "single-threaded performance", by AMD's slides, which *is* IPC.
That's where the contradiction is.
If it had better IPC, it would by definition have better single-threaded performance, and AMD would not have made that statement in the first place.

Perhaps it has better *peak* IPC (2 ALU + 2 AGU), but worse *average* IPC.

crazylocha · Aug 28, 2010

Don't make me get out my Quantel "Pro" w all 12 platter disc on your bad selves. And if y'all don't calm down right now, Ill make my Tandy color comp draw circles around you and make you relearn typing basic on a Timex Sinclair 1000 while you sit in a corner!

Part of the confusion about the AM3/AM3+ situation was brought about by TSMC's skipping a die shrink step. AMD's original plan was to go first with a 40nm AM3 BB/BD setup.

Look at the maps currently for ATI and AMD and you will notice many of the newer shrinks/dies/architectures coincide with the newer Global Foundry plants in Germany, NY state, and Saudi??(if remember right). That would be a bigger gain for AMD being able to control their own scheduling of changes and runs. They have already proven off the chart yields using other peoples systems(TSMC vs. nVidia's fermi's) that rival Intel's, if not bettering them. That part of the future looks real bright.

AM3+(progressing to AM4 in this instance) makes sense with the DDR4 specs getting hashed out soon. Different style quad channel tied to individual core performance looks interesting at the least. Newer BD design running each fetch/dump cycle directly to chained memory channel? Maybe with "Turbo" shutting of two units(cores or whatever we choose to call them now) but 4 memory controllers (1/unit?) pushing 4 threads off of 2 units(cores?) big savings on power/speed ratios are in the mix there. Just conjecturing on the overall idea behind design for future here, don't shoot me!

More of my concerns are with Intel moving so many components (NB/SB) onto only a few limited packages of their own. AMD following suit slowly makes me nervous too. Both will see Memory controller and CPU/GPU single die, just matter of time for NB/SB to move on there. APU (AMD term) wouldn't need most of what can already be put there from generation ahead. Business wise makes sense to reduce Mobo "clutter" as nearing "all inclusive" dies approach. Fewer parts to have to shop from multiple sources. USB3 stands to get phased out by Intel's fiber optic bridge, along with Sata, Firewire, and PCIe lanes. Going to be a whole new ball field to play a whole new game. Where does these maps leave nVidia? and resultingly direct competition? Can see more of why nV is moving in other directions besides GPU's now.

Soleron · Aug 28, 2010

crazylocha said:
Where does these maps leave nVidia? and resultingly direct competition? Can see more of why nV is moving in other directions besides GPU's now.

Nvidia said in their last earnings call that they expect their chipset business to be all but zero by Q1. So it's already here.

Low-end discrete GPUs are also killed by this and Bobcat/Llano, as we saw with the 5450.

So Nvidia needs Tegra and Tesla. See many real Tegra products yet, given it's over two years since it was launched?

crazylocha · Aug 28, 2010

Hence some of the "bone" thrown to them by Darpa for scaling maybe? Keeping them on scene for bit longer for transitional purposes?

CTho9305 · Aug 28, 2010

Scali said:
I'm betting JFAMD's job only on the claim that single-threaded IPC is going up compared to Phenom II.
I haven't seen that being rebutted yet, as we've not seen any working Bulldozer silicon, let alone any benchmarks.

If I turn out to be right, I have every right to call JFAMD a liar (just like Randy Allen when claiming Barcelona would be 40% faster: http://www.youtube.com/watch?v=G_n3wvsfq4Y), and then I demand that he quits his job, or that AMD fires him, for deliberate false advertisement of AMD's products (which is against the law in most countries).

I also have every right of saying the AMD PR people aren't doing their job well, when Anand and various other sites *repeatedly* misinterpret their information.
I have known Dave Baumann (another AMD-employed LIAR just like Randy Allen) to deliberately refer to misinterpreted information on third-party sites. That's just some of the underhanded tactics that AMD employs. You guys should be attacking AMD, not me. I'm just calling it as I see it.

You do realize that companies give performance projections long before they have the final product in-hand, right? Maybe Randy sincerely expected higher performance. Maybe he didn't...I don't know, because I wasn't personally involved and I didn't see what he saw. Did you?

Scali said:
Yea right. I know a whole lot more than most of you, being an industry insider.
I am willing to bet I know more than JFAMD as well, because I'm an engineer and he's just a PR guy who gets told what to say by engineers (and sometimes gets confused because he's not very good at his job).
People like jvroig and Idontcare can vouch for this. They are some of the few people who also know what they're talking about, and some of the few people who usually agree with what I say.

Scali said:
But if you come here as some kind of PR spindoctor and try to have a discussion with engineers such as myself, you're in trouble. We don't just buy what you say, we can make our own minds up with the information presented, combined with our own knowledge an experience. And when you try to argue, you're at a disadvantage, because you're the PR guy, we're the techies. You need to go back and consult with your engineers first, and try to figure out what we're saying, and what your answer should be. Not a situation you'd want to be in.

I find it surprising how forthright JFAMD is. He seems to genuinely want to correct misunderstandings. Frankly, Scali, you seem to think you're hot sh*t...but no matter how good an engineer you may be, JFAMD has sources who know the complete truth. You have watered-down marketing slides from which to speculate.

Personally I getting riled up about areas is pretty silly. It's like when someone says an x86 decoder is 5% of the area of a core. Sure, the decoder is only 5%, but the load-store unit has to handle segmented memory. The floating point unit has to store and process 80-bit data. The instruction cache, decoder, load-store unit, and data cache all have to work together to handle self-modifying code. The decoder has to handle the fact that you can legally run a 4-byte instruction starting at address 0x123, and later start executing at address 0x124 and have the same bytes mean something different. The processor has to handle mis-aligned memory operations that cross page boundaries and have different memory types (e.g. write-back vs non-cacheable) for the two pages. It has to handle one or both of those pages faulting. The integer datapath has to have hooks to handle all sorts of funky irregularities (e.g. shift-by-zero doesn't set flags; the mulitplier has to handle every crazy combination of operand sizes and signed/unsigned numbers). You'll all go and fight about whether the decoder overhead is 4% or 6%, while missing the forest for that one particular tree.

Martimus said:
I wonder if AMD will come out with a second generation Bobcat that is optimized by hand, instead of being designed completely by a program as Bobcat is said to be? It is that second part that keep sme from being too excited about this release, as I wonder how truly efficient the design can be.

I saw a really interesting poster from Intel at DAC comparing semicustom hand-implementation to fully automated synthesis, place & route. They found that hand implementation actually didn't buy them anything - in fact, the semicustom design consumed dramatically more power and area while gaining only a trivial amount of performance (~1%?). If you think about it, there are a few reasons that place&route can beat a human:

1) Humans can design fantastic bit-slices, but bit-slices aren't always optimal. Bitslices are great sometimes, but hand design tends to leave a lot of empty space and waste a lot of power. For example, if you have a shifter feeding an adder (like some ugly instruction sets allow), the adder needs the lower bits to be available before the upper bits. A human isn't going to be able to optimize the shifting logic separately at every bit, and is either going to plop down one high-speed shifter optimized for bit 0 everywhere, or, best case, break the datapath into a few chunks and use progressively smaller (lower power, slower) shifters for each block of e.g. 16 bits. A tool can optimize every bit differently.

Some structures are really pathological for humans, like multipliers. The most straightforward way to place them is a giant parallelogram, which leaves two large unused triangles. You can get into some funky methods of folding multipliers to cut down on wasted space, but it gets complicated fast (worrying about routing tracks, making sure you are still keeping the important wires short, etc). A place&route tool can create a big, dense blob of logic that uses area very efficiently.

2) Modern place&route tools have huge libraries of implementations for common structures that they can select. For example, Synopsys has something called DesignWare, which provides an unbelievable selection of circuits for (random example) adders, targeting every possible combination of constraints (latency, power, area, probably tradeoffs of wire delay vs. gate delay, who knows what else). A human doing semicustom implementation doesn't actually have to beat a computer - he has to beat every other human who has attacked the problem before, and had their solution incorporated into these libraries.

3) An automated design can adapt quickly to changes. You have to break a semicustom design up into pieces and create a floorplan for the design, giving each piece an area budget and planning which directions its data comes from/goes to (e.g. "the multiplier's operands come from the left"). Once the designs are done, you now have to jiggle things around to handle parts that came in over/under budget, and you end up with a lot of whitespace. If, half way through the project, you realize you want to make a large change, you may find that too much rework is required and you're stuck with a suboptimal design.

Plop a quarter micron K7 on top of a 32nm llano... is it really likely that the same floorplan has been optimal since the days when transistors were slow and wires were fast, through to the days where wires are slow and transistors are fast? Engineers always talk about logic and SRAM scaling differently, yet the L1 caches appear to take a pretty similar amount of area. Shouldn't 7 process generations have caused enough churn that a complete redesign would look pretty different, even from a very high level? With an autoplaced design, you can try all sorts of crazy large-scale floorplan changes with minimal effort. If you try a new floorplan with a hand-placed design, you won't know for sure that it works until you've redesigned every last piece. You could discover a nasty timing path pretty late, and suddenly be in big trouble. It's interesting to see how on that original K7, the area was used pretty efficiently - pretty much every horizontal slice is the same width. The llano image doesn't look quite as nice. For what it's worth, you can do similar comparisons with Pentium Pro/P2/P3/Banias/etc. On a related note, the AMD website used to have a bunch of great high-res photos of various processors. Anyone know where to find them now?

4) Not all engineers are the best engineers. You might be able to design the most amazing multiplier in the world, but a company might have a hard time finding 100 of you, and big custom designs require big teams.

If you look carefully at die photos of some mainstream Intel processors, it looks like they've actually been using a lot of automated place & route since at least as far back as Prescott. This blurry photo of Prescott shows a mix of what appears to be custom or semi-custom logic at the bottom and top-right, as well as a lot of what appears to be auto-placed logic (note the curvy boundary of logic and what looks like whitespace (darker) left of and above the center... humans just don't do that). I've also read a paper by a company involved in Cell (I think it was Toshiba) that found that an autoplaced version of Cell was faster and smaller than the original semicustom implementation.

DrMrLordX said:
Who knows, maybe they wouldn't even have to do that. If Bobcat has a small enough die size, maybe they could go mega-multicore with it and introduce their own speculative threading technology to better leverage all those cores. But that's just out in left field, that is.

Personally I don't think speculative threading will go anywhere any time soon. There are a lot of cool-sounding ideas out there, but there are a couple fundamental problems:
A) You're burning a tremendous amount of extra power, and we're already in a power-constrained world. You need extremely high accuracy in terms of guessing when you can safely jump ahead to be power-efficient.
B) Checking that the speculation was safe involves too much complexity. Code can access any memory address at any time... a speculative thread has to monitor every address it reads to make sure older instructions don't write to them. Real-world code isn't going to be friendly to that. If you specifically wrote code with speculative multithreading in mind, it might be doable... but I don't see that happening. For example, if you ensured that an interesting region of your program accessed no more than 1KB of memory (16 64-byte cache lines), it would be reasonable to track. If your program is jumping through a large or sparse dataset, it just gets too difficult. Hardware CAMs are power-hungry. Alternatives to CAMs are slow.

jvroig said:
Alrighty, let's give this another shot. This isn't just for you, in case you start feeling I'm singling you out, but I'm quoting you since you've quoted me

[Warning for all - long wall of text incoming. SKIP if no interest in the Bulldozer 5% issue]
<snip>

That was an excellent post, jvroig.

Scali · Aug 28, 2010

CTho9305 said:
You do realize that companies give performance projections long before they have the final product in-hand, right? Maybe Randy sincerely expected higher performance. Maybe he didn't...I don't know, because I wasn't personally involved and I didn't see what he saw. Did you?

The interview was from January 2007, only a few months before Barcelona was released to the market (which was in September 2007, after some delays, because originally they were scheduled for shipping in August 2007).
So at this time they already had the final product in-hand.

Kuzi · Aug 28, 2010

Scali said:
You lose "single-threaded performance", by AMD's slides, which *is* IPC.
That's where the contradiction is.
If it had better IPC, it would by definition have better single-threaded performance, and AMD would not have made that statement in the first place.

Perhaps it has better *peak* IPC (2 ALU + 2 AGU), but worse *average* IPC.

You lose "single-threaded performance" in comparison to a conventional core, but compared to Intel's Hyper-threading, the performance gain is much better. For example, an Intel core running 2 threads using HT, nets 120-130% performance, a BD Module nets 180% from running 2 threads, which one looks better to you?

Yes, there are drawbacks for each implementation, HT requires less die space, but gives less of a performance boost. Lets take a SandyBridge Quad core, and compare it to a 4-Module Bulldozer, both can run 8-threads, but SB may have a noticeable advantage for apps that use 4-threads or less. Once you crank up the thread count BD starts catching up on performance, because HT is not that efficient.

My comparison above is only valid if AMD counts 1-Module vs. 1-SB core, and it should be doable because BD was designed from the ground up to be small and offer high performance/mm^2.

Scali · Aug 28, 2010

Kuzi said:
You lose "single-threaded performance" in comparison to a conventional core, but compared to Intel's Hyper-threading, the performance gain is much better. For example, an Intel core running 2 threads using HT, nets 120-130% performance, a BD Module nets 180% from running 2 threads, which one looks better to you?

We were talking single-threaded performance (you know, IPC and all that), in which a HT processor is virtually identical to the same processor without HT.

Kuzi · Aug 28, 2010

Scali said:
We were talking single-threaded performance (you know, IPC and all that), in which a HT processor is virtually identical to the same processor without HT.

Yeah Scali, I get what you mean, but AMD never said 1 BD core will be slower than 1 Phenom core. The number they gave, 180% was for a BD Module running "2-threads", so 1 BD INT unit is producing 90% of the performance of a hypothetical "full" BD core designed in the conventional way, without any shared resources. When 1 thread is running per Module, there should be no loss, because the resources are not shared.

AMD gave the 180% number, to highlight the fact that for only a 12% die size gain per Module, you get up to 80% increase in performance, when running a second thread. IIRC, HT adds 5% die size to produce a 20-30% gain in performance. On paper, BD's implementation has an advantage here, but we'll have to wait and see I guess.

Idontcare · Aug 28, 2010

Kuzi said:
Yeah Scali, I get what you mean, but AMD never said 1 BD core will be slower than 1 Phenom core. The number they gave, 180% was for a BD Module running "2-threads", so 1 BD INT unit is producing 90% of the performance of a hypothetical "full" BD core designed in the conventional way, without any shared resources. When 1 thread is running per Module, there should be no loss, because the resources are not shared.

AMD gave the 180% number, to highlight the fact that for only a 12% die size gain per Module, you get up to 80% increase in performance, when running a second thread. IIRC, HT adds 5% die size to produce a 20-30% gain in performance. On paper, BD's implementation has an advantage here, but we'll have to wait and see I guess.

Hey Kuzi, feels like we've been talking about Bulldozer for a while now, huh

Yeah the 1.8x scaling number is subject to wide interpretation because we know that even if we have identical cores the thread scaling can be low for reasons entirely unrelated to the compute hardware...we got Amdahl's law showing us how poorly multi-thread performance scales if the code is not entirely parallelized, and we got Almasi and Gottlieb showing us how with perfectly parallelized code and homogeneous processing cores we still lose scaling efficiency from the network topology (the uncore stuff on modern monolithic multi-core cpus).

So the answer - 1.8x scaling for two thread executing within the same BD module - could be the answer to any number of questions, we don't really know the precise details behind the motivating question itself.

And even if we did know the precise details motivating the answer - 1.8x scaling - the answer itself is still useless to you and me unless we happen to use the same software in exactly the same computing environment.

In the end we have to test the chip with our apps of interest, the scaling will be something between 2x and 0x (barring some weird super-scaling speedup corner-case).

jvroig · Aug 28, 2010

GaiaHunter said:
And they could have indeed designed at least a bigger chip if you prefer, instead of a bigger core. Take the second int core out of a module, give it the FPU core, the predictors, etc, and you have 2 cores with more resources and bigger than 1 module.

If they did what you said, the resulting chip would have far less multi-threaded throughput within the same power and thermal budget (because it would have less cores), without much advantage (if at all) in single-threaded performance. It's a clear "win-maybe-a-little-but-lose-a-lot" scenario.

We know this because they told us. That's pretty much what they've been telling everyone about Bulldozer. Now, whether they are actually telling the truth or not depends on whether the benchmarks are impressive enough to make us agree with them. I'm not holding my breath, personally.

GaiaHunter said:
Additionally you can't just discard the possibility of AMD increasing resources available for a core if the 2nd int was absent.

Yes we can. Heat, power, size, and performance targets (well, aside from $) limit what can and can't be done, as I've hinted above. They made the design so that single-core performance doesn't suffer, but total multi-threaded throughput of an entire chip increases due to having more cores. That's what they've said, and it already means the cores are as "big" and as performant as they need to be to meet power, thermal and performance targets they've set. Again, only benchmarks will show if they've been telling the truth and hit those targets.

Scali · Aug 28, 2010

Kuzi said:
Yeah Scali, I get what you mean, but AMD never said 1 BD core will be slower than 1 Phenom core.

AMD said two things:
- Their sheets say "Throughput advantages for multi-threaded workloads without significant loss on serial workload components".
- JFAMD says IPC will be higher.

I'm saying they're contradicting eachother. If there is a loss of throughput on serial workload components, no matter how insignificant, the IPC can not be higher.

Kuzi said:
Yes, there are drawbacks for each implementation, HT requires less die space, but gives less of a performance boost. Lets take a SandyBridge Quad core, and compare it to a 4-Module Bulldozer, both can run 8-threads, but SB may have a noticeable advantage for apps that use 4-threads or less. Once you crank up the thread count BD starts catching up on performance, because HT is not that efficient.

I think a single HT core would be smaller than a BD module. Intel also has the upper hand in single-threaded IPC, which probably will not chance. So although HT may not run two threads as effectively as BD does, the smaller size and the higher single-threaded performance may cancel out BD's 'advantages'.
Namely, HT will generally give us ~30% extra performance out of the second core...
So the total performance of a HT core would be A = 1.3*X
Now if we assume that a BD module performs like B = 1.8*Y, then A > B as long as X is big enough. Which would be ~38% bigger than Y.

Another thing is that HT is easy to scale up. Intel just has to drop in extra execution units. It's the same idea as BD, only the sharing of execution resources is 100%, and more flexible.

Some Bulldozer and Bobcat articles have sprung up

Lifer

Banned

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Banned

Platinum Member

Senior member

Platinum Member

Lifer

Diamond Member

Banned

Banned

Member

Senior member

Member

Elite Member

Banned

Senior member

Banned

Senior member

Elite Member

Platinum Member

Banned