I'm betting JFAMD's job only on the claim that single-threaded IPC is going up compared to Phenom II.
I haven't seen that being rebutted yet, as we've not seen any working Bulldozer silicon, let alone any benchmarks.
If I turn out to be right, I have every right to call JFAMD a liar (just like Randy Allen when claiming Barcelona would be 40% faster:
http://www.youtube.com/watch?v=G_n3wvsfq4Y), and then I demand that he quits his job, or that AMD fires him, for deliberate false advertisement of AMD's products (which is against the law in most countries).
I also have every right of saying the AMD PR people aren't doing their job well, when Anand and various other sites *repeatedly* misinterpret their information.
I have known Dave Baumann (another AMD-employed LIAR just like Randy Allen) to deliberately refer to misinterpreted information on third-party sites. That's just some of the underhanded tactics that AMD employs. You guys should be attacking AMD, not me. I'm just calling it as I see it.
You do realize that companies give performance projections long before they have the final product in-hand, right? Maybe Randy sincerely expected higher performance. Maybe he didn't...I don't know, because I wasn't personally involved and I didn't see what he saw. Did you?
Yea right. I know a whole lot more than most of you, being an industry insider.
I am willing to bet I know more than JFAMD as well, because I'm an engineer and he's just a PR guy who gets told what to say by engineers (and sometimes gets confused because he's not very good at his job).
People like jvroig and Idontcare can vouch for this. They are some of the few people who also know what they're talking about, and some of the few people who usually agree with what I say.
But if you come here as some kind of PR spindoctor and try to have a discussion with engineers such as myself, you're in trouble. We don't just buy what you say, we can make our own minds up with the information presented, combined with our own knowledge an experience. And when you try to argue, you're at a disadvantage, because you're the PR guy, we're the techies. You need to go back and consult with your engineers first, and try to figure out what we're saying, and what your answer should be. Not a situation you'd want to be in.
I find it surprising how forthright JFAMD is. He seems to genuinely want to correct misunderstandings. Frankly, Scali, you seem to think you're hot sh*t...but no matter how good an engineer you may be, JFAMD has sources who know the complete truth. You have watered-down marketing slides from which to speculate.
Personally I getting riled up about areas is pretty silly. It's like when someone says an x86 decoder is 5% of the area of a core. Sure, the decoder is only 5%, but the load-store unit has to handle segmented memory. The floating point unit has to store and process 80-bit data. The instruction cache, decoder, load-store unit, and data cache all have to work together to handle self-modifying code. The decoder has to handle the fact that you can legally run a 4-byte instruction starting at address 0x123, and later start executing at address 0x124 and have
the same bytes mean something different. The processor has to handle mis-aligned memory operations that cross page boundaries and have different memory types (e.g. write-back vs non-cacheable) for the two pages. It has to handle one or both of those pages faulting. The integer datapath has to have hooks to handle all sorts of funky irregularities (e.g. shift-by-zero doesn't set flags; the mulitplier has to handle every crazy combination of operand sizes and signed/unsigned numbers). You'll all go and fight about whether the decoder overhead is 4% or 6%, while missing the forest for that one particular tree.
I wonder if AMD will come out with a second generation Bobcat that is optimized by hand, instead of being designed completely by a program as Bobcat is said to be? It is that second part that keep sme from being too excited about this release, as I wonder how truly efficient the design can be.
I saw a really interesting poster from Intel at
DAC comparing semicustom hand-implementation to fully automated synthesis, place & route. They found that hand implementation actually didn't buy them anything - in fact, the semicustom design consumed dramatically more power
and area while gaining only a trivial amount of performance (~1%?). If you think about it, there are a few reasons that place&route can beat a human:
1) Humans can design fantastic bit-slices, but bit-slices aren't always optimal. Bitslices are great sometimes, but hand design tends to leave a lot of empty space and waste a lot of power. For example, if you have a shifter feeding an adder (like some ugly instruction sets allow), the adder needs the lower bits to be available before the upper bits. A human isn't going to be able to optimize the shifting logic separately at every bit, and is either going to plop down one high-speed shifter optimized for bit 0 everywhere, or, best case, break the datapath into a few chunks and use progressively smaller (lower power, slower) shifters for each block of e.g. 16 bits. A tool can optimize every bit differently.
Some structures are really pathological for humans, like multipliers. The most straightforward way to place them is a giant parallelogram, which leaves two large unused triangles. You can get into some funky methods of folding multipliers to cut down on wasted space, but it gets complicated fast (worrying about routing tracks, making sure you are still keeping the important wires short, etc). A place&route tool can create a big, dense blob of logic that uses area very efficiently.
2) Modern place&route tools have huge libraries of implementations for common structures that they can select. For example, Synopsys has something called DesignWare, which provides an unbelievable selection of circuits for (random example) adders, targeting every possible combination of constraints (latency, power, area, probably tradeoffs of wire delay vs. gate delay, who knows what else). A human doing semicustom implementation doesn't actually have to beat a computer - he has to beat every other human who has attacked the problem before, and had their solution incorporated into these libraries.
3) An automated design can adapt quickly to changes. You have to break a semicustom design up into pieces and create a floorplan for the design, giving each piece an area budget and planning which directions its data comes from/goes to (e.g. "the multiplier's operands come from the left"). Once the designs are done, you now have to jiggle things around to handle parts that came in over/under budget, and you end up with a lot of whitespace. If, half way through the project, you realize you want to make a large change, you may find that too much rework is required and you're stuck with a suboptimal design.
Plop a
quarter micron K7 on top of a
32nm llano... is it really likely that the same floorplan has been optimal since the days when transistors were slow and wires were fast, through to the days where wires are slow and transistors are fast? Engineers always talk about logic and SRAM scaling differently, yet the L1 caches appear to take a pretty similar amount of area. Shouldn't 7 process generations have caused enough churn that a complete redesign would look pretty different, even from a very high level? With an autoplaced design, you can try all sorts of crazy large-scale floorplan changes with minimal effort. If you try a new floorplan with a hand-placed design, you won't know for sure that it works until you've redesigned every last piece. You could discover a nasty timing path pretty late, and suddenly be in big trouble. It's interesting to see how on that original K7, the area was used pretty efficiently - pretty much every horizontal slice is the same width. The llano image doesn't look quite as nice. For what it's worth, you can do similar comparisons with Pentium Pro/P2/P3/Banias/etc. On a related note, the AMD website used to have a bunch of great high-res photos of various processors. Anyone know where to find them now?
4) Not all engineers are the best engineers. You might be able to design the most amazing multiplier in the world, but a company might have a hard time finding 100 of you, and big custom designs require big teams.
If you look carefully at die photos of some mainstream Intel processors, it looks like they've actually been using a
lot of automated place & route since at least as far back as Prescott.
This blurry photo of Prescott shows a mix of what appears to be custom or semi-custom logic at the bottom and top-right, as well as a lot of what appears to be auto-placed logic (note the curvy boundary of logic and what looks like whitespace (darker) left of and above the center... humans just don't do that). I've also read a paper by a company involved in Cell (I think it was Toshiba) that found that an autoplaced version of Cell was faster and smaller than the original semicustom implementation.
Who knows, maybe they wouldn't even have to do that. If Bobcat has a small enough die size, maybe they could go mega-multicore with it and introduce their own speculative threading technology to better leverage all those cores. But that's just out in left field, that is.
Personally I don't think speculative threading will go anywhere any time soon. There are a lot of cool-sounding ideas out there, but there are a couple fundamental problems:
A) You're burning a tremendous amount of extra power, and we're already in a power-constrained world. You need extremely high accuracy in terms of guessing when you can safely jump ahead to be power-efficient.
B) Checking that the speculation was safe involves too much complexity. Code can access any memory address at any time... a speculative thread has to monitor every address it reads to make sure older instructions don't write to them. Real-world code isn't going to be friendly to that. If you specifically wrote code with speculative multithreading in mind, it might be doable... but I don't see that happening. For example, if you ensured that an interesting region of your program accessed no more than 1KB of memory (16 64-byte cache lines), it would be reasonable to track. If your program is jumping through a large or sparse dataset, it just gets too difficult. Hardware CAMs are power-hungry. Alternatives to CAMs are slow.
Alrighty, let's give this another shot. This isn't just for you, in case you start feeling I'm singling you out, but I'm quoting you since you've quoted me
[Warning for all - long wall of text incoming. SKIP if no interest in the Bulldozer 5% issue]
<snip>
That was an excellent post, jvroig.