Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

vas flam · May 25, 2011

http://news.softpedia.com/news/AMD-...each-Retail-Until-Q3-2011-Report-202051.shtml

Anybody know if the delay is true or not?! I'm crying here

Arkadrel · May 25, 2011

AtenRa said:
I believe we will not see more than 3.2GHz (base frequency) for an 8 Core BD (16MB caches) at 95W TDP, at least in the beginning.

Phenom II X6 1065T is a 2.9ghz - 3.4ghz turbo, 6 core cpu (95watt TPD) on 45nm.

Bulldozer will have a node advantage, and was designed to run at high ghz speeds, also its cores are made to share to lower power use. Modual approuch (not only lowers space for 2 cores taken, but power use for 2 cores active).

I think the bulldozer 8 core will be able to push higher speeds than the phenom II x6 cores.
Still 3.2ghz (base speed) + 500mhz (turbo thats almost always on) = 3.7ghz most of the time, with 8 cores.

That doesnt sound so bad.

Gundark · May 25, 2011

Absolution75 said:
I didn't read this entire thread, but do the leaked prices of say the 8core BD refer to BD modules or actual cores (ex 2 cores per module). If the BD was 8-module.... that would be interesting.

8core means 8 cores.

Voo · May 25, 2011

Khato said:
I'm nuts for other reasons sure, but not that one. Especially when it comes to high performance blocks that are not just straight synthesized logic, any duplication does amount to a copy/paste. Yes, after that's done the automated tools for signal wire routing and timings come into play, but that doesn't negate the starting point. It's far easier than having to design a high performance block that's twice the size.

Well maybe Intel and AMD do have such great tools that allow them to throw the design and constraints into it and get the perfect result back, but in my experience (well not my personal expertise - sure I had the courses but I just say what I've seen and heard from people doing these things for a living) getting optimal clock trees and co takes enough tweaking. Sure the tools are essential but you can't just throw the design at them and get the perfect result back.

But if you meant that you copy+paste the blocks into the design (well more increase some loop counts) and then still have to do the rest of the work, that we can agree on. And surely that makes it easier - just as always in software modular design has its advantages. It just sounded to me as if you advocated that you could just add some blocks, throw the new design into automated tools and get back with a production ready product.

AtenRa · May 25, 2011

http://cpu.zol.com.cn/231/2310866.html

translate: http://translate.google.com/transla...l=en&u=http://cpu.zol.com.cn/231/2310866.html

http://detail.zol.com.cn/picture_index_647/index6469206.shtml

http://detail.zol.com.cn/picture_index_647/index6469207.shtml

ehm, Bulldozer Launch in September ??

Abwx · May 25, 2011

AtenRa said:
http://cpu.zol.com.cn/231/2310866.html

translate: http://translate.google.com/translate?js=n&prev=_t&hl=en&ie=UTF-8&layout=2&eotf=1&sl=zh-CN&tl=en&u=http%3A%2F%2Fcpu.zol.com.cn%2F231%2F2310866.html

http://detail.zol.com.cn/picture_index_647/index6469206.shtml

http://detail.zol.com.cn/picture_index_647/index6469207.shtml

ehm, Bulldozer Launch in September ??

Maybe...

dma0991 · May 25, 2011

AtenRa said:
ehm, Bulldozer Launch in September ??

AMD, I am disappointed with your never ending delays.

Dresdenboy · May 25, 2011

AtenRa said:
http://cpu.zol.com.cn/231/2310866.html

translate: http://translate.google.com/transla...=http%3A//cpu.zol.com.cn/231/2310866.html

http://detail.zol.com.cn/picture_index_647/index6469206.shtml

http://detail.zol.com.cn/picture_index_647/index6469207.shtml

ehm, Bulldozer Launch in September ??

Looks to me like launch dates for specific models 8150 (no "P" at 125W TDP?), 8100, 6100, 4100, while models 8130P, 8110, 6110, 4110 might already be launched then. Could also be a fake again.

In the same sense we might cry about Intel delaying i7-2700K.

Abwx · May 25, 2011

Dresdenboy said:
Looks to me like launch dates for specific models 8150 (no "P" at 125W TDP?), 8100, 6100, 4100, while models 8130P, 8110, 6110, 4110 might already be launched then. Could also be a fake again.

In the same sense we might cry about Intel delaying i7-2700K.

An adm at Hardware.fr said that he contacted AMD and that it s
really for september..

http://forum.hardware.fr/viewbbcode.php?config=hfr.inc&cat=1&numreponse=7915288

Hope you are the one who s right, though...

Edit : he did contact a mobo manufacturer........

dma0991 · May 25, 2011

From what I've heard earlier someone said that it is the 2nd batch of Bulldozer but I am unsure if it is the 2nd batch or the first launch. :hmm:

AtenRa · May 25, 2011

Dresdenboy said:
Looks to me like launch dates for specific models 8150 (no "P" at 125W TDP?), 8100, 6100, 4100, while models 8130P, 8110, 6110, 4110 might already be launched then. Could also be a fake again.

Ahh yes, those are different models, thx

iCyborg · May 25, 2011

When did AMD say Bulldozer would launch in June? I can only find one article from March which is a speculation (it says that AMD didn't comment the news story). Has this one FUD being taken for granted here since people are disappointed with the "delay"?

busydude · May 25, 2011

iCyborg said:
When did AMD say Bulldozer would launch in June?

They never said June.. but they said Q2 2011 for Desktop and Q3 for server parts.

Cerb · May 25, 2011

Khato said:
I'm guessing that my statement was not properly understood, since that's the only explanation for that response. Let me attempt to rephrase in similar terms. If a fictional design in the intended workloads averaged 90% utilization of its integer resources and 45% utilization on everything else... Then it doesn't make sense to halve everything else in order to have a more balanced core design - sure it would increase the utilization of everything else, simply because there's less of it, heh. It'd also decrease integer resource utilization due to dependencies and drastically decrease performance (aka kill.)

Why would this be necessarily so? If you were halve the total size/complexity of the logic, and just try to add enough I/O to handle the rest, sure. The reduced/consolidated set would not have linear savings, because what would be halved would be the total performance capability, which is clearly being wasted.

Eh, okay. I know I sure wasn't there in the high level architecture design meetings 5+ years ago when those decisions were made.

Nor was I. So far, there's only what AMD has officially said, for BD, and what people working on it have said at various places.

I'll pretend to be an academic for a moment and proclaim that AMD has invented SCMT! Or maybe even P-SCMT if they did the separate issue queues like hyper threading. After all, according to that paper, bulldozer is neither SMT nor CMT because of the shared FPU, quoting from page 2, "The primary difference between the P-SMT and CMT approaches is that the former assigns threads to execution units at issue time, while in the more highly partitioned CMT processor, this assignment is done at dispatch time by steering each instruction to a particular cluster." As for the actual performance and energy conclusions of that paper, it's unfortunate that they only compared various 16-thread designs, none of which are comparable to the processor's we're interested in.

More like both, than neither, hence qualifying every CMT v. CMP+SMT bit as only being about integer.

P-SMT would be like if Intel made HT only able to use half the CPU's resources at any given time. So, all the cons of big complex units everywhere, but half the performance. It may fix the problem of resource contention, but only really makes sense if single-thread performance means nothing, in which case, why not just share caches and have more simpler cores? OK, in some corner cases, there are reasons, but not for any code we might care about.

The shared front-end acts very much like SMT, up until the instructions need to be sent off for execution, which are not SMT for int, but seem to be so for FPU.

It's unfortunate that they were using slower, older CPUs to start from, not x86, etc., etc....meh. Ultimately, unless AMD and Intel were both in a race for the same kind of design, nothing else would be directly comparable. The points are that CMT is not SMT, though you can have both, and that divided execution units can be quite efficient, compared to all being duplicated, or all being shared. The real useful details are going to be specific to whatever AMD had in the pipe that wasn't tied just to BD's design, and then whatever has happened over the years, as BD has been delayed.

Did Apple steal it again? Sorry, couldn't resist... Continuing with the rest.

What do know about my pixie dust?!

Pretty sure I didn't say anything about such a design being more power hungry or having less threads... But you are correct in that I should have stated it as far higher potential single-threaded performance, which likely wouldn't be realized often at all. The point being that the same resources in an equivalent SMT configuration could hit the same multi-threaded performance while offering no constraints to single-threaded potential. It's just markedly more difficult to design an adequate scheduler of that width.

Could AMD have made a CPU that ran anything any faster, without the shared front-end and FP (likely, the FP would be half width for each core, too)? It is quite probable that they couldn't, IMO, (or at least couldn't be highly confident that they could), and that this method gets them just as good single-threaded performance, at least as good multithreaded performance, and let's one thread hog enough FPU resources for two cores. Better single-threaded performance makes the assumption that their development budgets are not limited. At a certain level of difficulty (how many manhours? What are the chances it make our schedule slip again?), a feature likely has to get canned, and one easier to do within some time frame will get priority. I'm working on the assumption that AMD has very limited resources, and that potential risk and ROI affect every decision to a fair degree.

Okay, it's a copy, reflect, and paste. At least that's what the die shot implies, and is the only sensible way to implement the design. (I did the same thing on a ALU layout for one of my VLSI courses back in college.) Doing it any other way vastly increases the amount of back-end work necessary for no purpose. Oh, and guess I should have been more specific that I'm talking in terms of design implementation.

So, because there are double the execution units, the rest of the module must be double the space as if they had made it as a single core (with half the FP width)?

My claim of nonsense and reference back to point number 1 was that there's no merit of 'CMT' that increases its potential multi-threaded performance in comparison to SMT unless you give it more execution units to work with.

That's like saying the sky isn't blue, unless you look between the clouds.

Having more dedicated execution resources, and sharing elsewhere for better performance with limited resources, is the point to CMT, and this must be defined exclusively from sharing execution to be 'v. SMT'. Both are to make better use of limited resources, by sharing portions between threads, within a known-multithreaded environment: R&D budgets, development time, power envelope, clock speed, etc.. An ideal situation, IMO, would be for both intel and AMD to have SMT on int, whether AMD uses CMT or not, and let it be easily turned on and off during run-time. I'd bet a 4-thread BD module, using SMT for int, would turn out pretty good, for cases where response time doesn't matter, and you have plenty of threads to run.

Now the statement of SMT only providing an increase in performance with inefficient code would be correct, but it's still quite 'effective' at keeping execution units busy when running efficient code. On the flip-side, depending upon its implementation, a 'CMT' design could easily find inefficient code resulting in idle execution units - everything available thus far implies that this could well be the case with bulldozer.

SMT, used with efficient code, guarantees that each thread will run slower, even if the total is faster (which is not always the case, though it's not much of an issue on Nehalem and newer). If response time matters, SMT will get to be worse, as your code gets better. If you consider service time per thread always to be secondary, then you're probably not going to like anything AMD comes up with, because they have made a point of this being a way to differentiate their products. If idle execution units, due to low IPC code, are your main problem, SMT is by far the best way to fix that problem, short of ditching deep pipelines (which, for x86, is not likely to net as much gain as higher speeds).

Execution units idle because the rest of the CPU is bottlenecking them are always bad. Execution units idle because the code is cache-constrained, or is just very low IPC, is not bad, so long as they can put enough cores, running fast enough, on a chip that doesn't lose them money when selling it cheaper than Intel (and that the problem is not one of bad decisions regarding caches).

Intel, OTOH, just needs to keep their management's head out of the clouds, and the rest more of less takes care of itself, for them .

Smaller and simpler does indeed run faster. But again, how does that turn into an advantage of 'CMT' vs SMT? Sure compared to a huge CMP there's an advantage. But all the rest is more a function of other design decisions rather than some superiority of 'CMT'.

That not having unneeded redundancy should be part of the design, where this redundancy is defined by what is well or poorly used when multiple threads are active. Then, that it would be used, where possible, to facilitate those other improvements, including some that could take up more die space or xtors that might be allowed, if they'd gone normal CMP. If it is just done in a vacuum, relative to other features they could improve or add, just to save space/xtors/power, and not using any of those gains to improve overall performance, it will be a failure, because with only that, Intel's process advantage will still be too great. CMT v. just a CMP of the same, or just v. a SMT of the same execution resources, etc., is all to show what happens in a controlled environment, which can show merit, lack thereof, and what may be good directions to take, if there is merit. AMD still needs to replace the guts of the Athlon on steroids (STARS) with something far superior, that can be incrementally improved over the next decade, because they aren't dealing with such a controlled and contrived situation.

OTOH, if they can take one or more units shared between cores, make them powerful enough to handle both completely, yet smaller/simpler, or done in a more reasonable development time frame, or made even faster than if they had them duplicated, or some other thing that could get them a better CPU for their time and money, it could give them just that smidgen of an edge, to remain competitive, as an rodent-sized company when compared to Intel.

Phynaz · May 25, 2011

busydude said:
They never said June.. but they said Q2 2011 for Desktop and Q3 for server parts.

I don't recall, did they say retail availability?

Khato · May 25, 2011

Voo said:
Well maybe Intel and AMD do have such great tools that allow them to throw the design and constraints into it and get the perfect result back, but in my experience (well not my personal expertise - sure I had the courses but I just say what I've seen and heard from people doing these things for a living) getting optimal clock trees and co takes enough tweaking. Sure the tools are essential but you can't just throw the design at them and get the perfect result back.

Eh, unfortunately I can't get into too deep of a discussion here... But in general terms, the clock tree actually can be replicated as well so long as the trunk connection points are equal. It's actually quite easy when a logic block is simply mirrored. It's the I/O to a block that's more troublesome, since it's not likely that the surrounding logic/points of connection will be all that similar.

Voo said:
But if you meant that you copy+paste the blocks into the design (well more increase some loop counts) and then still have to do the rest of the work, that we can agree on. And surely that makes it easier - just as always in software modular design has its advantages. It just sounded to me as if you advocated that you could just add some blocks, throw the new design into automated tools and get back with a production ready product.

Nope, certainly not advocating that it's -that- simple. After all, more care has to be put into both any block of logic meant for repetition as well as any logic surrounding such blocks. However, it's still markedly easier than going through the entire layout design flow for each individual block. And, as said initially, die photos of bulldozer show a marked amount of 'messy' logic between the modules/L3 blocks and such, which implies that they're relying much more heavily on automated tools/synthesized logic for interconnection. Not that that's really bad or anything, just a likely necessary implementation choice.

Idontcare · May 25, 2011

Phynaz said:
I don't recall, did they say retail availability?

Do they ever? Neither Intel nor AMD speak to availability in the end-customer sense.

They will speak to terms such as "shipping for revenue", "shipping samples", and "launch" but I don't recall either one of them actually communicating the actionable information we all are seeking. (when can I order it from Newegg and have it not be on back-order when I place the order?)

Khato · May 25, 2011

Cerb said:
http://forums.anandtech.com/showpost.php?p=31759086&postcount=2440

Intel, OTOH, just needs to keep their management's head out of the clouds, and the rest more of less takes care of itself, for them .

Doing the lazy quote again since I sadly don't have time to reply to each individual point at current. Well, except for the one part that I did directly quote, which I can't help but laugh at - so true!

Suffice it to say, excellent responses over all. While I still don't agree with some points, that's more a matter of looking at them from a different perspective than actual disagreement.

dma0991 · May 25, 2011

iCyborg said:
When did AMD say Bulldozer would launch in June? I can only find one article from March which is a speculation (it says that AMD didn't comment the news story). Has this one FUD being taken for granted here since people are disappointed with the "delay"?

Even if there were no official announcement by AMD on the release date, there is no advantage for their late release. Maybe I over exaggerated by saying disappointment but that is just how I feel about the situation.

Tuna-Fish · May 25, 2011

I was thinking about the AGLU's, notably, why inc and not add? Then it hit me -- inc and dec both need only a single register read port.
So, if the design is:

MUL pipe = 2 read ports, 1 write port
DIV pipe = 2 read ports, 1 write port
AGLU 2 = 1 read port, 1 write port, access to IP and stack engine
AGLU 2 = 1 read port, 1 write port, access to IP and stack engine
(separate unit for data read on cache write, write ports on AGLU's double as the units used for data write on cache read)

total = 7 read ports, 4 write ports, the read port for cache write is not latency-sensitive and thus can be outside the forwarding network.

Then suddenly a lot of the things I've read makes sense. The design itself is sensible because it optimizes for clock speed while keeping the most common operations as one-cycle -- most memory references are IP-relative (static data), based off stack pointer or base pointer (the c stack) or register + immediate (OOP object references), with only the odd one out needing two registers. It would explain why inc and dec and not add, and why complex LEA requires the ALU. The write ports on the AGLUs will block any one-cycle operations that need them when data arrives, but that's exactly what happens with mul add add add on the ALU pipes.

And keeping the port counts on the register file low makes it fly. Now, you can even make it simpler still by replicating it, leaving 3 read ports and 4 write ports per partition (plus the single read port that can be naive). Perhaps divide the forwarding network in two -- you can only zero-cycle forward inside the partition. This would make the scheduler a bit more complex, while making the forwarding network a lot simpler and cheaper, and keeping performance good in most cases, with one cycle of additional latency inserted whenever you mul after a div (or an add at the wrong port), or vice versa, while still keeping most operations at one cycle latency.

drizek · May 25, 2011

Idontcare said:
Do they ever? Neither Intel nor AMD speak to availability in the end-customer sense.

They will speak to terms such as "shipping for revenue", "shipping samples", and "launch" but I don't recall either one of them actually communicating the actionable information we all are seeking. (when can I order it from Newegg and have it not be on back-order when I place the order?)

I'm sorry, but saying "we will launch in Q2" should only mean one thing.

Otherwise, they should have just said "we will have a booth at E3. Come hang out, we'll give you some stickers if you ask nicely".

Regardless, I will buy a 900 series motherboard and not waste my money on a new CPU. I'll get 8 or 16gb ram for it instead and hope to unlock my fourth core, and get a better NB OC on my Phenom. Then I will spend the CPU money on a new SLR and wait for 10-core Bulldozer, assuming they plan on releasing it next summer...

OCGuy · May 25, 2011

Wow, Q3 now? :/

Duke Nukem is coming out before Bulldozer....maybe the rapture is coming!

drizek · May 25, 2011

Isn't this to be expected?

Llano was shipping already, for a June launch. No word on Bulldozer shipping other than ESs.

Dresdenboy · May 25, 2011

AtenRa said:
Ahh yes, those are different models, thx

That's my POV

Oliverda posted this DH slide at XS, which supports this view and could even mean, the "delay" is actually a "pull in"

Idontcare · May 25, 2011

drizek said:
I'm sorry, but saying "we will launch in Q2" should only mean one thing.

It means different things to different businesses and customer groups at their various positions along the supply chain.

Where the angst comes in terms of us end-user enthusiasts is that we are the ones who fail to properly align our expectations with reality.

Instead of coming to terms with the vernacular involved with the statement "launch in Q2" we instead opt to replace the reality of those words with that of a false expectation based on something we are hoping for.

If AMD directly sold their product to us end-users then their timeline communications could be reasonably interpretted as an attempt on their behalf at setting our expectations. But they don't, and they aren't.

When AMD, or Intel, says "launch in Q2" that is a statement that means something to their supply partners. It is setting expectations for board makers, resellers, OEM's, etc for the timeline of when they can expect to start placing orders for the chips.

AMD has no control over how quickly or slowly DELL and HP are going to bring their end-customer products to retail. Likewise with boxed retail CPUs, sold in the 1000's to resellers (not end-customers).

AMD has no idea how Newegg intends to manage their internal inventory and supply distribution network. So why would AMD take on the risk of Newegg screwing up? All AMD can do is set expectations as to when the Newegg's and the HP's of the world can place orders.

The lag between that time and the time for end-users to take delivery from Fedex is out of AMD's control, so why would they ever project a committed timeline for it?

I agree this is the information we end-users are seeking to divine from the tea leaves that we sift of analyst meetings and blogger comments. But the reality is that it is we who are deluding ourselves into setting unrealistic expectations based on our choosing to misuse information that is intended to only be applicable within the confines of prescribed context.

I do not expect the Fedex guy to be able to deliver me a retail bulldozer cpu or a llano-based laptop from newegg until mid-July and late-August respectively.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Junior Member

Diamond Member

Member

Golden Member

Lifer

Lifer

Platinum Member

Golden Member

Lifer

Platinum Member

Lifer

Golden Member

Diamond Member

Elite Member

Lifer

Golden Member

Elite Member

Golden Member

Platinum Member

Golden Member

Golden Member

Lifer

Golden Member

Golden Member

Elite Member