Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

HW2050Plus · Jun 6, 2011

Abwx said:
According to this graph , a C2D run at an average of 1.2 IPC.
SB must be roughly at 1.6 IPC, so we can assume that BD s
two integer execution units are unlikely to be saturated and also
that they must be running at 80% of their max throughput
to equal SB s IPC.

We have infos that BD architecture is optimised to sustain
high IPC, that is, as constant as possible and close to
the max throughput.

Of course, things are different wether we take INTEGER of FP
to check the actual IPC.

Let me try to explain this fundamental misunderstanding of absolute IPC.

e.g. if you have an IPC of the absolute value x.y that means actually nothing if not used as relative comparison with exactly the same code.

E.g. I have a program that makes in the main loop 20 adds, 5 muls and 1 div. Then I have e.g. an IPC on:
cycles: (20 * 1 + 5 * 4 + 1 * 25) = 65
on a 1-wide superscalar architecture: 26 / 65 * 1 = 0.4
on a 2-wide superscalar architecture: 26 / 65 * 2 = 0.8
on a 3-wide superscalar architecture: 26 / 65 * 3 = 1.2

As you see the absolute number has nothing to do with the unit consumption. All units are at all time completly busy.

Yes that is simlified but just to show the point.

Another point:

Execution of memory instructions of e.g. move ryx, [address], add accordingly (assuming L1 cache hits):

on a 1-wide superscalar architecture: 1 / 4 * 1 = 0.25
on a 2-wide superscalar architecture: 1 / 4 * 2 = 0.5
on a 3-wide superscalar architecture: 1 / 4 * 3 = 0.75

And then you can have partial memory stalls where the above values are just lower but all units are busy.

And you have the situation of full pipeline stalls:

e.g. executing of 1-cycle instructions at full rate for 95% and then having stalls on 5% of the instructions:

on a 1-wide superscalar architecture: 1 / (0.95 / 1 + 0.05 * 15) = 0,59 **
on a 2-wide superscalar architecture: 1 / (0.95 / 2 + 0.05 * 15) = 0,81 **
on a 3-wide superscalar architecture: 1 / (0.95 / 3 + 0.05 * 15) = 0,94 **

Again IPC only 0.94 but all three units running 100% busy for 95% of executed instructions.

So I say it again and again don't conclude from absolute IPC numbers the usefulness or business of pipelines/units. That does not work up.

IPC is only useful if compared relativly.

All this shows exactly that there are three ways to improve IPC:
a) reduce instruction latency
b) more pipelines / units
c) less stalls (branch prediction, memory access: more cache, faster cache, prefetcher)

Again if we check for Bulldozer:
a) increased latencies
b) less pipelines / units
c) stalls around same (improved: prefetcher and more in flight, better prediction assumed, worse: cache size, cache speed, more misprediction penalty)

So it is crystal clear that Bulldozer will have significantly less IPC than K8/K10. By how much relativly we have to see. AMD itself claims a loss of only 10% regarding IPC. I personally doubt that and think they loose more regarding IPC.

AMD focused on throughput (more cores) and on raw frequency to improve speed, they did not focus on IPC with Bulldozer.

Tuna-Fish · Jun 6, 2011

HW2050Plus said:
Again if we check for Bulldozer:
a) increased latencies
b) less pipelines / units
c) stalls around same (improved: prefetcher and more in flight, better prediction assumed, worse: cache size, cache speed, more misprediction penalty)

So it is crystal clear that Bulldozer will have significantly less IPC than K8/K10. By how much relativly we have to see. AMD itself claims a loss of only 10% regarding IPC. I personally doubt that and think they loose more regarding IPC.

Oh for f***s sake.

a) Based on the compiler documentation provided, the latency of most common instructions is going to stay the same. (with the exception of L1 load).

b) BD has more pipelines than stars. BD has 4 pipelines, stars has 3. BD's pipelines are less generic, so it only has 2 ALUs. Most x86 code doesn't have enough ALU ops to keep 3 ALUs busy -- they do, however, have enough interleaved memory ops that separating the ALUs and the memory units is worthwhile. This is one of the reasons current Intel processors are faster than the AMD ones -- because they can do memory ops at the same time when they are using their ALUs.

c) The prefetchers, and especially the branch predictors, have enough room for improvement that the speed increase they can give would completely dwarf any other change to the processor. Based on dieshots, BD finally has sufficient (separate from the L1 cache) SRAM room to store proper branch prediction data.

Each and every one of these points have been made to you, most in this very thread. Please answer to them, or stop spreading bad analysis.

Arkadrel · Jun 6, 2011

So it is crystal clear that Bulldozer will have significantly less IPC than K8/K10. By how much relativly we have to see. AMD itself claims a loss of only 10% regarding IPC. I personally doubt that and think they loose more regarding IPC.

1)
Im pretty sure JF-AMD has been saying IPC increase a tiny bit (so no AMD dont claim they lose IPC with bulldozer).

2)
Who cares if a Intel CPU with higher IPC is faster @ 2,000 mhz than a AMD one?

These bulldozers where designed to run fast....So the solution is to just run AMD at higher Mhz than the Intel ones. So IPC alone is meaningless... its about overall performance, and by running your CPU faster than the competitiors you can still have a faster cpu even with lower IPC.

Riek · Jun 6, 2011

HW2050Plus said:
Let me try to explain this fundamental misunderstanding of absolute IPC.

e.g. if you have an IPC of the absolute value x.y that means actually nothing if not used as relative comparison with exactly the same code.

E.g. I have a program that makes in the main loop 20 adds, 5 muls and 1 div. Then I have e.g. an IPC on:
cycles: (20 * 1 + 5 * 4 + 1 * 25) = 65
on a 1-wide superscalar architecture: 26 / 65 * 1 = 0.4
on a 2-wide superscalar architecture: 26 / 65 * 2 = 0.8
on a 3-wide superscalar architecture: 26 / 65 * 3 = 1.2

As you see the absolute number has nothing to do with the unit consumption. All units are at all time completly busy.

Yes that is simlified but just to show the point.

Everybody else already bunked your stuff but here goes on more on your 'calculations'.

Lets the loop you talk about:
20 ADDS, add = 1cycle.
5MUls each take 4cycles
1DIV each take 20 cycles.

We are using a pipelined archietecture.
1wide => 1pipeline for all instructions
2wide => 1pipeline for ADD one for ADD/MUL/DIV
3wide => 1ADD, 1ADD/suffle, 1ADD/MUL/DIV

1wide => shedule DIV + 4 MULS( each 5 cycles ) (= 9cycles) = 20cycles + 10 cycles for the adds. (10 are sheduled during Div and after MUL)(suposeably its ooo and non dependant execution). = 30cycles for those instructions 26/30 = 0.866

2wide => 20cycles in pipe1 and 20 in pipe0 -> 26/20cycles = 1.3

3 wide => 20cycles in pipe2, 10 in pipe0 and 10 in pipe1 = 26/20cycles = 1.3

Or lets do the following structure:
3wide =>1ALU, 1ADD/MUL, 1ADD/Div
-> 20 cycles DIV in pipe2, 9 cycles in pipe1(MUL) (+4ADD) + 3cycles for 3ADD, 13 cycles in pipe0
-> 26/20cycles = 1.3.

Yes a huuuge improvement on this one.
and ofcourse if they are dependant, then having multiple execution resources doesn't have a benifit also.

exeuction resources have a diminishing return in value. above 2 the return gets lower and lower due to the cost/complexity increases and the performance benifit lowers. Ofcourse you can still increase the performance by adding more exeution resources, implementing a higher L/S and widening your ooo window while increasing your decoding bandwidth. But all that costs alot of additional die space and complexity which might not warrant the extra performance/cycle. (since you will get less cycles, a bigger die, longer development time etc etc)

Idontcare · Jun 6, 2011

HW2050Plus said:
So it is crystal clear that Bulldozer will have significantly less IPC than K8/K10. By how much relativly we have to see. AMD itself claims a loss of only 10% regarding IPC. I personally doubt that and think they loose more regarding IPC.

AMD focused on throughput (more cores) and on raw frequency to improve speed, they did not focus on IPC with Bulldozer.

Do you anticipate the quad-core Llano desktop CPU's (the 100W TDP ones) to outperform the quad-core (2-module) Zambezi CPU's?

Considering the clockspeed and core counts of PhII X4's and X6's at 45nm, I would not argue that PhII was designed for low clockspeeds.

If what you say is true then why would AMD bother developing Bulldozer at all considering they also already went to the expense of shrinking the K10 core to 32nm?

Surely PhII on 32nm is going to clock even higher than it does on 45nm, and being a smaller cores shrunk even smaller by 32nm they could easily pack 8 cores onto one die and run with that.

Somehow I just don't see AMD having the options as you would have us believe them to be (low-IPC Bulldozer versus high-IPC 32nm K10) and AMD making the choices they made.

996GT2 · Jun 6, 2011

Idontcare said:
Do you anticipate the quad-core Llano desktop CPU's (the 100W TDP ones) to outperform the quad-core (2-module) Zambezi CPU's?

Considering the clockspeed and core counts of PhII X4's and X6's at 45nm, I would not argue that PhII was designed for low clockspeeds.

If what you say is true then why would AMD bother developing Bulldozer at all considering they also already went to the expense of shrinking the K10 core to 32nm?

Surely PhII on 32nm is going to clock even higher than it does on 45nm, and being a smaller cores shrunk even smaller by 32nm they could easily pack 8 cores onto one die and run with that.

Somehow I just don't see AMD having the options as you would have us believe them to be (low-IPC Bulldozer versus high-IPC 32nm K10) and AMD making the choices they made.

Since when was K10 considered high IPC? It's got the same (if not lower) IPC than Core 2 Quad

Skurge · Jun 6, 2011

996GT2 said:
Since when was K10 considered high IPC? It's got the same (if not lower) IPC than Core 2 Quad

I'm sure he meant relative to the bulldozer that HW2050plus was talking about.

Idontcare · Jun 6, 2011

Skurge said:
I'm sure he meant relative to the bulldozer that HW2050plus was talking about.

Yep, that was the context of the post to which I was responding.

And while the K10 does not have the highest IPC of all time, save for all but a few microarchitectures it does have higher IPC than just about everything else produced by humankind in the 20yrs prior which is an accomplishment in its own right.

Abwx · Jun 6, 2011

Idontcare said:
And while the K10 does not have the highest IPC of all time, save for all but a few microarchitectures it does have higher IPC than just about everything else produced by humankind in the 20yrs prior which is an accomplishment in its own right.

And it has still very good IPC according to current standards.
Intel s processors have the benefit of softs better optimised
for its architectures.

A very interesting analysis of all current archs latencies
and throughput :
http://gmplib.org/~tege/x86-timing.pdf

hamsandytheorem · Jun 7, 2011

http://www.4gamer.net/games/100/G010000/20110607029/

(edit...for english, click here: http://www.techpowerup.com/147067/AMD-FX-8-Core-and-4-Core-Processor-Systems-Seen-Running-at-E3.html )

a bunch of new bulldozer slides. I didn't know that each integer core could be clocked independently, so maybe that's interesting

AtenRa · Jun 14, 2011

New socket next year ??

Arachnotronic · Jun 14, 2011

Yup. And I thought Intel was the only evil company that did socket changes?

Mopetar · Jun 14, 2011

AtenRa said:
New socket next year ??

They've said they're going to start using Bulldozer cores in their next generation of APUs so it does make sense that the socket will change. However, I'm not certain if the AM3+ is going anywhere for a while, unless AMD stops selling anything but APUs in the consumer space.

Ajay · Jun 14, 2011

AtenRa said:
New socket next year ??

Interesting. I can't recall the timeline for AMD's 22nm node. Will Komodo be built on that node?

podspi · Jun 14, 2011

AtenRa said:
New socket next year ??

It begs the question, why release a new socket if you're only going to use it for a year. But they may have realized that they need more than dual-channel memory if they're stuffing 10+ cores in a box...

drizek · Jun 14, 2011

Intel17 said:
Yup. And I thought Intel was the only evil company that did socket changes?

Just because the name is different, doesn't mean it will be incompatible. Is Komodo going to have an IGP...errr... fGPU?

drizek · Jun 14, 2011

podspi said:
It begs the question, why release a new socket if you're only going to use it for a year. But they may have realized that they need more than dual-channel memory if they're stuffing 10+ cores in a box...

What new socket? AM3+ is almost identical to AM3.

podspi · Jun 14, 2011

drizek said:
What new socket? AM3+ is almost identical to AM3.

Key word is almost. Supposedly, Bulldozer ver. 1 is going to be physically compatible to AM3, even though AM3+ adds pins. If Bulldozer ver. 2 is using yet another socket, why bother with AM3+ at all, if those pins are never going to be used?

Idontcare · Jun 14, 2011

Ajay said:
Interesting. I can't recall the timeline for AMD's 22nm node. Will Komodo be built on that node?

jimbo posted a link to a glofo timeline that showed they intended to have 22nm SHP (the stuff AMD uses) available for customers in 2H 2012.

Topweasel · Jun 14, 2011

podspi said:
Key word is almost. Supposedly, Bulldozer ver. 1 is going to be physically compatible to AM3, even though AM3+ adds pins. If Bulldozer ver. 2 is using yet another socket, why bother with AM3+ at all, if those pins are never going to be used?

True platform vs. upgrade path, also because AMD spent a year thinking they weren't going to be able to pull off AM3 support.

Ajay · Jun 14, 2011

Idontcare said:
jimbo posted a link to a glofo timeline that showed they intended to have 22nm SHP (the stuff AMD uses) available for customers in 2H 2012.

I sure hope AMD's Komodo will come out in 22nm and in 2H12. I have some doubts about Komodo because of the extra engineering resources that were probably needed for the re-spins of Zambezi.

Mopetar · Jun 14, 2011

Idontcare said:
jimbo posted a link to a glofo timeline that showed they intended to have 22nm SHP (the stuff AMD uses) available for customers in 2H 2012.

Of course GloFo, TSMC, et al. haven't had problems meeting timelines before. :sneaky: :hmm:

AtenRa · Jun 14, 2011

Komodo and Trinity will be made with GloFos 32nm, we will not see 22nm from GloFo before Q2 2013.

Arachnotronic · Jun 15, 2011

http://www.google.com/imgres?imgurl...=1t:429,r:6,s:0&tx=157&ty=96&biw=1599&bih=815

Ajay · Jun 15, 2011

Pffft... So with 22nm SB-E/Ci7s coming out the end of this year - AMD will be relegated to good bang/buck category as usual instead of producing 'enthusiast' chips.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Member

Golden Member

Diamond Member

Senior member

Elite Member

Diamond Member

Diamond Member

Elite Member

Lifer

Junior Member

Lifer

Lifer

Diamond Member

Lifer

Golden Member

Golden Member

Golden Member

Golden Member

Elite Member

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Lifer