Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Page 107 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

HW2050Plus

Member
Jan 12, 2011
168
0
0
According to this graph , a C2D run at an average of 1.2 IPC.
SB must be roughly at 1.6 IPC, so we can assume that BD s
two integer execution units are unlikely to be saturated and also
that they must be running at 80% of their max throughput
to equal SB s IPC.

We have infos that BD architecture is optimised to sustain
high IPC, that is, as constant as possible and close to
the max throughput.

Of course, things are different wether we take INTEGER of FP
to check the actual IPC.

Let me try to explain this fundamental misunderstanding of absolute IPC.

e.g. if you have an IPC of the absolute value x.y that means actually nothing if not used as relative comparison with exactly the same code.

E.g. I have a program that makes in the main loop 20 adds, 5 muls and 1 div. Then I have e.g. an IPC on:
cycles: (20 * 1 + 5 * 4 + 1 * 25) = 65
on a 1-wide superscalar architecture: 26 / 65 * 1 = 0.4
on a 2-wide superscalar architecture: 26 / 65 * 2 = 0.8
on a 3-wide superscalar architecture: 26 / 65 * 3 = 1.2

As you see the absolute number has nothing to do with the unit consumption. All units are at all time completly busy.

Yes that is simlified but just to show the point.

Another point:

Execution of memory instructions of e.g. move ryx, [address], add accordingly (assuming L1 cache hits):

on a 1-wide superscalar architecture: 1 / 4 * 1 = 0.25
on a 2-wide superscalar architecture: 1 / 4 * 2 = 0.5
on a 3-wide superscalar architecture: 1 / 4 * 3 = 0.75

And then you can have partial memory stalls where the above values are just lower but all units are busy.

And you have the situation of full pipeline stalls:

e.g. executing of 1-cycle instructions at full rate for 95% and then having stalls on 5% of the instructions:

on a 1-wide superscalar architecture: 1 / (0.95 / 1 + 0.05 * 15) = 0,59 **
on a 2-wide superscalar architecture: 1 / (0.95 / 2 + 0.05 * 15) = 0,81 **
on a 3-wide superscalar architecture: 1 / (0.95 / 3 + 0.05 * 15) = 0,94 **

Again IPC only 0.94 but all three units running 100% busy for 95% of executed instructions.

So I say it again and again don't conclude from absolute IPC numbers the usefulness or business of pipelines/units. That does not work up.

IPC is only useful if compared relativly.

All this shows exactly that there are three ways to improve IPC:
a) reduce instruction latency
b) more pipelines / units
c) less stalls (branch prediction, memory access: more cache, faster cache, prefetcher)

Again if we check for Bulldozer:
a) increased latencies
b) less pipelines / units
c) stalls around same (improved: prefetcher and more in flight, better prediction assumed, worse: cache size, cache speed, more misprediction penalty)

So it is crystal clear that Bulldozer will have significantly less IPC than K8/K10. By how much relativly we have to see. AMD itself claims a loss of only 10% regarding IPC. I personally doubt that and think they loose more regarding IPC.

AMD focused on throughput (more cores) and on raw frequency to improve speed, they did not focus on IPC with Bulldozer.
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,422
1,759
136
Again if we check for Bulldozer:
a) increased latencies
b) less pipelines / units
c) stalls around same (improved: prefetcher and more in flight, better prediction assumed, worse: cache size, cache speed, more misprediction penalty)

So it is crystal clear that Bulldozer will have significantly less IPC than K8/K10. By how much relativly we have to see. AMD itself claims a loss of only 10% regarding IPC. I personally doubt that and think they loose more regarding IPC.
Oh for f***s sake.

a) Based on the compiler documentation provided, the latency of most common instructions is going to stay the same. (with the exception of L1 load).

b) BD has more pipelines than stars. BD has 4 pipelines, stars has 3. BD's pipelines are less generic, so it only has 2 ALUs. Most x86 code doesn't have enough ALU ops to keep 3 ALUs busy -- they do, however, have enough interleaved memory ops that separating the ALUs and the memory units is worthwhile. This is one of the reasons current Intel processors are faster than the AMD ones -- because they can do memory ops at the same time when they are using their ALUs.

c) The prefetchers, and especially the branch predictors, have enough room for improvement that the speed increase they can give would completely dwarf any other change to the processor. Based on dieshots, BD finally has sufficient (separate from the L1 cache) SRAM room to store proper branch prediction data.

Each and every one of these points have been made to you, most in this very thread. Please answer to them, or stop spreading bad analysis.
 

Arkadrel

Diamond Member
Oct 19, 2010
3,681
2
0
So it is crystal clear that Bulldozer will have significantly less IPC than K8/K10. By how much relativly we have to see. AMD itself claims a loss of only 10% regarding IPC. I personally doubt that and think they loose more regarding IPC.

1)
Im pretty sure JF-AMD has been saying IPC increase a tiny bit (so no AMD dont claim they lose IPC with bulldozer).

2)
Who cares if a Intel CPU with higher IPC is faster @ 2,000 mhz than a AMD one?

These bulldozers where designed to run fast....So the solution is to just run AMD at higher Mhz than the Intel ones. So IPC alone is meaningless... its about overall performance, and by running your CPU faster than the competitiors you can still have a faster cpu even with lower IPC.
 

Riek

Senior member
Dec 16, 2008
409
14
76
Let me try to explain this fundamental misunderstanding of absolute IPC.

e.g. if you have an IPC of the absolute value x.y that means actually nothing if not used as relative comparison with exactly the same code.

E.g. I have a program that makes in the main loop 20 adds, 5 muls and 1 div. Then I have e.g. an IPC on:
cycles: (20 * 1 + 5 * 4 + 1 * 25) = 65
on a 1-wide superscalar architecture: 26 / 65 * 1 = 0.4
on a 2-wide superscalar architecture: 26 / 65 * 2 = 0.8
on a 3-wide superscalar architecture: 26 / 65 * 3 = 1.2

As you see the absolute number has nothing to do with the unit consumption. All units are at all time completly busy.

Yes that is simlified but just to show the point.

Everybody else already bunked your stuff but here goes on more on your 'calculations'.


Lets the loop you talk about:
20 ADDS, add = 1cycle.
5MUls each take 4cycles
1DIV each take 20 cycles.

We are using a pipelined archietecture.
1wide => 1pipeline for all instructions
2wide => 1pipeline for ADD one for ADD/MUL/DIV
3wide => 1ADD, 1ADD/suffle, 1ADD/MUL/DIV


1wide => shedule DIV + 4 MULS( each 5 cycles ) (= 9cycles) = 20cycles + 10 cycles for the adds. (10 are sheduled during Div and after MUL)(suposeably its ooo and non dependant execution). = 30cycles for those instructions 26/30 = 0.866

2wide => 20cycles in pipe1 and 20 in pipe0 -> 26/20cycles = 1.3

3 wide => 20cycles in pipe2, 10 in pipe0 and 10 in pipe1 = 26/20cycles = 1.3

Or lets do the following structure:
3wide =>1ALU, 1ADD/MUL, 1ADD/Div
-> 20 cycles DIV in pipe2, 9 cycles in pipe1(MUL) (+4ADD) + 3cycles for 3ADD, 13 cycles in pipe0
-> 26/20cycles = 1.3.


Yes a huuuge improvement on this one.
and ofcourse if they are dependant, then having multiple execution resources doesn't have a benifit also.

exeuction resources have a diminishing return in value. above 2 the return gets lower and lower due to the cost/complexity increases and the performance benifit lowers. Ofcourse you can still increase the performance by adding more exeution resources, implementing a higher L/S and widening your ooo window while increasing your decoding bandwidth. But all that costs alot of additional die space and complexity which might not warrant the extra performance/cycle. (since you will get less cycles, a bigger die, longer development time etc etc)
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
So it is crystal clear that Bulldozer will have significantly less IPC than K8/K10. By how much relativly we have to see. AMD itself claims a loss of only 10% regarding IPC. I personally doubt that and think they loose more regarding IPC.

AMD focused on throughput (more cores) and on raw frequency to improve speed, they did not focus on IPC with Bulldozer.

Do you anticipate the quad-core Llano desktop CPU's (the 100W TDP ones) to outperform the quad-core (2-module) Zambezi CPU's?

Considering the clockspeed and core counts of PhII X4's and X6's at 45nm, I would not argue that PhII was designed for low clockspeeds.

If what you say is true then why would AMD bother developing Bulldozer at all considering they also already went to the expense of shrinking the K10 core to 32nm?

Surely PhII on 32nm is going to clock even higher than it does on 45nm, and being a smaller cores shrunk even smaller by 32nm they could easily pack 8 cores onto one die and run with that.

Somehow I just don't see AMD having the options as you would have us believe them to be (low-IPC Bulldozer versus high-IPC 32nm K10) and AMD making the choices they made.
 

996GT2

Diamond Member
Jun 23, 2005
5,212
0
76
Do you anticipate the quad-core Llano desktop CPU's (the 100W TDP ones) to outperform the quad-core (2-module) Zambezi CPU's?

Considering the clockspeed and core counts of PhII X4's and X6's at 45nm, I would not argue that PhII was designed for low clockspeeds.

If what you say is true then why would AMD bother developing Bulldozer at all considering they also already went to the expense of shrinking the K10 core to 32nm?

Surely PhII on 32nm is going to clock even higher than it does on 45nm, and being a smaller cores shrunk even smaller by 32nm they could easily pack 8 cores onto one die and run with that.

Somehow I just don't see AMD having the options as you would have us believe them to be (low-IPC Bulldozer versus high-IPC 32nm K10) and AMD making the choices they made.

Since when was K10 considered high IPC? It's got the same (if not lower) IPC than Core 2 Quad
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
I'm sure he meant relative to the bulldozer that HW2050plus was talking about.

Yep, that was the context of the post to which I was responding.

And while the K10 does not have the highest IPC of all time, save for all but a few microarchitectures it does have higher IPC than just about everything else produced by humankind in the 20yrs prior which is an accomplishment in its own right.
 

Abwx

Lifer
Apr 2, 2011
11,172
3,869
136
And while the K10 does not have the highest IPC of all time, save for all but a few microarchitectures it does have higher IPC than just about everything else produced by humankind in the 20yrs prior which is an accomplishment in its own right.

And it has still very good IPC according to current standards.
Intel s processors have the benefit of softs better optimised
for its architectures.

A very interesting analysis of all current archs latencies
and throughput :
http://gmplib.org/~tege/x86-timing.pdf
 

Mopetar

Diamond Member
Jan 31, 2011
8,024
6,483
136
New socket next year ??

They've said they're going to start using Bulldozer cores in their next generation of APUs so it does make sense that the socket will change. However, I'm not certain if the AM3+ is going anywhere for a while, unless AMD stops selling anything but APUs in the consumer space.
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106


New socket next year ??

It begs the question, why release a new socket if you're only going to use it for a year. But they may have realized that they need more than dual-channel memory if they're stuffing 10+ cores in a box...
 

drizek

Golden Member
Jul 7, 2005
1,410
0
71
It begs the question, why release a new socket if you're only going to use it for a year. But they may have realized that they need more than dual-channel memory if they're stuffing 10+ cores in a box...

What new socket? AM3+ is almost identical to AM3.
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
What new socket? AM3+ is almost identical to AM3.

Key word is almost. Supposedly, Bulldozer ver. 1 is going to be physically compatible to AM3, even though AM3+ adds pins. If Bulldozer ver. 2 is using yet another socket, why bother with AM3+ at all, if those pins are never going to be used?
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
Interesting. I can't recall the timeline for AMD's 22nm node. Will Komodo be built on that node?

jimbo posted a link to a glofo timeline that showed they intended to have 22nm SHP (the stuff AMD uses) available for customers in 2H 2012.
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,655
136
Key word is almost. Supposedly, Bulldozer ver. 1 is going to be physically compatible to AM3, even though AM3+ adds pins. If Bulldozer ver. 2 is using yet another socket, why bother with AM3+ at all, if those pins are never going to be used?

True platform vs. upgrade path, also because AMD spent a year thinking they weren't going to be able to pull off AM3 support.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,106
136
jimbo posted a link to a glofo timeline that showed they intended to have 22nm SHP (the stuff AMD uses) available for customers in 2H 2012.


I sure hope AMD's Komodo will come out in 22nm and in 2H12. I have some doubts about Komodo because of the extra engineering resources that were probably needed for the re-spins of Zambezi.
 

Mopetar

Diamond Member
Jan 31, 2011
8,024
6,483
136
jimbo posted a link to a glofo timeline that showed they intended to have 22nm SHP (the stuff AMD uses) available for customers in 2H 2012.

Of course GloFo, TSMC, et al. haven't had problems meeting timelines before. :sneaky: :hmm:
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
Komodo and Trinity will be made with GloFos 32nm, we will not see 22nm from GloFo before Q2 2013.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,106
136
Pffft... So with 22nm SB-E/Ci7s coming out the end of this year - AMD will be relegated to good bang/buck category as usual instead of producing 'enthusiast' chips.
 
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |