HW2050Plus
Member
- Jan 12, 2011
- 168
- 0
- 0
According to this graph , a C2D run at an average of 1.2 IPC.
SB must be roughly at 1.6 IPC, so we can assume that BD s
two integer execution units are unlikely to be saturated and also
that they must be running at 80% of their max throughput
to equal SB s IPC.
We have infos that BD architecture is optimised to sustain
high IPC, that is, as constant as possible and close to
the max throughput.
Of course, things are different wether we take INTEGER of FP
to check the actual IPC.
Let me try to explain this fundamental misunderstanding of absolute IPC.
e.g. if you have an IPC of the absolute value x.y that means actually nothing if not used as relative comparison with exactly the same code.
E.g. I have a program that makes in the main loop 20 adds, 5 muls and 1 div. Then I have e.g. an IPC on:
cycles: (20 * 1 + 5 * 4 + 1 * 25) = 65
on a 1-wide superscalar architecture: 26 / 65 * 1 = 0.4
on a 2-wide superscalar architecture: 26 / 65 * 2 = 0.8
on a 3-wide superscalar architecture: 26 / 65 * 3 = 1.2
As you see the absolute number has nothing to do with the unit consumption. All units are at all time completly busy.
Yes that is simlified but just to show the point.
Another point:
Execution of memory instructions of e.g. move ryx, [address], add accordingly (assuming L1 cache hits):
on a 1-wide superscalar architecture: 1 / 4 * 1 = 0.25
on a 2-wide superscalar architecture: 1 / 4 * 2 = 0.5
on a 3-wide superscalar architecture: 1 / 4 * 3 = 0.75
And then you can have partial memory stalls where the above values are just lower but all units are busy.
And you have the situation of full pipeline stalls:
e.g. executing of 1-cycle instructions at full rate for 95% and then having stalls on 5% of the instructions:
on a 1-wide superscalar architecture: 1 / (0.95 / 1 + 0.05 * 15) = 0,59 **
on a 2-wide superscalar architecture: 1 / (0.95 / 2 + 0.05 * 15) = 0,81 **
on a 3-wide superscalar architecture: 1 / (0.95 / 3 + 0.05 * 15) = 0,94 **
Again IPC only 0.94 but all three units running 100% busy for 95% of executed instructions.
So I say it again and again don't conclude from absolute IPC numbers the usefulness or business of pipelines/units. That does not work up.
IPC is only useful if compared relativly.
All this shows exactly that there are three ways to improve IPC:
a) reduce instruction latency
b) more pipelines / units
c) less stalls (branch prediction, memory access: more cache, faster cache, prefetcher)
Again if we check for Bulldozer:
a) increased latencies
b) less pipelines / units
c) stalls around same (improved: prefetcher and more in flight, better prediction assumed, worse: cache size, cache speed, more misprediction penalty)
So it is crystal clear that Bulldozer will have significantly less IPC than K8/K10. By how much relativly we have to see. AMD itself claims a loss of only 10% regarding IPC. I personally doubt that and think they loose more regarding IPC.
AMD focused on throughput (more cores) and on raw frequency to improve speed, they did not focus on IPC with Bulldozer.
Last edited: