Originally posted by: TuxDave
So if AMD were to double A64's efficiency which is easily realistic given A64's dual state 64-bit instruction pipeline, lack of L3 cache, and less efficient memory controller, it'd be roughly 50% more efficient than C2D per clock cycle.
I'll have to quote you on that.... but due to lack of information I can't make a guess on my own.
Sure
"However, Barcelona is far more than a quad-core K8 with an L3 cache. We estimate the
number of non-cache transistors in a dual-core Athlon 64 X2 to be approximately 94M, and
the Barcelona core is around 247M;
even doubling the dual-core K8 figure won't get you close to Barcelona. Note that simply doubling the 94M number also isn't an accurate comparison as Barcelona only features a single on-die Northbridge. In essence, there are more than 60M additional transistors (or more than 15M per core) that went into architectural enhancements outside of more cores and cache in Barcelona.
Originally posted by: dmens
Originally posted by: RussianSensation
So if AMD were to double A64's efficiency which is easily realistic
Double? Yeah right. It ain't 1990.
That kind of improvement may exist for some specialized FP/vector codes where the only real limiting factor is machine bandwidth. Like that SpecFP demo the other day, pretty slick. For general code? No chance.
Core 2 Duo has 90-100% efficiency per clock cycle as P4 NetBurst architecture. Here is why doubling for AMD is realistic:
AMD Architecture Comparison
K8 vs. Barcelona
1. SSE Execution Width 64-bit vs. 128-bit
(double)
2. Instruction Fetch Bandwidth 16 bytes/cycle vs. 32 bytes/cycle
(double)
3. Data Cache Bandwidth 2 x 64-bit loads/cycle 2 x 128-bit loads/cycle
(double)
4. L2/Northbridge Bandwidth 64 bits/cycle vs. 128 bits/cycle
(double)
5. FP Scheduler Depth 36 Dedicated x 64-bit ops vs. 36 Dedicated x 128-bit ops
(double)
6. Barcelona adds a 512-entry indirect predictor - the 253.perlbmk test of SPEC CPU2000 the reduction in mispredicted branches with Prescott was significant, reaching almost 55%.
7. The inclusion of an indirect predictor wasn't the only crystal ball improvement in Barcelona; the size of the return stack in the new core is
double what it was in K8.
8. One major aspect of Intel's Core micro-architecture advantage is its ability to allow load instructions to bypass previous load and store instructions. On average, about
1/3 of all instructions in a program end up being loads, thus if you can improve load performance you can generally impact overall application performance pretty significantly. AMD's
K8 architecture had no equivalent scheme for allowing the out of order execution of loads ahead of other loads and stores!!! Barcelona can now re-order loads ahead of other loads, just like Core 2 can. It can also execute loads ahead of other stores.
Barcelona can generate up to three store addresses per clock as it has three AGUs (Address Generation Units) compared to Intel's one for stores.
9. The K8 core featured a
single memory controller that was 128-bits wide, but in
Barcelona AMD has split up the DRAM controller into two separate 64-bit controllers. Each controller can be operated independently and thus you get some improvements in efficiency, especially when dealing with quad core implementations where the individual cores working on independent threads all have their own memory access patterns. Now, instead of executing writes as soon as they show up, writes are stored in a buffer and once the buffer reaches a preset threshold the controller bursts the writes sequentially. What this avoids is the costly read/write switch penalty, helping improve bandwidth efficiency and reduce latency. [ Each Barcelona core gets its own set of data and instruction prefetchers, but the major improvement is that there's a new prefetcher in town - a DRAM prefetcher. The DRAM prefetcher doesn't pull data into the CPU's L2 or L3 caches either; instead it features its own buffer to avoid polluting the caches ]
I am not even touching on major changes in power management such as Barcelona's Northbridge now runs on a separate power plane, ability of each core to run at different clock speeds depending on load
[AMD's first quad core part to operate within the same thermal envelope as current Opterons.]
Source:
AMD on the Counterattack