Originally posted by: Peter
The reason is rather reverse. The Athlon has a much shorter processing pipeline, and much better caching and prefetching (partly due to its much larger L1 cache). Thus, its rather slow front side bus has little negative effect.
Pipeline length has little if nothing to do with memory bandwidth needs. However, you did hit the money about caching. As for "better", depends on what you deem "better". The Athlon's cache fetches 64 bytes per cacheline and its relatively larger L1, exclusive cache means that it can access memory a bit less often and each memory access transfers less.
On the flip side, if you had a program with high data locality, transfering a bigger cacheline would mean you get more of the information you need at once. The P4 has a 128-byte cacheline size (in 64-byte strides). In programs with high data locality, its cache is able to fetch more useful information per access and feed it to the processor. This is why it thrives on massive bandwidth (more room to prefetch more and more data into cache).
Both approaches have advantages and disadvantages. Smaller cacheline size means more granularity and less "waste" of memory bandwidth in programs with low data locality. Larger cacheline size means you are able to effectively use memory bandwidth to increase cache hit rates and mask memory latency.
The P4 on the other hand is very allergic to pipeline stalls, thus requiring the fastest and bestest path to RAM it can get to show some good performance. Since Intel don't currently put the RAM controller directly into the CPU like AMD is now doing, they need this very high FSB speed and the dual channel RAM arrangement.
Yes and no. Again, a common misconception is that a long-but-narrow design somehow suffers more from waiting than a short-but-wide design. Not really true at all. Both need data to be fed and both suffer if said data isn't fed. Both waste the same amount of transistor idle time, the only difference is, the statistical IPC of the short-but-wide design would be higher.
To illustrate this. Let's use a simplistic example. You have processor A, running at 2 GHz and able to issue at peak 3 instructions per cycle. You have processor B, running at 1 GHz and able to issue a peak 6 instructions per cycle. You have a memory subsystem that runs at 100MHz. It takes (for simplicity's sake) 5 cycles to get data from memory. So on the 2 GHz machine, if a memory fetch is needed, you wait for 20 * 5 = 100 cycles to get data from memory. In that time, your transistors are sitting idle. In that same time, those transistors could've executed 100 * 3 = 300 instructions.
Now let's look at processor B. During that same amount of time, it waits 10*5 = 50 cycles to get data from memory. During that same time, it could've executed 50 * 6 = 300 instructions. The transistors (all those parallel execution, decoding, scheduling units, etc.) that could've been doing work were sitting idle, wasting power.
Both processor wastes the same amount of potential performance per second. The only difference is, since processor B waits for less cycles, it's statistical IPC is higher than the other processor (more than double).
So no, in order to achieve full performance, long, short, whatever pipelines, it doesn't matter. All that matters is how much potential throughput your processor *could* have (be it an IPC of 6 at 1 GHz or an IPC of 3 at 2 GHz) vs how much it actually achieved due to memory bottlenecks.
Different caching methods aimed at different types of memory subsystems will deal with this in different ways. The P4's caching system targets high-bandwidth, high latency memory subsystems. It requires large amounts of bandwidth to fetch more data that could potentially be useful. It achieves a better cache hit rate with less cache. However, this is also very demanding on the memory subsystem.
The other approach is something similar to the K7's, which is to have a more efficient memory usage but at the cost of performance. If your memory is more than enough for your current prefetching scheme, more won't help.
Those who think the K7 simply isn't memory-limited should look at the wonders the integrated memory controller has done for the K8 (which uses a very similar core, almost unchanged save for the 64-bit extension and more efficient integer scheduling method).