the reason is simple. The memory footprint used to be large enough to basically fit into the cache of large cached CPU's. Now, its smaller, and fits into the cache of smaller CPU's. To make up for the fact that the time to complete a WU on an "average" "smaller cached" CPU, they gave it more work to do (they have waaaaaaaaaaaaay excess processing power at this point, why not soak it up doing more useful stuff, right?). So, while the net effect is that its faster for the smaller cached CPU's, the larger cached ones were just fine before. But now, they have more work to do.
Basically, the smaller cached CPU's used to be bound by the fact that the memory footprint was just too large. That's not the case anymore. They added more work to do. The large ones were nearly as fast as they were going to get, because the memory footprint/bandwidth issue, well, wasn't an issue. Now, they have more work to do.
[EDIT]
It'll certainly be interesting. The P4 has as much L2 cache as the P3. The P4's L2 cache has the same latencies (per clock), but twice the bandwidth per clock. This is because it transfers data EVERY cycle, whereas the Cumine only did it on every other cycle.
The latencies on RAMBUS aren't quite as bad as you'd at first expect, because of the fact that it is in a dual channel form. I don't fully understand it, but when you add another channel, it does more than double bandwidth: it also lowers latencies. Its like a beefed up i840 as far as I know.
But its not RAMBUS technology that sucks. Generally speaking, its Intels implementation of it that sucks. If RAMBUS sucked so bad, API would not be using it. I've read a post where an API architect even said the only problem with RAMBUS for x86 was the fact that intel doesn't know how to deal with it, and that the chipsets are crap for it (I paraphrased a lot there).