Originally posted by: funbun
Here is a sports car/sport motorcycle analogy:
Dodge Viper = V10 engine
Sport motor cycle = V2 (v twin)
It's obvious that the Viper has tons more horsepower than the motorcycle. However, the motorcycle has a better power to weight ratio. In other words, the motorcycle doesn't have to work as hard to overcome it own weight then the big clunky Viper. The motor cycle can out accelerate, out corner and flatout perform the Viper.
I'm not fond of that analogy.
Consider a 2 litre 4 cyl engine at 3400rpm (3.4GHz P4) vs. a 3 litre 6 cyl engine at 2400rpm (AMD @ 2.4GHz) instead.
There may be something wrong with that picture, bear with me guys, I don't know much about cars. I think you get the idea anyway.
Basically, the question posed here, is how clock rate can be substituted. First, there's two comments to be made about the question as such: One is that, historically, there's never been much correlation between clockspeed and rate of instructions executed. It's just an unenlighted assumption.
The clock's purpose is just to synchronize the switching inside the cpu. The clock is _NOT_ some count of work performed!
Secondly, there's still something like a point, to that intuitive assumption. But clockrate can be substituted the same way as engine rpm can be substituted. By volume.
Early CPUs needed many clockcycles, just to finish a single instruction. As scale of integration grew, making more transistors available on the chip, the number of clockcycles needed to perform an instruction, gradually decreased.
One of the most important techniques to accomplish this, is the
pipeline. It works like the assembly line in a car factory. (Like the "McDonald's analogy" It is illustrative to think of instructions as items that have to be assembled.) The CPU works simultaneously on a sequence of instructions. The instructions 'travel' down the pipe, while different parts of work are performed at stages along the pipe. Each instruction still need many cycles to execute, but a finished instruction 'comes off' the end of the pipeline every few - or even each - clockcycle.
Theoretically, it may immediately seem like you can't execute more than one instruction per clockcycle, even on the best pipeline. And if this were to be true, clockrate would then finally become a factor. So here's the trick: First, the incoming chain is split and instructions enters several parallel lines, not just one. These lines just figure out what should be done and what is needed. The instructions then enter a pool, where they wait until all parts needed for their execution are available and ready. Once they are ready, they are dispatched -
Out of Order (not wasting time waiting on latecomers) into the next stages of the 'assembly line', and guess what, - we have multiple parallel lines here again. These are the execution units. Simply speaking, K7 and K8 have three identical logical/integer units each, and three individually different, specialized floating point units. Once finished, the executions enter a pool again. This is the queue where they are reordered into correct sequence again, before the results are finalized/written.
Nothing I have mentioned, concerns caches or handling IO bottlenecks in any way at all. This is just raw execution at full speed.
The key to making it perform well is to break the sequential order of the instructions. "
Out of Order execution". This is not a simple thing to accomplish though. For one thing, each instruction need its own version of the registers and their contents.
There are many details left. I've tried to explain it simply, so the principles are transparent. I hope anyone now can see, that with these techniques, performance is pretty independent from clockrate. To recap - AMD's cpus have 3 decoder lines, then roughly(sic) four shedulers dispatch into 6 lines of execution.
*********************
Many have mentioned Intel's long pipelines vs AMD's shorter, here, as an explanation. That is also somewhat true, and I want to comment that too.
The primary reason for having a very long pipeline, is to make it possible to reach higher clockrates! High clockrate <= long pipeline. This is because in order to be able to sync faster, the chains and lattices of transistors need to be ready with their switching faster. This is accomplished by keeping the transistor 'chains' shorter and simpler. This however, also means that less can be done at each stage in the pipe, and the pipeline grows in length.
There seem to be some kind of confusion about this "less work done", -
remember that pipe still ticks off a completed instruction each clock!
The advantage (somewhat illusory) of the higher clockrate, is that a pipe can dispatch finished instructions at a higher rate. There's a number of disadvantages with long|deep pipelines though. One of them is that every time a branch enters the pipe, the cpu makes a guess what string of instructions should follow it into the pipe. At the end of the pipe, the result of the branch is at hand. If the guess was wrong, we lose all the work in the entire pipe. Code can be full of branches that are hard to guess. This cripples P4 performance on general code, compared to it's excellent performance on menial loops (benchmarks and media). This is part of the reason for the poor Intel performance you can see, in for instance, Business Winstone 2002 and 2003, relative AMD. But only part of the reason. The way P4 performance, on that realistic type of application benchmarks, scales with FSB speed, indicates prefetch is hard to do well, when you're concentrating on a deep pipe and high clockrates. So I would say AMD have better prefetch and branch prediction. AMD seem to get away much better with a lower memory bandwidth, as well as smaller caches. There is, for instance, not much difference in performance between socket 754 and 939/940 sofar. As the Athlons will develop more muscles, there will be eventually. But not much for now.
P.S. If you would use the benchmark suits from like 4 years ago, AMD (AthlonXP too) would destroy Intel completely. Today, most benchmarking, PCMark, SYSMark, media, etc. concentrates on branchless SSE2 vector loops that fit into the cache. This is to some part true about benchmarking the matrix multiplications inside games 3D engines and 3D render software too. They too benefit Intel's vector processing. I wonder, for instance, about how true the typical game benchmark is to real game performance. The benchmark only plays a 'movie' through the 3D engine. The thing that seem wrong to me about that, is not only that the cpu will be tasked with other things in a real game, like AI, pathfinding and physics simulation, but also that the caches will not be the exclusive playground for 3D engine vector code. Just looking at flight simulators, for example, AMD tends to humble Intel. Even though that is a cpu intensive physical simulation, I'm wondering if that effect doesn't spill over to real games too?
AMD must be bitterly disappointed with how benchmarking have changed, to show off their CPUs and Intel's. But I think it's their own fault. I think Intel were absolutely right, on concentrating on performance on menial loops. Because this is the performance most users will mostly notice on a modern media rich PC. The 'general' logic code that AMD is so good at, mostly only lasts for milliseconds, between the huge chunks of data processed by loops, the P4 is good at.
On a side track: Intel have made efforts on branchprediction and prefetch on the Prescott. To some extent that (just as the 1MB cache) is probably eaten by the even longer pipe. But maybe, just maybe, popular *Intel biased* benchmarks are making an injustice against it, compared to P4C.
Even the Athlon64 have difficulties with the P4 on vector computing (in 32-bit mode). And that is not due to GHz, it is due to AMD putting in too wimpy SSE2 vector processing in the K8. AMD may have thought the K8 was 'balanced' or something, but that's not good enough for the benchmarking battlefield. I have a notion we might see massive improvement in vector and FP performance, in next AMD core, while integer/general performance stays much the same. That might kill off media encoding as a popular benchmark, it will also redress Opteron2's SPEC_fp2003 for a really devastating blow against the Itanium.