*cracks knuckles*
Okay, I'm back.
Most previous x86 generational changes introduced not only a new microarchitecture, but a new execution paradigm. This unfortunately has caused most people to expect the same thing from subsequent microarchitectures. Ignoring the 80286 () the major (Intel) x86 microarchitectures brought the following:
80386: (more) orthogonal register set and paging (perhaps the two most important items), 32-bit flat addressing, translation lookahead-buffers
80486: Fully pipelined integer execution, integrated level 1 I/D caches, integrated floating-point unit
Pentium: 2-way issue statically scheduled superscalar
Pentium Pro: 3-way issue dynamically scheduled superscalar and speculative execution (and all the goodies that come with it: advanced dynamic branch prediction, branch target buffers, decoupled execution).
But since 1995 - 1996 when most MPU manufacturers introduced their dynamically scheduled superscalar processors, no one has really introduced any new paradigms (except Itanium / Itanium 2, but that's based on a older idea and a different tangent altogether). The "second-generation" dynamically scheduled superscalar processors that have been coming out the last year or two/will come out soon (Athlon, Hammer, P4, EV7, POWER4, etc) have merely improved upon the idea.
The P4 perhaps took the most radical approach, attempting to reduce wire delays as much as possible (through trace cache, pipeline stages devoted to signal propogation, double-speed ALUs to reduce data bypass delays) in the light of the increasing gap between wire delay and gate delay. Obviously it has proven to be an acceptable design decision, though by no means the only route that is necessary.
Aside from the exterior changes to the Athlon core (routing links and integrated DDR controller), the Hammer core is very similar; a few integer reservation stations were added, the fetch and decode stages were modified, and the TLB and branch predictor were improved. This is by no means a bad thing; the Alpha 21364, due out soon, takes the 6-year old 21264 core with almost no changes, and adds a 1.5 MB on-die L2 cache, 2 x 64-bit RDRAM controllers (12.8 GB/sec of memory bandwidth), 4 x 6.4GB/sec routing links, and 1 x 6.4 GB/sec IO bus. The 21364 should be on the top of SPECint and SPECfp performance, at least until Madison (Itanium 2 follow-up) is released next year.
The
really cool new ideas that academia has been toying with for a decade probably won't show up in commercial microprocessors for a few years: data value prediction, load address value prediction, trace processors, multithreading processors, data flow, multiscalar processors. The transistor requirements for many of these techniques are still too high. The P4's trace cache and upcoming simultaneous multithreading (Hyperthreading) put it the closest to some of these ideas, though it is a bit castrated compared to
the real deal.
Unfortunately there are not a lot of mainstream articles written yet about the new upcoming processing paradigms...they mainly consist of a large number of academic papers written over the last ten years. Here's a few important ones if anybody is interested (you may be able to find them at various universities' web sites using Google)
L. Hammond, M. Willey, and K. Olukotun, "Data speculation support for a chip multiprocessor."
G. S. Sohi, S. E. Breach, T. N. Vijaykumar, "Multiscalar processors."
J. G. Steffan, T. C. Mowry, "The potential for using thread-level data speculation to facilitate automatic parallelization."
J. B. Dennis and D. P. Misunas, "A preliminary architecture for a basic dataflow processor."
D. M. Tullsen, S. J. Eggers, and H. M. Levy, "Simultaneous multithreading: Maximizing on-chip parallelism."
A. Roth and G. S. Sohi, "Speculative multithreaded processors."