I can't argue with you on CPU architecture. I just don't have the knowledge.
I'd like to point out though, according to
this source , Itanium does not operate strictly "in-order".
Paulson isn't out, yet. There's some chance that it may move Itanium from being comical, to being OK, should you use VMS, HP-UX, Nonstop, etc.. No one in their right minds would use Itanium without needing a feature only it had.
IA64 going OOO is very much showing a flag of defeat (I'll bet the R&D work on it will help future non-Itanium developments, though).
From the link:
"The irony is that Poulson departs from the principles behind Itanium and follows a much more nuanced approach to computer architecture. The Itanium architecture and early implementations were a reaction to the increasing hardware complexity in the early 1990's. They were based around the theory that the hardware should be very simple and almost totally managed by software."
The thesis behind HP-WideWord/EPIC/IA64 was based on an idea that RISC architectures would, as time went on, reach a limit of about 1 IPC maximum. While it was not a proven thing, it was, with assumptions of the time (late 80s/early 90s) not unlikely, if one assumed speeds would keep ramping up (like the 10GHz Netburst CPU by 2011 that never came to be ). Also, at the time, superscalar processors (able to execute more than one instruction in an in-order stream at once) were fairly new, and register renaming more complicated than very simple shadowing was practically unheard of. The Pentium Pro had yet to show the world the future. With that in mind, however, their solution was bass-ackwards.
The answer, as seen by these guys now at HP, was to have the compiler tell the CPU exactly what, when, where, and how, instructions can be executed. By doing so, with VLIW, instructions are neatly aligned, easy to process, and the processor itself can be very simple (because it has to manage very little of the state!).
Such hardware needs great compilers to run effectively, because the whole idea is that the hardware is fast but dumb. Well, there's not merely that, but also that while memory
space gets cheaper all the time, memory
bandwidth stays very expensive. All that explicit telling the CPU what to do makes for absolutely
massive binaries, necessitating massive instruction caches, and that's just the tip of the iceberg.
The reality is that ideal compilers don't exist, and may never exist. A compiler can help the CPU out, through good static analysis, or profiling, but that's as far as it can go. What the CPU needs changes. It is dynamic. The best way we've yet figured out how to predict future needs for a processor is to make the processor do its own analysis of recent events.
Well, at an even lower level than branching, this is true. Register renaming ( I cannot for the life of me come up with a concise explanation, sorry) helps to enable effective reverse-engineering of ILP from an otherwise in-order stream of instructions. Instructions not dependent upon each other can be executed in any order. So, if instruction A's data needs to be fetched, but B, C, D, and E do not depend on A, they can run ahead, while A waits. The end result is similar performance to an in-order CPU, if A did not have to wait. In reality, which instruction is waiting is often unknown, and with data structures difficult to fit in L1, or that cross a few cache lines over the course of just a few instructions, you can pretty much be assured that some are going to wait, while others can run, and it could be different ones each time, or different ones on different CPU families.
Implementing OOOE is complicated an intricate, but
it works better than any alternative yet to be found, and could even improve alternatives (such as VLIW, and ISAs that try to map data dependencies). The increase in potential IPC from OOOE, however, has a side effect: now far memory is effective even farther, because groups of instructions are being completed faster. Bring on speculation and caches (xtors, manhours, heavy bags of cash!)!
The core of EPIC was that these kinds of technologies would become too difficult to implement, and run out of steam, while compilers could keep on advancing performance, so simpler faster hardware would be better in the long run. Implementation difficulty is certainly an issue, and we are now in the era of severely diminishing returns, but they haven't nearly run out of steam. Meanwhile, RISC-type developments (simpler faster hardware, more complicated software) have run out of steam, instead, and all the great compiler work to make RISC CPUs fly has ended up helping CISC every bit as much as RISC. Modern surviving CISC and RISC ISAs, that are fairly good anyway, bear little resemblance to old CISC or RISC, each having taken from the better of each other's features over the years.
As such, Itanium was late, hot, and underwhelming when the first Merceds came out, and except for certain niches within the niche that is HPC, it has remained late, hot, and underwhelming. While compilers have advanced, they haven't advanced enough to get rid of the need for increasing hardware-level complexity.
Instead Intel will likely introduce an new non-x86 CPU to replace atom. Something lightweight.....a new Smartphone specific cpu that doesn't need to be burdened with all the complex instructions that Intel will integrate on its future highest performance Server/Ultra Performance Desktop x86 CPUs.
Intel wants to use x86 to help make them successful. I could see them dropping IA32 support, and forcing EUFI, but that would be extent of ISA reductions they could make without shooting themselves in the foot. Complicated instructions are not a burden, you see. That's 80s thinking. They are a method of keeping commonly executed code from getting too big :awe:.
If the current scheme allows for $250,000 software fees on top of $10,000 hardware.....maybe AMD or Intel will find justifiable means to increase the price of Server CPUs?
No, because the market won't pay more. I guarantee you there are more CentOS 5.x LAMP or LAPP installations running on Xeons or Opterons than the entire set of all servers in the world running software purchased on a per-processor-core basis.
Don't look at it from the perspective of CPU value. Look at it from the perspective of an organization where millions of dollars to upgrade software, using talented certified professionals, is a bargain, comparing to
risking potentially cheaper options. That the pricing is per-core is just how MS can get away with charging that kind of money, because 99% of the time, small/lean outfits won't have a need many-core MS SQL servers, and huge entrenched businesses have been used to that kind of crap,
or worse, since the old super-proprietary mainframe and minicomputer eras.
Therefore couldn't there conceivably exist two tiers of Server x86 CPUs for both Intel and AMD? The low tier would be a continuation of what exists today (Xeon and Opteron)....and the future "Wide-like-Itanium" (for lack of a better description ) higher IPC x86 server chips could make up what will become tier 2. <--Obviously these wouldn't be as expensive as IBM z196, but I would imagine a reasonable mark-up could still be applied.
No. Intel and AMD already have
three tiers each (counting PhII w/ ECC as equivalent to a 1155 Xeon). But, it's not by any kind of IPC type metric,
because everyone benefits from higher IPC, since higher speeds mean too much heat. It's by number of CPUs, number of memory channels, and RAS features.