<<
I didn't see how the artical you referenced made the case for IA-64 >>
Where did I say DeMone's article was to make a case for IA64? Perhaps I should have been more clear, what I wanted you to garner from the article was the history behind IA64 on the first page (since I was addressing Merced's history and purpose).
<<
In the end I'm thinking Intel is yet again proving that VLIW is better on paper than in real life (as a general purpouse CPU) yet again >>
Ah, please read the last few paragraphs in my last post again where I've highlighted the benefits and drawbacks of VLIW....VLIW is already going to prove itself capable with McKinley. I've noticed time and time again that the enthusiast crowd thinks there is one and only one way to do anything correctly in microprocessor design. After education in computer architecture (I'm speaking to you as a graduate student in comp arch) it becomes clear that there are merely design decisions and trade-offs that are made to reach the final product. It should be no surprise that
Computer Architecture by Hennessey and Patterson (the "Bible" in comp arch, written by the creators of RISC as well as MIPS and SPARC, respectively) devotes one chapter to dynamic scheduling and another to static scheduling/VLIW.
<<
If you look again at the artical you referenced you should notice the following line in their 2nd CPU comparison table:
| EV7 | P4 | IA-64
System Bandwidth (GB/s) | 44.8 | 92 | 6.4 >>
Again you're confusing an instruction set (IA64) with a particular implementation of that instruction set (McKinley). Are you going to judge the Pentium4 and Hammer based on the 25 year old 8086 design?
<<
This demonstrates how hard of a time Intel will have scaling their CPU's compared to IBM/Compaq (and Sun). >>
I get the impression that you are confusing local bus design with MP system design. "Scalability" is determined by much more than mere external MPU bandwidth. Perhaps if all MP systems were designed as shared memory bus-based systems, that would be true, but this is not the case. Shared-memory multiprocessing systems, whether they are SMP or NUMA, can and are be implemented in a variety of ways that are not necessarily dependent on their local bus design. Shared memory multiprocessors can be arranged through, for example, crossbar switches, multistage interconnection networks, or bus interconnects. In the former two cases, which is very popular for systems from 8 to 64 MPUs, communication latency dependent on the routing ICs is a major factor. Beyond 64 MPUs, message passing systems are used, in which case message latency (which can be on the order of tens of microseconds) rather than local system bandwidth is much more important.
Like I said before, McKinley shares the same local bus design with HP's PA-8700. Each local node may have 6.4GB/sec of bandwidth, but the design of the entire MP system determines the system bandwidth between the nodes. HP's
64-way Superdome series, which uses McKinley's local bus and is drop-in compatible with it, has an aggregate system bandwidth of 64 GB/sec. The difference between the EV7, Power4, and McKinley in that table entry you quoted is that the EV7 and Power4 have integrated routing links (which is responsible for the large system bandwidth figure), whereas a large scale shared address McKinley design would use discrete routing ICs. Integrated routing links has the obvious benefit of having lower latency for communication, but it is not absolutely detrimental that McKinley does not feature them at this time. After all, large scale shared address systems have been using McKinley's style of parallel system design for well over a decade.
<<
Even AMD's hammer should scale much better than IA-64 with it's built in memory controllers and Hypertransport tunnels >>
That very well may be true (keeping in mind the last two paragraphs), but likely we'll never know with Sledgehammer/K8. Sledgehammer is design for
2- to 8-way MP systems, whereas McKinley is already capable in, for example, HP's 64-way Superdome servers. AMD does not have the support from the likes of IBM, Compaq, HP, Unisys and others at this time to design MP systems past 8 CPUs....not that it matters, since I can't imagine why anyone would design an x86 MP system past 8 CPUs. Sledgehammer is capable, on the other hand, of being used in clustered systems...but like any other cluster, IO performance and communication latency between each 2- to 8-way node is determined by network interconnect and routing IC design.
<<
bandwidth increases with the # of CPU's as it should >>
Aggregate bandwidth can increase in any NUMA based system using routing ICs (you can make a similar NUMA system design out of original Pentiums if you wanted to); the routing links do not have to be integrated on the MPU. NUMA is a far older design philosophy than Hammer...the advantage of integrated routing links is that it decreases communication latency with other nodes.