IA-64/EPIC, What Happened?

HeinrichX · Apr 18, 2002

I was browsing around Ars Technica the other day and found a IA-64 tech preview from 6/99, the article entails the details of the architecture and predicts that by 2003 IA-64 would be implemented on the Desktop effectively replacing IA-32 (x86) as the dominant ISA.

Now I think its pretty well known that IA-64 is the better ISA and EPIC is the better design philosophy, but why is Itanium still stuck in the niche range and continuing to dig it's hole even deeper? There is so much talk of x86-64 and Hammer and Yamhill and IA32-64. I just can't see how this works out, its clear that in developing IA-64 Intel didn't take to much attention as to how to transition from 32 bit to 64 bit (sure there is some x86 compatibility, but its quite poor). Meanwhile as AMD waited for Intel to develope some sort of cross-over design they went ahead and did it themselves.

AMD's doing this is most definitely going to undermine Intel's efforts over tha last 10 years with Itanium and the last 5 years with the Pentium 4. But what of the long term?
Will Intel follow AMD's lead for now to help transition the industry to 64 bit computing and then become the dominant leader once again as everyone changes over to IA-64?

Nothinman · Apr 18, 2002

IA-64 has HP-UX, Tru64, Win64 and Linux all supporting it.
Hammer has Linux. Sure, you can run 32-bit OSes on Hammer, but why would you?

AMD is undermining nothing, they'll get some 'power users' and Linux people buying their chips for home, but they still can't even sell 32-bit chips that have been working good for years to server consumers or vendors, IA-64 has no worries in that market (except they have to convince a lot of Alpha, UltraSparc, etc people they're better which will be very hard).

Sunner · Apr 19, 2002

<< Sure, you can run 32-bit OSes on Hammer, but why would you? >>

Perhaps cause it's IPC will be increased quite a bit compared to an AXP even when running 32bit apps?

Anyways, the market Itanium is aiming at for now is a tough one to penetrate, and to be successful, you have to have some compelling advantages over your opponents(significally cheaper but with the same quality, significant performance advantage at an attractive price, etc).

One problem I see with IA64 compared to AIX / POWER, or Solaris / SPARC, is that both the latter are extremely tightly tied to each other, one chip vendor, one server vendor, one system vendor, which makes it far easier to make everything work right and stable.

But we'll just have to wait and see what happens.

Locutus4657 · Apr 19, 2002

<< I was browsing around Ars Technica the other day and found a IA-64 tech preview from 6/99, the article entails the details of the architecture and predicts that by 2003 IA-64 would be implemented on the Desktop effectively replacing IA-32 (x86) as the dominant ISA. >>

Now I think its pretty well known that IA-64 is the better ISA and EPIC is the better design philosophy, but why is Itanium still stuck in the niche range and continuing to dig it's hole even deeper? There is so much talk of x86-64 and Hammer and Yamhill and IA32-64. I just can't see how this works out, its clear that in developing IA-64 Intel didn't take to much attention as to how to transition from 32 bit to 64 bit (sure there is some x86 compatibility, but its quite poor). Meanwhile as AMD waited for Intel to develope some sort of cross-over design they went ahead and did it themselves.[/i] >>

Huh, you have me confused... IA-64 and EPIC are one in the same... It seems like you're seperating them, they are not seperate. It is also clear that EPIC/IA-64 are neither a good ISA nor design philosiphy. As it turns out Itainium is so expensive, and such a poor preformer that intel had to relable it a "Development product for McKinley". It's not that intel did a poor job trying to roll it out, it's just that Itanic is a horrid product, hence it's well known nickname. IBM, Sun and Compaq have all the real powerful/cost effective 64 bit platforms. Intel doesn't have a whole heck of a lot, Itanic still even uses a shared bus architecture, a big no no in the market they are trying to compete in. Just so you know, assuming all the rumors are true, then AMD's X86-64 and intels Yamhill are both the same architecture. Of course intel may try to deny this not wanting to be embarrased, but that is how things seem to be. As for why there is such a buzz over it? Well I think it's because AMD's Hammer line offers the first affordable high speed 64 bit chip. The big plus with Hammer is that it will still run 32bit programs just as fast (faster actually) than a comprable Athlon. Everything the enthusiest market could want. But don't be fooled, there is no big uproar over hammer at the corperate level. Large corperations requireing 64bit mission critical systems will no doubt continue to rely on IBM/Sun/Compaq until AMD proves them selves and comes out with better Motherboards, or intel some how manages to make IA-64 viable by some mirical of god.

<< AMD's doing this is most definitely going to undermine Intel's efforts over tha last 10 years with Itanium and the last 5 years with the Pentium 4. But what of the long term?
Will Intel follow AMD's lead for now to help transition the industry to 64 bit computing and then become the dominant leader once again as everyone changes over to IA-64? >>

They are following AMD's lead with X86-64, with Yamhill... I don't know what makes you think IA-64 is such a good chip.

zephyrprime · Apr 19, 2002

<< IA-64 viable by some mirical of god. >>

I think it will happen when process technology is sufficient to squeeze the Itanic down to 1cm^2 from the current 4cm^2(?). Given moores law, this should take about ~2 years. McKinley throughs a monkey wrench into my calculations so who knows what will really happen.

Sohcan · Apr 19, 2002

<< Huh, you have me confused... IA-64 and EPIC are one in the same... It seems like you're seperating them, they are not seperate. It is also clear that EPIC/IA-64 are neither a good ISA nor design philosiphy. As it turns out Itainium is so expensive, and such a poor preformer that intel had to relable it a "Development product for McKinley". It's not that intel did a poor job trying to roll it out, it's just that Itanic is a horrid product, hence it's well known nickname >>

....and you're confusing an instruction set with a microarchitectural implementation. Itanium is a microprocessor/architectural implementation, IA64 is an instruction set, and EPIC is a design philosophy (a fancy marketing term for VLIW on steroids). You could make a case for IA-64's design flaws as an ISA and VLIW's difficulty achieving high integer performance, but pointing to Merced is not that answer. After all, people were groaning over x86's design kludges 20 years ago when compared to other contemporary CISC ISAs (though the term hadn't been defined yet) such as the 68000 and VAX...now, when put up against current high-end RISC designs, the 2.4GHz P4 is in a tie for first place for SPECint performance and in second place for SPECfp. It may be becoming cliche to say this now, but McKinley will knock your socks off in floating-point performance (not surprising since VLIW is good at predictable fp code) and be quite respectable in integer performance. This is, after all, going to be only the second IA64 microarchitecture. The important lesson is that if there's a will, there's a way...with enough development effort, engineers and architects can make an ISA with flaws work, as has been the case with x86, making it the most successful ISA ever. With the EV8 engineers working on the fourth gen IA64 architecture, things could get very interesting indeed.

<< IBM, Sun and Compaq have all the real powerful/cost effective 64 bit platforms >>

(Don't forget to include HP in there.) But look what's happening to their proprietary designs: Compaq mis-managed Alpha, and the EV7 will be the last design; they are using Itanium in the future. HP and Intel collaborated on IA64, and IIRC HP won't have any designs past the PA-8700 and Mako, to be replaced by Itanium. While it has become a strong force in the embedded market, MIPS funding has become a trickle, and SGI now primarily uses Xeons and Itaniums. There is doubt how much longer IBM will maintain Power, as Power4 has been relegated to extreme high-end systems...they are using Itanium as well. And Sun? Well, there are a lot of people who doubt how much longer they can exist in an era where their overpriced systems can't compete price/performance-wise against lower-end clustered x86/Itanium systems. It really pains me to say this, since the homogenization will become boring from an architectural perspective, but it appears that the era of proprietary high-end RISC designs is coming to an end.

<< Itanic still even uses a shared bus architecture, a big no no in the market they are trying to compete in >>

Don't be confused by the whole shared-bus vs. point-to-point argument for low-end systems; the same doesn't apply to high-end enterprise systems. MP bus architecture design is a lot more complex than that two-sided argument. HP's proprietary PA-8700 uses the same bus architecture as McKinley, and are configurable in up to 64-way systems in their Superdome series...in fact, the current Superdomes are drop-in compatible with McKinley. McKinley/Madison is also being used in a 1400-way supercomputer. Point-to-point NUMA-like designs just makes system design easier for 2- to 8-way systems, but its not as much of an issue for higher-end systems when there's a lot more money and R&D potential involved. Medium sized systems can employ custom routing ICs to form the topology of nodes, each of which may be implemented as smaller 4-way shared bus using NUMA-like communication. Once you get up to very large systems (past 64 CPUs), systems usually use message passing rather than SMP and NUMA.

Aragorn992 · Apr 19, 2002

<< As it turns out Itainium is so expensive, and such a poor preformer that intel had to relable it a "Development product for McKinley" >>

Thats incorrect. Intel said well before it was released it was merly a stepping stone to McKinley and not supposed to be a good performer. Though, the zealots like to explain it differently.

<< They are following AMD's lead with X86-64, with Yamhill... I don't know what makes you think IA-64 is such a good chip. >>

Intel is developing 64bit extensions for the Prescott as a precaution, most of there effort is going into IA-64 though. And IA-64 already has Windows support (though its not really targetted at windows users), Sanders is resorting to supporting MS in court just to get them to support x86-64.

Locutus4657 · Apr 20, 2002

<<

<< As it turns out Itainium is so expensive, and such a poor preformer that intel had to relable it a "Development product for McKinley" >>

Thats incorrect. Intel said well before it was released it was merly a stepping stone to McKinley and not supposed to be a good performer. Though, the zealots like to explain it differently.

<< They are following AMD's lead with X86-64, with Yamhill... I don't know what makes you think IA-64 is such a good chip. >>

Intel is developing 64bit extensions for the Prescott as a precaution, most of there effort is going into IA-64 though. And IA-64 already has Windows support (though its not really targetted at windows users), Sanders is resorting to supporting MS in court just to get them to support x86-64. >>

That's not what I read from intels press realeases on launch day... They seemed pretty convinced that Itanic was the greatest thing on earth and it would immediatly crush all compitition with it's high preformance and value... As it turns out Sun Microsystems alone is capable of beating out Itanic in both terms of value (sub $1000 blade) and preformance (32+ way servers, something Itanic can't scale to). Then throw Compaq's alpha into the mix, and then IBM's power 4. Let's face it, IA-64 is not shaping up to be what Intel had hoped it would be. Hens Yamhill. What Anand Published in the aformentioned artical is exactly what Intel was preaching through Itanic launch data, until the final realization that at least certainly in it's current form, the architechture is to expensive/slow to be markatable. From what I understand, MS had already been toying around with at least some sort of support for AMD's hammer long before this trial... Visual studio.Net comes with support for AMD hammer detection and I beleive some instruction support as well... It certainly has never been clear (and remains to be seen) if MS ever intended to fully support the chip. But time will tell... Like I said, neither AMD or Intel has a true mission critical 64 bit (high end) chip at the moment. AMD needs to work on public image and more advanced MB designs... And intel has more than their share of work to do on the Itanic... Until either one, or both companies solve their respective problems, expect RISC to dominate this market.

Buddha Bart · Apr 20, 2002

Hey, i've got a question about the legal issues.

Is AMD still going to be able to make a binary compatible architecture?

bart

Locutus4657 · Apr 20, 2002

Could you clarify exactly what you mean?

<< Hey, i've got a question about the legal issues.

Is AMD still going to be able to make a binary compatible architecture?

bart >>

manly · Apr 20, 2002

I'm not sure, but I think he asked if AMD can safely implement their own IA64 CPU. I doubt it, because patents would probably get in the way unless they negotiated royalties with Intel. That's one business reason why they're going with x86-64 instead.

Now I am not a computer engineer, but from what I've read, EPIC is a flawed approach. It is highly dependent on the compiler statically scheduling instructions to optimize performance. Supposedly it will take years for compilers to optimize well for EPIC, and by that time, who knows if it will be the slam dunk many had expected it to be.

Sohcan · Apr 20, 2002

<< That's not what I read from intels press realeases on launch day... They seemed pretty convinced that Itanic was the greatest thing on earth and it would immediatly crush all compitition with it's high preformance and value >>

Well, what do you expect from PR and marketing? Merced's purpose in its pilot production has been well documented for a long time.

<< As it turns out Sun Microsystems alone is capable of beating out Itanic in....preformance >>

Upon its release in mid-2001 (which was delayed over two years), the 800 MHz Merced scored around 370 in SPECint2K and 700 in SPECfp2K, compared to 370 in SPECint2K and 375 in SPECfp2K by its SPARC contemporary (US-III 750 MHz).

<< ...32+ way servers, something Itanic can't scale to >>

If you're referring specifically to Merced, well, sure, there were hardly enough Merced's sold to fill that scalability. If you're referring to IA64 in general, refer to my previous post.

<< Then throw Compaq's alpha into the mix, and then IBM's power 4...expect RISC to dominate this market >>

Again, refer to my previous post.

<< Now I am not a computer engineer, but from what I've read, EPIC is a flawed approach. It is highly dependent on the compiler statically scheduling instructions to optimize performance. Supposedly it will take years for compilers to optimize well for EPIC >>

Look at the bright side, it keeps compiler writers employed and entertained with difficult problems to solve. VLIW's difficulties as a general purpose MPU design model have been well documented...the idea goes back to the late 70s, notably with Multiflow. It's safe to say that the original goals that EPIC meant to solve when the VLIW research at HP started back in 1989 have gone un-met: it was thought that dynamically scheduled designs would become too cumbersome to engineer, and statically scheduled designs would represent a large savings in die area and a benefit in clock speed. The first idea arguably might be true (in a way); with HP, SGI, and Compaq all but leaving the high-end RISC arena due to its high R&D cost, high-performance MPU design is becoming dominated by companies that specialize in their production (Intel, AMD, and Motorola...soon only IBM and Sun of the old crowd will remain). Complexity for dynamically scheduled designs certainly hasn't reached a limit...its a shame the EV8 will never see the light of day, as it was far and away the most ambitious microarchitecture ever conceived. It would have been in a class all its own for single- and multi-threaded performance (though it was at least three years away at the time of cancellation last summer, the arena might have changed by its release). As for the second two goals, my opinion is that the designers didn't forsee the trend towards large caches occupying 2/3 of the die area, or MPUs with 10+ stage general pipelines.

VLIW encounters a hurdle when scheduling for integer code; some data and control dependencies simply can't be known at compile-time. Data access also represents a problem, because a cache miss can mean a critical stall for static scheduling. Some features of IA64, such as software prefetch, software pipelining, and loop unrolling serves to help this problem; large caches are also a great benefit. Floating-point code, on the other hand, with its predictable loops, is perfect for the wide parallelism of VLIW.

IA64 compiler and hardware maturity has already grown leaps and bounds with McKinley. According to presentations given at the recent ISSCC, recompiling for McKinley offers a 15% performance boost over Merced. Combined with 1.5X speedup over Merced at the same clock rate, the statements have said the 1GHz McKinley will have 2X the performance over the 800MHz Merced. We'll know the official SPEC scores very soon, but the 1GHz McKinley may come close to tying the top spot in SPECint (currently held by the 1.3GHz Power4 and 2.2GHz Northwood) and should sail past the current top spot in SPECfp, held by the 1.3GHz Power4, by 25-30%. IMHO that's pretty impressive for a design philosophy that was never supposed to be high performing in integer code.

Personally, I don't see VLIW/EPIC as inherintly flawed or beneficial....like everything else in computer architecture, it's merely another possible design solution.

Locutus4657 · Apr 21, 2002

I didn't see how the artical you referenced made the case for IA-64. If anything it pointed out the defficiencies in IA-64 which were already known (and of course it's strengths). In the end I'm thinking Intel is yet again proving that VLIW is better on paper than in real life (as a general purpouse CPU) yet again. If you look again at the artical you referenced you should notice the following line in their 2nd CPU comparison table:

| EV7 | P4 | IA-64
System Bandwidth (GB/s) | 44.8 | 92 | 6.4

Note P4 refers to Power 4, now Pentium 4 (as if you couldn'th have guessed). This demonstrates how hard of a time Intel will have scaling their CPU's compared to IBM/Compaq (and Sun). Even AMD's hammer should scale much better than IA-64 with it's built in memory controllers and Hypertransport tunnels (bandwidth increases with the # of CPU's as it should). The way I read your referenced artical, it only points out exactly how much work Intel has to do to make it a seriouse competitor (even against Hammer at the low end).

<<

<< That's not what I read from intels press realeases on launch day... They seemed pretty convinced that Itanic was the greatest thing on earth and it would immediatly crush all compitition with it's high preformance and value >>

Well, what do you expect from PR and marketing? Merced's purpose in its pilot production has been well documented for a long time.

<< As it turns out Sun Microsystems alone is capable of beating out Itanic in....preformance >>

Upon its release in mid-2001 (which was delayed over two years), the 800 MHz Merced scored around 370 in SPECint2K and 700 in SPECfp2K, compared to 370 in SPECint2K and 375 in SPECfp2K by its SPARC contemporary (US-III 750 MHz).

<< ...32+ way servers, something Itanic can't scale to >>

If you're referring specifically to Merced, well, sure, there were hardly enough Merced's sold to fill that scalability. If you're referring to IA64 in general, refer to my previous post.

<< Then throw Compaq's alpha into the mix, and then IBM's power 4...expect RISC to dominate this market >>

Again, refer to my previous post.

<< Now I am not a computer engineer, but from what I've read, EPIC is a flawed approach. It is highly dependent on the compiler statically scheduling instructions to optimize performance. Supposedly it will take years for compilers to optimize well for EPIC >>

Look at the bright side, it keeps compiler writers employed and entertained with difficult problems to solve. VLIW's difficulties as a general purpose MPU design model have been well documented...the idea goes back to the late 70s, notably with Multiflow. It's safe to say that the original goals that EPIC meant to solve when the VLIW research at HP started back in 1989 have gone un-met: it was thought that dynamically scheduled designs would become too cumbersome to engineer, and statically scheduled designs would represent a large savings in die area and a benefit in clock speed. The first idea arguably might be true (in a way); with HP, SGI, and Compaq all but leaving the high-end RISC arena due to its high R&D cost, high-performance MPU design is becoming dominated by companies that specialize in their production (Intel, AMD, and Motorola...soon only IBM and Sun of the old crowd will remain). Complexity for dynamically scheduled designs certainly hasn't reached a limit...its a shame the EV8 will never see the light of day, as it was far and away the most ambitious microarchitecture ever conceived. It would have been in a class all its own for single- and multi-threaded performance (though it was at least three years away at the time of cancellation last summer, the arena might have changed by its release). As for the second two goals, my opinion is that the designers didn't forsee the trend towards large caches occupying 2/3 of the die area, or MPUs with 10+ stage general pipelines.

VLIW encounters a hurdle when scheduling for integer code; some data and control dependencies simply can't be known at compile-time. Data access also represents a problem, because a cache miss can mean a critical stall for static scheduling. Some features of IA64, such as software prefetch, software pipelining, and loop unrolling serves to help this problem; large caches are also a great benefit. Floating-point code, on the other hand, with its predictable loops, is perfect for the wide parallelism of VLIW.

IA64 compiler and hardware maturity has already grown leaps and bounds with McKinley. According to presentations given at the recent ISSCC, recompiling for McKinley offers a 15% performance boost over Merced. Combined with 1.5X speedup over Merced at the same clock rate, the statements have said the 1GHz McKinley will have 2X the performance over the 800MHz Merced. We'll know the official SPEC scores very soon, but the 1GHz McKinley may come close to tying the top spot in SPECint (currently held by the 1.3GHz Power4 and 2.2GHz Northwood) and should sail past the current top spot in SPECfp, held by the 1.3GHz Power4, by 25-30%. IMHO that's pretty impressive for a design philosophy that was never supposed to be high performing in integer code.

Personally, I don't see VLIW/EPIC as inherintly flawed or beneficial....like everything else in computer architecture, it's merely another possible design solution. >>

Sohcan · Apr 21, 2002

<< I didn't see how the artical you referenced made the case for IA-64 >>

Where did I say DeMone's article was to make a case for IA64? Perhaps I should have been more clear, what I wanted you to garner from the article was the history behind IA64 on the first page (since I was addressing Merced's history and purpose).

<< In the end I'm thinking Intel is yet again proving that VLIW is better on paper than in real life (as a general purpouse CPU) yet again >>

Ah, please read the last few paragraphs in my last post again where I've highlighted the benefits and drawbacks of VLIW....VLIW is already going to prove itself capable with McKinley. I've noticed time and time again that the enthusiast crowd thinks there is one and only one way to do anything correctly in microprocessor design. After education in computer architecture (I'm speaking to you as a graduate student in comp arch) it becomes clear that there are merely design decisions and trade-offs that are made to reach the final product. It should be no surprise that Computer Architecture by Hennessey and Patterson (the "Bible" in comp arch, written by the creators of RISC as well as MIPS and SPARC, respectively) devotes one chapter to dynamic scheduling and another to static scheduling/VLIW.

<< If you look again at the artical you referenced you should notice the following line in their 2nd CPU comparison table:

| EV7 | P4 | IA-64
System Bandwidth (GB/s) | 44.8 | 92 | 6.4 >>

Again you're confusing an instruction set (IA64) with a particular implementation of that instruction set (McKinley). Are you going to judge the Pentium4 and Hammer based on the 25 year old 8086 design?

<< This demonstrates how hard of a time Intel will have scaling their CPU's compared to IBM/Compaq (and Sun). >>

I get the impression that you are confusing local bus design with MP system design. "Scalability" is determined by much more than mere external MPU bandwidth. Perhaps if all MP systems were designed as shared memory bus-based systems, that would be true, but this is not the case. Shared-memory multiprocessing systems, whether they are SMP or NUMA, can and are be implemented in a variety of ways that are not necessarily dependent on their local bus design. Shared memory multiprocessors can be arranged through, for example, crossbar switches, multistage interconnection networks, or bus interconnects. In the former two cases, which is very popular for systems from 8 to 64 MPUs, communication latency dependent on the routing ICs is a major factor. Beyond 64 MPUs, message passing systems are used, in which case message latency (which can be on the order of tens of microseconds) rather than local system bandwidth is much more important.

Like I said before, McKinley shares the same local bus design with HP's PA-8700. Each local node may have 6.4GB/sec of bandwidth, but the design of the entire MP system determines the system bandwidth between the nodes. HP's 64-way Superdome series, which uses McKinley's local bus and is drop-in compatible with it, has an aggregate system bandwidth of 64 GB/sec. The difference between the EV7, Power4, and McKinley in that table entry you quoted is that the EV7 and Power4 have integrated routing links (which is responsible for the large system bandwidth figure), whereas a large scale shared address McKinley design would use discrete routing ICs. Integrated routing links has the obvious benefit of having lower latency for communication, but it is not absolutely detrimental that McKinley does not feature them at this time. After all, large scale shared address systems have been using McKinley's style of parallel system design for well over a decade.

<< Even AMD's hammer should scale much better than IA-64 with it's built in memory controllers and Hypertransport tunnels >>

That very well may be true (keeping in mind the last two paragraphs), but likely we'll never know with Sledgehammer/K8. Sledgehammer is design for 2- to 8-way MP systems, whereas McKinley is already capable in, for example, HP's 64-way Superdome servers. AMD does not have the support from the likes of IBM, Compaq, HP, Unisys and others at this time to design MP systems past 8 CPUs....not that it matters, since I can't imagine why anyone would design an x86 MP system past 8 CPUs. Sledgehammer is capable, on the other hand, of being used in clustered systems...but like any other cluster, IO performance and communication latency between each 2- to 8-way node is determined by network interconnect and routing IC design.

<< bandwidth increases with the # of CPU's as it should >>

Aggregate bandwidth can increase in any NUMA based system using routing ICs (you can make a similar NUMA system design out of original Pentiums if you wanted to); the routing links do not have to be integrated on the MPU. NUMA is a far older design philosophy than Hammer...the advantage of integrated routing links is that it decreases communication latency with other nodes.

Locutus4657 · Apr 21, 2002

I most certianly understand that there is more than one way to design a CPU. Even though I'm a CS major, CPU basics are most certainly still discussed in various classes I have taken (Programming Languages, Operating Systems). The point I'm arguing is that IA-64/EPIC has yet to prove that it is any better than any other ISA/design philosiphy (an assertion made by HeinrichX at the beginning of the thread). As a matter of fact right now just the oppisit seems to be getting proved right now, that IA-64/EPIC is both slow and expensive to implement. I have conseded (though rather pessimessticly) that Intel could make improvements which make it a real contender, or perhaps the clear best choice. But as yet Intel hasn't been able to do this, and many people arn't 100% confident that McKinly will be able to do this either. With IBM's Power 4 and Sun's Ultra Sparc III's at the high end, and Hammers at the low end, Itanic could have a very hard time finding a place for its self in the market place out side of some niche market. At any rate, IA-64 has a long way to go before it proves it's the CPU of choice for high end machines and espeacially desktops. Just like AMD's Hammer is going to have a long way to go before it proves that it will have a robust enough platform to be considered a true Mission Critical server, with the kind of support big companies like to see. At any rate, since I don't have $5,000 to spend on a CPU (alone), I think I'm pretty sure which chip is going to end up in my next machine.

<<

<< I didn't see how the artical you referenced made the case for IA-64 >>

Where did I say DeMone's article was to make a case for IA64? Perhaps I should have been more clear, what I wanted you to garner from the article was the history behind IA64 on the first page (since I was addressing Merced's history and purpose).

<< In the end I'm thinking Intel is yet again proving that VLIW is better on paper than in real life (as a general purpouse CPU) yet again >>

Ah, please read the last few paragraphs in my last post again where I've highlighted the benefits and drawbacks of VLIW....VLIW is already going to prove itself capable with McKinley. I've noticed time and time again that the enthusiast crowd thinks there is one and only one way to do anything correctly in microprocessor design. After education in computer architecture (I'm speaking to you as a graduate student in comp arch) it becomes clear that there are merely design decisions and trade-offs that are made to reach the final product. It should be no surprise that Computer Architecture by Hennessey and Patterson (the "Bible" in comp arch, written by the creators of RISC as well as MIPS and SPARC, respectively) devotes one chapter to dynamic scheduling and another to static scheduling/VLIW.

<< If you look again at the artical you referenced you should notice the following line in their 2nd CPU comparison table:

| EV7 | P4 | IA-64
System Bandwidth (GB/s) | 44.8 | 92 | 6.4 >>

Again you're confusing an instruction set (IA64) with a particular implementation of that instruction set (McKinley). Are you going to judge the Pentium4 and Hammer based on the 25 year old 8086 design?

<< This demonstrates how hard of a time Intel will have scaling their CPU's compared to IBM/Compaq (and Sun). >>

I get the impression that you are confusing local bus design with MP system design. "Scalability" is determined by much more than mere external MPU bandwidth. Perhaps if all MP systems were designed as shared memory bus-based systems, that would be true, but this is not the case. Shared-memory multiprocessing systems, whether they are SMP or NUMA, can and are be implemented in a variety of ways that are not necessarily dependent on their local bus design. Shared memory multiprocessors can be arranged through, for example, crossbar switches, multistage interconnection networks, or bus interconnects. In the former two cases, which is very popular for systems from 8 to 64 MPUs, communication latency dependent on the routing ICs is a major factor. Beyond 64 MPUs, message passing systems are used, in which case message latency (which can be on the order of tens of microseconds) rather than local system bandwidth is much more important.

Like I said before, McKinley shares the same local bus design with HP's PA-8700. Each local node may have 6.4GB/sec of bandwidth, but the design of the entire MP system determines the system bandwidth between the nodes. HP's 64-way Superdome series, which uses McKinley's local bus and is drop-in compatible with it, has an aggregate system bandwidth of 64 GB/sec. The difference between the EV7, Power4, and McKinley in that table entry you quoted is that the EV7 and Power4 have integrated routing links (which is responsible for the large system bandwidth figure), whereas a large scale shared address McKinley design would use discrete routing ICs. Integrated routing links has the obvious benefit of having lower latency for communication, but it is not absolutely detrimental that McKinley does not feature them at this time. After all, large scale shared address systems have been using McKinley's style of parallel system design for well over a decade.

<< Even AMD's hammer should scale much better than IA-64 with it's built in memory controllers and Hypertransport tunnels >>

That very well may be true (keeping in mind the last two paragraphs), but likely we'll never know with Sledgehammer/K8. Sledgehammer is design for 2- to 8-way MP systems, whereas McKinley is already capable in, for example, HP's 64-way Superdome servers. AMD does not have the support from the likes of IBM, Compaq, HP, Unisys and others at this time to design MP systems past 8 CPUs....not that it matters, since I can't imagine why anyone would design an x86 MP system past 8 CPUs. Sledgehammer is capable, on the other hand, of being used in clustered systems...but like any other cluster, IO performance and communication latency between each 2- to 8-way node is determined by network interconnect and routing IC design.

<< bandwidth increases with the # of CPU's as it should >>

Aggregate bandwidth can increase in any NUMA based system using routing ICs (you can make a similar NUMA system design out of original Pentiums if you wanted to); the routing links do not have to be integrated on the MPU. NUMA is a far older design philosophy than Hammer...the advantage of integrated routing links is that it decreases communication latency with other nodes. >>

Sohcan · Apr 21, 2002

<< The point I'm arguing is that IA-64/EPIC has yet to prove that it is any better than any other ISA/design philosiphy (an assertion made by HeinrichX at the beginning of the thread) >>

Hehe, okay, getting back to the original point of the thread (), I don't think that IA64/VLIW is inherintly necessary or even superior than out-of-order superscalar. Contrary to what the original designers at HP thought, OOOE superscalar still has well more than a few years left. VLIW, compared to existing OOOE superscalar, is capable of extracting more instruction-level parallelism (ILP) out of floating-point code, and we shall see soon with McKinley that it can yield very respectable integer performance. We've reached the ILP limit possible with current OOOE superscalar and realistic design goals for memory latency, issue width, and branch history table size, and VLIW can only inch past that...but IMO improvements in ILP will become second stage to thread-level parallelism (TLP) in the near future, as well as TLP techniques to improve ILP (something that the EPIC designers probably didn't envision ten years ago). Had it ever come to fruition, the EV8 would likely have been the epitome of current OOOE superscalar and proven its worth with both ILP and TLP...it was truely quite amazing (read about it here).

Many of the new architectural paradigms beyond OOOE superscalar and VLIW proposed for 1 billion+ transistor MPUs (in many cases, far more than 1 billion transistors) center around TLP, such as multiscalar, trace processors, multithreaded processors, and single-chip multiprocessors. Those that aim to solely improve ILP, such as advanced superscalar and super speculative processors, won't be physically realizeable for a while....advanced superscalar aims combine the continuing trend towards larger caches with advanced branch prediction and up to 32-way instruction issue (compared to 3- to 4-way with current superscalar).

Anyway, I'm starting to ramble on...my point is that instruction set need not define the performance for a microarchitecture. This is plainly visible with x86, which was considered hobbled over 20 years ago. Despite its decoding nightmares, two-operand format, and stack-based floating-point model, x86 MPUs are competitive against high-end RISC in single processor performance. The high-performance MPU arena has shown in the last few years that semiconductor process technology and microarchitectural implementation is more important than the instruction-set in determining performance.

Despite that the original ideas motivating EPIC and VLIW as a general purpose computing ISA haven't held true, McKinley (I hate to sound like a broken record) will show that IA64 is capable of very respectable integer performance and first-class floating-point performance. Itanium is just as capable of having multithreading techniques applied to it (there have been a few papers modeling SMT and speculative precomputation threading techniques on Itanium) as OOOE superscalar, so if/when the current ISAs move towards the TLP architectural paradigms mentioned earlier, IA64 is capable of joining them (at which point all this fuss about ILP may matter little).

As for Itanium's market acceptance, we'll just have to agree to disagree . Only time will tell.

jhu · Apr 21, 2002

we'll just have to agree to disagree

hmm...i won't agree to that

Locutus4657 · Apr 22, 2002

Well hey, if Intel can make McKinly preform 2x better (on the whole) than the Merced, then even with a $5,000 price tag it'd be a whole lot more attractive. What would really make it much more attractive is being able to get that price tag down by at least a factor of 5. Of course assuming Mckinly does preform that much better, and assuming that does increase demand, the laws of the market place dictate the price should fall (assuming no production shortages). But for the time being Itanic isn't shaping up to be what intel wanted it to be.

<<

<< The point I'm arguing is that IA-64/EPIC has yet to prove that it is any better than any other ISA/design philosiphy (an assertion made by HeinrichX at the beginning of the thread) >>

Hehe, okay, getting back to the original point of the thread (), I don't think that IA64/VLIW is inherintly necessary or even superior than out-of-order superscalar. Contrary to what the original designers at HP thought, OOOE superscalar still has well more than a few years left. VLIW, compared to existing OOOE superscalar, is capable of extracting more instruction-level parallelism (ILP) out of floating-point code, and we shall see soon with McKinley that it can yield very respectable integer performance. We've reached the ILP limit possible with current OOOE superscalar and realistic design goals for memory latency, issue width, and branch history table size, and VLIW can only inch past that...but IMO improvements in ILP will become second stage to thread-level parallelism (TLP) in the near future, as well as TLP techniques to improve ILP (something that the EPIC designers probably didn't envision ten years ago). Had it ever come to fruition, the EV8 would likely have been the epitome of current OOOE superscalar and proven its worth with both ILP and TLP...it was truely quite amazing (read about it here).

Many of the new architectural paradigms beyond OOOE superscalar and VLIW proposed for 1 billion+ transistor MPUs (in many cases, far more than 1 billion transistors) center around TLP, such as multiscalar, trace processors, multithreaded processors, and single-chip multiprocessors. Those that aim to solely improve ILP, such as advanced superscalar and super speculative processors, won't be physically realizeable for a while....advanced superscalar aims combine the continuing trend towards larger caches with advanced branch prediction and up to 32-way instruction issue (compared to 3- to 4-way with current superscalar).

Anyway, I'm starting to ramble on...my point is that instruction set need not define the performance for a microarchitecture. This is plainly visible with x86, which was considered hobbled over 20 years ago. Despite its decoding nightmares, two-operand format, and stack-based floating-point model, x86 MPUs are competitive against high-end RISC in single processor performance. The high-performance MPU arena has shown in the last few years that semiconductor process technology and microarchitectural implementation is more important than the instruction-set in determining performance.

Despite that the original ideas motivating EPIC and VLIW as a general purpose computing ISA haven't held true, McKinley (I hate to sound like a broken record) will show that IA64 is capable of very respectable integer performance and first-class floating-point performance. Itanium is just as capable of having multithreading techniques applied to it (there have been a few papers modeling SMT and speculative precomputation threading techniques on Itanium) as OOOE superscalar, so if/when the current ISAs move towards the TLP architectural paradigms mentioned earlier, IA64 is capable of joining them (at which point all this fuss about ILP may matter little).

As for Itanium's market acceptance, we'll just have to agree to disagree . Only time will tell. >>

kylef · Apr 22, 2002

I hate to beat the horse to death here, but...

<< At any rate, IA-64 has a long way to go before it proves it's the CPU of choice for high end machines and espeacially desktops. >>

I don't think anyone would argue with you on that point. But it is difficult and unfair to compare Itanium (which is still in its incipient stage of development) with any OOOE microprocessors that have essentially been in development for 10+ years.

SuperTool · Apr 23, 2002

I think Intel should have bought Alpha from DEC outright. Not after all this time, but back when they decided to go the IA64 route.
Alpha technology and Intel marketing and manufacturing would have made a good combo, and they would be to the market 5+years earlier. Which would make all the difference. Intel didn't have to go the fancy route with EPIC and IA64. They could have come out with a RISC chip, and Dell would build servers with them. By picking EPIC, Intel also put themselves at the mercy of compiler writers and programmers, whom they have no control over. You gotta write fancy proggies to fill up those execution pipes and get the full performance. Unlike the competition, Intel is not writing it's own OS, so programmer acceptance is crucial here.

kylef · Apr 23, 2002

Sohcan,

This is really off-topic, but I just had to ask. Is your name intentionally "Nachos" spelled backwards? And does it reference the toy OS used and "loved" by computer students around the world? Or am I way off target?

Just curious...

Sohcan · Apr 23, 2002

<< we'll just have to agree to disagree

hmm...i won't agree to that >>

Well this is a dilly of a pickle.... (any Simpsons fans out there? )

<< What would really make it much more attractive is being able to get that price tag down by at least a factor of 5 >>

It'll undoubtedly happen with time. Right now high-performance IA64 implementations are tied to large and fast caches occupying 2/3 of the die area, hence the 220 million transistors with McKinley. It'll probably be another process generation or two before Itanium drops below $1000. Apparently Deerfield, to be manufactured on the 300mm .13u process next year, is targeted towards 1U servers.

<< I think Intel should have bought Alpha from DEC outright. Not after all this time, but back when they decided to go the IA64 route.
Alpha technology and Intel marketing and manufacturing would have made a good combo >>

I certainly hated to see Alpha past the EV7 die, but I don't think that DEC was ready to sell Alpha back in 1996 or whenever with their lawsuit settlement. By last summer Intel certainly had too much invested in IA64 to drop that platform and switch to Alpha. But you're right, Alpha's designs plus Intel's process technology (which was always lacking IMO with DEC and Compaq) would have been sweet.

<< By picking EPIC, Intel also put themselves at the mercy of compiler writers and programmers, whom they have no control over. You gotta write fancy proggies to fill up those execution pipes and get the full performance >>

To be fair, Intel has had a lot of control over compiler writers and programmers with IA64. They've put a lot of R&D over the last decade into their C++ and Fortran compilers for IA64, and since IA64 is such a limited platform right now, that's what developers use. The IA64 compiler writers have gone to great lengths so that you don't have to worry about IA64's features to achieve high performance...the compilers automatically do program profiling and produce code using predicated execution, software pipelining, and loop unrolling.

<< Is your name intentionally "Nachos" spelled backwards? And does it reference the toy OS used and "loved" by computer students around the world? >>

In that order, yes and no. "Sohcan" is a part of a really old inside joke, that probably wouldn't be funny if I told it to you.

SuperTool · Apr 24, 2002

<<
To be fair, Intel has had a lot of control over compiler writers and programmers with IA64. They've put a lot of R&D over the last decade into their C++ and Fortran compilers for IA64, and since IA64 is such a limited platform right now, that's what developers use. The IA64 compiler writers have gone to great lengths so that you don't have to worry about IA64's features to achieve high performance...the compilers automatically do program profiling and produce code using predicated execution, software pipelining, and loop unrolling.
>>

I really don't think the whole predication is that good of a use for execution pipes. You end up throwing away half the stuff you are calculating. I think a good hardware speculation scheme would have done much better. Maybe if Intel gets Alpha's SMT going on IA64, they will get better utilization of the pipelines. It's going to be tough (Alpha still is not done with it), I think to build the logic that does that on IA64, and might defeat the whole purpose of going the VLIW route, which is to let the compiler do all the work. Anyways, silicon space getting cheaper and cheaper, I think it's more effective to put speculation, out of order execution, etc in hardware, rather then relying on compilers to do the job. Anyways, I don't care how good your compiler is, there is just so much ILP you are going to get from some programs. SMT seems like a way better use of execution pipes.
I am of course biased.

Sohcan · Apr 25, 2002

<< I really don't think the whole predication is that good of a use for execution pipes. You end up throwing away half the stuff you are calculating >>

Predication is used very sparingly for the exact reason you describe...only in the case of if-then constructs, and then only for hard-to-predict branches. Otherwise good-old dynamic and static branch prediction is used.

<< SMT seems like a way better use of execution pipes >>

Agreed, and I'd like to see some of this action.

BurntKooshie · Apr 30, 2002

Hehe, I saw that link awhile ago too, Sohcan. I'm glad that people are taking more cutting-edge work done in academia (in this case, slipstreaming and threaded-multipath) and trying to put it to work

IA-64/EPIC, What Happened?

Junior Member

Elite Member

Elite Member

Senior member

Diamond Member

Platinum Member

Member

Senior member

Diamond Member

Senior member

Lifer

Platinum Member

Senior member

Platinum Member

Senior member

Platinum Member

Lifer

Senior member

Golden Member

Lifer

Golden Member

Platinum Member

Lifer

Platinum Member

Diamond Member