RISC vs. x86

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

lexxmac

Member
Nov 25, 2003
85
0
0
You all have said very interesting things, and I can agree with most of it. But, who said there isn't such a thing as x86 RISC? I've heard over and over is discussion of architectures that now, most x86 chips break down complex code in a translation unit and process out the smaller isntructions (but not the extent transmeta does). The P6 architecture which debuted with the old Pentium Pro (the latest variant of the P6 would be the Pentium M aka 'dothan') used a RISC inner core. The actual first x86 RISC was made at IBM, called NexGen, which AMD eventually bought the rights to. For supporting evidence see www.sandpile.org
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
There are 2 parts of the "RISC" definition. One part is the microarchitecture (implementation). In that sense, modern x86 MPU's are "RISC". The point was that to process simple instruction, but more of them, you could do it extremely fast.
The other part of "RISC" is in the ISA. To keep instructions simple in the ISA offers a simpler software interface in which compilers/assemblers have more freedom and can work more effectively to optimize for performance. In this respect, x86 chips will never be "RISC" as they still use the x86 ISA.
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Originally posted by: CTho9305
You can do up to 8-way SMP with Opterons, but the limiting factor is not the instruction set, but rather bus bandwidth issues.

I'm not entirely sure about Itanium, but I think you can't go above 2-way or 4-way SMP without killing performance, because the CPUs share their memory bandwidth. I think these aren't just "simple" 128-way SMP, but rather made up of modules with 2 CPUs each. This AMD page claims you can't go above 4-way SMP with Itaniums.

I'd hesitate to call MP Opterons "SMP", even though the name has become synonomous with multi-processing, as the memory access is not uniform across processors. Except for some early experimentation with multi-level trees and crossbar links, SMPs have almost exclusively been shared buses.

There was a time when shared bus SMPs scaled to many numbers of processors (such as the 36-way SGI Challenge and 30-way Sun Enterprise 6000), but electrical difficulty with increasing the bus frequency has limited designs to 5-drop buses (4 CPUs plus a memory controller), and even that is becoming hard. ccNUMA has been the popular way of making big shared-memory multiprocessors for the last decade, and until recently each NUMA node consisted of two or four processors on a shared bus...each shared bus node scales the memory bandwidth in the system. This is true for the HP Superdome (128-way PA-RISC 8800 or Itanium 2), SGI Altix 3700 (256-way Itanium 2, soon to be 512-way), Sun Fire E25K (iirc 144-way US-IV), among others. Only more recent systems, such as the EV7 Marvel and POWER4 p690, have used integrated links in large scale ccNUMA systems.

As I understand it, the way the Opterons HyperTransport is set up, once you go above 8 CPUs, some CPUs become multiple "hops" apart, so the performance gains drop rapidly.
I don't think hop distance was a reason Opteron glueless support is up to 8 CPUs...big MP systems rarely use fully connected topologies. Big MP systems have dramatically different design decisions than smaller systems, so since only a small fraction of processors go into 16+ processor servers (even if the systems demand a large portion of the total server system revenues), it makes sense for Intel and AMD to optimize their processors for fewer than 8-way servers. Then you can leave it up to the OEMs to design scalable chipsets for big servers.

As I understand, Opteron is uses broadcast-based coherence on its ccNUMA links. This is suitable for small systems, but broadcast traffic is probably what limits the scheme to 8 processors. Big ccNUMA systems use directory-based cache coherence...Itanium 2, for example, includes some coherence protocol operations to assist in directory coherence, since large MP systems are an important part of its market.

There is no limitation for instruction parallelism. Parallel groups are separated by a stop bit. The back-end then schedules all instructions in a group for execution with all available resources. Instruction bundles are there for the reason of faster fetch/decoding. I'm not sure I agree with the whole template scheme as it does not cover all possible instruction groups.
The instruction bundle template greatly assists in the design of the instruction dispersal logic. It still results in a rather small number of NOPs, especially considering its VLIW roots. IIRC recent figures are about 20% of (dynamic) instructions are NOPs in SPEC CPU, with the number of useful instructions slightly less than those for Alpha.

I also attended a practice talk recently ("Comparing OLTP Scaling Behavior on Intel Xeon and Intel Itanium 2 Processors") that compared performance of Itanium and Xeon on Oracle 10g using a TPC-C like workload. It turned out that, as the database size increased, the number of useful instructions executed on Itanium was actually less than on Xeon (IIRC, something like 1.7 million instructions/transaction on Itanium and 2 million instructions/transaction on Xeon). The author said NOP instructions added another 15% or so and predicated-false instructions another few percent. He didn't mention how many NOPs Xeon had, presumably it would add another few percent to the instruction count. Admittedly, some of the difference was due to the fact that the Itanium system had more main memory (the authors had to use what they had access to), which resulted in less buffer manager overhead on the Itanium system. But I still think it's very impressive that the Itanium system had better code density, based on executed instruction count, than the x86 system.
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
There is no limitation for instruction parallelism. Parallel groups are separated by a stop bit. The back-end then schedules all instructions in a group for execution with all available resources. Instruction bundles are there for the reason of faster fetch/decoding. I'm not sure I agree with the whole template scheme as it does not cover all possible instruction groups.

The instruction bundle template greatly assists in the design of the instruction dispersal logic. It still results in a rather small number of NOPs, especially considering its VLIW roots. IIRC recent figures are about 20% of (dynamic) instructions are NOPs in SPEC CPU, with the number of useful instructions slightly less than those for Alpha.

I can see how a template-model would help in decoding, but are the benefits worth the cost in decoding bandwidth? Perhaps decoding bandwidth isn't a limitation on IA-64, but it seems that with the bottleneck being main memory, code density should be a pretty big issue (less code means less usage of memory). Even a 10% reduction in code size could mean potentially more cache hits and/or less memory usage.

Just how much does the template-scheme help in decoding and how? It seems to me it would further add to the latency of decoding as you'd have to first parse the template bits, then figure out which template model to decode by. Couldn't you accomplish a faster (less decode latency) decode by just parsing an "instruction type" field at the beginning of each instruction in parallel?

I also attended a practice talk recently ("Comparing OLTP Scaling Behavior on Intel Xeon and Intel Itanium 2 Processors") that compared performance of Itanium and Xeon on Oracle 10g using a TPC-C like workload. It turned out that, as the database size increased, the number of useful instructions executed on Itanium was actually less than on Xeon (IIRC, something like 1.7 million instructions/transaction on Itanium and 2 million instructions/transaction on Xeon). The author said NOP instructions added another 15% or so and predicated-false instructions another few percent. He didn't mention how many NOPs Xeon had, presumably it would add another few percent to the instruction count. Admittedly, some of the difference was due to the fact that the Itanium system had more main memory (the authors had to use what they had access to), which resulted in less buffer manager overhead on the Itanium system. But I still think it's very impressive that the Itanium system had better code density, based on executed instruction count, than the x86 system.

Does that take into account actual code size? Again, with greater and greater memory bottlenecks, less code size could make the difference between the running loop fitting into cache or not.
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Originally posted by: imgod2u
The instruction bundle template greatly assists in the design of the instruction dispersal logic. It still results in a rather small number of NOPs, especially considering its VLIW roots. IIRC recent figures are about 20% of (dynamic) instructions are NOPs in SPEC CPU, with the number of useful instructions slightly less than those for Alpha.

I can see how a template-model would help in decoding, but are the benefits worth the cost in decoding bandwidth? Perhaps decoding bandwidth isn't a limitation on IA-64, but it seems that with the bottleneck being main memory, code density should be a pretty big issue (less code means less usage of memory). Even a 10% reduction in code size could mean potentially more cache hits and/or less memory usage.

Just how much does the template-scheme help in decoding and how? It seems to me it would further add to the latency of decoding as you'd have to first parse the template bits, then figure out which template model to decode by. Couldn't you accomplish a faster (less decode latency) decode by just parsing an "instruction type" field at the beginning of each instruction in parallel?

The template isn't necessarily for decoding purposes, but for instruction dispersal, when instructions are assigned to issue ports. Allowing any instruction to issue to any issue port will increase the fan-out on most of the instruction buffer slots by quite a bit. Keep in mind that most integer ALU instructions can go in an M slot or an I slot...I really don't think allowing any instruction to occupy any slot will decrease the code size by any noticeable amount, as the current templates cover the possibilities quite well. 8 of the 32 possible templates are still reserved, so there is room for expansion if necessary.

I also attended a practice talk recently ("Comparing OLTP Scaling Behavior on Intel Xeon and Intel Itanium 2 Processors") that compared performance of Itanium and Xeon on Oracle 10g using a TPC-C like workload. It turned out that, as the database size increased, the number of useful instructions executed on Itanium was actually less than on Xeon (IIRC, something like 1.7 million instructions/transaction on Itanium and 2 million instructions/transaction on Xeon). The author said NOP instructions added another 15% or so and predicated-false instructions another few percent. He didn't mention how many NOPs Xeon had, presumably it would add another few percent to the instruction count. Admittedly, some of the difference was due to the fact that the Itanium system had more main memory (the authors had to use what they had access to), which resulted in less buffer manager overhead on the Itanium system. But I still think it's very impressive that the Itanium system had better code density, based on executed instruction count, than the x86 system.

Does that take into account actual code size? Again, with greater and greater memory bottlenecks, less code size could make the difference between the running loop fitting into cache or not.

Code footprint is going to be around 40%-50% larger on Itanium, but I think it's a valid trade-off. The larger instruction size is necessary to have 128 registers and 4-operand instructions...the former enables rotating and stacked registers, which has a pretty significant impact on integer and floating point performance. The 4-operand instructions allow the fused multiply-add for floating point, which, along with Itanium 2's cache design, is a large attributing factor to its linear algebra performance.

So Itanium trades a larger code size for a dramatically reduced number of memory references. Check out this paper...page 7 shows that the number of memory instructions is 40% less than Alpha in SPEC CPU, which in turn is going to be less than x86. Code footprints are typically much smaller than data footprints...page 17 shows that the L1I and L2 caches take care of a vast majority of instruction accesses across all SPEC CPU programs, so once you hit the off-chip bandwidth, instruction references comprise a small percentage. Compare that to the data read latency on page 20. Page 15 shows the cycle breakdown...as you can see, instruction access, attributed to L1i misses, is much less significant than data access.
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
Originally posted by: Sohcan
Originally posted by: imgod2u
The instruction bundle template greatly assists in the design of the instruction dispersal logic. It still results in a rather small number of NOPs, especially considering its VLIW roots. IIRC recent figures are about 20% of (dynamic) instructions are NOPs in SPEC CPU, with the number of useful instructions slightly less than those for Alpha.

I can see how a template-model would help in decoding, but are the benefits worth the cost in decoding bandwidth? Perhaps decoding bandwidth isn't a limitation on IA-64, but it seems that with the bottleneck being main memory, code density should be a pretty big issue (less code means less usage of memory). Even a 10% reduction in code size could mean potentially more cache hits and/or less memory usage.

Just how much does the template-scheme help in decoding and how? It seems to me it would further add to the latency of decoding as you'd have to first parse the template bits, then figure out which template model to decode by. Couldn't you accomplish a faster (less decode latency) decode by just parsing an "instruction type" field at the beginning of each instruction in parallel?

The template isn't necessarily for decoding purposes, but for instruction dispersal, when instructions are assigned to issue ports. Allowing any instruction to issue to any issue port will increase the fan-out on most of the instruction buffer slots by quite a bit. Keep in mind that most integer ALU instructions can go in an M slot or an I slot...I really don't think allowing any instruction to occupy any slot will decrease the code size by any noticeable amount, as the current templates cover the possibilities quite well. 8 of the 32 possible templates are still reserved, so there is room for expansion if necessary.

Shouldn't assignment be already known after decoding? I mean, an integer instruction is an integer instruction, a load/store is a load/store and an FP instruction is an FP instruction. Why would you need templates to do this?
 

lexxmac

Member
Nov 25, 2003
85
0
0
Although no person has specifically debated what I have posted earlier, I would like to ad to what I have said. First off, It should be understood that x86 is not (my opinion here) a true RISC chip. I remember reading something somewhere that showed that most true RISC computers have more instructions going into the CPU than data, and I could almost guarantee that nowadays a chip called 'RISC' on its own data sheet will have a larger L1 instruction cache than its L1 data cache. I would have to say I'm very impressed with the Itanium, simply because it shows that intel is will to ditch the x86 instruction set on one large scale CPU.

A question though...

Does the concept of SIMD contradict RISC principles? SIMD/VLIW are by nature a 'complex' instructions, but does that mean that a true RISC chip shouldn't have SIMD at its disposal?

I know that when apple first started using the G4 (Motorola 7400), they could kick everything else on the market. Why? The 7400 was the first time (for apple at least, in genereal I'm not sure) what would be considered a true RISC chip used SIMD, which motorola implemented by having 4x 32bit, which they called AltiVec (apple called it the 'Velocity Engine'). Is the combination of true RISC with heavy parallelism (such as dual cores and/or more registers per core) and the addition of SIMD the way to go? Do I need to seek profesional mental help?
 

Sohcan

Platinum Member
Oct 10, 1999
2,127
0
0
Originally posted by: imgod2u
Originally posted by: Sohcan
The template isn't necessarily for decoding purposes, but for instruction dispersal, when instructions are assigned to issue ports. Allowing any instruction to issue to any issue port will increase the fan-out on most of the instruction buffer slots by quite a bit. Keep in mind that most integer ALU instructions can go in an M slot or an I slot...I really don't think allowing any instruction to occupy any slot will decrease the code size by any noticeable amount, as the current templates cover the possibilities quite well. 8 of the 32 possible templates are still reserved, so there is room for expansion if necessary.

Shouldn't assignment be already known after decoding? I mean, an integer instruction is an integer instruction, a load/store is a load/store and an FP instruction is an FP instruction. Why would you need templates to do this?

Well, strictly speaking, this isn't possible on Itanium. The instruction operation is defined by both a 4-bit opcode in each instruction, in addition to the template...you can't look at the instruction alone to determine what it does. Assuming that this wasn't the case, it's likely easier to look at the 5-bit template and know how to disperse the instructions to appropriate issue slots, rather than decoding each instruction opcode and deciding how to issue the instruction. I'm not positive (I can find out ), but I think that the instruction opcodes are not looked at in the instruction dispersal stage...the template alone should be enough.

The template also serves to indicate if/where the stops occur in the bundle, which is especially important in the few cases where the stop is not at the end of the bundle. I'm guessing that the template information is pretty important in enabling the three instructions to fit in a 128-bit bundle. If the 5-bit template were removed, and four bits were added to each instruction opcode (3 to encode the instruction type, 1 to indicate the presence of a stop), the bundle would be 135 bits.

Does the concept of SIMD contradict RISC principles?

Personally, I consider RISC/CISC to be orthogonal to SIMD/SISD and VLIW/sequential architecture. The first VLIW architectures were obviously CISCy, since they predated RISC. Despite that Itanium's instruction set architecture packs in a lot of things, its instruction atoms are more RISCy than CISCy, and even takes some RISC principles to an extreme. It has a large number of general-purpose registers, the instructions are relatively simple (all ALU and integer operations take one cycle to execute, all FP operations take four cycles), it has only one addressing mode, and the instructions are fixed-length and have a relatively few number of regularly composed instruction formats.

SIMD/VLIW are by nature a 'complex' instructions, but does that mean that a true RISC chip shouldn't have SIMD at its disposal?

Some would probably say that the simple SIMD implementations we've seen aren't necessary on RISC processors, given their better floating-point architecture compared to x87. Given Itanium's FP performance using its scalar architecture, and that its multimedia SIMD architecture was almost an afterthought, I kind of agree with this view...with a well-designed floating-point architecture, a simple SIMD extension shouldn't be necessary. Of course, a "true" vector architecture, ala Cray with long 64+ element vectors, and a beefy memory system to support it, can do wonders for scientific computing and linear algebra routines....but it's purpose is otherwise pretty limited.

On the other hand, all the major RISC architectures (PA-RISC, POWER, SPARC, MIPS, I don't know about ARM) have had SIMD extensions, some even before x86...but I wonder how much of the motivation was just a fad, and if the extensions are widely used.
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
Originally posted by: imgod2u

Shouldn't assignment be already known after decoding? I mean, an integer instruction is an integer instruction, a load/store is a load/store and an FP instruction is an FP instruction. Why would you need templates to do this?

Well, strictly speaking, this isn't possible on Itanium. The instruction operation is defined by both a 4-bit opcode in each instruction, in addition to the template...you can't look at the instruction alone to determine what it does. Assuming that this wasn't the case, it's likely easier to look at the 5-bit template and know how to disperse the instructions to appropriate issue slots, rather than decoding each instruction opcode and deciding how to issue the instruction. I'm not positive (I can find out ), but I think that the instruction opcodes are not looked at in the instruction dispersal stage...the template alone should be enough.

The template also serves to indicate if/where the stops occur in the bundle, which is especially important in the few cases where the stop is not at the end of the bundle. I'm guessing that the template information is pretty important in enabling the three instructions to fit in a 128-bit bundle. If the 5-bit template were removed, and four bits were added to each instruction opcode (3 to encode the instruction type, 1 to indicate the presence of a stop), the bundle would be 135 bits.

I see, so it has to do with packing instructions. I guess that's a good reason.

Does the concept of SIMD contradict RISC principles?

Personally, I consider RISC/CISC to be orthogonal to SIMD/SISD and VLIW/sequential architecture. The first VLIW architectures were obviously CISCy, since they predated RISC. Despite that Itanium's instruction set architecture packs in a lot of things, its instruction atoms are more RISCy than CISCy, and even takes some RISC principles to an extreme. It has a large number of general-purpose registers, the instructions are relatively simple (all ALU and integer operations take one cycle to execute, all FP operations take four cycles), it has only one addressing mode, and the instructions are fixed-length and have a relatively few number of regularly composed instruction formats.

SIMD/VLIW are by nature a 'complex' instructions, but does that mean that a true RISC chip shouldn't have SIMD at its disposal?

Some would probably say that the simple SIMD implementations we've seen aren't necessary on RISC processors, given their better floating-point architecture compared to x87. Given Itanium's FP performance using its scalar architecture, and that its multimedia SIMD architecture was almost an afterthought, I kind of agree with this view...with a well-designed floating-point architecture, a simple SIMD extension shouldn't be necessary. Of course, a "true" vector architecture, ala Cray with long 64+ element vectors, and a beefy memory system to support it, can do wonders for scientific computing and linear algebra routines....but it's purpose is otherwise pretty limited.

On the other hand, all the major RISC architectures (PA-RISC, POWER, SPARC, MIPS, I don't know about ARM) have had SIMD extensions, some even before x86...but I wonder how much of the motivation was just a fad, and if the extensions are widely used.[/quote]

Well, look no further than the PS2 for a wide usage of the MIPS SIMD extension....
Or the PPC970 in the PowerMacs. Not sure about Sparc though. SIMD is mostly there to assist SP floating point operations last I checked. Using SIMD could potentially boost throughput significantly, even compared to tranditional RISC FP processing. It's difficult, very difficult, to extract 4 FP instructions in parallel in code I'm guessing.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |