VLIW Processors - Transmetta

zensafari · Aug 12, 2004

I've often been interested in the Transmetta VLIW processor architecture. From what I understand the Transmetta Crusoe processor uses some type of shell that combines 4 or more instructions into one-128 bit instruction.

Some claim that VLIW architectures can be faster but I don't see how. How do you gain throughput with this type of architectures when running a standard 32-bit windows program? I don't see how speed-up can be affected greatly. Where is the big advantage?

Also, I've heard that the shell used to morph regular instructions into VLIW format is software-based. Is this software stored on the HDD or in some type of other off-chip memory?

My gut reaction tells me that the Transmetta Crusoe VLIW architecture is does not out-perform standard processors. I imagine that there are reservation stations within the processor that account for out-of-order execution but how are the 128-bit instructions divided up if need be?

Thanks to anyone who has some answers.

Fandu · Aug 12, 2004

Best articles that I've read on VLIW:
http://arstechnica.com/cpu/1q00/crusoe/crusoe-1.html
http://arstechnica.com/wankerdesk/01q4/transmeta/transmeta.html
http://arstechnica.com/cpu/003/mpf-2003/mpf-2003-2.html

xQuasi · Aug 12, 2004

Well...
I believe that PA-RISC were a VLIW processor, and Intel says that their IA-64 architecture is EPIC, but actually... VLIW and EPIC are basically the same.

Anyway...
The good thing with VLIW is that many instructions can finish simultaneously.
The compiler sorts the instructions and packs them into the long words, in this case 128 bits.
When its time to execute then it takes the whole word and executes it at the same time.

It executes batches of instructions instead of one instruction after the other.
Super-scalar CPU's works with one instruction after another, if it sees that two intructions doesn't affect each other then it executes both of them. But, the CPU's can't see as far as the actual compiler does which will sort the instructions.
CPU's can only see the instructions that's in the "pipes".

In the Itanium's case, if an instruction affects a factor that could not be predicted by the compiler then the processor is able to re-schedule the instructions. This is very complex and requires lots of transistors.
Which makes the development more expensive and the processor much larger which increases the production cost...
The good thing is that it doesn't "stall" as often as VLIW/EPIC processors that doesn't have dynamic scheduling.

I believe that a VLIW/EPIC processor guarantees a higher amount of instructions to be executed in parallel.

I don't know how Transmetas CPU's acutally work and where the code-morphing takes place.
But keep in mind that Transmeta's CPU's consumes about the same amount of power that VIA's C3's does. I think that C3's consumes a little bit less...

Correct me if I'm wrong.

imgod2u · Aug 13, 2004

Originally posted by: Quasi
Well...
I believe that PA-RISC were a VLIW processor, and Intel says that their IA-64 architecture is EPIC, but actually... VLIW and EPIC are basically the same.

Anyway...
The good thing with VLIW is that many instructions can finish simultaneously.
The compiler sorts the instructions and packs them into the long words, in this case 128 bits.
When its time to execute then it takes the whole word and executes it at the same time.

It executes batches of instructions instead of one instruction after the other.
Super-scalar CPU's works with one instruction after another, if it sees that two intructions doesn't affect each other then it executes both of them. But, the CPU's can't see as far as the actual compiler does which will sort the instructions.
CPU's can only see the instructions that's in the "pipes".

In the Itanium's case, if an instruction affects a factor that could not be predicted by the compiler then the processor is able to re-schedule the instructions. This is very complex and requires lots of transistors.
Which makes the development more expensive and the processor much larger which increases the production cost...
The good thing is that it doesn't "stall" as often as VLIW/EPIC processors that doesn't have dynamic scheduling.

Unfortunately (or perhaps fortunately), no. Itanium doesn't do dynamic out-of-order execution. It's completely in-order and if an instruction stalls, the processor stalls. IA-64 does have a few more flexible advantages over classic VLIW, one of which allows dynamic issuing to multiple execution units without the need for the compiler to specify which one to use. This allows more flexible bundling of instructions.
However, IA-64 implementations are, still, in-order, and so they are much more compiler-dependent in terms of scheduling instructions for execution. In predictable instruction sequences (such as most FP code), the compiler does a very good job and Itanium's performance is very good. In other less than predictable code, it's not so hot.

Yomicron · Aug 13, 2004

More Transmeta info:
Crusoe Exposed: Reverse Engineering the Transmeta TM5xxx Architecture I
Crusoe Exposed: Reverse Engineering the Transmeta TM5xxx Architecture II

zensafari · Aug 13, 2004

I'm sure that all super-scalar processors are different in how they handle data dependencies and a mix of data forwarding techniques and stalls are used. I guess my pick with VLIW is I just don't see how data dependencies (or out of order instruction issues) can be handled without stalling. As mentioned, the compiler has to pretty intense to get these things to work. And if a series of parallel execution units are used to execute VLIW commands, there must be some way for the processor to account for a dependency if the compiler missed it. And now, if we assume that each parallel execution unit is based off a pipelined design, we have an entire new can of worms to worry about. . . .

Does anyone out there have any clues about the functionality behind the Transmetta code morphing? Seems like a headache to do, I wonder much *benefit* is gained using this scheme...? And, what types of programs or cpu processes benefit the most from either parallel execution units and/or VLIW architectures?

Thanks for the useful replies and the links. This is a subject I find fascinating and am not well-versed in.

Fandu · Aug 13, 2004

One really interesting thing that I would like to see Transmeta do is release some non x86 code-morphing addons for the Crusoe. It would be really neat to be able to have one CPU that could switch between PPC, ARM, x86, Alpha, etc.

imgod2u · Aug 13, 2004

Originally posted by: zensafari
I'm sure that all super-scalar processors are different in how they handle data dependencies and a mix of data forwarding techniques and stalls are used. I guess my pick with VLIW is I just don't see how data dependencies (or out of order instruction issues) can be handled without stalling. As mentioned, the compiler has to pretty intense to get these things to work. And if a series of parallel execution units are used to execute VLIW commands, there must be some way for the processor to account for a dependency if the compiler missed it. And now, if we assume that each parallel execution unit is based off a pipelined design, we have an entire new can of worms to worry about. . . .

Does anyone out there have any clues about the functionality behind the Transmetta code morphing? Seems like a headache to do, I wonder much *benefit* is gained using this scheme...? And, what types of programs or cpu processes benefit the most from either parallel execution units and/or VLIW architectures?

Thanks for the useful replies and the links. This is a subject I find fascinating and am not well-versed in.

Apparantly pretty well I would assume. Seeing as the new Astro processor provides a wider VLIW backend and outperforms the previous Crusoe, I'd say that the code-morphing layer is doing a pretty good job of feeding instructions to the backend. The advantage isn't so much in performance as it is in flexibility. It's much easier to make a chip with an updated firmware rather than redesign the logic of the OoOE backend. That's really the whole concept of VLIW. Not that similar performance couldn't neccessarily be reached by superscalar means, but the extra cost of logic transistors would be orders of magnitude greater. A VLIW core is very simple and small and yet provides a lot of execution power. The front end just needs to feed it. This can be done with firmware or (as I'm suspecting later on in IA-64 families) through a runtime environment.

Sahakiel · Aug 14, 2004

Originally posted by: imgod2u
Apparantly pretty well I would assume. Seeing as the new Astro processor provides a wider VLIW backend and outperforms the previous Crusoe, I'd say that the code-morphing layer is doing a pretty good job of feeding instructions to the backend. The advantage isn't so much in performance as it is in flexibility.

The concept of VLIW is an alternative to superscalar and vector. The whole point of having VLIW is to offload work to the compiler on the premise that runtime information is adequate. VLIW provides the parallelism of vector processing while balancing transistor budgets. In terms of flexibility, only vector processors are more limited than VLIW.
VLIW has the advantage of simpler designs and front end. Each instruction is actually a compilation of multiple instructions which are fetched all at once. The key to VLIW performance is the fact that each instruction in one very long instruction corresponds to one execution engine on the CPU and are guaranteed to have no data dependencies for that long instruction.
The key to VLIW handicap is that not all instructions can be executed in parallel. That means a lot of noops are inserted into each long instruction, which basically wastes precious memory capacity and bandwidth.
VLIW never really took off. By the time it hit the market, superscalar had arrived and compiler technology had matured to the point where it was much easier and cheaper to go with superscalar. Now that compiler technology has stalled, more and more companies are looking at VLIW as an alternative. However, now the primary inhibition is costs for porting software and support. Code-morphing is how Transmeta decided to deal with the problem.

It's much easier to make a chip with an updated firmware rather than redesign the logic of the OoOE backend. That's really the whole concept of VLIW.

No, that's really the premise of code-morphing. Transmeta's code-morphing technology is little more than emulation. The software presents a hardware model (in this case, the x86 ISA) which allows literally any hardware to run software written for that hardware model. This approach allows Transmeta to use very simple hardware, which lowers ppwer consumption. However, it is entirely dependent on the software to provide adequate performance.

Not that similar performance couldn't neccessarily be reached by superscalar means, but the extra cost of logic transistors would be orders of magnitude greater. A VLIW core is very simple and small and yet provides a lot of execution power. The front end just needs to feed it. This can be done with firmware or (as I'm suspecting later on in IA-64 families) through a runtime environment.

IA-64 not VLIW. It is an implementation of the EPIC ISA, which allows for scalability in future processors. Instead of fixed instruction widths, EPIC implements fixed instruction bundles. Like VLIW, each instruction bundle contains singular instructions that are guaranteed to be independent. The scalability comes from the implementation of a stop bit that signals the beginning of code that is not guaranteed independent. That means future revisions can fetch multiple bundles and simply issue every bundle on one cycle up until it reaches a stop bit or runs out of resources.

xQuasi · Aug 14, 2004

Originally posted by: imgod2u

Originally posted by: Quasi
Well...
I believe that PA-RISC were a VLIW processor, and Intel says that their IA-64 architecture is EPIC, but actually... VLIW and EPIC are basically the same.

Anyway...
The good thing with VLIW is that many instructions can finish simultaneously.
The compiler sorts the instructions and packs them into the long words, in this case 128 bits.
When its time to execute then it takes the whole word and executes it at the same time.

It executes batches of instructions instead of one instruction after the other.
Super-scalar CPU's works with one instruction after another, if it sees that two intructions doesn't affect each other then it executes both of them. But, the CPU's can't see as far as the actual compiler does which will sort the instructions.
CPU's can only see the instructions that's in the "pipes".

In the Itanium's case, if an instruction affects a factor that could not be predicted by the compiler then the processor is able to re-schedule the instructions. This is very complex and requires lots of transistors.
Which makes the development more expensive and the processor much larger which increases the production cost...
The good thing is that it doesn't "stall" as often as VLIW/EPIC processors that doesn't have dynamic scheduling.

Click to expand...

Unfortunately (or perhaps fortunately), no. Itanium doesn't do dynamic out-of-order execution. It's completely in-order and if an instruction stalls, the processor stalls. IA-64 does have a few more flexible advantages over classic VLIW, one of which allows dynamic issuing to multiple execution units without the need for the compiler to specify which one to use. This allows more flexible bundling of instructions.
However, IA-64 implementations are, still, in-order, and so they are much more compiler-dependent in terms of scheduling instructions for execution. In predictable instruction sequences (such as most FP code), the compiler does a very good job and Itanium's performance is very good. In other less than predictable code, it's not so hot.

I've read that it does, but I stand corrected. It might be that someone haven't explained the dynamic issuing correctly. I'm 100% sure that they (the articles) said "dynamic scheduling".
I'll look further into that at a later time. Thanks!

imgod2u · Aug 14, 2004

Originally posted by: Sahakiel

Originally posted by: imgod2u
Apparantly pretty well I would assume. Seeing as the new Astro processor provides a wider VLIW backend and outperforms the previous Crusoe, I'd say that the code-morphing layer is doing a pretty good job of feeding instructions to the backend. The advantage isn't so much in performance as it is in flexibility.

Click to expand...

The concept of VLIW is an alternative to superscalar and vector. The whole point of having VLIW is to offload work to the compiler on the premise that runtime information is adequate. VLIW provides the parallelism of vector processing while balancing transistor budgets. In terms of flexibility, only vector processors are more limited than VLIW.
VLIW has the advantage of simpler designs and front end. Each instruction is actually a compilation of multiple instructions which are fetched all at once. The key to VLIW performance is the fact that each instruction in one very long instruction corresponds to one execution engine on the CPU and are guaranteed to have no data dependencies for that long instruction.
The key to VLIW handicap is that not all instructions can be executed in parallel. That means a lot of noops are inserted into each long instruction, which basically wastes precious memory capacity and bandwidth.
VLIW never really took off. By the time it hit the market, superscalar had arrived and compiler technology had matured to the point where it was much easier and cheaper to go with superscalar. Now that compiler technology has stalled, more and more companies are looking at VLIW as an alternative. However, now the primary inhibition is costs for porting software and support. Code-morphing is how Transmeta decided to deal with the problem.

Think beyond general purpose MPU's. VLIW is used very widely in embedded systems and FP powerhouse chips. Practically all modern GPU's use a VLIW native ISA. VLIW does require a good compiler, but the means to produce a good-performing runtime environment is here and it matches or exceeds native code. And as we've seen with GPU's, when you're not bounded in hardware by the ISA, you can really make things very fast. Just change the drivers (the JIT) to support the new GPU. ISA support in software is cheap.

It's much easier to make a chip with an updated firmware rather than redesign the logic of the OoOE backend. That's really the whole concept of VLIW.

Click to expand...

No, that's really the premise of code-morphing. Transmeta's code-morphing technology is little more than emulation. The software presents a hardware model (in this case, the x86 ISA) which allows literally any hardware to run software written for that hardware model. This approach allows Transmeta to use very simple hardware, which lowers ppwer consumption. However, it is entirely dependent on the software to provide adequate performance.

Cut off at the wrong point and completely out of context I'm afraid, see below.

Not that similar performance couldn't neccessarily be reached by superscalar means, but the extra cost of logic transistors would be orders of magnitude greater. A VLIW core is very simple and small and yet provides a lot of execution power. The front end just needs to feed it. This can be done with firmware or (as I'm suspecting later on in IA-64 families) through a runtime environment.

Click to expand...

IA-64 not VLIW. It is an implementation of the EPIC ISA, which allows for scalability in future processors. Instead of fixed instruction widths, EPIC implements fixed instruction bundles.

It is very much VLIW. It does, indeed, use very long instruction words. The fact that each instruction word does not specify explicit parallelism doesn't mean it's not a bundled, very long instruction word. As I've pointed out above, IA-64 does differ in some ways from "classic" VLIW, but the concept is still there. Explicit, compiler-generated parallelism vs superscalar.

Like VLIW, each instruction bundle contains singular instructions that are guaranteed to be independent.

Not really. The bundled instructions (128-bits, 3 instructions per bundle) are really there to help decoding and dispatch. There's really no guarantee that all 3 are explicitly parallel. Explicit parallelism through IA-64 is done by the means of a stop-bit in the instruction that the parallel "group" ends.

The scalability comes from the implementation of a stop bit that signals the beginning of code that is not guaranteed independent. That means future revisions can fetch multiple bundles and simply issue every bundle on one cycle up until it reaches a stop bit or runs out of resources.

It already does. In fact, even Merced fetched and decoded 2 bundles at a time IIRC.

Sahakiel · Aug 15, 2004

Originally posted by: imgod2u
Think beyond general purpose MPU's. VLIW is used very widely in embedded systems and FP powerhouse chips. Practically all modern GPU's use a VLIW native ISA. VLIW does require a good compiler, but the means to produce a good-performing runtime environment is here and it matches or exceeds native code.

Hm... I was sure graphics processors use vector processors. However, since I'm relatively new to the field, I won't comment further.

And as we've seen with GPU's, when you're not bounded in hardware by the ISA, you can really make things very fast. Just change the drivers (the JIT) to support the new GPU. ISA support in software is cheap.

There is a marked difference between an ISA specification and an API specification. This thread started out with a discussion over ISA design. Last I checked, mainstream graphics engines are written to specific API specifications.
ISA support in software is relatively cheap. However, it is also slower in just about every single instance.

Cut off at the wrong point and completely out of context I'm afraid, see below.

No, it was cut off correctly and in-context. I was simply correcting your statement about code-morphing being the concept of VLIW.

It is very much VLIW. It does, indeed, use very long instruction words. The fact that each instruction word does not specify explicit parallelism doesn't mean it's not a bundled, very long instruction word. As I've pointed out above, IA-64 does differ in some ways from "classic" VLIW, but the concept is still there. Explicit, compiler-generated parallelism vs superscalar.

Explicit, compiler-generated parallelism is the idea behind vector processors as well. Going by that criteria, vector processors are VLIW architectures, which is incorrect.
EPIC shares concepts with VLIW, that is true. EPIC also shares concepts with superscalar.

Not really. The bundled instructions (128-bits, 3 instructions per bundle) are really there to help decoding and dispatch. There's really no guarantee that all 3 are explicitly parallel. Explicit parallelism through IA-64 is done by the means of a stop-bit in the instruction that the parallel "group" ends.

Hm... I was a bit off about the bundles. There are four instruction bundles with instructions not guaranteed independent. The other 20 instruction bundles in the IA-64 ISA have 3 explicitly parallel instructions.

imgod2u · Aug 15, 2004

Originally posted by: Sahakiel

Originally posted by: imgod2u
Think beyond general purpose MPU's. VLIW is used very widely in embedded systems and FP powerhouse chips. Practically all modern GPU's use a VLIW native ISA. VLIW does require a good compiler, but the means to produce a good-performing runtime environment is here and it matches or exceeds native code.

Click to expand...

Hm... I was sure graphics processors use vector processors. However, since I'm relatively new to the field, I won't comment further.

Click to expand...

Combination of both really. Each instruction can operate on a vector of data, but multiple instructions are issued in parallel to a VLIW core in most cases.

And as we've seen with GPU's, when you're not bounded in hardware by the ISA, you can really make things very fast. Just change the drivers (the JIT) to support the new GPU. ISA support in software is cheap.

Click to expand...

There is a marked difference between an ISA specification and an API specification. This thread started out with a discussion over ISA design. Last I checked, mainstream graphics engines are written to specific API specifications.
ISA support in software is relatively cheap. However, it is also slower in just about every single instance.

Click to expand...

API specifications are compiled by the JIT to native VLIW before execution. The added flexibility of not having to support older ISA's in hardware more than makes up for the overhead of having to compile on-the-fly (and with runtime profiling, the overhead is even further trivial). The API need not change even if the hardware ISA changes. That's one of the prime advantages of a runtime environment.

Cut off at the wrong point and completely out of context I'm afraid, see below.

Click to expand...

No, it was cut off correctly and in-context. I was simply correcting your statement about code-morphing being the concept of VLIW.

Click to expand...

If you read the sentence afterwards, you'd see I was refering to the simplicity of VLIW cores.

It is very much VLIW. It does, indeed, use very long instruction words. The fact that each instruction word does not specify explicit parallelism doesn't mean it's not a bundled, very long instruction word. As I've pointed out above, IA-64 does differ in some ways from "classic" VLIW, but the concept is still there. Explicit, compiler-generated parallelism vs superscalar.

Click to expand...

Explicit, compiler-generated parallelism is the idea behind vector processors as well. Going by that criteria, vector processors are VLIW architectures, which is incorrect.

Click to expand...

Except vector processors are limited in the type of parallelism, VLIW is not. That's the primary difference. IA-64 has the flexibility to generate parallelism via the compiler that is characteristic of VLIW. Furthermore, it does, indeed, use a very long instruction wording ISA. The two are simply not related (the VLIW ISA isn't there to offer explicit parallelism).

EPIC shares concepts with VLIW, that is true. EPIC also shares concepts with superscalar.

Click to expand...

Erm, not it doesn't. IA-64 does not support OoOE. It absolutely cannot run instructions which would normally be sequential in parallel (which is what superscalar does). It can only obtain parallelism via explicit instructions from the compiler.

Not really. The bundled instructions (128-bits, 3 instructions per bundle) are really there to help decoding and dispatch. There's really no guarantee that all 3 are explicitly parallel. Explicit parallelism through IA-64 is done by the means of a stop-bit in the instruction that the parallel "group" ends.

Click to expand...

Hm... I was a bit off about the bundles. There are four instruction bundles with instructions not guaranteed independent. The other 20 instruction bundles in the IA-64 ISA have 3 explicitly parallel instructions.

Click to expand...

Erm. There is a vast permutation of instruction bundles. There are the templates though:
http://www.intel.com/design/itanium/manuals/245319.pdf

If you look, the templates do not specify nor imply that there is any explicit parallelism between the instructions. Parallelism is solely defined by the stop bit.

Sohcan · Aug 15, 2004

Originally posted by: Quasi
Well...
I believe that PA-RISC were a VLIW processor

PA-RISC is a sequential RISC architecture...the current PA 8x00 series is an out-of-order implementation.

In the Itanium's case, if an instruction affects a factor that could not be predicted by the compiler then the processor is able to re-schedule the instructions. This is very complex and requires lots of transistors.

The Itanium 2 core pipeline is relatively simple...this is the opinion that I've heard many times from other architects and circuit designers on my team who have had long experience with out-of-order PA-RISC and Pentium 3/4 cores. This is especially true considering it can issue up to six instructions per cycle, whereas 3-4 is typical for most OOOE superscalar cores, as well as given the large register files and fully bypassed execution units. The Itanium 2's high transistor count is an artifact of being targeted towards throughput-oriented server workloads...a vast majority of the FETs are outside the main integer pipeline, such as in the memory hierarchy and TLBs (translation lookaside buffers), the system interface, the large branch predictor, the ALAT (advanced load address table), the FPU, etc.

Which makes the development more expensive and the processor much larger which increases the production cost...

While I can't speak for AMD, believe me that the Itanium design teams are fewer in number and (in some cases much) smaller in size than the x86 design teams at Intel.

Originally posted by: imgod2uIn predictable instruction sequences (such as most FP code), the compiler does a very good job and Itanium's performance is very good. In other less than predictable code, it's not so hot.

On-line transaction processing is a server workload with notoriously high cache miss-rates and poor predictability, yet the 1.5 GHz Itanium 2 only recently lost the top spot in 4-way TPC-C (the defacto OLTP benchmark) to POWER5, after holding it for over a year. It still is nearly 30% ahead of Opteron and Xeon.

Originally posted by: zensafari
I guess my pick with VLIW is I just don't see how data dependencies (or out of order instruction issues) can be handled without stalling. As mentioned, the compiler has to pretty intense to get these things to work.

Finding data dependencies is a no-brainer...any optimizing compiler, for any architecture, produces a dependency graph, from which global optimizations and register allocation is performed. So moving most instructions around in the program flow and bundling them is easy. On top of that, Itanium 2's pipeline is very predictable (assuming an L1 cache hit), making scheduling relatively easy. All integer instructions take 1 cycle to execute, all multimedia instructions take 2 cycles, and all FP instructions take 4 cycles. Itanium 2 also does scoreboarding in hardware to track register usage and allow out-of-order instruction completion.

The difficult part is performing load and control speculation (moving loads up in the program flow, in some cases above branches). Load speculation has to be selectively performed in order to increase performance, which means speculating on what address the load is being performed on...that's not so easy. But as far as its affect on the performance of Itanium binaries, consider this: on Merced and McKinley (the 800 MHz Itanium and 1 GHz Itanium 2), misperforming a load or control speculation had to be cleaned up by the OS, which means a penalty of a few thousand cycles at a minimum. As a result, binaries for McKinley didn't use load speculation. It's only Madison (1.5 GHz Itanium 2) that reduced the penalty for a mis-speculation to around 15-20 cycles as I recall, so load speculation is only starting to be used in Itanium binaries.

Does anyone out there have any clues about the functionality behind the Transmetta code morphing?

Here's a good article on Code Morphing: Wolves in CISC Clothing

Originally posted by: imgod2u
Erm, not it doesn't. IA-64 does not support OoOE.

I'm sure you didn't mean it, but superscalar is not synonymous with OOOE...see the Pentium, UltraSPARC III, Alpha 21164, etc. Itanium does share a lot of features with superscalar that are not present in VLIW: dynamic allocation of instructions to execution units, register scoreboarding, and dynamic branch prediction.

It absolutely cannot run instructions which would normally be sequential in parallel (which is what superscalar does).

I don't follow you here...OOOE superscalar cannot arbitrarily execute dependent instructions out-of-order any more than Itanium can. If an instruction stalls in an OOOE processor due to a long latency event such as cache miss, only independent instructions may continue to execute...all dependent instructions stall as well.

Sahakiel · Aug 16, 2004

Originally posted by: imgod2uIf you read the sentence afterwards, you'd see I was refering to the simplicity of VLIW cores.

I'm not questioning the simplicity of VLIW cores. I simply pointed out you took the wrong path. Transmeta chose a VLIW core for code-morphing precisely because it would be simpler to implement. That meant lower power consumption. VLIW was not developed specifically for Transmeta's code-morphing.

Except vector processors are limited in the type of parallelism, VLIW is not. That's the primary difference. IA-64 has the flexibility to generate parallelism via the compiler that is characteristic of VLIW. Furthermore, it does, indeed, use a very long instruction wording ISA. The two are simply not related (the VLIW ISA isn't there to offer explicit parallelism).

So, basically, you agree that IA-64, while similar in some respects, is not VLIW.

Erm. There is a vast permutation of instruction bundles. There are the templates though:
http://www.intel.com/design/itanium/manuals/245319.pdf

My apologies for not being clear.
Instruction bundles are defined by 5-bit templates using the lower 5-bits of each instruction bundle. Each template specifies the type of instruction inside each field, with each type having multiple possibilities. The lowest order bit (I'm sure it was lowest) indicates the presence of a stop bit at the end of each instruction bundle. This cuts down the number of unique templates to 16. Eight templates are reserved for future use, which leaves 12 unique templates plus stop bits. Templates without stop bits are in the middle of streams with independent instructions.
Here's the kicker: Out of those 12 templates, only two have stop bits before the end of the bundle. Between each stop bit, the instructions are guaranteed independent by the compiler. That's why there is a stop bit Sequences of independent code can run from 1 instruction to an infinite number of instructions, depending on where the stop bit is located.

xQuasi · Aug 18, 2004

So... IA-64 for desktop computers wouldn't be that complex after all?

So can IA-64 replace x86?

How easy would it be for Transmeta to add support for another instruction set and code-morph it?

Tiamat · Aug 18, 2004

Originally posted by: zensafari
I've often been interested in the Transmetta VLIW processor architecture. From what I understand the Transmetta Crusoe processor uses some type of shell that combines 4 or more instructions into one-128 bit instruction.

Some claim that VLIW architectures can be faster but I don't see how. How do you gain throughput with this type of architectures when running a standard 32-bit windows program? I don't see how speed-up can be affected greatly. Where is the big advantage?

Also, I've heard that the shell used to morph regular instructions into VLIW format is software-based. Is this software stored on the HDD or in some type of other off-chip memory?

My gut reaction tells me that the Transmetta Crusoe VLIW architecture is does not out-perform standard processors. I imagine that there are reservation stations within the processor that account for out-of-order execution but how are the 128-bit instructions divided up if need be?

Thanks to anyone who has some answers.

I read somewhere that this process is done from 8MB of your system ram that is locked to doing this specific task. It might have been on Fujitsu's product page...

zensafari · Aug 18, 2004

Thanks for all of the knowledgable responses. I do appreciate it.

I'm not familiar with the Itanium design to be honest. How is it similar to a Crusoe? Maybe a brief comparison of the two will help me figure out what lays behind the VLIW machine. I still can't decide whether the *power* of the processor lies in the compiler's complexity or in the design of the core itself. Again, I'm referring to a VLIW architecture used for executing programs in an x86 environment (Windows, etc).

Thanks in advance.

imgod2u · Aug 18, 2004

Originally posted by: imgod2u
Erm, not it doesn't. IA-64 does not support OoOE.

Click to expand...

I'm sure you didn't mean it, but superscalar is not synonymous with OOOE...see the Pentium, UltraSPARC III, Alpha 21164, etc. Itanium does share a lot of features with superscalar that are not present in VLIW: dynamic allocation of instructions to execution units, register scoreboarding, and dynamic branch prediction.

As I understand it, superscalar processors all need to be able to support some form of reordering. If you execute 2 instructions in parallel, where normally they would be sequential (in terms of the code, but not neccessarily dependent), then those instructions aren't being executed in program order (as they're suppose to go one after the other, even if they're independent). This would, surely, require some form tracking to be kept, even if it's not as complex as a re-order window.

It absolutely cannot run instructions which would normally be sequential in parallel (which is what superscalar does).

Click to expand...

I don't follow you here...OOOE superscalar cannot arbitrarily execute dependent instructions out-of-order any more than Itanium can. If an instruction stalls in an OOOE processor due to a long latency event such as cache miss, only independent instructions may continue to execute...all dependent instructions stall as well.

Not dependent, but sequential. That is, in program order. Even if instructions are independent, they are still presented in normal assembly programs as a sequence of instructions one after the other. I was pointing out that Itanium doesn't have the ability to take sequential instruction streams and process those instructions in parallel (as it doesn't ever check for dependencies) and relies completely on the compiler to do such things. Thus, it has little, if nothing in common with superscalar designs.

So, basically, you agree that IA-64, while similar in some respects, is not VLIW.

No, as I've pointed out, it does use Very Long Instruction Words. It is VLIW by every meaning of the term. It simply doesn't go the same route as the "traditional" VLIW designs have gone. If you restrict the definition of VLIW to only what other architectures have done wth VLIW ISA's, then no, IA-64 isn't VLIW. But that's like saying PPC isn't a RISC processor simply because it doesn't go the same implementation route as other traditional RISC chips.

Sohcan · Aug 18, 2004

Originally posted by: imgod2u

Originally posted by: imgod2u
Erm, not it doesn't. IA-64 does not support OoOE.

Click to expand...

I'm sure you didn't mean it, but superscalar is not synonymous with OOOE...see the Pentium, UltraSPARC III, Alpha 21164, etc. Itanium does share a lot of features with superscalar that are not present in VLIW: dynamic allocation of instructions to execution units, register scoreboarding, and dynamic branch prediction.

Click to expand...

As I understand it, superscalar processors all need to be able to support some form of reordering. If you execute 2 instructions in parallel, where normally they would be sequential (in terms of the code, but not neccessarily dependent), then those instructions aren't being executed in program order (as they're suppose to go one after the other, even if they're independent). This would, surely, require some form tracking to be kept, even if it's not as complex as a re-order window.

OOOE universally refers to the "second-generation" superscalar processors that can execute instructions that occur later in the program order that others. The processors I mentioned, among others, are always called in-order.

In-order processors do not need to support re-ordering; they are blocking designs just like Itanium. If an instruction stalls due to a hazard, all subsequent instructions in the program flow stall even if they are independent. Depending on the design, instruction resources hardly even need to be tracked (if at all) while they are in the pipeline, such as with the Pentium. Going to an out-of-order design not only has huge implications on the design of instruction issue and retirement, but also changes the programming model if not handled correctly.

Your reasoning for calling in-order superscalar "OOOE" is that it changes the programming model by allowing otherwise sequential, independent instructions to be executed in parallel. If you're going to use that reasoning, then you're going to have to call a pipelined, scalar design "OOOE" as well, since its instruction-level parallelism technique also changes the architecture's programming model with respect to a non-pipelined, scalar design. The strict program order is not only sequential, but the completion of one instruction before the next is started. Think about how a pipeline changes the programming model in the presence of program interrupts/execeptions and instructions with differing execution latencies.

Read "Architecture of the Pentium microprocessor" (Donald Alpert and Dror Avnon, IEEE Micro June 1993), "Tuning the Pentium Pro Microarchitecture" (David Papworth, IEEE Micro April 1996) and "The MIPS R10000 Superscalar Microprocesor" (Ken Yeager, IEEE Micro, April 1996)...you can probably find them on Google. They give good descriptions of what goes into the implementations of in-order and out-of-order superscalar designs and how they keep the programming model consistent.

It absolutely cannot run instructions which would normally be sequential in parallel (which is what superscalar does).

Click to expand...

I don't follow you here...OOOE superscalar cannot arbitrarily execute dependent instructions out-of-order any more than Itanium can. If an instruction stalls in an OOOE processor due to a long latency event such as cache miss, only independent instructions may continue to execute...all dependent instructions stall as well.

Not dependent, but sequential. That is, in program order. Even if instructions are independent, they are still presented in normal assembly programs as a sequence of instructions one after the other. I was pointing out that Itanium doesn't have the ability to take sequential instruction streams and process those instructions in parallel (as it doesn't ever check for dependencies) and relies completely on the compiler to do such things. Thus, it has little, if nothing in common with superscalar designs.[/quote]
That's simply not true....there's a lot more that goes into building an instruction schedule than just finding data dependencies. Like current superscalar designs, Itanium dynamically resolves resource hazards and control dependencies, something that VLIW designs typically don't do.

In addition, a VLIW design might be non-interlocking...since VLIW exposes the structure of the pipeline to the software, this means that the schedule produced by the compiler needs to account for all data, control and structural hazards that might occur in the pipeline. For example, let's say that a VLIW processor has a 2 cycle latency for multiplies. If a multiply is followed by an instruction that uses its result, it must be scheduled 2 cycles after the load. This presents obvious problems if a later implementation has a 3 cycle multiply latency.

Like superscalar designs, Itanium is fully interlocking...no knowledge of the pipeline implementation is required to build a schedule. Like the previous example, an Itanium compiler would optimally schedule the consuming instruction two groups after the multiply (with independent instructions inbetween). But if a future design had an multiply latency of 1 or 3 cycles, the software would still function...Itanium's scoreboarding would correctly issue the consuming instruction after the multiply had finished executing.

Sahakiel · Aug 19, 2004

Originally posted by: imgod2u
No, as I've pointed out, it does use Very Long Instruction Words. It is VLIW by every meaning of the term. It simply doesn't go the same route as the "traditional" VLIW designs have gone. If you restrict the definition of VLIW to only what other architectures have done wth VLIW ISA's, then no, IA-64 isn't VLIW.

EPIC is not VLIW by every meaning of the term. It uses VLIW formatted instructions, which is where you're getting stuck. It exposes almost as much of the hardware as VLIW, which is where you're getting confused. It's not as if every processor that uses 32-bit operands with 6-bit opcodes is guaranteed to be the same processor.
VLIW processors expose the entire processor to the software. EPIC exposes a lot, far more than superscalar, but it still allows for binary compatibility across generations. Programs written for a VLIW processor cannot be run on a different processor with a different number of execution engines. At the same time, EPIC reserves the ability to issue multiple instruction bundles (or VLIW's) whereas VLIW processors cannot because it would literally violate program robustness.
VLIW processors also do not share execution units. Each functional unit corresponds to one instruction in a VLIW. VLIW ISA's always assume that all functional units are used during each clock cycle. By implication, all instructions in one VLIW must be independent of the others. If the compiler cannot find enough independent instructions, it must insert noops. There is no other recourse.
EPIC allows for flexibility here by specifying instruction bundle templates with only three instruction fields, implementing the stop bit, and issuing instructions through ports connected to multiple execution units. The compiler is less likely to insert noops with only three instructions per VLIW and the ability to issue multiple instruction types in the same instruction field.

But that's like saying PPC isn't a RISC processor simply because it doesn't go the same implementation route as other traditional RISC chips.

Funny you should mention that..

VLIW Processors - Transmetta

Junior Member

Golden Member

Junior Member

Senior member

Golden Member

Junior Member

Golden Member

Senior member

Golden Member

Junior Member

Senior member

Golden Member

Senior member

Platinum Member

Golden Member

Junior Member

Lifer

Junior Member

Senior member

Platinum Member

Golden Member