General purpose CPU's, pipelining sweetspot?

Sunner · Jul 4, 2002

Do you people who work with this stuff think there is a special sweetspot for how deep pipelines will become, assuming CPU's continue along the same principles as they do today?

Say when(if) the Pentium 10 comes along, will it continue the trend of deeper pipes with a 150 stage pipeline, or will Intel, AMD, etc have to just stop at some point and either rethink how CPU's work completely, or just gun for higher ILP/TLP to achieve better performance?

AndyHui · Jul 4, 2002

I remember Sohcan or Patrick mentioning an article about pipelines 40-60 stages deep a little while ago.

That would be interesting to see.

sao123 · Jul 4, 2002

Pipelines are extremely efficient for parallel computing. However pipelines become less efficient when jumps and loops become involved.
Everytime a jump or loop occurs, the pipe must be flushed and refilled before computing can continue. Therefore the rule should be to keep piping to a medium. Too little and you lose performance gains. Too deep and your always flushing. 4-6 stages is probably the deepest optimum pipe.

This is one reason, hardware compaines claim (advanced piping features) make newer processors so much faster. In a sence they have the potential to be faster... but machine code has to be optimized during compiling and linking to allow the program to take advantage of the piped features.

CTho9305 · Jul 4, 2002

Originally posted by: AndyHui
I remember Sohcan or Patrick mentioning an article about pipelines 40-60 stages deep a little while ago.

That would be interesting to see.

it was pm, and he cited an article that said, IIRC, 56 stages is the max for x86, but he did say it was an old article.

SuperTool · Jul 7, 2002

Also keep in mind that power becomes a concern with these deep pipelines. The deeper the pipeline, the smaller hopefully each stage is, the faster the clock cycle and the more power you use in the clock. You have to charge and discharge the whole clock tree once every cycle, and that's a lot of power.

glugglug · Jul 8, 2002

While bandwidth restrictions can always be overcome by making the bus wider or having extra buffers between the CPU and memory (ack! - this will make latency even worse but eventually probably be necessary) latency really cannot - There is a limit to how fast the signal can travel from the memory to the CPU core without the memory actually being placed inside the CPU. The latency is already pretty extreme - PC2700 DDR SDRAM takes about 70ns for the first read in a burst - on a 2GHz processor that's an incredible 140 CPU cycles spent waiting before the CPU receives the first byte to start processing. RDRAM latency is even worse.

Longer pipelines make the CPU more bandwidth hungry to get the same performance (at least from bandwidth getting eaten from speculated execution paths that dont get taken), but make it less susceptible to ill effects from higher latency. (Why the P4 is a good candidate for RDRAM)

The benefit of adding extra CPUs is also more dependent on bandwidth than latency. I think future CPUs will be heavily hyperthreaded in order to keep gaining performance (i.e. perhaps your CPU will have 8-16 copies of each register to simulate an 8-16 processor system, not just 2 of each register), and not just the ones by Intel. Common software applications will be rewritten to be massively multithreaded to take advantage of this, i.e. when you zip a file it might start 10 threads each working on a section of the file rather than just using 1 compression thread on a multiprocessor system. Whether the pipelines get longer, with instruction fetches for each thread in larger bursts for less high latency random accesses from all the virtual CPus, or they become shorter & simpler to fit more of each type of processing unit on the die remains to be seen.

Really the only long-term options I see to keep CPU performance increasing is either virtual CPUs (hyperthreading) or incredibly large low-latency on-chip cache. (most likely a lot more expensive)

imgod2u · Jul 8, 2002

In normal cases, where overall design is not limited by any particular ISA, the method of pipelining seems to be optimal at around 6-8 stages. However, with an x86 MPU, there is the limitation of decoding only 3 x86 instructions per clock. This makes clockspeed extremely important as you'll get to a point where you can't get any higher IPC (3 x86 instructions per clock). From a design point of view, trying to achieve closer or closer to 3 x86 instructions per clock is a tedious job and requires a redesign of the processor every time you increase the IPC. The better solution would be to just attempt to maintain a high enough IPC but make it so that the clockspeed can scale very high. This is the reasoning behind extremely deep pipelining in the x86 world. As long as you maintain minimal loss due to branch mispredicts (Prediction algorithms do this quite well), you can up the clockspeed significantly with a 20-stage design and even more with a longer integer pipeline and an even longer FP pipeline. This would allow much higher clockrates and further optimizations to the prediction algorithms as well as hyperthreading would help feed and maintain the pipeline.

CTho9305 · Jul 8, 2002

why can you only decode 3 instructions per clock? I realize that with variable-length instructions it could be VERY difficult to decode more, but why do you say 3?

imgod2u · Jul 8, 2002

The reason only 3 x86 instructions are decoded per clock is due to data dependencies and branches that appear in almost all x86 code. This is not to say, if you really tried, you couldn't have more decoding units and maybe 10% of the time, that extra 4th decoding unit would actually do work instead of standing there idle. But the optimum decoding for x86 code is 3 instructions per clock. Just enough so that you can usually find 3 instructions that you can decode and execute separately or right after each other and not too much so that extra die space is wasted.

Sunner · Jul 9, 2002

Maybe I should mention I wasn't specifically asking about x86 CPU's, but rather CPU's in general, including various RISC's, VLIW's, etc.

rummyPPG · Jul 9, 2002

Okay, so I remember my professor talking about this and I went back to review his notes online, but I don't remember exactly what he was talking about and would need to review my notes to find out. But basically when adding more stages to a pipeline, you add more latches (per gate delays within a stage) which somehow affects the overhead of having stalls (for jumps branches etc.); this is the part i'm fuzzy on. From there you can look at adding stages to get higher throughput, but at some poing the overhead starts taking over; in otherwords, quantitatively represent the throughput in a formula and find the max.

I'm pretty sure we determined this happens at 7-10 stages, but that depends what exactly a pipeline stage means to the particular project. I mean, the PIV has 2 of its 20 pipeline devoted just to getting signals across the chip because the wire delays don't scale down well, so perhaps this doesn't affect the overhead for latches as much since there aren't any gate delay logic within the stage?! I don't really know, i'll see if i can find more notes.

zephyrprime · Jul 9, 2002

optimum decoding for x86 code is 3 instructions per clock.

We're already past the point where we care about die space efficiency for decoders. It seems to me that we can pack enough transisitors on a die nowadays that even if the fourth decoder that you mention only gets 10% usage, it's perfectly acceptable.

But in a more general sense, you're right. With all the branching in average code, having a lot more of decoders than we already do is pointless even if we had die space to burn.

On a different note, I've read that reasearch done in the the early 90's said that given the quality of branch prediction then extent, the approximate max useful length of a risc pipeline for normal code would be about 14 stages. Now the x86 has to spend some extra stages decoding because of it's complex instructions and branch prediction is somewhat better so Intel can get away with having 20 stages on the P4.

I was thinking though that if you had enough complete speculative execution units, you could attain a longer pipeline. The way it works is that instead of relying on branch prediction for the first branch, you'd just execute all possible branches. Unfortunately, this still doesn't relieve data dependencies so you wouldn't be able to take this idea very far.

rummyPPG · Jul 9, 2002

E mailed my prof and he gave me some papers to look at that have been published. These, while lengthy, should answer your questions:

http://systems.cs.colorado.edu/ISCA2002/FinalPapers/hartsteina_optimum_pipeline_color.pdf
http://systems.cs.colorado.edu/ISCA2002/FinalPapers/hrishikeshm_optimal_revised.ps
http://systems.cs.colorado.edu/ISCA2002/FinalPapers/Deep%20Pipes.pdf

Hope this helps

Edit: Kind of a side note, but i was rereading through this post and realized that some people had the misconception that pipelining helps with parallel applications. This is simply not true. In a pipeline every instruction is still executed in parallel and while it is possible for some to be resolved before the end of the pipeline (namely branches and jumps) no instruction gets to the end of a pipe before one that was issued before it.

What pipelining DOES do is increase the throughput by reusing hardware that would otherwise be idle. In orther words, pipelining is analagous to starting the next load of clothes to be washed in the washer while the ones ahead are in the dryer. You aren't doing them in parallel since the clothes you started washing later will be done later (they have to wait til the drier is free); i.e. no two loads of clothes are simultaneously washing.

A analogy for parallel computing (either a superscalar processor or multiple processors) would be to have more than 1 set of washer/dryers, which would then let you simultaneously operate on two loads of wash thus being parallel.

sao123 · Jul 10, 2002

While your analysis of Parallel computing is correct, it is not the subject matter I was discussing.

when software is written for parallel piped computing, that means that the assembly language is optimized in such a way to prevent pipe stalls. There are two main ways to stall a pipe. 1 is branching, 2 is serial assembly instructions.

Parallel code is when a piped instruction, does not have to depend on another piped or currently executing instruction for data.
example:

Parallel:
D=1+2
C=B+A
F=E+G
A=2
All of these instructions are independent and do not cause a pipe stall.

Serial:
A=1
B=A+1
C=A+B
D=3

Instruction 3 cannot execute until 1 & 2 have finished, thus causing a pipe stall. the pipe must empty and refill.
Now the cool thing here is that certain architectures can execute parallel instructions out of order. Thus instruction 4 may execute before instruction 3 and therefore giving instruction 2 more time to complete before instruction 3 is attempted.

My point was, the level of compiler optimization will help determine the most effecitve pipe depth.
Pentium 3 was piped at a depth of 12. AMD K6 at 10.
P4 are at 20. Now note, that many people didnt find at first any speed advantage using the new P4 20 stage pipe, because the software wasnt optimized for it yet. Meaning code optimized for a 12 stage pipe had the branches and dependent instructions too close together for a 20 stage pipe. Lots of pipe stalls = bad performance.

See relevent articles here:
http://www.chipcenter.com/eexpert/code-opt/dgilbertC001.html
http://www.aceshardware.com/Spades/read.php?article_id=113
http://www.aceshardware.com/Spades/read.php?article_id=50

glugglug · Jul 10, 2002

Random memory access times that equate to 100+ CPU cycles will cause all pipelines used by a logical execution thread to stall no matter how good your compiler optimizations are. And high end P4s will have memory access times exceeding 200 clock cycles before the end of this year.

Deeper pipelining will help reduce the performance hit of waiting for L2 cache because while one instruction waits on a data fetch, instructions in the other stages of the pipe can continue processing. But for the higher latency fetches from off the chip, we are beyond the point where that is significant.

rummyPPG · Jul 10, 2002

Deeper pipelining will help reduce the performance hit of waiting for L2 cache because while one instruction waits on a data fetch, instructions in the other stages of the pipe can continue processing.

Actually, that's not true at all. In just a pipeline nothing behind it can process while an instruction further ahead is waiting. True in a pipeline you could have the request for the mem access happen early in the pipe and then use it later to hide the latency, but if the memory isn't ready when the instruction in the pipeline absolutely needs it then everything behind it stalls. Namely no other instruction behind it will get to any stages in the pipeline at or past the stalled instruction.

What you are describing is out-of-order execution that, while aiming at the some of the same goals as pipelining (efficient use of execution units, higher throughput), is not pipelining. For an example you can have an out-of-order processor that uses a pipelined integer adder. Any instruction following an add could complete before that add, thus hiding latency if the add is waiting for memory/other instructions. However, that is a property of the out-of-order execution engine; any additional addition instruction entering the pipelined integer adder could not finish before the previous addition instruction in there.

It is really analagous to a pipe in a straight line. Within the pipe nothing can pass anything else, in either direction, but more than one thing can be in there at once. Hope this helps.

sao123 · Jul 11, 2002

The workflow of an x86 chip that is based on the P6 microarchitecture proceeds as follows:
Instructions come into the Fetch/Decode Unit in program order.
Here, they are broken down into micro-ops, which are sent to the Re-Order Buffer.
From the Re-Order Buffer, they are taken (in any order) into the Reservation Station of the Dispatch/Execute Unit.
Once in the Reservation Station, they pass through any of five ports to an available functional unit, where they are executed.
The results are then sent back to the Re-Order Buffer, where the Retire Unit pulls the instruction out and retires it in program order once all theconditions of the instruction's execution are met.

The Processor Pipeline Stages - Intel P3
There are a total of 12 pipeline stages that an instruction must proceed through before it is completed. They are as follows: two Branch Prediction stages, three Instruction Fetch stages, two Instruction Decode stages, one Register Allocation stage, one Re-order Buffer Read stage, one Reservation Station stage, one Re-Order Buffer Write-back stage, and one Register Retirement File stage.

The first nine stages work with instructions in program order, and feed the results to the other stages, which can work with the instructions out of order. The Branch Prediction stages, when properly filled, can avoid processor pipeline stalls arising from branches in code. This is accomplished by use of a Branch Target Buffer (BTB) and a Return Stack Buffer (RSB). The BTB holds the address that the Program Counter needs to jump to when a branch is executed, and the RSB keeps track of the address that the Program Counter is to return to when the branch is completed.

The most important issue that affects the efficiency of the Instruction Fetch stages is the alignment of instructions in memory. The Instruction Decode stages are where the breakdown of instructions into micro-ops occurs. It should be noted that not all decoders are capable of handling long instructions÷this is where some slowdown can occur. There are one complex and two simple decoders, which can handle instructions of up to four micro-ops and one micro-op each, respectively. The complex instructions are broken down according to the Microcode Instruction Sequencer, which is part of the Fetch/Decode Unit.

The Register Allocation stage of the pipeline is where operations concerning the register renaming take place. This feature is one way of reducing data dependencies when speculative execution is being used. This is even more important in a system such as the P6 that is capable of out-of-order execution.

The Re-Order Buffer stage marks the end of the in-order procession of instructions through the pipeline. From here, the processor can "look ahead" as much as 20 to 40 instructions in the program flow to execute instructions that have data and/or functional units available. This is done to keep the processor busy and to avoid stalls in the pipeline.

The Reservation Station stage is important since this is where instructions are scheduled for execution. The flow of instructions through the ports of the Reservation Station is dependent on the availability of the proper data and functional units. By paying close attention to which functional units are in use at any given time, and what data are in the cache and registers, the speed of program execution can be increased. It would really behoove the programmer to be aware of what specific instructions each functional unit is responsible for, since this can give us a means of "timing" the availability of functional units.

After execution, completed instructions are sent back to the Re-Order Buffer, where the Retire Unit will pull these completed instructions out for retirement in program order. Some slowdown can occur here because, before any given instruction can be retired, all the instructions that came before it in program order must be completed and retired. This means that the out-of-order execution capability is only as effective as the Re-Order Buffer allows. The number of instructions it can hold is our "window of opportunity" for increasing the speed of the program execution.

Locality of Reference is not something that "just happens"?it is "built" by the programmer. The instruction set capability of the Pentium III processor makes this task a bit easier, with the addition of pre-fetching capabilities, but careful analysis of instruction and data flow is a must.

The most important factors here to keep track of would be such things as:

How many cache lines are available?
How many outstanding transactions do we have?
Is data that will be needed in the L1 cache already?
Can we take advantage of the Pentium III's cache control instructions?

There are some phenomena that, if left unchecked, will render the architectural enhancements of the P6 series of processors ineffective. Among these are branch mispredictions, memory misalignments, poor organization of instructions and/or data in memory, inefficient instruction scheduling, and lack of inherent parallelism in our code.

Branches that are not correctly predicted using the P6's mechanisms cause a high number of processor cycles to be wasted due to avoidable pipeline flushing and filling, and execution of unnecessary code. Memory misalignments also need extra cycles to be resolved. Poor organization of data and/or instructions in memory cause cache misses and inherently costly memory accesses that could have been avoided. Poorly scheduled instructions waste the advantage given us by the Reservation Station's five ports and multiple execution units. Lack of inherent parallelism fails to take advantage of the pipelining and decoupled instruction fetching/retirement.

One way to take advantage of the out-of-order execution and superscalar vector processing capabilities is to unroll loops within a segment of code. This allows the processor to execute instructions in an order that is more appropriate for the available resources and therefore execute the routine in fewer clock cycles by filling more functional units per cycle. Making use of "software pipelining" has a similar effect.

Try looking at the assembly "dump" to see why the compiler might not be creating the most efficient code possible for that application. If you see what amounts to a string of NOPs, then look for a data dependency that may be alleviated by rearranging your code in a certain place.

Do not issue consecutive instructions that attempt to use the same port on the Reservation Station. This means you will need to be aware of what functional unit is responsible for executing a particular instruction, and what port that functional unit is attached to. Ports 0 and 1 are typically heavily used, so be aware of this.

When working with floating-point numbers, try to avoid underflow exceptions by rounding to zero when possible. Exceptions can cause instructions to not be retired, which means that your Re-Order Buffer is using space for instructions that have already been executed but cannot be retired.

Ensure that each CALL instruction has a matching RET instruction to take advantage of the Return Stack Buffer. This will help to increase the efficiency of the processor's branch prediction capabilities. The P6, in addition to its dynamic branch prediction, also uses a simple static branch prediction algorithm. This algorithm assumes that:

Unconditional branches are to be taken,
backward conditional branches are to be taken, and
forward conditional branches are not taken.
If you write your code to follow this static algorithm, then efficiency will increase. Also try to eliminate branches from your code wherever possible, since the BTB has a limited number of entries.

The alignment of data is important since instructions are fetched in blocks of 16 bytes, and cache lines are 32 bytes in size. Data that are not properly aligned will cost unnecessary clock cycles. Note that data blocks of different sizes (8 bits or 16 bits or 32 bits, etc.) will have a different alignment boundary. The clock-cycle penalties are caused by a split occurring in a cache line, or data being fetched over multiple cycles instead of a single cycle. You will need to check your compiler's output carefully for data alignment, or you may want to use assembly code for more control.

If you recall the functionality of the Fetch/Decode Unit, you will realize that instruction scheduling within the program will be important. If you use two "simple" instructions next to each other, along with an instruction that will break down in two to four micro-ops, it will be possible to send all three instructions into the pipeline at once. Otherwise, efficiency will be lost. Try to avoid very long instructions (more than four micro-ops), but don't "overdo it" with the single micro-op instructions, either.

The trick to all of this is that the conditions illustrated in these guidelines are not isolated?they interact with each other. For instance, you shouldn't schedule two simple instructions together if they have a data dependency.

rummyPPG · Jul 11, 2002

That was a fairly good, albeit long, description of a out-of-order superscalar processor. However, I think the question to stay on topic should be asked:

Did Sunner want to know the optimal length for pipelining in the strictest sense (i.e. direct extensions of the orignal MIPS 5 stage pipeline)? Or was he asking about "pipelining," as in the PIV has 20 "pipeline" stages, which really means 20 clock cycles max (with no stalls, cache misses, data dependency, etc.) for an instruction to be retired? I mean saying a PIV has a 20-stage pipeline is misleading in the original definition, but who knows, maybe we're seeing a redefinition in the pro ess. However, the distinction makes a big difference what the pros and cons are then of a longer "pipeline."

sao123 · Jul 11, 2002

I think the real question is:
Is there a optimum pipe depth for computer hardware, where you get maximum performance with minimum penalty?

The answer is no :
It depents on which architecture you are using, risc vs cisc. x86 vs sparc vs a64... etc.
It depends on what type of hardware optimizations you can use along with the pipeline in the cpu (out of order exec, caching, etc).
It depends on how well you can optimize the binary coded instructions during compiling of software code to prevent stalls.

Sunner · Jul 12, 2002

rummyPPG, I was referring to pipe depth, as in "The P-4 has 20 stages"

ynotravid · Jul 12, 2002

Correct me cuz I'm wrong but.

There can't be "general" sweatspot for pipeline depth. The effectiveness of the pipeline is determined by the optimization of the code (misbranch prediction etc...), and different CPUs are used for different tasks (e.g. modeling nuclear explosions, playing Nuke, etc...). Then doesn't that mean that, unless you are more specific about what kind of CPU and what kind of task you are trying to accomplish, that you cannot start talking about what the optimal depth for pipelines would be?

Go ahead, I'm tough skinned.

Sunner · Jul 12, 2002

Ok, just x86 as the example then.

For just running Windows, playing Doom X, etc.
Or as I stated in the first post, do you think there will ever be a Pentium something with a pipeline 150 stages deep?

Or an Itanium something for that matter, should Intel choose to move it into the mainstream anytime soon.

imgod2u · Jul 12, 2002

IA-64 is of a completely different ISA and does not have the limitations of x86 as far as data dependencies, etc. So it can fetch and decode (without much waste of the decoding unit) much more than just 3 instructions per clock. So the optimal pipeline length would be around 7-9 stages.

What I'm wondering is, with Hyperthreading enabled, would Intel be able to bypass the limitation of decoding x86 code? After all, 2 streams of instructions from 2 different threads aren't going to have dependencies on each other so they can be decoded independently. If that's so, Intel can afford to put 6 decoding units on Prescott because HT will be enabled on that and it won't be a total waste of die space. A 6-way superscalar processor, just imagine......

ynotravid · Jul 12, 2002

Sunner, I have no knowledge of how branch prediction algorythms are implemented, but I do know that everytime we think we know what's going on some geek comes up with a totally revolutionary way of doing things. So as soon as we finally realize that 32 stages is the perfect depth for our CPU we're going to have to completely rethink our architecture.

P.S. Dang sao123, I have textbooks shorter than that.

jasonroehm · Jul 12, 2002

Branch predictors are getting to the point where they really can't possibly get much better. Some of the more advanced ones are accurate over 95% of the time. The challenge, though, is balancing the algorithms with the die space they take up, as the more accurate ones can take up a lot of space. I'm not sure what kind of algorithm the Athlon or P4 uses though, as I'm sure that it's tightly guarded proprietary information.

General purpose CPU's, pipelining sweetspot?

Elite Member

Administrator Emeritus<br>Elite Member<br>AT FAQ M

Lifer

Elite Member

Lifer

Diamond Member

Senior member

Elite Member

Senior member

Elite Member

Member

Diamond Member

Member

Lifer

Diamond Member

Member

Lifer

Member

Lifer

Elite Member

Senior member

Elite Member

Senior member

Senior member

Member