The workflow of an x86 chip that is based on the P6 microarchitecture proceeds as follows:
Instructions come into the Fetch/Decode Unit in program order.
Here, they are broken down into micro-ops, which are sent to the Re-Order Buffer.
From the Re-Order Buffer, they are taken (in any order) into the Reservation Station of the Dispatch/Execute Unit.
Once in the Reservation Station, they pass through any of five ports to an available functional unit, where they are executed.
The results are then sent back to the Re-Order Buffer, where the Retire Unit pulls the instruction out and retires it in program order once all theconditions of the instruction's execution are met.
The Processor Pipeline Stages - Intel P3
There are a total of 12 pipeline stages that an instruction must proceed through before it is completed. They are as follows: two Branch Prediction stages, three Instruction Fetch stages, two Instruction Decode stages, one Register Allocation stage, one Re-order Buffer Read stage, one Reservation Station stage, one Re-Order Buffer Write-back stage, and one Register Retirement File stage.
The first nine stages work with instructions in program order, and feed the results to the other stages, which can work with the instructions out of order. The Branch Prediction stages, when properly filled, can avoid processor pipeline stalls arising from branches in code. This is accomplished by use of a Branch Target Buffer (BTB) and a Return Stack Buffer (RSB). The BTB holds the address that the Program Counter needs to jump to when a branch is executed, and the RSB keeps track of the address that the Program Counter is to return to when the branch is completed.
The most important issue that affects the efficiency of the Instruction Fetch stages is the alignment of instructions in memory. The Instruction Decode stages are where the breakdown of instructions into micro-ops occurs. It should be noted that not all decoders are capable of handling long instructions÷this is where some slowdown can occur. There are one complex and two simple decoders, which can handle instructions of up to four micro-ops and one micro-op each, respectively. The complex instructions are broken down according to the Microcode Instruction Sequencer, which is part of the Fetch/Decode Unit.
The Register Allocation stage of the pipeline is where operations concerning the register renaming take place. This feature is one way of reducing data dependencies when speculative execution is being used. This is even more important in a system such as the P6 that is capable of out-of-order execution.
The Re-Order Buffer stage marks the end of the in-order procession of instructions through the pipeline. From here, the processor can "look ahead" as much as 20 to 40 instructions in the program flow to execute instructions that have data and/or functional units available. This is done to keep the processor busy and to avoid stalls in the pipeline.
The Reservation Station stage is important since this is where instructions are scheduled for execution. The flow of instructions through the ports of the Reservation Station is dependent on the availability of the proper data and functional units. By paying close attention to which functional units are in use at any given time, and what data are in the cache and registers, the speed of program execution can be increased. It would really behoove the programmer to be aware of what specific instructions each functional unit is responsible for, since this can give us a means of "timing" the availability of functional units.
After execution, completed instructions are sent back to the Re-Order Buffer, where the Retire Unit will pull these completed instructions out for retirement in program order. Some slowdown can occur here because, before any given instruction can be retired, all the instructions that came before it in program order must be completed and retired. This means that the out-of-order execution capability is only as effective as the Re-Order Buffer allows. The number of instructions it can hold is our "window of opportunity" for increasing the speed of the program execution.
Locality of Reference is not something that "just happens"?it is "built" by the programmer. The instruction set capability of the Pentium III processor makes this task a bit easier, with the addition of pre-fetching capabilities, but careful analysis of instruction and data flow is a must.
The most important factors here to keep track of would be such things as:
How many cache lines are available?
How many outstanding transactions do we have?
Is data that will be needed in the L1 cache already?
Can we take advantage of the Pentium III's cache control instructions?
There are some phenomena that, if left unchecked, will render the architectural enhancements of the P6 series of processors ineffective. Among these are branch mispredictions, memory misalignments, poor organization of instructions and/or data in memory, inefficient instruction scheduling, and lack of inherent parallelism in our code.
Branches that are not correctly predicted using the P6's mechanisms cause a high number of processor cycles to be wasted due to avoidable pipeline flushing and filling, and execution of unnecessary code. Memory misalignments also need extra cycles to be resolved. Poor organization of data and/or instructions in memory cause cache misses and inherently costly memory accesses that could have been avoided. Poorly scheduled instructions waste the advantage given us by the Reservation Station's five ports and multiple execution units. Lack of inherent parallelism fails to take advantage of the pipelining and decoupled instruction fetching/retirement.
One way to take advantage of the out-of-order execution and superscalar vector processing capabilities is to unroll loops within a segment of code. This allows the processor to execute instructions in an order that is more appropriate for the available resources and therefore execute the routine in fewer clock cycles by filling more functional units per cycle. Making use of "software pipelining" has a similar effect.
Try looking at the assembly "dump" to see why the compiler might not be creating the most efficient code possible for that application. If you see what amounts to a string of NOPs, then look for a data dependency that may be alleviated by rearranging your code in a certain place.
Do not issue consecutive instructions that attempt to use the same port on the Reservation Station. This means you will need to be aware of what functional unit is responsible for executing a particular instruction, and what port that functional unit is attached to. Ports 0 and 1 are typically heavily used, so be aware of this.
When working with floating-point numbers, try to avoid underflow exceptions by rounding to zero when possible. Exceptions can cause instructions to not be retired, which means that your Re-Order Buffer is using space for instructions that have already been executed but cannot be retired.
Ensure that each CALL instruction has a matching RET instruction to take advantage of the Return Stack Buffer. This will help to increase the efficiency of the processor's branch prediction capabilities. The P6, in addition to its dynamic branch prediction, also uses a simple static branch prediction algorithm. This algorithm assumes that:
Unconditional branches are to be taken,
backward conditional branches are to be taken, and
forward conditional branches are not taken.
If you write your code to follow this static algorithm, then efficiency will increase. Also try to eliminate branches from your code wherever possible, since the BTB has a limited number of entries.
The alignment of data is important since instructions are fetched in blocks of 16 bytes, and cache lines are 32 bytes in size. Data that are not properly aligned will cost unnecessary clock cycles. Note that data blocks of different sizes (8 bits or 16 bits or 32 bits, etc.) will have a different alignment boundary. The clock-cycle penalties are caused by a split occurring in a cache line, or data being fetched over multiple cycles instead of a single cycle. You will need to check your compiler's output carefully for data alignment, or you may want to use assembly code for more control.
If you recall the functionality of the Fetch/Decode Unit, you will realize that instruction scheduling within the program will be important. If you use two "simple" instructions next to each other, along with an instruction that will break down in two to four micro-ops, it will be possible to send all three instructions into the pipeline at once. Otherwise, efficiency will be lost. Try to avoid very long instructions (more than four micro-ops), but don't "overdo it" with the single micro-op instructions, either.
The trick to all of this is that the conditions illustrated in these guidelines are not isolated?they interact with each other. For instance, you shouldn't schedule two simple instructions together if they have a data dependency.