Pipeline Stall/Flush

Gamingphreek · Mar 18, 2009

We are studying the MIPS architecture right now in one of my upper level CS courses and I had a quick question.

When using beq and another instruction; obviously, there is a hazard there that cannot be forwarded around (Especially if the branch was predicted incorrectly). The pipeline then either stalls or flushes to prevent an incorrect memory address from being loaded.

The way we were explained, is that all the control lines were set to 0, so the instructions that had already passed through stages 1-4 were just lost. After that, the buffer between the Instruction Fetch and the Instruction Decode stage was used to restore the previous value of the PC.

Instead of setting all control lines to 0, is it not possible to power down certain parts (ie: ALU, Forwarding Control, Registers in question) instead of sending a 0 (Which admittedly is a 0V signal so those parts lose power) to save power. Pipeline stalls/flushes, especially on the old Netburst Architecture happened a lot, why can you not just shut down certain individual components?

Thanks,
-Kevin

BrownTown · Mar 18, 2009

Well, I'm pretty sure its converted into a "nop" which does pretty much does the minimal amount of switching and therefore is using less power than a real operation. You can't power section on and off for a few cycles anyways.

BEL6772 · Mar 18, 2009

Good question!

Currently power planes in chips aren't that granular. It would be cool to see if building a big 'ol power control network on the chip would be worth it. How much power could you really save by only activating stages that were working on valid instructions? How much latency would you introduce waiting for stable power before moving an instruction into the recently powered-on stage? How many transistors/how much 'real estate' would it take to build a power control network with that level of granularity and intelligence? How would you test it?

Of course once you start running multiple threads through the pipeline I imagine the opportunities for rolling a blackout through the pipe would drop. Maybe the power control network could shut down unused 'lanes' ... shut down the floating point hardware while working with integers, for example.

I'm guessing that the current strategy of shutting down idle cores is about as good as it gets. I wouldn't think the power 'reward' for opportunistically shutting down a few transistors here and there on an active core is worth the effort ... still might be a fun project to tackle, though

Gamingphreek · Mar 18, 2009

Originally posted by: BEL6772
Good question!

Currently power planes in chips aren't that granular. It would be cool to see if building a big 'ol power control network on the chip would be worth it. How much power could you really save by only activating stages that were working on valid instructions? How much latency would you introduce waiting for stable power before moving an instruction into the recently powered-on stage? How many transistors/how much 'real estate' would it take to build a power control network with that level of granularity and intelligence? How would you test it?

Of course once you start running multiple threads through the pipeline I imagine the opportunities for rolling a blackout through the pipe would drop. Maybe the power control network could shut down unused 'lanes' ... shut down the floating point hardware while working with integers, for example.

I'm guessing that the current strategy of shutting down idle cores is about as good as it gets. I wouldn't think the power 'reward' for opportunistically shutting down a few transistors here and there on an active core is worth the effort ... still might be a fun project to tackle, though

I just figured with pipelines larger than the 5 stages of the MIPS pipeline, that branch mispredictions and stalls would happen much more frequently - therefore the reward might be something significant.

As you mentioned though, have a control unit to detect all of that would probably be slightly difficult and introduce more latency into the system.

Someone has probably thought of the whole concept already - I was just wondering if I was completely out there on my thought.

Thanks
-Kevin

Born2bwire · Mar 19, 2009

Originally posted by: Gamingphreek

Originally posted by: BEL6772
Good question!

Currently power planes in chips aren't that granular. It would be cool to see if building a big 'ol power control network on the chip would be worth it. How much power could you really save by only activating stages that were working on valid instructions? How much latency would you introduce waiting for stable power before moving an instruction into the recently powered-on stage? How many transistors/how much 'real estate' would it take to build a power control network with that level of granularity and intelligence? How would you test it?

Of course once you start running multiple threads through the pipeline I imagine the opportunities for rolling a blackout through the pipe would drop. Maybe the power control network could shut down unused 'lanes' ... shut down the floating point hardware while working with integers, for example.

I'm guessing that the current strategy of shutting down idle cores is about as good as it gets. I wouldn't think the power 'reward' for opportunistically shutting down a few transistors here and there on an active core is worth the effort ... still might be a fun project to tackle, though

Click to expand...

I just figured with pipelines larger than the 5 stages of the MIPS pipeline, that branch mispredictions and stalls would happen much more frequently - therefore the reward might be something significant.

Thanks
-Kevin

I don't think that's an assumption you should be making unqualified. I don't recall myself the percentage of branch misses that occur for real world codes but they do study these things in detail in order to make these very kind of decisions.

http://www.google.com/url?sa=U...wHLpNbpHdFClsEwEcWJGFg

No idea if the above is accurate or not but they show that a simple bimodal prediction had an accuracy better than 87% for the various programs they benchmarked against. A lot of branch predictions are very easy to do accurately, consider how often we run for/while loops. A simple bimodal or even the appropriate static predictor will get very good results for such conditions.

Thus, one would need to analyze the effectiveness of shutting down a section of the pipe. The percentage of branch misses seems to be fairly low. Of those that miss the depth that which the incorrect ops have gone into the pipeline will vary. The penalty for missing on a boolean check of a constant and a short kept in a register will be much different then a boolean between data from two different indirect loads. So one needs to consider the average amount of misses, the average depth of the pipe that is flushed on these misses, the average time needed to stop and restart a section and the average time needed to determine when to shutdown and startup a section to be able to measure the feasibility of such a feature.

Also, when it comes to CMOS technology, the static power dissipation is very very low. That is one of the key advantages to the technology. A noop, which is what they should be doing with the control lines when flushing the pipe, will just set the sections to a static moment. So as long as flushing the pipeline does not require the entire section to reset itself, just to hold its status, the power dissipation is very low. Letting these sections idle for the prerequisite cycles until the pipeline catches up to it maybe rather similar to shutting down the entire sections. Or it may even be better if the new sections have similar datapaths as when the section was flushed. Only a change in a gate's state will create a large power dissipation. So another question may be whether or not on average there are more gate changes in a section between ops compared with going from a total reset to an op.

dmens · Mar 19, 2009

Instead of setting all control lines to 0, is it not possible to power down certain parts (ie: ALU, Forwarding Control, Registers in question) instead of sending a 0 (Which admittedly is a 0V signal so those parts lose power) to save power. Pipeline stalls/flushes, especially on the old Netburst Architecture happened a lot, why can you not just shut down certain individual components?

in a real silicon design, flushing to '0 nop can possibly cause additional switching and waste power.

if i remember correctly with the MIPS uarch, the '0 force on a branch bubble simply sets the destination register to r0 which does not exist. however, a '0 force on all the control logic requires switching. worst case scenario it can cause transitions on the large data busses, depending on your design.

a better solution for power is to stop the clocks on the subsequent pipestages, so the sequentials hold the same value. that way there is no switching in the combinational gates. to cancel the incorrect writeback, stage a one-bit signal that kills the write clock. in addition, you save one cycle of switching power for the all the sequentials per pipestage.

stopping clocks is almost always better than data squishing imo, if timing allows.

soydios · Mar 25, 2009

MIPS as an architecture does not specify a number of pipeline stages, because it uses hardware interlocking (flushing/stalling with nops). I'm in an introductory computer architecture class right now; are you using the Patterson & Hennessey 5-stage pipeline?

There is no specific "nop" instruction in MIPS. The closest you can do is to do something like "add $0, $0, $0" which adds register zero to register zero and stores the result in register zero. An instruction like this will not store a result because register zero is non-writeable, and data hazard detection logic will ignore dependencies on register zero because of it. For example, a MIPS-32 instruction of all zeroes corresponds to "sll $0, $0, 0", that is shift register zero left by zero and store the result in register zero. This is effectively a no-operation instruction, because it causes no change in the register file. Really all a no-operation instruction has to do is to not write anything anywhere.

Stalls/bubbles are used if you're waiting for a data dependency to clear. Flushes are used to flush the entire pipeline if you take a branch or jump. IIRC though, the MIPS architecture actually specifies that the next instruction after a branch does get executed (i.e. is not flushed), and that the compiler must account for this.

dmens and Born2bwire are correct, IMO, in that simply stopping the clock prevents any switching from happening, and thus conserves power in CMOS.

Idea I just had: disabling the write port on the register file until the instruction retires also effectively creates nops for flushing purposes without you having to replace all the instructions in the entire pipeline.

Pipeline Stall/Flush

Gamingphreek

Lifer

BrownTown

Diamond Member

BEL6772

Senior member

Gamingphreek

Lifer

Born2bwire

Diamond Member

dmens

Platinum Member

soydios

Platinum Member

TRENDING THREADS