I thought they used the same predictor in both the K5 and K6... hmmm, maybe my mind is failing me, I should check up on that...
Edit:
Correct you are. The K5 uses a 16K, 1024 instruction dynamic branch prediction system, and does not maintain a BHT, or use a prediction angrothym. Instead, it assign a "branch bit" to each bramch as it passes through the pipeline, and it changes this bit based on the instructions ahead of the branch in the pipeline.
<< Branch prediction is handled a little differently than in other advanced microprocessors. Instead of maintaining a separate branch target buffer to hold the addresses of predicted branches, the K5 appends the predicted address to the branch instruction during predecode. This 10-bit tag, called a successor index, points to a target within the I-cache.
At first, all predecoded branch instructions are predicted not taken. Later, if speculative execution reveals that the prediction was wrong, the prediction is reversed by writing a new successor index that points to the correct cache block. That prediction remains in effect until it's wrong again. In other words, the prediction is reversed every time it's wrong.
This is one reason why the cache blocks are only 16 bytes in size. The K5 can predict only one taken branch per block, so a smaller block reduces the chance that an instruction will branch to another branch in the same block. A 32-byte cache block would reduce performance, according to AMD's simulations.
Although the branch prediction is ``dynamic'' in the sense that it adapts to wrong predictions at run time, it does so merely by reversing its predictions in a binary flip-flop. In contrast, some of the latest RISC processors use algorithms that dynamically predict the outcome of branches by keeping track of how often a particular branch is actually taken. But RISC chips don't have to bother with complicated x86 decoding. By adopting a somewhat simpler form of branch prediction, the K5 keeps an already complicated decoder from becoming even more labyrinthine.
There is another advantage to the K5's approach: In effect, it predicts branches over a larger sample of the program than other methods. Branch target buffers have a limited number of entries, usually a few dozen. However, the K5 can theoretically predict a branch in every cache block. Since the block size is 16 bytes and the I-cache is 16 KB, that's potentially 1024 branches. This larger sample--coupled with the K5's flexible cache fetching--partly offsets its less sophisticated predictions. Of course, when the cache is flushed, all the prediction states are lost, too, because they're tagged to the instructions instead of being held in a branch target buffer.
To make this whole mechanism complete, the K5's byte queue can trigger a special signal called BQ confused. It waves this flag when the predecoded instructions don't appear to make sense because of a mispredicted branch or some other anomaly. The signal wipes out the incoherent cache blocks and reloads them with freshly predecoded instructions. Johnson says this rarely happens, but it is so reliable that it once masked a bug in the K5's critical logic path during the chip's early development. Even though not even AMD would claim the K5 is a fault-tolerant processor, it's comforting to know there's a mechanism of last resort that is robust enough to handle a logic glitch and confused code.
>>