I'm not qualified to explain how branch prediction works, but I can tell you why. All modern processors that I know of perform branch prediction, some better than others. The reason is pipelining. Pipelining involves breaking a complicated instruction up into separate chunks so that more than one instruction can be worked on at a time.
A pipeline is like an assembly line. Instead of working on one car at a time, and not starting the second car until the first is completely finished, you work on many cars simultaneously. While the first car is being painted, the second one is being welded, or whatever. The time to finish one car is the same, but the throughput of cars per hour is much higher.
So how does this relate to branch prediction? Suppose you add A + B to get C. Then you branch if C = D. Because of the pipeline, the processor will start the branch instruction before the addition is finished. It doesn't know the answer, but it needs to make a decision. So it guesses. It chooses one alternative, and keeps working on it. When the first instruction is done, it checks to see if it guessed right. If so, everything proceeds as normal. If not, however, the processor must back up and start over again. In modern processors, with very deep pipelines, this can be a serious performance hit. So accurate branch prediction is very important, and a good processor is right most of the time.
I'd recommend any of the Hennessy and Patterson Computer Architecture books for much more extensive information on this. A good starting point is "Computer Organization and Design: The Hardware Software Interface"