As the issue rate and depth of pipelining of high performance superscalar processors increase, the amount of speculative work due to branch prediction becomes much larger. Since all such work must be thrown away if the prediction is incorrect, an excellent branch predictor is vital to delivering the potential performance of a wide-issue, deep pipelined microarchitecture. Even a prediction miss rate of 5 percent results in a substantial loss in performance due to the number of instructions fetched each cycle and the number of cycles these instructions are in the pipeline before an incorrect branch prediction becomes known. Lilja [8] shows using a simple model of branch penalty, the nature of the slope of `Fraction of maximum performance as a function of effective probability and branch penalty' as really very steep for the first few percentiles of misprediction. The effective probability is a probability measure of misprediction given as the product of the probability of an instruction being a branch and the probability of the prediction being wrong. Lilja [8] signifies the fact that towards the end of more perfect prediction, every small improvement is important, say from 97.1 % to 98.2 %.
Conventional processor architectures, particularly superscalar designs, are extremely sensitive to control flow changes. A simplified processor pipeline can be divided into fetch, decode, execution, memory access and write stages. Changes in control flow, be they conditional or unconditional branches, direct or indirect function calls or returns, are not detected until those instructions are decoded. To keep the pipeline fully utilized, processors typically fetch the address following the most recent address. If the decoded instruction is a break in control flow, the previously fetched instruction can not be used, and a new instruction must be fetched, introducing a "pipeline bubble" or unused pipeline step. This is called an instruction misfetch penalty.
The final destination for conditional branches, indirect function calls and returns are typically not available until the memory access stage of the pipeline is completed. At this point the branch has been completely evaluated in the execution stage. The program counter is updated after the memory access stage. As with instruction fetch, the processor may elect to fetch and decode instructions on the assumption that the eventual branch target can be accurately predicted. If the processor mispredicts the branch destination, those instructions fetched from the incorrect instruction stream must be discarded, leading to several "pipeline bubbles." In practice, pipeline bubbles due to mispredicted breaks in control flow degrade a programs' performance more than the misfetch penalty. For example, the combined branch misprediction penalty for both pipelines of the Digital AXP 21064 processor is 10 cycles. While it would lose only two instruction issues from instruction misfetches [1]. Calder et al. [1] also note that as processors issue more instructions concurrently, the misfetch penalties increase, and the instruction fetch penalty becomes increasingly important.
The BTB architecture is complex and requires large chip area also. But, computer technology is advancing at a rapid pace. Advanced VLSI technology makes it possible to have larger branch prediction table and more complicated schemes. However, the advancement in programming languages also makes it possible to have larger and more complicated programs, and allows more cross-references between branches because of more complicated procedure calls. Multiprocessing and threading become important because of the rise of Multiple Instruction streams, Multiple data streams (MIMD) machines. Therefore, simpler and easier, space-efficient mechanisms are definitely required.
The rest of the paper review is organized as follows: the Background section gives references to previous work on branch prediction and explains the purpose of this paper, the Improvements to BTB Architectures section gives the improvements made by the paper to the BTB architecture which includes ``Decoupling prediction and fall through'' and ``Only taken allocate'' and also discusses the comparison and feasibility of these methods, the Dispensing with the BTB : an alternative architecture section reviews the alternative architecture proposed in the paper, that dispenses with the BTB and also discusses the shortcomings of this architecture and the need for more experimentation and a detailed design and the Summary and Conclusion section summarizes on the work done in the paper and concludes on what prospective work can be done in future.