Next: Improvements to BTB Architectures Up: ECE 3391 Computer Architecture Previous: Introduction

Background

There are two sources of pipeline stalls: instruction misfetch penalty and conditional branch misprediction penalty. Instruction misfetch penalty can be overcome, by using branch delay slots or branch target buffers(BTB's). A BTB can eliminate misfetch stalls by storing the branch destination. For unconditional branches, indirect jumps or functions calls, this destination can be immediately fetched. For conditional branches, either the ``fall-through'' or the destination stored in the BTB is selected and some form of branch prediction is needed to select between the fall-through and taken address. To differentiate between actions for the different branch types, the branch type is to be identified and some BTBs store the branch type in the BTB. For function calls (either direct or indirect), the previous function address is stored in the `destination' field of the BTB. This can also be done for return instructions, but a return stack [5] is much more accurate. When using a return stack the BTB provides no useful information for returns, but it does indicate the instruction is a return instruction so the return stack can be used, avoiding the misfetch penalty.

Branch prediction techniques are classified as static or dynamic. Static branch prediction information does not change during the execution of a program, while dynamic prediction may change, reflecting the time-varying activity of the program.

The literature is full of suggested branch prediction schemes [4, 9, 7, 12]. Branch prediction schemes can be classified into static schemes and dynamic schemes by the way the prediction is made. Static prediction schemes can be (i) simple like predicting branches as always taken. Lee and Smith [7] reported that this simple strategy can predict correctly 68 % of the time. (ii) using the direction of the branches to make a prediction. If the branch is backward, i.e., the displacement is negative, then the branch is predicted as taken, otherwise as not taken. This scheme works well with programs with many loops and does not work well with programs with many irregular branches. This scheme is also called as backward-taken, forward not-taken (BTFNT) architecture (iii) profiling , which uses previous runs of a program to collect information on the tendencies of a given branch to be taken or not taken and presets a static prediction bit in the opcode of the given branch. This strategy, the most successful among static predictors, suffers from the fact that runs of a program with different input data sets usually results in different branch behaviors. (iv) using static correlated branch prediction (SCBP) proposed by Young and Smith [14]. Young et al. [15] in a comparative analysis of various correlated branch prediction schemes, show that a large percentage of branches are strongly biased and hence explain the accuracy of profiled per-branch static branch prediction. They also point out the possible shortcoming of dynamic correlation schemes, due to the aliasing effect, which means that it is practically impossible to assign each per-branch stream to a unique 2-bit counter. However, it should be noted that both Yeh et al. [12] (through an aggressive policy of providing more substreams) and McFarling [10] (through his spreading of the indexes by using the exclusive-or of the global history register and the branch address as the index of the PHT(Pattern History Table)) were largely able to overcome this situation.

Most processors now use some form of dynamic conditional branch prediction to improve the branch architecture performance. The Alpha 21064 used in the DEC 3000-500 adds one bit to each instruction in the 8K direct mapped instruction cache to dynamically predict the direction of conditional branches. This bit is initialized with the sign-bit of the PC-relative displacement when the cache line is read in, initializing the dynamic l-bit predictors to BTFNT prediction. Thus, the BTFNT rule is used the first time the branch is encountered, and a dynamic one-bit predictor is used thereafter [2].

Two-bit up-down saturating counters have been shown to effectively predict the direction of conditional branches and to outperform l-bit branch predictors. The pattern history table (PHT) branch architecture is an example of an architecture using two-bit saturating up-down counters. It contains a table of two-bit counters used to predict the direction for conditional branches. In the direct mapped PHT architecture (PHT-Direct), the branch address is used to directly index into the pattern history table to find the two-bit counter to use when predicting the branch. Pan et al. [11] and Yeh et al. [12, 13] investigated branch-correlation or two-level branch prediction mechanisms. Although there are a number of variants, these mechanisms generally combine the history of several recent branches to predict the outcome of a branch. The degenerate method (GAg) of Pan et al. [11] uses a global history register to record the branch correlation.

When using a entry table, the processor maintains a k-bit global history register that records the outcome of the previous k branches (e.g., a taken branch is encoded as a 1, and a not-taken branch as a 0) . The register is used as an index into the pattern history table (PHT), much as the program counter is used for a direct-mapped PHT. This provides contextual information and correlation about particular patterns of branches. McFarling [10] used the exclusive-or of the global history register and the branch address as the index into the PHT (called the PHT-GAg). Branch target buffers (BTB) have been used as a mechanism for branch and instruction fetch prediction, effectively predicting the behavior of a branch. The Intel Pentium is an example of an architecture using a BTB - it has a 256-entry BTB organized as a four-way associative cache. Only branches that are taken are entered into the BTB. If a branch address appears in the BTB and the branch is predicted as taken, the stored address is used to fetch future instructions, otherwise the fall-through address is used. This is called as "BTB-2Bit" architecture in [2]. Each BTB entry contains a two-bit saturating counter to predict the direction of a conditional branch [7]. The Intel P6 architecture adds to this design by increasing the BTB to a 512 entry 4-way associative design and replaces the 2-bit counters with 4-bit history registers which is similar to Yeh et al. [12]. This is called as "BTB-PAs" architecture in [2]. The history register is used as the lower bits of an index into a PHT when predicting conditional branches. The upper bits of the index are the lower bits of the branch address. In both BTB-2Bit and BTB-PAs architectures, the branch prediction information (the two-bit counter and 6-bit history register), is associated or coupled with the BTB entry. Therefore, the dynamic conditional branch prediction information can only be used for branches in the BTB, and branches that miss in the BTB must use less accurate static prediction.

By coupling the branch prediction information with the BTB, we avoid both misfetch and misprediction penalties. However, we can only do this for branches that have been entered in the BTB. Typically, a BTB contains from 32 to 512 entries with varying degrees of associativity. A BTB requires a lot of storage, because it stores the address of the branch and the address of the probable destination. Some BTB's also include additional storage to encode the branch type and prediction information.

Calder et al. [1] suggest an equally effective scheme without the expensive BTB's. In the absence of a BTB, conditional branches can be predicted using much simpler mechanisms. A pattern history table eliminates the site and target addresses from the table; hence the table only predicts the direction for conditional branches. These designs use the branch site address as an index into a table of prediction bits. Different branch addresses may index into the same table entry and several conditional branches share the same prediction information. The branch correlation or two-level branch prediction mechanisms available are those of Pan et al. [11] and Yeh et al. [12, 13] and McFarling's [10] modifications discussed above. In [1], Calder et al. look at 2-level branch predictors and focus on avoiding the fetch penalty associated with identifying what type of break has occurred and computing its target address. While doing so, they also identify the simpler `GAg' architecture to be equally efficient to the `PAs' architecture (which has been decifered by the authors of the paper as being deceptively better, in giving lesser misprediction, but more overall penalty. They do this, by using the BEP measure).

Next: Improvements to BTB Architectures Up: ECE 3391 Computer Architecture Previous: Introduction

Annamalai Ramanathan
Fri Apr 4 20:04:23 EST 1997