Previous    Next


In many floating-point programs, such as Program 20.4a, the basic blocks are long, the instructions are long-latency floating-point operations, and the branches are very predictable for-loop exit conditions. In such programs the problem, as described in the previous sections, is to schedule the long-latency instructions. But in many programs - such as compilers, operating systems, window systems, word processors - the basic blocks are short, the instructions are quick integer operations, and the branches are harder to predict. Here the main problem is fetching the instructions fast enough to be able to decode and execute them. Image 20.11 illustrates the pipeline stages of a COMPARE, BRANCH, and ADD instruction. Until the BRANCH has executed, the instruction-fetch of the successor instruction cannot be performed because the address to fetch is unknown.

Java Click To expand
Image 20.11: Dependence of ADD's instruction-fetch on result of BRANCH.

Suppose a superscalar machine can issue four instructions at once. Then, in waiting three cycles after the BRANCH is fetched before the ADD can be fetched, 11 instruction-issue slots are wasted (3 × 4 minus the slot that the BRANCH occupies). Some machines solve this problem by fetching the instructions immediately following the branch; then if the branch is not taken, these fetched-and-decoded instructions can be used immediately. Only if the branch is taken are there stalled instruction slots. Other machines assume the branch will be taken, and begin fetching the instructions at the target address; then if the branch falls through, there is a stall. Some machines even fetch from both addresses simultaneously, though this requires a very complex interface between processor and instruction-cache. Modern machines rely on branch prediction to make the right guess about which instructions to fetch. The branch prediction can be static - the compiler predicts which way the branch is likely to go and places its prediction in the branch instruction itself; or dynamic - the hardware remembers, for each recently executed branch, which way it went last time, and predicts that it will gothesameway.


The compiler can communicate predictions to the hardware by a 1-bit field of the branch instruction that encodes the predicted direction. To save this bit, or for compatibility with old instruction sets, some machines use a rule such as "backward branches are assumed to be taken, forward branches are assumed to be not-taken." The rationale for the first part of this rule is that backward branches are (often) loop branches, and a loop is more likely to continue than to exit. The rationale for the second part of the rule is that it's often useful to have predicted-not-taken branches for exceptional conditions; if all branches are predicted taken, we could reverse the sense of the condition to make the exceptional case "fall through" and the normal case take the branch, but this leads to worse instruction-cache performance, as discussed in . When generating code for machines that use forward/backward branch direction as the prediction mechanism, the compiler can order the basic blocks of the program in so that the predictedtaken branches go to lower addresses. Several simple heuristics help predict the direction of a branch. Some of these heuristics make intuitive sense, but all have been validated empirically:

Pointer: If a loop performs an equality comparison on pointers (p=null or p=q), then predict the condition as false. Call: Abranchis less likely to be the successor that dominates a procedure call (many conditional calls are to handle exceptional situations). Return: Abranchis less likely to a successor that dominates a return-from-procedure. Loop: Abranchis more likely to the successor (if any) that is the header of the loop containing the branch. Loop: Abranchis more likely to the successor (if any) that is a loop preheader, if it does not postdominate the branch. This catches the results of the optimization described in Image 18.7, where the iteration count is more likely to be > 0 than = 0. (B postdominates A if any path from A to program-exit must go through B; see .)

Guard: If some value r is used as an operand of the branch (as part of the conditional test), then a branch is more likely to a successor in which r is live and which does not postdominate the branch.

There are some branches to which more than one of the heuristics apply. A simple approach in such cases is to give the heuristics a priority order and use the first heuristic in the order that applies (the order in which they are listed above is a reasonable prioritization, based on empirical measurements).

Another approach is to index a table by every possible subset of conditions that might apply, and decide (based on empirical measurements) what to do for each subset.


Perfect static prediction results in a dynamic mispredict rate of about 9% (for C programs) or 6% (for Fortran programs). The "perfect" mispredict rate is not zero because any given branch does not go in the same direction more than 91% of the time, on average. If a branch did go the same direction 100% of the time, there would be little need for it! Fortran programs tend to have more predictable branches because more of the branches are loop branches, and the loops have longer iteration counts. Profile-based prediction, in which a program is compiled with extra instructions to count the number of times each branch is taken, executed on sample data, and recompiled with prediction based on the counts, approaches the accuracy of perfect static prediction. Prediction based on the heuristics described above results in a dynamic mispredict rate of about 20% (for C programs), or about half as good as perfect (or profile-based) static prediction. A typical hardware-based branch-prediction scheme uses two bits for every branch in the instruction cache, recording how the branch went the last two times it executed. This leads to misprediction rates of about 11% (for C programs), which is about as good as profile-based prediction.

A mispredict rate of 10% can result in very many stalled instructions - if each mispredict stalls 11 instruction slots, as described in the example on page 456, and there is one mispredict every 10 branches, and one-sixth of all instructions are branches, then 18% of the processor's time is spent waiting for mispredicted instruction-fetches. Therefore it will be necessary to do better, using some combination of hardware and software techniques. Relying on heuristics that mispredict 20% of the branches is better than no predictions at all, but will not suffice in the long run.

JaVaScreenshot Previous    Next