Previous Next |

Hennessy and Patterson [1996] explain the design and implementation of high-performance machines, instruction-level parallelism, pipeline structure, functional units, caches, out-of-order execution, register renaming, branch prediction, and many other computer-architecture issues, with comparisons of compiler versus run-time-hardware techniques for optimization. Kane and Heinrich [1992] describe the pipeline constraints of the MIPS R4000 computer, from which Images 20.1 and 20.2 are adapted. CISC computers of the 1970s implemented complex instructions sequentially using an internal microcode that could do several operations simultaneously; it was not possible for the compiler to interleave parts of several macroinstructions for increased parallelism. Fisher [1981] developed an automatic scheduling algorithm for microcode, using the idea of trace scheduling to optimize frequently executed paths, and then proposed a very-long-instruction-word (VLIW) architecture [Fisher 1983] that could expose the microoperations directly to user programs, using the compiler to schedule. Aiken and Nicolau [1988] were among the first to point out that a single loop iteration need not be scheduled in isolation, and presented the algorithm for optimal (ignoring resource constraints) parallelization of loops. Many variations of the multiprocessor scheduling problem are NP-complete [Garey and Johnson 1979; Ullman 1975]. The *iterative modulo scheduling* algorithm [Rau 1994] gets good results in practice. In the absence of resource constraints, it is equivalent to the Bellman-Ford shortest-path algorithm [Ford and Fulkerson 1962]. Optimal schedules can be obtained (in principle) by expressing the constraints as an integer linear program [Govindarajan et al. 1996], but integer-linear-program solvers can take exponential time (the problem is NP-complete), and the register-allocation constraint is still difficult to express in linear inequalities.

Ball and Larus [1993] describe and measure the static branch-prediction heuristics shown in . Young and Smith [1994] show a profile-based static branch-prediction algorithm that does *better* than optimal static prversion; the apparent contradiction in this statement is explained by the fact that their algorithm replicates some basic blocks, so that a branch that's 80% taken (with a 20% misprediction rate) might become two different branches, one almost-always taken and one almost-always not taken.

JaVa |
Previous Next |