Previous Next 
Choosing an optimal schedule subject to datadependence constraints and resource hazards is difficult  it is NPcomplete, for example. Although NPcompleteness should never scare the compiler writer (graph coloring is NPcomplete, but the approximation algorithm for graph coloring described in is very successful), it remains the case that resourcebounded loop scheduling is hard to do in practice. We will first describe an algorithm that ignores the resource constraints and finds an optimal schedule subject only to the datadependence constraints. This algorithm is not useful in practice, but it illustrates the kind of opportunities there are in instructionlevel parallelism. The AikenNicolau loop pipelining algorithm has several steps:
Unroll the loop;
We use Program 20.4a as an example to explain the notions of tableau, slope, and coalesce. Let us assume that every instruction can be completed in one cycle, and that arbitrarily many instructions can be issued in the same cycle, subject only to datadependence constraints.
(a) A forloop to be softwarepipelined. (b) After a scalarreplacement optimization (in the deFlnition of a); and scalar variables labeled with their iterationnumber.
for i ← 1 to N 
for i ← 1 to N 
a ← j ⊕ V[i − 1] 
a_{i} ← j_{i−1} ⊕ b_{i−1} 
b ← a ⊕ f 
b_{i} ← a_{i} ⊕ f_{i−1} 
c ← e ⊕ j 
c_{i} ← e_{i−1} ⊕ j_{i−1} 
d ← f ⊕ c 
d_{i} ← f_{i−1} ⊕ c_{i} 
e ← b ⊕ d 
e_{i} ← b_{i} ⊕ d_{i} 
f ← U[i] 
f_{i} ← U[i] 
g : V[i] ← b 
g : V[i] ← b_{i} 
h : W[i] ← d 
h : W[i] ← d_{i} 
j ← X[i] 
j_{i} ← X[i] 
(a) 
(b) 
Data dependence through memory For optimal scheduling of stores and fetches, we need to trace data dependence as each value is stored into memory and then fetched back. As discussed on page 424, dependence analysis of memory references is not trivial! In order to illustrate loop scheduling for Program 20.4a without fullfledged dependence analysis, we can use scalar replacement to replace the reference to V[i − 1] with the (equivalent) b; now we can see that in the resulting Program 20.4b all memory references are independent of each other, assuming that the arrays U, V, W, X do not overlap. Next we mark each variable in the loop body to indicate whether this iteration's value is used, or the previous iteration's value, as shown in Program 20.4b. We can construct a datadependence graph as a visual aid in scheduling; solid edges are data dependences within an iteration, and dotted edges are loopcarried dependences, as shown in Graph 20.5a.
GRAPH 20.5: Datadependence graph for Program 20.4b: (a) original graph, in which solid edges are sameiteration dependences and dotted edges are loopcarried dependences; (b) acyclic dependences of the unrolled loop.Now suppose we unroll the loop; the datadependence graph is a DAG, as shown in Graph 20.5b. Scheduling DAGs is easy if there are no resource constraints; starting from operations with no predecessors, each operation goes as soon as its predecessors have all completed:
Cycle Instructions 


1 
a_{1}c_{1} f_{1} j_{1} f_{2} j_{2} f_{3} j_{3}… 
2 
b_{1}d_{1} 
3 
e_{1}g_{1}h_{1}a_{2} 
4 
b_{2}c_{2} 
5 
d_{2}g_{2}a_{3} 
⋮ 
⋮ 
It is convenient to write this schedule in a tableau where the rows are successive cycles and the columns are successive iterations of the original loop, as shown in Table 20.6a. After a few iterations are scheduled, we notice a pattern in the tableau: There is a group of instructions cdeh racing down to the lowerright corner with a slope of three cycles per iteration, another group abg with a more moderate slope of two cycles per iteration, and a third group fj with zero slope. The key observation is that there are gaps in the schedule, separating identical groups, that grow larger at a constant rate. In this case the groups of instructions at iteration i ≥ 4 are identical to the groups at iteration i + 1. In general the groups at iteration i will be identical to the groups at i + c,where sometimes c > 1; see Exercise 20.1. Theorems:
If there are K instructions in the loop, the pattern of identical groups separated by gaps will always appear within K^{2} iterations (and usually much sooner).
See the Further Reading section for reference to proofs. But to see why the loop is optimal, consider that the datadependence DAG of the unrolled loop has some path of length P to the last instruction to execute, and the scheduled loop executes that instruction at time P. The result, for our example, is shown in Table 20.6b. Now we can find a repeating pattern of three cycles (since three is the slope of the steepest group). In this case, the pattern does not begin until cycle 8; it is shown in a box. This will constitute the body of the scheduled loop. Irregularly scheduled instructions before the loop body constitute a prologue, and instructions after it constitute the epilogue. Now we can generate the multipleinstructionissue program for this loop, as shown in Image 20.7. However, the variables still have subscripts in this "program": The variable j_{i+1} is live at the same time as j_{i}. To encode this program in instructions, we need to put in MOVE instructions between the different variables, as shown in Image 20.8.
a_{1} ← j_{0} ⊕ b_{0} 
c_{1} ← e_{0} ⊕ j_{0} 
f_{1} ← U[1] 
j_{1} ← X[1] 

b_{1} ← a_{1} ⊕ f_{0} 
d_{1} ← f_{0} ⊕ c_{1} 
f_{2} ← U[2] 
j_{2} ← X[2] 

e_{1} ← b_{1} ⊕ d_{1} 
V[1] ← b_{1} 
W[1] ← d_{1} 
a_{2} ← j_{1} ⊕ b_{1} 

b_{2} ← a_{2} ⊕ f_{1} 
c_{2} ← e_{1} ⊕ j_{1} 
f_{3} ← U[3] 
j_{3} ← X[3] 

d_{2} ← f_{1} ⊕ c_{2} 
V[2] ← b_{2} 
a_{3} ← j_{2} ⊕ b_{2} 

e_{2} ← b_{2} ⊕ d_{2} 
W[2] ← d_{2} 
b_{3} ← a_{3} ⊕ f_{2} 
f_{4} ← U[4] 
j_{4} ← X[4] 

c_{3} ← e_{2} ⊕ j_{2} 
V[3] ← b_{3} 
a_{4} ← j_{3} ⊕ b_{3} 
i ← 3 

L : 
d_{i} ← f_{i−1} ⊕ j_{2} 
V[3] ← b_{3} 

e_{i} ← b_{i} ⊕ d_{i} 
W[i] ← d_{i} 
V[i+1] ← b_{i+1} 
f_{i+2} ← U[i+2] 
j_{i+2} ← X[i+2] 

c_{i+1} ← e_{i} ⊕ j_{i} 
a_{i+2} ← j_{i+1} ⊕ b_{i+1} 
i ← i + 1 
if i < N − 2 goto L 

d_{N−1} ← f_{N−1} ⊕ c_{N−1} 
b_{N} ← a_{N} ⊕ f_{N−1} 

e_{N−1} ← b_{N−1} ⊕ d_{N−1} 
W[N − 1] ← d_{N−1} V[N] ← b_{N} 

c_{N−1} ← e_{N−1} ⊕ j_{N−1} 

d_{N} ← f_{N−1} ⊕ c_{N} 

e_{N} ← b_{N} ⊕ d_{N} 
W[N] ← d_{N}N 
a_{1} ← j_{0} ⊕ b_{0} 
c_{1} ← e_{0} ⊕ j_{0} 
f_{1} ← U[1] 
j_{1} ← X[1] 

b_{1} ← a_{1} ⊕ f_{0} 
d_{1} ← f_{0} ⊕ c_{1} 
f′′ ← U[2] 
j_{2} ← X[2] 

e_{1} ← b_{1} ⊕ d_{1} 
V[1] ← b_{1} 
W[1] ← d_{1} 
a_{2} ← j_{1} ⊕ b_{1} 

b_{2} ← a_{2} ⊕ f_{1} 
c_{2} ← e_{1} ⊕ j_{1} 
f′ ← U[3] 
j′ ← X[3] 

d ← f_{1} ⊕ c_{2} 
V[2] ← b_{2} 
a ← j_{2} ⊕ b_{2} 

e_{2} ← b_{2} ⊕ d_{2} 
W[2] ← d_{2} 
b ← a ⊕ f′′ 
f ← U[4] 
j ← X[4] 

c ← e_{2} ⊕ j_{2} 
V[3] ← b 
a ← j′ ⊕ b 
i ← 3 

L : 
d ← f′′ ⊕ c 
b ← a′ ⊕ f′ 
b′ ← b; a′ ← a; f′′ ← f′; f′ ← f; j′′ ← j′; j′ ← j 

e ← b′ ⊕ d 
W[i] ← d 
V[i+1] ← b 
f ← U[i+2] 
j ← X[i+2] 

c ← e ⊕ j′′ 
a ← j′ ⊕ b 
i ← i + 1 
if i < N − 2 goto L 

d ← f′′ ⊕ c 
b ← a ⊕ f′ 
b′ ← b 

e ← b′ ⊕ d 
W[N − 1] ← d V[N] ← b 

c ← e ⊕ j′ 

d ← f′ ⊕ c 

e ← b ⊕ d 
W[N] ← d 
This loop is optimally scheduled  assuming the machine can execute eight instructions at a time, including four simultaneous loads and stores. Multicycle instructions Although we have illustrated an example where each instruction takes exactly one cycle, the algorithm is easily extensible to the situation where some instructions take multiple cycles.
JaVa  Previous Next 