Previous    Next

LOOP SCHEDULING WITHOUT RESOURCE BOUNDS

Choosing an optimal schedule subject to data-dependence constraints and resource hazards is difficult - it is NP-complete, for example. Although NP-completeness should never scare the compiler writer (graph coloring is NP-complete, but the approximation algorithm for graph coloring described in is very successful), it remains the case that resource-bounded loop scheduling is hard to do in practice. We will first describe an algorithm that ignores the resource constraints and finds an optimal schedule subject only to the data-dependence constraints. This algorithm is not useful in practice, but it illustrates the kind of opportunities there are in instruction-level parallelism. The Aiken-Nicolau loop pipelining algorithm has several steps:

  1. Unroll the loop;

  2. Schedule each instruction from each iteration at the earliest possible time;
  3. Plot the instructions in a tableau of iteration-number versus time;
  4. Find separated groups of instructions at given slopes;
  5. Coalesce the slopes;
  6. Reroll the loop.

We use Program 20.4a as an example to explain the notions of tableau, slope, and coalesce. Let us assume that every instruction can be completed in one cycle, and that arbitrarily many instructions can be issued in the same cycle, subject only to data-dependence constraints.

(a) A for-loop to be software-pipelined. (b) After a scalar-replacement optimization (in the deFlnition of a); and scalar variables labeled with their iteration-number.

for i ← 1 to N

for i ← 1 to N

ajV[i − 1]

aiji−1bi−1

baf

biaifi−1

cej

ciei−1ji−1

dfc

difi−1ci

ebd

eibidi

fU[i]

fiU[i]

g : V[i] ← b

g : V[i] ← bi

h : W[i] ← d

h : W[i] ← di

jX[i]

jiX[i]

(a)

(b)

Java End example

Data dependence through memory For optimal scheduling of stores and fetches, we need to trace data dependence as each value is stored into memory and then fetched back. As discussed on page 424, dependence analysis of memory references is not trivial! In order to illustrate loop scheduling for Program 20.4a without full-fledged dependence analysis, we can use scalar replacement to replace the reference to V[i − 1] with the (equivalent) b; now we can see that in the resulting Program 20.4b all memory references are independent of each other, assuming that the arrays U, V, W, X do not overlap. Next we mark each variable in the loop body to indicate whether this iteration's value is used, or the previous iteration's value, as shown in Program 20.4b. We can construct a data-dependence graph as a visual aid in scheduling; solid edges are data dependences within an iteration, and dotted edges are loop-carried dependences, as shown in Graph 20.5a.

GRAPH 20.5: Data-dependence graph for Program 20.4b: (a) original graph, in which solid edges are same-iteration dependences and dotted edges are loop-carried dependences; (b) acyclic dependences of the unrolled loop.
Java Click To expand
Java End example

Now suppose we unroll the loop; the data-dependence graph is a DAG, as shown in Graph 20.5b. Scheduling DAGs is easy if there are no resource constraints; starting from operations with no predecessors, each operation goes as soon as its predecessors have all completed:

Java ScreenShot
Table 20.6: (a) Tableau of software-pipelined loop schedule; there is a group of instructions fj with slope 0, another group abg with slope 2, and a third group cdeh with slope 3. (b) The smaller-slope groups are pushed down to slope 3, and a pattern is found (boxed) that constitutes the pipelined loop.

Java Click To expand

Java ScreenShot

Cycle Instructions

1

a1c1 f1 j1 f2 j2 f3 j3

2

b1d1

3

e1g1h1a2

4

b2c2

5

d2g2a3

It is convenient to write this schedule in a tableau where the rows are successive cycles and the columns are successive iterations of the original loop, as shown in Table 20.6a. After a few iterations are scheduled, we notice a pattern in the tableau: There is a group of instructions cdeh racing down to the lower-right corner with a slope of three cycles per iteration, another group abg with a more moderate slope of two cycles per iteration, and a third group fj with zero slope. The key observation is that there are gaps in the schedule, separating identical groups, that grow larger at a constant rate. In this case the groups of instructions at iteration i ≥ 4 are identical to the groups at iteration i + 1. In general the groups at iteration i will be identical to the groups at i + c,where sometimes c > 1; see Exercise 20.1. Theorems:

See the Further Reading section for reference to proofs. But to see why the loop is optimal, consider that the data-dependence DAG of the unrolled loop has some path of length P to the last instruction to execute, and the scheduled loop executes that instruction at time P. The result, for our example, is shown in Table 20.6b. Now we can find a repeating pattern of three cycles (since three is the slope of the steepest group). In this case, the pattern does not begin until cycle 8; it is shown in a box. This will constitute the body of the scheduled loop. Irregularly scheduled instructions before the loop body constitute a prologue, and instructions after it constitute the epilogue. Now we can generate the multiple-instruction-issue program for this loop, as shown in Image 20.7. However, the variables still have subscripts in this "program": The variable ji+1 is live at the same time as ji. To encode this program in instructions, we need to put in MOVE instructions between the different variables, as shown in Image 20.8.

Java Start Image
 

a1j0b0

c1e0j0

f1U[1]

j1X[1]

 
 

b1a1f0

d1f0c1

f2U[2]

j2X[2]

 
 

e1b1d1

V[1] ← b1

W[1] ← d1

a2j1b1

 
 

b2a2f1

c2e1j1

f3U[3]

j3X[3]

 
 

d2f1c2

V[2] ← b2

a3j2b2

   
 

e2b2d2

W[2] ← d2

b3a3f2

f4U[4]

j4X[4]

 

c3e2j2

V[3] ← b3

a4j3b3

 

i ← 3

L :

difi−1j2

V[3] ← b3

     
 

eibidi

W[i] ← di

V[i+1] ← bi+1

fi+2U[i+2]

ji+2X[i+2]

 

ci+1eiji

ai+2ji+1bi+1

ii + 1

if i < N − 2 goto L

 

dN−1fN−1cN−1

bNaNfN−1

     
 

eN−1bN−1dN−1

W[N − 1] ← dN−1 V[N] ← bN

     
 

cN−1eN−1jN−1

       
 

dNfN−1cN

       
 

eNbNdN

W[N] ← dNN

     
Java End Image

Image 20.7: Pipelined schedule. Assignments in each row happen simultaneously; each right-hand side refers to the value before the assignment. The loop exit test i < N + 1 has been "moved past" three increments of i, so appears as i < N − 2.
Java Start Image
 

a1j0b0

c1e0j0

f1U[1]

j1X[1]

 
 

b1a1f0

d1f0c1

f′′ ← U[2]

j2X[2]

 
 

e1b1d1

V[1] ← b1

W[1] ← d1

a2j1b1

 
 

b2a2f1

c2e1j1

f′ ← U[3]

j′ ← X[3]

 
 

df1c2

V[2] ← b2

aj2b2

   
 

e2b2d2

W[2] ← d2

baf′′

fU[4]

jX[4]

 

ce2j2

V[3] ← b

aj′ ⊕ b

 

i ← 3

L :

df′′ ⊕ c

ba′ ⊕ f

b′ ← b; a′ ← a; f′′ ← f′; f′ ← f; j′′ ← j′; j′ ← j

 

eb′ ⊕ d

W[i] ← d

V[i+1] ← b

fU[i+2]

jX[i+2]

 

cej′′

aj′ ⊕ b

ii + 1

if i < N − 2 goto L

 

df′′ ⊕ c

baf

b′ ← b

   
 

eb′ ⊕ d

W[N − 1] ← d V[N] ← b

     
 

cej

       
 

df′ ⊕ c

       
 

ebd

W[N] ← d

     
Java End Image

Image 20.8: Pipelined schedule, with move instructions.

This loop is optimally scheduled - assuming the machine can execute eight instructions at a time, including four simultaneous loads and stores. Multicycle instructions Although we have illustrated an example where each instruction takes exactly one cycle, the algorithm is easily extensible to the situation where some instructions take multiple cycles.


JaVaScreenshot Previous    Next
Comments