LOOP SCHEDULING WITHOUT RESOURCE BOUNDS

Choosing an optimal schedule subject to data-dependence constraints and resource hazards is difficult - it is NP-complete, for example. Although NP-completeness should never scare the compiler writer (graph coloring is NP-complete, but the approximation algorithm for graph coloring described in is very successful), it remains the case that resource-bounded loop scheduling is hard to do in practice. We will first describe an algorithm that ignores the resource constraints and finds an optimal schedule subject only to the data-dependence constraints. This algorithm is not useful in practice, but it illustrates the kind of opportunities there are in instruction-level parallelism. The Aiken-Nicolau loop pipelining algorithm has several steps:

Unroll the loop;
Schedule each instruction from each iteration at the earliest possible time;
Plot the instructions in a tableau of iteration-number versus time;
Find separated groups of instructions at given slopes;
Coalesce the slopes;
Reroll the loop.

We use Program 20.4a as an example to explain the notions of tableau, slope, and coalesce. Let us assume that every instruction can be completed in one cycle, and that arbitrarily many instructions can be issued in the same cycle, subject only to data-dependence constraints.(a) A for-loop to be software-pipelined. (b) After a scalar-replacement optimization (in the deFlnition of a); and scalar variables labeled with their iteration-number.

for i ← 1 to N	for i ← 1 to N
a ← j ⊕ V[i − 1]	a_i ← j_i−1 ⊕ b_i−1
b ← a ⊕ f	b_i ← a_i ⊕ f_i−1
c ← e ⊕ j	c_i ← e_i−1 ⊕ j_i−1
d ← f ⊕ c	d_i ← f_i−1 ⊕ c_i
e ← b ⊕ d	e_i ← b_i ⊕ d_i
f ← U[i]	f_i ← U[i]
g : V[i] ← b	g : V[i] ← b_i
h : W[i] ← d	h : W[i] ← d_i
j ← X[i]	j_i ← X[i]
(a)	(b)

Data dependence through memory For optimal scheduling of stores and fetches, we need to trace data dependence as each value is stored into memory and then fetched back. As discussed on page 424, dependence analysis of memory references is not trivial! In order to illustrate loop scheduling for Program 20.4a without full-fledged dependence analysis, we can use scalar replacement to replace the reference to V[i − 1] with the (equivalent) b; now we can see that in the resulting Program 20.4b all memory references are independent of each other, assuming that the arrays U, V, W, X do not overlap. Next we mark each variable in the loop body to indicate whether this iteration's value is used, or the previous iteration's value, as shown in Program 20.4b. We can construct a data-dependence graph as a visual aid in scheduling; solid edges are data dependences within an iteration, and dotted edges are loop-carried dependences, as shown in Graph 20.5a.GRAPH 20.5: Data-dependence graph for Program 20.4b: (a) original graph, in which solid edges are same-iteration dependences and dotted edges are loop-carried dependences; (b) acyclic dependences of the unrolled loop.

Now suppose we unroll the loop; the data-dependence graph is a DAG, as shown in Graph 20.5b. Scheduling DAGs is easy if there are no resource constraints; starting from operations with no predecessors, each operation goes as soon as its predecessors have all completed:

Table 20.6: (a) Tableau of software-pipelined loop schedule; there is a group of instructions fj with slope 0, another group *abg* with slope 2, and a third group *cdeh* with slope 3. (b) The smaller-slope groups are pushed down to slope 3, and a pattern is found (boxed) that constitutes the pipelined loop.

Cycle Instructions
1	a₁c₁ f₁ j₁ f₂ j₂ f₃ j₃…
2	b₁d₁
3	e₁g₁h₁a₂
4	b₂c₂
5	d₂g₂a₃
⋮	⋮

It is convenient to write this schedule in a tableau where the rows are successive cycles and the columns are successive iterations of the original loop, as shown in Table 20.6a. After a few iterations are scheduled, we notice a pattern in the tableau: There is a group of instructions cdeh racing down to the lower-right corner with a slope of three cycles per iteration, another group abg with a more moderate slope of two cycles per iteration, and a third group fj with zero slope. The key observation is that there are gaps in the schedule, separating identical groups, that grow larger at a constant rate. In this case the groups of instructions at iteration i ≥ 4 are identical to the groups at iteration i + 1. In general the groups at iteration i will be identical to the groups at i + c,where sometimes c > 1; see Exercise 20.1. Theorems:

If there are K instructions in the loop, the pattern of identical groups separated by gaps will always appear within K² iterations (and usually much sooner).
We can increase the slopes of the less steeply sloped groups, thereby either closing the gaps or at least making them small and nonincreasing, without violating data-dependence constraints.
The resulting tableau has a repeating set of m identical cycles, which can constitute the body of a pipelined loop.
The resulting loop is optimally scheduled (it runs in the least possible time).

See the Further Reading section for reference to proofs. But to see why the loop is optimal, consider that the data-dependence DAG of the unrolled loop has some path of length P to the last instruction to execute, and the scheduled loop executes that instruction at time P. The result, for our example, is shown in Table 20.6b. Now we can find a repeating pattern of three cycles (since three is the slope of the steepest group). In this case, the pattern does not begin until cycle 8; it is shown in a box. This will constitute the body of the scheduled loop. Irregularly scheduled instructions before the loop body constitute a prologue, and instructions after it constitute the epilogue. Now we can generate the multiple-instruction-issue program for this loop, as shown in Image 20.7. However, the variables still have subscripts in this "program": The variable j_i+1 is live at the same time as j_i. To encode this program in instructions, we need to put in MOVE instructions between the different variables, as shown in Image 20.8.

	a₁ ← j₀ ⊕ b₀	c₁ ← e₀ ⊕ j₀	f₁ ← U[1]	j₁ ← X[1]
	b₁ ← a₁ ⊕ f₀	d₁ ← f₀ ⊕ c₁	f₂ ← U[2]	j₂ ← X[2]
	e₁ ← b₁ ⊕ d₁	V[1] ← b₁	W[1] ← d₁	a₂ ← j₁ ⊕ b₁
	b₂ ← a₂ ⊕ f₁	c₂ ← e₁ ⊕ j₁	f₃ ← U[3]	j₃ ← X[3]
	d₂ ← f₁ ⊕ c₂	V[2] ← b₂	a₃ ← j₂ ⊕ b₂
	e₂ ← b₂ ⊕ d₂	W[2] ← d₂	b₃ ← a₃ ⊕ f₂	f₄ ← U[4]	j₄ ← X[4]
	c₃ ← e₂ ⊕ j₂	V[3] ← b₃	a₄ ← j₃ ⊕ b₃		i ← 3
L :	d_i ← f_i−1 ⊕ j₂	V[3] ← b₃
	e_i ← b_i ⊕ d_i	W[i] ← d_i	V[i+1] ← b_i+1	f_i+2 ← U[i+2]	j_i+2 ← X[i+2]
	c_i+1 ← e_i ⊕ j_i	a_i+2 ← j_i+1 ⊕ b_i+1	i ← i + 1	if i < N − 2 goto L
	d_N−1 ← f_N−1 ⊕ c_N−1	b_N ← a_N ⊕ f_N−1
	e_N−1 ← b_N−1 ⊕ d_N−1	W[N − 1] ← d_N−1 V[N] ← b_N
	c_N−1 ← e_N−1 ⊕ j_N−1
	d_N ← f_N−1 ⊕ c_N
	e_N ← b_N ⊕ d_N	W[N] ← d_NN

Image 20.7: Pipelined schedule. Assignments in each row happen simultaneously; each right-hand side refers to the value before the assignment. The loop exit test i < N + 1 has been "moved past" three increments of i, so appears as i < N − 2.

	a₁ ← j₀ ⊕ b₀	c₁ ← e₀ ⊕ j₀	f₁ ← U[1]	j₁ ← X[1]
	b₁ ← a₁ ⊕ f₀	d₁ ← f₀ ⊕ c₁	f′′ ← U[2]	j₂ ← X[2]
	e₁ ← b₁ ⊕ d₁	V[1] ← b₁	W[1] ← d₁	a₂ ← j₁ ⊕ b₁
	b₂ ← a₂ ⊕ f₁	c₂ ← e₁ ⊕ j₁	f′ ← U[3]	j′ ← X[3]
	d ← f₁ ⊕ c₂	V[2] ← b₂	a ← j₂ ⊕ b₂
	e₂ ← b₂ ⊕ d₂	W[2] ← d₂	b ← a ⊕ f′′	f ← U[4]	j ← X[4]
	c ← e₂ ⊕ j₂	V[3] ← b	a ← j′ ⊕ b		i ← 3
L :	d ← f′′ ⊕ c	b ← a′ ⊕ f′	b′ ← b; a′ ← a; f′′ ← f′; f′ ← f; j′′ ← j′; j′ ← j
	e ← b′ ⊕ d	W[i] ← d	V[i+1] ← b	f ← U[i+2]	j ← X[i+2]
	c ← e ⊕ j′′	a ← j′ ⊕ b	i ← i + 1	if i < N − 2 goto L
	d ← f′′ ⊕ c	b ← a ⊕ f′	b′ ← b
	e ← b′ ⊕ d	W[N − 1] ← d V[N] ← b
	c ← e ⊕ j′
	d ← f′ ⊕ c
	e ← b ⊕ d	W[N] ← d

Image 20.8: Pipelined schedule, with move instructions.

This loop is optimally scheduled - assuming the machine can execute eight instructions at a time, including four simultaneous loads and stores. Multicycle instructions Although we have illustrated an example where each instruction takes exactly one cycle, the algorithm is easily extensible to the situation where some instructions take multiple cycles.