RESOURCE-BOUNDED LOOP PIPELINING

A real machine can issue only a limited number of instructions at a time, and has only a limited number of load/store units, adders, and multipliers. To be practically useful, a scheduling algorithm must take account of resource constraints. The input to the scheduling algorithm must be in three parts:

A program to be scheduled;
A description of what resources each instruction uses in each of its pipeline stages (similar to Image 20.1);
A description of the resources available on the machine (how many of each kind of functional unit, how many instructions may be issued at once, restrictions on what kinds of instructions may be issued simultaneously, and so on).

Resource-bounded scheduling is NP-complete, meaning that there is unlikely to be an efficient optimal algorithm. As usual in this situation, we use an approximation algorithm that does reasonably well in "typical" cases.

MODULO SCHEDULING

Iterative modulo scheduling is a practical, though not optimal, algorithm for resource-bounded loop scheduling. The idea is to use iterative backtracking to find a good schedule that obeys the functional-unit and data-dependence constraints, and then perform register allocation. The algorithm tries to place all the instructions of the loop body in a schedule of Δ cycles, assuming that there will also be a prologue and epilogue of the kind used by the Aiken-Nicolau algorithm. The algorithm tries increasing values of Δ until it reaches a value for which it can make a schedule. A key idea of modulo scheduling is that if an instruction violates functional-unit constraints at time t, then it will not fit at time t + Δ, or at any time t′ where t′ ≡ t′ modulo Δ. Suppose, for example, we are trying to schedule Program 20.4b with Δ = 3 on a machine that can perform only one load instruction at a time. The following loop-body schedule is illegal, with two different loads at cycle 1: Java ScreenShot

We can move f_i from cycle 1 of the loop to cycle 0, or cycle 2: Java ScreenShot

Either one avoids the resource conflict. We could move f_i even earlier, to cycle −1, where (in effect) we are computing f_i+1, or even later, to cycle 3, where we are computing f_i−1: Java ScreenShot

But with Δ = 3 we can never solve the resource conflict by moving f_i from cycle 1 to cycle 4 (or to cycle -2), because 1 ≡ 4 modulo 3; the calculation of f would still conflict with the calculation of j: Java ScreenShot

Effects on register allocation Consider the calculation of d ← f ⊕ c, which occurs at cycle 0 of the schedule in Image 20.7. If we place the calculation of d in a later cycle, then the data-dependence edges from the definitions of f and c to this instruction would lengthen, and the data-dependence edges from this instruction to the use of d in W[i] ← d would shrink. If a data-dependence edge shrinks to less than zero cycles, then a data-dependence constraint has been violated; this can be solved by also moving the calculations that use d to a later cycle.

Conversely, if a data-dependence edge grows many cycles long, then we must carry several "versions" of a value around the loop (as we carry f; f′, f′′ around the loop of Image 20.8), and this means that we are using more temporaries, so that register allocation may fail. In fact, an optimal loop-scheduling algorithm should consider register allocation simultaneously with scheduling; but it is not clear whether optimal algorithms are practical, and the iterated modulo scheduling algorithm described in this section first schedules, then does register allocation and hopes for the best.

FINDING THE MINIMUM INITIATION INTERVAL

Modulo scheduling begins by finding a lower bound for the number of cycles in the pipelined loop body:

Resource estimator: For any kind of functional unit, such as a multiplier or a memory-fetch unit, we can see how many cycles such units will be used by the corresponding instructions (e.g., multiply or load, respectively) in the loop body. This, divided by the number of that kind of functional unit provided by the hardware, gives a lower bound for Δ. For example, if there are 6 multiply instructions that each use a multiplier for 3 cycles, and there are two multipliers, then Δ ≥ 6 · 3/2.
Data-dependence estimator: For any data-dependencecycle in the data-dependence graph, where some value x_i depends on a chain of other calculations that depends on x_i−1, the total latency of the chain gives a lower bound for Δ.

Let Δ_min be the maximum of these estimators. Let us calculate Δ_min for Program 20.4b. For simplicity, we assume that one ⊕-arithmetic instruction and one load/store can be issued at a time, and every instruction finishes in one cycle; and we will not consider the scheduling of i ← i + 1 or the conditional branch. Then the arithmetic resource estimator is 5 ⊕-instructions in the loop body divided by 1 issuable arithmetic instructions per cycle, or Δ ≥ 5. The load/store resource estimator is 4 load/store instructions in the loop body divided by 1 issuable memory operations per cycle, or Δ ≥ 4. The data-dependence estimator comes from the cycle c_i → d_i → e_i → c_i+1 in Graph 20.5a, whose length gives Δ ≥ 3. Next, we prioritize the instructions of the loop body by some heuristic that decides which instructions to consider first. For example, instructions that are in critical data-dependence cycles, or instructions that use a lot of scarce resources, should be placed in the schedule first, and then other instructions can be filled in around them. Let H₁,…, H_n be the instructions of the loop body, in (heuristic) priority order. In our example, we could use H = [c, d, e, a, b, f, j, g, h], putting early the instructions that are in the critical recurrence cycle or that use the arithmetic functional unit (since the resource estimators for this loop tell us that arithmetic is in more demand than load/stores). The scheduling algorithm maintains a set S of scheduled instructions, each scheduled for a particular time t. The value of SchedTime[h] = none if h ∉ S, otherwise SchedTime[h] is the currently scheduled time for h. The members of S obey all resource and data-dependence constraints. Each iteration of Algorithm 20.9 places the highest-priority unscheduled instruction h into S, as follows:

In the earliest time slot (if there is one) that obeys all dependence constraints with respect to already-placed predecessors of h, and respects all resource constraints.
But if there is no slot in Δ consecutive cycles that obeys resource constraints, then there can never be such a slot, because the functional units available at time t are the same as those at t + c · Δ. In this case, h is placed without regard to resource constraints, in the earliest time slot that obeys dependence constraints (with respect to already-placed predecessors), and is later than any previous attempt to place h.

ALGORITHM 20.9: Iterative modulo scheduling.

for Δ ← Δ_min to ∞
 Budget ← n · 3
 for i ← 1 to n
 LastTime[i] ← 0
 SchedTime[i] ← none
 while Budget > 0 and there are any unscheduled instructions
 Budget ← Budget - 1
 let h be the highest-priority unscheduled instruction
 t_min ← 0
 for each predecessor p of h
 if SchedTime[p]≠ none
 t_min ← max(t_min, SchedTime[p] + Delay(p, h))
 for t ← t_min to t_min + Δ - 1
 if SchedTime[h] = none
 if h can be scheduled without resource conflicts
 SchedTime[h] ← t
 if SchedTime[h] = none
 SchedTime[h] ← max(t_min, 1 + LastTime[h])
 LastTime[h] ← SchedTime[h]
 for each successor s of h
 if SchedTime[s] ≠ none
 if SchedTime[h] + Delay(h, s) > SchedTime[s]
 SchedTime[s] ← none
 while the current schedule has resource conflicts
 let s be some instruction (other than h) involved in a resource conflict
 SchedTime[s] ← none
 if all instructions are scheduled
 RegisterAllocate()
 if register allocation succeeded without spilling
 return and report a successfully scheduled loop.
 Delay(h, s) =
 Given a dependence edge h_i → s_i+k, so that h uses the value of s from the kth previous iteration
 (where k = 0 means that h uses the current iteration's value of s);
 Given that the latency of the instruction that computes s is l cycles;
 return l - kΔ

Once h is placed, other instructions are removed to make the subset schedule S legal again: any successors of h that now don't obey data-dependence constraints, or any instructions that have resource conflicts with h. This placement-and-removal could iterate forever, but most of the time either it finds a solution quickly or there is no solution, for a given Δ. To cut the algorithm off if it does not find a quick solution, a Budget of c · n schedule placements is allowed (for c = 3 or some similar number), after which this value of Δ is abandoned and the next one is tried. When a def-use edge associated with variable j becomes longer than Δ cycles, it becomes necessary to have more than one copy of j, with MOVE instructions copying the different-iteration versions in bucket-brigade style. This is illustrated in Image 20.8 for variables a, b, f, j, but we will not show an explicit algorithm for inserting the moves. Checking for resource conflicts is done with a resource reservation table, an array of length Δ. The resources used by an instruction at time t can be entered in the array at position t mod Δ; adding and removing resource-usage from the table, and checking for conflicts, can be done in constant time. This algorithm is not guaranteed to find an optimal schedule in any sense. There may be an optimal, register-allocable schedule with initiation interval Δ, and the algorithm may fail to find any schedule with time Δ, or it may find a schedule for which register-allocation fails. The only consolation is that it is reported to work very well in practice. The operation of the algorithm on our example is shown in Image 20.10.
Image 20.10: Iterative modulo scheduling applied to Program 20.4b. Graph 20.5a is the data-dependence graph; Δ_min = 5 (see page 451); H = [c, d, e, a, b, f, j, g, h].

OTHER CONTROL FLOW

We have shown scheduling algorithms for simple straight-line loop bodies. What if the loop contains internal control flow, such as a tree of if-then-else statements? One approach is to compute both branches of the loop, and then use a conditional move instruction (provided on many high-performance machines) to produce the right result. For example, the loop at left can be rewritten into the loop at right, using a conditional move:

for i ← 1 to N for i ← 1 to N x ← M[i] x ← M[i]
 if x > 0 u′ ← z ⋆ x
 u ← z ⋆ x u ← A[i]
 else u ← A[i] if x > 0 move u ← u′
 s ← s + u s ← s + u

The resulting loop body is now straight-line code that can be scheduled easily. But if the two sides of the if differ greatly in size, and the frequently executed branch is the small one, then executing both sides in every iteration will be slower than optimal. Or if one branch of the if has a side effect, it must not be executed unless the condition is true.

To solve this problem we use trace scheduling: We pick some frequently executed straight-line path through the branches of control flow, schedule this path efficiently, and suffer some amount of ineffiency at those times where we must jump into or out of the trace. See and also the Further Reading section of this chapter.

SHOULD THE COMPILER SCHEDULE INSTRUCTIONS?

Many machines have hardware that does dynamic instruction rescheduling at run time. These machines do out-of-order execution, meaning that there may be several decoded instructions in a buffer, and whichever instruction's operands are available can execute next, even if other instructions that appeared earlier in the program code are still awaiting operands or resources. Such machines first appeared in 1967 (the IBM 360/91), but did not become common until the mid-1990s. Now it appears that most high-performance processors are being designed with dynamic (run-time) scheduling. These machines have several advantages and disadvantages, and it is not yet clear whether static (compile-time) scheduling or out-of-order execution will become standard. Advantages of static scheduling Out-of-order execution uses expensive hardware resources and tends to increase the chip's cycle time and wattage. The static scheduler can schedule earlier the instructions whose future data-dependence path is longest; a real-time scheduler cannot know the length of the data-dependence path leading from an instruction (see Exercise 20.3). The scheduling problem is NP-complete, so compilers - which have no real-time constraint on their scheduling algorithms - should in principle be able to find better schedules. Advantages of dynamic scheduling Some aspects of the schedule are unpredictable at compile time, such as cache misses, and can be better scheduled when their actual latency is known (see Image 21.5). Highly pipelined schedules tend to use many registers; typical machines have only 32 register names in a five-bit instruction field, but out-of-order execution with run-time register renaming can use hundreds of actual registers with a few static names (see the Further Reading section). Optimal static scheduling depends on knowing the precise pipeline state that will be reached by the hardware, which is sometimes difficult to determine in practice. Finally, dynamic scheduling does not require that the program be recompiled (i.e., rescheduled) for each different implementation of the same instruction set.