Pipelining and Scheduling
sched-ule: a procedural plan that indicates the time and sequence of each operation
Webster's Dictionary
OVERVIEW
A simple computer can process one instruction at a time. First it fetches the instruction, then decodes it into opcode and operand specifiers, then reads the operands from the register bank (or memory), then performs the arithmetic denoted by the opcode, then writes the result back to the register back (or memory), and then fetches the next instruction. Modern computers can execute parts of many different instructions at the same time. At the same time the processor is writing results of two instructions back to registers, it may be doing arithmetic for three other instructions, reading operands for two more instructions, decoding four others, and fetching yet another four. Meanwhile, there may be five instructions delayed, awaiting the results of memory-fetches. Such a processor usually fetches instructions from a single flow of control; it's not that several programs are running in parallel, but the adjacent instructions of a single program are decoded and executed simultaneously. This is called instruction-level parallelism (ILP), and is the basis for much of the astounding advance in processor speed in the last decade of the twentieth century. A pipelined machine performs the write-back of one instruction in the same cycle as the arithmetic "execute" of the next instruction and the operandread of the previous one, and so on. A very-long-instruction-word (VLIW) issues several instructions in the same processor cycle; the compiler must ensure that they are not data-dependent on each other. A superscalar machine issues two or more instructions in parallel if they are not related by data dependence (which it can check quickly in the instruction-decode hardware); otherwise it issues the instructions sequentially - thus, the program will still operate correctly if data-dependent instructions are adjacent, but it will run faster if the compiler has not scheduled non-data-dependent instructions adjacent to each other. A dynamic-scheduling machine reorders the instructions as they are being executed, so that it can issue several non-data-dependent instructions simultaneously, and may need less help from the compiler. Any of these techniques produces instruction-level parallism. The more instructions can be executed simultaneously, the faster the program will run. But why can't all the instructions of the program be executed in parallel? After all, that would be the fastest possible execution. There are several kinds of constraints on instruction execution; we can optimize the program for instruction-level parallelism by finding the best schedule that obeys these constraints:
Data dependence: If instruction A calculates a result that's used as an operand of instruction B,then B cannot execute before A is finished. Functional unit: If there are kfu multipliers (adders, etc.) on the chip, then at most kfu multiplication (addition, etc.) instructions can execute at once. Instruction issue: The instruction-issue unit can issue at most kii instructions at a time.
Register: At most kr registers can be in use at a time; more specifically, any schedule must have some valid register allocation.
The functional-unit, instruction-issue, and register constraints are often lumped together as resource constraints or resource hazards. On a pipelined machine, even if "B cannot execute before A", there may be some parts of B's execution (such as instruction-fetch) that can proceed concurrently with A; Images 20.2 and 20.3 give details. There are also pseudo-constraints that can often be made to disappear by renaming variables:
Write-after-write: If instruction A writes to a register or memory location, and B writes to the same location, then the order of A and B must not be changed. But often it is possible to modify the program so that A and B write to different locations.
Write-after-read: If A must read from a location before B writes to it, then A and B's order of execution must not be swapped, unless renaming can be done so that they use different locations.
Resource usage of an instruction We might describe an instruction in terms of the number of cycles it takes to execute, and the resources it uses at different stages of execution. Image 20.1 shows such a description for three instructions of a hypothetical machine.
Image 20.1: Functional unit requirements of instructions (on the MIPS R4000 processor). This machine's Floating-point ADD instruction uses the instruction-fetch unit for one cycle; reads registers for one cycle; unpacks exponent and mantissa; then for the next cycle uses a shifter and an adder; then uses both the adder and a rounding unit; then the rounding unit and a shifter; then writes a result back to the register file. The MULT and CONV instructions use functional units in a different order.
If the ith cycle of instruction A uses a particular resource, and the jth cycle of instruction B uses the same resource, then B cannot be scheduled exactly i − j cycles after A, as illustrated in Image 20.2.
Image 20.2: If there is only one functional unit of each kind, then an ADD cannot be started at the same time as a MULT (because of numerous resource hazards shown in boldface); nor three cycles after the MULT (because of Add, Round, and Write hazards); nor four cycles later (because of Add and Round hazards). But if there were two adders and two rounding units, then an ADD could be started four cycles after a MULT. Or with dual fetch units, multiple-access register file, and dual unpackers, the MULT and ADD could be started simultaneously.
However, some machines have several functional units of each kind (e.g., more than one adder); on such a machine it does not suffice to consider instructions pairwise, but we must consider all the instructions scheduled for a given time. Data-dependence of an instruction The same considerations apply to data-dependence constraints. The result of some instruction A is written back to the register file during the Write stage of its execution (see Image 20.1); if instruction B uses this register, then the Read stage of B must be after the Write stage of A. Some machines have bypass circuitry that may allow the arithmetic stage of B to follow immediately after the arithmetic stage of A; for example, the Shift/Add stage of an ADD instruction might be able to immediately follow the Round stage of a MULT. These situations are shown in Image 20.3.
Image 20.3: Data dependence. (Above) If the MULT produces a result that is an operand to ADD, the MULT must write its result to the register file before the ADD can read it. (Below) Special bypassing circuitry can route the result of MULT directly to the Shift and Add units, skipping the Write, Read, and Unpack stages.