Previous    Next

PREFETCHING

If a load instruction misses the primary (or secondary) cache, there will be a 7-10 cycle delay (or a 70-100 cycle delay, respectively) while the datum is fetched from the next level of the memory hierarchy. In some cases, the need for that datum is predictable many cycles earlier, and the compiler can insert prefetch instructions to start the fetching earlier. A prefetch instruction is a hint to the hardware to start bringing data at address x from main memory into the cache. A prefetch never stalls the processor - but on the other hand, if the hardware finds that some exceptional condition (such as a page fault) would occur, the prefetch can be ignored. When prefetch(x) is successful, it means that the next load from x will hit the cache; an unsuccessful prefetch might cause the next load to miss the cache, but the program will still execute correctly. Many machines now have some form of prefetch instruction. Of course, one reasonable alternative is - instead of starting the fetch earlier - to just delay the instruction that uses the result of the fetch until later, using the software-pipelining techniques described in . In fact, processors that dynamically reorder instructions (to account for operands not ready) achieve this effect without any special work by the compiler. The problem with using software pipelining or dynamic rescheduling to hide secondary-cache misses is that it increases the number of live temporaries. Consider the following dot-product loop as an example:

 L1 : xM[i]
 yM[j]
 zx × y
 ss + z
 ii + 4
 jj + 4
 if i < N goto L1


If the data for the i and j arrays are not in the primary cache, or if N is large (> 8 kilobytes or so) so that the arrays cannot possibly fit in the cache, then each time i or j crosses to a new multiple-of-B boundary (into a new cache block), there will be a cache miss. In effect, the miss rate will be exactly W/B, where W is the word size and B is the block size. Typical values for W/B are Java ScreenShot or Java ScreenShot, and this is a rather high miss rate. The penalty for a primary cache miss is perhaps 7 cycles, or (on a dual-instruction-issue-per-cycle machine) 14 instructions. This would stall the processor of an early-'90s machine for 14 instructions, but a good late-'90s machine with out-of-order execution will find some other instruction to execute that is not data-dependent on the load. The effective order of execution, on a dynamic-instruction-reordering machine, is shown in Image 21.5a. When x1M[i0] is fetched there is a cache miss, so instructions data-dependent on x1 cannot be issued for 11 cycles. In the meantime, i1 and j1, and even i2 and j2 can be computed; and the fetch x2M[i1] can be issued.

Java Click To expand
Image 21.5: Execution of a dot-product loop, with 4-word cache blocks.
(a) Without prefetching, on a machine with dynamic instruction reordering, the number of outstanding instructions (reserved registers) grows proportionally to the cache-miss latency.
(b) With prefetching, the hardware reservation table never grows large. (Steady-state behavior is shown here, not the initial transient.)

As the number of uncompleted loop iterations increases, the number of live or reserved registers increases proportionately. The cache misses for x2, x3, x4 are the same miss as for x1 because they are all in the same cache block, so x1, x2, x3, x4 all become available at about the same time. Iterations 5-8 (which use the next cache block) would be dynamically scheduled like iterations 1-4, and so on. The primary-cache latency, illustrated here, is usually small enough to handle without prefetching techniques. But with a secondary cache-miss latency of 200 instructions (i.e., 29 loop iterations), there will be about 116 outstanding instructions (computations of x, y, z, s waiting for the cache miss), which may exceed the capacity of the machine's instruction-issue hardware. Prefetch instructions Suppose the compiler inserts a prefetch instruction for address a, in advance of the time a will be fetched. This is a hint to the computer that it should start transferring a from main memory into the cache. Then, when a is fetched a few cycles later by an ordinary load instruction, it will hit the cache and there will be no delay. Many machines don't have a prefetch instruction as such, but many machines do have a nonblocking load instruction. That is, when r3M[r7] is performed, the processor does not stall even on a cache miss, until r3 is used as an operand of some other instruction. If we want to prefetch address a, we can just do rtM[a], and then never use the value of rt . This will start the load, bringing the value into cache if necessary, but not delay any other instruction. Later, when we fetch M[a] again, it will hit the cache. Of course, if the computation was already memory-bound - fully utilizing the load/store unit while the arithmetic units are often idle - then prefetching using ordinary load instructions may not help. If the computation accesses every word of an array sequentially, it uses several words from each cache block. Then we don't need to prefetch every word - just one word per cache block is enough. Assuming a 4-byte word and 16-byte cache block, the dot-product loop with prefetching looks something like this:

L1 : if i mod 16 = 0 then prefetch M[i + K]
 if j mod 16 = 0 then prefetch M[j + K]
 xM[i]
 yM[j]
 zx × y
 ss + z
 ii + 4
 jj + 4
 if i < N goto L1


The value K is chosen to match the expected cache-miss latency. For a secondary-cache-miss latency of 200 instructions, when each loop iteration executes 7 instructions and advances i by 4, we would use K = 200·4/7 rounded up to the nearest multiple of the block size, that is, about 128. Image 21.5b uses prefetching to "hide" a cache latency of 11 instructions, so K = 16, the block size. An additional improvement that may be helpful on some machines, when K is small, is to avoid overlapping the prefetch latencies so the memory hardware needn't process two misses simultaneously. In practice, we don't want to test i mod 16 = 0 in each iteration, so we unroll the loop, or nest a loop within a loop, as shown in Program 21.6. The loop-unrolled version on the left could be further improved - in ways unrelated to prefetching - by removing some of the intermediate if statements, as described in .

Inserting prefetches using loop unrolling or nested loops.
L1 : prefetch M[i + K] L1 : ni + 16
 prefetch M[j + K] if n + KN goto L3
 xM[i] prefetch M[i + K]
 yM[j] prefetch M[j + K]
 zx × y L2 : xM[i]
 ss + z yM[j]
 ii + 4 zx × y
 jj + 4 ss + z
 if iN goto L2 ii + 4
 xM[i] jj + 4
 yM[j] if i < n goto L2
 zx × y goto L1
 ss + z L3 : xM[i]
 ii + 4 yM[j]
 jj + 4 zx × y
 if iN goto L2 ss + z
 xM[i] ii + 4
 yM[j] jj + 4
 zx × y if i < n goto L3
 ss + z
 ii + 4
 jj + 4
 if iN goto L2
 xM[i]
 yM[j]
 zx × y
 ss + z
 ii + 4
 jj + 4
 if i < n goto L1
L2 :



Java End example

Prefetching for stores Sometimes we can predict at compile time that a store instruction will miss the cache. Consider the following loop:

Java ScreenShot

If the array A is larger than the cache, or if A has not recently been accessed, then each time i crosses into a new cache block there will be a write miss. If the write-miss policy is write-validate, then this is no problem, as the processor will not be stalled and all the marked-invalid words will be quickly overwritten with valid data. If the policy is fetch-on-write, then the stalls at each new cache block will significantly slow down the program. But prefetching can be used here:

Java ScreenShot

As usual, unrolling the loop will remove the if-test. The A[i + K] value that's prefetched will contain garbage - dead data that we know will be overwritten. We perform the prefetch only to avoid the write-miss stall. If the write-miss policy is write-around, then we should prefetch only if we expect the A[i] values to be fetched soon after they are stored. Summary Prefetching is applicable when

We will not describe the algorithm for inserting prefetch instructions in loops, but see the Further Reading section.


JaVaScreenshot Previous    Next
Comments