Previous    Next

BLOCKING

The technique of blocking reorders a computation so that all the computations that use one portion of the data are completed before moving on to the next portion. The following nested loop for matrix multiplication, C = AB, illustrates the need for blocking:

for i ← 0 to N - 1
 for j ← 0 to N - 1
 for k ← 0 to N - 1
 C[i, j] ← C[i, j] + A[i, k] · B[k, j]


If both A and B fit into the cache simultaneously, then the k loop will run without cache misses; and there may be only one cache miss for C[i, j] on each iteration of the j loop. But suppose the cache is large enough to hold only 2 · c · N matrix elements (floating-point numbers), where 1 < c < N. For example, multiplying 50 × 50 matrices of 8-byte floats on a machine with an 8-kilobyte cache, c = 10. Then every reference to B[k, j] in the inner loop will be a cache miss, because - since the last time that particular cell of B was accessed - the entire B matrix will have been marched through the cache, dumping out the "old" values. Thus, each iteration of the inner loop will have a cache miss. Loop interchange cannot help here, because if the j loop is outermost, then A will suffer cache misses, and if the k loop is outermost, then C will suffer misses. The solution is to reuse rows of the A matrix and columns of the B matrix while they are still in cache. A c × c block of the matrix C can be calculated from c rows of A and c columns of B, as follows (see also Image 21.7):

for ii0 to i0 + c - 1
 for jj0 to j0 + c - 1
 for k ← 0 to N - 1
 C[i, j] ← C[i, j] + A[i, k] · B[k, j]


Java Click To expand
Image 21.7: Matrix multiplication. Each element of C is computed from a row of A and a column of B. With blocking, a c × c block of the C matrix is computed from a c × N block of A and a N × c block of B.

Only c · N elements of A and c · N elements of B are used in this loop, and each element is used c times. Thus, at a cost of 2 · c · N cache misses to bring this portion of A and B into cache, we are able to compute c · c · N iterations of the inner loop, for a miss rate of 2/c misses per iteration. All that remains is to nest this set of loops inside outer loops that compute each c × c block of C:

for i0 ← 0 to N - 1 by c
 for j0 ← 0 to N - 1 by c
 for ii0 to min(i0 + c - 1, N - 1)
 for jj0 to min(j0 + c - 1, N - 1)
 for k ← 0 to N - 1
 C[i, j] ← C[i, j] + A[i, k] · B[k, j]


This optimization is called blocking because it computes one block of the iteration space at a time. There are many nested-loop programs on which an optimizing compiler can automatically perform the blocking transformation. Crucial to the situation are loops whose iterations are not data-dependent on each other; in matrix multiplication, the calculation of C[i, j] does not depend on C[i′, j′], for example. Scalar replacement Even though the access to C[i, j] in the matrix-multiply program will almost always hit the cache (since the same word is being used repeatedly in the k loop), we can still bring it up one level in the memory hierarchy - from primary cache into registers! - by the scalar replacement optimization. That is, when a particular array element is used as a scalar for repeated computations, we can "cache" it in a register:

for ii0 to i0 + c - 1
 for jj0 to j0 + c - 1
 sC[i, j]
 for k ← 0 to N - 1
 ss + A[i, k] · B[k, j]
 C[i, j] ← s


This reduces the number of fetches and stores in the innermost loop by a factor of 2. Blocking at every level of the memory hierarchy To do blocking optimizations, the compiler must know how big the cache is - this determines the best value of c, the block size. If there are several levels of the memory hierarchy, then blocking can be done at each level. Even the machine's registers should be considered as a level of the memory hierarchy. Taking again the example of matrix multiply, we suppose there are 32 floating-point registers, and we want to use d of them as a kind of cache. We can rewrite the c × c loop (of the blocked matrix multiply) as follows:

for ii0 to i0 + c - 1
 for k0 ← 0 to N - 1 by d
 for kk0 to k0 + d - 1
 T[k - k0] ← A[i, k]
 for jj0 to j0 + c - 1
 sC[i, j]
 for kk0 to k0 + d - 1
 ss + T[k - k0] · B[k, j]
 C[i, j] ← s


Unroll and jam Loop unrolling must be used for register-level blocking, since registers cannot be indexed by subscripts. So we unroll the k-loops d times and keep each T[k] in a separate scalar temporary variable (for illustration, we will use d = 3, though d = 25 would be more realistic):

for ii0 to i0 + c - 1
 for k0 ← 0 to N - 1 by 3
 t0A[i, k0]; t1A[i, k0 + 1]; t2A[i, k0 + 2]
 for jj0 to j0 + c - 1
 C[i, j] ← C[i, j] + t0 · B[k0, j] + t1 · B[k0 + 1, j] + t2 · B[k0 + 2, j]


The register allocator will ensure, of course, that the tk are kept in registers. Every value of A[i, k] fetched from the cache is used c times; the B values still need to be fetched, so the number of memory accesses in the inner loop goes down by almost a factor of two.

A high-tech compiler would perform - on the same loop! - blocking transformations for the primary cache and for the secondary cache, and scalar replacement and unroll-and-jam for the register level of the memory hierarchy.


JaVaScreenshot Previous    Next
Comments