Previous Next |

The technique of *blocking* reorders a computation so that all the computations that use one portion of the data are completed before moving on to the next portion. The following nested loop for matrix multiplication, *C* = *AB*, illustrates the need for blocking:

fori← 0toN- 1forj← 0toN- 1fork← 0toN- 1C[i,j] ←C[i,j] +A[i,k] ·B[k,j]

If both *A* and *B* fit into the cache simultaneously, then the *k* loop will run without cache misses; and there may be only one cache miss for *C*[*i*, *j*] on each iteration of the *j* loop. But suppose the cache is large enough to hold only 2 · *c* · *N* matrix elements (floating-point numbers), where 1 < *c* < *N*. For example, multiplying 50 × 50 matrices of 8-byte floats on a machine with an 8-kilobyte cache, *c* = 10. Then *every reference to B*[*k*, *j*] *in the inner loop will be a cache miss*, because - since the last time that particular cell of *B* was accessed - the entire *B* matrix will have been marched through the cache, dumping out the "old" values. Thus, each iteration of the inner loop will have a cache miss. Loop interchange cannot help here, because if the *j* loop is outermost, then *A* will suffer cache misses, and if the *k* loop is outermost, then *C* will suffer misses. The solution is to reuse rows of the *A* matrix and columns of the *B* matrix while they are still in cache. A *c* × *c* block of the matrix *C* can be calculated from *c* rows of *A* and *c* columns of *B*, as follows (see also Image 21.7):

fori←i_{0}toi_{0}+c- 1forj←j_{0}toj_{0}+c- 1fork← 0toN- 1C[i,j] ←C[i,j] +A[i,k] ·B[k,j]

Image 21.7: Matrix multiplication. Each element of

Only *c* · *N* elements of *A* and *c* · *N* elements of *B* are used in this loop, and *each element is used c times*. Thus, at a cost of 2 · *c* · *N* cache misses to bring this portion of *A* and *B* into cache, we are able to compute *c* · *c* · *N* iterations of the inner loop, for a miss rate of 2/*c* misses per iteration. All that remains is to nest this set of loops inside outer loops that compute each *c* × *c* block of *C*:

fori_{0}← 0toN- 1bycforj_{0}← 0toN- 1bycfori←i_{0}tomin(i_{0}+c- 1,N- 1)forj←j_{0}tomin(j_{0}+c- 1,N- 1)fork← 0toN- 1C[i,j] ←C[i,j] +A[i,k] ·B[k,j]

This optimization is called *blocking* because it computes one block of the iteration space at a time. There are many nested-loop programs on which an optimizing compiler can automatically perform the blocking transformation. Crucial to the situation are loops whose iterations are not data-dependent on each other; in matrix multiplication, the calculation of *C*[*i*, *j*] does not depend on *C*[*i*′, *j*′], for example. **Scalar replacement** Even though the access to *C*[*i*, *j*] in the matrix-multiply program will almost always hit the cache (since the same word is being used repeatedly in the *k* loop), we can still bring it up one level in the memory hierarchy - from primary cache into registers! - by the *scalar replacement* optimization. That is, when a particular array element is used as a scalar for repeated computations, we can "cache" it in a register:

fori←i_{0}toi_{0}+c- 1forj←j_{0}toj_{0}+c- 1s←C[i,j]fork← 0toN- 1s←s+A[i,k] ·B[k,j]C[i,j] ←s

This reduces the number of fetches and stores in the innermost loop by a factor of 2. **Blocking at every level of the memory hierarchy** To do blocking optimizations, the compiler must know how big the cache is - this determines the best value of *c*, the block size. If there are several levels of the memory hierarchy, then blocking can be done at each level. Even the machine's registers should be considered as a level of the memory hierarchy. Taking again the example of matrix multiply, we suppose there are 32 floating-point registers, and we want to use *d* of them as a kind of cache. We can rewrite the *c* × *c* loop (of the blocked matrix multiply) as follows:

fori←i_{0}toi_{0}+c- 1fork_{0}← 0toN- 1bydfork←k_{0}tok_{0}+d- 1T[k-k^{0}] ←A[i,k]forj←j_{0}toj_{0}+c- 1s←C[i,j]fork←k_{0}tok_{0}+d- 1s←s+T[k-k^{0}] ·B[k,j]C[i,j] ←s

**Unroll and jam** *Loop unrolling must be used for register-level blocking*, since registers cannot be indexed by subscripts. So we unroll the *k*-loops *d* times and keep each *T*[*k*] in a separate scalar temporary variable (for illustration, we will use *d* = 3, though *d* = 25 would be more realistic):

fori←i_{0}toi_{0}+c- 1fork_{0}← 0toN- 1by3t_{0}←A[i,k_{0}];t_{1}←A[i,k_{0}+ 1];t_{2}←A[i,k_{0}+ 2]forj←j_{0}toj_{0}+c- 1C[i,j] ←C[i,j] +t_{0}·B[k_{0},j] +t_{1}·B[k_{0}+ 1,j] +t_{2}·B[k_{0}+ 2,j]

The register allocator will ensure, of course, that the *t _{k}* are kept in registers. Every value of

A high-tech compiler would perform - on the same loop! - blocking transformations for the primary cache and for the secondary cache, and scalar replacement and unroll-and-jam for the register level of the memory hierarchy.

JaVa |
Previous Next |