CACHE-BLOCK ALIGNMENT

The typical cache-block size (B = about 8 words, more or less) is similar to the typical data-object size. We may expect that an algorithm that fetches one field of an object will probably fetch other fields as well. If x straddles a multiple-of-B boundary, then it occupies portions of two different cache blocks, both of which are likely to be active at the same time. On the other hand, if x does not cross a multiple-of-B boundary, then accessing all the fields of x uses up only one cache block. To improve performance by using the cache effectively, the compiler should arrange that data objects are not unnecessarily split across blocks. There are simple ways to accomplish this:

Allocate objects sequentially; if the next object does not fit in the remaining portion of the current block, skip to the beginning of the next block.
Allocate size-2 objects in one area of memory, all aligned on multiple-of-2 boundaries; size-4 objects in another area, aligned on multiple-of-4 boundaries, and so on. This eliminates block-crossing for many common-sized objects, without wasted space between the objects.

Block alignment can waste some space, leaving unused words at the end of some blocks, as shown in Image 21.3. However, the execution speed may improve; for a given phase of the program, there is a set S of frequently accessed objects, and alignment may reduce the number of cache blocks occupied by S from a number greater than the cache size to a number that fits in the cache.
Image 21.3: Alignment of data objects (or basic blocks) to avoid crossing cache-block boundaries is often worthwhile, even at the cost of empty space between objects.

Alignment can be applied both to global, static data and to heap-allocated data. For global data, the compiler can use assembly-language alignment directives to instruct the linker. For heap-allocated records and objects, it is not the compiler but the memory allocator within the runtime system that must place objects on cache-block boundaries, or otherwise minimize the number of cache-block crossings.

ALIGNMENT IN THE INSTRUCTION CACHE

Instruction "objects" (basic blocks) occupy cache blocks just as do data records, and the same considerations of block-crossing and alignment apply to instructions. Aligning the beginning of frequently executed basic blocks on multiple-of-B boundaries increases the number of basic blocks that fit simultaneously in the instruction cache. Infrequently executed instructions should not be placed on the same cache blocks as frequently executed instructions. Consider the program

P;
if x then Q;
R;

where x is rarely true. We could generate code for it in either of the ways shown in Image 21.4; but placing Q out-of-line means that this series of statements (usually) occupies two cache blocks, but placing Q straddling cache blocks between P and R will mean that even in the common case, where Q is not executed, this part of the program will occupy three blocks in the cache.
Image 21.4: If x is rarely true, basic-block placement (a) will occupy three in-cache blocks, while (b) will usually occupy only two.

On some machines it is particularly important to align the target of a branch instruction on a power-of-2 boundary. A modern processor fetches an aligned block of k (2 or 4 or more) words. If the program branches to some address that is not on a multiple-of-k boundary, then the instruction-fetch is not fetching k useful instructions.

An optimizing compiler should have a basic-block-ordering phase, after instruction selection and register allocation. Trace scheduling (as described in ) can then be used to order a frequently executed path through a contiguous set of cache blocks; in constructing a trace through a conditional branch, it is important to follow the most-likely-taken out-edge, as determined by branch prediction (as described in ).