CACHE ORGANIZATION

A direct-mapped cache is organized in the following way to do this quickly. There are 2^m blocks, each holding 2^l words of 2^w bytes; thus, the cache holds 2^w+l+m bytes in all, arranged in an array Data[block][word][byte]. Each block is a copy of some main-memory data, and there is a tag array indicating where in memory the current contents come from. Typically, the word size 2^w might be 4 bytes, the block size 2^w+l might be 32 bytes, and the cache size 2^w+l+m might be as small as 8 kilobytes or as large as 2 megabytes. Java ScreenShot

Given an address x, the cache unit must be able to find whether x is in the cache. The address x is composed of n bits, x_n−1 x_n−2…x₂x₁x₀ (see Image 21.1). In a direct-mapped cache organization, we take the middle bits as the key = x_w+l+m−1x_w+l+m−2…x_w+l, and hold the data for x in Data[key]. The high bits x_n−1x_n−2…x_w+l+m form the tag, and if Tags[key] ≠ tag,then there is a cache miss - the word we require is not in cache. In this case, contents of data[key] are sent back to main memory, and the contents of memory at address x_n−1…x_w+l, are fetched into the kth cache block (and also sent to the CPU). Access time for main memory is much longer than the cache access time, so frequent misses are undesirable.
Image 21.1: Organization of a direct-mapped cache. Key field of the address is used to index the tags array and the data blocks; if tags[key] matches the tag field of the address then the data is valid (cache hit). Word index is used to select a word from the cache block.

The next time address x is fetched, if no intervening instruction has fetched another address with the same key but different tag, there will be a cache hit: Tags[key] = tag, and bits x_w+l−1 … x_w will address a word within the keyth block: The contents of data[key][x_w+l−1 … x_w] are transferred to the processor. This is much faster than going all the way to main memory for the data. If the fetching instruction is a byte-fetch (instead of a word-fetch), then (typically) the processor takes care of selecting the byte x_l−1 … x₀ from the word. Another common organization is the set-associative cache, which is quite similar but can hold more than one block with the same key value. The compiler optimization strategies presented in this chapter are valid for both direct-mapped caches and set-associative caches, but they are a bit more straightforward to analyze for direct-mapped caches. Write-hit policy The paragraphs above explain what happens on a read, when the CPU asks for data at address x. But what happens when the CPU writes data at address x? If x is in the cache, this is a write hit, which is easy and efficient to process. On a write hit, main memory may be updated now (write-through), or only when the cache block is about to be flushed from the cache (write-back), but the choice of write-hit policy does not much affect the compilation and optimization of sequential programs. Write-miss policy If the CPU writes data at an address not in the cache, this is a write miss. Different machines have different write-miss policies:

Fetch-on-write Word x is written to the cache. But now the other data words in the same cache block belonged to some other address (that had the same key as x), so to make a valid cache block the other words are fetched from main memory. Meanwhile, the processor is stalled. Write-validate Word x is written to the cache. The other words in the same cache block are marked invalid; nothing is fetched from main memory, so the processor is not stalled. Write-around Word x is written directly to main memory, and not to the cache. The processor is not stalled, as no response is required from the memory system. Unfortunately, the next time x is fetched there will be a read miss, which will delay the processor.

The write-miss policy can affect how programs should be optimized (see pages 475 and 480). Several layers of cache A modern machine has a memory hierarchy of several layers, as shown in Image 21.2: Inside the processor are registers,which can typically hold about 200 bytes in all and can be accessed in 1 processor cycle; a bit farther away is the primary cache, which can typically hold 8- 64 kilobytes and be accessed in about 2-3 cycles; then the secondary cache can hold about a megabyte and be accessed in 7-10 cycles; main memory can hold 100 megabytes and be accessed in 100 cycles. The primary cache is usually split into an instruction cache - from which the processor fetches instructions to execute, and a data cache, from which the processor fetches and stores operands of instructions. The secondary cache usually holds both instructions and data.
Image 21.2: The memory hierarchy.

Many processors can issue several instructions per cycle; the number of useful instructions in a cycle varies, depending on data-dependence and resource constraints (see page 441), but let us suppose that two useful instructions can be completed in each cycle, on the average. Then a primary-cache miss is a 15-instruction delay (7-10 cycles, times 2), and a secondary-cache miss is a 200-instruction delay. This cache organization has several consequences of interest to the programmer (and often to the compiler):

Byte fetch: Fetching a single byte is often more expensive than fetching a whole word, because the memory interface delivers a whole word at a time, so the processor must do extra shifting. Byte store: Storing a single byte is usually more expensive than storing a whole word, because the other bytes of that word must be fetched from the cache and stored back into it. Temporal locality: Accessing (fetching or storing) a word that has been recently accessed will usually be a cache hit. Spatial locality: Accessing a word in the same cache block as one that has been accessed recently will usually be a cache hit.
Cache conflict: If address a and address a + i · 2^w+b+m are both frequently accessed, there will be many cache misses because accessing one will throw the other out of the cache.

The compiler can do optimizing transformations that do not decrease the number of instructions executed, but that decrease the number of cache misses (or other memory stalls) that the program encounters.