GARBAGE COLLECTION & THE MEMORY HIERARCHY
Garbage-collected systems have had the reputation as cache-thrashers with bad cache locality: After all, it would appear that a garbage collection touches all of memory in random-access fashion. But a garbage collector is really a kind of memory manager, and we can organize it to manage memory for improved locality of reference.
Generations: When generational copying garbage collection is used, the youngest generation (allocation space) should be made to fit inside the secondary cache. Then each memory allocation will be a cache hit, and each youngest-generation garbage collection will operate almost entirely within the cache as well - only the objects promoted to another generation may cause cache-write misses. (Keeping the youngest generation inside the primary cache is impractical, since that cache is usually so small that too-frequent garbage collections would be required.) Sequential allocation: With copying collection, new objects are allocated from a large contiguous free space, sequentially in order of address. The sequential pattern of stores to initialize these objects is easy for most modern write-buffers to handle. Few conflicts: The most frequently referenced objects tend to be the newer ones. With sequential allocation of objects in the youngest generations, the keys of these newer objects (in a direct-mapped cache) will be all different. Consequently, garbage-collected programs have significantly lower conflict-miss rates than programs that use explicit freeing. Prefetching for allocation: The sequential initializing stores cause cache-write misses (in the primary cache, which is much smaller than the allocation space) at the rate of one miss per B/W stores, where B is the cache block size and W is the word size. On most modern machines (those with write-validate cache policies) these misses are not costly, because a write miss does not cause the processor to wait for any data. But on some machines (those with fetch-on-write or write-around policies) a write miss is costly. One solution is to prefetch the block well in advance of storing into it. This does not require analysis of any loops in the program (like the technique shown in ) - instead, as the allocator creates a new object at address a, it prefetches word a + K. The value K is related to the cache-miss latency and also the frequency of allocation versus other computation, but a value of K = 100 should work well in almost all circumstances.
Grouping related objects: If object x points to object y, an algorithm that accesses x will likely access y soon, so it is profitable to put the two objects in the same block. A copying collector using depth-first search to traverse the live data will automatically tend to put related objects together; a collector using breadth-first search will not. Copying in depth-first order improves cache performance - but only if the cache blocks are larger than the objects.
These cache-locality improvement techniques are all applicable to copying collection. Mark-and-sweep collectors, which cannot move the live objects, are less amenable to cache management; but see the Further Reading section.