EXERCISES

*21.1 Write a program in C for multiplying 1000×1000 double-precision floating-point matrices. Run it on your machine and measure the time it takes.
1. Find out the number of floating-point registers on your machine, the size of the primary cache, and the size of the secondary cache.
2. Write a matrix-multiply program that uses blocking transformations at the secondary cache level only. Measure its run time.
3. Modify your program to optimize on both levels of the cache; measure its run time.
4. Modify the program again to optimize over both levels of the cache and use registers via unroll-and-jam; view the output of the C compiler to verify that the register allocator is keeping your temporary variables in floating-point registers. Measure the run time.
*21.2 Write a program in C for multiplying 1000 × 1000 double-precision floating-point matrices. Use the C compiler to print out assembly language for your loop. If your machine has a prefetch instruction, or a nonstalling load instruction that can serve as a prefetch, insert prefetch instructions to hide secondary-cache misses. Show what calculations you made to take account of the cache-miss latency. How much faster is your program with prefetching?