Caching

Caching involves keeping frequently used data "close" to where it is needed, or preloading data in anticipation of future operations. Data read from disks may be cached until a subsequent write makes it invalid, and data written to disk is usually cached so that many consecutive changes to the same file may be written out in a single operation. In NFS, data caching means not having to send an RPC request over the network to a server: the data is cached on the NFS client and can be read out of local memory instead of from a remote disk. Depending upon the filesystem structure and usage, some cache schemes may be prohibited for certain operations to guarantee data integrity or consistency with multiple processes reading or writing the same file. Cache policies in NFS ensure that performance is acceptable while also preventing the introduction of state into the client-server relationship.

File attribute caching

Not all filesystem operations touch the data in files; many of them either get or set the attributes of the file such as its length, owner, modification time, and inode number. Because these attribute-only operations are frequent and do not affect the data in a file, they are prime candidates for using cached data. Think of ls -l as a classic example of an attribute-only operation: it gets information about directories and files, but doesn't look at the contents of the files.NFS caches file attributes on the client side so that every getattr operation does not have to go all the way to the NFS server. When a file's attributes are read, they remain valid on the client for some minimum period of time, typically three seconds. If the file's attributes remain static for some maximum period, normally 60 seconds, they are flushed from the cache. When an application on the NFS client modifies an NFS attribute, the attribute is immediately written back to the server. The only exceptions are implicit changes to the file's size as a result of writing to the file. As we will see in the next section, data written by the application is not immediately written to the server, so neither is the file's size attribute.The same mechanism is used for directory attributes, although they are given a longer minimum lifespan. The usual defaults for directory attributes are a minimum cache time of 30 seconds and a maximum of 60 seconds. The longer minimum cache period reflects the typical behavior of periods of intense filesystem activity -- files themselves are modified almost continuously but directory updates (adding or removing files) happen much less frequently.The attribute cache can get updated by NFS operations that include attributes in the results. Nearly all of NFS Version 3's RPC procedures include attributes in the results.Attribute caching allows a client to make a steady stream of access to a file without having to constantly get attributes from the server. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call.In the previous section, we saw how the async thread fills and drains the NFS client's buffer or page cache. This presents a cache consistency problem: if an async thread performs read-ahead on a file, and the client accesses that information at some later time, how does the client know that the cached copy of the data is valid? What guarantees are there that another client hasn't changed the file, making the copy of the file's data in the buffer cache invalid?An NFS client needs to maintain cache consistency with the copy of the file on the NFS server. It uses file attributes to perform the consistency check. The file's modification time is used as a cache validity check; if the cached data is newer than the modification time then it remains valid. As soon as the file's modification time is newer than the time at which the async thread read data, the cached data must be flushed. In page-mapped systems, the modification time becomes a "valid bit" for cached pages. If a client reads a file that never gets modified, it can cache the file's pages for as long as needed.This feature explains the "accelerated make" phenomenon seen on NFS clients when compiling code. The second and successive times that a software module (located on an NFS fileserver) is compiled, the make process is faster than the first build. The reason is that the first make reads in header files and causes them to be cached. Subsequent builds of the same modules or other files using the same headers pick up the cached pages instead of having to read them from the NFS server. As long as the header files are not modified, the client's cached pages remain valid. The first compilation requires many more RPC requests to be sent to the server; the second and successive compilations only send RPC requests to read those files that have changed.The cache consistency checks themselves are by the file attribute cache. When a cache validity check is done, the kernel compares the modification time of the file to the timestamp on its cached pages; normally this would require reading the file's attributes from the NFS server. Since file attributes are kept in the file's inode (which is itself cached on the NFS server), reading file attributes is much less "expensive" than going to disk to read part of the file. However, if the file attributes are not changing frequently, there is no reason to re-read them from the server on every cache validity check. The data cache algorithms use the file attribute cache to speed modification time comparisons.Keeping previously read data blocks cached on the client does not introduce state into the NFS system, since nothing is being modified on the client caching the data. Long-lived cache data introduces consistency problems if one or more other clients have the file open for writing, which is one of the motivations for limiting the attribute cache validity period. If the attribute cache data never expired, clients that opened files for reading only would never have reason to check the server for possible modifications by other clients. Stateless NFS operation requires each client to be oblivious to all others and to rely on its attribute cache only for ensuring consistency. Of course, if clients are using different attribute cache aging schemes, then machines with longer cache attribute lifetimes will have stale data. Attribute caching and its effects on NFS performance is revisited in "Attribute caching".

Client data caching

In the previous section, we looked at the async thread's management of an NFS client's buffer cache. The async threads perform read-ahead and write-behind for the NFS client processes. We also saw how NFS moves data in NFS buffers, rather than in page- or buffer cache-sized chunks. The use of NFS buffers allows NFS operations to utilize some of the sequential disk I/O optimizations of Unix disk device drivers.Reading in buffers that are multiples of the local filesystem block size allows NFS to reduce the cost of getting file blocks from a server. The overhead of performing an RPC call to read just a few bytes from a file is significant compared to the cost of reading that data from the server's disk, so it is to the client's and server's advantage to spread the RPC cost over as many data bytes as possible. If an application sequentially reads data from a file in 128-byte buffers, the first read operation brings over a full (8 kilobytes for NFS Version 2, usually more for NFS Version 3) buffer from the filesystem. If the file is less than the buffer size, the entire file is read from the NFS server. The next read( ) picks up data that is in the buffer (or page) cache, and following reads walk through the entire buffer. When the application reads data that is not cached, another full NFS buffer is read from the server. If there are async threads performing read-ahead on the client, the next buffer may already be present on the NFS client by the time the process needs data from it. Performing reads in NFS buffer-sized operations improves NFS performance significantly by decoupling the client application's system call buffer size and the VFS implementation's buffer size.Going the other way, small write operations to the same file are buffered until they fill a complete page or buffer. When a full buffer is written, the operating system gives it to an async thread, and async threads try to cluster write buffers together so they can be sent in NFS buffer-sized requests. The eventual write RPC call is performed synchronous to the async thread; that is, the async thread does not continue execution (and start another write or read operation) until the RPC call completes. What happens on the server depends on what version of NFS is being used.

For NFS Version 2, the write RPC operation does not return to the client's async thread until the file block has been committed to stable, nonvolatile storage. All write operations are performed synchronously on the server to ensure that no state information is left in volatile storage, where it would be lost if the server crashed.
For NFS Version 3, the write RPC operation typically is done with the stable flag set to off. The server will return as soon as the write is stored in volatile or nonvolatile storage. Recall from "NFS Version 3" that the client can later force the server to synchronously write the data to stable storage via the commit operation.

There are elements of a write-back cache in the async threads. Queueing small write operations until they can be done in buffer-sized RPC calls leaves the client with data that is not present on a disk, and a client failure before the data is written to the server would leave the server with an old copy of the file. This behavior is similar to that of the Unix buffer cache or the page cache in memory-mapped systems. If a client is writing to a local file, blocks of the file are cached in memory and are not flushed to disk until the operating system schedules them. If the machine crashes between the time the data is updated in a file cache page and the time that page is flushed to disk, the file on disk is not changed by the write. This is also expected of systems with local disks -- applications running at the time of the crash may not leave disk files in well-known states.Having file blocks cached on the server during writes poses a problem if the server crashes. The client cannot determine which RPC write operations completed before the crash, violating the stateless nature of NFS. Writes cannot be cached on the server side, as this would allow the client to think that the data was properly written when the server is still exposed to losing the cached request during a reboot.Ensuring that writes are completed before they are acknowledged introduces a major bottleneck for NFS write operations, especially for NFS Version 2. A single Version 2 file write operation may require up to three disk writes on the server to update the file's inode, an indirect block pointer, and the data block being written. Each of these server write operations must complete before the NFS write RPC returns to the client. Some vendors eliminate most of this bottleneck by committing the data to nonvolatile, nondisk storage at memory speeds, and then moving data from the NFS write buffer memory to disk in large (64 kilobyte) buffers. Even when using NFS Version 3, the introduction of nonvolatile, nondisk storage can improve performance, though much less dramatically than with NFS Version 2.Using the buffer cache and allowing async threads to cluster multiple buffers introduces some problems when several machines are reading from and writing to the same file. To prevent file inconsistency with multiple readers and writers of the same file, NFS institutes a flush-on-close policy:

All partially filled NFS buffers are written to the NFS server when a file is closed.
For NFS Version 3 clients, any writes that were done with the stable flag set to off are forced onto the server's stable storage via the commit operation.

This ensures that a process on another NFS client sees all changes to a file that it is opening for reading:

Client A	Client B
`open( )`
`write( )`
NFS Version 3 only: commit
`close( )`
`open( )`
`read( )`

The read( ) system call on Client B will see all of the data in a file just written by Client A, because Client A flushed out all of its buffers for that file when the close( ) system call was made. Note that file consistency is less certain if Client B opens the file before Client A has closed it. If overlapping read and write operations will be performed on a single file, file locking must be used to prevent cache consistency problems. When a file has been locked, the use of the buffer cache is disabled for that file, making it more of a write-through than a write-back cache. Instead of bundling small NFS requests together, each NFS write request for a locked file is sent to the NFS server immediately.

Server-side caching

The client-side caching mechanisms -- file attribute and buffer caching -- reduce the number of requests that need to be sent to an NFS server. On the server, additional cache policies reduce the time required to service these requests. NFS servers have three caches:

The inode cache, containing file attributes. Inode entries read from disk are kept in-core for as long as possible. Being able to read and write these attributes in memory, instead of having to go to disk, make the get- and set-attribute NFS requests much faster.
The directory name lookup cache, or DNLC, containing recently read directory entries. Caching directory entries means that the server does not have to open and re-read directories on every pathname resolution. Directory searching is a fairly expensive operation, since it involves going to disk and searching linearly for a particular name in the directory. The DNLC cache works at the VFS layer, not at the local filesystem layer, so it caches directory entries for all types of filesystems. If you have a tutorial drive on your NFS server, and mount it on NFS clients, the DNLC becomes even more important because reading directory entries from the tutorial is much slower than reading them from a local hard disk. Server configuration effects that affect both the inode and DNLC cache systems are discussed in "Kernel configuration".
The server's buffer cache, used for data read from files. As mentioned before, file blocks that are written to NFS servers cannot be cached, and must be written to disk before the client's RPC write call can complete. However, the server's buffer or page cache acts as an efficient read cache for NFS clients. The effects of this caching are more pronounced in page-mapped systems, since nearly all of the server's memory can be used as a read cache for file blocks.For NFS Version 3 servers, the buffer cache is used also for data written to files whenever the write RPC has the stable flag set to off. Thus, NFS Version 3 servers that do not use nondisk, nonvolatile memory to store writes can perform almost as fast as NFS Version 2 servers that do.

Cache mechanisms on NFS clients and servers provide acceptable NFS performance while preserving many -- but not all -- of the semantics of a local filesystem. If you need finer consistency control when multiple clients are accessing the same files, you need to use file locking.