Asynchronous NFS error messages
This final section provides an in-depth look at how an NFS client does write-behind, and what happens if one of the write operations fails on the remote server. It is intended as an introduction to the more complex issues of performance analysis and tuning, many of which revolve around similar subtleties in the implementation of NFS.When an application calls read( ) or write( ) on a local or Unix filesystem (UFS) file, the kernel uses inode and indirect block pointers to translate the offset in the file into a physical block number on the disk. A low-level physical I/O operation, such as "write this buffer of 1024 bytes to physical blocks 5678 and 5679" is then passed to the disk device driver. The actual disk operation is scheduled, and when the disk interrupts, the driver interrupt routine notes the completion of the current operation and schedules the next. The block device driver queues the requests for the disk, possibly reordering them to minimize disk head movement.Once the disk device driver has a read or write request, only a media failure causes the operation to return an error status. Any other failures, such as a permission problem, or the filesystem running out of space, are detected by the filesystem management routines before the disk driver gets the request. From the point of view of the read( ) and write( ) system calls, everything from the filesystem write routine down is a black box: the application isn't necessarily concerned with how the data makes it to or from the disk, as long as it does so reliably. The actual write operation occurs asynchronously to the application calling write( ). If a media error occurs -- for example, the disk has a bad sector brewing -- then the media-level error will be reported back to the application during the next write( ) call or during the close( ) of the file containing the bad block. When the driver notices the error returned by the disk controller, it prints a media failure message on the console.A similar mechanism is used by NFS to report errors on the "virtual media" of the remote fileserver. When write( ) is called on an NFS-mounted file, the data buffer and offset into the file are handed to the NFS write routine, just as a UFS write calls the lower-level disk driver write routine. Like the disk device driver, NFS has a driver routine for scheduling write requests: each new request is put into the page cache. When a full page has been written, it is handed to an NFS async thread that performs the RPC call to the remote server and returns a result code. Once the request has been written into the local page cache, the write( ) system call returns to the application -- just as if the application was writing to a local disk. The actual NFS write is synchronous to the NFS async thread, allowing these threads to perform write-behind. A similar process occurs for reads, where the NFS async thread performs some read-ahead by fetching NFS buffers in anticipation of future read( ) system calls. See "Client I/O system" for details on the operation of the NFS async threads.Occasionally, an NFS async thread detects an error when attempting to write to a remote server, and the error is printed (by the NFS async thread) on the client's console. The scenario is identical to that of a failing disk: the write( ) system call has already returned, so the error must be reported on the console in the next similar system call.The format of these error messages is:NFS write error on host mahimahi: No space left on device. (file handle: 800006 2 a0000 3ef 12e09b14 a0000 2 4beac395)
The number of potential failures when writing to an NFS-mounted disk exceeds the few media-related errors that would cause a UFS write to fail. Table 15-1 gives some examples.
Table 15-1. NFS-related errors
Error | Typical Cause |
---|---|
Permission denied | Superuser cannot write to remote filesystem. |
No space left on device | Remote disk is full. |
Stale filehandle | File or directory has been removed on the server without the client's knowledge. |
Both the "Permission denied" and the "No space left on device" errors would have been detected on a local filesystem, but the NFS client has no way to determine if a write operation will succeed at some future time (when the NFS async thread eventually sends it to the server). For example, if a client writes out 1KB buffers, then its NFS async threads write out 8KB buffers to the server on every 8th call to write( ). Several seconds may go by between the time the first write( ) system call returns to the application and the time that the eighth call forces the NFS async thread to perform an RPC to the NFS server. In this interval, another process may have filled up the server's disk with some huge write requests, so the NFS async thread's attempt to write its 8-KB buffer will fail.If you are consistently seeing NFS writes fail due to full filesystems or permission problems, you can usually chase down the user or process that is performing the writes by identifying the file being written. Unfortunately, Solaris does not provide any utility to correlate the filehandles printed in the error messages with the pathname of the file on the remote server. Filehandles are generated by the NFS server and handed opaquely to the NFS client. The NFS client cannot make any assumptions as to the structure or contents of the filehandle, enabling servers to change the way they generate the filehandle at any time. In practice, the structure of a Solaris NFS filehandle has changed little over time. The following script takes as input the filehandle printed by the NFS client and generates the corresponding server filename:[42]
[42]Thanks to Brent Callaghan for providing the basis for this script.
#!/bin/sh if [ $# -ne 8 ]; then echo "Usage: fhfind <filehandle> e.g." echo echo "fhfind 1540002 2 a0000 4d 48df4455 a0000 2 25d1121d" exit 1 fi FSID=$1 INUMHEX="echo $4 | tr [a-z] [a-z]" ENTRY="grep ${fsid} /etc/mnttab | grep -v lofs" if [ "${ENTRY}" = "" ] ; then echo "Cannot find filesystem for devid ${FSID}" exit 1 fi set - ${ENTRY} MNTPNT=$2 INUM="echo "ibase=16;${inumhex}" | bc" echo "Searching ${MNTPNT} for inode number ${INUM} ..." echo find ${MNTPNT} -mount -inum ${INUM} -print 2>/dev/null
The script takes the expanded filehandle string from the NFS write error and maps it to the full pathname of the file on the server. The script is to be executed on the NFS server:
mahimahi#fhfind 800006 2 a0000 3ef 12e09b14 a0000 2 4beac395
Searching /spare for inode number 1007 ... /spare/test/info/data
The eight values on the command line are the eight hex digits in the filehandle reported in the NFS error message. The script makes strict assumptions about the contents of the Solaris server filehandle. As mentioned before, the OS vendor is free to change the structure of the filehandle at any time, so there's no guarantee this script will work on your particular configuration. The script takes advantage of the fact that the filehandle contains the inode number of the file in question, as well as the device id of the filesystem in which the file resides. The script uses the device id in the filehandle (FSID in line 10) to obtain the filesystem entry from /etc/mnttab (line 13). In line 11, the script obtains the inode number of the file (in hex) from the filehandle, and applies the tr utility to convert all lowercase characters into uppercase characters for use with the bc calculator. Line 18 and 19 extract the mount point from the filesystem entry, to later use it as the starting point of the search. Line 21 takes the hexadecimal inode number obtained from the filehandle, and converts it to its decimal equivalent for use by find. In line 26, we finally begin the search for the file matching the inode number. Although find uses the mount point as the starting point of the search, a scan of a large filesystem may take a long time. Since there's no way to terminate the find upon finding the file, you may want to kill the process after it prints the path.Throughout this chapter, we used tools presented in previous chapters to debug network and local problems. Once you determine the source of the problem, you should be able to take steps to correct and avoid it. For example, you can avoid delayed client write problems by having a good idea of what your clients are doing and how heavily loaded your NFS servers are. Determining your NFS workload and optimizing your clients and servers to make the best use of available resources requires tuning the network, the clients, and the servers. The next few chapters present NFS tuning and benchmarking techniques.