NFS Problem Diagnosis
Contents:
NFS server problemsNFS client problems
NFS errno values
Throughout this tutorial, we've used the output of nfsstat on both NFS clients and servers to locate performance bottlenecks or inefficient NFS architectures. The first two sections in this appendix summarize symptoms of problems identified from the output of nfsstat. The last list contains typical values for the error variable errno that may be returned by file operations on NFS-mounted filesystems.
NFS server problems
Check the output of nfsstat -s for the following problems:- badcalls > 0
- RPC requests are being rejected out of hand by the NFS server. This could indicate authentication problems caused by having a user in too many groups, attempts to access exported filesystems as root, or an improper Secure RPC configuration.
- badlen > 0 or xdrcall > 0
- This indicates a malformed NFS request, detected by RPC or XDR protocol decoding on the server. This can be caused by bugs in the client or server, or by physical network problems.
- dupreqs > 0
- The duplicate request cache keeps a record of previously executed NFS requests. The dupchecks counter tracks the number of times this cache was consulted, or checked. The dupreqs counter tracks the number of times a check of the cache had a "hit." In other words, dupreqs counts the number of times the NFS server received a previously executed request. For connection-oriented (TCP) requests, a high ratio of dupreqs to dupchecks is 0.01%. For connectionless (UDP) requests, a high ratio of dupreqs to dupchecks is one percent. High ratios indicate one of three problems:
- The timeout set on one or more clients' NFS mounts is too low. Adjust the timeo option in the automounter map or the NFS mount command upward.
- The server is not responding quickly enough. There could be lots of reasons for this having to do with physical capabilities of the server: processor speed, numbers of processors (if it is a multiprocessor), not enough primary memory (check if the percentage of reads is high, say over 5%; this would indicate lots of reads that would be best served from cache if there was enough memory), numbers of disk drives on the system (spreading more data accesses across more spindles reduces response time; if you've eliminated primary memory as a cause, check if the percentage of writes is high, say over 5%), etc. Other possibilities extend to artificial limits, such as the number of server threads set via nfsd.
- There is a routing problem impeding replies from the server to one or more clients.
- readlink > 10%
- Clients are making excessive use of symbolic links that are on filesystems exported by the server. If the link is to a directory, replace the symbolic link with a directory, and mount both the underlying filesystem and the link's target on the client. If the link is to a file, replace the symbolic link with a hard link.
- getattr > 60%
- Check for possible non-default attribute cache values on NFS clients. A very high percentage of getattr requests may indicate that the attribute cache window has been reduced or set to zero with the actimeo or noac mount option. It can also indicate that the NFS filesystem implementation is doing a poor job of attribute caching.
- null > 1%
- The automounter has been configured to mount replicated filesystems, but the timeout values for the mount are too short. The null procedure calls are made by the automounter to locate a server for the filesystem; too many null calls indicates that the automounter is retrying the mount frequently. Increase the mount timeout parameter on the automounter command line.
- fsinfo > 1%
- This is typically used only on mounts. Lots of fsinfo calls suggests that the automounter is frequently mounting and unmounting the same filesystems. If so, tune the automounter to hold mounts longer via the -t option to automount. This will improve the response time on clients.