Identifying NFS performance bottlenecks

The stateless design of NFS makes crash recovery simple, but it also makes it impossible for a client to distinguish between a server that is slow and one that has crashed. In either case, the client does not receive an RPC reply before the RPC timeout period expires. Clients can't tell why a server appears slow, either: packets could be dropped by the network and never reach the server, or the server could simply be overloaded. Using NFS performance figures alone, it is hard to distinguish a slow server from an unreliable network. Users complain that "the system is slow," but there are several areas that contribute to system sluggishness. An overloaded server responds to all packets that it enqueues for its nfsd daemons, perhaps dropping some incoming packets due to the high load. Those requests that are received generate a response, albeit a response that arrives sometime after the client has retransmitted the request. If the network itself is to blame, then packets may not make it from the client or server onto the wire, or they may vanish in transit between the two hosts.

Problem areas

The potential bottlenecks in the client-server relationship are:

Throughput

The next two sections summarize NFS throughput issues.

NFS writes (NFS Version 2 versus NFS Version 3)

Write operations over NFS Version 2 are synchronous, forcing servers to flush data to disk[45] before a reply to the NFS client can be generated. This severely limits the speed at which synchronous write requests can be generated by the NFS client, since it has to wait for acknowledgment from the server before it can generate the next request. NFS Version 3 overcomes this limitation by introducing a two-phased commit write operation. The NFS Version 3 client generates asynchronous write requests, allowing the server to acknowledge the requests without requiring it to flush the data to disk. This results in a reduction of the round-trip time between the client and server, allowing requests to be sent more quickly. Since the server no longer flushes the data to disk before it replies, the data may be lost if the server crashes or reboots unexpectedly. The NFS Version 3 client assumes the responsibility of recovering from these conditions by caching a copy of the data. The client must first issue a commit operation for the data to the server before it can flush its cached copy of the data. In response to the commit request, the server either ensures the data has been written to disk and responds affirmatively, or in the case of a crash, responds with an error causing the client to synchronously retransmit the cached copy of the data to the server. In short, the client is still responsible for holding on to the data until it receives acknowledgment from the server indicating that the data has been flushed to disk.
[45]The effect of NVRAM is discussed in "Disk array caching and Prestoserve" later in this chapter.
For all practical purposes, the NFS Version 3 protocol removes any limitations on the size of the data block that can be transmitted, although the data block size may still be limited by the underlying transport. Most NFS Version 3 implementations use a 32 KB data block size. The larger NFS writes reduce protocol overhead and disk seek time, resulting in much higher sequential file access.

NFS/TCP versus NFS/UDP

TCP handles retransmissions and flow control for NFS, requiring only individual packets to be retransmitted in case of loss, and making NFS practical over lossy and wide area network practical. In contrast, UDP requires the whole NFS operation to be retransmitted if one or more packets is lost, making it impractical over lossy networks. TCP allows read and write operations to be increased from 8 KB to 32 KB. By default, Solaris clients will attempt to mount NFS filesystems using NFS Version 3 over TCP when supported by the server. Note that workloads that mainly access attributes or consist of short reads will benefit less from the larger transfer size, and as such you may want to reduce the default read size block by using the rsize=n option of the mount command. This is explored in more detail in "Client-Side Performance Tuning".

Locating bottlenecks

Given all of the areas in which NFS can break down, it is hard to pick a starting point for performance analysis. Inspecting server behavior, for example, may not tell you anything if the network is overly congested or dropping packets. One approach is to start with a typical NFS client, and evaluate its view of the network's services. Tools that examine the local network interface, the network load perceived by the client, and NFS timeout and retransmission statistics indicate whether the bulk of your performance problems are due to the network or the NFS servers. In this and the next two chapters, we look at performance problems from excessive server loading to network congestion, and offer suggestions for easing constraints at each of the problem areas outlined above. However, you may want to get a rough idea of whether your NFS servers or your network is the biggest contributor to performance problems before walking through all diagnostic steps. On a typical NFS client, use the nfsstat tool to compare the retransmission and duplicate reply rates:

% nfsstat -rc Client rpc: Connection oriented: calls badcalls badxids timeouts newcreds badverfs 1753584 1412 18 64 0 0 timers cantconn nomem interrupts 0 1317 0 18 Connectionless: calls badcalls retrans badxids timeouts newcreds 12443 41 334 80 166 0 badverfs timers nomem cantsend 0 4321 0 206 


The timeout value indicates the number of NFS RPC calls that did not complete within the RPC timeout period. Divide timeout by calls to determine the retransmission rate for this client. We'll look at an equation for calculating the maximum allowable retransmission rate on each client in "Retransmission rate thresholds". If the client-side RPC counts for timeout and badxid are close in value, the network is healthy. Requests are making it to the server but the server cannot handle them and generate replies before the client's RPC call times out. The server eventually works its way through the backlog of requests, generating duplicate replies that increment the badxid count. In this case, the emphasis should be on improving server response time. Alternatively, nfsstat may show that timeout is large while badxid is zero or negligible. In this case, packets are never making it to the server, and the network interfaces of client and server, as well as the network itself, should be examined. NFS does not query the lower protocol layers to determine where packets are being consumed; to NFS the entire RPC and transport mechanisms are a black box. Note that NFS is like spray in this regard -- it doesn't matter whether it's the local host's interface, network congestion, or the remote host's interface that dropped the packet -- the packets are simply lost. To eliminate all network-related effects, you must examine each of these areas.