Replication
Solaris 2.6 introduced the concept of replication to NFS clients. This feature is known as client-side failover. Client-side failover is useful whenever you have read-only data that you need to be highly available. An example will illustrate this. Suppose your user community needs to access a collection of historical data on the last 200 national budgets of the United States. This is a lot of data, and so is a good candidate to store on a central NFS server. However, because your users' jobs depend on it, you do not want to have a single point of failure, and so you keep the data on several NFS servers. (Keeping the data on several NFS servers also gives one the opportunity to load balance). Suppose you have three NFS servers, named hamilton, wolcott, and dexter, each exporting a copy of data. Then each server might have an entry like this in its dfstab file:share -o ro /export/budget_stats
Now, without client-side failover, each NFS client might have one of the following vfstab entries:
hamilton:/export/budget_stats - /stats/budget nfs - yes ro wolcott:/export/budget_stats - /stats/budget nfs - yes ro dexter:/export/budget_stats - /stats/budget nfs - yes ro
Suppose an NFS client is mounting /stats/budgetfrom NFS server hamilton, and hamilton stops responding. The user on that client will want to mount a different server. In order to do this, he'll have to do all of the following:
- Terminate any applications that are currently accessing files under the /budget_stats mount point.
- Unmount /stats/budget.
- Edit the vfstab file to point at a different server.
- Mount /stats/budget.
hamilton,wolcott,dexter:/export/budget_stat - /budget_stats nfs - yes ro
This vfstab entry defines a replicated NFS filesystem. When this vfstab entry is mounted, the NFS client will:
- Contact each server to verify that each is responding and exporting /export/budget_stats.
- Generate a list of the NFS servers that are responding and exporting /export/budget_stats and associate that list with the mount point.
- Pick one of the servers to get NFS service from. In other words, the NFS traffic for the mount point is bound to one server at a time.
%nfsstat -m
... /budget_stats from hamilton,wolcott,dexter:/export/budget_stats Flags: vers=3,proto=tcp,sec=sys,hard,intr,llock,link,symlink,acl,rsize=32768,wsize=32768, retrans=5 Failover:noresponse=1, failover=1, remap=1, currserver=wolcott
The currserver value tells us that NFS traffic for the /budget_stats mount point is bound to server wolcott. Apparently hamilton stopped responding at one point, because we see non-zero values for the counters noresponse, failover and remap. The counter noresponse counts the number of times a remote procedure call to the currently bound NFS server timed out. The counter failovercounts the number of times the NFS client has "failed over" or switched to another NFS server due to a timed out remote procedure call. The counter remap counts the number of files that were "mapped" to another NFS server after a failover. For example, if an application on the NFS client had /budget_stats/1994/deficit open, and then the client failed over to another server, the next time the application went to read data from /budget_stats/1944/deficit, the open file reference would be re-mapped to the corresponding /deficit file on the newly bound NFS server. Solaris will also notify you when a failover happens. Expect a message like:
NOTICE: NFS: failing over from hamilton to wolcott
on both the NFS client's system console and in its /var/adm/messages file. By the way, it is not required that each server have the same pathname mounted. The mount command will let you mount replica servers with different directories. For example:
#mount -o ro serverX:/q,serverY:/m /mnt
As long as the contents of serverX:/q and serverY:/m are the same, the top level directory name does not have to be. The next section discusses rules for content of replicas.
Properties of replicas
Replicas on each server in the replicated filesystem have to be the same in content. For example, if on an NFS client we have done:#mount -o ro serverX,serverY:/export /mnt
then /export on both servers needs to be an exact copy. One way to generate such a copy would be:
#rlogin serverY
serverY #cd /export
serverY #rm -rf ../export
serverY #mount serverX:/export /mnt
serverY #cd /mnt
serverY #find . -print | cpio -dmp /export
serverY #umount /mnt
serverY #exit
#
The third command invoked here, rm -rf ../export is somewhat curious. What we want to do is remove the contents of /export in a manner that is as fast and secure as possible. We could do rm -rf /exportbut that has the side of effect of removing /export as well as its contents. Since /export is exported, any NFS client that is currently mounting serverY:/export will experience stale filehandles (see "Stale filehandles"). Recreating /export immediately with the mkdir command does not suffice because of the way NFS servers generate filehandles for clients. The filehandle contains among other things the inode number (a file's or directory's unique identification number) and this is almost guaranteed to be different. So we want to remove just what is under /export. A commonly used method for doing that is:
# cd /export ; find . -print | xargs rm -rf
but the problem there is that if someone has placed a filename like foo /etc/passwd (i.e., a file with an embedded space character) in /export, then the xargs rm -rf command will remove a file called foo and a file called /etc/passwd, which on Solaris may prevent one from logging into the system. Doing rm -rf ../export will prevent /export from being removed because rm will not remove the current working directory. Note that this behavior may vary with other systems, so test it on something unimportant to be sure. At any rate, the aforementioned sequence of commands will create a replica that has the following properties:
- Each regular file, directory, named pipe, symbolic link, socket, and device node in the original has a corresponding object with the same name in the copy.
- The file type of each regular file, directory, named pipe, symbolic link, socket, and device node in the original is the same in the corresponding object with same name in the copy.
- The contents of each regular file, directory, symbolic link and device node in the original are the equal to the contents of each corresponding object with same name in the copy.
- The user identifier, group identifier, and file permissions of each regular file, directory, name pipe, symbolic link, socket, and device node in the original are to equal the user identifier, group identifier, and file permissions of each corresponding object with the same name in the copy. Strictly speaking this last property is not mandatory for client-side failover to work, but if after a failover, the user on the NFS client no longer has access to the file his application was reading, then the user's application will stop working.
Rules for mounting replicas
In order to use client-side failover, the filesystem must be mounted with the sub-options ro (read-only) and hard. The reason why it has to be mounted read-only is that if NFS clients could write to the replica filesystem, then the replicas would be no longer synchronized, producing the following undesirable effects:- If another NFS client failed over from one server to the server with the modified file, it would encounter an unexpected inconsistency.
- Likewise, if the NFS client or application that modified the file failed over to another server, it would find that its changes were no longer present.
#mount -o vers=2 serverA,serverB,serverC:/export /mnt
Note that it is not a requirement that all the NFS servers in the replicated filesystem support the same transport protocol (TCP or UDP).
Managing replicas
In Solaris, the onus for creating, distributing, and maintaining replica filesystems is on the system administrator; there are no tools to manage replication. The techniques used in the example given in the "Properties of replicas", can be used, although the example script given in that subsection for generating a replica may cause stale filehandle problems when using it to update a replica; we will address this in "Stale filehandles". You will want to automate the replica distribution procedure. In the example, you would alter the aforementioned example to:- Prevent stale filehandles.
- Use the rsh command instead of the rlogin command.