Troubleshooting

When diskless clients refuse to boot, they do so rather emphatically. Shuffling machines and hostnames to accommodate changes in personnel increases the likelihood that a diskless machine will refuse to boot. Start debugging by verifying that hostnames, IP addresses, and Ethernet addresses are all properly registered on boot and NIS servers. The point at which the boot fails usually indicates where to look next for the problem: machines that cannot even locate a boot block may be getting the wrong boot information, while machines that boot but cannot enter single-user mode may be missing their /usr filesystems.

Missing and inconsistent client information

There are a few pieces of missing host information that are easily tracked down. If a client tries to boot but gets no RARP response, check that the NIS ethers map or the /etc/ethers files on the boot servers contain an entry for the client with the proper MAC address. A client reports RARP failures by complaining that it cannot get its IP address. Diskless clients that boot part-way but hang after mounting their root filesystems may have /etc/hosts files that do not agree with the NIS ethers or hosts maps. It's also possible that the client booted using one name and IP address combination, but chose to use a different name while going through the single-user boot process. Check the boot scripts to be sure that the client is using the proper hostname, and also check that its local /etc/hosts file agrees with the NIS maps. Other less obvious failures may be due to confusion with the bootparams map and the bootparamd daemon. Since the diskless client broadcasts a request for boot parameters, any host running bootparamd can answer it, and that server may have an incorrect /etc/bootparams file, or it may have bound to an NIS server with an out-of-date map. Sometimes when you correct information, things still do not work. The culprit could be caching. Solaris has a name service cached daemon, /usr/sbin/nscd, which, if running, acts as a frontend for some databases maintained in /etc or NIS. The nscd daemon could return stale information and also stale negative information, such as a failed lookup of an IP address in the hosts file or map. You can re-invoke nscd with the -i option to invalidate the cache. See the manpage for more details.

Checking boot parameters

The bootparamd daemon returns a fairly large bundle of values to a diskless client. In addition to the pathnames used for root and swap filesystems, the diskless client gets the name of its boot server and a default route. Depending on how the /etc/nsswitch.conf is set up, the boot server takes values from a local /etc/bootparams, so ensure that local file copies match NIS maps if they are used. Changing the map on the NIS master server will not help a diskless client if its boot server uses only a local copy of the boot parameters file.

Debugging rarpd and bootparamd

You can debug boot parameter problems by enabling debugging on the boot server. Both rarpd and bootparamd accept a debug option. By enabling debugging in rarpd on the server, you can see what requests for what Ethernet address the client is making, and if rarpd can map it to an IP address. You can turn on rarpd debugging by killing it on the server and starting it again with the -d option:

# ps -eaf | grep rarpd root 274 1 0 Apr 16 ? 0:00 /usr/sbin/in.rarpd -a root 5890 5825 0 01:02:18 pts/0 0:00 grep rarpd # kill 274 # /usr/sbin/in.rarpd -d -a /usr/sbin/in.rarpd:[1] device hme0 ethernetaddress 8:0:20:a0:16:63 /usr/sbin/in.rarpd:[1] device hme0 address 130.141.14.8 /usr/sbin/in.rarpd:[1] device hme0 subnet mask 255.255.255.0 /usr/sbin/in.rarpd:[5] starting rarp service on device hme0 address 8:0:20:a0:16:63 /usr/sbin/in.rarpd:[5] RARP_REQUEST for 8:0:20:a0:65:8f /usr/sbin/in.rarpd:[5] trying physical netnum 130.141.14.0 mask ffffff00 /usr/sbin/in.rarpd:[5] good lookup, maps to 130.141.14.9 /usr/sbin/in.rarpd:[5] immediate reply sent

Keep in mind that when starting a daemon with the -d option, it usually stays in the foreground, so you won't get a shell prompt unless you explicitly place it in the background by appending an ampersand (&) to command invocation. The two things to look out for when debugging rarpd are:

Does rarpd register a RARP_REQUEST? If it doesn't, this could indicate a physical network problem, or the server is not on the same physical network as the client.
Can rarpd map the client's Ethernet address back to an IP address? If not, this could indicate a bad ethers map, a bad /etc/ethers file, or an /etc/nsswitch.conf file that is not pointing at the right place.

By enabling debug mode in bootparamd on the server, you can see the hostname, addresses, and pathnames given to the diskless client. You can turn on bootparamd debugging by killing it on the server and starting it again with the -d option:

# ps -eaf | grep bootparamd root 276 1 0 Apr 16 ? 0:00 /usr/sbin/rpc.bootparamd root 5878 5825 0 00:33:27 pts/0 0:00 grep bootparamd # kill 276 # rpc.bootparamd -d in debug mode. msg 1: group = 260 mib_id = 0 length = 128 msg 2: group = 261 mib_id = 0 length = 132 msg 3: group = 1025 mib_id = 0 length = 36 msg 4: group = 1026 mib_id = 0 length = 64 msg 5: group = 260 mib_id = 20 length = 144 msg 6: group = 260 mib_id = 100 length = 88 msg 7: group = 1026 mib_id = 1 length = 0 msg 8: group = 1026 mib_id = 2 length = 0 msg 9: group = 260 mib_id = 21 length = 2464 msg 10: group = 260 mib_id = 22 length = 360 mibget getmsg( ) 11 returned EOD (level 0, name 0) interface_addr = 130.141.14.8. interface_mask = 255.255.255.0 22 records for ipRouteEntryTable Whoami returning name = honeymoon, router address = 130.141.14.253 getfile_1: file is "honeymoon" 130.141.14.8 "/export/root/honeymoon"

The messages that start with msg are the results of asking the IP layer for Simple Network Management Protocol (SNMP) Management Information Base (MIB) information. The bootparamd daemon makes this inquiry to find the IP address of the best router for the diskless client. The messages that say group = 260 are the ones of interest for this purpose. Of those messages, the ones with a mib_id of 0 or 20 are of interest. Normally both kinds of messages will appear. If not, that may indicate a problem with the server's network configuration. But if there are no problems, we can expect the debug output to show a router address for the client. The getfile_1 message is simply reporting that it knows where the client's root filesystem is. Note the IP address is the same as the server's interface, which means that the NFS server for the client is the same as the bootparamd server. If the server shows strange boot parameters passed to the client, check that the server's /etc/bootparams file is correct, and that the boot server's NIS server has up-to-date maps. If the boot parameters received by the client are incorrect, check that the server answering the request for them has current information. Because requests are broadcast to bootparamd, the server that can reply in the shortest time supplies the information. If the client refuses to boot at all, complaining of:

null domain name invalid domain name invalid boot parameters

or similar problems, verify that the host answering its broadcasts is using the same boot protocol and configuration files. See "Boot parameter confusion" for an example of invalid boot parameters. Also ensure that the boot server exports the client's root and swap filesystems with the proper root mapping and access restrictions. In /etc/dfs/dfstab, both the root and swap filesystems should have the options:

rw=client,root=client

to limit access to the diskless client and to allow the superuser to write to the filesystems. If the swap filesystem is not exported so that root can write to it, the diskless client will not be able to start the init process to begin the single-user boot.

Missing /usr

After setting the host and domain names and configuring network interfaces in the boot process, a machine mounts its /usr filesystem. If there are problems with /usr, the boot process either hangs or fails at the first reference to the /usr filesystem. The two most common problems are not being able to locate the NFS server for /usr and attempting to mount the wrong /usr. NIS cannot be started until after /usr is mounted, since client-side daemons like ypbind live in /usr. Generally, /usr is mounted from the boot server, so a diskless client needs its own name and its server's hostname in its /etc/hosts. If /usr is not mounted from the root/swap filesystem server, the /usr server's hostname must appear in the local hosts file as well. You may need as many as four different entries in the "runt" /etc/hosts file on a diskless client: its hostname, a localhost entry, the boot server's name, and the name of the /usr server. Heterogeneous client/server environments create another set of problems. Clients of different architectures need their own /usr filesystems with executables built for the client's CPU, not the server's. The most obvious problem is when the client mounts the wrong /usr. If the executables on it were built for a different CPU, then the first attempt to invoke one of them produces a fairly descriptive error. However, if the /usr/platform directory is for the correct CPU architecture but doesn't contain the right kernel architecture (for example, Sun's sun4u and sun4m variants), then the client boots, but certain Unix utilities will not work. Processes that read the kernel or user address spaces, such as crash, are the most likely to break. If you suspect that you're mounting the wrong /usr, first check the client's /etc/vfstab file to see where it gets /usr :

wahoo:/export/root/honeymoon - / nfs - - rw wahoo:/export/swap/honeymoon - /dev/swap nfs - - - wahoo:/export/exec/Solaris_2.7_sparc.all/usr - /usr nfs - - ro

In this example, we would check /export/exec/Solaris_2.7_sparc.all/usr on the server wahoo. The directories in /export/exec have names with this format: Solaris_<release>_<architecture>. If the client and the server are of the same CPU architecture and are running the same release of the operating system, the usr subdirectory in /export/exec/Solaris_<release>_<architecture> is a symbolic link to the server's /usr directory. If the client and server do not have the same release and CPU architectures, the directories in /export/exec contain complete operating system releases. Three things can go wrong with this link-and-directory scheme:

The links /export/exec/*/usr point to the wrong place. This is possible if you changed the architecture of the server but restored /export from a backup tape. Make sure that Solaris_2.7_sparc.all/usr links point to /usr only if the server is a SPARC running Solaris 7. You'll get "exec format" errors if you mount a /usr of the wrong architecture on the client.
The /export/exec/* directories referenced by the clients don't exist. This is possible if you added a client of a new, different CPU architecture but did not install the appropriate operating system software for it. If you try to mount a directory that doesn't exist, you should see "cannot mount root" errors on the client.
The client may have the wrong mount point listed in its /etc/vfstab file. If you did not specify the architecture of the client correctly when using the AdminSuite software, the client's vfstab file is likely to contain the wrong mount information.

If you are unsure of how a mount and link combination will work, experiment on another diskless client having the same architecture. For example, mount /export/exec/Solaris_2.7_sparc.all/usr on /mnt, and then try a sample command to be sure you've mounted the right one:

client# mount wahoo:/export/exec/Solaris_2.7_sparc.all/usr /mnt client# cd /var client# /mnt/bin/ls 4lib dict krb5 oasys sbin ucblib 5bin dist kvm old share vmsys X dt lib openwin snadm xpg4 adm games lost+found platform spool aset include mail preserve src bin java man proc tmp ccs java1.1 net pub ucb demo kernel news sadm ucbinclude

If commands are executed properly, then you should be able to mount /usr safely on the diskless client in question.