Troubleshooting and Management

Troubleshooting does not exist in isolation from network management. How you manage your network will determine in large part how you deal with problems. A proactive approach to management can greatly simplify problem resolution. The remainder of this chapter describes several important management issues. Coming to terms with these issues should, in the long run, make your life easier.

Documentation

As a new administrator, your first step is to assess your existing resources and begin creating new resources. Software sources, including the tools discussed in this tutorial, are described and listed in Appendix A, "Software Sources". Other sources of information are described in Appendix B, "Resources and References". The most important source of information is the local documentation created by you or your predecessor. In a properly maintained network, there should be some kind of log about the network, preferably with sections for each device. In many networks, this will be in an abysmal state. Almost no one likes documenting or thinks he has the time required to do it. It will be full of errors, out of date, and incomplete. Local documentation should always be read with a healthy degree of skepticism. But even incomplete, erroneous documentation, if treated as such, may be of value. There are probably no intentional errors, just careless mistakes and errors of omission. Even flawed documentation can give you some sense of the history of the system. Problems frequently occur due to multiple conflicting changes to a system. Software that may have been only partially removed can have lingering effects. Homegrown documentation may be the quickest way to discover what may have been on the system. While the creation and maintenance of documentation may once have been someone else's responsibility, it is now your responsibility. If you are not happy with the current state of your documentation, it is up to you to update it and adopt policies so the next administrator will not be muttering about you the way you are muttering about your predecessors. There are a couple of sets of standard documentation that, at a minimum, you will always want to keep. One is purchase information, the other a change log. Purchase information includes sales information, licenses, warranties, service contracts, and related information such as serial numbers. An inventory of equipment, software, and documentation can be very helpful. When you unpack a system, you might keep a list of everything you receive and date all documentation and software. (A changeable rubber date stamp and ink pad can help with this last task.) Manufacturers can do a poor job of distinguishing one version of software and its documentation from the next. Dates can be helpful in deciding which version of the documentation applies when you have multiple systems or upgrades. Documentation has a way of ending up in someone's personal library, never to be seen again, so a list of what you should have can be very helpful at times. Keep in mind, there are a number of ways software can enter your system other than through purchase orders. Some software comes through tutorial subscription services, some comes in over the Internet, some is bundled with the operating system, some comes in on a tutorial in the back of a tutorial, some is brought from home, and so forth. Ideally, you should have some mechanism to track software. For example, for downloads from the Internet, be sure to keep a log including a list identifying filenames, dates, and sources. You should also keep a change log for each major system. Record every significant change or problem you have with the system. Each entry should be dated. Even if some entries no longer seem relevant, you should keep them in your log. For instance, if you have installed and later removed a piece of software on a server, there may be lingering configuration changes that you are not aware of that may come to haunt you years later. This is particularly true if you try to reinstall the program but could even be true for a new program as well. Beyond these two basic sets of documentation, you can divide the documentation you need to keep into two general categories -- configuration documentation and process documentation. Configuration documentation statically describes a system. It assumes that the steps involved in setting up the system are well understood and need no further comments, i.e., that configuration information is sufficient to reconfigure or reconstruct the system. This kind of information can usually be collected at any time. Ironically, for that reason, it can become so easy to put off that it is never done. Process documentation describes the steps involved in setting up a device, installing software, or resolving a problem. As such, it is best written while you are doing the task. This creates a different set of collection problems. Here the stress from the task at hand often prevents you from documenting the process. The first question you must ask is what you want to keep. This may depend on the circumstances and which tools you are using. Static configuration information might include lists of IP addresses and Ethernet addresses, network maps, copies of server configuration files, switch configuration settings such as VLAN partitioning by ports, and so on. When dealing with a single device, the best approach is probably just a simple copy of the configuration. This can be either printed or saved as a disk file. This will be a personal choice based on which you think is easiest to manage. You don't need to waste time prettying this up, but be sure you label and date it. When the information spans multiple systems, such as a list of IP addresses, management of the data becomes more difficult. Fortunately, much of this information can be collected automatically. Several tools that ease the process are described in subsequent chapters, particularly in "Device Discovery and Mapping" For process documentation, the best approach is to log and annotate the changes as you make them and then reconstruct the process at a later time. "Miscellaneous Tools" describes some of the common Unix utilities you can use to automate documentation. You might refer to this chapter if you aren't familiar with utilities like tee, script, and xwd.[2]
[2]Admittedly these guidelines are ideals. Does anyone actually do all of this documenting? Yes, while most administrators probably don't, some do. But just because many administrators don't succeed in meeting the ideal doesn't diminish the importance of trying.

Management Practices

A fundamental assumption of this tutorial is that troubleshooting should be proactive. It is preferable to avoid a problem than have to correct it. Proper management practices can help. While some of this section may, at first glance, seem unrelated to troubleshooting, there are fundamental connections. Management practices will determine what you can do and how you do it. This is true both for avoiding problems and for dealing with problems that can't be avoided. The remainder of this chapter reviews some of the more important management issues.

Professionalism

To effectively administer a system requires a high degree of professionalism. This includes personal honesty and ethical behavior. You should learn to evaluate yourself in an honest, objective manner. (See the sidebar "The Peter Principle Revisited".) It also requires that you conform to the organization's mission and culture. Your network serves some higher purpose within your organization. It does not exist strictly for your benefit. You should manage the network with this in mind. This means that everything you do should be done from the perspective of a cost-benefit trade-off. It is too easy to get caught in the trap of doing something "the right way" at a higher cost than the benefits justify. Performance analysis is the key element. The organization's mind-set or culture will have a tremendous impact on how you approach problems in general and the use of tools in particular. It will determine which tools you can use, how you can use the tools, and, most important, what you can do with the information you obtain. Within organizations, there is often a battle between openness and secrecy. The secrecy advocate believes that details of the network should be available only on a need-to-know basis, if then. She believes, not without justification, that this enhances security. The openness advocate believes that the details of a system should be open and available. This allows users to adapt and make optimal use of the system and provides a review process, giving users more input into the operation of the network. Taken to an extreme, the secrecy advocate will suppress information that is needed by the user, making a system or network virtually unusable. Openness, taken to an extreme, will leave a network vulnerable to attack. Most people's views fall somewhere between these two extremes but often favor one position over the other. I advocate prudent openness. In most situations, it makes no sense to shut down a system because it might be attacked. And it is asinine not to provide users with the information they need to protect themselves. Openness among those responsible for the different systems within an organization is absolutely essential.

Ego management

We would all like to think that we are irreplaceable, and that no one else could do our jobs as well as we do. This is human nature. Unfortunately, some people take steps to make sure this is true. The most obvious way an administrator may do this is hide what he actually does and how his system works. This can be done many ways. Failing to document the system is one approach -- leaving comments out of code or configuration files is common. The goal of such an administrator is to make sure he is the only one who truly understands the system. He may try to limit others access to a system by restricting accounts or access to passwords. (This can be done to hide other types of unprofessional activities as well. If an administrator occasionally reads other users' email, he may not want anyone else to have standard accounts on the email server. If he is overspending on equipment to gain experience with new technologies, he will not want any technically literate people knowing what equipment he is buying.) This behavior is usually well disguised, but it is extremely common. For example, a technician may insist on doing tasks that users could or should be doing. The problem is that this keeps users dependent on the technician when it isn't necessary. This can seem very helpful or friendly on the surface. But, if you repeatedly ask for details and don't get them, there may be more to it than meets the eye. Common justifications are security and privacy. Unless you are in a management position, there is often little you can do other than accept the explanations given. But if you are in a management position, are technically competent, and still hear these excuses from your employees, beware! You have a serious problem. No one knows everything. Whenever information is suppressed, you lose input from individuals who don't have the information. If an employee can't control her ego, she should not be turned loose on your network with the tools described in this tutorial. She will not share what she learns. She will only use it to further entrench herself. The problem is basically a personnel problem and must be dealt with as such. Individuals in technical areas seem particularly prone to these problems. It may stem from enlarged egos or from insecurity. Many people are drawn to technical areas as a way to seem special. Alternately, an administrator may see information as a source of power or even a weapon. He may feel that if he shares the information, he will lose his leverage. Often individuals may not even recognize the behavior in themselves. It is just the way they have always done things and it is the way that feels right. If you are a manager, you should deal with this problem immediately. If you can't correct the problem in short order, you should probably replace the employee. An irreplaceable employee today will be even more irreplaceable tomorrow. Sooner or later, everyone leaves -- finds a better job, retires, or runs off to Poughkeepsie with an exotic dancer. In the meantime, such a person only becomes more entrenched making the eventual departure more painful. It will be better to deal with the problem now rather than later.

Legal and ethical considerations

From the perspective of tools, you must ensure that you use tools in a manner that conforms not just to the policies of your organization, but to all applicable laws as well. The tools I describe in this tutorial can be abused, particularly in the realm of privacy. Before using them, you should make certain that your use is consistent with the policies of your organization and all applicable laws. Do you have the appropriate permission to use the tools? This will depend greatly on your role within the organization. Do not assume that just because you have access to tools that you are authorized to use them. Nor should you assume that any authorization you have is unlimited. Packet capture software is a prime example. It allows you to examine every packet that travels across a link, including applications data and each and every header. Unless data is encrypted, it can be decoded. This means that passwords can be captured and email can be read. For this reason alone, you should be very circumspect in how you use such tools. A key consideration is the legality of collecting such information. Unfortunately, there is a constantly changing legal morass with respect to privacy in particular and technology in general. Collecting some data may be legitimate in some circumstances but illegal in others.[3] This depends on factors such as the nature of your operations, what published policies you have, what assurances you have given your users, new and existing laws, and what interpretations the courts give to these laws.
[3]As an example, see the CERT Advisory CA-92.19 Topic: Keystroke Logging Banner at http://www.cert.org/advisories/CA-1992-19.html for a discussion on keystroke logging and its legal implications.
It is impossible for a tutorial like this to provide a definitive answer to the questions such considerations raise. I can, however, offer four pieces of advice:

The Peter Principle Revisited

In 1969, Laurence Peter and Raymond Hull published the satirical tutorial, The Peter Principle. The premise of the tutorial was that people rise to their level of incompetence. For example, a talented high school teacher might be promoted to principal, a job requiring a quite different set of skills. Even if ill suited for the job, once she has this job, she will probably remain with it. She just won't earn any new promotions. However, if she is adept at the job, she may be promoted to district superintendent, a job requiring yet another set of skills. The process of promotions will continue until she reaches her level of incompetence. At that point, she will spend the remainder of her career at that level. While hardly a rigorous sociological principle, the tutorial was well received because it contained a strong element of truth. In my humble opinion, the Peter Principle usually fails miserably when applied to technical areas such as networking and telecommunications. The problem is the difficulty in recognizing incompetence. If incompetence is not recognized, then an individual may rise well beyond his level of incompetence. This often happens in technical areas because there is no one in management who can judge an individual's technical competence. Arguably, unrecognized incompetence is usually overengineering. Networking, a field of engineering, is always concerned with trade-offs between costs and benefits. An underengineered network that fails will not go unnoticed. But an overengineered network will rarely be recognizable as such. Such networks may cost many times what they should, drawing resources from other needs. But to the uninitiated, it appears as a normal, functioning network. If a network engineer really wants the latest in new equipment when it isn't needed, who, outside of the technical personnel, will know? If this is a one-person department, or if all the members of the department can agree on what they want, no one else may ever know. It is too easy to come up with some technical mumbo jumbo if they are ever questioned. If this seems far-fetched, I once attended a meeting where a young engineer was arguing that a particular router needed to be replaced before it became a bottleneck. He had picked out the ideal replacement, a hot new box that had just hit the market. The problem with all this was that I had recently taken measurements on the router and knew the average utilization of that "bottleneck" was less than 5% with peaks that rarely hit 40%. This is an extreme example of why collecting information is the essential first step in network management and troubleshooting. Without accurate measurements, you can easily spend money fixing imaginary problems.

Economic considerations

Solutions to problems have economic consequences, so you must understand the economic implications of what you do. Knowing how to balance the cost of the time used to repair a system against the cost of replacing a system is an obvious example. Cost management is a more general issue that has important implications when dealing with failures. One particularly difficult task for many system administrators is to come to terms with the economics of networking. As long as everything is running smoothly, the next biggest issue to upper management will be how cost effectively you are doing your job. Unless you have unlimited resources, when you overspend in one area, you take resources from another area. One definition of an engineer that I particularly like is that "an engineer is someone who can do for a dime what a fool can do for a dollar." My best guess is that overspending and buying needlessly complex systems is the single most common engineering mistake made when novice network administrators purchase network equipment. One problem is that some traditional economic models do not apply in networking. In most engineering projects, incremental costs are less than the initial per-unit cost. For example, if a 10,000-square-foot building costs $1 million, a 15,000-square-foot building will cost somewhat less than $1.5 million. It may make sense to buy additional footage even if you don't need it right away. This is justified as "buying for the future." This kind of reasoning, when applied to computers and networking, leads to waste. Almost no one would go ahead and buy a computer now if they won't need it until next year. You'll be able to buy a better computer for less if you wait until you need it. Unfortunately, this same reasoning isn't applied when buying network equipment. People will often buy higher-bandwidth equipment than they need, arguing that they are preparing for the future, when it would be much more economical to buy only what is needed now and buy again in the future as needed. Moore's Law lies at the heart of the matter. Around 1965, Gordon Moore, one of the founders of Intel, made the empirical observation that the density of integrated circuits was doubling about every 12 months, which he later revised to 24 months. Since the cost of manufacturing integrated circuits is relatively flat, this implies that, in two years, a circuit can be built with twice the functionality with no increase in cost. And, because distances are halved, the circuit runs at twice the speed -- a fourfold improvement. Since the doubling applies to previous doublings, we have exponential growth. It is generally estimated that this exponential growth with chips will go on for another 15 to 20 years. In fact, this growth is nothing new. Raymond Kurzweil, in The Age of Spiritual Machines: When Computers Exceed Human Intelligence, collected information on computing speeds and functionality from the beginning of the twentieth century to the present. This covers mechanical, electromechanical (relay), vacuum tube, discrete transistor, and integrated circuit technologies. Kurzweil found that exponential growth has been the norm for the last hundred years. He believes that new technologies will be developed that will extend this rate of growth well beyond the next 20 years. It is certainly true that we have seen even faster growth in disk densities and fiber-optic capacity in recent years, neither of which can be attributed to semiconductor technology. What does this mean economically? Clearly, if you wait, you can buy more for less. But usually, waiting isn't an option. The real question is how far into the future should you invest? If the price is coming down, should you repeatedly buy for the short term or should you "invest" in the long term? The general answer is easy to see if we look at a few numbers. Suppose that $100,000 will provide you with network equipment that will meet your anticipated bandwidth needs for the next four years. A simpleminded application of Moore's Law would say that you could wait and buy similar equipment for $25,000 in two years. Of course, such a system would have a useful life of only two additional years, not the original four. So, how much would it cost to buy just enough equipment to make it through the next two years? Following the same reasoning, about $25,000. If your growth is tracking the growth of technology,[4] then two years ago it would have cost $100,000 to buy four years' worth of technology. That will have fallen to about $25,000 today. Your choice: $100,000 now or $25,000 now and $25,000 in two years. This is something of a no-brainer. It is summarized in the first two lines of Table 1-1.
[4]This is a pretty big if, but it's reasonable for most users and organizations. Most users and organizations have selected a point in the scheme of things that seems right for them -- usually the latest technology they can reasonably afford. This is why that new computer you buy always seems to cost $2500. You are buying the latest in technology, and you are trying to reach about the same distance into the future.

Table 1-1. Cost estimates

Year 1 Year 2 Year 3 Year 4 Total
Four-year plan $100,000 $0 $0 $0 $100,000
Two-year plan $25,000 $0 $25,000 $0 $50,000
Four-year plan with maintenance $112,000 $12,000 $12,000 $12,000 $148,000
Two-year plan with maintenance $28,000 $3,000 $28,000 $3,000 $62,000
Four-year plan with maintenance and 20% MARR $112,000 $10,000 $8,300 $6,900 $137, 200
Two-year plan with maintenance and 20% MARR $28,000 $2,500 $19,500 $1,700 $51,700

If this argument isn't compelling enough, there is the issue of maintenance. As a general rule of thumb, service contracts on equipment cost about 1% of the purchase price per month. For $100,000, that is $12,000 a year. For $25,000, this is $3,000 per year. Moore's Law doesn't apply to maintenance for several reasons:

Thus, the $12,000 a year for maintenance on a $100,000 system will cost $12,000 a year for all four years. The third and fourth lines of Table 1-1 summarize these numbers. Yet another consideration is the time value of money. If you don't need the $25,000 until two years from now, you can invest a smaller amount now and expect to have enough to cover the costs later. So the $25,000 needed in two years is really somewhat less in terms of today's dollars. How much less depends on the rate of return you can expect on investments. For most organizations, this number is called the minimal acceptable rate of return (MARR). The last two lines of Table 1-1 use a MARR of 20%. This may seem high, but it is not an unusual number. As you can see, buying for the future is more than two and a half times as expensive as going for the quick fix. Of course, all this is a gross simplification. There are a number of other important considerations even if you believe these numbers. First and foremost, Moore's Law doesn't always apply. The most important exception is infrastructure. It is not going to get any cheaper to pull cable. You should take the time to do infrastructure well; that's where you really should invest in the future. Most of the other considerations seem to favor short-term investing. First, with short-term purchasing, you are less likely to invest in dead-end technology since you are buying later in the life cycle and will have a clearer picture of where the industry is going. For example, think about the difference two years might have made in choosing between Fast Ethernet and ATM for some organizations. For the same reason, the cost of training should be lower. You will be dealing with more familiar technology, and there will be more resources available. You will have to purchase and install equipment more often, but the equipment you replace can be reused in your network's periphery, providing additional savings. On the downside, the equipment you buy won't have a lot of excess capacity or a very long, useful lifetime. It can be very disconcerting to nontechnical management when you keep replacing equipment. And, if you experience sudden unexpected growth, this is exactly what you will need to do. Take the time to educate upper management. If frequent changes to your equipment are particularly disruptive or if you have funding now, you may need to consider long-term purchases even if they are more expensive. Finally, don't take the two-year time frame presented here too literally. You'll discover the appropriate time frame for your network only with experience. Other problems come when comparing plans. You must consider the total economic picture. Don't look just at the initial costs, but consider ongoing costs such as maintenance and the cost of periodic replacement. As an example, consider the following plans. Plan A has an estimated initial cost of $400,000, all for equipment. Plan B requires $150,000 for equipment and $450,000 for infrastructure upgrades. If you consider only initial costs, Plan A seems to be $200,000 cheaper. But equipment needs to be maintained and, periodically, replaced. At 1% per month, the equipment for Plan A would cost $48,000 a year to maintain, compared to $18,000 per year with Plan B. If you replace equipment a couple of times in the next decade, that will be an additional $800,000 for Plan A but only $300,000 for Plan B. As this quick, back-of-the-envelope calculation shows, the 10-year cost for Plan A was $1.68 million, while only $1.08 million for Plan B. What appeared to be $200,000 cheaper was really $600,000 more expensive. Of course, this was a very crude example, but it should convey the idea. You shouldn't take this example too literally either. Every situation is different. In particular, you may not be comfortable deciding what is adequate surplus capacity in your network. In general, however, you are probably much better off thinking in terms of scalability than raw capacity. If you want to hedge your bets, you can make sure that high-speed interfaces are available for the router you are considering without actually buying those high-speed interfaces until needed. How does this relate to troubleshooting? First, don't buy overly complex systems you don't really need. They will be much harder to maintain, as you can expect the complexity of troubleshooting to grow with the complexity of the systems you buy. Second, don't spend all your money on the system and forget ongoing maintenance costs. If you don't anticipate operational costs, you may not have the funds you need.