Task-Specific Troubleshooting
The guidelines just given are a general or generic overview of troubleshooting. Of course, each problem will be different, and you will need to vary your approach as appropriate. The remainder of this chapter consists of guidelines for a number of the more common troubleshooting tasks you might face. It is hoped that these will give you further insight into the process.Installation Testing
Ironically, one of the best ways to save time and avoid troubleshooting is to take the time to do a thorough job of testing when you install software or hardware. You will be testing the system when you are most familiar with the installation process, and you will avoid disruptions to service that can happen when a problem isn't discovered until the software or hardware is in use.This is a somewhat broad interpretation of troubleshooting, but in my experience, there is very little difference between the testing you will do when you install software and the testing you will do when you encounter a problem. Overwhelmingly the only difference for most people is the scope of the testing done. Most people will test until they believe that a system is working correctly and then stop. Failures, particularly multiple failures, may leave you skeptical, while some people tend to be overly optimistic when installing new software.
Firewall testing
Because of the complexities, firewall testing is an excellent example of the problems that installation testing may present. Troubleshooting a firewall is a demanding task for several reasons. First, to avoid disruptions in service, initial firewall testing should be done in an isolated environment before moving on to a production environment.Second, you need to be very careful to develop an appropriate set of tests so that you don't leave gaping holes in your security. You'll need to go through a firewall rule by rule. You won't be able to check every possibility, but you should be able to test each general type of traffic. For example, consider a rule that passes HTTP traffic to your web server. You will want to pass traffic to port 80 on that server. If you are taking the approach of denying all traffic that is not explicitly permitted, potentially, you will want to block traffic to that host at all other ports. You will also want to block traffic to port 80 on other hosts.[42] Thus, you should develop a set of three tests for this one action. Although there will be some duplicated tests, you'll want to take the same approach for each rule. Developing an explicit set of tests is the key step in this type of testing.
[42]If you doubt the need for this last test, read RFC 3093, a slightly tongue-in-cheek description of how to use port 80 to bypass a firewall.The first step in testing a firewall is to test the environment in which the firewall will function without the firewall. It can be extraordinarily frustrating to try to debug anomalous firewall behavior only to discover that you had a routing problem before you began. Thus, the first thing you will want to do is turn off any filtering and test your routing. You could use tools like ripquery to retrieve routing tables and examine entries, but it is probably much simpler to use ping to check connectivity, assuming ICMP ECHO_REQUEST packets aren't being blocked. (If this is the case, you might try tools like nmap or hping.)
You'll also want to verify that all concomitant software is working. This will include all intrusion detection software, accounting and logging software, and testing software. For example, you'll probably use packet capture software like tcpdump or ethereal to verify the operation of your firewall and will want to make sure the firewall is working properly. I hate to admit it, but I've started packet capture software on a host that I forgot was attached to a switch and banged my head wondering why I wasn't seeing anything. Clearly, if I had used this setup to make sure packets were blocked without first testing it, I could have been severely misled.
Test the firewall in isolation. If you are adding filtering to a production router, admittedly this is going to be a problem. The easiest way to test in isolation is to connect each interface to an isolated host that can both generate and capture packets. You might use hping, nemesis, or any of the other custom packet generation software discussed in "Testing Connectivity Protocols". Work through each of your tests for each rule with the rule disabled and enabled. Be sure you explicitly document all your tests, particularly the syntax.
Once you are convinced that the firewall is working, it is time to move it online. If you can schedule offline testing, that is the best approach. Work through your tests again with and without the filters enabled. If offline testing isn't possible, you can still go through your tests with the filters enabled.
Finally, don't forget to come back and go through these tests periodically. In particular, you'll want to reevaluate the firewall every time you change rules.
Performance Analysis and Monitoring
If a system simply isn't working, then you know troubleshooting is needed. But in many cases, it may not be clear that you even have a problem. Performance analysis is often the first step to getting a handle on whether your system is functioning properly. And it is often the case that careful performance analysis will identify the problem so that no further troubleshooting is needed.Performance analysis is another management task that hinges on collecting information. It is a task that you will never complete, and it is important at every stage in the system's life cycle. The most successful network administrator will take a proactive approach, addressing issues before they become problems. "Device Monitoring with SNMP" and "Performance Measurement Tools" discussed the use of specific tools in greater detail.
For planning, performance analysis is used to compare systems, establish system requirements, and do capacity planning and forecasting. For management, it provides guidance in configuring and tuning the system. In particular, the identification of bottlenecks can be essential for management, planning, and troubleshooting.
There are three general approaches to performance analysis -- analytical modeling, simulations, and measurement. Analytical models are mathematical models usually based on queuing theory. Simulations are computer models that attempt to mimic the behavior of the system through computer programs. Measurement is, of course, the collection of data from an existing network. This tutorial has focused primarily on measurement (although simulation tools were mentioned in "Testing Connectivity Protocols").
Each approach has its role. In practice, there can be a considerable overlap in using these approaches. Analytical models can serve as the basis for simulations, or direct measurements may be needed to supply parameters used with analytical models or simulations.
easurement has its limitations. Obviously, the system must exist before measurements can be made so it may not be a viable tool for planning. Measurements tend to produce the most variable results. And many things can go wrong with measurements. On the positive side, measurement carries a great deal of authority with most people. When you say you have measured something, this is treated as irrefutable evidence by many, often unjustifiably.
General steps
Measuring performance is something of an art. It is much more difficult to decide what to measure and how to make the actual measurements than it might appear at first glance. And there are many ways to waste time collecting data that will not be useful for your purposes.What follows is a fairly informal description of the steps involved in performance analysis. As I said before, listing the steps can be very helpful in focusing attention on some parts of the process that might otherwise be ignored.[43] Of course, every situation is different, so these steps are only an approximation. Designing performance analysis tests is an iterative process. You should go back through these steps as you proceed, refining each step as needed.
[43]If you would like a more complete discussion of the steps in performance analysis, you should get Raj Jain's exceptional tutorial, The Art of Computer Systems Performance Analysis. Jain's tutorial considers performance analysis from a broader perspective than this tutorial.
- State your goal. This is the question you want to answer. At this point, it may be fairly vague, but you will refine it as you progress. You need a sense of direction to get started. A common mistake is to allow a poorly defined goal to remain vague throughout the process, so be sure to revisit this step often. Also, try to avoid goals that bias your approach. For instance, set out to compare systems rather than show that one system is better than another.
As an example, a network administrator might ask if the network backbone is adequate to support current levels of traffic. While an extremely important question, it is quite vague at this point. But stating the goal allows you to start focusing on the problem. For example, formally stating this problem may lead you to ask what adequate really means. Or you might go on to consider what the relevant time frame is, i.e., what current means.
- Define your system. The definition of your system will vary with your goal. You will need to decide what parts of the system to include and in what detail. You may want to exclude those parts outside your control. If you are interested in server performance, you will undoubtedly want to consider the various subsystems of the server separately -- such as disks, memory, CPU, and network interfaces.
With the backbone example, what exactly is the backbone? Certainly it will include equipment such as routers and switches, but does it include servers? If you do include servers, you will want to view the server as a single entity, a source or sink for network traffic perhaps, but not component by component.
- Identify possible outcomes. This step consists of identifying possible answers to the question you want to answer. This is a refinement of Step 1 but should be addressed after the parts of the system are identified. Identifying outcomes establishes the level of your interest, how much detail you might need, and how much work you are going to have to do. You are determining the granularity of your measurements with this step.
For example, possible outcomes for the question of backbone performance might be that performance is adequate, that the system suffers minor congestion during the periods of heaviest load, or that the system is usually suffering serious congestion with heavy packet loss. For many purposes, just selecting one of these three answers might be adequate. However, in some cases, you may want a much more descriptive answer. For example, you may want some estimation of the average utilization, maximum utilization, percent of time at maximum utilization, or number of lost packets. Ultimately, the degree of detail required by the answer will determine the scope of the project. You need to make this decision early, or you may have to repeat the project to gather additional information.
- Identify and select what you will measure. Metrics are those system characteristics that can be quantitatively measured. The choice of a metric will depend on the services you are examining. Be careful in your selection. It is often tempting to go with metrics based on how easy the data is to collect rather than on how relevant the data is to the goal. For a network backbone, this might include throughput, delay, utilization, number of packets sent, number of packets discarded, or average packet size.
- If appropriate, identify test parameters and factors.[44] Parameters and factors are characteristics of the system that affect performance that can be changed. You'll change these to see what effect they have on the system. Parameters include both system and load (or traffic) parameters. Try to be as systematic as possible in identifying and evaluating parameters to avoid arbitrary decisions. It is very easy to overlook relevant parameters or include irrelevant ones.
[44]Further distinctions between parameters and factors are sometimes made but don't seem relevant when considered solely from the perspective of measurements.
For a network backbone, system parameters may include interface speeds and link speeds or the use of load sharing. For traffic, you might use a tool like mgen to add an additional load. But for simple performance measurement, you may elect to change nothing. - Select tools. Once you have a clear picture of what you want to do, it is time to select the tools of interest. It is all too easy to do this too soon. Don't let the tools you have determine what you are going to do. Tools for backbone performance might include using ntop on a link or SNMP-based tools.
- Establish measurement constraints. On a production network, establishing constraints usually means deciding when and where to make your measurements. You will also need to decide on the frequency and duration of your measurements. This is often more a matter of intuition than engineering. This is something that you will have to do iteratively, adjusting your approach based on the results you get. Unless you have a very compelling reason, measurements should be taken under representative conditions.
For backbone performance, for example, router interfaces are the obvious places to look. Server interfaces are another reasonable choice. You may also need to look at individual links as well, particularly in a switched network. You will also need to sample at different times, including in particular those times when the load is heaviest. (Use mrtg or cricket to determine this.) You will need to ensure that your measurements have the appropriate level of detail. If you have isochronous applications, such as video conferencing, that are extremely sensitive to delay, five-minute averages will not provide adequate information.
- Review your experimental design. Once you have decided what you want to measure and how, you should look back over the process before you begin. Are there any optimizations you can make to minimize the amount of work you will have to do? Will the measurements you make really answer your questions? It is wise to review these questions before you invest large amounts of time.
- Collect data. The single most important consideration in collecting data is that you adequately document what you are doing. It is an all too common experience to discover that you have a wonderful collection of data, but you don't fully know or remember the circumstances surrounding its collection. Consequently, you don't know how to interpret it. If this happens, the only thing you can do is discard the data and start over. Remember, collecting data is an iterative process. You must examine your results and make adjustments as needed. It is too easy to continue collecting worthless data when even a cursory examination of your data would have revealed you were on the wrong track.
- Analyze data. Once the data is collected, you must analyze, interpret, and act upon your results. This analysis will, of course, depend heavily on the context and goals of the investigation. But an essential element is to condense the data and extract the needed information, presenting it in a concise form. It is often the case that measurements will create massive amounts of data that are meaningless until carefully analyzed.
Don't get too carried away. Often the simplest analyses are of greater value than overly complex analyses. Simple analyses can often be more easily understood. But whatever you conclude, you'll need to do it all again. System performance analysis is a never-ending task.
Bottleneck analysis
Since networks are composed of a number of pieces, if the pieces are not well matched, poor performance may depend on the behavior of a single component. Bottleneck analysis is the process of identifying this component.When looking at performance, you'll need to be sure you get a complete picture. Generally, one bottleneck will dominate performance statistics. Many systems, however, will have multiple bottlenecks. It's just that one bottleneck is a little worse than the others. Correcting one bottleneck will simply shift the problem -- the bottleneck will move from one component to another. When doing performance monitoring, your goal should be to discover as many bottlenecks as possible.
Often identifying a bottleneck is easy. Once you have a clear picture of your network's architecture, topology, and uses, bottlenecks will be obvious. For example, if 90% of your network traffic is to the Internet and you have a gigabit backbone and a 56-Kbps WAN connection, you won't need a careful analysis to identify your bottleneck.
Identifying bottlenecks is process dependent. What may be a bottleneck for one process may not be a problem for another. For example, if you are moving small files, the delay in making a connection will be the primary bottleneck. If you are moving large files, the speed of the link may be more important.
Bottleneck analysis is essential in planning because it will tell you what improvements will provide the greatest benefit to your network. The only real way to escape bottlenecks is to grossly overengineer your network, not something you'll normally want to do. Thus, your goal should not be to completely eliminate bottlenecks but to minimize their impact to the point that they don't cause any real problems. Upgrading the network in a way that doesn't address bottlenecks will provide very little benefit to the network. If the bottlenecks on your network are a slow WAN connection and slow servers, upgrading from Fast Ethernet to Gigabit Ethernet will be a foolish waste of money. The key consideration here is utilization. If you are seeing 25% utilization with Fast Ethernet, don't be surprised to see utilization drop below 3% with Gigabit Ethernet. But you should be aware that even if the utilization is low, increasing the capacity of a line will shorten download times for large files. Whether this is worthwhile will depend on your organization's mission and priorities.
Here is a rough outline of the steps you might go through to identify a bottleneck:
- Map your network. The first step is to develop a clear picture of your network's topology. To do this, you can use the tools described in "Device Discovery and Mapping". tkined might be a good choice. Often potential bottlenecks are obvious once you have a clear picture of your network. At the very least, you may be able to distinguish the parts of the network that are likely to have bottlenecks from parts that don't need to be examined, reducing the work you will have to do.
- Identify time-dependent behavior. The problems bottlenecks cause, unless they are really severe, tend to come and go. The next logical step is to locate the most heavily used devices and the times when they are in greatest use. You'll want to use a tool like mrtg or cricket to identify time-dependent behavior. (Understanding time-dependent behavior can also be helpful in identifying when you can work on the problem with the least impact on users.)
- Pinpoint the problems. At this point, you should have narrowed your focus to a few key parts of the network and a few key times. Now you will want to drill down on specific devices and links. ntop is a likely choice at this point, but any SNMP-based tool may be useful.
- Select the tool. How you will proceed from here will depend on what you have discovered. It is likely that you will be able to classify the problem as stemming from an edge device, such as a server or a path between devices. Doing so will simplify the decision of what to do next.
If you believe the problem lies with a path, you can use the tools described in "Path Characteristics" to drill down to a specific device or single link. You'll probably want to get an idea of the nature of the traffic over the link. ntop is one choice, or you could use a tool like tcpdump, ethereal, or one of the tools that analyzes tcpdump traffic.
For a link device like a router or switch, you'll need to look at basic performance. SNMP-based tools are the best choice here.
For end devices, you need to look at the performance of the device at each level of the communications architecture. You could use spray to examine the interface performance. For the stack, you might compare the time between SYN and ACK packets with the time between application packets. (Use ethereal or tcpdump to collect this information.) The setup times should be independent of the application, depending only on the stack. If the stack responds quickly and the application doesn't, you'll need to focus on the application.
- Fix the problem. Once you have an idea of the source of the problem, you can then decide how to deal with it. For poor link performance, you have several choices. You can upgrade the link bandwidth or alter the network topology to change the load on the link. Adding interfaces to a server is one very simple solution. Attaching a server to multiple subnets is a quick way to decrease traffic between those subnets. Policy-based routing is yet another approach. You can use routing priorities to ensure that important traffic is handled preferentially.
For an edge device such as an attached server, you'll want to distinguish among hardware problems, operating system problems, and application problems, then upgrade accordingly.
Capacity planning
Capacity planning is an extremely important task. Done correctly, it is also an extremely complex and difficult task, both to learn and to do. But this shouldn't keep you from attempting it. The description here can best be described as a crude, first-order approximation of capacity planning. But it will give you a place to start while you are learning.Capacity planning is really an umbrella that describes several closely related activities. Capacity management is the process of allocating resources in a cost-efficient way. It is concerned with the resources that you currently have. (As you might guess, this is closely related to bottleneck analysis.) Trend analysis is the process of looking at system performance over time, trying to identify how it has changed in the past with the goal of predicting future changes. Capacity planning attempts to combine capacity management and trend analysis. The goal is to predict future needs to provide for effective planning.
The basic steps are fairly straightforward to describe, just difficult to carry out. First, decide what you need to measure. That means looking at your system in much the same way you did with bottleneck analysis but augmenting your analysis with anything you know about the future growth of your system. You'll need to think about your system in context to do this.
Next, select appropriate tools to collect the information you'll need. (mrtg and cricket are the most obvious tools among those described in this tutorial, but there are a number of other viable tools if you are willing to do the work to archive the data.) With the tools in place, begin monitoring your system, recording and archiving appropriate data. Deciding what to keep and how to organize it is a tremendously difficult problem. Every situation is different. Each situation is largely a question of balancing the amount of work involved in keeping the data in an organized and accessible manner with the likelihood that you will actually use it. This can come only from experience.
Once you have the measurements, you will need to analyze them. In general, focus on areas that show the greatest change. Collecting and analyzing data will be an iterative process. If little is different from one measurement to the next, then collect data less frequently. When there is high variability, collect more often.
Finally, you'll make your predictions and adjust your system accordingly.
There are a number of difficulties in capacity planning. Perhaps the greatest difficulty comes with unanticipated, fundamental changes in the way your network is used. If you will be offering new services, predictions based on trends that predate these services will not adequately predict new needs. For example, if you are introducing new technologies such as Internet telephony or video, trend analysis before the fact will be of limited value. There is a saying that you can't predict how many people will use a bridge by counting how many people are currently swimming across the river. If this is the case, about the best you can do is look to others who have built similar bridges over similar rivers.
Another closely related problem is differential growth. If your network, like most, provides a variety of different services, then they are probably growing at different rates. This makes it very difficult to predict aggregate performance or need if you haven't adequately collected data to analyze individual trends.
Yet another difficulty is motivation. The key to trend analysis is keeping adequate records, i.e., measuring and recording information in a way that makes it accessible and usable. This is difficult for many people since the records won't have much immediate utility. Their worth comes from being able to look back at them over time for trends. It is difficult to invest the time needed to collect and maintain this data when there will be no immediate return on the effort and when fundamental changes can destroy the utility of the data.
You should be aware of these difficulties, but you should not let them discourage you. The cost of not doing capacity planning is much greater.