Diagnosing Transient Performance Problems

Terry Gray
03 Dec 2000

BACKGROUND

It's kind of a personal hot button. For over a decade I have pleaded with whomever would listen, including major network equipment vendors, that what the world really needs is a better set of tools for diagnosing transient network/system performance problems. Alas, little progress has been made in that interval, yet our network computing world is becoming ever more complex and difficult to diagnose.

COMPLEXITY

Examples of this growing complexity include:

New link layer technologies such as wireless add even more uncertainty about the dynamic behavior of a network.
Understanding the overall performance implications of DiffServ and other QoS mechanisms is pretty much uncharted territory.
Increasing use of UDP for multimedia puts the question mark back into the behavior of coexisting TCP streams.
Content-distribution network technology is all the rage now-days, making it harder to figure out what path a conversation takes, and where the relevant servers are located.
VPNs, MPLS, VLANs, MTU size, packet fragmentation, IPv6... all these have performance implications. Even mature elements of the Internet infrastructure such as ARP can have unintended performance consequences in some implementations.
And there's always link-level flow control (802.3x, if anyone actually uses it anymore) which implements "head of line blocking" for all flows sharing a particular link, or link-layer ARQ for wireless or other noisy media.
Oh, and did someone say multicast?
If that's not enough, the complexity and variability of drivers and other OS code on typical desktops can have enormous impact on network application performance, especially streaming media apps.

IMPORTANCE

It's never been easy to determine whether reports of "slow response" are due to network problems or end-system problems. And if the network is in fact the culprit, how do we figure out in real-time exactly *where* the problem is? Such information is obviously key to rapid problem resolution.

Understanding what kind of performance end-users experience, and --if it is poor-- identifying the source of the problem, are sufficiently important issues for e-commerce vendors that a new industry has been created to characterize and analyze end-user web-browsing performance. Keynote and Appliant are examples of companies providing these services to e-comm vendors. Still, sophisticated tools to help diagnose these problems are not widely available even for common applications such as web browsing, much less for more esoteric apps or when advanced network transport capabilities are being used.

EXAMPLES

I was reminded of the sorry state of the art once again just a few months ago. While in my office, I decided to listen to a webcast seminar which was originating elsewhere on our campus. The results were terrible. My immediate reaction was that we must be having serious network problems somewhere on campus. However, on a whim, I happened to fire-up a wireless-connected laptop I had sitting on the desk in addition to my primary machine. Imagine my surprise when I discovered that the seminar "came through" just fine when using the laptop (connected via an 802.11b wireless access point on the same subnet as my desktop computer.) Without this accidental discovery, wasted effort would have been directed toward diagnosing the wrong system elements.

More recently, I listened to streaming audio from a local radio station during three different periods of one afternoon. The first and third times it sounded great, but the second time large chunks of audio were missing. A quick ping and traceroute to the station's main address (who knows where the actual audio server was located) yielded nominal values. Should we then assume that the packet loss was due to server overload? With what confidence level could we say that?

EXISTING TOOLS

What can be done to help isolate such transient performance problems?

There are some existing diagnostic apps which can help a bit, e.g. ping, and traceroute, but they lack the specificity to really isolate problems. Moreover, if ping and traceroute do not show any problem, can we say with certainty that the network is not at fault? I wish I knew!
"Beacon responder" hosts, strategically located in a network, with nothing to do but respond to diagnostic packets, can help isolate problems.
SNMP data from routers on dropped packets can be extremely valuable. As router capabilities become more complex, we will need to rely even more on internal router instrumentation.
SNMP data from end-system network stacks is potentially useful, but since TCP implements implicit congestion control via dropped packets, one must take care to distinguish between "baseline" packet loss that indicates the normal operation of TCP seeking its best operating point, vs. abnormal packet loss.

QUESTIONS

Given a (possibly asymmetric) path between a client and server, what questions might we ask concerning a perceived performance problem? First and foremost, we'd like to know whether it is a network problem or end-system problem or both...

If network:
- where?
- both directions?
- packet loss or corruption or delay?
- link congestion or switch/router?
If end-system: client or server or both?
If hardware: over-load related or not?
If software, which component?
- network device driver
- network stack
- Operating system
- drivers other than network
- application(s)

POSSIBILITIES

A few years ago, one of our staff worked on some end-to-end tools to try to infer whether perceived delays were due to network or server problems, based on the timing of keystroke responses from a server application running in "cooked" mode. A later version tried to use ntpd as a source of reliable timestamped packets. While useful, these tools were not available on all platforms, and I still found myself wishing for a more definitive "packet-eye" view of what a typical flow experiences in its path between client and server (and back). To do this, it would be good to have the cooperation of router vendors, but some progress could be made even without it.

For example, imagine that we could inject a uniquely identifiable packet into the network that would trigger recognizers on both input and output interfaces of all switches and routers in the path. When recognized, the router would record a timestamp. These would be gathered together and yield a timeline of the progress of the packet (or where it fell off the radar screen). Obviously, having router clocks synchronized would be very useful, but since latency on a link is highly predictable in comparison to latency within a switch or router, useful information could be inferred even if the clocks were not perfectly synchronized.

Note that it is important that such packets flow through the normal forwarding path of the router, and trigger timestamps on ingress and egress. At the same time, it would be useful to record at least the source address of the probe packets to disambiguate timestamp data, since more than one probe flow could exist simultaneously. To do that within the primary forwarding path of a modern high-performance router probably implies some specialized hardware on the interface cards. On the other hand, one could also imagine a few strategically placed passive monitors that sat around watching for probe packets to go by, and recording timestamps (and at least source address) for each probe packet. This wouldn't give as complete a picture as when router interfaces did the monitoring, but it is perhaps a more realistic approach to deployment. The recording mechanisms could implement some form of rate limiting to guard against DOS attacks against this measurement service.

CONCLUSION

These are just a couple of ideas. The purpose of this note is to stimulate discussion and thinking about what kind of tools would be useful in responding to the all-too-typical call to Network Operations: "Things seem to be very slow at the moment... is something wrong with the network?"

TEG HOME