Diagnosing Transient Performance Problems
Terry Gray
03 Dec 2000
BACKGROUND
It's kind of a personal hot button. For over a decade I have pleaded
with whomever would listen, including major network equipment vendors,
that what the world really needs is a better set of tools for diagnosing
transient network/system performance problems. Alas, little progress has
been made in that interval, yet our network computing world is becoming
ever more complex and difficult to diagnose.
COMPLEXITY
Examples of this growing complexity include:
-
New link layer technologies such as wireless add even more uncertainty
about the dynamic behavior of a network.
-
Understanding the overall performance implications of DiffServ and
other QoS mechanisms is pretty much uncharted territory.
-
Increasing use of UDP for multimedia puts the question mark back into
the behavior of coexisting TCP streams.
-
Content-distribution network technology is all the rage now-days,
making it harder to figure out what path a conversation takes, and
where the relevant servers are located.
-
VPNs, MPLS, VLANs, MTU size, packet fragmentation,
IPv6... all these have performance implications. Even mature elements of
the Internet infrastructure such as ARP can have unintended performance
consequences in some implementations.
-
And there's always link-level flow control (802.3x, if anyone actually
uses it anymore) which implements "head of line blocking" for all
flows sharing a particular link, or link-layer ARQ for wireless or other
noisy media.
-
Oh, and did someone say multicast?
-
If that's not enough, the complexity and variability of drivers and
other OS code on typical desktops can have enormous impact on network
application performance, especially streaming media apps.
IMPORTANCE
It's never been easy to determine whether reports of "slow response"
are due to network problems or end-system problems. And if the network
is in fact the culprit, how do we figure out in real-time exactly *where*
the problem is? Such information is obviously key to rapid problem
resolution.
Understanding what kind of performance end-users experience, and --if it
is poor-- identifying the source of the problem, are sufficiently
important issues for e-commerce vendors that a new industry has been
created to characterize and analyze end-user web-browsing performance.
Keynote and Appliant are examples of companies providing these services to
e-comm vendors.
Still, sophisticated tools to help diagnose these problems are not widely
available even for common applications such as web browsing, much less for
more esoteric apps or when advanced network transport capabilities are
being used.
EXAMPLES
I was reminded of the sorry state of the art once again just a few months
ago. While in my office, I decided to listen to a webcast seminar which
was originating elsewhere on our campus. The results were terrible. My
immediate reaction was that we must be having serious network problems
somewhere on campus. However, on a whim, I happened to fire-up a
wireless-connected laptop I had sitting on the desk in addition to my
primary machine. Imagine my surprise when I discovered that the seminar
"came through" just fine when using the laptop (connected via an 802.11b
wireless access point on the same subnet as my desktop computer.)
Without this accidental discovery, wasted effort would have been directed
toward diagnosing the wrong system elements.
More recently, I listened to streaming audio from a local radio station
during three different periods of one afternoon. The first and third
times it sounded great, but the second time large chunks of audio were
missing. A quick ping and traceroute to the station's main address (who
knows where the actual audio server was located) yielded nominal values.
Should we then assume that the packet loss was due to server overload?
With what confidence level could we say that?
EXISTING TOOLS
What can be done to help isolate such transient performance problems?
-
There are some existing diagnostic apps which can help a bit, e.g.
ping, and traceroute, but they lack the specificity to really isolate
problems. Moreover, if ping and traceroute do not show any problem,
can we say with certainty that the network is not at fault? I wish I
knew!
-
"Beacon responder" hosts, strategically located in a network, with
nothing to do but respond to diagnostic packets, can help isolate
problems.
-
SNMP data from routers on dropped packets can be extremely valuable. As
router capabilities become more complex, we will need to rely even more
on internal router instrumentation.
-
SNMP data from end-system network stacks is potentially useful, but since
TCP implements implicit congestion control via dropped packets, one must
take care to distinguish between "baseline" packet loss that indicates the
normal operation of TCP seeking its best operating point, vs. abnormal
packet loss.
QUESTIONS
Given a (possibly asymmetric) path between a client and server, what
questions might we ask concerning a perceived performance problem?
First and foremost, we'd like to know whether it is a network problem or
end-system problem or both...
- If network:
- where?
- both directions?
- packet loss or corruption or delay?
- link congestion or switch/router?
- If end-system: client or server or both?
- If hardware: over-load related or not?
- If software, which component?
- network device driver
- network stack
- Operating system
- drivers other than network
- application(s)
More general questions include:
-
When is user-perceived delay due to queuing or encoding delay, and
when is it due to packet loss and retransmission?
-
When a packet is lost, how often is it because the packet, while
successfully placed on an output queue, never made it to the next hop
"intact"? That is, because of link-layer noise or other sources of
corruption, the packet was not valid when it arrived at the next switch
or router interface?
-
Alternatively, how often is it because congestion
on an output link caused (one of) the router's queues to be overrun?
-
We presume that most links are pretty clean, so most packet loss is due to
queue overrun. Will pervasive use of wireless dramatically change the
ratio ?
-
In interactive network applications, what role does the application
itself play in detecting and reporting network conditions to the user?
-
How often are user-visible delays in interactive apps due to problems with
support services, e.g. DNS, rather than congestion in the forwarding path
or trouble with the app server?
POSSIBILITIES
A few years ago, one of our staff worked on some end-to-end tools to try
to infer whether perceived delays were due to network or server problems,
based on the timing of keystroke responses from a server application
running in "cooked" mode. A later version tried to use ntpd as a source
of reliable timestamped packets. While useful, these tools were not
available on all platforms, and I still found myself wishing for a more
definitive "packet-eye" view of what a typical flow experiences in its
path between client and server (and back). To do this, it would be good
to have the cooperation of router vendors, but some progress could be made
even without it.
For example, imagine that we could inject a uniquely identifiable packet
into the network that would trigger recognizers on both input and output
interfaces of all switches and routers in the path. When recognized, the
router would record a timestamp. These would be gathered together and
yield a timeline of the progress of the packet (or where it fell off the
radar screen). Obviously, having router clocks synchronized would be very
useful, but since latency on a link is highly predictable in comparison to
latency within a switch or router, useful information could be inferred
even if the clocks were not perfectly synchronized.
Note that it is important that such packets flow through the normal
forwarding path of the router, and trigger timestamps on ingress and
egress. At the same time, it would be useful to record at least the
source address of the probe packets to disambiguate timestamp data, since
more than one probe flow could exist simultaneously. To do that within
the primary forwarding path of a modern high-performance router probably
implies some specialized hardware on the interface cards. On the other
hand, one could also imagine a few strategically placed passive monitors
that sat around watching for probe packets to go by, and recording
timestamps (and at least source address) for each probe packet. This
wouldn't give as complete a picture as when router interfaces did the
monitoring, but it is perhaps a more realistic approach to deployment.
The recording mechanisms could implement some form of rate limiting to
guard against DOS attacks against this measurement service.
CONCLUSION
These are just a couple of ideas. The purpose of this note is to
stimulate discussion and thinking about what kind of tools would be useful
in responding to the all-too-typical call to Network Operations: "Things
seem to be very slow at the moment... is something wrong with the
network?"
TEG HOME