Diagnosing Distributed System Performance Problems

"Finger Pointing Tools" for Isolating Distributed System Performance Problems

Terry Gray
Document in progress
28 Feb 2001

INTRODUCTION

It's an age-old problem. Someone calls the Network Operations Center and asks whether there is something wrong with the network, because a particular network-based application is running slowly. Where is the problem? Is it the network? Is it an overloaded server? Is it the client computer? Clearly, what the world really needs is a tool or technique to rapidly determine which element(s) of the system is/are responsible for the performance problem. In support of the Internet2 "End to End Performance Initiative", this document focuses on the quest for just such a tool or technique set. We will refer to this quest as the "Finger Pointing Tool" project. "FPT" for short.

Gray's First Law of Computing Happiness is that "the key to high performance (and world peace) is reducing contention for shared resources." However, distributed computing almost always involves a shared network and shared servers. The FPT project seeks to find ways to distinguish contention-induced performance problems from other sources, and to distinguish server-based delays from network-based delays.

IMPORTANCE

It's never been easy to determine whether reports of "slow response" are due to network problems or end-system problems. And if the network is in fact the culprit, how do we figure out in real-time exactly *where* the problem is? Such information is obviously key to rapid problem resolution.

Understanding what kind of performance end-users experience, and --if it is poor-- identifying the source of the problem, are sufficiently important issues for e-commerce vendors that a new industry has been created to characterize and analyze end-user web-browsing performance. Keynote and Appliant are examples of companies providing these services to e-comm vendors.

Still, sophisticated tools to help diagnose these problems are not widely available even for common applications such as web browsing, much less for more esoteric apps or when advanced network transport capabilities are being used.

MOTIVATING EXAMPLES

Imagine sitting in your office and listening to a webcast of a lecture being given elsewhere on your enterprise campus. The results are terrible. Your immediate reaction is that there must be serious network problems somewhere on the campus net. However, on a whim, you happen to fire-up a wireless-connected laptop sitting on your desk in addition to your primary machine. Imagine your surprise when you discover that the seminar "came through" just fine when using the laptop (connected via an 802.11b wireless access point on the same subnet as the desktop computer.) This actually happened, and without the accidental laptop experiment, wasted effort would have been directed toward diagnosing the wrong system elements.

Another actual example: listening to streaming audio from a local radio station during three different periods of one afternoon. The first and third times it sounded great, but the second time large chunks of audio were missing. A quick ping and traceroute to the station's main address (who knows where the actual audio server was located) yielded nominal values. Should we then assume that the packet loss was due to server overload? With what confidence level could we say that?

EXISTING TOOLS/TECHNIQUES

What can be done to help isolate the source of such performance problems?

There are some existing diagnostic apps which can help a bit, e.g. ping, and traceroute, but they lack the specificity to really isolate problems (especially in complex topologies with asymmetric routes). Moreover, if ping and traceroute do not show any problem, can we say with certainty that the network is not at fault?
Reference server, or "Beacon responder" hosts, strategically located in a network, with nothing to do but respond to diagnostic packets, can help isolate problems.
SNMP data from routers on dropped packets can be extremely valuable. As router capabilities become more complex, we will need to rely even more on internal router instrumentation.
SNMP data from end-system network stacks is potentially useful, but since TCP implements implicit congestion control via dropped packets, one must take care to distinguish between "baseline" packet loss that indicates the normal operation of TCP seeking its best operating point, vs. abnormal packet loss.
Applications sometimes keep track of performance-related information and offer users some insight into current system status.

In general, the traditional approach to performance problem isolation is a manual process of elimination. The Finger Pointing Tool project seeks to understand whether the current methods can be improved upon, either by automating the "process of elimination" technique or identifying new tools/techniques for distinguishing between network and end-system problems.

COMPLEXITY

Although the problem of isolating distributed system performance problems really is age-old (in Internet time), little progress has been made in the past decade. Worse, the network/distributed computing environment is becoming ever more complex and difficult to diagnose. Examples of this growing complexity include:

Asymmetric routing is now commonplace in the Internet, which undermines the effectiveness of traditional diagnostic tools such as traceroute.
New link layer technologies such as wireless add even more uncertainty about the dynamic behavior of a network.
Understanding the overall performance implications of DiffServ and other QoS mechanisms is pretty much uncharted territory.
Increasing use of UDP for multimedia puts the question mark back into congestion avoidance and the behavior of coexisting TCP streams.
Content-distribution network technology is all the rage now-days, making it harder to figure out what path a conversation takes, and where the relevant servers are located.
MPLS, VLANs, MTU size, packet fragmentation, IPv6... all these have performance implications. Even mature elements of the Internet infrastructure such as ARP can have unintended performance consequences in some implementations.
Firewalls, VPN servers, traffic shapers, load redistributors, and similar devices can also impact performance, and have often been implicated as the cause of performance problems.
The Ethernet "duplex negotiation" specification and/or implementations are widely understood to be flawed (indeed, a huge industry embarrassment), yet duplex negotiation failures often go unnoticed by users who just assume that "the network is always slow".
And there's always link-level flow control (802.3x, if anyone actually uses it anymore) which implements "head of line blocking" for all flows sharing a particular link, or link-layer ARQ for wireless or other noisy media.
Oh, and did someone say multicast?
If that's not enough, the complexity and variability of drivers and other OS code on typical desktops can have enormous impact on network application performance, especially streaming media apps.

DISTRIBUTED SYSTEM ELEMENTS

Elements of "System under study" include:

Client system
- -App
- -OS, including net stack and device drivers
- -Hardware platform
The Network
- -Wireless link?
- -First hop to edge switch
- -Firewall? VPN server?
- -Building subnet
- -Campus backbone
- -Firewall? VPN server?
- -Gigapop
- -national backbone
- -etc
Server
- -App
- -OS, including net stack and device drivers
- -Hardware platform
- -Backend DBMS server?
- -Etc

PROBLEM ISOLATION GRANULARITY

In a perfect world, in the fullness of time, etc, it would be great to have a Finger Pointing Tool that could isolate performance problems to a very specific system element. Doing so will necessarily draw upon advances in the related areas of network measurement/analysis, OS design/instrumentation, and application design/instrumentation. However, for purposes of the initial Finger Pointing Tool project, we'll declare success if we can arrive at a method to rapidly and definitively discover whether the problem is within the client, the server, or the network.

To give a flavor of the complexity of the problem, consider the following: Given a (possibly asymmetric) path between a client and server, what questions might we ask concerning a perceived performance problem? First and foremost, we'd like to know whether it is a network problem or end-system problem or both...

If network:
- where?
- both directions?
- packet loss or corruption or delay?
- link congestion or switch/router?
If end-system: client or server or both?
If hardware: over-load related or not?
If software, which component?
- network device driver
- network stack
- Operating system
- drivers other than network
- application(s)
Or is the problem at a network-system boundary, e.g. an Ethernet duplex negotiation failure?

PROBLEM CLASSES

INFRASTRUCTURE vs. TRANSIENT PROBLEMS
(FIRST vs. SUBSEQUENT USE SCENARIOS)

"Has this application ever worked for you before?"

Sometimes a distributed system performance problem surfaces when someone tries to use an application for the first time, as perhaps a first attempt at an H323 conference, and other times the application has worked successfully in the past (in a similar scenario), yet there is at the moment a transient performance bottleneck somewhere.

For purposes of performance problem diagnosis/isolation, there is a very important distinction to be drawn between "first use" problems, and those that occur in subsequent use. When first attempting to use a distributed system application in a particular situation (i.e. in a particular hw/sw configuration, to a particular remote host, etc), a performance failure could literally be due to "anything". However, if one has successfully used the app in that situation previously, the problem could still be "anywhere", but one would expect the most likely source of difficulty would be the "dynamic" and usage-sensitive aspects of the system, e.g. server or link congestion, routing problems, etc. These dynamic or "transient" performance problems are particularly hard to track down, because in many cases, by the time all the necessary instrumentation has been marshalled, the problem has disappeared! Until it comes back...

Some of the subsequent-use performance problems will in fact be due to changes in the system infrastructure, e.g. new software installed on a router, or a server, or a client machine. In those cases, the problem will not go away "by itself"... someone will have to undo or repair whatever is broken, just as in resolving "first use" problems.

For purposes of this document we'll use the terms "infrastructure" and "transient" to distinguish these two classes of performance problems, fully realizing that these distinctions can get pretty fuzzy. If a router software bug results in periodic route-table corruption that "heals itself", is that an "infrastructure" performance problem or a "transient" performance problem? If a link or exchange point cannot accommodate traffic peaks around, for example, lunch time or dinner time, would that be an "infrastructure" or "transient" performance problem? In both cases, the correct answer is "Yes" --i.e. "both".

The value of making the "infrastructure" vs. "transient" distinction is in trying to eliminate possible sources of difficulty, by focusing attention on "what changed" since the application was working successfully... if it ever did. In practice, the relevant diagnostic tools and techniques for first-use/infrastructure problems may be quite different from those appropriate for subsequent-use/transient problems, though there will obviously be overlap.

With this perspective, one can imagine the following classes of FPT:

"Transient" FPTs. These are tools that assume a sound client/network/server infrastructure that has previously supported a successful use of the application. This class of FPT seeks to distinguish between transient network problems and transient server problems. These problems are typically associated with load or congestion issues, sometimes complicated by routing anomolies. Implicit here is the idea that the user's client/desktop (or handheld?) machine is not the cause of the problem, since the user can presumably control the transient load on his/her own machine. (Of course there are exceptions.)
"Distant infrastructure" FPTs. These are tools that assume the application has worked successfully for the user before, but in a different usage pattern, e.g. to a different server or remote colleague than is currently being attempted.
"Local infrastructure" FPTs. These are tools that attempt to pin-down non-transient problems that are "close to home", e.g. Ethernet auto-negotiation failures between the desktop and the edge switch, perennial subnet congestion, or client machine misconfiguration.

Think of this categorization as a way to characterize the problem space, even if it does not prove to be the best way to characterize the solution space.

Another categorization of tools is:

Those that require a central server to collect and correlate data, and
Those which can provide useful diagnostic info on a stand-alone basis (either as part of the application, or as a separate program, but independent of any coordinated/organized data collection effort.)

POSSIBILITIES

So what kind of tools would be useful in responding to the all-too-typical call to Network Operations: "Things seem to be very slow at the moment... is something wrong with the network?"

Rather than seeking a single "all-purpose" Finger Pointing Tool, it may be much more realistic to think in terms of integrating FPT capabilities within specific applications. But some stand-alone FPTs would also be worth considering.

The Packet Tracker

What if we could get a "packet-eye" view of what a typical flow experiences in its path between client and server (and back)? This idea builds on efforts in the network measurement/analysis field, but explicitly tries to provide information on both network delays and server delays.

To do this, it would be good to have the cooperation of router vendors, but some progress could be made even without it. For example, imagine that we could inject a uniquely identifiable packet into the network that would trigger recognizers on both input and output interfaces of all switches and routers in the path. When recognized, the router would record a timestamp. These would be gathered together and yield a timeline of the progress of the packet (or where it fell off the radar screen). Obviously, having router clocks synchronized would be very useful, but since latency on a link is highly predictable in comparison to latency within a switch or router, useful information could be inferred even if the clocks were not perfectly synchronized.

Note that it is important that such packets flow through the normal forwarding path of the router, and trigger timestamps on ingress and egress. At the same time, it would be useful to record at least the source address of the probe packets to disambiguate timestamp data, since more than one probe flow could exist simultaneously. To do that within the primary forwarding path of a modern high-performance router probably implies some specialized hardware on the interface cards. On the other hand, one could also imagine a few strategically placed passive monitors that sat around watching for probe packets to go by, and recording timestamps (and at least source address) for each probe packet. This wouldn't give as complete a picture as when router interfaces did the monitoring, but it is perhaps a more realistic approach to deployment. The recording mechanisms could implement some form of rate limiting to guard against DOS attacks against this measurement service.

The game grid

Imagine a million network games or game platforms that reported in real-time their current throughput, packet loss and latency statistics to "network tomography" servers, as well as providing the user, upon request, with current network conditions reports. (The end user may not want to be bothered with such telemetry info, but it can be incredibly useful to the NOC when that end user calls to complain about network performance.)

The web perferator

Imagine browser enhancements or plugins or special javascript pages that could report information on web server and network performance. There are already commercial web server monitoring companies doing this sort of thing... but combining these approaches with some of the other possibilities identified in this section could lead to an enormously powerful diagnostic capability. As above, the idea here is to both report relevant info to coordinating performance analysis servers, but also to (optionally) make performance telemetry available to the end user, in a form that can be useful when that individual calls their local NOC for assistance with a performance problem.

"perf@Home"

Imagine a network performance instrumentation tool implemented as a screen saver along the lines of "SETI@Home". Except that instead of looking for ET, this tool looks for distributed system performance problems. It periodically (at a low background rate) performs certain fundamental operations, such as DNS lookup, ping to a few strategically-located beacon hosts, etc. Here again, results can be transmitted to a network tomography/correlation server and also made available to the end user (and thence to the local NOC trying to assist the end user).

nocmeeting

Imagine an H323 client that could be configured to beam telemetry data to a coordinating "network tomography" server and/or offer the user detailed information on current network and server conditions.

Reference Servers

Imagine a flock of diagnostic or reference servers sprinkled around the network that could facilitate the applications or stand-alone "Finger Pointing" tools in determining the source of performance trouble.

APPLICATION DESIGN

In the preceding examples, a recurring theme is the notion of normal end-user applications containing their own performance diagnostic capabilities. On the road toward the FPT-grail, there will inevitably be suggestions for application developers up to and including incorporation of full FPT capabilities within the application.

At a minimum, applications should always give their users good feedback on what they are trying to do at the moment. For example, what phase of activity is currently underway:

DNS lookup
Authentication
TCP (or...) open
Stream data...

Performance problems can occur in any stage or phase of operation. To diagnose performance problems, it is essential to know what an app is trying to do when it fails/slows down.

CFP: E2E Performance Problem Isolation (PPI) Study Team

GOAL

The purpose of this effort is to improve the tools and techniques available for diagnosing and isolating performance problems experienced by network-application users. In particular, one specific outcome of interest would be the development of "Finger-Pointing" tools and techniques to discriminate between performance problems due to the network and those due to the end-systems.

RELATIONSHIP TO OTHER I2 "E2E PI" EFFORTS

Obviously, a successful Finger Pointing Tool and/or diagnostic technique will draw heavily on companion I2 E2E Performance Initiative efforts in the network measurement/analysis, OS design/instrumentation, application design/instrumentation, and operational support areas. The exact boundaries of responsibility have yet to be drawn, but the FPT project is intended to both draw upon and integrate some of these activities as well as try to automate some of the current manual diagnostic procedures.

The reason this activity is part of the end-to-end/application part of the initiative and not part of the "operational support" part is twofold: first, the latter is focused on establishing the human infrastructure needed for getting help and resolution for performance problems, and therefore the FPT project can be viewed as trying to identify tools and techniques that would feed into that operational support infrastructure; second, the FPT project has an explicit goal of creating (or causing to be created!) some automated tools to facilitate performance problem isolation. This may or may not prove to be achievable, or might be overtaken by events in other areas, but if feasible, the development activity would fall outside the scope of the operational support area.

The key element that distinguishes this effort from those of the E2EPI Network Working Group is the focus on being able to distinguish between network-induced performance problems and other kinds. Clearly this effort will draw upon the work of the E2E-PI Network and Host/OS Working Groups.

STEPS

To achieve these goals, we propose the following steps:

> PHASE I: Analysis and Design

1. Call For Participation for the PPI Study Team: a small group of highly qualified and motivated people interested in working on this problem. I2 community members and interested vendors are encouraged to apply.

2. Call For Participation for submission of Success Stories, Best Current Practice whitepapers, and proposals for Finger Pointing Tools from the I2 community.

3. Assessment of submissions by PPI Study Team, resulting in summary paper that identifies common themes, exemplary practices and tools, and provides a foundation for defining a suitable set of Finger-Pointing tools and techniques.

4. PPI Study Team defines requirements and approaches for Finger Pointing tools and techniques, coordinating closely with the Network and Host/OS Working Groups. It is expected that several classes of FPT will be identified:

-Additional telemetry and diagnostic capability within "real" apps
-Instrumentation/Status/Telemetry built into the OS/Stack <>
-Stand-alone generic problem isolation software tool, dubbed "GRIPP: Generic Resource for Isolating Performance Problems"
-Dedicated hardware combined with the GRIPP
-"Packet's Eye View" logging of times a distinguished packet appeared at selected points in the system. <>
-Reference servers, intended soley to assist in performance problem isolation.
-Correlation of network plus host sensor data for "process of elimination" inferences.

PHASE II: Implementation and Evaluation

5. Since deployment of some form of "Reference Server" or "beacon host" is likely to be a key part of the solution, arrange for same and/or coordinate with other E2EPI teams who may be doing similar things.

6. Arrangements are made to do a Rapid Prototype of key ideas, if they have not already been implemented, either via resources available to PPI Study Team members or via a separate CFP.

7. Evaluate results of rapid prototyping efforts; identify next steps.

Right now performance problem isolation is usually done manually, by a process of elimination. It is not yet clear to what extent this activity can be automated, nor the granularity of resolution that may be possible. But the problem is urgent; we must try and see how far we get.

TEG HOME