"Finger Pointing Tools" for Isolating Distributed System Performance Problems

Terry Gray
Document in progress
28 Feb 2001

INTRODUCTION

It's an age-old problem. Someone calls the Network Operations Center and asks whether there is something wrong with the network, because a particular network-based application is running slowly. Where is the problem? Is it the network? Is it an overloaded server? Is it the client computer? Clearly, what the world really needs is a tool or technique to rapidly determine which element(s) of the system is/are responsible for the performance problem. In support of the Internet2 "End to End Performance Initiative", this document focuses on the quest for just such a tool or technique set. We will refer to this quest as the "Finger Pointing Tool" project. "FPT" for short.

Gray's First Law of Computing Happiness is that "the key to high performance (and world peace) is reducing contention for shared resources." However, distributed computing almost always involves a shared network and shared servers. The FPT project seeks to find ways to distinguish contention-induced performance problems from other sources, and to distinguish server-based delays from network-based delays.

IMPORTANCE

It's never been easy to determine whether reports of "slow response" are due to network problems or end-system problems. And if the network is in fact the culprit, how do we figure out in real-time exactly *where* the problem is? Such information is obviously key to rapid problem resolution.

Understanding what kind of performance end-users experience, and --if it is poor-- identifying the source of the problem, are sufficiently important issues for e-commerce vendors that a new industry has been created to characterize and analyze end-user web-browsing performance. Keynote and Appliant are examples of companies providing these services to e-comm vendors.

Still, sophisticated tools to help diagnose these problems are not widely available even for common applications such as web browsing, much less for more esoteric apps or when advanced network transport capabilities are being used.

MOTIVATING EXAMPLES

Imagine sitting in your office and listening to a webcast of a lecture being given elsewhere on your enterprise campus. The results are terrible. Your immediate reaction is that there must be serious network problems somewhere on the campus net. However, on a whim, you happen to fire-up a wireless-connected laptop sitting on your desk in addition to your primary machine. Imagine your surprise when you discover that the seminar "came through" just fine when using the laptop (connected via an 802.11b wireless access point on the same subnet as the desktop computer.) This actually happened, and without the accidental laptop experiment, wasted effort would have been directed toward diagnosing the wrong system elements.

Another actual example: listening to streaming audio from a local radio station during three different periods of one afternoon. The first and third times it sounded great, but the second time large chunks of audio were missing. A quick ping and traceroute to the station's main address (who knows where the actual audio server was located) yielded nominal values. Should we then assume that the packet loss was due to server overload? With what confidence level could we say that?

EXISTING TOOLS/TECHNIQUES

What can be done to help isolate the source of such performance problems?

In general, the traditional approach to performance problem isolation is a manual process of elimination. The Finger Pointing Tool project seeks to understand whether the current methods can be improved upon, either by automating the "process of elimination" technique or identifying new tools/techniques for distinguishing between network and end-system problems.

COMPLEXITY

Although the problem of isolating distributed system performance problems really is age-old (in Internet time), little progress has been made in the past decade. Worse, the network/distributed computing environment is becoming ever more complex and difficult to diagnose. Examples of this growing complexity include:

DISTRIBUTED SYSTEM ELEMENTS

Elements of "System under study" include:

PROBLEM ISOLATION GRANULARITY

In a perfect world, in the fullness of time, etc, it would be great to have a Finger Pointing Tool that could isolate performance problems to a very specific system element. Doing so will necessarily draw upon advances in the related areas of network measurement/analysis, OS design/instrumentation, and application design/instrumentation. However, for purposes of the initial Finger Pointing Tool project, we'll declare success if we can arrive at a method to rapidly and definitively discover whether the problem is within the client, the server, or the network.

To give a flavor of the complexity of the problem, consider the following: Given a (possibly asymmetric) path between a client and server, what questions might we ask concerning a perceived performance problem? First and foremost, we'd like to know whether it is a network problem or end-system problem or both...

More general questions include:

PROBLEM CLASSES

INFRASTRUCTURE vs. TRANSIENT PROBLEMS
(FIRST vs. SUBSEQUENT USE SCENARIOS)

"Has this application ever worked for you before?"

Sometimes a distributed system performance problem surfaces when someone tries to use an application for the first time, as perhaps a first attempt at an H323 conference, and other times the application has worked successfully in the past (in a similar scenario), yet there is at the moment a transient performance bottleneck somewhere.

For purposes of performance problem diagnosis/isolation, there is a very important distinction to be drawn between "first use" problems, and those that occur in subsequent use. When first attempting to use a distributed system application in a particular situation (i.e. in a particular hw/sw configuration, to a particular remote host, etc), a performance failure could literally be due to "anything". However, if one has successfully used the app in that situation previously, the problem could still be "anywhere", but one would expect the most likely source of difficulty would be the "dynamic" and usage-sensitive aspects of the system, e.g. server or link congestion, routing problems, etc. These dynamic or "transient" performance problems are particularly hard to track down, because in many cases, by the time all the necessary instrumentation has been marshalled, the problem has disappeared! Until it comes back...

Some of the subsequent-use performance problems will in fact be due to changes in the system infrastructure, e.g. new software installed on a router, or a server, or a client machine. In those cases, the problem will not go away "by itself"... someone will have to undo or repair whatever is broken, just as in resolving "first use" problems.

For purposes of this document we'll use the terms "infrastructure" and "transient" to distinguish these two classes of performance problems, fully realizing that these distinctions can get pretty fuzzy. If a router software bug results in periodic route-table corruption that "heals itself", is that an "infrastructure" performance problem or a "transient" performance problem? If a link or exchange point cannot accommodate traffic peaks around, for example, lunch time or dinner time, would that be an "infrastructure" or "transient" performance problem? In both cases, the correct answer is "Yes" --i.e. "both".

The value of making the "infrastructure" vs. "transient" distinction is in trying to eliminate possible sources of difficulty, by focusing attention on "what changed" since the application was working successfully... if it ever did. In practice, the relevant diagnostic tools and techniques for first-use/infrastructure problems may be quite different from those appropriate for subsequent-use/transient problems, though there will obviously be overlap.

With this perspective, one can imagine the following classes of FPT:

Think of this categorization as a way to characterize the problem space, even if it does not prove to be the best way to characterize the solution space.

Another categorization of tools is:

POSSIBILITIES

So what kind of tools would be useful in responding to the all-too-typical call to Network Operations: "Things seem to be very slow at the moment... is something wrong with the network?"

Rather than seeking a single "all-purpose" Finger Pointing Tool, it may be much more realistic to think in terms of integrating FPT capabilities within specific applications. But some stand-alone FPTs would also be worth considering.

The Packet Tracker

What if we could get a "packet-eye" view of what a typical flow experiences in its path between client and server (and back)? This idea builds on efforts in the network measurement/analysis field, but explicitly tries to provide information on both network delays and server delays.

To do this, it would be good to have the cooperation of router vendors, but some progress could be made even without it. For example, imagine that we could inject a uniquely identifiable packet into the network that would trigger recognizers on both input and output interfaces of all switches and routers in the path. When recognized, the router would record a timestamp. These would be gathered together and yield a timeline of the progress of the packet (or where it fell off the radar screen). Obviously, having router clocks synchronized would be very useful, but since latency on a link is highly predictable in comparison to latency within a switch or router, useful information could be inferred even if the clocks were not perfectly synchronized.

Note that it is important that such packets flow through the normal forwarding path of the router, and trigger timestamps on ingress and egress. At the same time, it would be useful to record at least the source address of the probe packets to disambiguate timestamp data, since more than one probe flow could exist simultaneously. To do that within the primary forwarding path of a modern high-performance router probably implies some specialized hardware on the interface cards. On the other hand, one could also imagine a few strategically placed passive monitors that sat around watching for probe packets to go by, and recording timestamps (and at least source address) for each probe packet. This wouldn't give as complete a picture as when router interfaces did the monitoring, but it is perhaps a more realistic approach to deployment. The recording mechanisms could implement some form of rate limiting to guard against DOS attacks against this measurement service.

The game grid

Imagine a million network games or game platforms that reported in real-time their current throughput, packet loss and latency statistics to "network tomography" servers, as well as providing the user, upon request, with current network conditions reports. (The end user may not want to be bothered with such telemetry info, but it can be incredibly useful to the NOC when that end user calls to complain about network performance.)

The web perferator

Imagine browser enhancements or plugins or special javascript pages that could report information on web server and network performance. There are already commercial web server monitoring companies doing this sort of thing... but combining these approaches with some of the other possibilities identified in this section could lead to an enormously powerful diagnostic capability. As above, the idea here is to both report relevant info to coordinating performance analysis servers, but also to (optionally) make performance telemetry available to the end user, in a form that can be useful when that individual calls their local NOC for assistance with a performance problem.

"perf@Home"

Imagine a network performance instrumentation tool implemented as a screen saver along the lines of "SETI@Home". Except that instead of looking for ET, this tool looks for distributed system performance problems. It periodically (at a low background rate) performs certain fundamental operations, such as DNS lookup, ping to a few strategically-located beacon hosts, etc. Here again, results can be transmitted to a network tomography/correlation server and also made available to the end user (and thence to the local NOC trying to assist the end user).

nocmeeting

Imagine an H323 client that could be configured to beam telemetry data to a coordinating "network tomography" server and/or offer the user detailed information on current network and server conditions.

Reference Servers

Imagine a flock of diagnostic or reference servers sprinkled around the network that could facilitate the applications or stand-alone "Finger Pointing" tools in determining the source of performance trouble.

APPLICATION DESIGN

In the preceding examples, a recurring theme is the notion of normal end-user applications containing their own performance diagnostic capabilities. On the road toward the FPT-grail, there will inevitably be suggestions for application developers up to and including incorporation of full FPT capabilities within the application.

At a minimum, applications should always give their users good feedback on what they are trying to do at the moment. For example, what phase of activity is currently underway:

Performance problems can occur in any stage or phase of operation. To diagnose performance problems, it is essential to know what an app is trying to do when it fails/slows down.

CFP: E2E Performance Problem Isolation (PPI) Study Team

GOAL

The purpose of this effort is to improve the tools and techniques available for diagnosing and isolating performance problems experienced by network-application users. In particular, one specific outcome of interest would be the development of "Finger-Pointing" tools and techniques to discriminate between performance problems due to the network and those due to the end-systems.

RELATIONSHIP TO OTHER I2 "E2E PI" EFFORTS

Obviously, a successful Finger Pointing Tool and/or diagnostic technique will draw heavily on companion I2 E2E Performance Initiative efforts in the network measurement/analysis, OS design/instrumentation, application design/instrumentation, and operational support areas. The exact boundaries of responsibility have yet to be drawn, but the FPT project is intended to both draw upon and integrate some of these activities as well as try to automate some of the current manual diagnostic procedures.

The reason this activity is part of the end-to-end/application part of the initiative and not part of the "operational support" part is twofold: first, the latter is focused on establishing the human infrastructure needed for getting help and resolution for performance problems, and therefore the FPT project can be viewed as trying to identify tools and techniques that would feed into that operational support infrastructure; second, the FPT project has an explicit goal of creating (or causing to be created!) some automated tools to facilitate performance problem isolation. This may or may not prove to be achievable, or might be overtaken by events in other areas, but if feasible, the development activity would fall outside the scope of the operational support area.

The key element that distinguishes this effort from those of the E2EPI Network Working Group is the focus on being able to distinguish between network-induced performance problems and other kinds. Clearly this effort will draw upon the work of the E2E-PI Network and Host/OS Working Groups.

STEPS

To achieve these goals, we propose the following steps:

> PHASE I: Analysis and Design

1. Call For Participation for the PPI Study Team: a small group of highly qualified and motivated people interested in working on this problem. I2 community members and interested vendors are encouraged to apply.

2. Call For Participation for submission of Success Stories, Best Current Practice whitepapers, and proposals for Finger Pointing Tools from the I2 community.

3. Assessment of submissions by PPI Study Team, resulting in summary paper that identifies common themes, exemplary practices and tools, and provides a foundation for defining a suitable set of Finger-Pointing tools and techniques.

4. PPI Study Team defines requirements and approaches for Finger Pointing tools and techniques, coordinating closely with the Network and Host/OS Working Groups. It is expected that several classes of FPT will be identified:

PHASE II: Implementation and Evaluation

5. Since deployment of some form of "Reference Server" or "beacon host" is likely to be a key part of the solution, arrange for same and/or coordinate with other E2EPI teams who may be doing similar things.

6. Arrangements are made to do a Rapid Prototype of key ideas, if they have not already been implemented, either via resources available to PPI Study Team members or via a separate CFP.

7. Evaluate results of rapid prototyping efforts; identify next steps.

Right now performance problem isolation is usually done manually, by a process of elimination. It is not yet clear to what extent this activity can be automated, nor the granularity of resolution that may be possible. But the problem is urgent; we must try and see how far we get.


TEG HOME