DECAF: Dataflow Event Correlation for Anomolies and Faults

aka the "Anti-Netflow" Service


T. Gray
24 April 2003
Rev 3

INTRODUCTION

This note proposes a new network monitoring capability aimed at helping operations personnel diagnose transient performance problems. Ideally, it would be implemented in all packet forwarding devices, including routers, switches, and perimeter firewalls --anything that has the potential to drop packets before they arrive at their intended destination.

A DECAF-enabled device would provide to a network management system (NMS) sufficient information about dropped packets to permit correlation of loss events with reports of performance problems from a specific user.

This is a concept piece. Details of exact record formats or exactly how the packet loss data is transmitted to the network management system are beyond the present scope. Leveraging related work (e.g. the IETF IPFIX working group) would be desirable.

NB: the phrase "Anti-Netflow" service is not a pejorative aimed at Cisco's Netflow traffic statistics facility; rather, it is intended to suggest a complementary tool that reports on the opposite case of traffic *not* flowing.

BACKGROUND

Diagnosing transient performance problems continues to be a difficult problem for network operations personnel, and tools to assist are few and far between.

This has been the case for far too long. Over a decade ago, the idea of having routers notice when distinguished probe packets pass thru, and logging when that happened, was proposed to a major router vendor in various Technical Advisory Group meetings. Correlation of data from multiple routers in the path could give a "packet's eye view" of path conditions for the artificial traffic flow. Alas, nothing came of those conversations.

More recently, Van Jacobson (in a private conversation) observed that routers tend to give us lots of data about their successes, but very little about their failures (i.e. about packets that are not successfully forwarded). This suggests a more focused, and more easily implemented, strategy for equipping the NOC with tools suitable for diagnosing transient performance problems.

Some packet loss is a necessary part of TCP's congestion-avoidance via rate-adjustment mechanism, and this makes it difficult to assess how much packet loss constitutes a user-perceptible problem. Packet loss leads to actual loss in UDP streams, but in TCP, it leads to retransmission --and thus to delayed streams, or "slow network syndrome". Causes of loss include link congestion, noise-induced errors (esp. on wireless links), partial link hardware failures, link configuration errors (e.g. duplex mismatch), and inadequate internal router resources (e.g. buffer starvation). And let us not forget packet loss do to access control lists or firewall policies, which could conceivably be blocking more traffic than intended.

When transient performance problems are reported, the NOC is called upon to either vindicate the network (and thus point the finger at end-system configuration or overload), or track down the cause of user-perceived delays or loss. They need better tools to do this. Especially since the problem report may not come in until the symptoms have abated.

It is widely agreed that packets dropped in the network, which in the case of TCP are retransmitted after some delay, represent the dominant cause of transient network performance problems. Delay of packets while traversing network devices, is *not* the biggest problem to solve in this space. Nor are routing config problems, which while potentially subtle and hard to identify, are at least persistent until corrected.

Accordingly, knowing where packet loss is occuring is crucial, and obviously fundamental to eliminating it. But today, routers generally only provide aggregated and non-flow-specific loss statistics via SNMP polling, information which is poorly suited to the problem of real-time performance assessment and debugging. In order to be helpful in real-time diagnosis, the data must be very timely and it should, if possible, include source/destination information on the disrupted packet flows, so that the loss events can immediately be correlated with current problem reports.

Note that temporarily "misplaced" packets due to transient routing anomolies, look a lot like lost packets and can certainly be responsible for user-visible performance problems. Thus it would be great if a tool aimed at correlating user-observed anomolies with device-observed anomolies could include this class of problem as well.

CASES

Packet loss scenarios vary both in terms of loss volume and event frequency. As a very simplistic way of thinking about the issue, consider the following four points in the problem space:

    Case 1   Case 2   Case 3   Case 4

    VOLUME    
   
    FREQUENCY    
   

    LOW
   
    LOW
   

    HIGH
   
    LOW
   

    LOW
   
    HIGH
   

    HIGH
   
    HIGH
   

The easiest scenario to resolve is high volumes of lost packets continuing until fixed (case 4). The hardest one is very low volumes of loss that do not repeat (case 1). It's not clear how often Case 1 occurs in contemporary networks, but even if that scenario is rare, there is much that could be done to improve the debugging situation for cases 2 and 3.

CONCEPT

What if packet loss (or "misplacement") in a network device was considered an exception condition that triggered an immediate alarm and/or a log entry containing as much identifying information as possible for the dropped packets?

Hypothesis: Such information would permit network operations staff to rapidly correlate problem reports with actual evidence of packet loss at a particular time and place in the network. This would be useful both for real-time debugging and also for ex-post-facto correlation of problem reports with packet loss records.

CATEGORIES

Once we have a mechanism in place to tell the NMS where packet loss is occuring, we will inevitably want to know why as well. Hence event classification is important. Moreover, the amount of detail that one might find useful will likely vary with type of anomoly.

Some useful categories of loss that a network device could distinguish include:

For routers, consider adding to this list some form of route flap anomoly information. While this info is usually already available via existing NMS tools, it may make sense to include it in the same diagnostic stream as the explicit loss event data for easier correlation.

ISSUES

At least three problems must be overcome for this concept to fly:

  1. information overload,
  2. lack of information about certain classes of dropped packets, and
  3. keeping event reporting from harming packet forwarding.

Problem 1. Although we'd like to consider packet drops to be rare exceptions, sometimes massive numbers of packets are dropped in a short period of time, depending on the nature of the malfunction. Information overload can certainly occur if network management tools need to absorb and correlate huge masses of lost packet data relating to a single failure. This suggests that the concept of logging or sending individual events for each lost packet is only feasible and desirable if the rate is low.

To manage the information overload problem, two parallel diagnostic streams are proposed. One would be a continuous low-volume stream of aggregated loss event data. The other would be a per-event stream of detailed event records that would only be sent to the NMS if the event frequency was below a specified threshold.

The desired threshold for cessation of individual event records and the parameters for aggregated packet loss reports will surely be different for different classes of packet loss. Hence the need to set thresholds by event class. A case can also be made for a way to tell the device "I don't care about that particular class of error."

Since the whole idea of this exercise is to allow the NOC to correlate a fresh problem report with device data from the (recent, perhaps extremely recent) past, pre-configuring network devices to filter out events in order to manage the volume of data problem would only be useful for planned path tests, not for reaction to random user problems. However, controlled experiments aimed at diagnosing performance problems are a Good Thing, and thus the DECAF concept might reasonably be extended to include device-based filtering of anomoly event data. More analysis is required to determine the cost-benefit tradeoffs of device-based (rather than NMS-based) event filtering.

Recall that our goal is to provide a new diagnostic facility that tries to help diagnose transient performance problems, and therefore focuses on the location and cause of packet loss in a flow. To achieve this goal, the device event diagnostic information must be both timely and contain sufficient data to permit correlation with user problem reports. Obviously the aggregated diagnostic stream will not be as timely as the per-event streams need to be. But even the less-timely aggregated diagnostic streams should contain some flow source/destination information to permit correlation with user problem reports.

Problem 2. Information for certain types of errors may be scarce; for example, the MAC-layer interface chips that handle frame recognition may not allow any bits from a corrupted packet to be saved for logging or diagnosis. Rather, they may only signal to other system elements that a corrupt packet has been detected. While detailed info about the packet source or destination may not be available, it should be possible to generate a dummy event record with the error class and a timestamp. Perhaps the timestamp could be supplemented with current routing table info to indicate what address ranges are relevant to the interface in question.

Problem 3. In addition to the issues of information overload and insufficient information (problems we seek to solve in pursuit of the timely and specific packet loss info needed for real-time diagnosis of transient performance problems) we must also make sure that the primary mission of the network device is protected. Even while trying to provide info on the exception cases of lost or misplaced packets, the router/switch/whatever must continue to forward as many packets as it can --so there must be some attention given to safety-fuse strategies so that the device doesn't become so busy reporting on the exceptions that it can no longer do its primary job of forwarding packets. We don't want to see a new category of packet loss called "lost because too busy reporting on the previous loss".

FEASIBILITY

With modified ASICs, routers/switches could do even more to help NOCs solve these difficult transient performance problem reports. For example, hardware recognizers could notice distinguished probe packets and timestamp when they entered and exited the device.

In contrast, this proposal is intended to be something that vendors could implement via software upgrade in most designs. However, participation from the key router/switch/firewall vendors is needed to validate that assumption and refine the concept.

SUMMARY

Having network devices do a better job of reporting cases of packet loss is proposed as an important step in equipping NOCs with the info they need to diagnose transient network performance problems.

This note offers an outline of such a facility and seeks input and participation from network device vendors in refining and implementing the concept.

REFERENCES

Netflow data format, version 5:
http://www.caida.org/tools/measurement/cflowd/configuration/configuration-9.html

The IP Flow Information Export (ipfix) IETF working group:
http://www.ietf.org/html.charters/ipfix-charter.html

ACKNOWLEDGEMENTS

Thanks to Van Jacobson for stimulating some of these ideas, and to David Richardson and Steve Corbato, my principal sounding boards for wacky networking ideas, and the other members of the Internet2 End2End Performance Initiative technical advisory group for their feedback and encouragement.


TEG HOME