NDC Logical Firewall - Choosing Hardware

We've been developing and testing the logical firewall on a Dell Dimension 4100 with a 1GHz Pentium-III CPU, a 1.4MB 3.5" floppy drive, 256MB RAM, and a 3COM 3c905C-TX 10/100 Mbit/sec network interface card. See also the LFW with Gigabit Ethernet below.

The following aspects of choosing hardware should be considered:

The LFW with Gigabit Ethernet

Running Gibraltar 0.99.5 on one processor in a borrowed Dell PowerEdge 2650 (a 2.2GHz Pentium 4 Xeon with built-in Broadcom BCM5701 gigabit ethernet on a 64bit wide 133MHz PCIX bus) we were able to saturate a single gigabit ethernet interface bi-directionally at 35% CPU utilization forwarding about 82,000 1500 byte packets/sec (the logical firewall configuration). In the physical firewall configuration, we were able to saturate both interfaces bi-directionally forwarding about 164,000 1500 byte packets/sec at about 67% CPU utilization. Using small packets (128 Bytes) we were CPU limited forwarding about 211,000 packets/sec.

Our research indicates that a fast and wide PCI bus is necessary to achieve good gigabit ethernet performance.

Note: Gibraltar 0.99.5 failed to automatically load a driver for the Broadcom 5701 NICs, however a suitable driver is on the CDROM. To enable it, type:

    echo tg3 >>/etc/modules; modprobe tg3; /etc/init.d/networking restart

The Logical Firewall Running on Virtual Hardware (VMware Player)

The Logical Firewall can run under VMware including the Free VMware Player. This can be most easily accomplished by using one of the two pre-configured virtual machines offered here. See The Logical Firewall under VMWare for details.

With VMware, the host PC can be running either Windows or Linux and one host PC can run any number of distinct Logical Firewalls subject only to CPU, RAM, and disk limitations of the host. The protection offered by a Logical Firewall running under VMware (on an uncompromised host) is exactly the same as that offered by the firewall running on real hardware -- even to the physical host running VMware or to other VMware clients. For example, a Windows host and other Windows guests can all be protected as clients of a Logical Firewall running under VMware on the same physical PC).

NDC Logical Firewall - Filtering Bridge Performance

To try to answer the question: "what would happen if we put a large fraction of our campus behind a filtering bridge firewall?", I did the following experiments intended to answer these questions:

  1. How does the firewall perform if there are a very large number of states in the ip_conntrack state table?

  2. What would happen in a slammer-like attack where most packets do not take a short-cut through the iptables rules by virtue of being part of a "connected" state.

  3. What is the maximum throughput a variation 4 filtering bridge firewall can sustain.

Since network and PCI bus bandwidth offer predictable firewall bottlenecks, measuring CPU utilization is ultimately the challenge. The most accurate way I know of measuring CPU use is to measure idle CPU cycles (by consuming them at low scheduling priority) and measuring how long it takes to get a unit of work done. I did this on the firewall with: "nice -20 idleproc" and noted it agreed with the CPU utilization more conveniently reported by: "vmstat 1".

To get a predictable traffic load to measure, I generated a stream of 100Mb traffic through the filtering bridge using:

    tcpblast -u  -d 0 -p 9999 -s 1024 dest-host 200000
On dest-host, I received it with:
    nc -l -p 9999 -u > /dev/null
The number of packets received on dest-host was obtained by running the following (on dest-host) before and after the test.
    netstat -s | sed -n '/^Udp:/ { ; N; p; }'

Approximately 11,000 packets/sec consumed about 33% of a 1GHz Intel P3 running Gibraltar 0.99.7a with a small set of rules produced by the variation 4 rule generator. Since virtually all packets sent were received, and the throughput was close to full wire-speed (100Mb/sec for this test), and the test was repeatable, the measurement is considered valid.

How does the firewall perform if there are a very large number of states in the ip_conntrack state table?

To answer this question, a large number of SYN_RECEIVED states were generated into the ip_conntrack state table by running this script on the firewall:

    #!/bin/sh
    echo 2097152 > /proc/net/sys/ipv4/ip_conntrack_max
    COUNT=200000
    while [ $COUNT -gt 0 ] ;do
      let HIGH=($COUNT/253)%253+1
      let LOW=$COUNT%253+1
      let PORT1=$RANDOM%64000+1
      let PORT2=$RANDOM%64000+1
      nemesis-tcp -fS -S 10.1.$LOW.$HIGH -D 10.2.$HIGH.$LOW -x $PORT1 -y $PORT2
    # nemesis-udp     -S 10.1.$LOW.$HIGH -D 10.2.$HIGH.$LOW -x $PORT1 -y $PORT2
      let COUNT=$COUNT-1
      case $LOW in 1) echo $COUNT;; esac
    done
The script generated over 190,000 state table entries between about 64,000 unique source and destination IP addresses between about 64,000 different ports.

Since the script needs to run "nemesis-tcp" for each packet generated, it is necessary to increase the timeout for SYN_SENT from the default of 2 minutes to 20 minutes to prevent some of the state from timing out before the script finishes. This can be done (on Gibraltar 0.99.6a) with this command:

    FILE=/proc/sys/net/ipv4/ip_conntrack_tcp_timeouts; awk '{$3 = 1200; print}' < $FILE > $FILE
  # FILE=/proc/sys/net/ipv4/ip_conntrack_udp_timeouts; awk '{$1 = 1200; print}' < $FILE > $FILE
The state created by the script above can be examined in /proc/net/ip_conntrack.

There was no measurable difference in CPU use for the 100Mb UDP test stream with 190,000 TCP SYN_RECEIVED entries in the ip_conntrack state table. The test was repeated (using the 2 lines commented out above) with identical results after creating 190,000 UDP UNREPLIED packets (and modifying the appropriate UDP state timeout).

What would happen in a slammer-like attack where most packets do not take a short-cut through the iptables rules by virtue of being part of a "connected" state.

To answer this question, dummy firewall rules were inserted (into the "mangle table") which would be tested (but not matched) before the test for "state connected". This script was used to insert the dummy rules:

    #!/bin/sh
    COUNT=500
    while [ $COUNT -gt 0 ] ;do
      let HIGH=$COUNT/253+1
      let LOW=$COUNT%253+1
      iptables -t mangle -I PREROUTING 1 -p tcp -s 1.1.1.$HIGH -d 1.1.1.$LOW -j ACCEPT
      let COUNT=$COUNT-1
    done
Testing each packet against these additional 500 iptables rules caused the 100Mb UDP test stream to consume about twice as many CPU cycles as it did without the rules. This shows that a DOS attack (or a portscan) can have a somewhat greater impact on firewall CPU use than normal connected traffic flow and that impact depends on how large the firewall's ruleset is, or more specifically, how early in the ruleset the unwanted traffic can be excluded.

What is the maximum throughput a variation 4 filtering bridge firewall can sustain.

To answer this question, we ran Gibraltar 0.99.7a on a Dell PowerEdge 2650 with a single 3.06GHz P4 XEON processor, two built-in broadcom gigabit ethernet adapters (on a 133MHz 64bit PCI bus). Using a "smart-bits" network load tester, we found that for maximum size packets (1518 bytes) with a minimal ruleset, the firewall was able to keep up at 95% of full bi-directional gigabit network speed at about 80,000 packets/sec (40,000 each way). For small (128 byte) packets, the firewall was CPU limited at about 220,000 packets/sec. Enabling/disabling "hyperthreading" had no measurable effect (which is not surprising since Gibraltar uses a uniprocessor Linux kernel).

Note that bridging performance of Gibraltar appears to be somewhat less than routing performance, or put another way, the 2.2GHz processor we tested before as a routing firewall performed slightly better than the 3.06GHz processor we just tested as a bridging firewall.


idleproc.c

    #include 
    #include 
    #include 
    /*
     * idleproc.c - a tool to measure idle CPU cycles by consuming them.
     * Corey Satten 5/2000
     */

    long ctr;
    volatile long sec;

    catch(int *s) {
	sec = ctr;
	}

    main() {
	int i, j, f, t;
	double use;
	struct timeval tv1,tv2;
	struct timezone tz1,tz2;

	signal(SIGALRM, catch);
	alarm(1);
	gettimeofday(&tv1, &tz1);
	while (sec == 0) ++ctr;
	gettimeofday(&tv2, &tz2);
	t = (tv2.tv_sec-tv1.tv_sec)*1000000 + (tv2.tv_usec - tv1.tv_usec);
	sec = sec * 500000.0 / t;

	for (f=0; ;++f) {
	    tv1 = tv2;
	    for (i=0; i 0) putchar('-');
	    printf("|\n");
	    tv1 = tv2;
	    }
	}

Changes Made to tcpblast.c

16c16
< , verstr[30]="FreeBSD + rzm ";
---
> , verstr[80]="FreeBSD + rzm ";
84c84
<       fprintf(stderr, "nblocks        number of blocks (1..9999)\n");
---
>       fprintf(stderr, "nblocks        number of blocks (1..999999)\n");
163c163
<               case 'p': strncpy(port, optarg, strlen(port)-1);        break;
---
>               case 'p': strncpy(port, optarg, sizeof(port)-1);        break;
197,198c197,198
<         if (nblocks<=0 || nblocks>=10000) {
<               fprintf(stderr, "%s: 1 < nblocks <= 9999 \n", argv[0]);
---
>         if (nblocks<=0 || nblocks>=1000000) {
>               fprintf(stderr, "%s: 1 < nblocks <= 999999 \n", argv[0]);

Corey Satten
Email -- corey @ u.washington.edu
Web -- http://staff.washington.edu/corey/
Date -- Mon Jan 28 12:25:56 PST 2008