Log in

No account? Create an account
20 July 2010 @ 11:04 pm

(This is the last in a series of five posts. See part 1, part 2, part 3, part 4.)

As usual, when you spend three days hunting a bug, it turns out to be an embarrassingly simple fix. In this case, the issue was that the VLAN driver called skb_cow(skb, skb_headroom(skb)), which expands the skb to have at least the amount of headroom specified in its second parameter if necessary, but also at least NET_SKB_PAD (32 bytes). Since the result of skb_headroom(skb) was simply passed in to skb_cow(), the intended effect must be the latter, to ensure that the VLAN driver has sufficient room to perform the header manipulations it performs immediately afterward.

Thus, this causes skb_cow() to call pskb_expand_head() if and only if the skb as delivered from the ethernet driver has less than 32 bytes of headroom. In its turn, pskb_expand_head() expands buffers and adjusts the offsets recorded in the skb accordingly, including csum_start, which is in a union with csum. It does so even if the skb is intended to have csum rather than csum_start, hence the bug. This bug can only be triggered if the following conditions hold:

  • The ethernet hardware driver delivers packets with less than NET_SKB_PAD bytes of headroom (the SunHME driver, only for packets larger than RX_COPY_THRESHOLD).
  • The ethernet hardware computes checksums on received packets, so that the skbs it produces have ip_summed set to CHECKSUM_COMPLETE and a valid csum field that will be checked later.
  • The VLAN driver is involved in the receive code path; the header manipulations to make the frames received at VLAN devices look like normal ethernet frames are unusual. Most configurations will not need to modify headers in the receive path and so will never call pskb_expand_head().

The fix is to modify pskb_expand_head() to only adjust csum_start if it, rather than csum is meaningful; that is, if ip_summed is CHECKSUM_PARTIAL. Patch follows.

--- linux-2.6.34/net/core/skbuff.c      2010-05-16 14:17:36.000000000 -0700
+++ linux-2.6.34-skb-csum/net/core/skbuff.c     2010-07-20 22:02:00.000000000 -0700
@@ -854,7 +854,9 @@
        skb->network_header   += off;
        if (skb_mac_header_was_set(skb))
                skb->mac_header += off;
-       skb->csum_start       += nhead;
+       /* Only adjust this if it actually is csum_start rather than csum */
+       if ( skb->ip_summed == CHECKSUM_PARTIAL )
+               skb->csum_start += nhead;
        skb->cloned   = 0;
        skb->hdr_len  = 0;
        skb->nohdr    = 0;

20 July 2010 @ 05:50 pm

(This is the fourth in a series of five posts. See part 1, part 2, part 3, part 5.)

The checksum problem is happening in the call to vlan_check_reorder_header() on line 203 of net/8021q/vlan_dev.c. The checksum field is 0xffffffff before the call and 0x001fffff after it [1]. The only difference between the packets I can see so far is that the ones at or below RX_COPY_THRESHOLD, from the non-buggy code path of the sunhme driver, have 52 bytes of headroom and the problematic ones have only 20 bytes.

[1] Checksums in IP and related protocols are 16-bit one's-complement sums, but the kernel uses 32-bit unsigned integers to store them and perform all operations in unsigned arithmetic for efficiency, and then folds down to the real checksums by repeatedly adding to the upper 16 bits to the lower 16 bits to count all the overflows, and then complementing them. These values thus fold to 0x0000 (correct checksum for a valid packet) and 0xffe0 (incorrect checksum).

Current Music: Siouxsie and the Banshees - Slowdive

(This is the third in a series of five posts. See part 1, part 2, part 4, part 5.)

The checksum test in nf_conntrack can be disabled with sysctl -w net.netfilter.nf_conntrack_checksum=0 (see, for example, the net->ct.sysctl_checksum test on line 185 of net/ipv4/netfilter/nf_conntrack_proto_icmp.c), but this just moves the error elsewhere:

br0: hw csum failure.
Call Trace:
 [00000000006be93c] icmp_rcv+0x1a0/0x324
 [0000000000695578] ip_local_deliver_finish+0x1b8/0x2a0
 [00000000006950a4] ip_rcv_finish+0x3a4/0x3d0
 [000000000066eaa0] netif_receive_skb+0x618/0x640
 [0000000010271160] br_nf_pre_routing_finish+0x334/0x348 [bridge]
 [00000000102719e4] br_nf_pre_routing+0x870/0x894 [bridge]
 [000000000068e0cc] nf_iterate+0x34/0x90
 [000000000068e2b0] nf_hook_slow+0x4c/0xec
 [000000001026c2e8] br_handle_frame+0x24c/0x2d0 [bridge]
 [000000000066e92c] netif_receive_skb+0x4a4/0x640
 [000000000066eb38] process_backlog+0x70/0xc4
 [000000000066ed70] net_rx_action+0x98/0x188
 [000000000045ab84] __do_softirq+0x80/0x110
 [0000000000429d60] do_softirq+0x54/0x80
 [000000000045a870] irq_exit+0x38/0x90
 [0000000000429e60] handler_irq+0xd4/0xec

A checksum test similar to the one in tcp_error/udp_error/icmp_error can be found on line 1000 of net/ipv4/icmp.c. Fundamentally, this is triggered by calling __skb_checksum_complete() on packets which the SunHME driver produced which were longer than RX_COPY_THRESHOLD. Next, I think I'll see if I can move the bug earlier in processing by adding such a test to the VLAN driver after the skb_pull_rcsum() call, which is the first point in processing at which the IP packet is available sans any ethernet or 802.1q headers.

18 July 2010 @ 11:26 pm

(This is the second in a series of five posts. See part 1, part 3, part 4, part 5.)

By instrumenting the VLAN driver code with debug output, I have demonstrated that the checksum computed for the offending packets immediately after the call to skb_pull_rcsum are all 0xffff; i.e., the checksum for only the part of the packet from the IP header onward is exactly what the checksum for a valid IP packet should be. The skbs created by the SunHME driver for packets larger than RX_COPY_THRESHOLD must differ from the ones created by said driver for smaller packets in such a way that causes the checksum to become corrupted only further on in the processing chain. Alignment, perhaps? This could be either a SunHME driver bug or a bug somewhere else, it seems.

18 July 2010 @ 01:02 am

(This is the first in a series of five posts. See part 2, part 3, part 4, part 5.)

I have been stalking a very elusive kernel bug:

Using Linux kernel 2.6.34 on a Sun Ultra 10 (cronus.persephoneslair.org) with a PCI Quad-Fast Ethernet card and the sunhme driver, there seems to be a subtle problem with VLANs. In my particular config, the second port of the QFE board, eth2, is connected via a cross-over cable to a 3Com 8760 Wireless AP. This AP supports several SSIDs, all on their own VLAN, and puts itself on VLAN 1. I have several such SSIDs, offering unfiltered access to my network with 802.1X authentication, or rate-limited internet access with communication to other machines on my network firewalled as for outside machines on an open network, and so on. Finally, VLAN 1, which provides publicly routable addresses with unfiltered communication to my internal network, is bridged to a wired ethernet segment also connected to cronus.

Thus, the configuration script on cronus, with irrelevant portions elided, looks something like this:

ip link set eth2 up
# Create eth2.1 and eth2.2  for VLANs
vconfig add eth2 1
vconfig add eth2 2
ip link set eth1 up
ip link set eth2.1 up
# Create bridge device
brctl addbr br0
brctl addif br0 eth1
brctl addif br0 eth2.1
# br0 is the public IP block
ip link set br0 up
ip addr add dev br0
# eth2.2 for non-authenticated wireless clients
ip link set eth2.2 up
ip addr add dev eth2.2

Now, the bug: I recently got around to doing some unrelated kernel work needed due to recent HDLC layer changes, and upgraded this machine from kernel to kernel 2.6.34. Promptly after finishing those changes and booting the new kernel, I noticed frequent messages in the logs saying "br0: hw csum failure", followed by a stack trace:

Call Trace:
 [00000000101e5c88] tcp_error+0x178/0x280 [nf_conntrack]
 [00000000101e24c4] nf_conntrack_in+0xe0/0x674 [nf_conntrack]
 [000000000068e0cc] nf_iterate+0x34/0x90
 [000000000068e2b0] nf_hook_slow+0x4c/0xec
 [00000000102719d0] br_nf_pre_routing+0x85c/0x894 [bridge]
 [000000000068e0cc] nf_iterate+0x34/0x90
 [000000000068e2b0] nf_hook_slow+0x4c/0xec
 [000000001026c2e8] br_handle_frame+0x24c/0x2d0 [bridge]
 [000000000066e92c] netif_receive_skb+0x4a4/0x640
 [000000000066eb38] process_backlog+0x70/0xc4
 [000000000066ed70] net_rx_action+0x98/0x188
 [000000000045ab84] __do_softirq+0x80/0x110
 [0000000000429d60] do_softirq+0x54/0x80
 [000000000045a870] irq_exit+0x38/0x90
 [0000000000429e60] handler_irq+0xd4/0xec
 [00000000004208b4] tl0_irq5+0x14/0x20

Despite the association with br0 and the presence of br_handle_frame in the stack trace, I did not suspect any particular association with bridging; a few similar messages were present for eth2.2, which was not bridged. As we shall see, this suspicion was correct.

The key to this is the call to nf_conntrack_in; on line 833 of net/netfilter/nf_conntrack_core.c in this function, there is a call to l4proto->error() to check for a protocol-specific error; in this case, that's a function pointer to tcp_error. Then, if it returns a status code <= 0, it stops checking for further errors and takes the negative of that code as its own return value, indicating how to handle the packet.

The checksum itself is tested in tcp_error on line 798 of net/netfilter/nf_conntrack_proto_tcp.c; if the checksum test fails, that function returns -NF_ACCEPT, so the connection tracking code ignores the error and accepts the packet as-is. This explains one feature of the bug: despite the profusion of log messages, traffic flows unimpeded aside from the performance impact of so many syslog writes.

The actual test for the checksum flows up to nf_ip_checksum on line 168 of net/ipv4/netfilter.c. On line 193 it calls __skb_checksum_complete(), which in turn verifies the checksum and, on line 644 of net/core/datagram.c, calls netdev_rx_csum_fault() in the case that the packet was marked with CHECKSUM_COMPLETE and the existing checksum is inconsistent. CHECKSUM_COMPLETE indicates that the checksum of the received data was computed in hardware. Finally, netdev_rx_csum_fault emits the message.

So, what could be causing these bad checksums? The Sun HME is a somewhat unusual ethernet device in that it can compute IP checksums in hardware; why is it doing so incorrectly in this case? Especially, it only seems to have a problem when used with VLANs. That's a significant clue; there is code in the VLAN driver (at line 172 of net/8021q/vlan_dev.c) to call skb_pull_rcsum to trim off the VLAN header from the start of the packet and update the checksum, so the checksum must be invalid from the start somehow, but only when VLANs are active.

Fortunately, another clue turned up: when pinging across the interface in question, only packets above a certain size threshold trigger the error messages. ping -s 210 worked normally, but ping -s 211 caused the "hw csum failure" log messages, with a higher mean ping time and standard deviation (due to the delay of writing the log message). Similar commands on a different IP belonging to a device not attached through the VLAN both caused no error messages.

Now, those are very interesting numbers. 210 bytes of ping payload plus 8 bytes of ICMP header plus 20 bytes of IP header plus 4 bytes of VLAN tag plus 14 bytes of ethernet header makes exactly 256 bytes, a highly suspicious value. The error only occurs if the total length of the ethernet frame as received on the wire is larger than 256 bytes.

The SunHME driver does quite a bit of cleverness aligning its receive buffers in ways which might be prone to malfunctioning in the VLAN case, which I don't quite fully understand yet. See line 1154 of drivers/net/sunhme.c. It also has an interesting optimization when handling receive interrupts, where if the size of the incoming packet is larger than a parameter RX_COPY_THRESHOLD, it allocates a new SKB, does appropriate DMA manipulations, and gives it to the card as a new receive buffer, while keeping the packet in its original buffer to return to the network layer code. If it's smaller than that value, it copies the contents of the buffer into a new SKB and passes that to network layer, keeping the buffer the packet was originally received in as a reusable receive buffer for the device. This appears to be intended to trade off the time needed to copy the data against the time needed for those DMA manipulations; if the packet is too small the time savings of not copying aren't worth it.

And now, the smoking gun: the value of RX_COPY_THRESHOLD, defined at line 372 of drivers/net/sunhme.h, is 256 bytes. In a quick experiment, changing this to 512 bytes also changed the maximum ping size which avoided the bug to 466 bytes. Clearly, the 'clever' code path which avoids copying the packet is somehow the cause of this. I'll have to investigate further tomorrow, beginning with why it doesn't occur in kernel

23 January 2008 @ 03:19 am

It's been nine days since my last post about this. The prom console thing worked, but it exposed another problem: smp_boot_one_cpu() was hanging trying to start the other processors. Well, to be precise, it was calling prom_startcpu(), and that was returning just fine, but the next printk() after that would output one character and then hang.

I immediately suspected locking issues; the prom console acquires and releases prom_entry_lock once per character, to avoid holding it too long, so I suspected it was managing to grab it right away, get one character out (which would take a little under a millisecond on a 9600 bps serial terminal), and then release it and try to reacquire it for the next character, and have trouble there and end up spinning forever.

It turned out I was right, but I ended up investigating quite a few false starts before I found the problem. I really needed to be able to trace what the second processor startup code in trampoline.S was up to, and it wasn't as easy as just throwing some calls to printk() in, since it held prom_entry_lock for much of its execution (kernel spinlocks hang if you try to recursively acquire them), and it didn't have little things like a stack, or even a TLB, set up in some of the places I needed to trace.

Eventually, I devised a mechanism to allow trampoline.S running on cpu1 to signal to a waiting loop on cpu0, and make sure it released prom_entry_lock so cpu0 could use printk() to give me debug output. I was able to use this mechanism earlier this evening to trace the exact location of the hang down to when cpu1 calls SUNW,set-trap-table here. Notice how it does not hold prom_entry_lock when it makes that call. Yes, after nine days of debugging, the problem turned out to be as simple as something not being locked when it should. Both processors were probably trying to use the same buffer simultaneously to call the PROM, and stomping all over each other and probably both getting thrown off into hyperspace. With the appropriate locking and unlocking code added to trampoline.S, the kernel successfully starts all 11 additional processors and gets all the way up to trying to mount the root filesystem. I'll be working on getting my initramfs image ready tomorrow evening.

For the record, all three patches against stock 2.6.23 that I needed to get this kernel booting are reproduced here.

Kernel patches behind the cutCollapse )

And, with that, it's 3:41 AM and I'm going to bed.

Current Mood: accomplishedaccomplished