(This is the first in a series of five posts. See part 2, part 3, part 4, part 5.)
I have been stalking a very elusive kernel bug:
Using Linux kernel 2.6.34 on a Sun Ultra 10 (cronus.persephoneslair.org) with a PCI Quad-Fast Ethernet card and the sunhme driver, there seems to be a subtle problem with VLANs. In my particular config, the second port of the QFE board, eth2, is connected via a cross-over cable to a 3Com 8760 Wireless AP. This AP supports several SSIDs, all on their own VLAN, and puts itself on VLAN 1. I have several such SSIDs, offering unfiltered access to my network with 802.1X authentication, or rate-limited internet access with communication to other machines on my network firewalled as for outside machines on an open network, and so on. Finally, VLAN 1, which provides publicly routable addresses with unfiltered communication to my internal network, is bridged to a wired ethernet segment also connected to cronus.
Thus, the configuration script on cronus, with irrelevant portions elided, looks something like this:
ip link set eth2 up
# Create eth2.1 and eth2.2 for VLANs
vconfig add eth2 1
vconfig add eth2 2
ip link set eth1 up
ip link set eth2.1 up
# Create bridge device
brctl addbr br0
brctl addif br0 eth1
brctl addif br0 eth2.1
# br0 is the public IP block
ip link set br0 up
ip addr add 18.104.22.168/27 dev br0
# eth2.2 for non-authenticated wireless clients
ip link set eth2.2 up
ip addr add 172.27.240.1/24 dev eth2.2
Now, the bug: I recently got around to doing some unrelated kernel work needed due to recent HDLC layer changes, and upgraded this machine from kernel 22.214.171.124 to kernel 2.6.34. Promptly after finishing those changes and booting the new kernel, I noticed frequent messages in the logs saying "br0: hw csum failure", followed by a stack trace:
[00000000101e5c88] tcp_error+0x178/0x280 [nf_conntrack]
[00000000101e24c4] nf_conntrack_in+0xe0/0x674 [nf_conntrack]
[00000000102719d0] br_nf_pre_routing+0x85c/0x894 [bridge]
[000000001026c2e8] br_handle_frame+0x24c/0x2d0 [bridge]
Despite the association with br0 and the presence of br_handle_frame in the stack trace, I did not suspect any particular association with bridging; a few similar messages were present for eth2.2, which was not bridged. As we shall see, this suspicion was correct.
The key to this is the call to nf_conntrack_in; on line 833 of net/netfilter/nf_conntrack_core.c in this function, there is a call to l4proto->error() to check for a protocol-specific error; in this case, that's a function pointer to tcp_error. Then, if it returns a status code <= 0, it stops checking for further errors and takes the negative of that code as its own return value, indicating how to handle the packet.
The checksum itself is tested in tcp_error on line 798 of net/netfilter/nf_conntrack_proto_tcp.c; if the checksum test fails, that function returns -NF_ACCEPT, so the connection tracking code ignores the error and accepts the packet as-is. This explains one feature of the bug: despite the profusion of log messages, traffic flows unimpeded aside from the performance impact of so many syslog writes.
The actual test for the checksum flows up to nf_ip_checksum on line 168 of net/ipv4/netfilter.c. On line 193 it calls __skb_checksum_complete(), which in turn verifies the checksum and, on line 644 of net/core/datagram.c, calls netdev_rx_csum_fault() in the case that the packet was marked with CHECKSUM_COMPLETE and the existing checksum is inconsistent. CHECKSUM_COMPLETE indicates that the checksum of the received data was computed in hardware. Finally, netdev_rx_csum_fault emits the message.
So, what could be causing these bad checksums? The Sun HME is a somewhat unusual ethernet device in that it can compute IP checksums in hardware; why is it doing so incorrectly in this case? Especially, it only seems to have a problem when used with VLANs. That's a significant clue; there is code in the VLAN driver (at line 172 of net/8021q/vlan_dev.c) to call skb_pull_rcsum to trim off the VLAN header from the start of the packet and update the checksum, so the checksum must be invalid from the start somehow, but only when VLANs are active.
Fortunately, another clue turned up: when pinging across the interface in question, only packets above a certain size threshold trigger the error messages. ping -s 210 126.96.36.199 worked normally, but ping -s 211 188.8.131.52 caused the "hw csum failure" log messages, with a higher mean ping time and standard deviation (due to the delay of writing the log message). Similar commands on a different IP belonging to a device not attached through the VLAN both caused no error messages.
Now, those are very interesting numbers. 210 bytes of ping payload plus 8 bytes of ICMP header plus 20 bytes of IP header plus 4 bytes of VLAN tag plus 14 bytes of ethernet header makes exactly 256 bytes, a highly suspicious value. The error only occurs if the total length of the ethernet frame as received on the wire is larger than 256 bytes.
The SunHME driver does quite a bit of cleverness aligning its receive buffers in ways which might be prone to malfunctioning in the VLAN case, which I don't quite fully understand yet. See line 1154 of drivers/net/sunhme.c. It also has an interesting optimization when handling receive interrupts, where if the size of the incoming packet is larger than a parameter RX_COPY_THRESHOLD, it allocates a new SKB, does appropriate DMA manipulations, and gives it to the card as a new receive buffer, while keeping the packet in its original buffer to return to the network layer code. If it's smaller than that value, it copies the contents of the buffer into a new SKB and passes that to network layer, keeping the buffer the packet was originally received in as a reusable receive buffer for the device. This appears to be intended to trade off the time needed to copy the data against the time needed for those DMA manipulations; if the packet is too small the time savings of not copying aren't worth it.
And now, the smoking gun: the value of RX_COPY_THRESHOLD, defined at line 372 of drivers/net/sunhme.h, is 256 bytes. In a quick experiment, changing this to 512 bytes also changed the maximum ping size which avoided the bug to 466 bytes. Clearly, the 'clever' code path which avoids copying the packet is somehow the cause of this. I'll have to investigate further tomorrow, beginning with why it doesn't occur in kernel 184.108.40.206.