Home Router Puzzle Piece Two – Fun with wireless

I encourage you to perform this experiment, which many of you open source readers can likely perform as you read this entry.

By the end of this exercise, you’ll agree with my conclusion: Your home network can’t walk and chew gum at the same time.

You can perform all but parts of experiment 3 on commercial routers; I’ve seen similar results on a Cisco e3000, Dlink DIR-825revA and others. An open source router, however, will allow us to diagnose (at least part of) the problem more definitively. I suspect more recent home routers may behave worse than old home routers; but as my old routers have been blown up by lightning, I can’t test this hypothesis.

Here’s my experiment configuration:

computer 802.11 <-802.11-> Router <-ethernet-> server computer

Seems pretty simple, doesn’t it? For completeness sake, I’ll more carefully document the configuration, not that I think it matters much, though may change the details of the results; feel free to substitute your own gear arbitrarily:

The server system is connected to the router via GigE.
The test computer is an HP EleteBook 2540p running Linux 2.6.36-rc5, and uses a Intel Centrino Advanced-N 6200 AGN, REV=0x74

The test computer was sitting about 4 feet away from the router; my local radio environment is quiet; you will see the most interesting results for reasons I cover in the next installments on such a quiet network. A commercial home router should suffice for experiment 2: you’ll need an open source router (or be able to log in to your router) to perform parts of experiment 3 below.

Experiment 2a:

ping -n server & scp YourFavoriteBigFile server:

YourFavoriteBigFile needs to be large, say 100Mbytes or more, so the copy will take more than a few seconds. You can use nttcp as well if you have installed it in Experiment 1 (but it will take a bit longer to reach full effect, I believe). Your favorite distro’s ISO image will do fine.

How much buffering should we expect to keep TCP busy? For a single flow like this, over 802.11g (presuming we can actually get about 25Mbps, and a delay of 1ms, we’d expect to need no more than the bandwidth x delay product. This is about 2 packets; it makes sense we need to always have a second packet available to keep the wireless link busy. You’d expect an extra millisecond of queing delay for the ICMP ping packet (which has an almost negligible size).

What do you observe?

I observe latencies that increase the longer the TCP session goes on, reaching up to about 600 or more milliseconds after about 20 seconds on Linux, but with very high jitter. Pinging the server from a second machine shows little increase in latency.

Why is this occuring? Ah, dear Watson, that is the question….

Experiment 2b:

As in Experiment 1, reduce your txqueuelen to zero in several steps (e.g.”ifconfig wlan0 txqueuelen 0“). What do you observe? I observe about 100ms latency, with significant jitter.

Unfortunately, my wireless NIC does not support the “-g” and “-G” options we explored in Experiment 1. So I cannot try reducing the transmit ring. If yours does, I encourage you to to try twisting that knob as in the first experiments. I hypothesize my wireless NIC has a transmit ring of order the same size as the ethernet NIC we explored in Experiment 1.

Experiment 2c:

Move the test computer further from the router, until, say, you can only 6 Mbps of bandwidth (your actual goodput will be less; remember, just because your radio is signalling at 6Mbps doesn’t mean you are able to get that much actual wirreless bandwidth).

Remember to set your txqueulen back to its original value (e.g. ”ifconfig wlan0 txqueuelen 1000“, for my laptop).

Run the experiment again. What do you observe? Why? I observe up to several seconds of latency; the lower the bitrate, the higher the latency.

Experiment 2d:

Third experiment (while still remote enough the bandwidth available is low).

Try web browsing in another window, during the copy. What do you think of this result? I don’t think you will like it at all. I sure don’t.

Experiment 2e:

As in Experiment 1, reduce your txqueuelen to zero in several steps (e.g.”ifconfig wlan0 txqueuelen 0“). What do you observe?

I observe the latency drop to only a bit over a hundred milliseconds (but with substantial jitter.

Unfortunately, my wireless Intel NIC does not support the “-g” and “-G” options So I cannot try reducing the transmit ring in the wireless device as I could on ethernet. I hypothesize a similarly large ring for the wireless chip.

Experiment 3a-3d:

Repeat experiment 2, but copy YourFavoriteBigFile from your server back to your system. Make sure that there is more bandwidth from where you are copying from than the wireless link.

On a Linksys E3000 router running commercial firmware, my latencies reach 500ms or more, with high jitter at 54Mbps datarate, with high ping packet loss when pinging from the transmitting direction. On a Netgear WNDR3700, running OpenWRT 10.03, changing txqueuelen seems to have no effect, but the latency is stuck at around 200ms. In a quick test at 6Mb/second, I observed 4 second (highly variable) latency; at 12Mb/second, I observe about 2 second (highly variable) latency.

Note that twisting the txqueuelen knob (and/or transmit rings) on your laptop has no effect, but by logging into your router and twisting the knob, you may (or may not be able to) eliminate most of the latency. On a Linksys WRT-54TM running Gargoyle router code version 1.3.8, I can reduce the latency (when at 54Mbps) from of order 1 second (with high jitter) to around 20ms by setting txqueuelen to 10 on wl0 (you can’t go to zero on this hardware, I surmise). This is still higher than it should be from first principles, but closer to something tolerable.

Conclusion of Experiments 1-3

Your home network can’t walk and chew gum at the same time.

Tomorrow’s installment will give a detailed explanation of what’s happening; the name of our quarry will become clear. Then we’ll progress to find more of the mastermind’s henchmen elsewhere in the network where the hallmark of the mastermind are somewhat harder to see, and the damage that is being done to us all.

This entry was posted on December 2, 2010 at 12:10 pm and is filed under Bufferbloat, Puzzle. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

22 Responses to “Home Router Puzzle Piece Two – Fun with wireless”

Anders Says:
December 2, 2010 at 5:23 pm | Reply
How much did the tcp window size grow during your tests?
- gettys Says:
  December 2, 2010 at 6:41 pm | Reply
  Good question; data to follow as I circle
  back to why I started doing experiments…
- rick jones Says:
  December 30, 2010 at 1:19 pm | Reply
  If my recollection is correct, by default (no disabling of autotuning and/or perhaps explicit setting of socket buffers) “Linux” will autotune the socket buffer/window to as much as 4MB by default.
Lucas Says:
December 3, 2010 at 8:04 am | Reply
Experiment 2a:
ping:
61 packets transmitted, 61 received, 0% packet loss, time 60088ms
rtt min/avg/max/mdev = 0.867/74.064/175.714/55.567 ms

scp:
test.data 100% 100MB 2.4MB/s 00:42

Experiment 2b with txqueuelen 0:
ping:
67 packets transmitted, 67 received, 0% packet loss, time 68099ms
rtt min/avg/max/mdev = 0.880/65.122/143.626/54.252 ms

scp:
test.data 100% 100MB 2.4MB/s 00:42

The wireless NIC is a BCM4313 running with the Broadcom wl driver. The AP is also using a Broadcom NIC (BCM4318) with wl, running OpenWrt Kamikaze.

Experiment 2a with a different client (AR2425 with ath5k driver), same AP:
ping:
45 packets transmitted, 45 received, 0% packet loss, time 44000ms
rtt min/avg/max/mdev = 0.774/301.412/622.837/224.333 ms

scp:
test.data 100% 100MB 2.7MB/s 00:37

All of the above reproduced with similar results 3 times.

Speculation: something funny with broadcom AP NICs when used with non-broadcom clients?
- gettys Says:
  December 3, 2010 at 8:23 am | Reply
  I think the correct explanation is different, having to do with TCP.
  
  You may remember I specified I have a relatively quiet radio situation.
  
  Your results on wireless will typically be much less consistent than ethernet.
  
  The reasons for this will become clear as we go further in the explanations. Bug me about this point if you you still have questions after a few more blog entries.
Nathaniel Smith Says:
December 10, 2010 at 12:34 am | Reply
Ugh. We do backups over WLAN and they like to utterly destroy the computer’s usability when they’re happening, for all the reasons you say. But experimenting just now, *nothing* seems to help… I just txqueuelen to 0, *and* if I’m understanding Linux traffic shaping correctly, told it let ICMP skip to the head of queues (‘tc qdisc add dev wlan0 root handle 1: prio && tc filter add dev wlan0 protocol ip parent 1: prio 1 u32 match ip protocol 1 0xff flowid 1:1’).

Result: when saturating the uplink, mtr from my laptop to the router says average ping 2345 ms, std. dev. 1966, max 11792. Which is just… I mean, huh? Do they really have 10+ seconds of hardware buffer on here or… in what world does this make any sense?

I also tried ‘iwconfig wlan0 retry 0’ in case MAC-level retransmissions were a factor; no change.

This is with an iwl3945 (rev 02) and ubuntu’s 2.6.32.
- gettys Says:
  December 10, 2010 at 6:55 am | Reply
  For testing purposes, you really don’t want to classify ICMP differently than TCP (I am presuming that your backup program is using TCP as its transport, a good bet). You are using it as a diagnostic probe, and so want the traffic queued in the same queue.
  
  Trying to use traffic shaping in front of a wireless hop is problematic, as the available “goodput” on wireless is so variable: not only is it typically half of the “nominal” bit rate, but the bitrate itself is variable (from 1 megabit to 54 in 802.11g, or even higher with 802.11n), not to mention noisy RF environments. RF is plain much harder to get bits over than other media. So details of how fast your wireless is running (and even the noise level of the environment) can make a big difference, unfortunately. I’m more optimistic that some form of AQM will be useful here, though current RED and similar algorithms are problematic, there appear to be ways forward, as I’ll discuss in the future.
  
  Now, we have (on Linux) (at least) two buffers in play here: the transmit queue and the nic’s driver ring. I have strong evidence of the driver ring bufferbloat on both Mac OSX and Windows. Both can be large. txqueuelen is giving you control over the transmit queue. On *some* ethernet drivers I’ve played with, ethtool gives you control over the driver ring. I haven’t (on the hardware I have tested) found a wireless device driver that supports ethtool’s control over the transmit ring; old hardware may have little buffering in any case. On that ethernet device (an Intel), the ring is 256 elements in size. As described in my earlier post on “fun with your switch”, I demonstrated I could control latency further by cranking that down; in my ethernet interface’s case, I could crank it down to 64 elements (packets, in all probability). That is still a lot of buffering. At some point, I’ll explain Ted T’so’s (very cogent) hypothesis that these buffers got put in recent hardware to help paper over x86’s SMM mode.
  
  Let’s do a quick computation: even 64 packets is > 768000 bits of payload (excluding TCP/IP overhead; and wireless adds yet more bits as well for its encapsulation). A ring of 256 elements is > 3 megabits. So yes, it is *really* easy to have multiple seconds of buffering in wireless.
  
  I’ve tried looking at the Linux sources to see what size my wireless nic’s transmit ring is; unfortunately, I didn’t find it in my explorations of that driver and have bucked it to Intel’s asking what it is; haven’t heard back yet.
  
  I’ve also found instances where it appears the hardware has buffering, but twisting the knobs that purports to control it has no effect (an ethernet NIC I found in one of my routers). The exact details of the hardware (and its drivers) matter here. Given what I know about such devices and drivers, I’m not at all surprised by this. Some of these devices have RAM of their own or may use host RAM, and given typical testing tests, a naive driver engineer may not bother to implement the driver ring controls as it will appear his driver “works fine”.
  
  So your mileage will vary, and whether you will have controls you can use is therefore problematic. At best, what I’ve been able to demonstrate in “fun with your switch” and “fun with wireless”, is that in some cases where the NIC’s controls actually work, they behave as you would expect. We have a lot of work ahead of us.
  
  Sigh.
  - gettys Says:
    December 10, 2010 at 7:09 am | Reply
    Oh, and there is another point i have to make and I’ve not gotten around to posting about:
    
    On 802.11, it is a shared medium. You have buffering, and may only be getting a fraction of the available bandwidth. So that buffering may be multiplied many times over in terms of the latency it causes, if the medium is busy. Even 100Kbits of buffering, if you are able to only get 1/10 of the medium’s bandwidth multiplies well over a second of latency.
    
    You *really* want to keep transmit buffering very low on shared media in particular.
Nathaniel Smith Says:
December 10, 2010 at 12:18 pm | Reply
Right — I’m only giving ICMP special treatment because I figured that it would tell me whether the traffic control was making any difference, before taking the trouble to also classify the rest of my traffic. I’m also not using traditional traffic shaping — the big problem there, as you note, is estimating your bandwidth so you can throttle things down and make yourself the bottleneck. But I’m just trying to solve the case where my laptop’s *already* the bottleneck, so it really shouldn’t matter (AFAICT) that 802.11 has variable goodput — if the drivers would just always let my latency-sensitive traffic jump to the head of the queue, then I’d get optimal behavior. That’s what the ‘prio’ scheduler is supposed to do, I think.

Unfortunately, in current Linux, I think the hardware buffer comes after the traffic control policy, which makes it useless…

Some groveling in drivers/net/wireless/iwlwifi suggests that for the iwl-class hardware, Linux unconditionally uses a DMA ring buffer of TFD_QUEUE_SIZE_MAX (= 256 packets). Which kind of supports your hypothesis about the degree of care that goes into buffer sizing! 🙂 One thing to try might be to just redefine this to, like… 8 or something, and see what happens. The trade-off, of course, is that the host CPU is going to have to handle a lot of interrupts keeping such a short buffer full.

Or, there’s another possible approach. This hardware actually has several independent transmit buffers (“multiqueue support”). I believe this is intended as another of those throughput-uber-alles features (it lets multiple CPUs feed the same network card without locking), but it seems like it could be bent to other purposes… If (big if!) we can convince the card to prioritize traffic from one buffer, or even just do some fair round-robin thing, then we can queue latency-sensitive traffic and bulk traffic into separate buffers, and get a kind of “virtual” jump-to-head-of-queue behavior for the latency-sensitive packets, without having to reorder or shorten any actual physical queues.

Unfortunately, I’m not sure how to coax the drivers to actually do this. ‘tc -s class show dev wlan0’ on my box shows that there are 4 buffers, of which only buffer #3 is actually being used. As a first experiment, I tried using ‘tc filter add dev wlan0 protocol ip parent 0: prio 1 u32 match ip protocol 1 0xff flowid 0:1’ to tell the kernel to send all ICMP traffic through buffer #1 instead. But this just gives an “Operation not supported” error…
- gettys Says:
  December 10, 2010 at 3:23 pm | Reply
  The transmit rings are grossly large on the most modern hardware. Even if you are trying to mitigate interrupts, etc, to get smart offload, 256 is insane (thanks for finding the hash define; somehow I missed when I went digging through the driver some weeks ago); you’ve gotten most of the benefit in the first few packets. And as I’ll note when I post about 802.11 in general (and have already noted to some extent in comments here), it is particularly important to keep these buffers no larger than necessary on shared media such as wireless.
  
  Note that at typical performance levels on wireless, it really doesn’t matter to try to use all the “smarts” that hardware can do for us. The fundamental problem is the “go fast” only thinking that has dominated for over a decade. We only are performance sensitive a bit of the time, and seldom on wireless (yet the hardware has sprouted all the features anyway). I suppose on 802.11n we *might* have to move a few hundred megabits/second, but I expect the processor would still be loafing.
  
  Anyone care to write a patch for the iwlwifi driver to implement what ethtool needs for ring size management? I will poke the Intel friend I have in any case, but the wireless driver isn’t his particular problem….
  - Nathaniel Smith Says:
    December 10, 2010 at 7:35 pm | Reply
    After a bit more digging, it looks like I was slightly wrong — the effective queue size on iwl hardware seems to be (7/8) * 256 = 224 packets.
    
    But that’s not so interesting. What’s interesting is that I just modified drivers/net/wireless/iwlwifi/iwl-tx.c, around line 293 in the function iwl_queue_init, where it calculates the high-water mark for a queue. I added the line “q->high_mark = 246;”, to override its calculation. What this number seems to be is, if your 256-element ring buffer has fewer free spaces than this, then we refuse to enqueue any more packets to it — so setting it to 246 reduces the effective buffer size to 10 packets. (Then for anyone following along: I rebuilt just the iwlwifi drivers against my running kernel with ‘make -C /lib/modules/$(uname -r)/build M=$PWD modules’, and reloaded the modules. Note that the change is actually to iwlcore.ko.)
    
    After this change, I set txqueuelen to 0, and now, my average ping time with a saturated uplink is <40 ms (!!!).
    
    I then redid redid the traffic control settings I described above to give ICMP higher priority, set my txqueuelen *back* to 1000, and my pings are *still* <40 ms.
    
    Not quite as good as the 2 ms pings I see without the saturated uplink, but just a *tiny* bit more acceptable than the 10000 ms pings I had before…
  - Nathaniel Smith Says:
    December 10, 2010 at 7:51 pm | Reply
    Okay, after 10 minutes of *cough* EXHAUSTIVE testing, I have to say that this is AMAZING. I’m running a full-disk rsync backup right now, it’s totally saturating my wifi uplink, and even with no traffic shaping or anything (just the hack above + txqueuelen = 0), web browsing Just Works.
    
    My ssh sessions are responsive. This is fabulous. I need more superlatives.
    
    I (or someone) need to go bang some heads together on linux-wireless…
    - gettys Says:
      December 10, 2010 at 11:01 pm
      Please test yet more carefully.
      
      For others: note your mileage will vary, depending on your device driver. Understanding the principles involved is what is important.
    - gettys Says:
      December 11, 2010 at 8:25 am
      No rocks please. Banging heads together is not appropriate. The Linux kernel folks have made a mistake (or overlooked an evolutionary trend, depending on how you look at it) shared by the Mac and Windows kernel developers, as well as just about all other networking engineers everywhere.
      
      This is why I titled this article as I did, about glass houses and rocks. We are all living in one (or more) glass houses, as far as I can tell.
  - Nathaniel Smith Says:
    December 11, 2010 at 12:42 pm | Reply
    Yes, sorry about that — I was thinking more in terms of “get them to pay attention” than “call them idiots”, but I realized right after I posted that I really should have been more careful in my phrasing.
    
    In any case, I sent this on to linux-wireless; we’ll see what happens!
    
    http://thread.gmane.org/gmane.linux.kernel.wireless.general/61285
rick jones Says:
December 30, 2010 at 1:26 pm | Reply
If there is a concern about ping’s ICMP packets being allowed to jump the queue, one could use a netperf TCP_RR test instead. If ./configure’ed with –enable-histogram and run with a -v 2 option it will give a reasonable histogram of the application-level RTTs at the end. (There are other stats one can get with an “omni” variant of that test but that is more than should be gone-into here – reaching me/others via netperf-talk at netperf dot org or via the feedback links on the website can be used to start that discussion offline)

As for linux and the tx queue lengths used by its drivers, I wonder the extent to which things like the intra-stack flow control for UDP come into play.
- gettys Says:
  December 30, 2010 at 10:27 pm | Reply
  I haven’t tried playing with any traffic classification (so far, anyway)..
  
  For my testing, I did work with the author of httping to get some modifications to it (persistent connections) so that we’d have a TCP based ping tool that would be widely available. See: http://www.vanheusden.com/httping/ There is an android version of it available as well. Of most concern would be when pinging intermediate routers, if those routers are loaded. (since ICMP would likely take a slow path). I never saw any behavior to make me worry.
  
  In my testing, I always got comparable results for ICMP to httping.
- gettys Says:
  December 30, 2010 at 10:30 pm | Reply
  Rick,
  
  You could do everyone a big service by improving netperf to do a “latency under load” test.
  
  Extra bonus points if you write a tool which tells you which hop in a path is the bottleneck adding lots of latency.
  – Jim
  - Simon Leinen Says:
    December 31, 2010 at 7:24 am | Reply
    There’s a stand-alone tool called “thrulay” that does latency-under-load tests. Stanislav Shalunov wrote it when he worked for Internet2. http://shlang.com/thrulay/
  - rick jones Says:
    December 31, 2010 at 12:48 pm | Reply
    I think it is close (save for the which hop).
    
    If one uses the top-of-trunk, there are some “enhanced” stats that came to netperf via some googleheads (omni test, -j option) to give mean, min, max and some percentiles. Couple that with a –enable-burst on the ./configure to enable the request/response test to have multiple requests pending at one time and then I think we are almost there. Probably need a little enhancement to the individual rtt measurements.
    - rick jones Says:
      January 11, 2011 at 7:53 pm
      So, the latency/histogram code did indeed need a little enhancing to track more than one outstanding transaction at a time. I think I have that sorted now, so the top-of-trunk netperf2 bits with –enable-burst and perhaps –enable-histogram should be able to track rtts for a burst-mode TCP_RR test that one can set-up with the burst parm to try to have lots of transactions outstanding at one time and see the latency increase as one does it. for more, ask in the netperf-talk mailing list hosted on netperf.org
      
      netperf does not, however, attempt to find the hop in the path where the latency resides. that goes a bit beyond netperf’s design center 🙂
Jesper Dangaard Brouer Says:
January 10, 2011 at 9:35 am | Reply
Lets al least mitigate the Wifi bufferbloat
http://netoptimizer.blogspot.com/2011/01/bufferbloat-wireless-is-worse-than.html