Diagnosing Bufferbloat

People (including in my family) ask how to diagnose bufferbloat.

Bufferbloat’s existence is pretty easy to figure out; identifying which hop is the current culprit is harder.  For the moment, let’s concentrate on the edge of the network.

The ICSI Netalyzr project is the easiest way for most to identify problems: you should run it routinely on any network you visit. as it will tell you of lots of problems, not just bufferbloat.  For example, I often take the Amtrak Acela express, which has WiFi service (of sorts).  It’s DNS server did not randomize its ports properly, leaving you vulnerable to man-in-the-middle attacks (so it would be unwise to do anything that requires security); this has since been fixed, as today’s report shows (look at the “network buffer measurements”).  This same report shows very bad buffering, in both directions, of about 6 seconds up, and 1.5 seconds downstream.  Other runs today show much worse performance, including an inability to determine the buffering entirely (netalyzr cannot always determine the buffering in the face of cross traffic or other problems; it conservatively only reports buffering if it makes sense).

Netalyzer Uplink buffer test results

Netalyzer Uplink buffer test results

As you’d expect, performance is terrible (you can see what even “moderate” bufferbloat does in my demo video on a fast cable connection).  The train buffering is similar to what my brother has on his DSL connection at home; but as the link is busy with other users, the performance is continually terrible, rather than intermittently terrible.  6 seconds is commonplace; but the lower right hand netalyzr data is cut off since ICSI does not want their test to run for too long.

In this particular case,  with only a bit more investigation, we can guess most of the problems are in the train<->ISP hop, because my machine reports high bandwidth on its WiFi interface (130Mbps 802.11n), with the uplink speeds a small fraction of that, so the bottleneck to the public internet is usually in that link, rather than the WiFi hop (remember, it’s just *before* the lowest bandwidth hop that the buffers fill in either direction).  In your home (or elsewhere on this train), you’d have to worry about the WiFi hop as well unless you are plugged directly into the router. But further investigation shows additional problems.

If netalyzr isn’t your cup of tea, you may be able to observe what is happening with “ping”, while you (or others) load your network.

By “ping”ing the local router on the train and also somewhere else, you can glean additional information. As usual, a dead giveaway for bufferbloat is high and variable RTT’s with little packet loss (but sometimes packets are terribly delayed and out of order; packets stuck in buffers for even 10’s of seconds are not unusual). Local pings vary much more that you might like, sometimes as much as several hundred milliseconds, but occassionally even multiple seconds on occasion.  Here, I hypothesize bloat in the router on the train, just as I saw inside my house when I first understood that bufferbloat was a generic problem with many causes. Performance is terrible at times due to the train’s connection; but also a fraction of the time due to serving local content with bloat in the router.

Home router bloat

Specifically, if the router has lots of buffering (as most modern routers do; often 256-1250 packets), and is using a default FIFO queuing discipline, it is easy for a router to fill these buffers with packets all destined for the same machine that is operating at a fraction of the speed that WiFi might go.  Ironically, modern home routers tend to have much larger buffering than old routers, due to changes in upstream operating systems optimized toward bandwidth, whose systems were not tested for latency.

Even if “correct” buffering were present (actually an oxymoron), the bandwidth can drop from the 130 Mbps I see to the local router all the way down to 1Mbps, the minimum speed WiFi will operate at, so your buffering can be very much too high even at the best of times.  Moving your laptop/pad/device a few centimeters can make a big difference in bandwidth. But since we have no AQM algorithm to control the amount of buffering, recent routers have been tuned (to the extent they’ve been tuned at all) to operate at maximum bandwidth, even though this means the buffering available can easily be 100 times too much when running slowly (which all turns into delay).  One might also hope that a router would prevent starvation to other connections in such circumstances, but as these routers are typically running with a FIFO queuing disciple, they won’t.  A local (low RTT) flow can get a much higher fraction of bandwidth than a long distance flow.

To do justice to the situation, it is also possible that the local latency variation is partially caused by device driver problems in the router: Dave Taht’s experience has been that 802.11n WiFi device drives often buffer many more packets than they should (beyond that required for good performance when aggregating packets for 802.11n), and he, Andrew McGregor, and Felix Fietkau spent a lot of time last fall reworking one of those Linux device drivers. Since wireless on the train supports 802.11n, we know implies that these device drivers are in play; fixing these problems for the CeroWrt project was a prerequisite for later work on queuing and AQM algorithms.

11 Responses to “Diagnosing Bufferbloat”

  1. foo Says:

    Is there a Free Software alternative to Netalyzr? Preferably one that is not written in Java/JavaScript.

  2. Open Source Pixels » Gettys: Diagnosing Bufferbloat Says:

    […] Gettys looks into how to figure out which hop is the current culprit for bufferbloat. “In this particular case, […]

  3. Francis Hsu Says:

    While the specific ‘bufferbloat’ problem persists, this is a sympton of the wider problem that exists in what I term the ‘Dark Infrastructure.’ This is all the unused, unknown and poorly understood resources that exists in all the computers in the world and those that are connected to the Internet. Your identifying the bufferbloat is an excellent start. I hope others follow your effort to find out what else is in that invisible corners of the Dark Infrastructure.

  4. Maciej Sołtysiak Says:

    Hi Jim, I appreciate your work on the topic. I’ve a question, silly perhaps, but maybe not?

    Do you think it would make sense to organize events like “World ECN testing day” (as in IPv6 Day?) to encourage people to turn ECN on their client machines to test, and if it works, let them leave it turned on? If doesn’t let them report to their CPE manufacturer or ISP ?
    If it does, also let them get a badge for being a good knight in the bufferbloat battle ?

    I would appreciate your thought on this, thanks.

    Maciej

    • gettys Says:

      I doubt you could get enough testing on an obscure topic like ECN; you might get traction as part of other efforts (e.g. IPv6 day).

  5. Bernd Paysan Says:

    “Buffer bloat” is actually a fundamental design flaw in TCP flow control, and pointing with fingers towards buffering is the wrong approach. TCP maximizes delay, i.e. it fills up buffers completely. A saner approach would try to minimize buffering, and use that as flow control – there is no point to transmit data faster than it can pass through the net, and increasing latency means you are filling up buffers. Don’t.

    • gettys Says:

      In the first place, you do not want buffers running constantly full, no matter what. Without AQM, they will.

      And it’s effectively impossible to predict the amount of buffering needed for TCP to work well: you need a bandwidth delay product of buffering. But you don’t know the delay *and* you don’t know the bandwidth. So you don’t know the needed buffer size even *approximately*. The traditional “engineering rule of thumb” is really badly obsolete in today’s internet, with the addition of wireless (where your bandwidth can vary by orders of magnitude even moving your device a few centimeters) and with CDNs “inside” the ISP’s networks.

      I cover this elsewhere in several blog entries.

      So fixed sized unmanaged buffers just don’t work.

  6. braempje Says:

    How did you produce the very nice Netalyzer Uplink buffer test figure? Is this somehow the aggregated result of multiple Netalyzer runs?

    • gettys Says:

      The netalyzr team (I think most likely Nick Weaver) produced those plots from their data. My great thanks to them, as it expresses the problem better than I ever could.

      Each dot is a single netalyzr test; it’s a lot of samples!

      Note that the top speed for the test is around 20Mbps; there are problems at higher bandwidths not plotted. And the white area down and to the right is caused by the termination of the test at around 5 seconds of buffering; there are problems down there too!

Leave a comment