Archive for the ‘Bufferbloat’ Category

Diagnosing Bufferbloat

February 20, 2012

People (including in my family) ask how to diagnose bufferbloat.

Bufferbloat’s existence is pretty easy to figure out; identifying which hop is the current culprit is harder.  For the moment, let’s concentrate on the edge of the network.

The ICSI Netalyzr project is the easiest way for most to identify problems: you should run it routinely on any network you visit. as it will tell you of lots of problems, not just bufferbloat.  For example, I often take the Amtrak Acela express, which has WiFi service (of sorts).  It’s DNS server did not randomize its ports properly, leaving you vulnerable to man-in-the-middle attacks (so it would be unwise to do anything that requires security); this has since been fixed, as today’s report shows (look at the “network buffer measurements”).  This same report shows very bad buffering, in both directions, of about 6 seconds up, and 1.5 seconds downstream.  Other runs today show much worse performance, including an inability to determine the buffering entirely (netalyzr cannot always determine the buffering in the face of cross traffic or other problems; it conservatively only reports buffering if it makes sense).

Netalyzer Uplink buffer test results

Netalyzer Uplink buffer test results

As you’d expect, performance is terrible (you can see what even “moderate” bufferbloat does in my demo video on a fast cable connection).  The train buffering is similar to what my brother has on his DSL connection at home; but as the link is busy with other users, the performance is continually terrible, rather than intermittently terrible.  6 seconds is commonplace; but the lower right hand netalyzr data is cut off since ICSI does not want their test to run for too long.

In this particular case,  with only a bit more investigation, we can guess most of the problems are in the train<->ISP hop, because my machine reports high bandwidth on its WiFi interface (130Mbps 802.11n), with the uplink speeds a small fraction of that, so the bottleneck to the public internet is usually in that link, rather than the WiFi hop (remember, it’s just *before* the lowest bandwidth hop that the buffers fill in either direction).  In your home (or elsewhere on this train), you’d have to worry about the WiFi hop as well unless you are plugged directly into the router. But further investigation shows additional problems.

If netalyzr isn’t your cup of tea, you may be able to observe what is happening with “ping”, while you (or others) load your network.

By “ping”ing the local router on the train and also somewhere else, you can glean additional information. As usual, a dead giveaway for bufferbloat is high and variable RTT’s with little packet loss (but sometimes packets are terribly delayed and out of order; packets stuck in buffers for even 10′s of seconds are not unusual). Local pings vary much more that you might like, sometimes as much as several hundred milliseconds, but occassionally even multiple seconds on occasion.  Here, I hypothesize bloat in the router on the train, just as I saw inside my house when I first understood that bufferbloat was a generic problem with many causes. Performance is terrible at times due to the train’s connection; but also a fraction of the time due to serving local content with bloat in the router.

Home router bloat

Specifically, if the router has lots of buffering (as most modern routers do; often 256-1250 packets), and is using a default FIFO queuing discipline, it is easy for a router to fill these buffers with packets all destined for the same machine that is operating at a fraction of the speed that WiFi might go.  Ironically, modern home routers tend to have much larger buffering than old routers, due to changes in upstream operating systems optimized toward bandwidth, whose systems were not tested for latency.

Even if “correct” buffering were present (actually an oxymoron), the bandwidth can drop from the 130 Mbps I see to the local router all the way down to 1Mbps, the minimum speed WiFi will operate at, so your buffering can be very much too high even at the best of times.  Moving your laptop/pad/device a few centimeters can make a big difference in bandwidth. But since we have no AQM algorithm to control the amount of buffering, recent routers have been tuned (to the extent they’ve been tuned at all) to operate at maximum bandwidth, even though this means the buffering available can easily be 100 times too much when running slowly (which all turns into delay).  One might also hope that a router would prevent starvation to other connections in such circumstances, but as these routers are typically running with a FIFO queuing disciple, they won’t.  A local (low RTT) flow can get a much higher fraction of bandwidth than a long distance flow.

To do justice to the situation, it is also possible that the local latency variation is partially caused by device driver problems in the router: Dave Taht’s experience has been that 802.11n WiFi device drives often buffer many more packets than they should (beyond that required for good performance when aggregating packets for 802.11n), and he, Andrew McGregor, and Felix Fietkau spent a lot of time last fall reworking one of those Linux device drivers. Since wireless on the train supports 802.11n, we know implies that these device drivers are in play; fixing these problems for the CeroWrt project was a prerequisite for later work on queuing and AQM algorithms.

Bufferbloat demonstration videos

February 1, 2012

If people have heard of bufferbloat at all, it is usually just an abstraction despite having personal experience with it. Bufferbloat can occur in your operating system, your home router, your broadband gear, wireless, and almost anywhere in the Internet.  They still think that if experience poor Internet speed means they must need more bandwidth, and take vast speed variation for granted. Sometimes, adding bandwidth can actually hurt rather than help. Most people have no idea what they can do about bufferbloat.

So I’ve been working to put together several demos to help make bufferbloat concrete, and demonstrate at least partial mitigation. The mitigation shown may or may not work in your home router, and you need to be able to set both upload and download bandwidth.

Two  of four cases we commonly all suffer from at home are:

  1. Broadband bufferbloat (upstream)
  2. Home router bufferbloat (downstream)
Rather than attempt to show worst case bufferbloat which can easily induce complete failure, I decided to demonstrate these two cases of “typical” bufferbloat as shown by the ICSI data. As the bufferbloat varies widely as the ICSI data shows, your mileage will also vary widely.

There are two versions of the video:

  1. A short bufferbloat video, of slightly over 8 minutes, which includes both demonstrations, but elides most of the explanation. It’s intent is to get people “hooked” so they will want to know more.
  2. The longer version of the video clocks in at 21 minutes, includes both demonstrations, but gives a simplified explanation of bufferbloat’s cause, to encourage people to dig yet further.
Since bufferbloat only affects the bottleneck link(s), and broadband and WiFi bandwidth are often similar and variable, it’s very hard to predict where you will have trouble. If you to understand that the bloat grows just before the slowest link in a path, (including in your operating system!) you may be able to improve the situation. You have to take action where the queues grow. You may be able to artificially move the bottleneck from a link that is bloated to one that is not. The first demo moves the bottleneck from the broadband equipment to the home router, for example.
To reduce bufferbloat in the home (until the operating systems and home routers are fixed), your best bet is to ensure your actual wireless bandwidth is always greater than your broadband bandwidth (e.g., by using 802.11n and possibly multiple access points) and use bandwidth shaping in the router to “hide” the broadband bufferbloat.  You’ll still see problems inside your house, but at least, if you also use the mitigation demonstrated in the demo, you can avoid problems accessing external web sites.
The most adventurous of you may come help out on the CeroWrt project, an experimental OpenWrt router where we are working on both mitigating and eventually fixing bufferbloat in home routers. Networking and ability to reflash routers required!


CACM: BufferBloat: What’s Wrong with the Internet?

December 8, 2011

Communications of the ACM: Bufferbloat: What’s Wrong with the Internet?

February issue of the Communications of the ACM.

Some puzzle pieces of a picture puzzle.

A discussion with Vint Cerf, Van Jacobson, Nick Weaver, and Jim Gettys

This is part of an ACM Queue case study, accompanying Kathie Nichols and my article that appeared in the January 2012 CACM (Communications of the ACM).

CACM: Bufferbloat: Dark Buffers in the Internet

December 6, 2011

Vint Cerf recommended that I start immediately blogging about bufferbloat a year or so ago, given the severity of the problem to avoid the usual publication delays; Some puzzle pieces of a picture puzzle.that’s why things appeared here first.

But more formal publication has its merits; in particular, having articles for less directly involved in networking and/or more managerially oriented technical managers is very important. So I’ve been working with/in ACM queue to put together a case study. It has now appeared as an article in the January 2012 issue of CACM (Communications of the ACM) in dead-tree form.  There will also be a full paper posted in ACM queue, but to make the January CACM, we put that aside to finish the (much shorter) article.

Progress on the cable front…

July 13, 2011

I know it’s not anyone’s idea of fun to monitor gigantic specs; I certainly don’t do so.  I  think the following tidbit, while public information, has not been noticed by people.

The cable data spec (DOCSIS) has an engineering change in progress to allow the control of buffering in cable modems that was published early this year.  This will allow operators to at least mitigate (reduce) bufferbloat by setting the buffering to something related to the provisioned bandwidth, rather than the current state of buffering, which is either a) whatever size RAM happened to be available in the device, or b) is sized to be the (invalid) BDP “rule of thumb” for the maximum possible bandwidth that hardware might ever be used in (resulting in greatly overbuffered devices when used by most people at the typical bandwidths (e.g. 10Mbps service with a DOCSIS 3 modem capable of 100Mbps). The feature is called “buffer control” for those who want to dig into the specs.

As a concrete example, this might allow the modem I happen to have to have its worse case latency reduced from the 1.2 seconds I observed at 20Mbps to of order 100ms.  This isn’t perfect, but it’s a whole lot better indeed.  To really solve the problem to get latencies under load where they could be, we’ll need real AQM that can work in this environment, which is more of a challenge as noted for reasons elsewhere in this blog. I have no information as to whether any deployed cable modems will ever see new firmware to support this addition to the DOCSIS specification.

I gather than new cable modems that support this change to the DOCSIS spec will be in the market by late this year; but note that deploying the support elsewhere in ISP’s networks to configure them will take more time, and probably won’t happen until next year.

I have no information of if there are similar changes underway to modify the other widespread broadband technologies (e.g. DSL and fiber).

It is a good start to getting things fixed :-) .  Hopefully the market will now help to spread such mitigation steps in the industry.

 

Google TechTalk video is up

June 2, 2011

I gave bufferbloat talks at Microsoft Research, Apple, Google and a workshop during the week of  April 24; the slides are available. A video of the talk is up as part of the Google TechTalk series.  My thanks to Greg Chesson, Mark Chow and Denton Gentry to pulling together the video despite technical difficulties, and my thanks to Vint Cerf for the kind introduction.

I’ll be giving an abbreviated version of this talk at the NANOG 52 meeting (North American Network Operators Group) in Denver at the Sheraton, on Tuesday, June 14, in the Grand Ballroom at 11:30AM. I’ll try to focus that talk a bit more on the consequences to those operating networks, as best I can given limited time.

IEEE Internet Computing “Backspace” column on bufferbloat

May 4, 2011

Vint Cerf asked me to write his usual “Backspace” column for IEEE Internet Computing magazine on bufferbloat.  It appeared in the current May/June issue. You can find an online copy of the article on the bufferbloat.net web site (with permission of the IEEE).

Presentation for the Prague IETF 80 Transport Area Open Meeting

March 28, 2011

I’m on the agenda for the Transport Area meeting of the Prague IETF meeting.  In it, I have 30 minutes to try to convey the gist and severity of the bufferbloat problem to that audience. I have had the opportunity to present this presentation three times in preparation; once at BattlemeshV4, and twice internally in Bell Labs, so it is much more polished than the original Murray Hill presentation.

Due to the preciousness of meeting time at the IETF, I had to choose what to elide from the much longer original presentation, which includes information of how to mitigate bufferbloat and much additional detail.  On the other hand, I will attempt to be speaking more slowly at the IETF, so it may be more understandable to people listening (or so I hope!).

If you are attending IETF 80, I urge you to attend, and not just those who are interested in transport.  Bufferbloat is terribly damaging to applications (particularly interactive and low latency applications) and general network operations. The draft of the talk itself is already available and the audio and should be available as well as part of the IETF 80 activities. It is currently scheduled (subject to change) for Wednesday morning (Prague time) in the Congress Hall III room. I’m sure hallway conversations will cause me to tweak the talk before I present it Wednesday, but it’s getting close.

CAIDA Workshop (AIMS 2011) – Bauer and Beverly ECN results

February 22, 2011

The CAIDA workshop was last week with interesting talks; unfortunately, I did not attend.

A reminder: the CAIDA workshop is  a venue for not-really-finished work; it is a home for things that are still being worked on, and talks are sometimes half baked.

With those caveats, Steve Bauer has been kind enough to give me a copy of his slides, slightly tweaked since that workshop, that goes over the state of ECN. They are much better than half-baked, but they are very much a work in progress. It has a quick overview of how ECN is implemented, and then goes into the current state of deployment and problems uncovered.

For those of you who are interested in bufferbloat, but don’t know the background around ECN, ECN provides an alternate way to hosts or routers to signal congestion to packet drops. Many wireless people (particularly cell operators) are particularly adverse to the idea of dropping packets to signal congestion, having often gone through heroic measures to move the bits; but unless we signal congestion somehow, our bloated buffers will fill. And ECN has not seen wide adoption in the Internet due to problems in some devices; the question is, can we start using it as we have hoped for a long time?

Again, this is preliminary data; Steve and Robert will have better data later this spring and summer.

I find a number of interesting points in the data, along with a few questions:

  • Slide 10 shows the current behavior in various operating systems. Server side ECN deployment is now occurring at a good clip (remember; ECN won’t be used unless both ends agree), having gone from 1% to 12% in the last two years.  Given when the key OS releases occurred, this is encouraging.
  • The University results were strongly biased by a couple of the big research networks being mis-configured; those have been/are in the process of being fixed, and a number of the other problems (such as Steve’s home broadband carrier’s misconfiguration) are being fixed, often quite quickly (slide 38).  This shows the importance of testing tools, which at the moment, we have few.
  • There is a methodological result here: we can modify traceroute to build a tool to help diagnose ECN problems.  Any volunteers?
  • Another result: if routers ever turned on ECN marking, we could use method to find what routers were congested
  • There is some brokenness that needs fixing.

Questions include:

  • How do different paths differ?  For example, much of the remaining broken kit out there is in the broadband part of the net; but this is seldom accessed from handsets. If we can’t use ECN everywhere immediately, there may be interesting intermediate positions.
  • How to best account for the big content providers (e.g. video)? they tend to use multiple CDNs and those CDN’s are also moving targets; but turning ECN on on those content sources could have a very large impact quickly. But it’s therefore hard to keep track of how much traffic is ECN capable there. How can they make this monitoring more comprehensive and systematic?

If anyone is looking for a really helpful project to undertake, modifying mtr to both better report bufferbloat and make it useful as a ECN diagnostic tool would be wonderful. Shining the light on where problems are is essential to getting bufferbloat fixed.

CAIDA workshop – Sundaresan et. al.

February 17, 2011

The CAIDA workshop was last week with interesting talks; unfortunately, I did not attend.  The CAIDA workshop is  a venue for not-really-finished work; it’s home for things that are still being worked on, and maybe half baked.

There were two presentations of interest,  I know about:

  • Benchmarking Broadband Internet Performance by Srikanth Sundaresan, Walter de Donato, Nick Feamster, Renata Teixeira, Antonio Pescape
  • A survey of the current state of ECN support in servers, clients, and routers, presented by Steve Bauer of MIT.

Steve needs to finish tweaking his slides, so that will have to wait a bit.  But the Georgia Tech presentation is available. Page 18 is interesting.  Since some people don’t do well with log plots, Srikanth is kind enough to provide a version of that slide in linear form.

Upload times, in seconds (linear)

So as you could see in the ICSI data the times are horrifyingly bad.


Follow

Get every new post delivered to your Inbox.

Join 346 other followers