Bufferbloat and congestion collapse – Back to the Future?

I hit “publish” accidentally , when I meant to press “save draft” for publication in several weeks..  There is still a bit of the supporting evidence hasn’t been blogged or fully researched yet.  Until I remove this warning, this is a draft.  Sigh. But the conclusions won’t change…

For the last decade at least, we have been frogs in slowly heating water, and have not jumped out, despite at least a few pokes from different directions that all was not well in our Internet pond. Lots of people have noticed individual problems caused by bufferbloat and how to mitigate them. To some extent, we’ve been engineering around this problem without full understanding, by throwing bandwidth at the problem and as gamer routers show. But RAM costs have dropped even faster than network speeds have risen, and rising network speeds encourage yet larger (currently unmanaged) buffers; throwing bandwidth at the problem has been a losing race.

I’ve been losing sleep the last few months. Partially, I’m at an age where for the first time I’ve lost a number of older friends in a short period. And partially, it is very serious worry about the future of the Internet, and that previous attempts to warn of the bufferbloat problem have failed. And partially, I just don’t sleep very well. I’m not quite to the point Bob Metcalfe was in 1995 when he predicted web growth would cause the Internet’s collapse (we came close). But I’m close to such a prediction. That’s seriously bad news.

And no, I’m not going to offer to eat my words, the way Bob did. He looked much too green after consuming his column when I saw him backstage afterwards. I’ve had enough stomach problems  recently and weight problems that is unwise. If I do ever make that prediction, I might eat a small piece of cake if I’m wrong. But so far, it’s still just worry and not prediction. But my worries have grown as I discover more.

I base the worry on the following observations:

  • the data and discussion in a previous post shows as analyzed and by experts in TCP’s behavior  confirm that bufferbloat can/does destroy TCP’s congestion avoidance abilities. And it is the touchstone for all congestion avoidance algorithms in the Internet. A common reaction to new protocol proposals in the IETF that they should be at least as network friendly as TCP.
  • Windows XP is finally being mercifully being retired (if still more slowly than anyone including Microsoft would like), and everything else implements TCP window scaling.  As it does so, dominant TCP traffic will shift from partially able to saturate most Internet links to always capable of link saturation. This is a fundamental shift in traffic character, affecting not only HTTP, but all TCP based applications.
  • There are many, many  more link saturating applications already deployed, and many many more Internet services that do so.
  • Browsers, by ignoring RFC 2616 and 2068 strictures against use of many TCP connections in the last few years, have been diluting congestion signaling.
  • Some major players appear to be reducing/defeating slow start. This is also really bad.
  • By papering over problems, we are repeatedly closing off solving problems. A good example is ECN, whose deployment has been delayed by broken home kit, possibly a decade.
  • There is a misguided and dangerous belief among almost everyone in the industry that dropping *any* packets is bad.  In fact, it is essential (or at least that we mark packets with ECN); the trick here is enough, but not too much.
  • Much of the consumer kit (home routers, cable modems, DSL CPE)  is never properly maintained. Often, broken firmware is never touched after hardware is shipped, and/or usually requires manual upgrade by customers incapable of it; only recently have consumer devices started to automatically upgrade themselves (sometimes to the detriment of consumers, when features are removed).  This is a serious security risk already (I know of a way to wardrive through a major city and take down wireless networks right and left, to give a simple example).  But this also means that quickly mitigating bufferbloat will be much harder.  Often, even trivial changes of one constant here or there might reduce the magnitude of the problem by a factor of ten; but that option is not available.  So scrapping out a lot of gear needs to happen, but to do so costs money.  Who pays?  Will it happen soon enough? Can/should it be rolled into IPv6 deployment, if that happens?
  • Self synchronized systems are common place and time based congestion problems have been observed in the Internet before. Some of the common network technologies have the property of bunching packets together into periodic bursts.  My traces show stable oscillations, that may or may not be stable once random loss is put into the system, and I do not know if they would synchronize. This bursty behavior caused by intermediate nodes collecting packets together is well documented (though I don’t have the references handy).

First, some personal history: with Bob Scheifler, I started the X Window System in 1984. It was one of the very first large scale distributed open source projects. Our team was split between MIT in Cambridge, Massachusetts, and Digital’s west coast facilities. At the height of X11’s development, the congestion collapse of NSF net occurred. The path from Digital to MIT became so unusable that we were reduced to setting alarm clocks to 3am to rdist our source back and forth, and at times, when that would not succeed, FedX’ing magnetic tapes to get enough bandwidth (our bandwidth was fine; our goodput zero).  Additionally, Nagle’s algorithm had caused us problems with X (which does its own explicit buffering), and TCP_NODELAY was added specifically to help us.  I was also the editor of the HTTP specification: one concern we had there was that many TCP connections could self-congest a customer’s line that had minimal buffering (dialup gear often had only one or two packet buffers per dialup line in that era).  So I’ve both been directly scarred by, and concerned with application generated Internet congestion and as a network application developer had reasons to become much more familiar than most with its details.

The browser situation is also worrying; but I’ve not seen recent web traffic statistics and so this worry may be a red-herring. What it is doing to latency is not a red-herring: it is doing bad things to the jitter in your home network, as I’ll explain in detail in a future post.  While the first decade of browser warfare was mostly features, we now have a healthier situation of browser warfare on both features *and* performance. By using many, many TCP connections (6-15 is now commonplace, whereas the standard asked for no more than two connections), we’ve minimized the amount of congestion signalling going on and maximized the amount of traffic that is in slow start.  And I recently caught wind of some major web sites messing with the initial congestion window.  I haven’t had time to dig into this yet, so I won’t say more. While the original motivations for rules against HTTP using many connections have clearly lapsed, we may now have others due to bufferbloat. I had hoped that pipelining would enable both highest performance and optimal TCP behavior when I was editor of the HTTP spec and while doing that research: but it is now clear that due to the ugly complexity of the HTTP protocol and the lack of a sequence number in HTTP that those hopes are in vain. Something like spdy is in order.  I’d sure like to see the HTTP protocol replaced entirely for the web; personally, I’m most excited by the CCNx project as a long term path forward there, as it enables fundamentally better performance (and would save massive amounts of energy!), but events may force shorter term band-aids. More when I blog again about browsers.

The Internet we had just learned to depend on became utterly unusable to many of us (at least on particular paths); we had just learned to depend on the Internet, and even then, it was scarring. In that era, IIRC, there were only about 100,000 hosts on the Internet, most of which were running Berkeley Unix.  It was an era when all the systems were being managed by computer nerds. When Van Jacobson and Mike Karels published patches to the Berkeley TCP/IP stack for slow start and other algorithms after maybe 6 months of serious pain, they were, within weeks, applied to most machines on the Internet, and the situation recovered quickly. When discussing with my friend at Comcast mitigating the cable modem bufferbloat problem, he thought mitigations were probably possible for DOCSIS 3 (which only started shipping last year), possibly possible for DOCSIS 2, and no prayer of mitigating bufferbloat in DOCSIS 1 (but hoped the buffers there may be small enough not to be defeating congestion avoidance). I surmise that this opinion is based on the realities of what firmware still has maintenance teams for those devices. I expect a similar situation in other types of broadband gear.

The home router situation is probably much grimmer, from what I’ve experienced. We have a very large amount of deployed home network kit (hundreds of millions of boxes) much of which is no longer maintained, even for security updates (which is why the home router problem is so painful, and dangerous in my opinion).  It seems that within 6 months to a year, the engineers working on that firmware have moved on to new products (and/or new companies), and that kit with serious problems (like that which has inhibited deployment of ECN) never, ever gets fixed.

There may be a way forward to replacing all this antique, unmaintained home kit as IPv6 deploys (if it really does); to deploy IPv6, almost all home routers (and much/most broadband CPE equipment) will be upgraded.  These boxes aren’t all that expensive to replace; amortized over time, the ISP’s can easily afford to do so if the customers do not. But I don’t think we want to be in a situation where we have to try to replace them overnight, particularly since it will take a year or two at the minimum at least to engineer, test and qualify bufferbloat solutions. Replacing the old gear might be a concrete step in the war against global warming as well: new gear often (but not always) consumes less power; saving five watts would pay for a new home router in maybe 5 years, at my electric rates (I’m not sure the gear is always consuming less power though…)

Courtesy of malware on operating systems, many, but far from all of user’s operating systems get security updates, so we can have some hope for updating end user operating systems (if we can distribute the updates, that is; it may be hard to auto-update systems on non-functional networks; in the NSFnet collapse, all that had to get through were short source code patches that were applied at the recipients, and email went through in the middle of the night even at the worst of the NSFnet collapse). I worry much less about 3g; those devices are still pretty new, centrally managed, with maintenance teams dedicated to the software and firmware, there is even traffic classification around network control messages in that sort of gear.  The phones are new enough they also are getting rapid update, and get replaced quickly.

As I will blog about more completely shortly (I had intended to blog about a number of other topics first, rather than this entry, but what is once published on the Internet cannot be unpublished), Dave Reed was correct when he attempted to draw attention to bad bufferbloat in 3G wireless networks over a year ago. That is a different aspect of large buffers in the aggregate; you can have no packet loss, but very high delays with bufferbloat scattered through a network, when that network becomes congested. The very lack of packet loss means that queue management algorithms such as RED are not enabled. By not signalling congestion, the end-nodes (3g smart-phones) do not slow down transmission, the buffers bloat, and the whole network operates at high latency. These networks stay congested until (possibly) late at night, when their buffers finally drain; you see a daily pattern to their latency behavior, low latency of, say 60ms, when the network is quiet and unloaded, increasing during the day, and then dropping again when load diminishes. I’ve seen up to 6 seconds myself, and Dave has observed up to 30 seconds.  In 3g, telephony is also separately provisioned, and so has the same fairness issue as I documented before; so long as there is no QOS services provided to general data service applications and bufferbloat is in that infrastructure, we’ll never make low latency non-carrier VOIP, teleconferencing possible. From one point of view, however, we’re already (from the user’s aspect anyway) seeing congestion collapse on those networks, if not the packet loss form of congestion collapse warned about by Nagle and observed in the NSFnet collapse, which motivated development of TCP’s congestion avoidance algorithms.

We’re clearly suffering from steady-state congestion already; bufferbloat in broadband and 3g have injected pain. But it illustrates another facet to the issue. Mere aggregation of a problem can cause other problems to occur (e.g. diurnal 3g network congestion). Just because we’ve seen and understood one aspect to a problem does not mean we understand all of the consequences.

When I talked to Van Jacobson about why active queue management is not universally enabled (a coming post will discuss that), he pointed out that we must also be concerned  about  dynamic congestion effects in the network as well; not just the TCP oscillation you see in my traces, but that much of the network gear being built, often in the name of bandwidth optimization, is processing bunches of packets at once and bunching them together, to be re-emitted in bursts.  Van wants there to be a way to schedule the transmission of outgoing packets, so that devices could defeat this bunching rather than the bursts traveling through the network aimed at some bottleneck someplace that might not be able to deal with them.  It is part of what is good about the “random” in RED. Time based behavior is more subtle to understand, but might be as troublesome as what we already see.

So we have a number of different resonant frequencies in the network; some are timers used in network protocols; some are timers in various network gear, for their own internal implementation. And self-synchronizing behavior in large systems is more than a theoretical possibility in large systems; it has been observed. before Are we a bunch of soldiers marching in cadence on a bridge? Will oscillations form slowly enough we can react? Or will the bridge rapidly fall?

What’s going to happen over the coming five years? I don’t know. I do know by messing with slow start and destroying congestion avoidance in TCP, we’re playing with fire.

So I worry. And these worries make it hard to get back to sleep.  Or am I just being an old Internet soldier suffering from post-traumatic stress disorders?

I do want to see a vigorous discussion of these fears; if you can dispel them, I can sleep better.

12 Responses to “Bufferbloat and congestion collapse – Back to the Future?”

  1. nona Says:

    Time to remind you that you were going to write about CCNX and some of your ideas for networked graphics based on it.

    I’m quite curious about that.

  2. JohnH Says:

    Much of the problem with bufferbloat seem to reflect the debate between TCP/IP and ATM. The latter had a much better set of capabilities for defining and using QOS on a per-connection basis. Unfortunately, I don’t believe any of it really made it past the standards stage, at least for dynamic SVCs. QOS with guarantees requires point by point bandwidth allocation/release like the traditional telephone calls reserve bandwidth. For better or worse, the QOS features in TCP are not well quantified, and difficult to implement on an end-to-end basis. For QOS to work properly, the expected usage needs to be identified (throughput and latency), the endpoints need to limit to that throughput, the network needs to police that throughput, and the network needs to prioritize that traffic to meet that QOS. Sometimes, that will mean that the network has to refuse a request for transport because of resource unavailability, along the lines of “all circuits are busy”. I don’t think this would be a good approach to take.

    Without QOS support, it’s up to the endpoints to try to replicate QOS functionality. I believe that the network still needs to interpret QOS settings and prioritize packets based on QOS. But it is ultimately incumbent on having endpoints properly tag their packets, and to limit bandwidth to what the network can deliver, adjusting to changes in network congestion without becoming an unstable mess. This becomes challenging considering that congestion is asymmetrical, and changes locations over time. One factor which can help endpoints to stay honest is to either charge for high priority packets, or limit the amount one can use without being subject to high discard rates, or some combination. Network discard is also something needed to help limit denial of service attacks with high priority packets. Higher priority packets in a TCP stream would also likely be associated with smaller window sizes.

    • gettys Says:

      Actually, the issues around ATM are different, and as I keep repeating, QOS is mostly a red-herring to what I’m talking about.

      The fundamental issue is how do you signal congestion; and we need some mechanism to signal it. We have either packet drop, or ECN, but ECN deployment isn’t real at least yet. Without signalling the endpoints in a timely fashion, the end points cannot avoid congesting the network. Adding buffers in great excess to the RTT has destroyed the congestion avoidance of all protocols, no matter what QOS they might have requested, and, given the pervasive existing deployment of bufferbloated devices which cannot implement QOS whether we want it or not, QOS is not part of the fundamental solution to bufferbloat.

      Whether QOS is available/present is therefore a different discussion, and indeed, there are uses for QOS that can help ensure yet better behavior, which, in fact, we need more than I had realized until recently due to what the browser and web server folks have been doing without much thought, as I’ll blog sometime soon. And yes, you can use QOS for all the useful reasons you state. Whether we charge for prioritized service is again another axis of discussion, but also not really germane to the bufferbloat problem.

      As to ATM, it was never going to be deployed end-to-end, so even if reservation could have even been possibly made to scale to the world, it would have never solved the entire problem. What chance ATM technology had to make a serious dent in networking failed, in my opinion, is when the ATM forum ignored the existence proof of Chuck Thacker’s Autonet switch, which would not drop cells. The ATM Forum folks ignored the fact that dropping one cell in a much larger IP packet (which was going to happen with dismaying frequency, given what they were doing), was going to cause too high a packet loss rate to be a good medium for the Internet. I find that very unfortunate (Chuck had put his money where his mouth was, by building running hardware to prove his point) that they did not listen. So it was a technological opportunity lost.

  3. Peter Korsgaard Says:


    Linux recently changed the default initial TCP receive window size to 10 MSS rather than the 2-4 MSS specified in RFC3390:


    See https://datatracker.ietf.org/doc/draft-ietf-tcpm-initcwnd/ for the draft RFC specifying this.

  4. Arms Says:

    Another anecdotal reference:

    Recently, I had a discussion with an not-so-small enterprise-grade switch vendor, around the topic of Incast (multiple, synchonized flows colliding at a single port, leading to burst loss and excessive TCP latency due to RTO).

    The CTO of that company (remeber, they deal with wire-speed, core datacenter networks) told me in full honesty, that they will solve this Incast scenario by simply delivering a ultra-high end switch, which features many Gigabytes (!) of (shared) buffer memory. When I run the numbers, the figure he mentioned were slightly over 500 ms at 10 Gigabit/s wire-speed buffering. (I think it was about 1GB SRAM, but all the crossbar/signalling overhead reduced this a bit).

    Also, as buffering these days no longer happens in terms of frames, but bytes, small frames (such as pure ACKs) no longer alleviate this buffer bloat as it may have done in the distant past (with fixed-size, dedicated port buffers).


    • Alex Yuriev Says:

      And that’s the right approach because every single thing that you guys are arguing about in reality is a product of a horrible network design. The CTO of the switch company is trying to deal with serialization issue. There’s a reason why Comcast has 250GBytes/transfer per month caps and there’s a reason why 250GB/month is basically what you can transfer at 1Mbit/sec for 30 days plus some overhead.

      If you want to convince network operators with properly engineered networks that buffer bloat is the problem demonstrate that it exists when both sides are connected to a single commercial grade 1Gbit/sec switch (i.e. a switch that can switch traffic between those ports at a wire speed at 1Gbit/sec regardless of the size of the packets that are being switched ). After that use 1G-10G-1G network topology and 10G-1G-10G network topology for the same tests.

  5. mjodonnell Says:

    I was delighted to discover this discussion just in time for my class on why the Internet actually works. Then, I didn’t get any students. Oh, well.

    I was astonished to find that oversized buffers are so widely deployed. I regard myself as a rather ignorant dabbler in Internet design, so I thought anything that I had learned must be general knowledge.

    If I understand correctly, bufferbloat is a very bad idea, independently of its impact on TCP flow control. TCP problems may provide the first dramatic demonstration.

    Some years ago, I learned from Michael Greenwald to think of congestion, as a protocol problem (not the same as a network topology problem), not in terms of dropping packets, but in terms of dropping the wrong packets at the wrong time. In particular, the forwarding protocol is suboptimal whenever, in a collision between packets, it drops one that could have been delivered usefully in favor of one that will never be useful.

    I like to think of this in terms analogous to cache/VM misses: there are misses due to capacity and misses due to the replacement strategy. The former can only be addressed by adding memory/cache, while the latter can be addressed in a number of ways, including the compiler and the OS. Similarly, there are capacity packet drops which can’t be avoided, and congestion packet drops due to packets which kill other packets before dying themselves.

    TCP throttling addresses the packets within a single TCP flow that kill other packets and are lost before reaching the receiving application.

    Bufferbloat should have a general bad impact on congestion drops, through preserving old packets so that they can do more damage, even after they are obsolete (e.g., have already been retransmitted). Tail dropping instead of head dropping makes the problem worse. There’s probably a good argument for stack order processing, or perhaps a stack of blocks, where each block is sized to preserve heuristic packet ordering.

    I understood that buffers should be sized based on variance, rather than on mean throughput. I think that this idea goes back to queueing theory long before it was applied to packet-switched networks. Presumably, the formula for buffer size that you quoted involves an estimate of variance based on some statistical assumptions.

    I hope I haven’t misused the comment forum too much. I haven’t grokked the structure of the blog very well yet. I’m fishing for any useful discussion that might follow up. In case someone wants it, the material on capacity vs. congestion drops exists only on slides for my class: http://people.cs.uchicago.edu/~odonnell/Teacher/Courses/Strategic_Internet/Slides/ slides number 57-68 on pages 82-103 of the PDF with notes.


    Mike O’Donnell

  6. Network Buffer Bloat – flyingpenguin Says:

    […] Very widespread. I hate to spoil the story, but here’s the conclusion: By inserting such egregiously large buffers into the network, we have destroyed TCP’s congestion avoidance algorithms. TCP is used as a “touchstone” of congestion avoiding protocols: in general, there is very strong pushback against any protocol which is less conservative than TCP. This is really serious, as future blog entries will amplify. […]

  7. Sean Stapleton Says:

    Jim, thanks for the great series. Below is a link to a nice first-order analysis of slow-start cheating.

    Google and Microsoft Cheat on Slow-Start. Should You? http://blog.benstrong.com/2010/11/google-and-microsoft-cheat-on-slow.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: