The criminal mastermind: bufferbloat!

Each of these initial experiments were been designed to clearly demonstrate a now very common problem: excessive buffering in a network path. I call this “bufferbloat“. We all suffer from it end-to-end, and not just in our applications, operating systems and home network, as you will see.

Large network buffers can be thought of as “dark buffers”, analogous to “dark matter” in the universe; they are undetectable under many/most circumstances, and you can detect them only by indirect means. Buffers do not cause problems when they are empty. But when they fill they introduce additional latency (and create other problems, possibly very severe) to other traffic sharing the link.

In the past, memory was expensive, and bandwidth on a link fixed; in most parts of the path your bytes take through the network, necessary buffering was easy to predict and there were strong cost incentives to minimize extra buffering. Times have changed, memory is really cheap, but our engineering intuition is to avoid dropping data. This intuition turns out to be wrong, and has become counter-productive.

The network is often completely congested

All modern operating systems on modern hardware (this leaves out Windows XP, which does not implement TCP window scaling and on continental U.S .delays won’t go much faster than 6Mbps, or have more than 64KB in flight at once) can trivially saturate any network up to a gigabit per second with even a *single* TCP connection. This means that bulk data transfer, in either direction on broadband links (and other internet links) will routinely saturate one hop in a path, the minimum bandwidth path from sender to receiver. In your home, this is most likely your broadband Internet connection, though as Experiment 2 showed, it can easily be the wireless link in your home router. Other protocols can also saturate a link: e.g. bittorrent, UDP based protocols, etc.

As I demonstrated, it’s really easy to saturate a 100Mbps link (on Linux or Mac OSX) in your home or office (still probably the most common ethernet switch), and even more easily, any wireless link, with bad consequences.

Congestion avoiding protocols (e.g. TCP, which is our commonly accepted touch-stone of congestion avoiding protocols, by which others are judged) rely on a low level of timely packet drop to detect congestion (or, in modern TCP stacks, packets marked by ECN bits; for various reasons, ECN is not typically used today). They regulate their sending by noticing when packets have dropped to not overfill the chokepoint on their network path. Timely congestion notification via packet drop is a fundamental design presumption of the Internet communications protocols.

Buffering is necessary

Some buffering is clearly necessary in a network: it means that short bursts of packets can be absorbed without significant loss. It is important there is space available to put packets waiting for transmission. We certainly don’t want a lot of packet loss anytime two packets happen to arrive at once, or happen to be generated nearly simultaneously on a host. And for reasons that will become clear, some applications (e.g. web browsers) are causing packet bursts.

Too much buffering is bad

Packets reach a choke point on the path, and are queued; a queue cannot drain until packets have been transmitted. If more packets come in than can be transmitted, the queue gets longer. Clearly, we don’t want the queue length to go to infinity; sooner or later we better drop some data. (Note that even with traffic classification, you still have queuing going on; just in more than one queue). There are three choices: you can drop from the tail of the queue, the head of the queue, or pick a random packet.

Two queues will grow: one on either side of the slowest hop in the network path, depending on which direction packets are flowing. The more packets queued, the higher the latency in the queue. The bottleneck, of course, may be a different hop in the path in one direction than the other.

In the experiments, you are seeing (excessive) queuing in operating systems, and in home routers. You see various behavior going on as TCP tries to find out how much bandwidth is available, and (maybe) different kinds of packet drop (e.g. head drop, or tail drop; you can choose which end of the queue to drop from when it fills). Note that any packet drop, whether due to congestion or random packet loss (e.g. to wireless interference) is interpreted as possible congestion, and TCP will then back off how fast to will transmit data.

This begs a fundamental question in a packet switched network: how much buffering is enough? Well, this is a complex question: the optimum shown by Kleinrock is the bandwidth * the delay * sqrt(Nflows). The “bandwidth delay product” is needed in order to keep enough data in flight to go as fast as it can over the transmission path; but we also need to have some space to deal with other traffic that might arrive (the number of flows). There is recent research that shows that in fact Kleinrock’s optimum is more likely an upper bound; many real routers may be able to work better with fewer packets queued.

We clearly need some buffering, but not too much. Cheap memory means we can have GOBS of buffering. In fact, many Linux network drivers are telling Linux to buffer 1000 packets in transmission! at 1500 bytes/packet, this is 1.5 megabytes! (X11 was developed on VAXes with 2MB of RAM total). This is 12 megabits. At 10 megabits/second, it’s going to take more than a second for any such queue to drain. At 1Mbps, well over 10 seconds.

These network devices cover 2 orders of magnitude of performance (both ethernet and wireless), but the buffering number was probably picked (if thought about consciously at all) to enable high bandwidth over planetary scale at highest bandwidth for a server with many simultaneous flows, but without thought about latency when the same hardware is running very slowly, on low delay paths, with just a few flows

What is more, I’ve demonstrated in these experiments two cases of bufferbloat in Linux: it’s transmit queue (the txqueuelen knob we twisted), and the transmit rings in the NIC’s (sometimes adjustable as well). Other hardware/software combinations has additional hidden buffer locations.

There is no (single) right answer

I hope the previous paragraph’s discussion makes it clear there is no single right answer possible for buffering. It is a function of bandwidth (which on both ethernet and wireless can easily vary by several orders of magnitude; further complicating wireless is the difference between “goodput” and “bandwidth”), your workload (the number of flows), and your tolerance for latency.

Next installment

Having demonstrated bufferbloat in simple experiments everyone can perform, we’ll move on to what these bloated buffers are doing to TCP in more detail and why it ends up filling the buffers at the choke point, as best I can explain (I’m not a full fledged TCP expert). I’ll include TCP traces illustrating the problem (over home broadband links). Other protocols are equally capable of causing the buffers to fill; this is not unique to TCP.

This entry was posted on December 3, 2010 at 10:07 am and is filed under Bufferbloat, Networking. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

43 Responses to “The criminal mastermind: bufferbloat!”

The problem with excessive buffering in networks | Kodden's Corner Says:
December 7, 2010 at 7:53 am | Reply
[…] Jim Gettys, and the problem seems to be worse than anticipated. You can read more about the problem here, and the detailed analysis here. Linux, Networkingcongestion, optimization, performance, […]
Geoff H Says:
December 10, 2010 at 6:24 pm | Reply
Great analysis! Network management is one of my weaker points, and I had been wondering about some of these issues, as I’d seen strange things on systems that I didn’t think I should be seeing.

Network buffers that are too large gives me a good reason for what is going on (though I have to investigate to find out), but an important gap in delays and congestion in modern networking just got filled in from your info.

Thanks! 🙂
Increasing TCP’s initial congestion window | monolight Says:
December 23, 2010 at 7:37 pm | Reply
[…] is a form of bufferbloat, so it does affect […]
Проблемы с буферизацией в современных TCP/IP сетях (на английском) | Телекомблог: заметки и аналитика Says:
January 4, 2011 at 3:46 am | Reply
[…] «The criminal mastermind: bufferbloat!» – обозначены общие проблемы с излишней буферизацией в сети, по мере продвижения пакета по сети буферизация на каждом нагруженном шлюзе, что в итоге приводит к появлению задержек; […]
Анализ проблем с буферизацией в современных TCP/IP сетях | AllUNIX.ru – Всероссийский портал о UNIX-системах Says:
January 4, 2011 at 6:17 am | Reply
[…] «The criminal mastermind: bufferbloat!» – обозначены общие проблемы с излишней буферизацией в сети, по мере продвижения пакета по сети буферизация на каждом нагруженном шлюзе в итоге приводит к появлению задержек; […]
I, Cringely » Blog Archive » 2011 predictions: One word — bufferbloat. Or is that two words? - Cringely on technology Says:
January 4, 2011 at 1:35 pm | Reply
[…] it will be involved in three of my predictions for the coming year.Bufferbloat is a term coined by Jim Gettys at Bell Labs for a systemic problem that’s appearing on the Internet involving the downloading of […]
Neil Jerram Says:
January 4, 2011 at 5:31 pm | Reply
Thank you for researching and writing about this. For some time I have been seeing massive delays in transport from A to B in my home network, as illustrated below, and it appears that bufferbloat is indeed the culprit.

A ——wired Ethernet—— Netgear DG834G ——-wireless——— B

With “nttcp -i” on A and “nttcp ” on B, in 3 tests I measured 9, 8 and 8s to transfer 8Mb.

The other way round (i.e. “nttcp -i” on B and “nttcp ” on A):

– with the router’s default txqueuelen of 100, in 2 tests I measured 137 and 95s for an 8Mb transfer

– with the router’s txqueuelen set to 0, in 3 tests I measured 19, 12 and 24s for the transfer.

It seems there is still a buffering effect there (compared to the upload direction), but already that’s a massive improvement – thank you!

Neil
- gettys Says:
  January 4, 2011 at 7:18 pm | Reply
  Some hardware does little or no buffering (usually older hardware); in those cases, you can’t set the txqueuelen to zero (without making your machine catatonic).
  
  The fundamental issue is that some buffering is necessary, but we’re way beyond what it should be. The hard part about this is making our operating systems automatically handle the buffering properly.
Stephan H. Wissel Says:
January 7, 2011 at 9:33 am | Reply
Hi JG,

“.. the hard part … is making our operating systems automatically handle the buffering properly” sounds like you understand what can/must/should be done.

Would make a great post with instructions for Mac and Linux.
🙂 stw
- gettys Says:
  January 7, 2011 at 10:24 am | Reply
  There are some immediate mitigations I’ve outlined in the blog: real solutions are harder; wireless is much more challenging that ethernet. I’ve already posted mitigations elsewhere in the blog.
  
  I think the way forward is AQM of some sort needs to be always present, whether Van’s nRED algorithm or something else. For example, there
  is a mad-wifi implementation of Buffer Sizing for 802.11 Based Networks, by Tianji Li, Douglas Leith, David Malone at http://www.hamilton.ie/tianji_li/buffersizing.html you can play with immediately. It isn’t my area of expertise, so we have a design space to explore.
  
  Van thinks there is a simpler, more robust solution than the above: but no one has tried it yet (and can’t until he’s finished tweaking the paper that never got published). Running code beats unpublished algorithms any day.
KeithMc Says:
January 7, 2011 at 9:35 am | Reply
Is it just buffering in general (of all IP packets) or just TCP?
You might want to try UDT:
http://udt.sourceforge.net/

Their Power Point presentation gets to the heart of the TCP throughput being severely throttled by latency and packet-loss problem.

Here’s a quick overview of UDT:
http://blog.goofy.net/post/1042563748/udt-now-in-java-flavor
- gettys Says:
  January 7, 2011 at 9:55 am | Reply
  Bufferbloat causes problems with all protocols, not just TCP.
  
  And it is the lack of timely notification of congestion by these buffers that is destroying congestion avoidance, and the observation that transport protocols (not just TCP) will fill the buffers.
  - KeithMc Says:
    January 7, 2011 at 12:58 pm | Reply
    …but UDT does its own timing at each end…
Jay Moran Says:
January 7, 2011 at 10:19 am | Reply
Excellent post gettys! Can’t wait to continue reading the rest of them. Hopefully lots of folks read this and what I can assume will be many more greats posts on the subject. Always amazes me when Network “Designers” don’t understand some of the most basic pieces of network fundamentals… but I guess they know how to type “router bgp” really well. 🙂
- gettys Says:
  January 7, 2011 at 8:50 pm | Reply
  Well, the consequences of bufferbloat aren’t immediately obvious. It took years to sort this out in the 1980’s and 1990’s; some of us thought the problem was “solved” and went to sleep. DARPA thought the internet was a solved problem, and stopped doing network research, and NSF only cared about “going fast”, where fast is solely defined as number bits/second, rather than latency.
  
  “We’re all bozos on this bus”, I’m afraid. Don’t cast any stones: there are glass houses around everyone I’ve looked at so far.
  
  So here we all are together, all careening down a highway that is almost always congested somewhere.
  
  The best I hope for is we all learn from this pain (as some of did from the NSFnet collapse in 1986, and avoid a crash.
johne Says:
January 8, 2011 at 3:14 am | Reply
With all due respect, this analysis is fundamentally flawed and demonstrates a lack of understanding of the fundamentals that are behind TCP. Many of the individual statements are true, but are often combined in a way that suggests unfamiliarity with the way the various pieces of the puzzle interact with each other.

With the assumption that the hosts using the internet are using a “bug free” (or sufficiently bug free for the purposes of our discussion, and in particular with regard to their congestion control behavior), it is my opinion that “bufferbloat” in the routers that a flow must traverse can not, not even in principle, negatively impact the performance of that TCP flow.

The reason for this lies in the very reason why the first “Great Internet Congestion Collapse” occurred: at the time, the emergent phenomena of “many” hosts using TCP that were attempting to use the “maximum capacity” of a flows path, and the interaction between multiple hosts of the algorithm used in TCP at the time that attempted to achieve that “maximum capacity”, was not well understood. The interaction of all these hosts created a very non-linear effect, which tends to be the type that “it’s not a problem, until one days everything comes to a complete stop” due to the (usually) exponential growth of the “problem” once it goes past a certain tipping point. Sort of like the phenomena at work when you watch the smoke coming of a burning cigarette- nice and smooth up to a certain point, and after that point it becomes extremely chaotic and turbulent.

This resulted in what could arguably be called the only “major” change to the TCP/IP stack: Van Jacobsons TCP congestion control algorithm (i.e., TCP Tahoe, TCP Reno), which can be summarized as “Additive Increase, Multiplicative Decrease”.

IMHO, AIMD, coupled with the “ACK clocking phenomena”, makes it almost fundamentally impossible for “additional buffers in network routers” to negatively impact the performance of TCP traffic. Now, buggy implementations can throw a monkey wrench in to things, but IMHO, the fundamental principles of the way contemporary TCP does congestion control means that your entire premise, and in particular conclusion, is immediately suspect. The “ACK clocking” phenomena basically ensures that if something happens to cause some thing to alter the bandwidth characteristics between two communicating end points, up to and including having the receiving end point “vanish” from the net, means that it takes “at most” 1 RTT delay for the sending side to stop adding new packets to the network. At no point does a TCP sending just keep piling on the packets, making what could be a bad situation exponentially worse. As soon as any problem is “sensed”, the sending side immediately and “instantly” stops adding packets to the network. Some observed behavior has led to some heuristics and recommendations that a sending side can use to “begin exponential back-off” of its sending rate faster than it would normally otherwise happen. For example, ECN allows to end points to determine that “something is happening in the flows path that if it keeps up, a router is going to have to resort to hard dropping of packets…. and therefore, the smart thing to do is go in to some form of slow start recovery before it gets to that point.”

[…]In the past, memory was expensive, and bandwidth on a link fixed; in most parts of the path your bytes take through the network, necessary buffering was easy to predict and there were strong cost incentives to minimize extra buffering. Times have changed, memory is really cheap, but our engineering intuition is to avoid dropping data. This intuition turns out to be wrong, and has become counter-productive.[…]

Actually, for a backbone router, the routers most likely to contribute to “bufferbloat” for most of the end points on the Internet, memory is still extremely expensive for the rates that backbone routers need to operate at. All the major vendors or backbone routers ship 100 gigabit/sec interfaces that can be used as backbone links. At 64 bytes, this means a theoretical maximum of 195,312,500 packets per second. Another way of saying this is there is a new packet every ~5.1 nanoseconds. Let’s just say that being able to write 64 bytes to a “memory” every 5.1 nanosecond is fairly impressive even today. Being able to treat said memory as a complete 64 byte wide FIFO that can queue and dequeue a new packet every 5.1ns makes it all the more impressive. It’s safe to say that we’re well beyond the DDR3 memory you’d find in your desktop or laptop. In the unlikely case that you’re not building the FIFOs right in to your full custom VLSI packet processor silicon, a quick check for a discrete FIFO turns up the Cypress CY7C4281V-10JXC, which is rated at 512 kiloBITS of FIFO organized as 64Kb x 9 (bits!) @ 100 MHz @ 97 dollars (as listed on the site, which is likely to drop dramatically in volume orders).

You’ll notice that you’d need a fairly wide array of these to reach your 64 bytes, or 512 bits, minimum packet size. The 100 MHz rating falls a bit short of the needed ~200MHz as well. In other words, we’d need 64 of these to meet the needed width, and we’ll just side step the speed issue, but the end result would be 64 FIFO chips for a 64K deep FIFO packet buffer that would cost us a list price of ~$6,208.0. And that would be for just a single buffer for a single link in just one direction.

This translates in to a cost of $6,208 for 4megabytes of memory, compared to a cost of ~$100-$200 for 4 GIGAbytes of DDR3 DRAM memory.

Clearly nobody builds things this way (i.e., with individual discrete FIFO chips), they bring all the memory “on die”. But we’re still talking about some form of “custom chip”, be it ASIC, full custom VLSI, or “something else”. As a good rule of thumb, the combination of any of these possibilities is mutually exclusive with “cheap”, at least relative to the way you’ve described and used “cheap” (but this is IMHO).

Additionally, there seems to be some confusion as to what the buffering requirements that the BDP implies / requires. This is the amount of buffer required by the SENDING end point in order to make use of all the bandwidth available between the two end points under consideration. The amount of buffering by the routers / hops for this BDP path is exactly zero. The “buffering” that is taking place between these two end points, which in turn drives the buffer required by the sending end point, takes place ENTIRELY “in space”. Even if you had a point to point link @ some bits/sec between two end points that communicated via lasers in a perfectly straight line in a vacuum over a “planet wide distance”…. you’d still need a huge buffer on the sending side in order to be able to send packets at the links maximum rate using TCP. A bit of thought should be enough to tell you that there’s really not a whole lot of difference between the straight line vacuum laser case over large distances and a TCP connection from the east to west coast that has to traverse a number of router hops in the middle.

In all honesty, I have not read all of your site regarding your thoughts on “bufferbloat”, but I do see a few problems with what I’ve seen so far. It’s safe to say that I could be considered an “expert” (whatever that means) on these matters, but don’t for a second take that to mean that I’m right or that you’re wrong. If what I say makes sense, then it should be “obvious”. If what I say is compelling, you update your “internal model” of whatever you’re seeing and describing here and are still “coming up short”, then keep pursing it. There is always the very real possibility that you’ve stumbled across something important, but for whatever reason have incorrectly attributed the cause to “bufferbloat”. As always, if you’re right, you’re right, and right trumps everything… especially someone who’s armchair bitching in the comments of a blog post. 🙂 I’ve been told more than once by “experts” that I didn’t know what the hell I was talking about. Not always, but more than a few times, I was right, and it’s hard to argue with right especially when you’ve managed to “turn it in to something real” (i.e., not just right in theory, but right in theory and practice, with something tangible to show for it). It’s easy to criticize, but I’ll root for anyone who believes they’re right despite the fact that “experts” say they’re wrong, (hopefully) even if one of them is me.
- gettys Says:
  January 8, 2011 at 12:50 pm | Reply
  Please continue to try to poke holes at this.
  The TCP behavior is an important point. I’m not a TCP expert; as I’ve said earlier, I’m primarily expert at using TCP. I will get details wrong from time to time. Please keep poking at the arguments.
  
  When I saw the xplot results I posted I was worried as they looked nothing like the plots I was familiar with. (A link to the actual data is there; even better is to take more data of your own.) As I wrote there, I asked actual experts (Dave Clark, Vern Paxson, Van Jacobson, Dave Reed, Dick Sites and Greg Chesson) what was going on to cause such behavior. Quite a long mail discussion ensued.
  
  As best as I can try to reproduce that discussion, and I can easily be stating details wrong, what’s going on here is that TCP does trying to do its usual thing. So after the initial slow start phase (which will very quickly attempt to fill the bottleneck buffer until it gets its first loss), it starts its much more gradual probing toward the available bandwidth. So TCP sends data faster, no packet loss is detected after a RTT, so TCP speeds up a bit more, and more, until a queue starts to form in the bottleneck. But since no loss is seen, TCP continues to increase its speed a bit more (making the queue longer, so the RTT increases) and a bit more (making the queue longer, so the RTT increases); eventually TCP may even decide the path has changed and become more aggressive on upping its bandwidth. Eventually the buffer fill again and there is again loss. So for much of the time, TCP is actually sending data faster than the bottleneck; you’d expect the usual TCP sawtooth pattern. And the buffers fill.
  
  This process of initially filling the buffer takes seconds, even on a path of 10ms. There are many RTT’s expended to drive TCP to the point of insanity. But if you lie to it long enough, it gets indigestion. The algorithms work presuming timely packet drops; the buffer sizes we are seeing are multiple times that of the path latency. The aren’t just big, they are bloated.
  
  Fundamentally, the insertion of large buffers many times the actual path latency has destroyed TCP’s servo loop: it becomes unstable.
  
  Then, having seen the ICSI data and having whiffs of smoke in my home router, I went looking elsewhere for trouble.
  
  Either the buffers should be kept dramatically smaller, or (since the correct amount of buffering is very hard to predict under many circumstances), we have to use some form of AQM.
  
  As to AQM in the backbone, certainly most backbone vendors run it (or provision their links with excess capacity for failover so that they will never run with long queues). Unfortunately, as you go toward the “edge”, life is more problematic; there is data showing the lack of AQM in the residential broadband. (see, for example: Characterizing Residential Broadband Networks by Dischinger et. al., section 4.3.2. Hotel networks I’ve stayed at have both clearly suffered from broadband bufferbloat, but I’ve also seen the bottleneck deep in those networks, not at the edge. And in checking up on “smoke” from smokeping data in my own company’s internal network, I discovered that while it runs with quite elaborate classification, no AQM was enabled.
  
  But don’t take anything of this on faith; dig for yourself. Everyone makes mistakes, even the experts.
Jon Crowcroft Says:
January 8, 2011 at 5:14 am | Reply
We did a whole bunch of work on performance on TCP/IP over 2G and 3G access links – one of our best PhD students for a long time, Rajic Chakravorty, did a whole sequence of papers (there’s a cache of them at
http://pages.cs.wisc.edu/~rajiv/
see for example the GPRSWeb one, but others there that aggregate TCP flows or split the flow are also ways to fix things) – these were all mitigating problems
created by mis-design of the link or mis-design of resource allocation – these were mostly on 2.5G wiress services a while back, so ones milegag will vary applying them on later technology- its clear that some providers did take on board what Rajiv said – for example, Orange use proxies for train access via UMTS)
… zu guter Letzt » F!XMBR Says:
January 8, 2011 at 1:37 pm | Reply
[…] The criminal mastermind: bufferbloat! «Large network buffers can be thought of as “dark buffers”, analogous to “dark matter” in the universe; they are undetectable under many/most circumstances, and you can detect them only by indirect means. Buffers do not cause problems when they are empty. But when they fill they introduce additional latency (and create other problems, possibly very severe) to other traffic sharing the link.» […]
How big are the buffers in FreeBSD drivers? | Alexander Leidinger Says:
January 8, 2011 at 3:23 pm | Reply
[…] he is telling the network congestion algorithms can not do their work good, because the network buffers which are too big come into the way of their work (not reporting packet loss timely enough respectively try to […]
John Nagle Says:
January 8, 2011 at 6:21 pm | Reply
“Buffer bloat” isn’t the problem. See my RFC 970, “On Packet Switches with Infinite Storage”, from 1985. The problem is FIFO queuing in routers.

There’s nothing inherently wrong with big in-transit buffers for TCP streams. The real question is not which packets get dropped; it’s which packets get sent next. That’s what “fair queuing” and some other quality of service algorithms are about. Unfortunately, many routers are basically FIFO devices with some packet drop algorithm. If the router is FIFO, dumb, and has big buffers, there’s trouble.

Back in the 1980s, when I was working on this, I was applying fair queuing at choke points. My basic thinking was that the network should not drop packets for congestion unless a sender is badly behaved and isn’t obeying the congestion avoidance rules. This is well-behaved, and will work well when bandwidth is a scarce resource. But for years, the Internet had more bandwidth than was needed, and so people stopped worrying about congestion. Now that everybody is trying to stream high-definition video, it’s a big problem again.

Fair queuing, for those who came in late, is that packets within each flow (typically a source/destination IP address pair) are handled in a FIFO fashion, with a queue for each flow. The queues are serviced in a round-robin fashion. “Quality of service” adjustments may mean that some queues get serviced more often than others. With fair queuing, if an endpoint sends more than the network can deliver, that flow loses out, but others are unaffected. It doesn’t matter how much buffer space is available. An upper limit on packet lifetime on queue helps; the Time-To-Live value for queued packets needs to be decremented periodically and expired packets discarded.

A problem used to be that the CPU overhead for fair queuing was too high. Today we can afford enough transistors in ASICs and FPGAs to do queuing right, even in fast routers. That’s already happened. The big players have already put the necessary hardware into their newer routers. Cisco supports weighted fair queuing in their current DOCSIS cable routers. So does Motorola. But it has to be set up and configured. Motorola has a very clear management level presentation (http://www.cascaderange.org/presentations/DOCSIS_1_1_QoS.pdf) on the need for fair queuing on their DOCSIS cable routers. That short piece of PowerPoint is a must read for anybody involved in managing a cable Internet system. Read the slides staring with “If RED is not good enough, what is?”. A key point for managers: “There are no parameters to set”. There are other parts of DOCSIS routers that have way too many tuning knobs. That’s not true of fair queuing.

So, if your cable system is showing this problem, they probably have older routers, or misconfigured routers, or routers from some clueless vendor, or need a software upgrade. Cisco only supported this fully in DOCSIS routers starting in 2008. Earlier cable routers tended to be rather dumb. If you’re in the industry, pass around that Motorola PowerPoint.

If you’re dropping packets in the backbone, that’s a separate problem. Congestion has to be forced out to the edges, where it can be handled. That’s a provisioning issue.

This has nothing to do with “buffer bloat”. It’s a queuing order problem.

John Nagle
- gettys Says:
  January 8, 2011 at 6:54 pm | Reply
  John, I don’t know if we’ve ever crossed paths… In any case, welcome.
  
  I think I’m going to disagree with you here, and we should argue this out in public.
  
  The routers, devices, home routers and even our operating systems are set up currently to be FIFO, dumb, and have big buffers, and yes, there is trouble. The trouble is mostly, but not entirely at the edge.
  
  While I agree with what you have just said as far as it goes, but I’ll also argue that source/IP address pairs are not sufficient. And yes, I can’t possibly agree more strongly with your assertion that we need solutions that require no manually set parameters. And we all agree classic RED is not a solution, including Van. Classic RED is at best a stop gap, as tuning it is so painful the observed reality is that network operators often don’t configure it in circumstances it is needed. Van says there is no hope of classic RED possibly working for 802.11.
  
  I argue that self congestion is a real headache, and one we already suffer with today frequently. I started on this quest as I described as documented in that smokeping graph on that page noting that once I understood how to do myself in, it was so painful I had to manually stop what I was doing in the background to get my work done; I couldn’t spend my whole afternoon waiting on the web or dealing with mail. Many others note the same phenomena.
  
  So a full “solution” needs to work for not just the network, but also for the user. I don’t want my backup to the cloud, or the staging of my next movie to my disk to screw up my interactive work.
  
  So at a minimum requirement, in my mind, for fair queuing to actually be a viable solution is that it be feasible and implemented on all flows individually, not just source/destination address pair. Google or Netflix to my NAT just won’t cut it as a discriminator to prevent trouble. And even then, I worry about the application programmers continuing to shoot themselves in the foot. But they will always do that, I suppose.
  
  Jim
  - John Nagle Says:
    January 10, 2011 at 2:21 pm | Reply
    I don’t have time right now to discuss this further on some random blog. However, if anyone working on solving the problem would like to get in touch with me, I’m not hard to find.
ghira Says:
January 10, 2011 at 8:52 am | Reply
No, source and destination IP isn’t sufficient.

However:

In

http://www.cisco.com/en/US/partner/docs/ios/qos/configuration/guide/15_1/qos_15_1_book.html

we find:

“Packets with the same source IP address, destination IP address, source TCP or User Datagram Protocol (UDP) port, or destination TCP or UDP port belong to the same flow. WFQ allocates an equal share of the bandwidth to each flow. Flow-based WFQ is also called fair queueing because all flows are equally weighted. ”

Admittedly, there’s an “or” there that seems out of place. A little later
we see:

“For flow-based DWFQ, packets are classified by flow. Packets with the same source IP address, destination IP address, source TCP or UDP port, destination TCP or UDP port, and protocol belong to the same flow. ”

Elsewhere in Cisco documentation it does very much seem as though
they mean this to be taken that source/dest IP and source/dest port
are all part of the definition of what a single flow is, so the second
extract above seems more apropos.

Even with NAT, this should be enough to help distinguish one
flow from another even if you are talking to the same place
multiple times simultaneously.
ghira Says:
January 10, 2011 at 1:29 pm | Reply
Of course, this isn’t magic. You don’t actually get one queue per flow as far as I know, you hash flows into some number of pretend queues. Something like bittorrent can still destroy you by sticking stuff in all of them.

And even if you did get one queue per flow, having 20000 queues and trying to be nice to all of them wouldn’t achieve much.

Still, it’s _pretty_ magic and you might find it interesting to try an experiment where you just turn on fair-queueing and RED and that’s all.
Phil Karn Says:
January 28, 2011 at 9:49 pm | Reply
I’m with John; the problem isn’t buffering per se, it’s FIFO buffering.

I still do a fair amount of pinging and tracerouting, and I’ve noticed in recent years that, if anything, the delay variance along most Internet routes has decreased, probably because the backbone links have gotten so fast. The speed-of-light delay nearly always dominates the total end-to-end latency.

That’s proof right there that “buffer bloat” is not a big problem, at least not inside the network.

It does exist at the endpoints, and that makes it relatively easy to fix. All we have to do is to get rid of dumb FIFO queuing. Implement DSCP. Establish priority queuing, and use fair queuing within each priority level. I’ve been doing this for years on my outbound links, and it really works. I routinely keep my link saturated with Bit Torrent traffic, and it doesn’t bother my interactive traffic or even my VoIP calls at all.
- gettys Says:
  January 28, 2011 at 10:12 pm | Reply
  Depends which network; certainly some major networks are “clean” and run RED or other AQM routinely. Some are not so nice. Whether you consider these part “the network” or not, depends on your vantage point. If you are in a hotel lucky enough not to already be done in by the broadband problem, you may still have problems elsewhere inside the networks servicing them (as I have observed). Does that mean those are not part of the the “network”?
  
  Dave Reed’s observations of 3g networks are a symptom of aggregate problems on those networks, and we have severe problems in home routers and in hosts (caused by multiple layers of buffering, which may or may not be classifying (or doing so only very poorly, as the device drivers may have hundreds of packets of buffering, which may or may not have priority queues.
  
  This is more than just fifo buffers, these fifo buffers now resemble the infinite buffers described in John Nagle’s RFC 970; they are frequently seconds in size, often 100 times or more the RTT of the link. Take a look at the ICSI scatterplots; the horizontal lines indicate seconds of buffering.
  
  Classification, nice as it is, just moves the pain point, but does not signal the end points in a timely enough fashion to keep the transport protocols from filling whatever buffers are in the path. We’ve destroyed the congestion avoidance in TCP and other protocols.
  
  Note that we have problems of large, unmanaged buffers even in the operating systems and our home routers. For any sort of fair queuing to start to be effective, it have to be done not just on a IP/IP address basis, but IP/port pair IP/port pair basis, on an end to end basis.
Brian Says:
February 7, 2011 at 3:01 am | Reply
Just so I make sure I am understanding all of this. Any link where latency increases in response to loading (i.e. bulk transfer) there is bufferbloat?

Even a link under heavy load should have a latency comparable (how comparible? i.e. what sort of deviation from unloaded should one allow in proving non-bufferbloated paths?) to the unloaded link if there is no bufferbloat before the choke point?

Therefore, this link: http://www.dslreports.com/r3/smokeping.cgi?target=network.2dfa9c7818a539e6969abcb99211955b (see Feb. 7, 2011, 2:20am EST) is bufferbloated?
- gettys Says:
  February 13, 2011 at 10:31 pm | Reply
  Latency will go up under load, but shouldn’t go up a whole lot.
  
  If everything is really working properly, having two streams sharing a link (one of which is running as fast as it can), should only be inserting about 1 packet time’s latency (which isn’t much).
  
  That smokeping doesn’t look too bad. Contrast it with what I got.
  - Brian Says:
    February 14, 2011 at 9:54 am | Reply
    Yeah, I didn’t really run that download (incoming to me to be clear) for too long so maybe the effect got lost in the summary. I just did another download (since to me, the real bufferbloat issue is the incoming buffers — those that I cannot control and have to discuss with my ISP about). The results can be seen at http://www.dslreports.com/r3/smokeping.cgi?target=network.2dfa9c7818a539e6969abcb99211955b or if they roll out of the 3 or 30 hour graphs before you get this, I have them captured here:
    http://brian.interlinx.bc.ca/bufferbloat/
    
    The top graph is of course the overview and the middle three are the 3 hour graphs from ca1, ny, and ks, respectively.
    
    The last graph is the bandwidth meter on my router.
    
    Notice the deep sawtooth pattern. That looks indicative of TCP’s congestion avoidance, but should those deep valleys be so deep? Or does that look like overly large buffers not providing timely notification of dropped packets and fouling up congestion avoidance?
    - gettys Says:
      February 14, 2011 at 11:09 am
      I don’t know what class of service you have. But yes, the revised smokepings are something like I’d expect. What kind of service do you have?
      
      What does netalyzr tell you about your buffering?
  - Brian Says:
    February 14, 2011 at 9:58 am | Reply
    I should add, that yeah, the bufferbloat in my graphs doesn’t look anywhere near as bad as yours. I got the impression from the post though that you were saturating “upstream” bandwidth, not downstream bandwidth (transferring … from my house to … MIT). But a smokeping from DSL reports is best at measuring the bufferbloat problem downstream to you, not your upstream buffers on the way to your ISP, yes?
Network buffers are too large « IntelliThoughts Says:
February 13, 2011 at 8:17 am | Reply
[…] was reading a blog post on bufferbloat a while ago. Pretty much every device your packets go through has a buffer, and with RAM being […]
Brian Says:
February 14, 2011 at 11:18 am | Reply
[ I hope this threads properly. I had to alter the URL since there was no “Reply” comment after your most recent reply to me ]

In terms of the service I have, it’s cable, 14Mb/s down, 1Mb/s up.

netalyzer says:

Network buffer measurements (?): Uplink 1000 ms, Downlink 150 ms

We estimate your uplink as having 1000 msec of buffering. This is quite high, and you may experience substantial disruption to your network performance when performing interactive tasks such as web-surfing while simultaneously conducting large uploads. With such a buffer, real-time applications such as games or audio chat can work quite poorly when conducting large uploads at the same time.
We estimate your downlink as having 150 msec of buffering. This level may serve well for maximizing speed while minimizing the impact of large transfers on other traffic.
- Brian Says:
  February 14, 2011 at 11:21 am | Reply
  I find netalyzer’s measurement strange though. It’s exactly the inverse of what I would expect if I interpret the terms “Uplink” and “Downlink” as from my perspective. Maybe that’s a misinterpretation. If so, I think they need to rethink how the user is going to read their chosen nomenclature.
  - Dave Täht Says:
    February 14, 2011 at 11:28 am | Reply
    you have a 14×1 ratio of download to upload?? That is very close to the theoretical minimum ratio for TCP to function *at all*… (for ipv6, it’s actually over the limit)
    
    in response to your earlier posts, your bloat problem will show up much worse while doing an upload than a download. It’s bad either way.
    
    In looking at your smoke pings, the sort of spikes you are seeing do appear to be normal TCP additive increase/multiplicative decrease, but without being able to see the period of the swings I can’t tell if that actual packet loss event that causes them is due to bufferbloat… what OS/device/gateways are you using?
    - Brian Says:
      February 14, 2011 at 11:40 am
      Yeah, a 14:1 ratio of download:upload.
      
      Is the “theoretical minimum ratio” for TCP to function published anywhere? That way I can show my ISP and lobby them to increase the upload speed to meet the minimum.
      
      I am aware that the bloat problem would show up worse with uploads — if I were not already mitigating it. 🙂 It’s the downstream, where I have no mitigation other than pointing my ISP at it and saying “see — you need to manage the queues better”.
      
      I am using Linux all around — as endpoints as well as the router (OpenWRT).
    - gettys Says:
      February 14, 2011 at 6:08 pm
      Bandwidth shaping can help downstream as well; but if the broadband head ends are congested and not running AQM (e.g. RED), you can still have problems there.e
- gettys Says:
  February 14, 2011 at 6:09 pm | Reply
  Uplink is from you to the rest of the world. It tends to be worse than the downlink, as if its buffering is similar size, due to the asymmetry, you end up with more delay. In any case, you’re link is bloated; time to go mitigate…
Dave Täht Says:
February 14, 2011 at 11:54 am | Reply
ratio: MTU/ACK size

For ipv4 that’s usually something like 1500/64 but depends on the mtu, and ack size can be larger in the presence of sack. So 23 is best theoretical in this case.

IPv6 ACKs are of course larger, and encapsulated ipv6 over ipv4, much larger.
Linus Lüssing Says:
July 26, 2011 at 3:19 am | Reply
Hi gettys!

I’ve been fighting with high latencies within my LAN during bandwidth saturation, too. And it definitely looks like the bufferbloat phaenomena you’ve been describing in this blog and at the last Wireless Battlemesh in Spain. High, and highly varying latencies. I’ve been playing a lot with queue and buffer sizes to somehow get rid of the annoying 1 to 2 second latencies, but of course, that did not really help. However the solution for me, a common DSL user, turned out to be really simple: I changed the TCP congestion control algorithm from TCP Cubic (was the default in Debian) to TCP Vegas:

“Congestion avoiding protocols (e.g. TCP, which is our commonly accepted touch-stone of congestion avoiding protocols, by which others are judged) rely on a low level of timely packet drop to detect congestion (or, in modern TCP stacks, packets marked by ECN bits; for various reasons, ECN is not typically used today).”

This is actually not true in general, TCP Vegas is an exception here: It does not rely on timely droped packets. TCP Vegas also reacts to an increase in the RTT and therefore reacts a lot quicker than other TCP congestion control algorithms in case of (sufficiently) large buffers.

I’d be curious which TCP congestion control algortihm you’ve been using during your tests and whether switching to TCP Vegas would have a positive impact in your scenarios.

Cheers, Linus
- gettys Says:
  December 7, 2011 at 9:34 pm | Reply
  It’s Cubic on both ends of what I tested (Linux has defaulted to Cubic for some time). And yes, there are TCP CC algorithms that may be delay sensitive rather than loss sensitive. I don’t think any currently widespread OS is using them in production, however, by default, and few try to play such games, if it is possible.
  
  One of the “interesting” issues about suggestions to “solve the problem” by switching which algorithm is whether doing so will hurt your performance when multiple flows (from other users) are on the same link. And you don’t get to tell them to change what they are doing easily…
  
  I’ve tried not to overdose on this topic, particularly since I’m not an expert at the large varieties of TCP versions; I’ll leave that to folks like NIST to sort through…
  
  I’m glad that it helped you personally, though.
Sortie du noyau Linux 3.3 | PSOTUX Says:
March 19, 2012 at 1:44 pm | Reply
[…] a commencé avec une alerte lancée en décembre 2010 par Jim Gettys. Pour ceux qui ne connaîtraient pas ce programmeur il […]