Mitigations and Solutions of Bufferbloat in Home Routers and Operating Systems

As discussed several days ago we can mitigate (but not solve) broadband bufferbloat to a decent, if not ideal, degree by using bandwidth shaping facilities found in many recent home routers. Unfortunately, life is more complicated and home routers themselves are often typically at fault (if you find a recently designed home router that works right, it may want to be enshrined in a museum where its DNA and evolution analyzed, and its implementors both admired for their accomplishment and despised, for not telling us about what they discovered. Complete robust solutions, unfortunately, will be difficult in the short term (wireless makes it an “interesting” problem) for reasons I’ll get to in this and future posts.

Confounding the situation further, your computer’s/ smartphone/ netbook’s/ tablet’s operating system may also be suffering from bufferbloat, and the its severity may/almost certainly does depend upon the hardware. Your mileage will vary.

You may or may not have enough access to the devices to even manipulate the bufferbloat parameters. Locked down systems come back to bite you. But again, you can probably make the situation much better for you personally, if you at a minimum understand what is causing your pain, and are willing to experiment.

Conclusions

Since any number you pick for buffering is guaranteed to be wrong for many use cases we care about, the general solution will await operating systems implementers revisiting buffering strategies to deal with the realities of the huge dynamic range of today’s networks, but we can mitigate the problem (almost) immediately by tuning without waiting for nirvana to arrive.

As an end user, you may suffer in your home router or your computer anytime when the bandwidth (“goodput”) you get over a wireless hop is less than the provisioned and actually provided broadband bandwidth. This is why I immediately saw problems on the Verizon FIOS wireless routers (the traces show both problems on the wired and wireless side; but the wireless side is much worse). On that typically symmetric service at my in-law’s FIOS 20/20 service, 802.11g is usually running more slowly than the broadband connection. I also see bufferbloat regularly at home on my router using my Comcast service, which I recently changed to 50/10 service; there are parts of my house where it is now easy to get enough insufficient bandwidth over wireless. Bufferbloat is nothing if not elusive. It’s been like hunting the will o’ the wisp on wireless, until I had a firm mental grasp on what was happening.

Remember: you see bufferbloat only on the buffers adjacent to the bottleneck in the path you are using. Buffers elsewhere in the path you are probing remain invisible, unless and until they become the bottleneck hop.

Mitigating the inbound home router wireless bufferbloat problem

Whenever the bandwidth from your ISP exceeds that of your wireless “goodput”, you’ll likely see bufferbloat in your home router (since the bottleneck is the wireless hop between you and the router). Full solutions to the problem are beyond the scope of today’s posting (coming soon), and will require some research, though ways forward exist. In short, complete solutions will require active queue management (e.g. RED or similar algorithms), since the mitigation strategy of bandwidth shaping to “hide” the buffer we showed in a previous post will not lend itself well to the highly variable bandwidth & goodput of wireless. Outbound, it is very likely in your operating system where the bloat will occur (since your router is generally connected to the broadband gear either internally (as in the FIOS router I experimented with) or via a 100Mbps or 1Gps ethernet. You’ll most likely experience this, as one of the replies to this posting points out, when uploading large files, doing backups, or similar operations. I only realized OS bufferbloat occurs after I started investigating home routers and did not get the results I expected immediately. With some disbelief, I got confirmation with the simple experiments I reported on.

If you can “log in” to a shell prompt on your wireless router (many are running Linux, and a few have known ways to break into them), or are willing to install open source firmware on your router, you can go further, by mitigating the excessive buffering in the ways explained below for Linux. Remember, that this only affects the down stream direction (home router to your laptop). Note that the only one of these open source projects I have found that has close to turnkey classification and mitigation of broadband bufferbloat is Gargolye. Paul Bixel has worked hard on mitigating broadband bufferbloat, but has not attempted wireless bufferbloat mitigation. As noted in a previous post, many mid to high end home routers have enough capability to mitigate broadband bufferbloat.

OS Bufferbloat Mitigation

As explained in fun with your switch, fun with wireless, and the criminal mastermind postings, and in future blog postings, we have bad behavior all over the Internet, though I focused on the home environment in most of the postings so far. In all of the common operating systems, there is at least one, if not two places (and maybe more undiscovered) where bloat has been demonstrated. Please go find and fix them. All OS’s therefore suffer to some extent or another.

Your most immediate mitigation may be to literally move either your laptop or your home router to where the bandwidth equation is different, shifting bufferbloat to a (possibly) less painful point. But there are also potentially some quick mitigations you can perform on your laptop, and as some others in replies to previous postings have demonstrated, that are more general. The first order mitigation is to set your buffering in your operating system to something reasonable, as explained below (details of the Linux commands can be found in “fun with your switch” and “fun with wireless“.

Linux

I’ll discuss Linux first, as in my testing, it has problems that may affect you even if you don’t run Linux, as Linux is often used in home routers. But then again, as I use Linux for everything, there may be more buffers on other operating systems that I have not run into; my testing on Mac and Windows has been very small relative to Linux. We all live in a glass house; don’t go throwing stones. Be polite. Demonstrate real problems. But be insistent, for the health of the internet.

Note the total amount of buffering causes TCP and other congestion avoiding protocols indigestion: in Linux’s case, it is both the device driver rings (which I believe I see in other operating systems) and the “transmit queue” buffering. I gather some of the BSD systems may have unlimited device driver buffering. Some hardware may also be doing further buffering below the register level in smart devices (I susped the Marvell wireless device we used on OLPC might, for example).

As discussed in fun with your switch, I detected two different sources of excessive buffering in Linux, both typically resulting from device drivers (therefore shared in common with other operating systems). Device drivers hint to the operating system a “transmit queue length”, which is controllable on Linux by use of the “txqueuelen” parameter settable using the “ifconfig” command. By default, many/most modern ethernet and wireless NIC’s are telling Linux to be willing to buffer up to 1000 packets. In my experiments on (most) of my hardware, since the ethernet and wireless rings are both at a minimum quite large, I could set txqueuelen to zero without causing any immediate problems.

But note that if you set buffering to zero in both device drivers (and the transmit queue), if there is no other buffering you don’t happen to know about, your system will just stop transmitting entirely; so some care is in order. This depends on the exact details of the hardware. Buffering is necessary; just not the huge amounts currently common, particularly at these speeds and low latencies.

Also note that many device drivers (e.g. the Intel IWL wireless driver) do not support the controls to set the ring buffer sizes, and at least one device I played with it seemed to have no effect whatsoever (implying buffering present, but no control over the size of those buffers).

A possible reason for the transmit queue (others with first hand knowledge of the history, please chime in), is that on some old hardware, e.g.old serial devices being used with modems, had essentially no buffering, and you might experience excessive packet loss on those devices. It may have also been really necessary for performance before Linux’s socket buffer management became more sophisticated and started to adjusting its socket buffer sizes based on the observed RTT (note that the lower level bufferbloat may be inducing socket bufferbloat and application latency as well, though I have no data to confirm this hypothesis). At some point, the default value for txqueuelen was raised to 1000; I don’t know the history or discussion that may have taken place. There are also queues in the operating system required for traffic classification; I haven’t had time to figure out if that is where Linux implementss its classification algorithms or not; some hardware also supports multiple queues for that purpose. Note this means that many Linux based devices and home routers may have inherited differing settings. Extreme bufferbloat is present on a number of the common commercial home routers I have played with using modern hardware, and the open source routers I’ve played with as well.

So even though the “right” solution is proper queue management on you can tune the txqueuelen and (possibly) the NIC device driver rings to more reasonable sizes, rather than the current defaults, which are typically set for server class systems on recent hardware.

Once tuned, Linux’s latency (and the router’s latency) can be really nice even under high load (even if I’ve not tried hard to get to the theoretical minimums). But un-tuned, I can get many second latency out of both Linux home routers and my laptop, just by heading to some part of my house where my wireless signal strength is low (I have several chimneys that makes this trivial). By walking around or obstructing your wireless router, you should be easily able to reproduce bufferbloat in either your router or in your laptop, depending on which direction you saturate.

With an open source router on appropriate hardware and a client running Linux, you can make bufferbloat very much lower in your home environment, even when bufferbloat would otherwise cause your network to become unusable. Nathaniel Smith in a reply to “Fun with Wireless” shows what can be done when you both set the txqueuelen and change the driver (in his case, a one line patch!)

Mac OSX

I’ve experimented on relatively recent Apple hardware: on Ethernet showed what appears to be device driver ring bufferbloat, roughly comparable to Linux. On my simple test on ethernet on a 100Mbps switch, I observed 11ms latency, roughly, slightly more than Linux which was 8ms on similar vintage hardware in the same comparable test. On Linux, the transmit ring is set to 256, by default, and allowed me to set it as small as 64. So I hypothesize a similar size buffer in it’s ethernet driver (and possibly a small buffer in the OS above the driver). As I’m not a Mac expert, I can’t tell you as I could on Linux how to reduce the transmit ring size.

I have not tried to pry my son’s Mac out of his hands for Mac wireless experiments: perhaps you would like to do so with your Mac, or I may get around to the wireless experiment over the holidays. If you do, make sure you arrange the bottleneck to be in the right place (the lowest bandwidth bottleneck needs to be between your laptop and your test system).

Microsoft Windows

Experimenting with Microsoft Windows several weeks ago was a really interesting experience. Plugged into a 100Mbps switch, there was no bufferbloat in the operating system (both Windows XP and Windows 7) on recent hardware. But neither Windows saturate a 100Mbps switch (you expect to see about 93Mbps on that hardware, due to TCP and IP header overhead). As soon as we set the NIC to run at 10Mbps, the expected bufferbloat behavior occurred. Since in my tests, the medium no longer is the bottleneck, it shifts to somewhere else in the path (in my test, there was no bottleneck).

Here’s what I think is going on and I believe what happened.

With some googling, I discovered on Microsoft’s web site that Microsoft has bandwidth shaped their TCP implementation to not run at full speed by default, but to run probably just below what a 100Mbps network (I observed mid 80 megabit). You have to go tune registry parameters to get full performance on Microsoft Windows TCP implementation. There is an explanation on their web site that this was to ensure that multimedia applications not destroy the interactive performance of the system. I think there is a grain (or block) of truth to this explanation: as soon as you insert big buffers into the network, you’ll start to see bad latency whether using TCP, UDP or other protocols, and one of the first places you’ll notice is the UI interaction between users of a media player and the media server (I’m an old UI guy; trust me when I say that you start “feeling” latency at even 20ms). Any time they ran Windows on hardware with big buffers, they had problems; certainly hardware has supported much higher transmit buffers than makes any sense for most user’s office or home environments for quite a few years. I suspect Microsoft observed the bufferbloat problem and, as a simple mitigation strategy was available to them, took it.

Microsoft does not have control of many/most of the drivers their customers expect Windows to run well on (not true for Mac and Linux), however. So I suspect that Microsoft and (some of their customers) have a real headache on their hands, only soluble by updates to a large number of drivers by many vendors.

On the other hand, on 100Mbps ethernet, still the most common bandwidth ethernet, both Windows XP and Windows 7 “just worked” as you might hope with low latency (of order 1ms even while loaded). And Windows XP is less likely to induce bloated buffers in broadband, though as bittorrent showed, it still can, and as I’ll explain in details shortly, recent changes in both web browsers and certain web servers can encourage XP to fill buffers. I have not experimented with wireless. Please do so and report back.

Alternate explanations and/or confirmations of this hypothesis are welcome.

I do not happen to know the mechanisms, if any, to control driver buffering size on Microsoft Windows, though it may be present in driver dialog boxes somewhere.

Why in the world does the hardware now have so much buffering, anyway?

On my Intel Ethernet NIC, the Linux driver’s ring buffer size is 256 by default: but the hardware goes up to 4096 in size. That’s amazingly huge. I’ve seen similar sizes on other vendor’s NIC’s as well. I wondered why. I like the explanation that Ted T’so gave me when I talked to him about bufferbloat a month ago: it stems from experience he has when he was working for the Linux Foundation on real time. I think Ted is likely right.

It can’t be for interrupt mitigation; most of your benefit is in the first few packets; similarly for segmentation and reassembly. Even doing a little transmit buffering can get you into a lot of trouble on wireless, as I’ll show in a future post. I suppose that interrupt latency could also be a problem on loaded systems, though this seems extreme.

Ted’s theory is this is a result of the x86 processor’s SMM mode. To quote Wikipedia: “System Management Mode (SMM) is an operating mode in which all normal execution (including the operating system) is suspended, and special separate software (usually firmware or a hardware-assisted debugger) is executed in high-privilege mode.” Ted noted there are motherboards/systems out there which go catatonic for of order one or a few milliseconds at a time; yes, your N processor chip motherboard consisting of C cores each may crowbar to a single thread on a single processor for that length of time. The BIOS is ensuring your CPU cores don’t over heat (you might think there should be a way to do this at lower priority for things less time urgent, mighten you?) and important (but not necessarily urgent) tasks. To paper over latencies and hiccups of that length of time at 1 gigabit you indeed need hundreds or conceivably small number of thousands of ring entries. And that’s the size we see in current hardware.

Unless someone has a better theory, I like Ted’s.

The General Operating System Problem

We now have commodity “smart” network devices, that may do lots of features for us, to make the network “go fast” (forgetting that for many people, operations/second and latency trumps bits per second and throughput hands down; performance has multiple metrics of import, not just one). For example, the devices may compute the TCP checksums, segment the data, and so on; and similarly on the receive side of the stack. To go fast, we may also be wanting to (and needing to) mitigate interrupts, so the OS doesn’t necessarily get involved with every packet transfer in each direction, on server systems (but often not on edge systems at all). And, as opposed to a decade ago, we now have widespread deployment of networking technologies that span one or more orders of magnitude of performance, while still only admitting to a “one size fits all” tuning.

Here’s the rub: these same smart device designs are often/usually being put into commodity hardware a generation or two later, and the same device drivers are being used, set up for their use on high end servers. But the operating environment that hardware is now in is in your laptop, your handheld device or your router, running at low bandwidth, rather than a big piece of iron in a data center, hooked up to a network running at maximum speed. Rather, it is being used in devices that are being used at a small fraction of their theoretical performance capability. For example, my gigabit ethernet NIC much more often than not is plugged into a 100megabit switch, with the results I noted. And, of course, I’m seldom going anything like the speed of a server on my laptop: at most, I might be copying files to a disk someplace, and going of order 100Mbps.

Even more of a problem is wireless: not only is the bit rate of the network not a 100Mbps (for 802.11N), or 20Mbps (for 802.11g), but the bit rate may drop as low as 1 megabit/second. Remember also that those networks are shared media. If you have a loaded wireless network, the buffering of the other nodes also comes into play; you may only get 1/10 (or less) of the available bandwidth at whatever rate that wireless network is operating at (and 802.11 likes to drop its speed to maximize distance at the drop of a hat). I’ll discuss what happened to OLPC in a future post, though we also had other problems in our mesh network. So the effective “goodput” on wireless may easily vary by factors of 100 or more on wireless, presenting even more of a challenge than for ethernet, where typically we face a switched network and a factor of 10 in its performance.

In general, I believe that hardware transmit buffer sizes should be kept as small as possible. “As possible”, will depend strongly upon the network media and circumstances. One of the mistakes here, I suspect, is that the operating system driver implementers, not understanding that transmit and receive are actually quite different situations, set the transmit and receive buffering to the same amount. After all, I’m never going to lose a packet I haven’t transmitted yet; it’s only receive I could have a problem on. And as I showed previously, some packet drop (or use of ECN) is necessary when congested for the proper functioning of Internet protocols, and indeed, for the health of the Internet overall. And this is indeed be a form of congestion. Ideally, we should always mark packets with ECN whenever/wherever congestion occurs, no matter where the excessive queues are forming.

And since the network delays are anywhere from almost zero to several hundred milliseconds (for planetary paths), the delay/bandwidth product is also very large, along with the workload of the systems. There is no single right answer possible for buffering: our operating systems need to become much more intelligent about handling buffering in general.

I certainly do not pretend to have a clue as to the right way to solve this buffer management problem in multiple operating systems; but it seems like a tractable problem. That will be fun for the OS and networking subsystem implementers to figure out, and help keep them employed.

The general challenge for operating systems is we want a system which both can run like a bandit in the data center, and also work well in the edge devices. I believe it possible for us to “have it both ways”, and to “have our cake and eat it too”. But it will take work and research to get there. In the short term, we can tune for different situations to mitigate the problem.

Coming Installments

After action report of 802.11 network meltdown at OLPC
RED in a different light
corporate and ISP networks
802.11 and 3g networks
where to from here?

Conclusions

This entry was posted on December 13, 2010 at 5:15 pm and is filed under Bufferbloat, Networking, Puzzle. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

36 Responses to “Mitigations and Solutions of Bufferbloat in Home Routers and Operating Systems”

Justin Smith Says:
December 13, 2010 at 6:50 pm | Reply
There blog series has some important stuff to say. But why does the author ramble on and on and on and on? PLEASE, state your objective, describe the context, then make your point! 90% of this article a bird’s nest of unreadable and completely unnecessary commentary.
- gettys Says:
  December 13, 2010 at 7:05 pm | Reply
  Sometimes I’m a better author than others. Sorry.
  – Jim
- Karellen Says:
  December 14, 2010 at 9:11 am | Reply
  Please excuse the length of the blog post; the author did not have enough time to make it shorter. 🙂
Dan Says:
December 13, 2010 at 6:51 pm | Reply
Regarding the following:

==============
On my Intel Ethernet NIC, the Linux driver’s ring buffer size is 256 by default: but the hardware goes up to 4096 in size. That’s amazingly huge. I’ve seen similar sizes on other vendor’s NIC’s as well. I wondered why. I like the explanation that Ted T’so gave me when I talked to him about bufferbloat a month ago: it stems from experience he has when he was working for the Linux Foundation on real time. I think Ted is likely right.

It can’t be for interrupt mitigation;
=============

In fact it very much is for interrupt mitigation. It has nothing at all to do with SMM. At gigabit speeds you are looking at between 85k (1500 bytes) to 2M (60 byte) packets per second. You _really_ do not want anything on your machine to be interrupting at 2 million times a second (nor 85k for that matter). Nothing else would get done! In order to keep the pipe full you need enough buffering on the transmit side to cover the interrupt latency. Even if you were foolish enough to allow your device to interrupt for every transmitted packet, interrupt latencies on the order of 100s of microseconds are not at all unheard of, especially if your ISR has to grab a lock that might be held by another thread.

The problem is not so much with the amount of buffering in the driver. For UDP and SCTP it works great! The problem is more with TCP and it’s byte based congestion algorithm that just don’t scale well to high latency, high bandwidth links.

thanks for the great post! Looking forward to the next one!
dan
- gettys Says:
  December 13, 2010 at 7:04 pm | Reply
  Ah ok. You’re right about the small packet case.
  
  But SMM mode latencies are generally much higher than interrupt latencies. I don’t know how often they occur.
  - Dan Says:
    December 13, 2010 at 7:09 pm | Reply
    For sure, but there is not going to be anywhere enough buffering to handle some jacked up machine which goes dead for 1ms at a time.. Driver writers certainly would not optimize for such a thing.. They would optimize for transmit throughput however. Also you can test this pretty easily but just changing the linux e1000 driver to use a transmit descriptor queue size of 1 (or 2 or 10 or whatever, something small). I guarantee that you’ll see throughput suffer.
  - gettys Says:
    December 13, 2010 at 7:22 pm | Reply
    And even at 64 entries (the minimum I can set in the driver right now; that’s what’s in my laptop), I get a lot more latency than I should when operating at 100Mbps.
    
    There is no single right answer here. And driver writers are not testing latency under load, which shows the problem.
jack Says:
December 13, 2010 at 7:14 pm | Reply
>using the “ifconfig” command.

You should have really known better that iproute2 is how things are done in this century.
- gettys Says:
  December 13, 2010 at 8:01 pm | Reply
  I’m a dinosaur from the 1980’s; old neural habits die hard…
- Neil Cherry Says:
  January 10, 2011 at 11:38 am | Reply
  > Jack says:
  >> using the “ifconfig” command.
  
  >> You should have really known better that iproute2 is how things are
  >> done in this century.
  
  Hmm, I don’t know Jack (couldn’t resist) but I think you should have said ip as iproute2 is the package. The command would be:
  
  ip link set wlan0 txqueuelen 0
  
  And not all Linux based routers have iproute2 installed (i.e. no ip command). The ifconfig command is more than likely installed. Still it’s good to know.
  - gettys Says:
    January 10, 2011 at 2:09 pm | Reply
    Again, you can’t set txqueuelen to zero if you want to classify traffic (as I understand it), and your network device may have tremendous buffering in addition to the transmit queue. Ethernet nic’s often support ethtool for controlling that; unfortunately, the wireless devices I’ve tried don’t support ethtool’s ring controls.
  - briareus Says:
    January 19, 2011 at 10:43 pm | Reply
    The parameter that ifconfig reports as “txqueulen” is shown and set by ip link as “qlen” according to the test I just performed. I did not test whether calling the parameter “txqueuelen” worked also.
    
    Also, AFAICT route, ifconfig et al were not broken, why was it necessary for the “replacement” with the ip command?
Mark D. Says:
December 13, 2010 at 8:05 pm | Reply
The device transmit queue is probably not the right place to solve this problem, because the problem is route specific. What is needed is a route / destination attribute containing the upper bound of the bandwidth available on that route.

Then layer 3 (IP) traffic classifiers should automatically be installed and updated to restrict transmitted traffic on that route to the specified bound. By so doing, excessive device level transmit buffering both on the client and the local bottleneck routers can be eliminated, and it won’t matter what value they are set to, something which is particularly important if they are on external devices that you can’t configure at all.

Of course the trick is to make this self configuring and self discovering as much as possible. If you have one WAN gateway / default route the maximum sustained bandwidth available on that route can either be specified by the user or estimated using ~3/4 of the maximum congestion window ever experienced on a connection on that route, and of course capped by the device bandwidth of the local device that each route / destination is forwarded through.

A similar process could be used to maintain a destination by destination bandwidth bound for each device on the local subnet, in particular wireless devices that have a bound that goes up and down depending on signal quality and the like.
Jason Lunz Says:
December 13, 2010 at 8:47 pm | Reply
Jim, thanks for this series! I’ve greatly enjoyed it.

I did a little digging into the history of the transmit queue length on linux. I think you will find this thread to be entertaining:

http://thread.gmane.org/gmane.linux.network/6366

The thread itself began in late 2003 in a response to an early gigabit ethernet driver developer increasing the number of hardware tx descriptors for that driver to 1024. It quickly becomes a discussion of whether to raise the generic tx queue length for all ethernet devices.

Here’s where it’s suggested to raise it across the board:
http://thread.gmane.org/gmane.linux.network/6366/focus=6532

Here’s where the decision is made to change it:
http://thread.gmane.org/gmane.linux.network/6366/focus=6599

And here we have the first victim returning half a year later to point out the effect on latency:
http://thread.gmane.org/gmane.linux.network/6366/focus=11785

There’s plenty of other good material in there as well. After the change goes into the (then under development) linux 2.6 tree, we see the change included in the release announcement for the 2.4.23 kernel as well:

http://lwn.net/Articles/51830/

Finally, to clarify – the txqueuelen you see with ifconfig/ip does control the tx-side qos packet schedulers. It’s their responsibility to do the job with the queue length allotted.
- gettys Says:
  December 13, 2010 at 9:41 pm | Reply
  Thank you very much for the archaeology; it’s interesting that the cause is what I surmised.
  
  And it makes sense we’d want a short queue for QOS schedulers; but those queues do need to be managed, or they will grow too long.
Adam Williamson Says:
December 14, 2010 at 4:19 am | Reply
This stuff is interesting but also pretty long and hard to follow. =) so, the tl;dr case for a network with a router running dd-wrt and clients running Linux with drivers that respect the txqueuelen parameter is just to set it to something low (10? 50? 100? what?) on both the router and the clients? What about a router running dd-wrt and a network with mixed Linux / Windows 7 clients?
- gettys Says:
  December 14, 2010 at 11:09 am | Reply
  Certainly on Linux reducing the txqueuelen on both the router and the clients may help. If you are able, you may also reduce the wireless ring size as well; that is more likely to require hacking code, since the drivers may not support ethtool’s ring control options.
  
  In the downstream direction, modifying the parameters in the router will help all clients, no matter what operating system. Modifying the buffering parameters on the clients will help upstream performance.
  - Adam Williamson Says:
    December 14, 2010 at 11:41 am | Reply
    cool. thanks very much!
Laurence Says:
December 14, 2010 at 11:28 pm | Reply
Wouldn’t the problem disappear if you were to drop packets in the queues after 10ms or so?
George Says:
December 15, 2010 at 2:18 am | Reply
While this might be great for an end user sitting at home with what amounts to oversubscription of the WAN, I think a server in a data center might be under completely different circumstances so tuning recommended for one might not provide the same results to another.

For example: US 16 core server with 2Gig uplink (bonded GigE) to a switch with 20G gig of uplink (bonded 10GigE) with multiple 10G paths to the Internet. Very little congestion. But the server has thousands of active connections at any given point in time, many are “talking” to other servers (not end users) in Europe from the West coast (LFN path).

Increasing that buffer from the default 1000 might make sense. What might improve performance for one application profile might not improve performance for another. A server that is not sending traffic to a congested end user or a server that is sending traffic over a fast fat pipe to another device such as a load balancer that has a separate TCP/IP connection to an end user might be able to make use of that buffer space.

A process running on one CPU might put a packet or two in that buffer and then is preempted by another process that shoves another couple of packets in to a different destination, etc. On a really busy SMP server with a lot of connections in progress, it is going to be a bit harder for one flow to monopolize the uplink.
- gettys Says:
  December 15, 2010 at 7:25 am | Reply
  No; not without serious performance consequences. “There is no single right answer”, as always. For example, browsers now put bursts of packets into the system; if you bottleneck at a slow wireless link, to give an example, these bursts are much more than 10ms, and vary.
  
  AQM is the only possible “right answer”, and it must adjust to the recent “goodput” of the hop.
Daniel Colascione Says:
December 16, 2010 at 6:47 pm | Reply
Are hardware buffers strictly FIFO? I recall an ingenious solution used in PulseAudio for effective the same problem. The sound card has a ring buffer for PCM data, and the operating system has to fill it every so often for sound to play uninterrupted. The larger that buffer, the less skipping the user hears. However, large buffers also increase latency quite a bit, and users want to hear sounds right away. If a sound needs to be played, PulseAudio will overwrite the portion of the buffer closest to the hardware’s current read pointer, essentially cutting in line. Using this approach, we have all the benefits of deep buffering with only a slight latency penalty.

Could the same principle be applied to network cards? They have similar in-order DMA buffers. Using QoS, the operating system could put latency-sensitive packets ahead of others in the queue. Some network hardware actually has built-in support for multiple queues which amounts to much the same thing, and support for them has been in Linux for some time.
- gettys Says:
  December 17, 2010 at 8:16 am | Reply
  A first issue is many systems (e.g. OS’s and home routers) are not enabling/configuring any queue management at all, even though you can’t solve the congestion problem without doing so. You have to signal congestion to the endpoints, some way, or some how, when it occurs, or the buffers of whatever size will grow without limit, as the end points can end up filling them with time. Again, QOS does not address this problem; QOS mediates among different classes of traffic, but does not by itself manage the amount of buffering overall.
  
  Deep buffering is never a feature in networking the way it can be in audio (certainly our forgotten Audiofile audio server pioneered the same technique now used in PulseAudio); ultimately you can’t go faster than the bottleneck of each path, and all buffering beyond that required to keep the bottleneck links busy does then does is add latency (and cause trouble with congestion avoidance as a result).
  
  As far as the network hardware goes, I’ve never groveled through the specs of what modern network hardware does;but I’d pretty much expect the CPU can change what’s in the ring buffers until the packets are sent; but see the previous paragraph.
  
  The fundamental issue is that we have to be much more careful/smart with our network system buffer management than we currently are.
- Wolfgang Beck Says:
  December 17, 2010 at 10:00 am | Reply
  In routers, this ‘cutting in the line’ / priorization is done with a number of queues that are served in different frequency. For example, voice packets go to queue 1, TCP packets into queue 2. One scheduling strategy would be: ‘process only packets from queue 2 after all packets of queue 1 have been processed’ (there are fairer strategies than this one). If more packets arrive than can be sent, voice packets will experience less delay than TCP packets.
  
  However, if your system has other queues in the path, the priorisation is lost. In Linux this happens when the routing and classification (‘in what queue do I have to put this packet?’) takes too long and packets queue up at the receiving side.
Alex Elsayed Says:
December 19, 2010 at 7:39 am | Reply
I’ve found a preprint that, building on some of the author’s previous work, may eventually be [part of] a full-on solution to bufferbloat by enabling flow-splitting without interfering with non-flow-control aspects of the stream. This is possible by splitting the transport layer into four sublayers, of which flow-control is the second from the bottom. The site is http://dedis.cs.yale.edu/2009/tng/ – the paper’s title is “Flow Splitting with Fate Sharing in a Next-Generation Transport Services Architecture”, and it is also available on arXiv at http://arxiv.org/pdf/0912.0921v1
Проблемы с буферизацией в современных TCP/IP сетях (на английском) | Телекомблог: заметки и аналитика Says:
January 4, 2011 at 3:46 am | Reply
[…] «Mitigations and Solutions of Bufferbloat in Home Routers and Operating Systems» – идеи по избавлению от проблем с буферизацией для операционных систем (Linux, MacOS X и Windows) и маршрутизаторов; […]
Анализ проблем с буферизацией в современных TCP/IP сетях | AllUNIX.ru – Всероссийский портал о UNIX-системах Says:
January 4, 2011 at 6:18 am | Reply
[…] «Mitigations and Solutions of Bufferbloat in Home Routers and Operating Systems» – идеи по избавлению от проблем с буферизацией для операционных систем (Linux, MacOS X и Windows) и маршрутизаторов; […]
Matthew W. S. Bell Says:
January 4, 2011 at 10:02 pm | Reply
I’m not entirely convinced I’ve understood the problem here, but when I was fiddling with wireless device drivers, I expended quite some effort in getting hardware transmit timestamps back into the queued packets. Shouldn’t these be used for latency calculations in higher layers?
- gettys Says:
  January 10, 2011 at 2:19 pm | Reply
  Yes, they should be used much more than they are currently at higher levels. For example, Van Jacobson was able to tell me definitively my traces were bufferbloat just from the traces, since both ends were Linux and the timestamp option was on.
Marc Horowitz Says:
January 9, 2011 at 4:28 pm | Reply
Jim,

It turns out that this problem has been identified in the past. I worked on a now defunct product which ran into these same issues. The solution was to use TCP Vegas congestion algorithm (see “TCP Vegas: End to End Congestion Avoidance on a Global Internet”, Brakmo and Peterson, 1995), which almost completely mitigates the problem. The best reference I can find connecting the dots between the problem of buffer overprovisioning and the solution of TCP Vegas is http://www.csm.ornl.gov/~dunigan/netperf/atou.html (Tom Dunigan, 2004). He observes “the following plots show how standard Vegas… is able to avoid queueing and reduce latency when the buffers are over provisioned at sender and receiver.”

Because having users install a new TCP congestion algorithm is a non-starter (and for other reasons), we (primarily Nick Martin) ended up layering the userspace lwip package with a homespun vegas congestion implementation over UDP. This worked as well as we could have hoped, but of course if a Reno flow was operating over the same modem, we would still end up with performance issues.

This does indicate one possible solution for savvy home networking users. My home router on which I’ve installed dd-wrt uses Vegas congestion control by default. Since most home networks use NAT, the long-haul TCP flow will use Vegas, which should allow for good performance, at least for flows where the bulk of the data is in one direction, from the router to the internet, I believe. From observation, large uploads on typical networks seem to increase latency more than downloads, so this helps with that, but not with any download bufferbloat-related performance problems. The good news, though, is that for downloads we don’t need to fix the whole Internet; a handful of CDNs starting to use Vegas would make a huge impact.

Ubuntu linux users can enable Vegas congestion control using the instructions at http://ubuntuforums.org/showthread.php?t=1104107.
- gettys Says:
  January 10, 2011 at 2:04 pm | Reply
  Thanks very much. Your post is very interesting; I’l have to look into Vegas and your post and pointers….
  
  But does it will help the general case of others doing you in? Like your kids filling the buffers on other machines while you are doing your work/play? Remember, we have a game theory problem here; just because we “fix” bufferbloat for ourselves, if you end up in an unfair position, you’ll be less than happy. I don’t see how any single individual’s action can solve the general problem; somehow the queues have to be managed at each point in the network.
  
  And yes, bufferbloat has been identified many times in the past; I’ve mostly been assembling a puzzle from other people’s pieces (and I often don’t know the whole history of discovery of the pieces). And while I identified the shape of the “criminal mastermind” and that it’s been stalking the internet killing latency again and again, there are many more pieces yet to assemble. In a talk I’ve only given once so far, I say this up front. I’m happy to give it at MIT anytime you like, if you do… We have lots more to find and assemble before the whole picture is complete.
Leo Dirac Says:
January 31, 2011 at 7:35 pm | Reply
I really appreciate your writings on this subject. Thanks for all your hard work experimenting and writing it up.

That said, this post falls well short of the promise its title implies. The title is “Mitigations and Solutions…” but the page actually offers neither. Even for Linux, the system you have most thoroughly investigated, there isn’t a single example of what somebody can do to decrease the latency on their home network. You describe some of the commands and parameters involved but leave it as an exercise for the reader to figure out what to actually type. In comments you offer an excuse that everybody’s situation will be different and that the solution for a home user will (obviously) differ from that of an ISP operator. These two situations are probably ends of a spectrum that most of us can take a reasonable guess as to our position on. As you say, more experimentation is needed, and the more you can do to enable less-sophisticated users to experiment (i.e. tell them exactly what to type) the more we’re going to collectively learn about the subject.

Thank you again for your hard work, but please try to sprinkle in some actionable, practical advice. Especially on pages like this which appear intended to offer exactly that.
- Leo Dirac Says:
  January 31, 2011 at 10:01 pm | Reply
  Along these lines, I wrote this post on how to re-configure a DD-WRT based wifi access point to optimize its buffer sizes for a typical home network connection.
  
  http://www.embracingchaos.com/2011/01/fighting-buffer-bloat-on-dd-wrt.html
- gettys Says:
  February 13, 2011 at 10:33 pm | Reply
  We’re trying to organize more specific instructions in the wiki at http://www.bufferbloat.net.
  
  Please join the mailing lists, and help build a good wiki with more specific directions.
Marc Herbert Says:
October 25, 2014 at 9:14 am | Reply
“With some googling, I discovered on Microsoft’s web site that Microsoft has bandwidth shaped their TCP implementation to not run at full speed by default,”

It seem I can’t reproduce your googling, can you please provide references? The best I found is this:
http://www.speedguide.net/articles/windows-7-vista-2008-tweaks-2574
… but it’s neither that explicit nor from Microsoft.
- gettys Says:
  February 2, 2017 at 9:24 pm | Reply
  This was five years ago; I have no clue if the technical note is still present, or applies to Windows 7 or 10.