Whose house is of glasse, must not throw stones at another.

George Herbert, Outlandish Proverbs, 1640

In my last post I outlined the general bufferbloat problem. This post attempts to explain what is going on, and how I started on this investigation, which resulted in (re)discovering that the Internet’s broadband connections are fundamentally broken (others have been there before me). It is very likely that your broadband connection is badly broken as well; as is your home router; and even your home computer. And there are things you can do immediately to mitigate the brokenness in part, which will cause applications such as VOIP, Skype and gaming to work much, much better, that I’ll cover in more depth very soon. Coming also soon, how this affects the world wide dialog around “network neutrality.”

Bufferbloat is present in all of the broadband technologies, cable, DSL and FIOS alike. And bufferbloat is present in other parts in the Internet as well.

As may be clear from old posts here, I’ve had lots of network trouble at my home, made particularly hard to diagnose due to repetitive lightning problems. This has caused me to buy new (and newer) equipment over the last five years (and experience the fact that bufferbloat has been getting worse in all its glory). It also means that I can’t definitively answer all questions about my previous problems, as almost all of that equipment is scrap.

Debugging my network

As covered in my first puzzle piece I was investigating performance of an old VPN device Bell Labs had built last April, and found that the latency and jitter when running at full speed was completely unusable, for reasons I did not understand, but had to understand for my project to succeed. The plot thickened when I discovered I had the same terrible behavior without using the Blue Box.

I had had an overnight trip to the ICU in February; so did not immediately investigate then as I was catching up on other work. But I knew I had to dig into it, if only to make good teleconferencing viable for me personally. In early June, lightning struck again (yes, it really does strike in the same place many times). Maybe someone was trying to get my attention on this problem. Who knows? I did not get back to chasing my network problem until sometime in late June, after partially recovering my home network, further protecting my house, fighting with Comcast to get my cable entrance relocated (the mom-and-pop cable company Comcast had bought had installed it far away from the power and phone entrance), and replacing my washer, pool pump, network gear, and irrigation system.

But the clear signature of the criminal I had seen on April had faded. Despite several weeks of periodic attempts, including using the wonderful tool smokeping to monitor my home network, and installing it in Bell Labs, I couldn’t nail down what I had seen again.I could get whiffs of smoke of the the unknown criminal, but not the same obvious problems I had seen in April. This was puzzling indeed; the biggest single change in my home network had been replacing the old blown cable modem provided by Comcast with a new faster DOCSIS 3 Motorola SB6120 I bought myself.

In late June, my best hypothesis was that there might be something funny going on with Comcast’s PowerBoost^® feature. I wondered how that worked, did some Googling, and happened across the very nice internet draft that describes how Comcast runs and provisions its network. When going through the draft, I happened to notice that one of the authors lives in an adjacent town, and emailed him, suggesting lunch and a wide ranging discussion around QOS, Diffserv, and the funny problems I was seeing. He’s a very senior technologist in Comcast. We got together in mid-July for a very wide ranging lunch lasting three hours.

Lunch with Comcast

Before we go any further…

Given all the Comcast bashing currently going on, I want to make sure my readers understand through all of this Comcast has been extremely helpful and professional, and that the problem I uncovered, as you will see before the end of this blog entry, are not limited to Comcast’s network: bufferbloat is present in all of the broadband technologies, cable, FIOS and DSL alike.

The Comcast technical people are as happy as the rest of us that they now have proof of bufferbloat and can work on fixing it, and I’m sure Comcast’s business people are happy that they are in a boat the other broadband technologies are in (much as we all wish the mistake was only in one technology or network, it’s unfortunately very commonplace, and possibly universal). And as I’ve seen the problem in all three common operating systems, in all current broadband technologies, and many other places, there is a lot of glasse around us. Care with stones is therefore strongly advised.

The morning we had lunch, I happened to start transferring the old X Consortium archives from my house to an X.org system at MIT (only 9ms away from my house; most of the delay is in the cable modem/CMTS pair); these archives are 20GB or so in size. All of a sudden, the wiffs of smoke I had been smelling became overpowering to the point of choking and death. “The Internet is Slow Today, Daddy” echoed through my mind; but this was self inflicted pain. But as I only had an hour before lunch, the discussion was a bit less definite than it would have been even a day later. Here is the “smoking gun” of the following day, courtesy of DSL Reports Smokeping installation. You too can easily use this wonderful tool to monitor the behavior of your home network from the outside.

As you can see, I had well over one second latency, and jitter just as bad, along with high ICMP packet poss. Behavior from inside out looked essentially identical. The times when my network connection returned to normal were when I would get sick of how painful it was to browse the web and suspend the rsync to MIT. As to why the smoke broke out, the upstream transfer is always limited by the local broadband connection: the server is at MIT’s colo center on a gigabit network, that directly peers with Comcast. It is a gigabit (at least) from Comcast’s CMTS all the way to that server (and from my observations, Comcast runs a really clean network in the Boston area. It’s the last mile that is the killer.

As part of lunch, I was handed a bunch of puzzle pieces that I assembling over the following couple months. These included:

That what I was seeing was more likely excessive buffering in the cable system, in particular, in cable modems. Comcast has been trying to get definitive proof of this problem since Dave Clark at MIT had brought this problem to their attention several years ago.
A suggestion of how to rule in/out the possibility of problems from Comcast’s Powerboost by falling back to the older DOCSIS 2 modem.
A pointer to ICSI’s Netalyzr.
The interesting information that some/many ISP’s do not run any queue management (e.g. RED).

Screen capture of wireshark output, showing part of a "burst".

Wireshark screen capture, showing part of a "burst"

I went home, and started investigating seriously. It was clearly time to do packet traces to understand the problem. I set up to take data, and eliminated my home network entirely by plugging my laptop directly into the cable modem.

But it had been more than a decade since I last tried taking packet captures, and was staring at TCP traces. Wireshark was immediately a big step up (I’d occasionally played with it over the last decade); as soon as I took my first capture I immediately knew something was gravely wrong despite being very rusty at staring at traces. In particular, there were periodic bursts of illness, with bursts of dup’ed acks, retransmissions, and reordering. I’d never seen TCP behave in such a bursty way (for long transfers). So I really wanted to see visually what was going on in more detail. After wasting my time investigating more modern tools, I settled on the old standby’s of tcptrace and xplot I had used long before. There are certainly more modern tools; but most are closed source and require Microsoft Windows. Acquiring the tools, and their learning curve (and the fact I normally run Linux) mitigated against their use.

A number of plots show the results. The RTT becomes very large after a while (10-20 seconds) into the connection, just as the ICMP ping results go.. The outstanding data graph and throughput graph show the bursty behavior so obvious even browsing the wireshark results. Contrast this with the sample RTT, outstanding data graph and throughput graphs from the TCP trace manual.

RTT - Round Trip Time

Outstanding data graph

Throughput Graph

Also remember that buffering in one direction still causes problems in the other direction; TCP’s ack packets will be delayed. So my occasional uploads (in concert with the buffering) was causing the “Daddy, the Internet is slow today” phenomena; the opposite situation is of course also possible.

The Plot Thickens Further

Shortly after verifying my results on cable, I went to New Jersey (I work for Bell Labs from home, reporting to Murray Hill), where I stay with my in-laws in Summit. I did a further set of experiments. When I did, I was monumentally confused (for a day), as I could not reproduce the strong latency/jitter signature (approaching 1 second of latency and jitter) that I saw my first day there when I went to take the traces. With a bit of relief, I realized that the difference was that I had initially been running wireless, and then had plugged into the router’s ethernet switch (which has about 100ms of buffering) to take my traces. The only explanation that made sense to me was that the wireless hop had additional buffering (almost a second’s worth) above and beyond that present in the FIOS connection itself. This sparked my later investigation of routers (along with occasionally seeing terrible latency in other routers), which in turn (when the results were not as I had naively expected, sparked investigating base operating systems.

The wireless traces are much rattier in Summit: there are occasional packet drops severe enough to cause TCP to do full restarts (rather than just fast retransmits), and I did not have the admin password on the router to shut out other access by others in the family. But the general shape in both are similar to that I initially saw at home.

Ironically, I have realized that you don’t see the full glory of TCP RTT confusion caused by buffering if you have a bad connection as it reset TCP’s timers and RTT estimation; packet loss is always considered possible congestion. This is a situation where the “cleaner” the network is, the more trouble you’ll get from bufferbloat. The cleaner the network, the worse it will behave. And I’d done so much work to make my cable as clean as possible…

At this point, I realized what I had stumbled into was serious and possibly widespread; but how widespread?

Calling the consulting detectives

At this point, I worried that we (all of us) are in trouble, and asked a number of others to help me understand my results, ensure their correctness, and get some guidance on how to proceed. These included Dave Clark, Vint Cerf, Vern Paxson, Van Jacobson, Dave Reed, Dick Sites and others. They helped with the diagnosis from the traces I had taken, and confirmed the cause. Additionally, Van notes that there is timestamp data present in the packet traces I took (since both ends were running Linux) that can be used to locate where in the path the buffering is occurring (though my pings are also very easy to use, they may not be necessary by real TCP wizards, which I am not, and begs a question of accuracy if the nodes being probed are loaded).

Dave Reed was shouted down and ignored over a year ago when he reported bufferbloat in 3G networks (I’ll describe this problem in a later blog post; it is an aggregate behavior caused by bufferbloat). With examples in broadband and suspicions of problems in home routers I now had reason to believe I was seeing a general mistake that (nearly) everyone is making repeatedly. I was concerned to build a strong case that the problem was large and widespread so that everyone would start to systematically search for bufferbloat. I have spent some of the intervening several months documenting and discovering additional instances of bufferbloat, as my switch, home router, results from browser experiments, and additional cases such as corporate and other networks as future blog entries will make clear.

ICSI Netalyzr

One of the puzzle pieces handed me by Comcast was a pointer to Netalyzr.

ICSI has built the wonderful Netalyzr tool, which you can use to help diagnose many problems in your ISP’s network. I recommend it very highly. Other really useful network diagnosis tools can be found at M-Lab and you should investigate both; some of the tests can be run immediately from a browser (e.g. netalyzr), but some tests are very difficult to implement in Java. And by using these tools, you will also be helping researchers investigate problems in the Internet, and you may be able to discover and expose mis-behavior of many ISP’s. I have, for example, discovered that the network service provided on the Acela Express is running a DNS server which is vulnerable to man-in-the-middle attacks due to lack of port randomization, and therefore will never consider doing anything on it that requires serious security.

At about the same time as I was beginning to chase my network problem, the first netalyzer results were published at NANOG; more recent results have since been published. Netalyzr: Illuminating The Edge Network, by Christian Kreibich, Nicholas Weaver, Boris Nechaev, and Vern Paxson. This paper has a wealth of data in it on all sorts of problems that Netalyzr has uncovered; excessive buffering is caused in section 5.2. The scatterplot there and the discussion is worth reading. Courtesy of the ICSI group, they have sent me a color version of that scatterplot that makes the technology situation much clearer (along with the magnitude of the buffering) which they have used in their presentations, but is not in that paper. Without this data, I would have still been wondering bufferbloat was widespread, and whether it was present in different technologies or not. My thanks to them for permission to post these scatter plots.

Netalyzer Uplink buffer test results

Netalyzer Downlink buffer test results

As outlined in the Netalyzr paper in section 5.2, the structure you see is very useful to see what buffer sizes and provisioned bandwidths are common. The diagonal lines indicate the latency (in seconds!) caused by the buffering. Both wired and wireless Netalyzer data are mixed in the above plots. The structure shows common buffer sizes, that are sometimes as large as a megabyte. Note that there are times that Netalyzr may have been under-detecting and/or underreporting the buffering, particularly on faster links; the Netalyzr group have been improving its buffer test.

I do have one additional caution, however: do not regard the bufferbloat problem as limited to interference cause by uploads. Certainly more bandwidth makes the problem smaller (for the same size buffers); the wired performance of my FIOS data is much better than what I observe for Comcast cable when plugged directly into the home router’s switch. But since the problem is present in the wireless routers often provided by those network operators, the typical latency/jitter results for the user may in fact be similar, even though the bottleneck may be in the home router’s wireless routing rather than the broadband connection. Anytime the downlink bandwidth exceeds the “goodput” of the wireless link that most users are now connected by, the user will suffer from bufferbloat in the downstream direction in the home router (typically provided by Verizon) as well as upstream (in the broadband gear) on cable and DSL. I see downstream bufferbloat commonly on my Comcast service too, now that I’ve upgraded to 50/10 service, now that it is much more common my wireless bandwidth is less than the broadband bandwidth.

Discarding various alternate hypotheses

You may remember that I started this investigation with a hypothesis that Comcast’s Powerboost might be at fault. This hypothesis was discarded by dropping my cable service back to using DOCSIS 2 (which would have changed the signature in a different way when I did).

Secondly, those who have waded through this blog will have noted that I have had many reasons not to trust the cable to my house, due to mis-reinstallation of a failed cable by Comcast earlier, when I moved in. However, the lightning events I have had meant that the cable to my house was relocated this summer, and a Comcast technician had been to my house and verified the signal strength, noise and quality at my house. Furthermore, Comcast verified my cable at the CMTS end; there Comcast saw a small amount of noise (also evident in (some of) the packet traces by occasional packet loss) due to the TV cable also being plugged in (the previous owner of my house loved TV, and the TV cabling wanders all over the house). For later datasets, I eliminated this source of noise, and the cable tested clean at the Comcast end and the loss is gone in subsequent traces. This cable is therefore as good as it gets outside a lab and very low loss. You can consider some of these traces close to lab quality. Comcast has since confirmed my results in their lab.

Another objection I’ve heard is that ICMP ping is not “reliable”. This may be true if pinging a particular node when loaded, as it may be handled on a node’s slow path. However, it’s clear the major packet loss is actual packet loss (as is clear from the TCP traces). I personally think much of the “lore” that I’ve heard about ICMP is incorrect and/or a symptom of the bufferbloat problem. I’ve also worked with the author of httping, so that there is a commonly available tool (Linux and Android) for doing RTT measurements that is indistinguishable from HTTP traffic (because it is HTTP traffic!), by adding support for persistent connections. In all the tests I’ve made, the results for ICMP ping match that of httping. But TCP shows the same RTT problems that ICMP or httping does in any case.

What’s happening here?

I’m not a TCP expert; if you are a TCP expert, and if I’ve misstated or missed something, do let me know. Go grab your own data (it’s easy; just an scp to a well provisioned server, while running ping), or you can look at my data.

The buffers are confusing TCP’s RTT estimator; the delay caused by the buffers is many times the actual RTT on the path. Remember, TCP is a servo system, which is constantly trying to “fill” the pipe. So by not signalling congestion in a timely fashion, there is *no possible way* that TCP’s algorithms can possibly determine the correct bandwidth it can send data at (it needs to compute the delay/bandwidth product, and the delay becomes hideously large). TCP increasingly sends data a bit faster (the usual slow start rules apply), reestimates the RTT from that, and sends data faster. Of course, this means that even in slow start, TCP ends up trying to run too fast. Therefore the buffers fill (and the latency rises). Note the actual RTT on the path of this trace is 10 milliseconds; TCP’s RTT estimator is mislead by more than a factor of 100. It takes 10-20 seconds for TCP to get completely confused by the buffering in my modem; but there is no way back.

Remember, timely packet loss to signal congestion is absolutely normal; without it, TCP cannot possibly figure out the correct bandwidth.

Eventually, packet loss occurs; TCP tries to back off. so a little bit of buffer reappears, but it then exceeds the bottleneck bandwidth again very soon. Wash, Rinse, Repeat… High latency with high jitter, with the periodic behavior you see. This is a recipe for terrible interactive application performance. And it’s probable that the device is doing tail drop; head drop would be better.

There is significant packet loss as a result of “lying” to TCP. In the traces I’ve examined using The TCP STatistic and Analysis Tool (tstat) I see 1-3% packet loss. This is a much higher packet loss rate than a “normal” TCP should be generating. So in the misguided idea that dropping data is “bad”, we’ve now managed to build a network that both is lossier and exhibiting more than 100 times the latency it should. Even more fun is that the losses are in “bursts.” I hypothesis that this accounts for the occasional DNS lookup failures I see on loaded connections.

By inserting such egregiously large buffers into the network, we have destroyed TCP’s congestion avoidance algorithms. TCP is used as a “touchstone” of congestion avoiding protocols: in general, there is very strong pushback against any protocol which is less conservative than TCP. This is really serious, as future blog entries will amplify. I personally have scars on my back (on my career, anyway), partially induced by the NSFnet congestion collapse of the 1980’s. And there is nothing unique here to TCP; any other congestion avoiding protocol will certainly suffer.

Again, by inserting big buffers into the network, we have violated the design presumption of all Internet congestion avoiding protocols: that the network will drop packets in a timely fashion.

Any time you have a large data transfer to or from a well provisioned server, you will have trouble. This includes file copies, backup programs, video downloads, and video uploads. Or a generally congested link (such at a hotel) will suffer. Or if you have multiple streaming video sessions going over the same link, in excess of the available bandwidth. Or running current bittorrent to download your ISO’s for Linux. Or google chrome uploading a crash to Google’s server (as I found out one evening). I’m sure you can think of many others. Of course, to make this “interesting”, as in the Chinese curse, the problem will therefore come and go mysteriously, as you happen to change your activity (or things you aren’t even aware of happen in the background).

If you’ve wondered why most VOIP and Skype have been flakey, stop wondering. Even though they are UDP based applications, it’s almost impossible to make them work reliably over such links with such high latency and jitter. And since there is no traffic classification going on in broadband gear (or other generic Internet service), you just can’t win. At best, you can (greatly) improve the situation at the home router, as we’ll see in a future installment. Also note that broadband carriers may very well have provisioned their telephone service independently of their data service, so don’t jump to the conclusion that therefore their telephone service won’t be reliable.

Why hasn’t bufferbloat been diagnosed sooner?

Well, it has been (mis)diagnosed multiple times before; but the full breadth of the problem I believe has been missed.

The individual cases have often been noticed, as Dave Clark did on his personal DSLAM, or as noted in the Linux Advanced Routing & Traffic Control HOWTO. (Bert Huber attributed much more blame to the ISP’s than is justified: the blame should primarily be borne by the equipment manufacturers, and Bert et. al. should have made a fuss in the IETF over what they were seeing.)

As to specific reasons why, these include (but are not limited to):

We’re all frogs in heating water; the water has been getting hotter gradually as the buffers grow in subsequent generations of hardware, and memory has become cheaper. We’ve been forgetting what the Internet *should* feel like for interactive applications. Us old guy’s memory is fading of how well the Internet worked in the days when links were 64Kb, fractional T1 or T1 speeds. For interactive applications, it often worked much better than today’s internet.
Those of us most capable of diagnosing the problems have tended to opt for the higher/highest bandwidth tier service of ISP’s; this means we suffer less than the “common man” does. More about this later. Anytime we try to diagnose the problem, it is most likely we were the cause; so we stop what we were doing to cause “Daddy, the Internet is slow today”, the problem will vanish.
It takes time for the buffers to confuse TCP’s RTT computation. You won’t see problems on a very short (several second) test using TCP (you can test for excessive buffers much more quickly using UDP, as Netalyzer does).
The most commonly used system on the Internet today remains Windows XP, which does not implement window scaling and will never have more than 64KB in flight at once. But the bufferbloat will become much more obvious and common as more users switch to other operating systems and/or later versions of Windows, any of which can saturate a broadband link with but a merely a single TCP connection.
In good engineering fashion, we usually do a single test at a time, first testing bandwidth, and then latency separately. You only see the problem if you test bandwidth and latency simultaneously. None of the common consumer bandwidth tests test latency simultaneously. I know that’s what I did for literally years, as I would try to diagnose my personal network. Unfortunately, the emphasis has been on speed; for example, the Ookla speedtest.net and pingtest.net are really useful; but they don’t run a latency test simultaneously with each other. As soon as you test for latency with bandwidth, the problem jumps out at you. Now that you know what is happening, if you have access to a well provisioned server on the network, you can run tests yourself that make bufferbloat jump out at you.

I understand you may be incredulous as you read this: I know I was when I first ran into bufferbloat. Please run tests for yourself. Suspect problems everywhere, until you have evidence to the contrary. Think hard about where the choke point is in your path; queues form only on either side of that link, and only when the link is saturated.

Coming installments

How and why I believe bufferbloat helped trigger the current network neutrality debate, due to misdiagnosis of the root cause of bittorrent problems, combined with Windows XP limitations
Why non-carrier VOIP and Skype are so poor
3G networks – why your smart phone’s performance is often horrible
Bufferbloat in corporate networks
ISP’s and bufferbloat
“Fat” subnets (e.g. 802.11 wireless)
Mitigation strategies you can immediately apply (sometimes)
Worrying trends (why I lose sleep at night)
RED and it’s (current) limitations
Real solutions, and where research is needed
Network meltdown, modern variety

Acknowledgements

My thanks to the many who have helped cracking of this case, including Dave Clark, Vint Cerf, Vern Paxson, Van Jacobson, Dave Reed, Scott Bradner, Steve Bellovin, Greg Chesson, Dick Sites, Ted T’so, and quite a few others. And particularly to the ICSI Netalyzr developers, without whose work I’d still be wondering if what I saw at home and in New Jersey were a fluke.

Conclusions

All broadband technologies are suffering badly from bufferbloat, as are many other parts of the Internet.

You suffer from bufferbloat nearly everywhere: if not at home or your office, then when you travel, you will find many hotels are now connected by broadband connections, and you often suffer grievous latency and jitter since they have not mitigated bufferbloat and are sharing the connection with many others. (More about mitigation strategies soon). How easy/difficult to fix those technologies is clearly dependent on the details of those technologies; full solutions depend on active queue management; some other mitigations are possible (just set the buffers to something sane, as they are often up to a megabyte in size now, as the ICSI data show), as I’ll describe later in this sequence of blog posts.

Bufferbloat is a serious, widespread problem, the full severity of which will become clearer subsequent postings.

Let’s go fix it, together.

This entry was posted on December 6, 2010 at 7:53 pm and is filed under Bufferbloat, Networking, Puzzle. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

170 Responses to “Whose house is of glasse, must not throw stones at another.”

Zack Says:
December 6, 2010 at 8:08 pm | Reply
Is this exclusively about latency, or could it also explain situations where a large file transfer initially saturates the “last hop” link, but slows down to ~10% of theoretical bandwidth after a few megabytes are transferred, and stays that way until completion?
gettys Says:
December 6, 2010 at 8:14 pm | Reply
I’d have to see data to know (I’m not volunteering to go look at yours either; I have plenty of fish frying). I’ve seen high packet loss rates at times, but I haven’t caught anything like 90% loss rate in my experiments.

Certainly Powerboost (and similar features from other broadband providers), don’t make a 90% difference in bandwidth performance; they might get you temporarily a factor of 2-5 more than your provisioned bandwidth at most.

What is your service, by whom?

And you may need to take some traces to really see what is going on.
Dan McDonald Says:
December 6, 2010 at 8:58 pm | Reply
@Zack — packet drops (as perceived by TCP) will cause bandwidth to strangle, and bufferbloat definitely causes the perception of packet drops.
- gettys Says:
  December 6, 2010 at 9:01 pm | Reply
  It’s not that simple, with a modern TCP: fast retransmit and SACK can paper over a lot of sins. But there may be circumstances where things go badly wrong. I suggest you take a packet capture, and see if you can get someone with real TCP expertise to take a look at it.
  - Zack Says:
    December 7, 2010 at 11:00 am | Reply
    Unfortunately, this was happening at my previous apartment, in a different city, with a different ISP, and doesn’t seem to be happening now. If I see it again, though, I’ll get a packet trace.
rwg Says:
December 6, 2010 at 9:26 pm | Reply
“The most commonly used system on the Internet today remains Windows XP, which does not implement window scaling and will never have more than 64KB in flight at once.”

For what it’s worth, Windows XP supports TCP window scaling (and timestamps), but it’s not enabled by default. More info at: http://support.microsoft.com/kb/224829
- gettys Says:
  December 6, 2010 at 9:54 pm | Reply
  Yes, you are entirely correct. Of course, editing your registry on Windows is hazardous to your machine’s heath, so few enable it. It’s mostly interesting as it bears on why bufferbloat (and problems it has caused) has gone so long before widespread diagnosis, as future posts will make clear. It is also why I tend to lose sleep at night: the traffic is finally shifting away from old TCP’s and XP finally retires, and I worry about the problem becoming more severe.
  - Christopher Smith Says:
    January 8, 2011 at 1:36 pm | Reply
    Actually, even with the default setting, Windows XP will use large window scaling if it receives a SYN packet with the window scaling option marked.
    - gettys Says:
      January 8, 2011 at 2:43 pm
      Most traffic is initiated by Windows XP, given its (finally dropping) dominance on the net. So correct me if I’m wrong, but that tells me that we’ll still see most XP initiated TCP sessions running without window scaling.
walken Says:
December 6, 2010 at 11:10 pm | Reply
Linux has traffic shaping capabilities that can be used to work around this problem. My home setup involves a linux router sitting in front of the DSL modem. traffic from the router to the DSL modem is rate limited to a couple percent slower than then actual DSL link speed, so that buffering will occur in the router rather than in the modem. Then, one can configure buffering behaviors in the router:
– limiting buffer size, to control latency;
– fair queuing (typically SFQ in linux), so that individual high throughput connections might still have high latency but they at least won’t impact the latency of lower throughput connections;
– or, any combination of the above strategies.

Really, the linux traffic shaping stuff is very powerfull and underused. I wish broadband hardware manufacturers could all do something similar in their hardware (even better if they’d make it configurable, of course).
- gettys Says:
  December 6, 2010 at 11:16 pm | Reply
  Yes, it does, as I’ll explain detail in a future post. OpenWRT variants such as gargoyle do this. It doesn’t require hardware at all, just use of existing facilities (though RED also has some problems, as I’ll also cover). This is the mitigation I’ve referred to in my post. But as the posts are long enough as it is, I didn’t want to try to cover that immediately.
  
  Note that classification is not sufficient, you also need to run some form of AQM, or you’ll still have problems.
  
  However, it’s clear they aren’t doing everything they should: such as running (G)RED on the local routing.
- Christopher Smith Says:
  January 8, 2011 at 1:47 pm | Reply
  Interestingly, my friend Shane Tzen observed this problem about a decade ago and wrote up specifically about using traffic shaping strategies with TCP to avoid a problem: http://www.knowplace.org/pages/howtos/traffic_shaping_with_linux/network_protocols_discussion_traffic_shaping_strategies.php (look at the section title TCP).
Simon Says:
December 7, 2010 at 12:18 am | Reply
I’ve also been aware of this problem since May 2009, when I noticed that high latency was correlated with a saturated upload link. Initially I thought it was something BitTorrent specific, what with the 100+ connections, but it was just a wild guess. The key moment for me was when I realized that even a single uploading connection as with a speed test was capable of increasing the latency of the connection. At that point I understood why, because all of the QoS related information out there for systems like OpenWRT and Tomato mention the buffering issue as something that has to be worked around for the QoS to be able to provide good latencies for high priority packets. I’m amazed at how much more time you had to spend to diagnose this, but I’m happy you’re taking up the cause, and I look forward to your advice for mitigating it, given the level of detail you’ve put into this post.
- gettys Says:
  December 7, 2010 at 5:18 am | Reply
  The time has been spent mostly looking elsewhere than in the broadband link; that was clear quickly as soon as I had traces and saw the Netalyzr data.
  
  Since it quickly became clear the problem is much more widespread than the broadband edge network, the time has gone into building a strong enough case that I now hope everyone with stop and think deeply about whether their piece of the Internet system suffers from bufferbloat. Dave Reed tried to warn everyone over a year ago about bufferbloat in 3G network systems, and despite his deep expertise in Internet technology (he’s a co-author of the famous “end to end” design paper), ended up not “making the case” well enough to convince the jury. Some have axes to grind.
  
  The immediate reaction I’ve received on quite a few occasions, including in my own company, has been incredulity.
  
  “nothing bad can be happening”
  “but dropping any packet is horrible and wrong”
  “I don’t understand”
  
  Just to give a small sample of what I’ve heard over the last few months. It helps that I’ve had a bit of success with this quest; I know of at least one product we’ll be shipping which will work well rather than badly, having had a bloatectomy. And that device will therefore likely work much better than its competition; I certainly hope it does well when it reaches the market.
nate Says:
December 7, 2010 at 12:19 am | Reply
I just posted this on LWN. But I was afraid you’d miss it:

This is freaking fantastic.

It’s really really cool beyond belief that you (Gettys) figured it out.

I mean seriously amazing stuff. No question about it.

I terms of technical insight and investigative ability this was a HUGE hit out of the ballpark. Way out. You not only got a home run, you hit it over the stands, past the parking lot and it’s bouncing over the highway as we speak.

Internet history in the making. No question about it at all.

Beyond belief you deserve the gratitude of, well, anybody with high speed internet access to the internet.

Words escape me. All I can do is shake my head in awe. Completely awesome.

Kudos.
- gettys Says:
  December 7, 2010 at 5:05 am | Reply
  Thanks; but I think you are too kind.
  
  Many of the puzzle pieces were handed to me (unassembled) by Comcast.
  
  As always, we are on the shoulders of other giants: the area of congestion management was explored with a depth of understanding I admire deeply by the likes of Van Jacobson, Sally Floyd, and many, many others. If I’ve done anything important here, it has been recognizing that the problem is occurring in other parts of the end-to-end system than “conventional” internet core routers, where it was pretty fully explored in the 1980’s and 1990’s.
  
  And chance is very important: aiding me was knowing some of the players here, so that when I smelled smoke, they could diagnose the fire, giving me the confidence to dig deeper and look further. So in part, it’s being in a particular place at a particular time.
ebiederm Says:
December 7, 2010 at 12:58 am | Reply
I haven’t dug to deeply into this but I wonder if the variants of tcp that use increases in round trip times as an indication of congestion would avoid this problem.
Charles 'Buck' Krasic Says:
December 7, 2010 at 1:35 am | Reply
I’ve been working in the area of video streaming over TCP for a number of years. In the course of that work, I’ve noticed some of the pathologies in last-mile broadband access too. A lot of time, they seem to be due to shapers that appear to be applied probabilistically. It seems to me that if you are a long fat TCP flow, odds are high you will be lumped into the “smells like bittorrent” category of the “traffic management” gear of the ISP.
When that happens, the shaper kicks in, and the buffering is horrendous.

For a kind of crazy workaround, you might find a paper we published in the ACM Multimedia Systems 2010 conference to be entertaining:

http://www.mmsys.org/?q=node/24

In particular, we designed an automated failover mechanism into our protocol above TCP, called Paceline. Basically, when TCP delay goes off the chart, we kill the connection and continue on a fresh one. I usually explain this as based on the human behavior that is the ‘stop-reload’ cycle everyone does when their web browsing session stalls. Only in Paceline, we automate it. It was not designed to address the above shaper issue, but I’ve noticed that it often does so in practice. The connections will failover for a few seconds, and then one will seem to break free of the shaper and be good to go for tens of seconds or even minutes. I had a good chuckle when I first noticed it in action. 🙂

— Buck Krasic (University of British Columbia)
- gettys Says:
  December 7, 2010 at 4:54 am | Reply
  Yuck. Engineering around brokenness. Let’s get the brokenness fixed… Or the kludge tower that is the Internet will teeter yet more, and someday we’ll fall over (something I now actually fear, as I’ve alluded to in my posting and will discuss in more detail soon).
  - Charles 'Buck' Krasic Says:
    December 7, 2010 at 10:49 am | Reply
    There is one other thing I meant to mention in my earlier post.
    
    You may be aware of this, but it is important to remember that TCP’s window size is the maximum of 1) receive buffer size, 2) send buffer size, 3) bandwidth delay product determined by the congestion control algorithms. When buffer sizes get massive, it is very possible that 1) or 2) will be smaller than 3), so you could say the effect is to “turn off” congestion control in elephant flows most of the time. From their point of view, they are in the LAN like ACK paced mode, they simply send data on receipt of every ACK, and leave the actual rate determination to lower network layers. I have long suspected/wondered whether whoever engineered current traffic management practices has done this by design, the goal isn’t to “break” congestion control, but instead to re-assign responsibility to a different entity, from end-host to ISP managed devices–broadband modems and traffic management gear.
    
    You have done a lot of good measurements. I’ve found some insights through end host instrumentation, specifically I’ve wired up some of Linux’s TCP_INFO sockopt statistics (buffers sizes, window size, rtts, rto’s etc.) to a user level trace tool (of my writing). Watching the actual values used inside of TCP is quite informative.
    
    As for the kludginess of Paceline, yea well “kludge” vs “pragmatic, balanced and elegant solution given the context” is always a subjective assessment. 😉
bert hubert Says:
December 7, 2010 at 1:46 am | Reply
Hi Jim,

Thanks for delving into this! I’ve been wanting to get to the bottom of this ever since writing the Linux Advanced Routing & Traffic Control HOWTO.

I indeed noticed this problem way back when in 1999 or so. You correctly note that the blame should fall on equipment manufacturers, but back then consumers did not have any choice in the matter. You got the equipment your cable company or DSL provider selected for you.

Also, at the time, there was a huge and almost exclusive focus on ‘DOWNLOAD SPEED’, and modems were clearly optimized to generate as much of that as possible, disregarding any latency impact.

About raising a stink in the IETF, I don’t know. At the time I did not see the (European) Internet Service Providers I was working with interact with the IETF much.

But anyhow, let’s hope something happens now. In the meantime, http://lartc.org/wondershaper gives you control of the queues again.

Bert
- gettys Says:
  December 7, 2010 at 4:42 am | Reply
  Certainly the ISP’s share the responsibility for the problem with equipment vendors; the monomaniacal focus on bandwidth has cost us all tremendously, and we need to change the conversation from solely bandwidth to some bandwidth/latency metric to make progress. We have to change this to a competitive situation to make quick progress. Without shining the light of day onto the problem and turning it to a competitive situation, bufferbloat won’t get eliminated in finite time.
  
  Van Jacobson pointed out to me that the problem goes back a long way, when DARPA walked away from funding most network research over a decade ago: this left research in how to handle dynamic range of bandwidth completely in the lurch; NSF has primarily been interested in just “go fast” to connect scientists to super computers. So nobody has been minding the store, and doing research on many orders of magnitude of differing performance is far from a fully solved problem and needs serious research. Dynamic range of adaptive behavior is as hard as absolute performance; we’ve only been looking at absolute performance for over a decade.
  
  As to the IETF, even in 1997, when I was still working in HTTP extensively, there was enough representation that the word might have spread; the Nordic countries were already very clue full in particular. The IETF is both very similar and slightly different from the FOSS community (having some of the shared heritage having been spawned out of the academic research communities decades ago). There was certainly heavy representation from all the equipment manufacturers. Somehow we need to break down the barriers that have somewhat separated the communities, as there is much that can/should be shared.
  
  And yes, mitigation of bufferbloat is (partially) possible via Wondershaper and techniques like that which Paul Bixel is attempting in his recent work in Gargoyle (I haven’t yet tried them out, but hope to soon). Just remember that the problem is more general, and not confined to the router/broadband hop; we also have to fix even local traffic (to your storage and other boxes at home), as my experiments show.
  
  And the base OS’s all have problems to some degree or another. We have a mess everywhere, and be careful with stones…
- Jesper Dangaard Brouer Says:
  December 21, 2010 at 5:00 pm | Reply
  Hi Bert,
  
  The Wondershaper should be updated to use the TC options “linklayer” and “overhead”.
  As this solves the issues of “reducing” the bandwidth to achive queue control, as these options (eg. linklayer ADSL) takes ADSL overhead and framing into account.
  
  The options (which I implemented) are included in mainline Kernels since 2.6.24 and in tc/iproute2 in version 2.6.25.
  
  Guess the word have not been spread of this (now old) option… sorry about that.
  
  –Jesper Dangaard Brouer
  - Ben Livengood Says:
    January 7, 2011 at 8:18 pm | Reply
    Is there documentation for the new options? Everything I can find seems to be a few years old (HOWTOs, Documentation in the kernel source, man pages, the lartc.org site, etc.).
    - gettys Says:
      January 7, 2011 at 9:36 pm
      Ugh. Yeah, that’s one of the real headaches.
      
      Much of the “tuning” information I’ve seen is both out of date, often now completely broken, superseded by other problems (e.g. classification may be completely ineffective if your device driver is doing buffering underneath you), and mostly to “go fast” for supercomputers, and not what most users want/need. This is part of why I think real solutions should “just work”; expecting everyone to figure out the right “default” is a recipe for failure.
      
      I need to turn on a wiki I have set up to help with this problem and have a place for everyone to work together on this. Maybe next week. Getting Slashdotted today hasn’t helped.
Tomasz Says:
December 7, 2010 at 4:46 am | Reply
Along different TCP Congestion algorithms available in Linux there is one estimating buffer sizes in devices along the path. I just cannot remember which one.
- gettys Says:
  December 7, 2010 at 4:51 am | Reply
  That may be somewhat useful in some circumstances; however, as you’ll see from a future post, I’m skeptical of the “change TCP” approach.
  - Dave Täht Says:
    February 3, 2011 at 12:41 pm | Reply
    TCP vegas is the one you are referring to.
    
    It does not compete successfully with other TCP/ips. Further, it appears to be be confused by the the number of retries in a modern wireless connection to mis-estimate the length of the path. (resulting in a slowdown).
    
    TCP veno has some potential.
mpz Says:
December 7, 2010 at 7:42 am | Reply
This blog pots explains a lot. Recently I upgraded my cable internet connection from 2 to 10 Mbps and also switched from Windows XP to 7, and I’ve experienced incredible sluggishness and outright connection resets when downloading so much as 1 file that saturates the pipe. The one file I’m downloading with wget comes along really nice at a steady 1.1 MB/s, but all the other connections that I have open (like ssh and irc), reset within 30 to 60 seconds of starting the download.

My cable modem is a crap ass Motorola Surfboard fwiw.
- gettys Says:
  December 7, 2010 at 9:02 am | Reply
  Yes, you’ll see more problems having switched to Windows 7, due implementing window scaling by default. Exactly how bad things can get, I don’t really know; I haven’t seen connection resets in my controlled experiments, but have seen DNS lookup failure.
  
  How much pain you will suffer depends on the cross product of buffering amount and bandwidth (in each direction).
  
  I have no data on which modems may be “good” or “bad” in terms of buffering.
  
  I do have a SB6120 myself; tomorrow’s post will be how I’ve mitigated most of the pain in my broadband hop. I’m quite happy now…
mpz Says:
December 7, 2010 at 7:50 am | Reply
Also, I’ve noticed that manually limiting the download speed results in oddly fluctuating speeds. I’ve seen this with both LeechFTP and wget — I’ve tried limiting the download speed to 800 kB/s to avoid having my other connections reset, and the download speed fluctuates between 100 and 1000 kB/s on my cable modem. On an ethernet connection at work (presumably a fiber optic link without bufferbloat at any point) everything downloads at a steady 800 kB/s when I try this.
- gettys Says:
  December 7, 2010 at 9:06 am | Reply
  This may not be a “real” effect you are seeing; the fluctuation may be primarily in the accounting. If you look at my traces, you’ll see bursts of dup’ed acks in them and bursts of SACKS. Most data did not get dropped, but the acks certainly end up getting piled together. TCP gurus can better explain what’s going on; I’m not such a guru.
Phil Endecott Says:
December 7, 2010 at 7:52 am | Reply
Hi Jim,

When I upload large files over my cable connection, I always see it go in bursts with a period of about a second. I had assumed that this was something inherent in the way the cable system imposed my upstream bandwidth limit, i.e. it was setting a quota of bytes per second. Now I suspect that the data on the cable is going at a constant rate and the 1 second burstyness is a function of the buffer size in my modem.

So the question is, what can be done about it? All of my network gear has some sort of web interface where lots of things can be tweaked, but I’ve never seen anything to change buffer sizes. I wonder if it’s possible in principle to change the buffer sizes in typical devices by changing the software, or whether the buffers are at a lower level in the hardware?

Anyway, I look forward to your future posts.
- gettys Says:
  December 7, 2010 at 9:03 am | Reply
  As I’ll discuss in tomorrow’s post, you can avoid the broadband bufferbloat with some home routers. Of course, as I showed in a previous post, the home routers themselves may also have problems.
The problem with excessive buffering in networks | Kodden's Corner Says:
December 7, 2010 at 7:53 am | Reply
[…] morning I read an interesting article on the subject by Jim Gettys, and the problem seems to be worse than anticipated. You can read more […]
Tony Finch Says:
December 7, 2010 at 8:07 am | Reply
I have been reading your posts about buffer bloat with interest.

One of my friends identified this problem in 2004 and fixed it using traffic shaping, just like some of the other commenters. http://www.greenend.org.uk/rjk/2004/tc.html

If you are running Linux, have a look at its pluggable congestion control algorithms. See . Some algorithms rely on RTT measurements rather than packet loss for feedback on congestion. Try changing to TCP-vegas and see if that solves the problem.
- Tony Finch Says:
  December 7, 2010 at 8:08 am | Reply
  Sorry, broke the linux link – see http://fasterdata.es.net/TCP-tuning/linux.html
  - gettys Says:
    December 7, 2010 at 9:13 am | Reply
    Actually, much of what that link describes is *exactly* the kind of bandwidth maximization tuning that got us into this mess, and is often obsolete information to boot. For example, at this date Linux automatically tunes its socket buffer sizes, making a class of “optimization” obsolete (along with some of the reason for some of the buffering).
    
    What is more, as I said in a previous post: *there is no single right answer* for buffer sizes. The challenge is how to do buffer management in a fully automatic way. More about that to come…
- gettys Says:
  December 7, 2010 at 9:08 am | Reply
  Yes, that’s tomorrow’s fodder indeed. I can only write and edit so fast.
jon crowcroft Says:
December 7, 2010 at 8:39 am | Reply
yes indeed here’s a partial solution
http://paravirtualization.blogspot.com/2010/11/terrible-internet-buffer-overrun.html
- gettys Says:
  December 7, 2010 at 9:19 am | Reply
  Jon, you are a man after my own heart….
  
  And I certainly hope it doesn’t come to your jocular “The Terrible Internet Buffer Overrun Disaster of 2012”, though I have been losing sleep over it. Destroying TCP’s congestion avoidance algorithms is a recipe for disaster.
Ian Says:
December 7, 2010 at 8:48 am | Reply
I’ve been arguing with Verizon in the UK for the last year about insane RTT times on our E1. They always pointed at over-utilisations, but it just didn’t make sense to me that RTT would be knocked to pieces by a single FTP session. As you mention in the article, for a while now the whole network has just ‘felt’ wrong in a way I struggled to explain, but knew wasn’t right. Also as you suggest, I had pretty much given up the ghost on working it out and have just been throwing bandwidth at the problem, with very limited sucess. But reading this, suddenly it all makes sense. Not sure where to go from here, but at least I know I’m not mad.

Looking at your quoted comment above “but dropping any packet is horrible and wrong” I can’t help but think how many ISP SLA documents have specific compensation clauses about levels packet loss. Would it be fair to suggest that this builds in an inherant motivation for the ISPs to increase buffer size in order to prevent packet loss and therefore reduce compensation payout? Even if by doing so they break the network? Note, I’m not suggesting this is a macheavelian plot, but simply an unintended consequence of how ISP contracts are written.
- gettys Says:
  December 7, 2010 at 9:22 am | Reply
  I suspect SLA’s should have some packet level loss clause.
  
  What is missing, as in the public broadband tests, is test of latency under full load. We’ve focussed on bandwidth so long we’ve forgotten latency.
  
  And yes, Petunia, if we (re)build the Internet properly, we can have our cake and eat it to. This shouldn’t have to be one or the other, but not both….
  - Ian Says:
    December 7, 2010 at 9:38 am | Reply
    I just had a look at the SLA for our E1. Interestingly, we get a direct commitment to packet loss based on a percentage per month from ingress to the network (i.e. our managed router) to the point where they hand it off to the next provider or destination, but there doesn’t appear to be any exception for over-uttlisation. There is also a latency SLA, but this only applies across the core network, not the last mile. So if my local tail is buffered up to the eyeballs and causing latency, but not dropping packets, neither SLA will kick in. This structure makes sense when the average throughput is lower than the total bandwidth, as was typically the case historically. But in the modern age when any pipe can be filled no matter how fat it is, it doesn’t make sense any more. Before any technical fix can be applied, the ISPs need to alter their SLA structure so packet loss from over-utilisation is exempted from compensation, otherwise any attempt to fix this will just trigger lots of invalid payment to customers.
    - gettys Says:
      December 7, 2010 at 10:23 am
      Normal TCP loss rates for signaling congestion is much lower than that I observe; I expect there is a point in the middle that will work. And ECN is another option that needs exploration.
      Jim
  - Frank Bulk Says:
    December 11, 2010 at 5:11 pm | Reply
    …but jitter is often mentioned in SLAs.
- gettys Says:
  December 7, 2010 at 1:06 pm | Reply
  Note that Verizon has made the same mistake as Comcast, as has AT&T, the hardware manufacturers, the operating system folks, and so on. We’re all living in a glass houses, so be gentle and leave the stones behind; go forth and educate…
  
  Telling everyone “they are stupid”, or “they screwed up”, when it’s “we were complacent”, and “we all screwed up” won’t be at all helpful; it is why I chose the title I did for this posting. At some point late in this process of blogging, I’ll show bufferbloat in application software as well, just to complete the journey. It’s “we” who have made/are making this mistake.
  
  Certainly, there may have been unintended consequences of SLA contracts; but as the last SLA I ever worried about was about 15 years ago, I’m hardly someone to comment on on the perverse incentives that may have entered the system.
John Gilmore Says:
December 7, 2010 at 10:31 am | Reply
See Nagle’s RFC 970 “On packet switches with infinite storage”. Even in 1985 the early roots of this problem were visible.

Note also that Nagle suggests dropping the *last* packet in the host’s queue when one must be dropped. If we want drops to produce rapid feedback, dropping the *first* one in the queue would notify the receiving host earlier that there’s a problem.
- gettys Says:
  December 7, 2010 at 12:02 pm | Reply
  and RFC 896, also by John Nagle, is worth reminding yourselves of: I’ve been alluding to congestion collapse, and we all need to remember what was said in 1984 as I move onto that topic….
  
  Right, of course, is in the eye of the beholder: real AQM (RED or something better; classic RED has not one, but two bugs, according to Van when I talked to him in August) is also better than head drop, as the queues never grow to such a huge size (remember, they are now often orders of magnitude bigger than they should be, and there is no “single right answer” to the question). I’ll move onto that topic soon as well.
Peter Bachman Says:
December 7, 2010 at 10:53 am | Reply
One of the strangest support calls we got at PSINet was “web pages load 1/2 way and stop” which was tracked to a bad buffer on a router on our network. My home lab has the docis 3 modem load balanced with fios 20/20 using Vyatta software router. Be interested in the mitigations
- gettys Says:
  December 7, 2010 at 12:05 pm | Reply
  Tomorrow. Today’s posting will be a couple areas where I know the issue is/has affected the network neutrality discussions. Since they are ongoing, I want to inject a bit of insight (and opinion) on that topic, and feel I can’t take my time and come back to it later and have the observations illuminate the discussion (whatever side of the debate you may be on).
John Day Says:
December 7, 2010 at 11:59 am | Reply
Isn’t this precisely what the Internet is suppose to do?

a) Never tell the application to stop sending, i.e. no back pressure, if the application has stuff to send, let it send;
b) treat everything equally giving no special treatment to packets. What the net neutrality types are always crying for.
Now this may not be the behavior that is desirable but it seems to be in line with what has been advocated for many years.
- gettys Says:
  December 7, 2010 at 12:59 pm | Reply
  Heh.
  
  My view on NN is that the network is supposed to do what I ask it to (it’s what I’m paying my ISP to provide service for and my money may need to go further than my immediate ISP in the form of traffic exchange agreements, at times), and do it with some fairness when sharing is required. Note that I personally don’ t have problems with paying extra for premium performance at busy times of day (which is why it makes me sad that the best mitigation for bufferbloat right now defeats Comcast’s Powerboost, which, in internet tradition, is trying to give me extra performance when it doesn’t cost them extra).
  
  It’s having others make yea or nea decisions on what I can access and/or get decent service for that makes my hackles rise immediately and will get me all worked up on the topic. I should be able to choose the service I get, and without the “bundling” disaster that has made me pull the plug on cable TV.
  
  Having a network which is neither fair under load, and causing operational nightmares for users and ISP’s alike is a jointly losing strategy. That’s were we are today. A Lose Lose situation, if ever there was one.
Nicholas Weaver Says:
December 7, 2010 at 12:08 pm | Reply
One possibility that’s been proposed is RAQM (REMOTE Active Queue Management, http://www.cs.purdue.edu/homes/eblanton/publications/raqm.ps ):

Namely, that by doing delay based estimation it becomes possible to divorce the point of control (the ‘congestion notification’, aka, where to drop the packets) from the bottleneck, allowing you to get active queue management even when the queues don’t support active queue management.

Thus a properly equipped in-path device (like a WRT system) could kludge-fix ALL the paths going through it for buffer problems, without needing to be the bottleneck itself (unlike conventional traffic shaping).

This will fail if there are some flows through the bottleneck that aren’t controlled by the RAQM device, but otherwise should work very well.

And there also is a 90% solution in queue engineering: queues sized in delay rather than capacity.

If a queue is considered ‘full’ if the oldest packet is > 200ms old, this will still allow cross-the-world good bandwidth (you need a minimum queue size of ~bandwidth*delay/sqrt(N), so with 200ms US to Europe pingtimes, 200ms is big enough for most).

This is NOT optimal (the optimal size should be based on measured RTTs and dynamically changed), but its at least in the right ballpark: you still get full rate TCP throughput on reasonable cross-the-planet links, and you add a maximum delay of 250ms latency in the worst case.
- gettys Says:
  December 7, 2010 at 12:38 pm | Reply
  Certainly the good is the enemy of the perfect; far be it from me to tell people to not do something less broken than they currently do. I’m often seeing latencies in seconds, getting to 200ms would be a serious improvement. And RAQM may help mitigate the problem while we fix all the broken gear properly.
  
  I will point out that, however, we can’t stop at 200ms (a number that networking people seem to like, as it is convenient and achievable with little thought or hard work). The reality of human interaction and the speed of light is that *any* additional unnecessary latency is often/usually too much. As a UI guy, my metrics have always been (since I learned this stuff first hand in the 1980’s), that to:
  - No perceptible delay to all human interactions requires less than 20ms (rubber banding is hardest)
  - semi tolerable rubber banding needs less than 50ms
  - typing needs to be less than 50ms to be literally imperceptible
  - typing echo needs to be less than 100ms to to be usually not objectionable
  - echo cancellation gets harder as well (the best echo cancellation needs to be done as close to all participants as possible, even the latency over a broadband link is undesirable).
  - then there are serious gamers, where even a millisecond may be an advantage and the difference between life and death
  - don’t even get me started about the financial loonies that got us into our current economic mess
  Given that vertical retrace at 60hz puts you statistically almost behind from the get-go (on average, you’ve lost 8ms right there, even with a really good OS and scheduler), and most paths are 10’s to a hundred milliseconds and we can’t repeal the speed of light, the problem is harder than most in the networking community tend to acknowledge. Even a gigabit switch when loaded may insert significant latency due to buffering.
  
  I see latency as one of the great challenges for the networking/OS community. It’s probably as difficult as the “go fast” problem, and we want an internet that does both simultaneously without tuning, under load.
  
  Lest people think this is unrealistic, I’ll point out that my rubber banding experiments were on a Microvax II on a 10Mbps ethernet in 1985; we got to 16ms over the X protocol over TCP on that local network then; it required that to get client side rubber banding to “feel” physically attached to the hand when running remotely. While client side java/javascript has relaxed the need some, it hasn’t gotten rid of all of the need, and the typing perception requirements are still necessary (as Google Instant has shown).
  
  Latency you never, ever get back.
  - Nicholas Weaver Says:
    December 7, 2010 at 4:55 pm | Reply
    Here’s the reason why so many dirty-network types like me want to say “200ms and done” (Jim knows this, its more just for the record in general):
    
    Its because that anything less than 200ms REQUIRES that every bottleneck in the path implement traffic classification and prioritization.
    
    It is impossible to have simple queues which satisfy the requirements of both LPBs (“Low Ping Bastards”: any application with strong realtime components like first person shooters, VoIP, etc) and sustained TCP throughput at the same time.
    
    Thus the choices are either a compromise that produces the maximum benefit for both types of traffic (~200ms is a very good answer, and RAQM can easily do this) or do a forklift upgrade on EVERY bottleneck in the Internet to do multiple-queue traffic prioritization.
    - Arms Says:
      January 4, 2011 at 6:14 am
      @Nicholas Weaver:
      
      You seem to be stuck with legacy information. Achiving low (zero) loss, low (<0.98) goodput is indeed impossible without a major rehaul of deployed congestion control and without seperating (corruption) loss from congestion feedback.
      
      You may want to learn about DCTCP (based on alpha/beta ECN TCP aka ECN-hat), virtual queues (deployed in ATM but that technology became too costly) and CONEX (re-ECN) which provides the basic signalling foundation to implement all that goodness.
      
      Regards
  - gettys Says:
    December 7, 2010 at 5:19 pm | Reply
    And I no longer think there is a good excuse not to do full AQM going forward: the cost of a dual issue gigahertz SOC with embedded NIC is in the 1 Watt power dissipation, and costs no more than $!5 (right now). So something fully capable of “working right” implementing full AQM should be in hardware/software/fimware designs going forward, IMHO, and we can mitigate the problem to the of order 200ms as best we can in the existing plant until it gets swapped out. That I seek the ideal while also wanting the good ASAP is not a contradiction in my view.
    
    This is why I’ve been talking both about mitigation and solution to the buffering problem, rather than a single “fix”.
    
    I *really* don’t want people to leave under the impression that 200ms is good enough. Many don’t understand the UI realities that exist who haven’t worked in the UI field. And there is a market for equipment that doesn’t just work OK, but works well. I see it as a market opportunity for equipment vendors that solve the problem properly.
j Says:
December 7, 2010 at 4:09 pm | Reply
You should be running into bigtime problems anytime you saturate your upstream on a cable modem shouldn’t you? That is the innate behavior of the two-wire topology.
- gettys Says:
  December 7, 2010 at 4:16 pm | Reply
  No, if buffering is correct in a system, two (or more) TCP sessions should fairly share the link just fine even with the connection saturated. You should never be seeing latencies of the order we’re getting.
James Mitchell Says:
December 7, 2010 at 4:55 pm | Reply
I don’t have the expertise to know if you’re right or not, but I’ve been noticing these sorts of issues for most of the last decade on and off, occasionally musing: TCP has features built in to avoid these problems, doesn’t it? Why don’t they work!

Recent experiences streaming random internet video to recent windows devices on a (locally) quiet network have been equally confounding. The video is supposed to degrade seamlessly, but instead I get sputtering high quality video — maybe the players are written wrong, but I’m suspicious.

I don’t quite have the expertise to judge, but intuitively this makes sense. Fortuitously there should be a much more hackable home router on my desk tomorrow. I shall follow along eagerly.
Wolfgang Beck Says:
December 7, 2010 at 5:06 pm | Reply
A couple of observations:
– there are a number of ‘rules of thumb’ which call for rather big buffers (google Villamizar tcp buffer size).
– TCP congestion avoidance works under the assumption that bit error- induced packet loss is very rare. This is true for optical networks, but less true for copper. Wireless is extremely lossy. The link layer has to do some error correction or TCP would never get up to speed. This comes at the prize of some buffering
Guilherme Salgado Says:
December 8, 2010 at 7:51 am | Reply
Hi Jim,
For someone who’s always been affected by the symptoms you describe here (thanks to the severely limited upstream links we have in Brazil), this was a very enlightening read. Thanks a lot!
Recently, though, I switched providers (also switching from cable to DSL) and bought the *cheapest* modem I could find. Much to my surprise, since then I seem to be able to upload at full-speed (which is still rather slow; I have a 400kbps uplink) without rendering the downlink unusable (as has always been the case). Now I’m wondering that maybe to cut costs on the modem they used a small buffer, which in turn doesn’t trick the TCP congestion avoidance algorithms. I guess that’s a possibility?
- gettys Says:
  December 8, 2010 at 9:42 am | Reply
  The problem with this theory is that you can’t even buy “small enough” DRAM chips these days for the “right size” buffers (not that there can be any “right size” in the first place, AQM is the “right” solution, as I’ll discuss later).
  
  Most likely it was cheap because it was an old design where memory had been significant cost issue, or the designer’s firmware wasn’t riddled with bugs they were papering over (a common cause of buffer bloat, since latency has not been tested for properly, and if they don’t meet bandwidth goals, they don’t get certified by carriers for use). Measure the bandwidth you get, and the saturated latency, and you can easily compute the buffer size.
Matthias Says:
December 9, 2010 at 6:42 am | Reply
You might be interested in the 2007 paper

Marcel Dischinger et al, Characterizing Residential Broadband Networks
(available at http://broadband.mpi-sws.org/residential/)

which uses an interesting test methodology to separately measure up- and downlink, and reports queuing delays of several *seconds* on some cable modem uplinks.
- gettys Says:
  December 9, 2010 at 7:53 am | Reply
  Thanks.
  
  Note I had problems with that link, but found it at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.6825&rep=rep1&type=pdf
  
  A quick glance shows inconsistent use of RED as well in it.
Hans Petter Jansson Says:
December 9, 2010 at 12:34 pm | Reply
Great analysis.

I was running into this with a 128kbit/s uplink I had once. The solution I used was to put a Linux box in front of the modem, capping my bandwidth at 100kbit/s using a tiny queue, effectively taking the modem’s queue out of the equation. I couldn’t do much about the downlink, though, except try to manage TCP ACKs.

I guess this is a bit of a tangent, but what you say about overly clean networks is often observed when tunneling TCP over TCP, as in the not uncommon “VPN over SSH” situation. Saturating such a network will tend to bring everything to a halt, since no packets are dropped in the upper layer.

Avery Pennarun is solving the SSH VPN issue in his sshuttle project: http://apenwarr.ca/log/?m=201005#02
Russell Stuart Says:
December 9, 2010 at 9:50 pm | Reply
I see you have mentioned this elsewhere, but my first reaction to reading this was “where art thou, ECN?”. This is the very problem ECN is meant so solve. Yet we don’t implement it because the short term pain may be high.

I get IPv6 deja-vu when I think about it. But unlike IPv6, there is no d-day coming. I suspect that rather than fix wear the one off pain and the problem with ECN, we will just let the water warm up until it becomes intolerable, then tweak a few queue lengths until it becomes just tolerable again. If there isn’t a collective effort put in by a few big players, that is where we will sit for time immemorial.

Oh, and for those thinking implementing QOS at home will fix the problem for you – it will only do that if you are the cause of the congestion in the cloud. If someone else is filling up the queues in an upstream router nothing you can do can change the latency you see. From Jim’s description, this is the situation he finds himself in.
- gettys Says:
  December 9, 2010 at 10:41 pm | Reply
  Steve Bauer at MIT (and probably others) are researching the state of ECN suppression. Until that and/or other research is complete, it’s not clear if we can. A conversation with Steve a month or two ago makes me hope it may be usable and useful in some parts of the network, but the general answer isn’t in yet.
  
  And yes, you can possibly help yourself with QOS locally for your VOIP in a limited environment like your home, but you really still have to manage your queues and get TCP behaving correctly. If you don’t, you run smack dab into congestion someplace.
  
  And as ISP’s aren’t necessarily managing queues, we have lots of messy problems. Others can help by starting to monitor their ISP’s carefully with tools like smokeping (and educating them as to the issues, if they lack clues). In the limited probing I’ve done of Comcast’s network, it’s always been smooth as a baby’s behind, until that last killer mile. Other monitoring I’ve done (particularly from hotel rooms) makes me believe, as the anecdotal and other data suggest, that there are clueless ISP’s out there. I’ll explain next week as to why there has been a reluctance to use AQM.
chrismarget Says:
December 21, 2010 at 2:59 pm | Reply
Jim,

Something else is going on here too.

Here’s another snapshot from your ConPing/210aKnoll,pcap file:

I’m guessing your data was captured directly on the sending machine, and that it has a NIC doing TSO.

The effect should be small, but because of the way you’ve captured the data, none of it can be truly trusted. The TCP RTT plots are measuring latency starting with a bogus TCP segment that hasn’t been actually been transmitted yet. It still needs to be sliced into MSS-sized segments which then need to be streamed onto your LAN.

Yes, serialization delay is low on the LAN, but it would be nice to see this data more accurately.

Do you perhaps have captures without TSO, or (better) taken by a 3rd party to the transaction?

I’ve seen NICs (Broadcom chips w/ tg3 driver) hold onto data for hundreds of ms.

Here’s something else… Around 16:08:50 your TCP session was showing latency continuously over 1000ms:

…but the pings during this same window are much closer to 100ms:

15:08:49.745852 IP 24.218.178.78 > 18.7.25.161: ICMP echo request, id 30925, seq 386, length 64
15:08:49.789203 IP 18.7.25.161 > 24.218.178.78: ICMP echo reply, id 30925, seq 383, length 64
15:08:50.752000 IP 24.218.178.78 > 18.7.25.161: ICMP echo request, id 30925, seq 389, length 64
15:08:50.843154 IP 18.7.25.161 > 24.218.178.78: ICMP echo reply, id 30925, seq 386, length 64
15:08:51.754606 IP 24.218.178.78 > 18.7.25.161: ICMP echo request, id 30925, seq 392, length 64
15:08:51.926162 IP 18.7.25.161 > 24.218.178.78: ICMP echo reply, id 30925, seq 389, length 64

Maybe you’ve already addressed the possibility of priority-queueing small packets… I’m just starting to follow along with your project and am certainly not up to speed 🙂

Would you please post a link to the Comcast provisioning document written by your neighbor?

Thanks!
- gettys Says:
  December 22, 2010 at 10:13 am | Reply
  At the time I took that data, I had no good way to take traces except on the transmitting system.
  
  The particular laptops I’ve used to take data have Intel NIC’s, not the broadcom, and I don’t think Linux distro’s are typically doing traffic control games (which maybe they should). But with the current very large transmit rings, from what I’ve gathered in other comments to this blog, the traffic control would be ineffective anyway (until those buffers are cut down to size, as one person posted a patch to do).
  
  My pings in the traces are exactly in line with what I observe from without (e.g. the DSL reports Smokeping data) in magnitude. If you are motivated, it would be interesting to look a bit further into the pings; Van Jacobson noted that since Linux happened to be used on both ends, there is already timestamp data in the traces.
  
  I’ve since bought one of these port mirroring switches, which are quite inexpensive ($150). At some point, it would indeed be better to collect data that way (now that I’m able), and expect to do so before I do formal publication. For the next few weeks, I need to finish writing up what I know, update an overview presentation I did several months ago, and do a few other things, before circling back to try to write more formally and rigorously.
  
  In any case, while I’d certainly like to retake the data before a formal publication, I encourage you to do your own experiments. The Netalyzr data shows buffering is dismayingly common (e.g. Nick Weaver at ICSI immediately reproduced similar results on his home connection, as soon as I made contact with him toward the end of the summer). If anything, the Netalyzr data has underestimated the frequency problem (it’s UDP buffering test wasn’t aggressive enough to fill higher bandwidth connections such as FIOS all the time, and can be confused by cross traffic). Broadband bufferbloat isn’t a rare phenomena (worse luck).
- Karel De Vogeleer Says:
  January 8, 2011 at 6:42 am | Reply
  Chrismarget,
  
  You suggested that small packets might be priority queued. There has been research conducted on the one-way delay in 3G networks where it was shown that large packets get through faster than smaller ones, at least in Sweden (I guess its dependent on the ISPs hardware). See the links below, there seems to be a threshold at around 250 bytes. The authors claim that reason for this is that the technology used changes from WCDMA to HSDPA around this point, resulting in lower latencies.
  
  http://www.bth.se/fou/Forskinfo.nsf/Sok/73da45afadb2e7cfc12577030030ba80!OpenDocument
  http://www.bth.se/fou/Forskinfo.nsf/Sok/6933eb641a36f5b5c12577270026654c!OpenDocument
  - gettys Says:
    January 8, 2011 at 9:57 am | Reply
    And while there may be reasons to give small packets priority, those reasons don’t include actually solving bufferbloat…
Jesper Dangaard Brouer Says:
December 21, 2010 at 4:51 pm | Reply
Hi Jim,

I think you would be very interested in reading my masters thesis [1] (http://goo.gl/sBHtg), as it contains pieces to solve your puzzel.

I think you have has actually missed what is really happening.

The real problem is that TCP/IP is clocked by the ACK packets, and on asymmetric links (like ADSL and DOCSIS), the ACK packets are simply comming downstream too fast (on the larger downstream link), resulting in bursts and high-latency on the upstream link. See page 11 in thesis for a nice drawing.

With the ADSL-optimizer I actually solved the problem, by having an ACK queue, which is bandwidth “sized” to the opposite link size. The ADSL-optimizer also solves the issue by ceasing control of the queue, which actually isn’t that easy on ADSL due to the special linklayer overhead (see chapter 5 and 6).

My investigations show, that the major issue is, that TCP/IP congestion protocol was not designed with asymmetric links in mind.
But, there is still some truth in, the ISPs are increasing the buffer sizes too much, which makes this effect even worse.

I guess the real (but impractical) solution would be to implement a new TCP algorithm which handels this asymmetry, and e.g. isn’t based on the ACK feedback, and deploy it on you home machines (as the effect is largest here).

–Jesper Dangaard Brouer

[1] http://www.adsl-optimizer.dk/thesis/
- gettys Says:
  December 22, 2010 at 9:51 am | Reply
  Thanks for the pointer to your thesis. I’ll take a look.
  
  While I don’t doubt that there are issues caused due to the asymmetric nature of many of the broadband connections, I also would be surprised if that was the primary issue here (above and beyond I’ve already had real TCP experts look at the data, which I’m not). In part, because I see the same effect on symmetric FIOS service….
  
  Remember, what’s going on here is that a huge amount of buffering has been inserted into TCP’s control loop: the paths I’m typically testing on are between 10-30ms, and the delays several orders of magnitude larger. Without queue management, TCP (and other protocols) will fill these buffers. If your algorithm is having the effect of managing the queue growth, then you are achieving what is required for good operation.
- Alex Yuriev Says:
  March 20, 2011 at 9:45 am | Reply
  *Rings bell*
  
  And above we have a real winner. We have determined this to be the case in 1997-1999 time when doing one way satellite links to Australia with the back channel being an GRE tunnel. TCP inherently does not deal well with links that have different directional latency. It is the primary if not the only issue affecting you.
  
  It is caused by under-provisioned networks between the end points. And that is caused by people buying into the crap known as Quality of Service, which should be called Quantify of Service.
  - gettys Says:
    March 21, 2011 at 4:25 am | Reply
    By your definition, all networks would always have to be provisioned at the highest possible speed (note that a modern TCP can trivially go at gigabits/second). And it is never even knowable (in general) what the provisioning would need to be, nor can provisioning be changed over-night, as it requires trucks, ships, and backhoes. So at best an ISP can try to match provisioning with traffic; but can never do it 100% right.
    
    The whole point of TCP and congestion avoiding protocols is to adapt to whatever speed is actually available over a path. With bufferbloat, however, you destroy the hosts abilities to react to congestion properly and the network operate properly.
    - Alex Yuriev Says:
      March 21, 2011 at 9:41 am
      Can you kindly make up your mind if you are talking about TCP or IP?
      
      Because you went from talking about buffers in TCP connections to line card buffers back to TCP buffers.
      
      “And it is never even knowable (in general) what the provisioning would need to be, nor can provisioning be changed over-night, as it requires trucks, ships, and backhoes. So at best an ISP can try to match provisioning with traffic; but can never do it 100% right. ”
      
      FUD. Of course you know what your network is provisioned for. I will let you in on a little secret – if you are to take the cap on transfer and divide it by the number of seconds in the time interval the cap is responsible for you will see what the network is provisioned for. Neat huh?
      
      When you provision your network for 250GB/month transferred per drop ( which is what comcast does ) but you peddle it as 50Mbit/sec connection you will have all the weird problems you are experiencing. And when you add “PowerBoost”-type crap that the supposedly in the know population swoons over as Comcast’s gift to the users in the spirit of the free internet you are trying to cover up an incompetent network design.
      
      “The whole point of TCP and congestion avoiding protocols is to adapt to whatever speed is actually available over a path.”
      
      No, the whole point of TCP congestion control was allowing protocol to adapt to whatever speeds that were available on the sanely designed path because at the time all the paths were sanely designed. Congestion showed up on symmetrical links. That’s why SLIP and PPP worked well over V22bis and sucked over HST.
      
      And TCP always sucked on congestion – just ask anyone who had MAE-East ports at the times when ServInt and Netaxs were 30% of traffic going over that fabric.
Jon Crowcroft Says:
December 22, 2010 at 9:52 am | Reply
surely most traffic is downloads, so most 40 byte ack packets are on the slower uplink, so the ack bunching effect will be marginal impact?

anyhow, on my (orange) 20Mbps downlink, 384kbps uplink DSL line I am seeing
240ms RTTs when the link is under load…so maybe my ISP has a clue?
- gettys Says:
  December 22, 2010 at 10:30 am | Reply
  The netalyzr data shows many different buffer sizes in play. So you may be lucky on the hardware chosen by your ISP (or your ISP did have a clue when selecting it).
  
  And don’t ever tell a gamer that 240ms is good latency (and, btw, that is almost twice the “acceptable” latency the telephony industry has used as a benchmark for many years). 240ms is much less than the disaster that set me going, but I’d still have to consider it very problematic.
  
  Having upgraded my home service from 20/2 Comcast service to 50/10, I had similar latencies to you, Jon. Now that I’ve mitigated my home Comcast service, I see about 20ms of jitter on a base latency of 10ms. See http://www.dslreports.com/r3/smokeping.cgi?target=network.2ea89843611d2ac85ee91c449b367f39.NY (the dsl reports data is taken from NY, so the base latency is higher than the actual latency of the service). This is something I can actually use.
  
  I still can’t consider 20ms base jitter “good” however: that’s at the lower level of human perception, and gamers (and stock traders) care about latency differences to even an order of magnitude beyond 20ms. It’s still much higher than it “ought” to be from first principles. Latency is also something you never get back, and to get acceptable latency for a given application, you have to add up all the latencies; e.g. vertical retrace on the display, queuing delays in all switching/routing, delays in server/peer processing, speed of light, etc. You have to attack all sources of latency/jitter in the entire system, end to end. Speed of light means we’re almost always starting off behind; we should always be minimizing latency as much as we can.
  
  Of course, my home router is now my biggest problem, having “fixed” my broadband connection… We’ll have to beat down some other nails before circling back to broadband. There are a lot of nails scattered around to pound on…
Mike Stump Says:
December 27, 2010 at 12:37 am | Reply
You must be new around here, I first saw this problem in the 1980 time frame with nice large buffers, I mean, what could go wrong, that could buffer 30s worth of data, and they did. Trivial enough to spot. By reducing the buffer size, one just makes the problem harder to spot. I kinda would like someone to do the theoretic modeling where they `fix’ the problem by doing last in, first out (min latency, no matter what line condition) and updating all the software stacks to `work’ in this mode. You then sense congestion by noticing out of order and delayed packets. The only problem, kinda `late’ fixing it now. 😦 Not that I’ve done the work to know if such a solution is even possible.

Another fix I’d like to see would be to use the existing TOS field by providing a hard meaning for it, and router support for it by defining the `cost’ associated with that packet. Imagine it is a mu-law encoded 8 bit floating point value representing the `cost’ to send the packet. Have 0 be the bulk, very low cost. Most people, most applications would use this and would be equivalent to what we have today. People doing 1-million dollar trades across the internet, could, if they wanted, tag the 20 packets to do the trade at a higher cost. Rate limiters could then enforce not bandwidth, but rather the `cost’ of the packets. One could run data hogs along side VOIP merely by raising the TOS to be, say 30x the cost of the base line. Grandmas that use their connections lightly, could just tag everything with a higher cost and get a `better’ quality, suitable for VOIP, and people that pound out gigabytes, could sit back and know their ISP doesn’t have to worry about them, as they tag everything at 0 cost, meaning, it drops faster and sooner than a packet with _any_ non-zero cost to it. Links that saturate would first throw out all 0 cost traffic, bye bye bit torrent, which is fine, as the client would seek out a link where there is no saturation. If people want to bit torrent really important stuff, TOS of 0xff, and presto, like having a dedicated line. This would allow an ISP to offer differing levels of service, merely by having the clients select which service they want with the TOS field of packets. Hard core gamers with money would want to run at a higher TOS, and pay for it. Cheap games would run at a lower value, and bulk people, they know who they are, would want to run at 0. This gets the ISP out of monitoring and controlling and deep packet inspection and the like, it also provides a way for customers to pay more (to up the cap/rate limiters than control their line). Also, the ISP can use a sudden influx of high TOS packets across a link to mean, man, lets order up another one of these circuits now; conversely, a link with only 0 TOS packets means, even though it is completely full and saturated, don’t bother wasting any money expanding it. This reduces the costs for expansion and build out and provides customers a way to better communicate the value of the links of the network, as a whole. Owners of high value links, would naturally want to charge more, which, would, in the free market, encourage competition, thus, reducing the cost of the link.
- Arms Says:
  January 4, 2011 at 6:42 am | Reply
  @Mike Stump:
  
  In my book, trying to fix this by using globally accepted TOS rules has already tuned out to be an EPIC FAIL about 2 decades ago…
  
  Why else are ISPs today not running TOS and instead MPLS when real money comes into the equation?
  
  You should better take an interest in CONEX and reECN to learn how all parties (users,content and service providers) can get their incentives right to actually make this reality (and not another EPIC FAIL like TOS, AQM and pure ECN).
  
  Regards
  - Alex Yuriev Says:
    March 21, 2011 at 9:48 am | Reply
    @Arms
    
    Yes! Of course TOS was an epic fail and it is exactly why we run MPLS. But please don’t tell it to people who worked at Bell Labs or are considered gods of TCP – that just does not fit into their nice world view of how the things are supposed to work.
Flogs Says:
December 27, 2010 at 8:55 am | Reply
Bufferbloat – When the internet was interactive…

Ich kämpfe seit Jahren schon dafür interaktiv arbeiten zu können während im hintergrund batch downloads laufen. Das funktionierte mal anno 1993 als ich noch eine Analog-G Standleitung hatte. Seit dem ist die Situation immer schlimmer geworden obwohl di…
Jesper Dangaard Brouer Says:
December 30, 2010 at 9:25 am | Reply
Hi Jim,

I have created my own blogpost on:
http://netoptimizer.blogspot.com/2010/12/buffer-bloat-calculations.html

Here I explain how to calculate the buffer size from the delay, and explain where the delay is comming from, and how it related to the link bandwidth.

I don’t go into the details about TCP and why the queue builds. Buffer-bloat is a fact, just do the math…
bk Says:
January 7, 2011 at 9:04 am | Reply
FYI. It’s quite easy to start with a TCP on the LAN, transparently switch to a UDP like protocol at the ingress router and switch back to a TCP at the egress router. 🙂
- gettys Says:
  January 7, 2011 at 10:48 am | Reply
  I keep saying, and I’ll say this again: bufferbloat isn’t a property of TCP per se’.
  
  The bad latencies will occur anytime a network hop is saturated with any protocol.
  
  The issue is that the buffers are destroying all congestion avoiding protocol’s congestion avoidance, which guarantees that any unmanaged buffers fill and stay full (at the bottleneck). So the end-points (hosts) never slow down, and you suffer continually high latency than you would otherwise. All buffers should be correctly sized, and when you can’t (as often occurs) predict the “right” buffer size in advance, the buffers must be managed.
Bill Gianopoulos Says:
January 7, 2011 at 9:07 am | Reply
Interestingly, this reminds me of a similar issue that I talked to AT&T frame relay engineers about back in the mid 90’s about the excessive amounts of data that they were buffering in their frame relay switches and how it completely subverted the edge router from being able to properly prioritize traffic.

I never got anywhere with this. It was like I was speaking a completely foreign language or something. And in the frame relay case it is even lamer to buffer seconds worth of data based on the CIR at a single switch, because you choices are not limited to buffering or dropping. Frame relay has a built-in flow-control mechanism.
- gettys Says:
  January 7, 2011 at 10:38 am | Reply
  The problem was first observed very early in the Internet on satellite experiments.
  
  And we have the same phenomena today in 802.11 and 3g network technologies, where they are often trying too hard to transport data (and on top of that, we’re failing to manage the building queues). As I don’t understand those technologies at the proper level of detail, I’ve been glossing over those problems; someone else needs to take a stab and the explanations.
  
  The fundamental issue is that most practicing engineers think that losing any bits is bad; whereas the Internet was designed presuming that packets could/would be lost at any time and would indicate congestion was occurring. And with memory having become so cheap (in most places; I know the really high speed networking folks still have problems), we’ve been frogs in heating water.
Lionel Bouton Says:
January 7, 2011 at 9:13 am | Reply
I’m surprised this is news for any network engineer especially for ISPs. This problem is at least known and understood well enough to develop countermeasures since 2002, see http://lartc.org/wondershaper/.
- gettys Says:
  January 7, 2011 at 10:33 am | Reply
  I was similarly puzzled; here’s some of what I think has happened.
  
  Many have observed this problem in individual guises over the years.
  
  But few have seen it as a general problem afflicting systems end-to-end.
  
  As far as ISP’s, RED has had enough tuning difficulty that we have a bi-modal set of ISP’s/network: those who were so burned by congestion in the 1990’s that they have RED (or other AQM) religion, and those who don’t. So some networks do run with AQM, some only bother where they have problems (and then have problems later when a bottleneck shifts), and some run entirely without.
  
  And equipment vendors have never been tested for “latency under load”; until i stumbled into my simple minded test that exposed it, even Comcast had no easy test for it. And since Windows XP doesn’t implement window scaling, even that test doesn’t work on the systems that most engineers were using (at least on their day jobs) until recently.
  
  And there have been cultural barriers: the right Linux hackers never happened to talk to the other right people to get word out.
Dwayne Says:
January 7, 2011 at 9:45 am | Reply
How does the larger buffers interact with the naggle algorithm, which is increasingly (mis)used.
- gettys Says:
  January 7, 2011 at 9:50 am | Reply
  I don’t think nagle and bufferbloat two interact much, given a moment’s thought.
  - Dwayne Says:
    January 7, 2011 at 9:56 am | Reply
    Okay, I asked mainly because it interacts badly with delayed acknowledgments, which I thought maybe in turn affected buffer bloat.
    - E.Hastings Says:
      January 7, 2011 at 4:29 pm
      Ahh good point. Isn’t Nagle’s essentially another buffer? It accumulates small data messages into a single large message before sending.
      
      For data transfer, where latency doesn’t matter it can help. However, for interactive apps, like games, that send small messages often it can hurt. In fact, games that use TCP like World of Warcraft often explicitly disable Nagle’s while they are running.
      
      How this would interact with other buffers I’m not sure… But I guess it illustrates the negative effects of a buffer on latency at a local level.
    - gettys Says:
      January 7, 2011 at 4:52 pm
      Nagle’s a pretty much different horse, and is certainly not excessive in the size of the buffering; it’s just trying to keep tiny operations close together coalesced into fewer packets (like two back to back keystrokes).
      
      Nagle is usually a good optimization; but it was also why I became aware of these issues in the first place in 1984 or 1985. TCP_NODELAY was added to Berkeley UNIX when we ran into it early in the development of the X Window System.
      
      I would argue the default for Nagle has happened to be wrong, since a lot of app writers get it wrong.
Bink Says:
January 7, 2011 at 10:53 am | Reply
This is news? Really? Traffic congestion on an oversubscribed link? QoS facilities have been available for years to work around traffic congestion and aid TCP. For the most part, there’s nothing new here—everyone can more along and simply read http://www.benzedrine.cx/ackpri.html, which covers 85% of the diatribe here.
- gettys Says:
  January 7, 2011 at 8:56 pm | Reply
  I’ve had to add the answer to this one to the FAQ list I put together today, having answered it sooo many times now.
  
  Please go look at the FAQ list and/or any of the other attempts to explain why classification isn’t adequate.
  
  Classification by itself can’t solve the bufferbloat problem. It may be useful for many other purposes, but if you haven’t done AQM, you will still lose, and classification (ironically) becomes much more necessary than it would otherwise be.
- Jeff Says:
  January 11, 2011 at 5:34 pm | Reply
  PS: There’s no reason to cop an attitude, either.
  - gettys Says:
    January 12, 2011 at 8:17 am | Reply
    The link you provided is all about a tool that can can improve a DSL broadband link using OpenBSD using classification. But I’m mostly use Linux, on Cable, and, now that I’ve twisted my home router’s QOS knobs to deal with the cable hope, I still have problems in my home network on 802.11 and my laptop not amenable to radical improvement by classification. So first, the problem isn’t what you say it is, nor would your approach help me much if I did the equivalent in Linux, as it doesn’t actually solve the problem. I hope this illustrates that we’re in a complex mess.
    
    There is a general point: while we have many tools which can immediately help the situation, (including classification), this is a complex topic, and when people assert that a particular tool or incantation “solves” “the” problem, they leave the impression there is a single magic fix, where in reality, there are many tools that can be used to both mitigate and solve the many problems that occur in many parts of the network, often including some network hops under which they have no control and which someone else will have to do their piece. Which is most appropriate where depends on the circumtances.
    
    That classification could at most be a help, but could not solve the problem was a point I’d covered many times already. It does get tiring to repeat ones self so many times.
    
    But not everyone reads every page, nor should I expect them to, and I hadn’t put the FAQ together, so your post is fully understandable.
    
    So my apologies. Your over-tired, slashdotted bozo moderator…
Lance Berc Says:
January 7, 2011 at 11:00 am | Reply
Jim, you and I saw some of this when using audio via the J-Video system between Palo Alto and Cambridge in the mid-90s. I was adjusting the local VoIP buffer size trying to minimize buffering while maximizing continuity by measuring arrival jitter, eventually coming up with something very similar to Van’s heuristic in the mbone tools (8x std deviation?) .

We found this worked great locally but to Cambridge we had these occasional crazy stalls tracked down to routers that would occasionally buffer a large fraction of a second w/o dropping anything, shooting my jitter estimations in the head.

You may not remember this because I gave up and we reverted to the telephone for voice and J-Video for video. At the time I declared that “VoIP works as far as you can walk” and gave up on the whole idea.

I see echos of this today (like when people try to relay VoIP over RDP sessions) and have found that people don’t understand that more expensive service might improve bandwidth but has no effect on latency, and as you point out might even make it worse.
- gettys Says:
  January 7, 2011 at 9:00 pm | Reply
  Hi Lance,
  
  I’d completely forgotten about this. It was actually before the mid-90’s, just about exactly when Sally and Van were doing RED. I think we were doing the J-Video work in 1992 or thereabouts. I have no clue as to whether we were in touch with them over our troubles or not; certainly Van was using AudioFile and we were in touch in that era pretty regularly.
  
  And yes, people somehow think more bandwidth will solve their latency problems; it seldom does, and often has made it worse (as you buy later hardware which is often even more bloated than the older hardware, and the dynamic range problem has just gotten that much larger..)
Brian Knoblauch Says:
January 7, 2011 at 11:00 am | Reply
Excellent post. Explains perfectly the issues I’ve been seeing recently. I’d been baffled as to why lately a single download stream can make an Internet connection unusable for anyone else (on big pipes) when back in the day we used to do multiples on much lower bandwidth connections without an issue!
Brian Says:
January 7, 2011 at 11:41 am | Reply
What you describe here seems like a manafestation of the Nagle algorithm. I have been dealing with issues like this for years in using TCP to move large medical imaging datasets; even on a LAN we will see several seconds of latency when the algorthm is not properly implemented. By manipulating the send and recieve buffer in sockets we can usually get the latency down to something managable. As an example, a 512KB CAT Scan image might take 3 seconds over a 1Gbps network with these settings set incorrectly. When you correctly set the send and receive buffer in sockets, you can get the transmission time back to the milliseconds. When you are moving over 20K of these images per day on a network, it makes a huge difference.
- gettys Says:
  January 7, 2011 at 9:07 pm | Reply
  Nope; this isn’t Nagle. Note my tests are simple copies of very big files, lasting tens of seconds.
  
  Nagle is a useful algorithm; I just think it should have been off by default, since it bites so many application programmers the wrong way. On the other hand, that might have caused other problems.
  
  Best might have been to have the socket interface require you to specify, so application writers might have had to engage brain enough to think in the first place, about the nature of the traffic they were about to send.
- Arms Says:
  January 8, 2011 at 6:18 am | Reply
  @Brian: What you describe is not the per-packet latency (which is what this blog post is about), but receiver or sender side limited bandwidth and flow completion time.
  
  TCP Windowsizes are way too small these days, even for corporate LANs (and it was mentioned multiple times already, that Windows XP’s defaults are especially small).
  
  Orthogonal to too-small windowsize defaults and bufferbloat, most TCP stacks (with the expeption of Linux) implement only the RFC algorithms for loss recovery.
  
  However, expecially at higher (LAN) speeds, timely recovery from losses (which, as mentioned here multiple times, are a basic design choice of the Internet / IP Networks) also becomes a paramount issue – and not everything which could be done in that space is already fully explored.
  
  For starters, not many stacks are implementing F-RTO, Eifel (well, only Linux is allowed to), FACK, and my favorite, Lost Retransmission detection (LRD) and improved RTTM.
  
  Just let two Linux and two Windows boxes (with properly tuned windowsizes) run across a larger LAN. You will notice, that the goodput (flow completion time) of Linux will beat any other stack every time, hands down.
  
  Regards,
  - Paul M Says:
    March 23, 2011 at 8:08 am | Reply
    I note that Google are attempting to “fix” the problem with their SPDY protocol, whether this is a fix or a workround, I don’t know.
    
    http://xahlee.blogspot.com/2011/02/google-chrome-spdy.html
    - Arms Says:
      March 23, 2011 at 5:45 pm
      @Paul M: No, SPDY is a L5 protocol (alternative to HTTP), transporting the same content as HTTP (ie. HTML, XML).
      
      But Chrome is often bundled with devices/OS (ie Android), where the TCP stack itself is already heavily tuned (some would say, these stacks violate IETF standards), and that helps too…
      
      But quite a number of features of SPDY you can also get using HTTP 1.1, when server and browser are properly tuned (see this blog: http://bitsup.blogspot.com/ ). But as they are optional (instead of default with SPDY), they are not in widespread use…
      
      Regards,
Ivan Barrera Says:
January 7, 2011 at 12:03 pm | Reply
The problem of the trade-off between delay and throughput dates back to 1983, as you said when Jacobson and Floyd were working on RED. It seems to me, that the issue is still marketing, and how “throwing bandwidth at the problem” won’t cut it anymore.

But mainly, mainstream OS do as they please. ECN has been proposed for a long time already, and just until now, Windows vista implemented it (not enabled by default of course). And then it comes BitTorrent type of protocols and modifications to TCP such as CUBIC (now default in Linux). As my topic of research, I say that congestion should be addressed first. Since the core network is now fairly unused, due to the overdimensioned bandwidth, the problem moved toward the edges, and as you claimed, the only thing that ISPs managed to do, were throwing the ball to the end-mile (customers).

Interesting that someone puts this up, and people actually cares about it! Keep it up.
- gettys Says:
  January 7, 2011 at 9:17 pm | Reply
  I much prefer driving a sports car: trying to maneuver the Queen Mary on a super highway approaching the exit to my house just isn’t fun… And that’s what we all get to do these days.
  
  And yes, I agree whole heartedly we have to change the marketing discourse. Ergo this blog: we have to shine light on the problem, and encourage fixes and have the power of the purse working for us, rather than against us.
  
  Note ECN has been inhibited by a certain Taiwanese vendor having shipped a lot of broken kit a long time ago that would go belly up if they saw an ECN bit. Steve Bauer (and maybe others) is investigating whether it may finally be safe to use ECN overall. So characterizing this as “as they please” is a mis-characterization. We do need easy to use tools and ways to distribute them to help your grandma find out if their home network is broken, however.
  
  Bufferbloat is a case where I know a problem has been generating many service calls (where it hurts ISP’s where it does most, directly to their bottom lines). I know, because I’ve placed them multiple times myself (before I understood what was going on).
  - Arms Says:
    January 8, 2011 at 6:34 am | Reply
    I’m not convinced, ECN per se is the proper (only) answer here. ECN by itself only helps reducing the loss (and redundant work, subsequently necessary after a lost packet).
    
    Here at home, I’m running with enabled ECN (linux, Win7) and only found a few obscure server sites, that also support TCP ECN.
    
    However, even though I’m constantly tracing, I have yet to see a CE-marked frame in one of those few ECN-enabled flows.
    
    Thus the problem is NOT only the end systems – where the default is still not to use ECN – but IMHO much more so, the access (where congestion actually occurs) and core routers. There, ECN (and AQM) could / should be enabled as it won’t make a difference even if those broken home routers are still operational (which I kind of doubt that there are still large numbers around. The half-time of home gear is probably less than 2-3 years, and the debacle happend twice or three times that half-life ago. So only a small fraction of the original popultation of broken equipment will be still operational (and users there are completely free not to enable ECN / disable ECN on their side).
    
    One problem of ECN, as I see it, is that it only signals the existance of “TCP-compatible” congestion, but NOT it’s extent (ie. the depth of the cumulative network buffers across the whole path). The reaction to ECN was specified to be identical to the reaction to loss – thus there never was any incentive to neither end-users nor network operators to move from loss-based congestion signalling to ECN-mark congestion signalling.
    
    A simple incentive back in those days might have been to allow a gradually less severe cwnd reduction on ECN marks. Thereby, traditional protocols not using ECN (but required to be TCP friendly) would be at a disadvantage (not only because of the worst goodput/throughput ratio, which the end users don’t really care about all that much – but network operators should care about).
    
    Perhaps the current CONEX WG does something right and not only builds an improved signalling framework, but also sets the incentives for end-users and network operators right, to get that deployed this time.
    
    Regards,
    - Ivan Barrera Says:
      January 9, 2011 at 2:35 pm
      Arms,
      
      What AQM algorithm are you using altogether with ECN (I assume your router at home should be doing most of the marking)? That’s an interesting thing I’m yet to try.
      
      I agree with you, that changes to TCP should be made for those using ECN to encourage its use, but IETF has always been looking after the fairness and how some flows shouldn’t be able to starve others following standards. But there’s the new P2P on the block, that abuses TCP opening many connections, and quickly overflows queues.
      
      As for the inhibition of the ECN bit, Jim I think that’s still the market. Since ECN wasn’t as important for many manufacturers, to me, selling expensive cards or focusing on other areas was more important that adding such feature. If the ECN bug had been a TCP bug, they would have taken the Taiwanese routers down, instead they just inhibited ECN.
      
      As I said before, one main advantage of ECN (besides the obvious one of reducing packet losses and retransmissions) is that it allows to differentiate between packet losses due to congestion from those due to malfunctions, medium, etc. Which is key to actually modify TCP to behave accordingly. With no TCP modifications, well, that advantage is not completely used at its potential.
      
      Bests,
DaveK Says:
January 7, 2011 at 12:12 pm | Reply
Hah! Your wireshark picture looks almost exactly like what I noticed happening on my connection just a week or two ago! I was wondering why less than a hundred kB/s of bittorrent download was completely thrashing every other outgoing connection attempt and saw exactly the same sorts of dup’d acks and retransmissions going on.

What I’m less certain of is whether TCP’s RTT/retry/congestion-avoidance algorithms are worth trying to save at all. In the presence of any real degree of packet loss much above the 0.1% range, TCP falls down horribly, as I discovered while working on ultra-wideband networking devices a few years ago. I guess that finally actually implementing proper ToS/QoS everywhere is going to be the only real effective solution long-term.
- gettys Says:
  January 9, 2011 at 12:21 pm | Reply
  I don’t agree with your conclusions: rather, I believe that everyone needs to develop a deeper understanding of how packet networking actually works. We aren’t seeing the forest for the trees.
  
  Part of the issue (as I understand it from watching mail traffic, and again, this is not my area) is that many/most of these technologies have been designed such that they buffer packets for a long time in the name of trying to get them delivered reliably: but then this can have the effect of defeating SACK and fast retransmit in TCP. Note that in my traces (which show between one and three percent packet loss, BTW), the pipe’s being kept very full.
  
  There is no such thing as a “layer” in a network “stack”; I’ve been badly burned by this kind of thinking (and somewhat guilty of it myself) and hope to address this in a future post. One very common pervasive problem has been design by committee, where the committees have been entirely focused on the particular “layer” (in the ISO model sense) of a particular technology, and lacking in expertise in how the protocols built above them actually function. These so-called “layers” interact with each other, and “fixing” problems in one “layer” may just cause more trouble elsewhere.
  
  A quick example: most 802.11 access points drop the transmit bandwidth down to 1Mbps on all multicast/unicast traffic; this means if there is even a small amount of such traffic, you can turn your 20 megabit network into a 1 megabit network.
  - Frank Bulk Says:
    January 9, 2011 at 1:27 pm | Reply
    Yes, the IEEE 802.11 standard requires access points to transmit at the lowest level when sending multicast traffic.
    
    Frank
Andrew Says:
January 7, 2011 at 1:23 pm | Reply
Hangon a second. This is familiar. I used to have an old DSL modem that was really fast but the ISP had it working at about 1/4 to 1/8th of its maximum speed. This worked fine for downloads from the internet but on uploads caused issues.

Because the modem could do something like 1Mbps but was working at only 120kps it’s buffers were way bigger than it needed to be for the speed it was actually working at. If the outbound link got saturated the internet would, basically, stop working. The symptoms were new HTTP connections (or any TCP connections really) might start but not always complete.. or complete really, really slowly.

Since I was writing a p2p client at the time and mostly using that program to do large transfers I implemented throttling in it and the problem went away. It would come back, though, if any family member saturated the outbound connection for more than a few seconds.

As the ISP upgraded its systems the modem started to work closer to its designed speed and the issue went away. Dramatically so. I could still get it to happen by adding noise to the line in such a way that I closed off enough of its upstream 5KBps channels. (This was easy at the time since the apartment block’s wiring would do this for me.)

All this was when I first got broadband back in 1999-2000. Unless I missed something I don’t see why silly buffer sizes causing latency causing TCP to fail would be controversial.
Joris Says:
January 7, 2011 at 2:18 pm | Reply
Has any one tried disabling window scaling on windows 7?

netsh int tcp set heuristics disabled
netsh int tcp set global autotuninglevel=highlyrestricted

You may want to note your default values with
netsh int tcp show global
first.

I noticed the latency was decreased (not by a whole lot) and varied a lot less when the upstream is congested.
PlanBForOpenOffice Says:
January 7, 2011 at 3:16 pm | Reply
Hi Jim,
are you aware of a netalyzer version that is not browser based or has an UI?

I’d like to run such tests on a few machines that do not have a UI (servers).
- gettys Says:
  January 7, 2011 at 8:25 pm | Reply
  The Netalyzr team hasn’t published the source to their tests so far. I gather they may have a command line version internally. You could drop a note to Nick Weaver or Christian Kreibich and see if they will give you copies.
  
  However, I believe some of the tests at m-lab are equally effective; but as the results of those tests haven’t yet been published, I’ve mostly ignored them in this blog as they would just further complicate an already complicated story.
  
  And as I showed in several of my early postings such as fun with your switch and fun with wireless testing for bufferbloat is as simple as one line shell commands.
How big are the buffers in FreeBSD drivers? | Alexander Leidinger Says:
January 7, 2011 at 3:21 pm | Reply
[…] experience are because buffers in the network hardware or in operating systems are too big. He also proposes workarounds until this problem is attacked by OS vendors and equipment […]
E.Hastings Says:
January 7, 2011 at 3:33 pm | Reply
I bet your kids have great ping in Call of Duty 4! 😉
- gettys Says:
  January 7, 2011 at 4:01 pm | Reply
  My son plays Battle for Wesnoth, rather than a first person shooter; my daughter virtual life games, when not OD’ing on medical topics in Wikipedia.
  
  But yes, they would have pretty good latencies now, unless suffering from 802.11 bufferbloat (I’ve not gone beyond experimentation on home routers and hosts so far).
MarkE Says:
January 7, 2011 at 4:13 pm | Reply
Brilliant posting Gettys. I’ve experienced many of the similar problems, but always thought it was on the service providers end, not in between hops because of buffer bloat.
- gettys Says:
  January 7, 2011 at 4:36 pm | Reply
  Thanks.
  
  It may be at the ISP’s end; but in can be anywhere on the path, including your host.
Brian Kilgore Says:
January 7, 2011 at 7:00 pm | Reply
I worked for a startup company almost 10 years ago. We were building a cellular wireless data system. Our first simulation system I noticed slow transmissions and a high number of retransmissions. I finally tracked it down to packets being delivered out of order. The solutions I saw was either modify the TCP packets, or increase the TCP window size. This is a complicated issues. I always thought a good solution for wireless carriers was a proxy approach. If you breakup the communications into 2 separate TCP connections the issues becomes more manageable.

I haven’t thought about these kind of problems for years. These are interesting problems. I wish I had more time to think about them.
- gettys Says:
  January 7, 2011 at 8:43 pm | Reply
  Actually, no.
  
  Running multiple TCP connections just dilutes the congestion avoidance further, and makes the situation worse. I’ve alluded to this in the blog already a bit; I have a major posting sometime on this topic coming, once I can breathe again.
- Arms Says:
  January 8, 2011 at 6:45 am | Reply
  As far as I can tell, mobile operators regularly put in “transparent” proxies to deliver OK download speeds (page load times) for mobile devices.
  
  Of course, I can not prove this when using the phone’s browser (or an apple device, due to it’s closed ecosystem, preventing tcpdump to be available there). But even when tethered, some non-http sessions look suspicious (negotiate a smaller MTU in the SYN/SYNACK of the tcp session; certain options I know are supported by the server (when using a wired internet connection) are missing in the SYN/ACK, etc…
  
  Doing this vs. not doing this has become a no-brainer, as mobile operators not “proxying” TCP sessions (idependent of content; with http its particularly easy and more cost effective) will not have many customers for long…
  
  Again, this is rumor (I only consulted with an operator once – running 2G at the time, and this was the one big thing which fixed their issues vs. their competition at the time) as far as I’m concerned… Perhaps someone working for a mobile operator want to speak up and shed some light into 2G / 3G / 4G mobile data networks and operational tweaks.
  
  Regards,
Interesting Reading #660 – Trimensional 3D scanner app, Porsche 918 Spyder hybrid, Amazing amateur astronomer, Surprising beers and much more! – The Blogs at HowStuffWorks Says:
January 7, 2011 at 10:16 pm | Reply
[…] Bufferbloat – “In my last post I outlined the general bufferbloat problem. This post attempts to explain what is going on, and how I started on this investigation, which resulted in (re)discovering that the Internet’s broadband connections are fundamentally broken (others have been there before me). It is very likely that your broadband connection is badly broken as well. And there are things you can do immediately to mitigate the brokenness in part, which will cause applications such as VOIP, Skype and gaming to work much, much better…” […]
Network Buffer Bloat – flyingpenguin Says:
January 8, 2011 at 4:43 am | Reply
[…] are new here, you might want to subscribe to the RSS feed for updates on this topic. Jim Gettys has a bone to pick about performance of his networks. He suspects there is a problem with TCP buffers related to network congestion and round trip time […]
John Williams Says:
January 8, 2011 at 5:14 am | Reply
This happens to be a well known problem in gaming circles where latency AND bandwidth matter greatly. The solution is to tune the “maximum receive window size” in the operating system to the speed of the connection. Ideally that buffer should hold no more than a seconds worth of data.

In Linux, this can be done in /proc/sys/net/ipv4/tcp_rmem and for windows, there’s a program called TCPOptimizer.

Years ago I realised that my ISP had a buffer at their end large enough to hold 40 seconds worth of data which ended up being a similar story to yours.
- gettys Says:
  January 8, 2011 at 10:00 am | Reply
  Yes, it’s well known in a few circles. (who haven’t properly screamed about their problems, in my opinion).
  
  Note, however, that your wireless router or your computer may also be bufferbloated (and maybe even worse than your ISP): as soon as your broadband bandwidth is higher than your wireless bandwidth, the bottleneck moves to the 802.11 link, and you have yet another problem.
  
  What is worse, the bandwidth available (actual goodput) there is often widely varying, so static tuning as you suggest won’t really help that case. So we have to circle back to AQM to fully solve the problem.
Links 8/1/2011: GIMP 2.8 Status Update, Ubuntu GNU/Linux Ported to Nook | Techrights Says:
January 8, 2011 at 2:33 pm | Reply
[…] Whose house is of glasse, must not throw stones at another. All broadband technologies are suffering badly from bufferbloat, as are many other parts of the Internet. […]
András Salamon Says:
January 8, 2011 at 2:45 pm | Reply
It’s sad that the lessons of Stuart Cheshire’s well-known rant It’s the Latency, Stupid are still being ignored. Back in the early days of the commercial Internet, Cheshire was bemoaning the excessive buffering in consumer modems: and here we are again, fifteen years later, in exactly the same place.

Thanks for your detective work. It might be useful to distil the large amount of text you have written about the history of your investigation into a short overview document, capturing the essential message. The congestion recovery mechanisms of TCP assume that congestion leads to packet loss that can be detected. Bufferbloat removes this link between congestion and packet loss. The result is worse overall network performance than using smaller buffers with some packet loss.
- gettys Says:
  January 8, 2011 at 3:51 pm | Reply
  Yes, everyone should read Cheshire’s rant, and take it to heart.
  
  I’ve been intending to try to do a summary for a while, and almost got to it this week, before getting Slashdotted. I did get “What is” and “FAQ” pages done despite that.
- Arms Says:
  January 8, 2011 at 8:11 pm | Reply
  FWIW, here is the link why today, even core 10Gbit Switches get delivered with excessive memory for excessive buffering – and the marketing people of these companies think this is a Good Thing (TM)….
  
  http://www.force10networks.com/whitepapers/MeasurementAnalysisTCP.asp
  
  Correctly identified a symptom, and perfectly predictably resulted in the kneejerk reaction of “more memory for the masses”. (The root cause of Incast and TCP performance impact should be addressed via more smart means, ie an evolution of http://simula.stanford.edu/sedcl/files/dctcp-final.pdf)
Steve Says:
January 9, 2011 at 11:44 pm | Reply
This is really great research – and the narrative is great reading too. Thanks. Looking forward to the rest.

I’m wondering if this may also be incorrectly implemented by satellite providers – creating bufferbloat on this wireless lines? My folks live way out in the woods and can only get sat links to the internet. Their latency is atrocious at all times, but some times the link is just unusable – I wonder if this is at times when the uplink (since it’s shared with lots of other senders) is overloaded and the sat company is buffering everybody’s stuff (either on the sat or at the downlink)?

Any thoughts on how to test out this hypothesis? I’d love to be able to give some info to the sat company on how to improve their service, since it’s very painful to use these days. Very bursty in my superficial experience which makes me think bufferbloat might be the culprit (and I could see the false logic in thinking that putting a big buffer on such a slow link would help).

Also would possibly implementing wondershaper on their local end possibly improve things? Thanks for any insights on the satellite implications of all this.. Really great work!
- gettys Says:
  January 10, 2011 at 12:06 pm | Reply
  Sure. Historically, (talking with others with yet more gray hair than me), bufferbloat was first identified and understood on satellite hops, and I’ve certainly seen behavior over satellite links that was likely extreme bufferbloat.
  
  And yes, shaping traffic to avoid the buffers filling may (or may not) be very helpful. On links with predictable bandwidth, you can avoid filling the buffes.
  
  But, as usual, you have to identify which hop is actually the bottleneck. It can be hiding in the satellite technology itself, or in the routers on either side of the hop, or locally in your home network (though this is probably less likely in this case). And, IIRC, some of the satellite technologies play evil games with TCP in the background. So the first step is to identify where you are actually suffering. Time maybe to start another page on tools and trouble shooting, or turn on a wiki for everyone to play with.
Jesper Dangaard Brouer Says:
January 10, 2011 at 9:34 am | Reply
Lets fix the Wifi bufferbloat
http://netoptimizer.blogspot.com/2011/01/bufferbloat-wireless-is-worse-than.html
Oscar Niemi Says:
January 10, 2011 at 8:16 pm | Reply
Sounds very much like the problem you solve with the I and D in a common PID regulator.
http://en.wikipedia.org/wiki/PID_controller
Arms Says:
January 12, 2011 at 6:07 am | Reply
@Jim Gettys:

I just got my new FTTH connection (10/10 Mbit/s). As expected, even the new provider has not configured any decent AQM scheme (despite the CPE using a broadcom BCM5338M, which does offer advanced schemes – but only rate limiting appears to be utilized – the physical FHHT link is Eth 100 FDX).

But the point I wanted to make is, that things might not be as bleak with the demise of WinXP as you suggest. Win7 comes with Compound TCP as the standard congestion control algorithm, which is a hybrid (latency / loss feedback) scheme – see http://tools.ietf.org/html/draft-sridharan-tcpm-ctcp-02.

So, running with really large tcp receivewindow to a decent server in my ISPs core, the latency impact with CTCP is significantly reduced over running NewReno.

The tests were performed using FTP session, captured with Wireshare, and analysed using Ostermann’s tcptrace [cygwin] utility, version 6.6.0 4Nov2003:

NewReno:
================================
TCP connection 18:
host ai: 213.143.112.66:20
host aj: srichard-LW7.teletronic.at:53190
complete conn: no (SYNs: 2) (FINs: 0)
first packet: Wed Jan 12 11:26:26.222561 2011
last packet: Wed Jan 12 11:40:56.934254 2011
elapsed time: 0:14:30.711693
total packets: 729058
filename: wireshark_7F15C909-38F4-4965-8F89-4DFEE06BFB4A_20110112110841_a03940
ai->aj: aj->ai:
total packets: 370210 total packets: 358848
ack pkts sent: 370209 ack pkts sent: 358848
pure acks sent: 370209 pure acks sent: 8
sack pkts sent: 20015 sack pkts sent: 0
dsack pkts sent: 77 dsack pkts sent: 0
max sack blks/ack: 3 max sack blks/ack: 0
unique bytes sent: 0 unique bytes sent: 1043337805
actual data pkts: 0 actual data pkts: 358839
actual data bytes: 0 actual data bytes: 1044126973
rexmt data pkts: 0 rexmt data pkts: 637
rexmt data bytes: 0 rexmt data bytes: 789168
zwnd probe pkts: 0 zwnd probe pkts: 0
zwnd probe bytes: 0 zwnd probe bytes: 0
outoforder pkts: 0 outoforder pkts: 0
pushed data pkts: 0 pushed data pkts: 15929
SYN/FIN pkts sent: 1/0 SYN/FIN pkts sent: 1/0
req 1323 ws/ts: Y/Y req 1323 ws/ts: Y/Y
adv wind scale: 6 adv wind scale: 8
req sack: Y req sack: Y
sacks sent: 20015 sacks sent: 0
urgent data pkts: 0 pkts urgent data pkts: 0 pkts
urgent data bytes: 0 bytes urgent data bytes: 0 bytes
mss requested: 1460 bytes mss requested: 1460 bytes
max segm size: 0 bytes max segm size: 52128 bytes
min segm size: 0 bytes min segm size: 5 bytes
avg segm size: 0 bytes avg segm size: 2909 bytes
max win adv: 802240 bytes max win adv: 66560 bytes
min win adv: 5888 bytes min win adv: 66560 bytes
zero win adv: 0 times zero win adv: 0 times
avg win adv: 786975 bytes avg win adv: 66560 bytes
initial window: 0 bytes initial window: 2896 bytes
initial window: 0 pkts initial window: 1 pkts
ttl stream length: NA ttl stream length: NA
missed data: NA missed data: NA
truncated data: 0 bytes truncated data: 0 bytes
truncated packets: 0 pkts truncated packets: 0 pkts
data xmit time: 0.000 secs data xmit time: 870.698 secs
idletime max: 299.9 ms idletime max: 369.6 ms
throughput: 0 Bps throughput: 1198259 Bps

RTT samples: 1 RTT samples: 172778
RTT min: 0.2 ms RTT min: 1.9 ms
RTT max: 0.2 ms RTT max: 168.4 ms
RTT avg: 0.2 ms RTT avg: 92.0 ms
RTT stdev: 0.0 ms RTT stdev: 17.3 ms

RTT from 3WHS: 0.2 ms RTT from 3WHS: 1.9 ms

RTT full_sz smpls: 1 RTT full_sz smpls: 1
RTT full_sz min: 0.2 ms RTT full_sz min: 100.8 ms
RTT full_sz max: 0.2 ms RTT full_sz max: 100.8 ms
RTT full_sz avg: 0.2 ms RTT full_sz avg: 100.8 ms
RTT full_sz stdev: 0.0 ms RTT full_sz stdev: 0.0 ms

post-loss acks: 0 post-loss acks: 190
segs cum acked: 0 segs cum acked: 185193
duplicate acks: 8 duplicate acks: 1147
triple dupacks: 1 triple dupacks: 14
max # retrans: 0 max # retrans: 8
min retr time: 0.0 ms min retr time: 0.0 ms
max retr time: 0.0 ms max retr time: 571.5 ms
avg retr time: 0.0 ms avg retr time: 110.2 ms
sdv retr time: 0.0 ms sdv retr time: 142.2 ms

CTCP:
================================
TCP connection 1:
host a: 213.143.112.66:20
host b: srichard-LW7.teletronic.at:60372
complete conn: yes
first packet: Tue Jan 11 20:05:10.363129 2011
last packet: Tue Jan 11 20:19:40.933754 2011
elapsed time: 0:14:30.570624
total packets: 732099
filename: upload-100fdx-wire-win7.trc
a->b: b->a:
total packets: 371825 total packets: 360274
ack pkts sent: 371824 ack pkts sent: 360274
pure acks sent: 371823 pure acks sent: 1
sack pkts sent: 19282 sack pkts sent: 0
dsack pkts sent: 25 dsack pkts sent: 0
max sack blks/ack: 3 max sack blks/ack: 0
unique bytes sent: 0 unique bytes sent: 1048576569
actual data pkts: 0 actual data pkts: 360272
actual data bytes: 0 actual data bytes: 1049084734
rexmt data pkts: 0 rexmt data pkts: 382
rexmt data bytes: 0 rexmt data bytes: 508165
zwnd probe pkts: 0 zwnd probe pkts: 0
zwnd probe bytes: 0 zwnd probe bytes: 0
outoforder pkts: 0 outoforder pkts: 0
pushed data pkts: 0 pushed data pkts: 16007
SYN/FIN pkts sent: 1/1 SYN/FIN pkts sent: 1/1
req 1323 ws/ts: Y/Y req 1323 ws/ts: Y/Y
adv wind scale: 6 adv wind scale: 8
req sack: Y req sack: Y
sacks sent: 19282 sacks sent: 0
urgent data pkts: 0 pkts urgent data pkts: 0 pkts
urgent data bytes: 0 bytes urgent data bytes: 0 bytes
mss requested: 1460 bytes mss requested: 1460 bytes
max segm size: 0 bytes max segm size: 63712 bytes
min segm size: 0 bytes min segm size: 24 bytes
avg segm size: 0 bytes avg segm size: 2911 bytes
max win adv: 1057088 bytes max win adv: 66560 bytes
min win adv: 5888 bytes min win adv: 66560 bytes
zero win adv: 0 times zero win adv: 0 times
avg win adv: 926699 bytes avg win adv: 66560 bytes
initial window: 0 bytes initial window: 2896 bytes
initial window: 0 pkts initial window: 1 pkts
ttl stream length: 0 bytes ttl stream length: 1048576569 bytes
missed data: 0 bytes missed data: 0 bytes
truncated data: 0 bytes truncated data: 1000809126 bytes
truncated packets: 0 pkts truncated packets: 360263 pkts
data xmit time: 0.000 secs data xmit time: 870.444 secs
idletime max: 665.7 ms idletime max: 729.8 ms
throughput: 0 Bps throughput: 1204470 Bps

RTT samples: 2 RTT samples: 170802
RTT min: 0.1 ms RTT min: 1.9 ms
RTT max: 0.2 ms RTT max: 146.2 ms
RTT avg: 0.1 ms RTT avg: 91.8 ms
RTT stdev: 0.0 ms RTT stdev: 18.1 ms

RTT from 3WHS: 0.2 ms RTT from 3WHS: 1.9 ms

RTT full_sz smpls: 1 RTT full_sz smpls: 3
RTT full_sz min: 0.1 ms RTT full_sz min: 95.9 ms
RTT full_sz max: 0.1 ms RTT full_sz max: 95.9 ms
RTT full_sz avg: 0.1 ms RTT full_sz avg: 95.9 ms
RTT full_sz stdev: 0.0 ms RTT full_sz stdev: 0.0 ms

post-loss acks: 0 post-loss acks: 207
For the following 5 RTT statistics, only ACKs for
multiply-transmitted segments (ambiguous ACKs) were
considered. Times are taken from the last instance
of a segment.
ambiguous acks: 0 ambiguous acks: 4
RTT min (last): 0.0 ms RTT min (last): 57.7 ms
RTT max (last): 0.0 ms RTT max (last): 106.8 ms
RTT avg (last): 0.0 ms RTT avg (last): 79.5 ms
RTT sdv (last): 0.0 ms RTT sdv (last): 24.0 ms
segs cum acked: 0 segs cum acked: 188879
duplicate acks: 0 duplicate acks: 2281
triple dupacks: 0 triple dupacks: 33
max # retrans: 0 max # retrans: 6
min retr time: 0.0 ms min retr time: 0.0 ms
max retr time: 0.0 ms max retr time: 397.4 ms
avg retr time: 0.0 ms avg retr time: 81.5 ms
sdv retr time: 0.0 ms sdv retr time: 79.8 ms

The b-side is the important part. In summary, the latency induced due to bufferbloat is 150 +-87 ms with NewReno (the high variance indicating frequent draining of the buffers – and frequent collapse of the throughput – averaging at 1.198 MB/sec). With CTCP in comparison, the induced latency is “only” 92 +-18 ms – i believe the latency component of CTCP has a target of 100ms – and the much better variance also indicates a much less pronounced sawtooth behavior for an average throughput of 1.204 MB/sec.

Still looking at the download traces (Ostermann’s tcptrace doesn’t correlate round-trip times properly from a receiver-side trace…)

Regards,
- Jesper Dangaard Brouer Says:
  January 12, 2011 at 9:00 am | Reply
  The CTCP (Compound TCP) algorithm sound very interesting!
  
  There is more on the MS implementation:
  http://research.microsoft.com/apps/pubs/default.aspx?id=70189
  
  Quote:
  “… .This new delay-based component can rapidly increase sending rate when network path is under utilized, but gracefully retreat in a busy network when bottleneck queue is built. …”
  
  Is sound like what we want, it “retreat” when the queue is building…
  - Arms Says:
    January 12, 2011 at 9:40 am | Reply
    Well, arriving with ~74-100ms is an improvement over latencies in the range of ~63-237ms, but the problem here is, that the baseline latency (speed of light) will be different for every path in the internet. And a sender has no means to distinguish signalling delay from queuing delay.
    
    In my case, the unloaded (empty buffers at the beginning of the test, until slow-start overshoots and floods the queues) is slightly less than 3ms to the ISPs local server. And for the record, on the download path (from the same server), the full queue latency rises to 39.6 +- 2.8 ms – a bit better than in the uplink direction. Most likely this comes from the shared memory buffering used by many common switch designs (I’m linked up to Broadcom chipset boxes doing the rate limiting). Right now, there are about 20 home users, provisioned 10/10 and a few 30/30, sharing the 1GE link to my appartment complex – thus the total buffers of the switch in the basement is shared among all these users, which results in less (kB) buffering available to each individual port – and thereby limiting maximum latency indirectly…
    
    From the point of control theory, you want the feedback signal as fast as possible, but not faster than the ground frequency of the control loop (ie. 1 RTT; ICMP source quenche violated that principle, and had a number of other shortcomings, such as generating more load at times when congestion is already prevailing), but not very much slower (ie. >> 2-3 RTT).
    
    In my example, the empty queue RTT to my server is around 3ms, but the feedback loop reacts with ~40, ~100 or even ~150 ms – a factor 15, 35 and 50 more untimely than would be ideal…
    
    OTOH, with such huge buffering delays, latency based congestion algorithms have not much trouble spotting these building queues… And you won’t even need to go for real-time optimized OS (minimizing OS scheduler / interrupt jitter, measuring times with high precision).
    
    (For comparison, pathChirp needs to run it’s timers at mircosecond / sub-microsecond resolution to yield good results. This quickly leads into a rat hole of OS stack changes throughput…)
    
    Regards,
  - Arms Says:
    January 12, 2011 at 2:47 pm | Reply
    One more paper on this topic (and non-delay based CC algorithms in TCP):
    
    http://books.google.com/books?id=n3nxjc0I7QsC&lpg=PA392&dq=Collateral%20Damage%3A%20The%20Impact%20of%20Optimised%20TCP%20Variants%20on%20Real-Time%20Traffic%20Latency%20in%20Consumer%20Broadband%20Environments&pg=PA393#v=onepage&q=Collateral%20Damage:%20The%20Impact%20of%20Optimised%20TCP%20Variants%20on%20Real-Time%20Traffic%20Latency%20in%20Consumer%20Broadband%20Environments&f=true
Paul M Says:
January 12, 2011 at 9:06 am | Reply
I have seen various people publish “hacks” to “improve” the performance of firefox by increasing the number of tcp sockets it can open to web servers. Thinking about it these go back to the days of WinXP and older? This then causes some people to go crazy and increase the values beyond any reasoned values.

I think a similar problem occurs when there is wifi congestion. People’s instinct is to turn the power UP on their wifi access points (or fit higher gain antenna) – their idea being that shouting louder overcomes the interference and congestion, but of course the wireless clients’ power can’t easily be fixed that way, not can their receivers easily be made less sensitive to suit!

The proper solution of course is for *everyone* to *reduce* the power output on their access points to the minimum just sufficient to cover their site. The snag is that this requires people to (a) understand the cause of interference and (b) cooperate with their neighbours.
Paul M Says:
January 12, 2011 at 9:09 am | Reply
BTW, I am using an external ADSL modem/bridge which connects to my linux router on ethernet with PPPoE, and to the ADSL PPPoA service over PSTN. I note that txqueuelen on eth1 and dsl0 is just 10

the eth0 lan interface (gigabit) has txqueuelen 1000 😦
- gettys Says:
  January 13, 2011 at 8:00 am | Reply
  Yes, Linux drivers give the system a hint of the size of the transmit queue. It’s modern devices (Ethernet and wireless) that seem to most commonly pick 1000 out of the air.
  
  Beware buffering elsewhere in the system, as I ran into in the drivers. (and you can run into in the broadband gear, your home router, and elsewhere in the net.
Bufferbloat – An Invisible Menace That Is Slowing Down The Internet | Paul Jenkins' Tech Info Says:
January 12, 2011 at 5:31 pm | Reply
[…] few weeks ago, Jim Gettys of Bell Labs, set off an IED of an announcement on his personal blog: the entire internet is substantially slowing down in real-time communications due to a creeping […]
Greg Watson Says:
January 12, 2011 at 7:31 pm | Reply
Nick McKeown’s group at Stanford has done a lot of work on the impact of buffer sizing, also concluding that lots of buffers can make things worse.

All papers: http://yuba.stanford.edu/~nickm/papers/

Disclosure: I used to work in Nick’s group.

Relevant Papers:

Talks:
* “Buffers: How we fell in love with them, and why we need a divorce.”
Keynote at Hot Interconnects, Aug 2004.
ppt, pdf, ps

Papers:

“Experimental Study of Router Buffer Sizing”,
Neda Beheshti, Yashar Ganjali, Monia Ghobadi, Nick McKeown, and Geoff Salmon
IMC’08, October 2008, Vouliagmeni, Greece.
pdf

“Obtaining High Throughput Networks with Tiny Buffers”,
Neda Beheshti, Yashar Ganjali, Ashish Goel, Nick McKeown
In Proceedings of 16th International Workshop on Quality of Service (IWQoS), Enschede, Netherlands, June 2008.
5 pages pdf.

“Experimenting with Buffer Sizing in Routers”,
Neda Beheshti, Yashar Ganjali, Jad Naous, and Nick McKeown
ANCS’07, December 2007, Orlando, Florida, USA.
pdf

“Packet Scheduling in Optical FIFO Buffers,”
N. Beheshti, Y. Ganjali, and N. McKeown,
High-Speed Networking Workshop (In Conjunction with IEEE Infocom 2007), Anchorage, AK, May 2007.

“Update on Buffer Sizing in Internet Routers”,
Yashar Ganjali, Nick McKeown
Computer Communications Review (CCR), Volume 36, Number 5, October 2006.
4 Pages pdf.

“Buffer sizing in all-optical packet switches”,
Neda Beheshti, Yashar Ganjali, Ramesh Rajaduray, Daniel Blumenthal, and Nick McKeown
In Proceedings of OFC/NFOEC, Anaheim, CA, March 2006.
3 Pages pdf.

“Routers with very small buffers”,
Mihaela Enachescu, Yashar Ganjali, Ashish Goel, Nick McKeown, and Tim Roughgarden
In Proceedings of the IEEE INFOCOM’06, Barcelona, Spain, April 2006.
11 Pages pdf.

“Part I: Buffer Sizes for Core Routers”,
Damon Wischik and Nick McKeown
ACM/SIGCOMM Computer Communication Review, Vol. 35, No. 3, July 2005.
4 Pages pdf.

“Part III: Routers with Very Small Buffers”,
Mihaela Enachescu, Yashar Ganjali, Ashish Goel, Tim Roughgarden, and Nick McKeown
ACM/SIGCOMM Computer Communication Review, Vol. 35, No. 3, July 2005.
7 Pages pdf.
Extended version: Stanford HPNG Technical Report TR05-HPNG-060606 pdf.

“Recent Results on Sizing Router Buffers”
Guido Appenzeller, Nick McKeown, Joel Sommers, Paul Barford
Proceedings of the Network Systems Design Conference, October 18-20 2004, San Jose, Ca. pdf,

“Sizing Router Buffers”
Guido Appenzeller, Isaac Keslassy and Nick McKeown
ACM SIGCOMM 2004, Portland, August 2004. pdf, ps
Extended version: Stanford HPNG Technical Report TR04-HPNG-060800 pdf, ps
- gettys Says:
  January 13, 2011 at 7:31 am | Reply
  Thanks greatly for all the references!
Dylan Hall Says:
January 19, 2011 at 5:45 pm | Reply
Thanks for your investigation and write up. I think you’ve managed to bring together some fairly well know behaviors (fill your link and it becomes useless for anything else) and make sense of it 🙂

7-8 years ago I was trying to implement a “QoS” service on our then rather new network. The service consisted of a Frame Relay link to the customer which terminated on an ATM switch (I don’t recall what type, not my area), which re-encapsulated the packets in ATM and delivered them to my router (Unisphere, now Juniper ERX). Frame relay has the ability to interleave frames (FRF.12 I think it’s called) so you can run VoIP on low speed links, but due to the ATM layer in the middle we couldn’t take advantage of that.

Our primary concern was jitter caused by the serialisation delay on low speed links. At the time Cisco said 768Kbps was the lower bound, and we were aiming for 1Mbps given the additional buffering/jitter introduced by the ATM layer.

Most of my lab testing involved saturating low priority queues with UDP packets (no back off so the queues were permanently full) and testing that the higher priority queues still had a timely (low latency/jitter) service.

Some of the interesting facts I figured out during this process where:

Our Ethernet switches (Extreme i-series) use 256kB buffers on each port.
The Juniper ERX lines cards are equipped with 32MB of buffer which is shared dynamically among all the sub-interfaces on that card (each card supports up to 32000 sub interfaces). Each sub-interface has an upper bound of 7MB of buffer.
The Cisco CPE at the time (2500/2600 routers) used 64kB buffers.

During my final testing I dropped the speed of the service (256kbps I think) and tested the impact of filling the high priority queue while a VoIP call was in progress. I set it up such that the total traffic was around 10% higher than the available bandwidth. Once the queue filled and started tail dropping the service had the expected 10% packet loss and the VoIP call continued to work. The problem was the end-to-end latency was > 40 seconds. I was able to speak into one of the phones, stand around for 30 seconds, then walk across the room and listen to my message on the other phone.

The problem was my lab environment only had a single service configured on the line card, so the router had given it the full 7MB buffer, totally insane.

I was able to find the knob in the config to lower the upper bound to 64kB based on the reasoning “64kB is good enough for Cisco, it’s good enough for me”.

We then told our customers that “under no circumstances over-subscribe your high priority queue, use it for latency/jitter sensitive traffic only, e.g. RTP”. Even with the 64kB queue limit a full queue still caused more latency than was ideal for VoIP.

Let me finally get to my points 🙂

I was building a private service (RFC2547 based). Most of us using standard public Internet services don’t have the luxury of being able to mark important packets and expect them to get special treatment at all the choke points in the network.

Many (most?) Telco router gear has knobs to tune these sorts of settings, I wonder how many ISP’s actually change these settings from the defaults? Are the defaults on this gear sensible?

VoIP is really tolerant of packet loss, far more so than the Voice engineers are willing to admit. This means our obsessive pursuit of low/no loss services is unnecessary (at least when VoIP is used as the excuse to justify that pursuit). From my experience working at a Telco, loss is considered to be evil because voice doesn’t like it which I find ironic because the standard solution (larger buffers) causes far worse issues.
Chet Johnson Says:
February 15, 2011 at 9:21 pm | Reply
I am quite familiar with what you have described.

What I look forward to is the opportunity to meet again with Dave Clark and Vint Cerf. David and I met when he took a little trip up to Hillsoboro, OR awhile back and discussed tcptrace and slow start a loong time ago. Vint and I met while he was was with MCI and we met in Folsom, CA. It’s been awhile. I hope they are reading this and I wish them well.
daniel hewitt Says:
May 4, 2011 at 10:13 am | Reply
So how come latency increases when you dont use a tcp application at all?

so using UDP to stream video or download via bittorrent, and then playing an multiplayer game that uses udp has ridiculously high latency?
- gettys Says:
  May 4, 2011 at 8:05 pm | Reply
  Anything can fill the buffers, TCP, UDP or other protocols.
  
  The surprise here is that a single TCP connection fills these buffers (on anything except Windows XP). And the buffers are so large as to be causing TCP major confusion: congestion avoidance has been defeated.
  - Christopher Smith Says:
    May 5, 2011 at 1:09 pm | Reply
    Anything can fill the buffers, but TCP is the protocol that is negatively impacted by it. UDP by itself shouldn’t be impacted by this phenomenon beyond huge variability in latencies. The mean throughput for UDP should be good, as should the min latency (mean & median will be somewhat negatively impacted, but likely not too badly). Of course, a lot of UDP applications have their own congestion control at a higher layer, which causes tons of fun.
    
    I think though that what Daniel is talking about is really a different problem related to actually maxing out available bandwidth in the pipe, rather than transient max outs caused by buffer bloat.
    - gettys Says:
      May 5, 2011 at 1:22 pm
      i think that is incorrect.
      
      Given the fact of single queues and no classification in these devices, the buffers being full means that UDP traffic also suffers the delays just as much as TCP.
      
      Yet worse: the buffers are being kept almost precisely full, with TCP pacing its packets to keep them topped off (more or less), particularly when in bursts as TCP tries to find new safe operating points, increasing the loss rates on competing flows whether TCP or UDP.
    - Ivan Says:
      May 5, 2011 at 2:11 pm
      Actually, UDP traffic can be even more negatively impacted because of the lack of reliability and greatly contribute to bufferbloat due to the lack of congestion control mechanisms.
      
      With full buffers, dropped UDP PDUs will not be retransmitted unless reliability is implemented in the application layer. Nevertheless, the fact that many applications will attempt to retransmit some packets to achieve some ends, such as DNS, will cause further flooding of packets in the network and the risk of no data actually passing through.
Charles 'Buck' Krasic Says:
May 5, 2011 at 1:52 pm | Reply
Right. If there are long lived flows that are greedy for bandwidth (elephants), they will fill the queues. Whether the elephants are TCP, or UDP (with app level congestion control), is moot.

We see more and more long form video (elephants) moving through the network. There are downloads (torrents etc.), and adaptive streaming for VoD, live streaming, and video conferencing. Elephants already dominate Internet traffic on fraction of bytes basis, and will do for the forseeable future. The presence of the elephants assure “bufferbloat” latencies will be ever more common. That is, until better mechanisms than the status quo are deployed.
- Ivan Says:
  May 20, 2011 at 1:33 pm | Reply
  Note that I wouldn’t call torrents elephants. And actually torrents are somewhat a mechanism to abuse TCP congestion control. While it pushes networks to their limits (which I guess is fine) it opens many connections to download chunks of data. Which causes unfairness to other well-behaved flows. A single download should be fair to a single streaming flow, however if a torrent opens 99 connections, the competing flow will only have 1/100 of the bandwidth.
  
  In addition to it, in order to bypass the congestion control mechanism of TCP some implementations use UDP which turned out more detrimental to the network (but somewhat more efficient for the downloaders)
  - Arms Says:
    May 20, 2011 at 2:16 pm | Reply
    The BitTorrent protocol is a particularly bad example if you think the applications use of UDP will be more detrimential to the network. You may want to read up what the congestion control of uTP is acutally all about (for starters: http://forum.utorrent.com/viewtopic.php?id=76640) and you may also want to learn LEDBAT ( http://tools.ietf.org/wg/ledbat/ ) congestion control.
    
    It appears that you may have mistaken implementation bugs in an early alpha, with the design goals of this particular protocol. (For some reason, particularly eastern european ISPs had issues at that time [2008] with uTP).
    
    http://www.serviceassurancedaily.com/2008/12/bittorrent_over_udp_end_of_the.html
    
    Obviously, there is no point of switching all transport protocols to scavenger services though…
    - Ivan Says:
      May 22, 2011 at 11:47 am
      At least it’s good to see it’s going through the IETF. Now, as far as I know, uTorrent is attempting to transmit reliably over UDP, which seems to be what TCP does. While I’m all in with improving TCP (and there have been several proposals to that), I think (my opinion) using UDP to bypass the basic TCP friendliness requirements imposed to other protocols doesn’t seem to be fair. As I said, several other protocols that are friendly to TCP had been discarded for several reasons, and this type of “new” protocols need to go under special review before running into the wild. Particularly, when there’s no congestion control mechanism described for UDP traffic.
    - Arms Says:
      May 24, 2011 at 4:22 am
      @Ivan:
      
      You are right, with plain UDP there is no congestion control other than what an application designer thinks is appropriate.
      
      My point being, that BitTorrent in paricular is a bad example, because uTP does have a congestion control scheme, that goes to extreme length to be a scavenger service (less-than-best effort), compared with the best-effort service of TCP.
      
      One key signal missing in TCP to build improved congestion control (such as LEDBAT, and available in uTP) is measuring one-way delay (instead of round trip time, which is the signal measured currently by TCP). There are efforts underway to address this aspect of TCP, btw.
    - Ivan Says:
      May 24, 2011 at 9:14 am
      @Arms: I know you (we all) are trying to make a point about the importance of congestion control mechanisms. My point of view is that several improvements have been proposed to maximize throughput while being TCP friendly. We can spend much time discussing the advantages of uTP.
      
      My point being that XCP for example (and among many others) provided exclusive focus on congestion feedback, and was carefully studied. They followed the rules on TCP friendliness. UDP has not congestion control mechanism. Any congestion control mechanism on top of UDP is simply an application layer feature. And that shouldn’t be the way of competing against TCP, by enforcing reliability using a protocol that shouldn’t be used for it. It’s like using a hammer the other way around, it may feel lighter, it may do the job, but it’s not supposed to be the way to use it. And careful attention should be given to “wild” deployments of this type of protocols that may turn hurtful, reason why I think is good is going under the review of the IETF.
    - gettys Says:
      May 24, 2011 at 10:18 am
      And the different flavors of congestion avoidance algorithms in TCP and other protocols are entirely moot, so long as we fail to notify the hosts of congestion in a timely fashion (necessary for the congestion avoidance servo mechanisms to have rapid and stable response). That’s what bufferbloat has done to us; the amount of buffering would not matter if had AQM algorithms that worked deployed everywhere.
      
      It is easy to get lost in the forest among the trees of the congestion avoidance algorithms if you lose sight of this fundamental fact.
  - Charles 'Buck' Krasic Says:
    May 22, 2011 at 11:31 am | Reply
    Yes, torrent clients maintain many connections. But at any given time, a relatively small subset of them will be actively transferring data. Compared to other “mousey” internet traffic types (web pages, e-mail, chat, …), in torrent swarms, the active connections engage in relatively long lived data transfers (MB+ range comprised of several chunks). The point is that these flows transfer data over sufficiently many RTTs to open up their TCP congestion window and keep it there for a while. In my book this makes them elephants. You could even say that elephants are precisely those flows that can induce bufferbloat effects.
    - gettys Says:
      May 22, 2011 at 11:35 am
      Web browsing *also* induces bufferbloat effects; you have N connections (6 or more), all with their initial window’s data flying toward the broadband edge, where they go *splat* into the queues of the home devices.
      
      So suffering comes in all forms: in the web browsing case, you get transient effects, just the thing to cause your VOIP traffic to have fits.
    - Ivan Says:
      May 22, 2011 at 11:41 am
      “In my book this makes them elephants.”.
      
      Charles, I read that book. Unluckily, I’d say it hasn’t been updated. Current webpages may download that amount easily. I think the rapid increase in bandwidth availability has shifted the meaning of elephants and mice. I know several cases already where terabyte transfers are competing against music streaming, because the bandwidth allows.
Charles 'Buck' Krasic Says:
May 22, 2011 at 11:43 am | Reply
Sure. It would be an interesting study to classify the causes of bufferbloat in the wild. My bet is on torrents and streaming dominating the other web induced bufferbloat “events”. Hopefully with all this discussion, one or more such studies are already underway. 🙂
Charles 'Buck' Krasic Says:
May 24, 2011 at 10:43 am | Reply
I agree that deployment of working ECN+AQM is desirable.

I would point out an interesting twist from the literature:

http://ccr.sigcomm.org/online/?q=node/271

They emulate AQM from the end hosts, by interposing between TCP and IP. It wouldn’t prevent cheating like AQM can, but it was a cute idea.
Stephen Glover » Blog Archive » Some Internet Service Providers (ISP’s) are Redirecting Search Traffic Without Consent. Says:
August 17, 2011 at 7:35 pm | Reply
[…] Jim Gettys is leading an initiative to fight “bufferbloat”, i.e., overly large buffers that cause time-sensitive traffic to be delayed significantly in the presence of high-volume background data transfers. Take a look a Jim’s introductory article and the role of Netalyzr’s findings here. […]
Bufferbloat: Dark Buffers in the Internet – IETF Journal Says:
June 6, 2016 at 1:18 pm | Reply
[…] bad things happen (see https://gettys.wordpress.com/2010/12/06/whose-house-is-of-glasse-must-not…), as John Nagle’s cogent explanation, RFC 970 (from 1985!), […]