The Internet is Broken, and How to Fix It

Many real time applications such as VOIP, gaming, teleconferencing, and performing music together, require low latency. These are increasingly unusable in today’s internet, and not because there is insufficient bandwidth, but that we’ve failed to look at the Internet as a end to end system. The edge of the Internet now often runs congested. When it does, bufferbloat causes performance to fall off a cliff.

Where once a home user’s Internet connection consisted of a single computer, it now consists of a dozen or more devices – smart phones, TV’s, Apple TV’s/Roku devices, tablet devices, home security equipment, and one or more computer per household member. More Internet connected devices are arriving every year, which often perform background activities without user’s intervention, inducing transients on the network. These devices need to effectively share the edge connection, in order to make each user happy. All can induce congestion and bufferbloat that baffle most Internet users.

The CoDel (“coddle”) AQM algorithm provides the “missing link” necessary for good TCP behavior and solving bufferbloat. But CoDel by itself is insufficient to solve provide reliable, predictable low latency performance in today’s Internet.

Bottlenecks are most common at the “edge” of the Internet and there you must be very careful to avoid queuing delays of all sorts. Your share of a busy 802.11 conference network (or a marginal WiFi connection, or one in a congested location) might be 1Mb/second, at which speed a single packet represents 13 milliseconds. Your share of a DSL connection in the developing world may similarly limited. Small business often supports many people on limited bandwidth. Budget motels commonly use single broadband connections among all guests.

Only a few packets can ruin your whole day! A single IW10 TCP open has immediately blown any telephony jitter budget at 1Mbps (which is about 16x the bandwidth of conventional POTS telephony).

Ongoing technology changes makes the problem more challenging. These include:

Changes to TCP, including the IW10 initial window changes and window scaling.
NIC Offload engines generate bursts of line rate packet streams at multi-gigabit rates. These features are now “on” by default even in cheap consumer hardware including home routers, and certainly in data centers. Whether this is advisable (it is not…) is orthogonal to the reality of deployed hardware and current device drivers and default settings.
Deployment of “abusive” applications (e.g. HTTP/1.1 using many > 2 TCP connections, sharded web sites, BitTorrent). As systems designers, we need to remove the incentives for such abusive application behavior, while protecting the user’s experience. Network engineers must presume software engineers will optimize their application performance, even to the detriment of other uses of the Internet, as the abuse of HTTP by web browsers and servers demonstrates.
The rapidly increasing number of devices sharing home and small office links.

All of these factors contribute to large line rate bursts of packets crossing the Internet to arrive at a user’s edge network, whether in his broadband connection, or more commonly, in their home router.

Requirements

Not all requirements apply everywhere. For example, regulating different user’s total bandwidth isn’t necessary in an Internet core router, as it is already regulated at the edge of the network. Home and small businesses routers have different requirements than core routers. These requirements include:

Handle changing bandwidth quickly and robustly, since both wireless and broadband system’s bandwidth is variable
Preserve good utilization of your bandwidth, while retaining real time performance for latency sensitive traffic, such as VOIP, gaming, etc.
Buffers should really “work”, and not be perpetually full. If they run full, bad things happen when bursts occur.
“Fair” division of bandwidth among users, and “fairness” between different applications of that user.
Solving the BitTorrent problem, redux. But BitTorrent is as just an example of what other applications may want and need to do; our systems still need to protect themselves from this behavior.
Trying to deal with VPN’s, as best we can.
Good behavior for “ant” protocols such as DNS, DHCP, RA, etc, so that the network operates well even under extreme load.

To achieve these requirements, we need to simultaneously solve a number of problems, not necessarily (or even desirably) in one algorithm.

Bufferbloat itself, only soluble by a suitable adaptive AQM algorithm, ensures buffers kept generally empty so they (and TCP itself) can function properly. TCP’s responsiveness to sharing bandwidth between competing flows depends on the square of the delay: 10 times to much buffering induces 100 times the delay. An AQM algorithm that can adapt to wireless (or variable broadband links) successfully has not been available; existing algorithms are unsuitable. Bandwidth utilization argues for an AQM which is reasonably efficient and properly adaptive to available bandwidth. Some AQM’s may manage latency well, but not necessarily allow for good utilization of available bandwidth. Even fewer adjust to variable bandwidth. To the extent possible, we’d like to have our cake and eat it too. CoDel (pronounced “coddle”) has the needed characteristics and is showing excellent results.
Good real time performance for latency sensitive traffic argues for classification, since even a single packet on a low bandwidth link (or a heavily loaded link for which your share of bandwidth is low) is significant.
“Fairness” between applications is also essential. We should reduce/eliminate the current perverse incentives for applications to abuse the network, as HTTP does today. We’ve had an arms race conspiracy for the last decade between web browsers and web sites to minimize latency that is destructive to other traffic we may care about (such as telephony, teleconferencing and gaming). Sometimes this is best addressed by fixing protocols to be both more efficient and more friendly to the network, as HTTP/1.1 pipelining and now SPDY are intended to do. But the “web site sharding” problem is impossible for clients to avoid.
“Fairness” across users. To an ISP, fairness is/should be between paying customers and not individual users: inside the house in a home router, fairness is between users (or other policies the home user wishes to enforce; e.g. guest traffic might only be allowed to use 10% of my network if my network is busy). It might also mean (since devices are often associated with users), “fairness” between devices.
“Fairness” is also between different flows of different RTT’s. TCP itself is not “fair”. TCP makes no guarantees of “fairly” dividing bandwidth between flows of different RTT’s, nor can it solve the fairness problem between users and wishing it would do so is futile and counter productive. BitTorrent may have a hundred flows simultaneously, and it isn’t “fair” for a background protocol to compete with other traffic of others in my house, or even interactive or real time traffic of my own on my own computer. The basic observations are:

One size does not fit all
Exactly what “fairness” means depends on location
“Fairness” may also mean that heavy users won’t compete with my traffic at busy times of day, or if they have already used too much bandwidth, or…
“Fairness” is in the eye of the beholder.

Since “fairness” cannot be guaranteed by TCP, AQM, necessary as it is, cannot be the entire solution for reliable low latency applications to flourish.

The “Edge” of the Internet

Systems/devices in the edge of the Internet have a fundamental advantage over an Internet core router: the ratio of CPU cycles available/packet is much, much more favorable, and computation often comes for “free” hidden behind cache misses. Techniques in that in previous decades were prohibitive, such as fair queuing, can be used even on very fast links. So, for example, Dave Taht’s and others in the bufferbloat project’s experiments with the fq_codel queue discipline is that it is comparable in speed to Linux’s current pfifo_fast queue discipline, consuming only 2% of a modern CPU at 10GigE speeds. On current home router hardware with GigE Ethernet, fq_codel profiles similarly well. “Fair” queuing as a default queue discipline is therefore now feasible in all edge hosts and devices.

Today’s Internet violates the principle of “least surprise.” Inexpert users, (particularly in remote locations such as New Zealand) are baffled when transfers have vastly different results when competing transfers have very different RTT’s. We can solve this “surprising” behavior that most Internet users don’t understand (and today pester their ISP’s about) using fair queuing.

Fair queuing has many other good features: it naturally prioritizes short lived flows and flows which are not elephants (which may be DNS lookups, DHCP requests, TCP opens, etc.) without requiring explicit classification rules, nor does it require knowledge of the insides of encrypted packets. The early results for fq_codel look wonderful, even without other classification rules or diffserv support, and improve upon CoDel’s behavior in many ways beyond keeping queues short overall. Andrew McGregor reports “phenomenal” results in New Zealand using the fq_codel queue discipline and port based QoS classification rules.

Since fairness is in the eye of the beholder, that “fair” queuing will be different at different locations in the edge of the network. The queues may very well be keyed differently on a per application, per user, per machine/device, or per customer basis depending on exactly how close to the “edge” of the network you are located and the desired policy. Home routers have a particularly complex problem: “fair” should probably be measured by “air time” to individual stations, rather than bytes, and may also need to enforce bandwidth allocation as well (e.g. for guest networks). AQM is also “interesting”: to keep total latency low when there are multiple active stations, AQM will need to be run across multiple queues to these active stations.

Diffserv & Classification

Even with both AQM and “fair” queuing, Diffserv (and specific classification tricks) are still necessary. Using diffserv immediately solves the BitTorrent problem covered in another blog post: AQM, by ensuring latency is kept reasonably low, defeats Ledbat’s attempt to stay out of the way of TCP traffic.

Applications like BitTorrent (or, for that matter, sharing links with services that you may provide from home), may use many (even hundreds) of flows. For some applications (not BitTorrent!), these may even be short flows, and therefore compete strongly with interactive applications, such as web surfing. Without a “hint” that these particular services should be queued at higher, or lower priority than your interactive use, you cannot prioritize your link’s usage properly. Whether this “hint” is via Diffserv, or via port numbers is not the issue here; one way or the other we need to both have the intention of the traffic (e.g. scavenger, interactive, real-time sensitive), and properly handle the situation. Fair queuing by itself, while very helpful, does not solve this problem, particularly for real time traffic.

Some traffic is really, really time sensitive: but may not have an assigned port number: diffserv marking handles this case nicely. To meet real time application performance (e.g. VOIP) on a busy home or conference wireless network we need to be very careful, and use facilities such as 802.11e QOS queues to minimize latency. Other applications (e.g. backup) may be very happy to just scavenge bandwidth. Such applications need to be able to be deployed easily without expecting everyone to update their network environment, nor users to visit their home routers and set up explicit port rules. Again, diffserv marking handles this case very nicely.

Some will say diffserv is not deployed: but this belief is incorrect. The gaming industry noticed that Linux’s PFIFO_FAST queue discipline (which is the Linux default, and therefore the default in most of today’s home routers) honored diffserv marking and are using it today to improve real time performance. Some SIP ATA adapters also implement diffserv marking.

Diffserv has an Achilles heel: if the users do not have control over whether the diffserv marking is honored in the home router (“diffserv domain” in its terminology), vendors and software may “game” diffserv to the point of uselessness. Home routers MUST have facilities to detect diffserv marking both so users can control its use, and to enable push-back on software and hardware vendors who abuse diffserv’s intent, that a network owner be able to control their own network.

Flies in the Ointment

Unless bufferbloat is fixed (by deploying of CoDel), to achieve even mediocre latency today you must severely bandwidth shape broadband service, which also defeats features like Comcast’s Powerboost. Good utilization of bandwidth that you’ve already paid good money for is impossible until those links are debloated. But AQM by itself can’t solve transient bufferbloat at all at a bottleneck.

The line rate bursts of packets arrive at the broadband head-end, and are typically dumped into a single queue (which today suffers badly from bufferbloat). ISP’s provision their telephony and possibly other services onto separate queues, to which you, as a customer, have no access. As discussed in another blog article, ISP’s (unintentionally though I believe it was) currently have a fundamental advantage over others in providing many new Internet services. If you care about innovation in the Internet, you must care about this problem deeply.

Diffserv was designed before current broadband systems deployed. Broadband effectively “split” the responsibility for the network between the ISP and the user; you do not have full control over your “diffserv domain”. You may have control of what order packets are provided upstream, but downstream, you don’t (since that occurs in the ISP). And since the broadband link is also often the bottleneck link, combined with the technology changes I noted in the introduction, we have a fundamental issue. How does the user regain control of incoming traffic and therefore their own network, and prioritize the traffic properly, whether by port number or otherwise?

There are (at least!) three possible solutions to this problem. Other ideas are welcome!

Build some protocol by which a home router can communicate to the broadband head end its intentions. This seems like a lot of work; it is related to the work the IETF Port Control Protocol working group has underway.
Andrew McGregor suggest that the broadband head-end, by observing how upstream traffic is marked with diffserv marking, could invert the process in the broadband head-end going downstream. Monitoring flows is no longer the issue it once was, and this idea should be explored by those expert in that area.
Wes Felter, in a comment on this post noted that a subset of OpenFlow might be another solution.

That there is a single queue on most broadband links available is clearly broken. You would like to be able to guarantee other traffic cannot interfere with VOIP, gaming or music playing, for example. A home router can work around this problem upstream to a good extent, but downstream, as outlined above, not so much.

Whether diffserv marking should have any affect inside the ISP’s network (the ISP’s diffserv domain) is outside this discussion, which is entirely about using diffserv and multiple queues in the broadband devices to help fix the queuing problems at the broadband edge, where the queuing problems are most acute.

That broadband technology often already has support for these queues (e.g. DOCSIS flows) that ISP’s can use exclusively for their services, is galling. That you have no way having these queues, even for pay, without buying the ISP’s service is is a fundamental network neutrality issue, in my view dangerous to the Internet’s long term innovation and health.

Summary and Conclusions

CoDel, fair queuing, and and diffserv and conventional classification comprise the fundamental materials for building a reliable low latency internet.

A huge amount of engineering and deployment work remains. It is vital to understand that there is no single “magic bullet” to drain our current swamp. To achieve the goal of a low latency, predictable, reliable behavior, high performance Internet, however, all of the techniques above all need to be brought to bear.

This is a “systems integration” problem of first magnitude.

This entry was posted on June 26, 2012 at 9:32 pm and is filed under Bufferbloat, Networking, Puzzle. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

10 Responses to “The Internet is Broken, and How to Fix It”

Wes Felter (@wmf) Says:
June 26, 2012 at 10:14 pm | Reply
I like the idea of using a restricted subset of OpenFlow to allow customers to classify downstream traffic. This should require much less state than full conntracking.
Jake Says:
June 27, 2012 at 3:39 am | Reply
Take a look at delay-based congestion control. There were some early experiments that showed that Vegas is not competitive with other schemes without AQM, but FAST is.

Click to access FAST-ToN-final-060209-2007.pdf

http://netlab.caltech.edu/FAST/

Full disclosure: I work at Fastsoft, so I’m biased. (And I don’t mean to claim that our products will solve every transport-based performance problem. I’m still working on that.)

But I encourage you to read the papers and follow the math. If you have a way to have your downloads coming only from servers that use delay-based congestion control, that gives you another solution to the realtime performance problems caused by large buffers. (As far as I can tell, that’s not an idea in conflict with CoDel, btw.)
- gettys Says:
  June 27, 2012 at 10:07 am | Reply
  “If you have a way to have your downloads coming only from servers that use delay-based congestion control,”
  but you don’t in the real world.
  
  While delay based congestion control may be useful, I don’t understand how to get everyone using them, even if they work fine (which they may). All it takes is one TCP flow to a server that that does not use them to ruin your whole day.
  - Jake Says:
    June 27, 2012 at 2:35 pm | Reply
    OK, but by the same token, you don’t have AQM and sane diffserv in the real world either.
    
    Your proposed solution involves “make and deploy a new protocol to communicate intent from the home router to the broadband head end”. Perhaps I’m misunderstanding, but I’m confused as to how that’s an easier problem to solve than “get most servers to use delay-based congestion control”.
    
    And of course it doesn’t ruin your whole day to download a 5-packet text article, or most of the ajax junk you get when you leave random pages open. We’re mostly talking about being able to make phone calls while your computer decides to get software updates, and your kids are watching a movie. Right?
    
    If that’s typically more or less accurate, then although it’s true that one big download from a server that hasn’t upgraded can be a problem, the more servers that are using delay-based congestion control, the fewer problems you should have. So I’m just saying you might want to include it in the list of solutions worth considering.
    - Simon Farnsworth Says:
      June 28, 2012 at 11:20 am
      The key difference comes from considering game theory (and the Prisoner’s Dilemma); if I convince my ISP to do good AQM and Diffserv just for me (no other users), the world is a better place for me, and no worse for you. As the presence of good AQM and Diffserv spreads, more of us benefit, but people who haven’t switched can’t make things worse for those networks that have. There’s no PD here, as “cooperating” (by deploying good AQM or good Diffserv or both) is always better than defecting (staying in the bufferbloat status quo), regardless of what other people do.
      
      If I switch all my servers to delay based congestion control, the world is a worse place for me, until everyone else follows my lead. This is a Prisoner’s Dilemma; if we all co-operated by deploying delay based congestion control everywhere, it would be a net improvement. However, if you defect (by sticking to current congestion control methods), the best option for me is to defect too; if I cooperate and deploy delay based CC when you haven’t, your services are going to have an advantage over mine at every bottleneck link.
      
      Further, this gives me an incentive to undo my deployment of delay based congestion control – I will get an advantage over you until you do the same. This makes delay based congestion control everywhere inherently unstable – you lose if someone else does packet loss based congestion control.
    - Jake Says:
      June 28, 2012 at 9:56 pm
      That was true for Vegas, but is mostly false for FAST. Just because it’s delay-based doesn’t mean that loss-based will inherently be faster.
      
      The reason is because cwnd can grow more rapidly under FAST, and still avoid wrecking itself by exceeding line capacity, the way loss-based algorithms would if they grew too fast. So when a AIMD algorithm like Reno or BIC hits loss and cuts its cwnd in half, FAST can take over that newly available bandwidth faster than the loss-based one does. Vegas lost to Reno mostly because it didn’t grow to where it should be fast enough, not because the idea is inherently broken.
      
      It’s true that FAST will also back off gradually as BIC grows, but BIC will still overreach and get its cwnd cut in half again before long, and it will usually happen sooner the 2nd time around when it’s sharing a bottleneck with a FAST flow (there is some oscillation, but see http://www.omikk.bme.hu/collections/phd/Villamosmernoki_es_Informatikai_Kar/2010/Sonkoly_Balazs/tezis_eng.pdf for an analysis of the fairness characteristics).
Alex Says:
June 27, 2012 at 4:16 am | Reply
Have you read Bob Briscoe’s re-ecn stuff? That takes the ‘system integration’ perspective – trying to get the system model right, rather than adding mechanism piecemeal. See http://tools.ietf.org/id/draft-briscoe-tsvwg-re-ecn-tcp-motivation-02.txt
- gettys Says:
  June 27, 2012 at 10:03 am | Reply
  I read it a while back, and need to read it again. But ECN is a big topic (particularly when it comes to the global network, and this piece primarily looks at the neglected edge of the network), and I decided not to try to make an already too long piece yet longer. I also needed to get this piece out.
Revenge of the TOE Says:
September 27, 2012 at 3:27 pm | Reply
[…] Ultimately, we resolved the issue by adding a policy map in the firewall for the web server with the sysadmin disabling TSO in the linux kernel, but I’m not even sure these were good choices. Especially because TOE/TSO is supposed to be a performance enhancing feature. Just seemed to be the most expeditious choice when a production service was intermittently unavailable with lots of unhappy users. Guess we were just collateral damage in another bufferbloat drive-by. Maybe Jim Gettys is right and the internet really is broken. […]
- gettys Says:
  September 27, 2012 at 4:07 pm | Reply
  There are a bunch of issues related to TOE/TSO.
  
  Anything that causes a single process to send a large number of line rate packets back to back onto the wire is just making burstiness in the Internet worse (and there are already phenomena that tend to coalesce packets into bursts as it is). So such bursts cross the Internet, where they enter their customer’s single, bloated, stupid broadband device.
  
  Part of why we like fq_codel so much is that its natural behavior is to break up such bursts at any bottleneck, reducing the burst problem.
  
  Needless to say, this is not good for latency, and may cause collateral damage of all sorts.
  
  Note that the people are not watching latency (particularly from the customer’s end). That is as important/more important than bandwidth per CPU cycle: web sites are “stickier” the faster they are *to the user of that web site*. Getting packets out of your data center with the least CPU cycles may not achieve that goal, but sometimes the opposite.
  
  We’ve been around the mulberry bush many times on offloading TCP itself since the 1980’s. But this means you get two maintenance headaches: both your OS, and the firmware in the NIC. Not a good idea.
  
  It is also a classic example of “if a little is good, a lot must be better”. Any of these techniques may be helpful at scale 2 or 4 (they provide most of their benefit). But then people seem to think “it must be even better to turn the knob to 11″… It seldom is… But then the knob gets turned anyway.

jg's Ramblings