Bufferbloat in switches/bridges

I received the following question today from Ralph Droms. I include an edited version of my response to Ralph.

On Thu, Jun 20, 2013 at 9:45 AM, Ralph Droms (rdroms) <rdroms@yyy.zzz> wrote:

Someone suggested to me that bufferbloat might even be worse 
in switches/bridges than in routers.  True fact?  If so, can 
you point me at any published supporting data?
Thanks,
Ralph

Ralph,

It is hard to quantify as to whether switches or routers are “worse”, and I’ve never tried, nor seen any published systematic data. I
wouldn’t believe such data if I saw it, anyway. What matters is whether you have unmanaged buffers before a bottleneck link.

I don’t have first hand information (to just point you at particular product specs; I tend not to try to find out whom is particularly guilty as it can only get me in hot water if I compare particular vendors). I’ve generally dug into the technology to understand how/why buffering is present to understand what I’ve seen.

You can go look at specs of switches yourself and figure out switches have problems from first principles.

Feel free to write a paper!

Here’s what I do know.

Ethernet Switches:

The simplest switch case is where you have a 10G or 1G switch being operated at 1G or 100M; you end up 10x or 100x over buffered. I’ve never seen a switch that cuts its internal buffering depending on line rate. God forbid you happen to have 10Mbit gear still in that network, and Ethernet flow control can cause cascades between switches to to reduce you to the lowest bandwidth….
Thankfully, enterprise switch gear does not emit Ethernet pause frames (though honors them if received): but all the commodity switch chips used in cheap unmanaged consumer switches does generate pause frames, that I looked at. Sigh…
As I remember, when I described this kind of buffering problem to a high end router expert at Prague, he started muttering “line cards” at me; it wouldn’t surprise me if the same situation isn’t present in big routers supporting different line rate outputs. But I’ve not dug into them.
We even got caught by this in CeroWrt, where the ethernet bridge chip was misconfigured, and due to jumbo-grams, was initially accidentally 8x overbuffered (resulting in 80-100ms of latency through the local switch in a cheap router, IIRC; Dave Taht will remember the exact details.)
I then went and looked at the data sheets of a bunch of integrated cheap switch chips (around 10 of them, as I remember): while some (maybe half) were “correctly” buffered (not that I regard any static configuration as correct!), some had 2-4x more sram in the switch chips than were required for their bandwidth. So even without the bandwidth switching trap, sometimes the commodity switch chips have too much buffering. Without statistics of what chips are used in what products, it’s impossible to know how much equipment is affected (though all switches *should* run fq_codel or equivalent, IMHO, knowing what I know now)….
I hadn’t even thought about how VLAN’s interacted with buffering until recently. Think about VLAN’s (particularly in combination with Ethernet flow control), and get a further Excedrin headache…About 6 months ago I talked to an engineer who had had terrible problems getting decent, reliable, latency in a customer’s VOIP system. He tracked it down (miraculously) to the fact that the small business (less than 50 employees) was sharing an enterprise switch using VLAN’s for isolation from other tenants in a building. The other tenants in the building sometimes saturated the switch, and the customer’s VLAN performance for their VOIP TRAFFIC would go to hell in a handbasket (see above about naive sysops not configuring different classes of service correctly). As the customer was a call center, you can imagine, they were upset.

Ethernet is actually very highly variable bandwidth: we can’t safely treat it as fixed bandwidth! Yet switch designers make this completely unwarranted presumption routinely.

This is part of why I see conventional QOS as a dead-end; most of the need for classic QOS goes away if we properly manage buffers in the first place. Our job as Internet engineers is to build systems that “just work” that system operators can’t mis-configure, or even worse, come from the factory mis-configured to fail under load (which is never properly tested in most customer’ sites).

Enterprise Ethernet Switches

Some enterprise switches sell additional buffer memory as a “feature”! And some of those switches require configuration of their buffer memory across various QOS classes; if you foolishly do nothing, some of them leave all memory configured to a single class and disaster ensues.

What do you think a naive sysop does???? Particularly one who listens to the salesman or literature of the switch vendor about the “feature” of more buffering to avoid dropping packets, and buy such additional RAM?

So the big disasters I’ve heard of are those switches, where deluded naive people have bought yet more buffer memory, and particularly if they fail to configure the switches for QOS classes. That report came off the NANOG list, as I remember, but it was a couple of years ago and I didn’t save the message.

After reading that report I looked at the specs for two or three such enterprise switches and confirmed that this scenario was real, resulting in potentially *very* large buffering (multiple hundreds of milliseconds reaching even to seconds). IIRC, one switch had decent defaults, but another defaulted to insane behavior.

So the NANOG report of such problems was not only plausible, but certain to happen, and I stopped digging further. Case closed. But I don’t know how common it is, nor if it is more common than associated routers in the network.

Router Bufferbloat problems

I *think* the worst router problems are in home routers, where we have uncontrolled buffering (often 1280 packets worth) and highly variable bandwidth before the WiFI links and classic AQM algorithms such as WRED are both not present, and if were present, would not be of any use due to highly variable bandwidth. Home routers are certainly located where one of the common bottlenecks in the path are located and therefore are extremely common offenders. Whether better or worse than broadband hop next to them is also impossible to quantify.

I’ve personally measured up to 8 second latency in my own home without deliberate experiments. In deliberate experiments I can make latency as large as you like. That’s why we like CoDel (fq_codel in particular) so much: it responds very rapidly to changes in bandwidth, which are perpetual in wireless. Fixing Linux and Linux’s WiFi stack is therefore where we’ve focused (not to mention the code is available, so we can actually do work rather than try to persuade clueless people of their mistakes, which is a difficult road to hoe. This one is the one we seem to see the most often, along with the hosts and either side of the broadband hop.

The depth and breadth of this swamp is immense. In short, there is bufferbloat everywhere: you have to be systematically paranoid….

But which bufferbloat problem is “worst” is I think, unanswerable. Once we fix one problem, it’s whack-a-mole on the next problem, until the moral sinks home: Any unmanaged buffer is one waiting to get you if it can ever be at a bottleneck link. Somehow we have to educate everyone that static buffers are landmines waiting for the next victim and never acceptable.

This entry was posted on June 20, 2013 at 8:23 pm and is filed under Bufferbloat, Networking. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

7 Responses to “Bufferbloat in switches/bridges”

Wes Felter Says:
June 21, 2013 at 12:51 am | Reply
There’s a lot to touch on here. Since the inherent latency of 10G switches is very low (< 1 us), more than one or two packets worth of buffering for a single flow can cause bloat. (Fortunately or unfortunately, the hosts introduce so much latency that switch bloat is hardly noticeable.) Large buffers really do help with incast patterns that occur in some data center workloads since a few us of buffering is less bad than an RTO timeout. Due to the length and conservatism of hardware design cycles, I suspect CoDel will not even be considered in switches for a while and we'll still be seeing RED for another 5 years or so.

Sending pause frames is configurable in high-end switches and is mandatory in lossless (aka DCB) mode. We did some experiments with TCP over lossless 10G Ethernet and it fills up buffers in every switch along the path leading to even worse bloat. Because no packets are dropped it basically becomes "additive increase, never decrease" congestion control. There are a bunch of RED/ECN-based proposals like DC-TCP to fix this, and we found that they do help. As y'all know, changing TCP is an even slower process than changing qdiscs.
- gettys Says:
  June 25, 2013 at 9:43 am | Reply
  I don’t understand data centers well enough to know what is “right” for them. I need to see if I can get Andrew McGregor (who both understands bufferbloat and switches and data centers) to opine on the topic.
  
  CoDel is great for edge devices (or managing paths of unknown RTT’s expected to be Internet wide); but CoDel, in its current form, with its normal “do no harm” defaults doesn’t respond quickly enough for what you’d like inside a data center.
tance Says:
June 25, 2013 at 4:25 am | Reply
I’ve heard about the bufferbloat problem in relation to TCP (from ESR) before, and now from you. But I’ve been wondering, how do you calculate the “correct” amount buffers for a particular device and link? What is the correct maximum latency to aim for?
- tance Says:
  June 25, 2013 at 5:33 am | Reply
  Since my last comment, I have read a several more of your articles on the buffer bloat, and I realised there isn’t a “quick-fix” when it comes to buffer sizing.
  
  That said, I did forget to mention previously that I am currently more interested in buffering for a layer-2 device, and I almost certainly won’t be able to implement AQM. Are there any design guidelines for those of us designing new hardware without these algorithms? (Layer-2 or otherwise)
- gettys Says:
  June 25, 2013 at 9:40 am | Reply
  Heh. I’ve written on this topic before: there is no “right size” for buffers. The “traditional” rule of thumb has been the bandwidth delay product; but you often don’t know the bandwidth, nor the delay (nor the number of flows). I’m honestly not sure what I’d recommend; I will try to remember to talk with Andrew McGregor about what he’d recommend given what we know now.
  
  If you can’t implement something like fq_codel (which doesn’t react fast enough for data center only use), at least don’t leave the trap of not adjusting the buffer sizes for the bandwidth the Ethernet is operating at. So far, we don’t have an ideal algorithm that “just works” ideally under all circumstances.
  - tance Says:
    June 25, 2013 at 12:22 pm | Reply
    Thanks for the reply. I’ll be eager to hear what else you have to share once you’ve spoken to him.
    
    I suppose at minimum I should adjust based on the link speed, but then I still need to know what to adjust it to, so…
    
    But at least I now know not to just throw “as much memory as we can afford” at it.
Frits Riep Says:
May 19, 2015 at 8:11 am | Reply
Jim,
Thanks for your analysis of the problem. I also think this is a very important topic. I work with small businesses and find there are many occasions where bufferbloat is big problem, but it is not feasible to replace the router because of factors not really under our control (carrier provided router with credentials controlled by the provider – eg Verizon fios router with MOCA providing TV video on demand, or DVR control).

It would be ideal in those circumstances to simply add a Layer2 device with bandwidth control and Codel to control bufferbloat and leave the IP subnet as is (no double NAT) or IP subnet reconfiguration.

I would think it would be possible to reconfigure OpenWRT with SQM-Scripts to do this and this would provide a simple solution. All of the existing network would stay as is, but the bufferbloat would be controlled. Do you have any recommendations, or could you point me so any write-up on how to configure such a set up?

jg's Ramblings