Page MenuHomePhabricator

Evaluate/Deploy TCP BBR when available (kernel 4.9+)
Closed, ResolvedPublic

Description

Basic BBR info here, lots of other analysis and blog posts to search for though: https://patchwork.ozlabs.org/patch/671069/

It should land in released upstream 4.9 kernels, at which point we can consider playing with it here. It sounds very promising, especially at the edge, but also possibly elsewhere in our infrastructure.

This task is just a reminder for when the right kernel arrives here.

Reading list:

Event Timeline

On the tech-mgmt meeting you mentioned this was underway, is there another phab task for it?

This is it. We're currently still testing/deploying the kernel that allows it to be enabled. After that we can do some testing/evaluation on BBR itself and report here. Our current thinking is we expect we'll enable it as the default congestion control for our public edge nodes, but probably not elsewhere in our network (app/db/etc in core DCs), as it's unclear whether it might lose to cubic in some corner cases on fast/local networks.

@Gilles - FYI the kernel upgrades that were blocking this are done, and we're tentatively looking at turning on BBR on May 22, so that we have a week of post-switchback stats to compare when looking at NavTiming impact.

Change 351707 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] BBR Congestion Control

https://gerrit.wikimedia.org/r/351707

Interesting data on the topic of BBR under datacenter conditions (low latency 100GbE), possibly supporting the idea that it's not awful to enable it everywhere: https://groups.google.com/forum/#!topic/bbr-dev/U4nlHzS-RFA

Change 351707 merged by Ema:
[operations/puppet@production] BBR Congestion Control

https://gerrit.wikimedia.org/r/351707

APNIC has a good writeup here (first half is TCP history redux, second half goes into interesting details and new data on BBR): https://blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/

I didn't read a lot of documentation about BBR but I am wondering if it could help in a local LAN use case like the Hadoop cluster, where LibreNMS periodically notifies us that the switch ports are saturated due to sudden burst of traffic (1G links IIRC). My understanding is that BBR should focus on the RTT variations and not on the packet loss, so I am wondering if it could improve (or keep the same level of) throughput while reducing the network port/buffers saturation events.

I'd be happy to test it if what I wrote makes sense :)

There's not a lot of good data on how BBR behaves in datacenter-like networks (high bandwidth, low latency, low loss, etc). It's not really the use case it was designed for, and the reports from others have been mixed. It probably won't turn out completely awful or anything, but I don't know if it would actually fix the port saturation problem or not.

To try it out you need a few things:

  1. Nodes running 4.9 kernels (our latest jessie kernels)
  2. Enable fq qdisc on eth0: tc qdisc replace dev eth0 root fq
  3. Set puppet hieradata bbr_congestion_control: true and run puppet agent (makes fq default in the future, turns on BBR itself)

Also while I'm thinking about it - we should validate that the sysctl setting for fq as default qdisc "sticks" on reboot and isn't affected by some kind of ordering race...

where LibreNMS periodically notifies us that the switch ports are saturated due to sudden burst of traffic

LibreNMS doesn't report an issue, just that the port is used at more than 80% of its capacity for a long time (30min). Which is good to know as it *could* be an issue, and might need actions (increasing uplink, etc), but can also be normal operation (the network is here to be used).
Using BBR will not change that alert, as it will try to use as much as the interface as possible anyway.

Thanks for the feedback, I thought it was more a problem of port capacity completely used (100%) and buffers filled, now it makes more sense not to try BBR for my use case.

On the reboot issue: I've tested cp4021 and the existing puppetization works fine on reboot (even given the other stuff below).

On the issue of enabling the fq stuff at runtime (to turn on BBR without a reboot...):

The simplistic tc qdisc replace dev eth0 root fq only works correctly for a simple/default case. Our cache/lvs hosts (maybe others) use our interface-rps script with bnx2x and bnx2 drivers to set up something more complicated, which ends up auto-creating multiple transmit queues per device. This is what the current picture (pre-BBR) looks like using cp4021 as an example:

root@cp4021:~# tc qdisc show dev eth0
qdisc mq 0: root 
qdisc pfifo_fast 0: parent :2d bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2c bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2b bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2a bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :29 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :28 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :27 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :26 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :25 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :24 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :23 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :22 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :21 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :20 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1f bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1e bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1d bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1c bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1b bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1a bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :19 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :18 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :17 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :16 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :15 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :14 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :13 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :12 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :11 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :10 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :f bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :e bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :d bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :c bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :b bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :a bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :9 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1

Executing the simplistic tc qdisc replace dev eth0 root fq wipes out all the above and leaves you with a setup that breaks the multi-queue stuff:

root@cp4021:~# tc qdisc replace dev eth0 root fq
root@cp4021:~# tc qdisc show dev eth0
qdisc fq 8001: root refcnt 47 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140

The correct way to turn on mq+fq for such hosts at runtime is to set the default qdisc to fq, and then replace the mq root with a fresh mq root (which will use the default qdisc for the multiple queues), like this:

root@cp4021:~# echo fq >/proc/sys/net/core/default_qdisc
root@cp4021:~# tc qdisc replace dev eth0 root mq
root@cp4021:~# tc qdisc show dev eth0
qdisc mq 8002: root 
qdisc fq 0: parent 8002:2d limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:2c limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:2b limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:2a limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:29 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:28 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:27 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:26 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:25 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:24 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:23 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:22 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:21 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:20 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:1f limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:1e limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:1d limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:1c limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:1b limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:1a limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:19 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:18 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:17 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:16 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:15 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:14 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:13 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:12 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:11 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:10 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:f limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:e limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:d limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:c limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:b limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:a limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:9 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:8 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:7 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:6 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:5 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:4 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:3 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:2 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
qdisc fq 0: parent 8002:1 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140

Mentioned in SAL (#wikimedia-operations) [2017-05-22T19:16:19Z] <bblack> BBR: cp1065: switching qdisc to mq+fq manually - T147569

Mentioned in SAL (#wikimedia-operations) [2017-05-22T19:25:17Z] <bblack> BBR: cp1065: switching congestion control to bbr manually - T147569

Mentioned in SAL (#wikimedia-operations) [2017-05-22T19:29:15Z] <bblack> BBR: cp1074: switching qdisc to mq+fq manually - T147569

Mentioned in SAL (#wikimedia-operations) [2017-05-22T19:30:47Z] <bblack> BBR: cp1074: switching congestion control to bbr manually - T147569

Mentioned in SAL (#wikimedia-operations) [2017-05-22T21:10:01Z] <bblack> BBR: cp1074: reverted back to cubic+pfifo_fast - T147569

Mentioned in SAL (#wikimedia-operations) [2017-05-22T21:11:01Z] <bblack> BBR: cp1065: reverted back to cubic+pfifo_fast - T147569

Mentioned in SAL (#wikimedia-operations) [2017-05-23T16:15:26Z] <ema> cp1074: enable prometheus node_exporter qdisc collector T147569

Mentioned in SAL (#wikimedia-operations) [2017-05-23T16:43:11Z] <bblack> BBR: enabling mq+fq on cp1074 - T147569

Mentioned in SAL (#wikimedia-operations) [2017-05-23T16:49:12Z] <bblack> BBR: enabling bbr on cp1074 - T147569

Change 355276 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] caches: enable BBR tuned mq fq qdiscs

https://gerrit.wikimedia.org/r/355276

Change 355276 merged by BBlack:
[operations/puppet@production] caches: enable BBR tuned mq fq qdiscs

https://gerrit.wikimedia.org/r/355276

Mentioned in SAL (#wikimedia-operations) [2017-05-23T20:10:37Z] <bblack> enable BBR for all caches @ ulsfo - T147569

Mentioned in SAL (#wikimedia-operations) [2017-05-23T20:20:35Z] <bblack> enable BBR for all caches @ codfw - T147569

Mentioned in SAL (#wikimedia-operations) [2017-05-23T20:24:55Z] <bblack> enable BBR for all caches - T147569

Change 355356 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] r::c::perf - raise fq flow_limit to 300

https://gerrit.wikimedia.org/r/355356

Change 355356 merged by BBlack:
[operations/puppet@production] r::c::perf - raise fq flow_limit to 300

https://gerrit.wikimedia.org/r/355356

Change 355391 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] prometheus: enable qdisc collector on cache hosts

https://gerrit.wikimedia.org/r/355391

Mentioned in SAL (#wikimedia-operations) [2017-05-24T09:49:39Z] <ema> upgrade prometheus-node-exporter on cache hosts to 0.14.0~git20170523-0 T147569

Change 355391 merged by Ema:
[operations/puppet@production] prometheus: enable qdisc collector on cache hosts

https://gerrit.wikimedia.org/r/355391

Change 355451 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] r::c::perf - FQ outbound flow rate cap @ 1Gbps

https://gerrit.wikimedia.org/r/355451

Change 355451 merged by BBlack:
[operations/puppet@production] r::c::perf - FQ outbound flow rate cap @ 1Gbps

https://gerrit.wikimedia.org/r/355451

So I've stared at NavTiming graphs, and honestly it's hard to read any notable difference in the tea leaves.

I'm still investigating the off chance that for some reason our current bnx2x cards/drivers implement GSO in a way that may defeat the aims of the FQ scheduler, but I think that's unlikely at this point.

For anyone else staring at tea leaves and such, the point in the graphs you're looking for as "turn-on time for BBR" was approximately 2017-05-23 20:40 UTC. In general, it's hard to make out any immediate effects in the noise of the graphs unless they're dramatic enough, so I don't think it made a dramatic difference. We'll eventually reach a point where we've got better daily/weekly average comparisons and can trawl a bit more. What we expected was mostly an improvement in the p95/p99 sorts of cases (dealing better with slow and/or lossy mobile, distant clients, etc).

There is an apparent performance improvement that coincides in timing, but on a simulated slow internet connection:

T166373: Investigate apparent performance improvement around 2017-05-24

The improvement is only experienced on the large articles + slow connection combo. Could that be it?

The last performance improvement we deployed that only affected large articles on slow internet connections (logo preloading) was also impossible to spot in NavigationTiming. Which doesn't mean that it doesn't benefit real users, just that it's lost in all the NavigationTiming noise.

@BBlack one way to verify that the performance improvement we're seeing is "real" would be to turn BBR off for a bit. That being said, it will still be a simulated slow connection and that alone doesn't tell us the effect in the real world, if any.

That said, the navigation timing metrics from the real world doesn't give us the best measurements for user experience so I think they both gives valuable information. The problems with the test we run on mobile now is that we only use Chrome in emulated mode, so we don't use a real mobile device.

BBlack claimed this task.

There is an apparent performance improvement that coincides in timing, but on a simulated slow internet connection:

T166373: Investigate apparent performance improvement around 2017-05-24

The improvement is only experienced on the large articles + slow connection combo. Could that be it?

Yes, that looks like it (the first drop in Peter's graph in T166373#3294375).

The last performance improvement we deployed that only affected large articles on slow internet connections (logo preloading) was also impossible to spot in NavigationTiming. Which doesn't mean that it doesn't benefit real users, just that it's lost in all the NavigationTiming noise.

Right. Ideally it would be great to flip it off and on some more, but we've already stalled out a lot of other pending changes the past couple of weeks on the Traffic end. I'd like to close this up so we can move on with those other changes (which will add more performance noise). Even if we had nothing obviously-great in the graphs, I'd be willing to accept a "no visible regression" result and trust that Google's done their homework and BBR is helping in some cases we can't see. The results spotted in the other ticket are more than enough to move on and accept BBR as part of our new normal.

Also as of today, I think we're beginning to see it appear in the median p95 LoadEventEnd daily averages, even through the noise (check the right edge of the yellow line in: https://grafana.wikimedia.org/dashboard/db/navigation-timing?refresh=5m&orgId=1&var-metric=loadEventEnd&panelId=11&fullscreen ).