I was taking a brief look at our past data to see what the impact was of enabling BBR congestion control, and lowering of notsent_lowat.
This is something we did in May 2017 (BBR for edge caches) and June 2016 (lowering tcp notsent_lowat), but was recently written about by lots of companies in the tech industry (Cloudfare blog, Spotify blog, Dropbox, etc.).
I gave up on finding the exact point, in favour of looking at it maybe at a later time if/when we write about this on the blog with more data and perhaps as result of further fine tuning with notsent_lowat.
But... I did find this: Around 30 October 2017, our time-to-first-byte seemingly regressed overnight from 320ms to 500ms (desktop p75) and from 500ms to 750ms (mobile p75)
Below is an excerpt of Server Admin Log entries between Oct 27 to Nov 1 (https://tools.wmflabs.org/sal/production?p=0&q=&d=2017-11-01), that may be relevant:
16:01 bblack: lvs1003 - puppet disabled, testing experimental ethtool ringbuffer change
14:55 bblack: strongswan experiment done, cp* back to puppet-agent-enabled
14:09 bblack: cp*: disabling puppet to test strongswan change...
21:00 XioNoX: removing old AMS-IX IPv6 - T167840
13:57 bblack: caches@eqiad - upgrade nginx to 1.13.6-2+wmf1~jessie1
13:54 bblack: caches@esams - upgrade nginx to 1.13.6-2+wmf1~jessie1
13:22 bblack: caches@codfw - upgrade nginx to 1.13.6-2+wmf1~jessie1
13:14 bblack: caches@ulsfo - upgrade nginx to 1.13.6-2+wmf1~jessie1
12:24 bblack: cp4025: restart varnish-be for mailbox lag
12:23 bblack: cp4023: restart varnish-be for mailbox lag
12:21 bblack: esams primary lvses (3001-2): disable LRO,pause on eth0 (under pybal stopped briefly)
12:19 bblack: ulsfo primary lvses (4001-2): disable LRO,pause on eth0 (under pybal stopped briefly)
12:12 bblack: esams+ulsfo backup lvses (3003-4,4003-4): disable LRO,pause on eth0
15:20 XioNoX: re-enabling Zayo transit in eqiad
08:51 ema: cp4022: restart varnish-be for mbox lag
23:49 ema: powercycle cp4024
12:54 ema: cp4026: restart varnish-be for mbox lag
21:03 bblack: cp1067 (current target cache): disabling the relatively-new VCL that sets do_stream=false if !CL on applayer fetches...
19:39 bblack: backend restart on cp1065
18:39 bblack: restarting varnish backend on cp1053 to move the lag/503 issues to another box and buy more time to debug
18:28 bblack: cp4025 - restart backend for mailbox lag (upload@ulsfo, unrelated to text-cluster issues)
18:21 bblack: cp1053 - manual VCL change, backends appservers+api_appservers, reduce connect/firstbyte/betweenbytes timeoues from 5/180/60 to 3/20/10
16:51 elukey: restart varnish backend on cp1055 - mailbox lag + T179156
16:16 bblack: cp1054 varnish backend restarted (was 503s / bad-conns target of ongoing issues)
16:10 XioNoX: deactivating BGP sessions to Zayo in eqiad (flapping)
15:46 bblack: restart varnish-backend on cp4022 (upload@ulsfo) - mailbox
14:49 bblack: turn on cp4024 port on asw-ulsfo
13:52 bblack: reboot cp4021 to clean up oom messes
13:49 bblack: restarting nginx on cp4021, without NUMA memory constraints
11:36 ema: cp4023: varnish-backend-restart for lag
03:02 bblack: cp1067, cp4026 - backend restarts, mailbox lag