Page MenuHomePhabricator

Investigate 200-300ms increase in responseStart.p75
Closed, ResolvedPublic

Description

I was taking a brief look at our past data to see what the impact was of enabling BBR congestion control, and lowering of notsent_lowat.

This is something we did in May 2017 (BBR for edge caches) and June 2016 (lowering tcp notsent_lowat), but was recently written about by lots of companies in the tech industry (Cloudfare blog, Spotify blog, Dropbox, etc.).

I gave up on finding the exact point, in favour of looking at it maybe at a later time if/when we write about this on the blog with more data and perhaps as result of further fine tuning with notsent_lowat.

But... I did find this: Around 30 October 2017, our time-to-first-byte seemingly regressed overnight from 320ms to 500ms (desktop p75) and from 500ms to 750ms (mobile p75)

Mobile
Desktop

Below is an excerpt of Server Admin Log entries between Oct 27 to Nov 1 (https://tools.wmflabs.org/sal/production?p=0&q=&d=2017-11-01), that may be relevant:

2017-11-01
16:01 bblack: lvs1003 - puppet disabled, testing experimental ethtool ringbuffer change
14:55 bblack: strongswan experiment done, cp* back to puppet-agent-enabled
14:09 bblack: cp*: disabling puppet to test strongswan change...
2017-10-31
21:00 XioNoX: removing old AMS-IX IPv6 - T167840
13:57 bblack: caches@eqiad - upgrade nginx to 1.13.6-2+wmf1~jessie1
13:54 bblack: caches@esams - upgrade nginx to 1.13.6-2+wmf1~jessie1
13:22 bblack: caches@codfw - upgrade nginx to 1.13.6-2+wmf1~jessie1
13:14 bblack: caches@ulsfo - upgrade nginx to 1.13.6-2+wmf1~jessie1
12:24 bblack: cp4025: restart varnish-be for mailbox lag
12:23 bblack: cp4023: restart varnish-be for mailbox lag
12:21 bblack: esams primary lvses (3001-2): disable LRO,pause on eth0 (under pybal stopped briefly)
12:19 bblack: ulsfo primary lvses (4001-2): disable LRO,pause on eth0 (under pybal stopped briefly)
12:12 bblack: esams+ulsfo backup lvses (3003-4,4003-4): disable LRO,pause on eth0
2017-10-30
15:20 XioNoX: re-enabling Zayo transit in eqiad
08:51 ema: cp4022: restart varnish-be for mbox lag
2017-10-29
23:49 ema: powercycle cp4024
12:54 ema: cp4026: restart varnish-be for mbox lag
2017-10-28
21:03 bblack: cp1067 (current target cache): disabling the relatively-new VCL that sets do_stream=false if !CL on applayer fetches...
19:39 bblack: backend restart on cp1065
18:39 bblack: restarting varnish backend on cp1053 to move the lag/503 issues to another box and buy more time to debug
18:28 bblack: cp4025 - restart backend for mailbox lag (upload@ulsfo, unrelated to text-cluster issues)
18:21 bblack: cp1053 - manual VCL change, backends appservers+api_appservers, reduce connect/firstbyte/betweenbytes timeoues from 5/180/60 to 3/20/10
16:51 elukey: restart varnish backend on cp1055 - mailbox lag + T179156
2017-10-27
16:16 bblack: cp1054 varnish backend restarted (was 503s / bad-conns target of ongoing issues)
16:10 XioNoX: deactivating BGP sessions to Zayo in eqiad (flapping)
15:46 bblack: restart varnish-backend on cp4022 (upload@ulsfo) - mailbox
14:49 bblack: turn on cp4024 port on asw-ulsfo
13:52 bblack: reboot cp4021 to clean up oom messes
13:49 bblack: restarting nginx on cp4021, without NUMA memory constraints
11:36 ema: cp4023: varnish-backend-restart for lag
03:02 bblack: cp1067, cp4026 - backend restarts, mailbox lag

Event Timeline

Krinkle created this task.Oct 17 2018, 6:26 PM
Restricted Application added a project: Operations. · View Herald TranscriptOct 17 2018, 6:26 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema moved this task from Triage to General on the Traffic board.Oct 22 2018, 8:34 AM
jijiki triaged this task as Medium priority.Oct 26 2018, 10:22 AM
Peter added a subscriber: Peter.Oct 29 2018, 4:21 PM
Imarlier assigned this task to Gilles.Nov 5 2018, 9:16 PM
Imarlier moved this task from Inbox to Doing on the Performance-Team board.
Gilles added a comment.EditedNov 6 2018, 10:37 AM

I've found something very interesting. If you plot both navtiming and navtiming2, navtiming2 for mobile is in direct continuation of the old trend, without any regression (it's improving over time, even!):

For desktop, however, it's a different story (but still a different value than legacy navtiming):

Given that this is around the time navtiming2 was rolled out, though, I think the prime suspect for me would be changes to NavigationTiming around that time. Which might be why we didn't notice the "regression", as we were busy making big changes to the Navigation Timing code, updating the alerts, etc.

This 2017-11-01 SAL entry seems noteworthy:

22:33 Krinkle: group2(all-wikidata) wikis to wmf.5 from 24 hours ago seems to have caused a 60% drop in navigation timing metric report count (100/min => 40/min)

Let's look at the report rate, and we find this very telling event:

Zooming in:

If we were collecting only half the reports before, it's easy to imagine that we might have been missing some data before all the navtiming changes around that period?

1.31.0-wmf.6 was deployed on 2017-11-01 to group 1 and on 2017-11-02 to group 2

Unfortunately the git tags are gone and the wiki page that normally lists changes is empty:

https://www.mediawiki.org/wiki/MediaWiki_1.31/wmf.6

Now, it's not hard to imagine that these NavigationTiming changes made the cut:

https://github.com/wikimedia/mediawiki-extensions-NavigationTiming/commit/e95fbc81c4f6938f47674f275728799176332047

https://github.com/wikimedia/mediawiki-extensions-NavigationTiming/commit/1265a36a10c663296eda1d33ba3f76bc4926b56f

And that basically from that point forward we started collecting a whole lot of data we were discarding before, which can explain a sudden change in the aggregate NavigationTiming metrics.

All signs point to a change in the metrics collection, in my opinion. @Krinkle would you agree? Is it worth investigating this further?

Krinkle closed this task as Resolved.Nov 7 2018, 5:13 AM

This 2017-11-01 SAL entry seems noteworthy:

22:33 Krinkle: group2(all-wikidata) wikis to wmf.5 from 24 hours ago seems to have caused a 60% drop in navigation timing metric report count (100/min => 40/min)

Let's look at the report rate, and we find this very telling event: [..]

Aha, indeed. That was T179555, fixed by rOPUP10f9fc7b0242: navtiming: Removed old secureConnectionStart check that throw away metrics, which we deployed via Puppet the day it happened.

Nice!