Page MenuHomePhabricator

Investigate backend save timing regression starting at 2019-04-08 19:15:00
Closed, ResolvedPublic

Description

Backend saving timing went from ~750ms to ~850ms right after a MW train deploy.

See https://grafana.wikimedia.org/d/000000085/save-timing?panelId=11&fullscreen&orgId=1&from=now-2d&to=1554960315783

See https://grafana.wikimedia.org/d/000000429/backend-save-timing-breakdown?refresh=5m&panelId=15&fullscreen&orgId=1&from=now-20d&to=1554789372029

19:41 marxarelli: dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.24
19:35 marxarelli: starting promotion of 1.33.0-wmf.24 to group1

I'm also not sure why SAL and the grafana tags are off by so many minutes (it's not like it was a slow full-scap).

I also notice a previous bump around 2019-04-07 01:00:00 before that. No scap tags exists around then and SAL is empty.

save-timing.png (500×1 px, 133 KB)

Event Timeline

Krinkle triaged this task as High priority.

And it keeps on rising, with another major increase on 2019-04-21 around 12:30 UTC. Increase save timing (p75) by half a second (from 2.0s to 2.5s).

Alert dashboard
Screenshot 2019-04-25 at 23.07.15.png (1×2 px, 601 KB)

Front-end only. Backend-timing remained constant.

Overview and breakdown
Screenshot 2019-04-26 at 00.05.28.png (1×2 px, 433 KB)
Breakdown and rate
Screenshot 2019-04-25 at 23.08.20.png (750×2 px, 573 KB)

This isolates the issue to group2 wikis (we monitor: enwiki, frwiki, ruwiki). It also shows that the report rate has not changes, which rules any skew related due to a change in sampling size or similar metric collection problem.

The Server Admin Log for that day shows no entries around that time (it was a Saturday).

Based on past experience, I figured I'd check AbuseFilter logs as that is one of the few ways that an on-wiki change can significantly impact save timing for lots of users at once.

AbuseFilter admin log on Meta-Wiki shows no changes that day.

AbuseFilter admin log on en.wikipedia.org shows a change right on the spot where the regression started.

12:40, 20 April 2019 MusikAnimal (talk | contribs) modified filter 944 (details)

The filter is private though, so I'm taking this investigation further in private as well until we can confirm this one or ruled it out.

Okay, I've ruled out this filter. With the help of MusikAnimal, found that disabling/reverting it temporarily had no impact on save timing.

Also, zooming in closer on the graph, it may've started closer to 13:20, not 12:30.

Screenshot 2019-04-26 at 00.01.20.png (926×1 px, 193 KB)

I'ved checked frwiki and ruwiki as well, and no filter changes around that time. Re-checked enwiki and metawiki as well. Nothing. So, probably something else then.

… come to think of it, if it was AbuseFilter then it'd show up in backend-timing as well, which it doesn't.

Looking at front-end perf overall (for regular page views), responseStart/TTFB looks fine. Nothing major that would explain a half second regression.

Screenshot 2019-04-26 at 00.11.28.png (1×1 px, 532 KB)

The original cause of this task was deploy related and was probably fixed in d256b472f73956ee8e2503e0254a1107baa1f00a . The later timing increase is probably from something else on-wiki.

Change 510291 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: restore a simple version of the apc.serializer check in APCUBagOStuff

https://gerrit.wikimedia.org/r/510291

Change 510826 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.34.0-wmf.5] objectcache: restore a simple version of the apc.serializer check in APCUBagOStuff

https://gerrit.wikimedia.org/r/510826

Change 510291 merged by jenkins-bot:
[mediawiki/core@master] objectcache: restore a simple version of the apc.serializer check in APCUBagOStuff

https://gerrit.wikimedia.org/r/510291

Change 510826 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.5] objectcache: restore a simple version of the apc.serializer check in APCUBagOStuff

https://gerrit.wikimedia.org/r/510826

Change 511917 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: make detectLocalServerCache() prefer apcu over apc

https://gerrit.wikimedia.org/r/511917

Change 511917 merged by jenkins-bot:
[mediawiki/core@master] objectcache: make detectLocalServerCache() prefer apcu over apc

https://gerrit.wikimedia.org/r/511917

Mentioned in SAL (#wikimedia-operations) [2019-05-24T14:55:04Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.6/includes/libs/objectcache/: rMWd262078b194f / T220470 (duration: 01m 06s)

This was likely due to an APC change. Filing a separate task for the 4/20 group 2 regression (which seems out of band for deployments).