Maniphest T220470

Investigate backend save timing regression starting at 2019-04-08 19:15:00
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aaron
	Apr 9 2019, 5:34 AM

Description

Backend saving timing went from ~750ms to ~850ms right after a MW train deploy.

See https://grafana.wikimedia.org/d/000000085/save-timing?panelId=11&fullscreen&orgId=1&from=now-2d&to=1554960315783

See https://grafana.wikimedia.org/d/000000429/backend-save-timing-breakdown?refresh=5m&panelId=15&fullscreen&orgId=1&from=now-20d&to=1554789372029

19:41 marxarelli: dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.24
19:35 marxarelli: starting promotion of 1.33.0-wmf.24 to group1

I'm also not sure why SAL and the grafana tags are off by so many minutes (it's not like it was a slow full-scap).

I also notice a previous bump around 2019-04-07 01:00:00 before that. No scap tags exists around then and SAL is empty.

Details

Subject	Repo	Branch	Lines +/-
objectcache: make detectLocalServerCache() prefer apcu over apc	mediawiki/core	master	+3 -3
objectcache: restore a simple version of the apc.serializer check in APCUBagOStuff	mediawiki/core	wmf/1.34.0-wmf.5	+20 -11
objectcache: restore a simple version of the apc.serializer check in APCUBagOStuff	mediawiki/core	master	+20 -11

Customize query in gerrit

Related Objects

Mentioned Here: rMWd262078b194f: objectcache: check apc.serializer in APCBagOStuff like APCUBagOStuff
rMWd256b472f739: parser: use "-" for revision ID for non-preview edit filter parse during save

Event Timeline

aaron created this task.Apr 9 2019, 5:34 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 9 2019, 5:34 AM

aaron updated the task description. (Show Details)Apr 9 2019, 5:57 AM

Krinkle assigned this task to aaron.Apr 15 2019, 7:59 PM

Krinkle triaged this task as High priority.

aaron moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.Apr 15 2019, 8:02 PM

And it keeps on rising, with another major increase on 2019-04-21 around 12:30 UTC. Increase save timing (p75) by half a second (from 2.0s to 2.5s).

Alert dashboard

Screenshot 2019-04-25 at 23.07.15.png (1×2 px, 601 KB)

Front-end only. Backend-timing remained constant.

Overview and breakdown
Breakdown and rate

This isolates the issue to group2 wikis (we monitor: enwiki, frwiki, ruwiki). It also shows that the report rate has not changes, which rules any skew related due to a change in sampling size or similar metric collection problem.

The Server Admin Log for that day shows no entries around that time (it was a Saturday).

Based on past experience, I figured I'd check AbuseFilter logs as that is one of the few ways that an on-wiki change can significantly impact save timing for lots of users at once.

AbuseFilter admin log on Meta-Wiki shows no changes that day.

AbuseFilter admin log on en.wikipedia.org shows a change right on the spot where the regression started.

12:40, 20 April 2019 MusikAnimal (talk | contribs) modified filter 944 (details)

The filter is private though, so I'm taking this investigation further in private as well until we can confirm this one or ruled it out.

Okay, I've ruled out this filter. With the help of MusikAnimal, found that disabling/reverting it temporarily had no impact on save timing.

Also, zooming in closer on the graph, it may've started closer to 13:20, not 12:30.

Screenshot 2019-04-26 at 00.01.20.png (926×1 px, 193 KB)

I'ved checked frwiki and ruwiki as well, and no filter changes around that time. Re-checked enwiki and metawiki as well. Nothing. So, probably something else then.

… come to think of it, if it was AbuseFilter then it'd show up in backend-timing as well, which it doesn't.

Looking at front-end perf overall (for regular page views), responseStart/TTFB looks fine. Nothing major that would explain a half second regression.

Screenshot 2019-04-26 at 00.11.28.png (1×1 px, 532 KB)

The original cause of this task was deploy related and was probably fixed in d256b472f73956ee8e2503e0254a1107baa1f00a . The later timing increase is probably from something else on-wiki.

Change 510291 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: restore a simple version of the apc.serializer check in APCUBagOStuff

https://gerrit.wikimedia.org/r/510291

gerritbot added a project: Patch-For-Review.May 16 2019, 8:27 PM

Change 510826 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@wmf/1.34.0-wmf.5] objectcache: restore a simple version of the apc.serializer check in APCUBagOStuff

https://gerrit.wikimedia.org/r/510826

Change 510291 merged by jenkins-bot:
[mediawiki/core@master] objectcache: restore a simple version of the apc.serializer check in APCUBagOStuff

https://gerrit.wikimedia.org/r/510291

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.6; 2019-05-21).May 17 2019, 11:00 AM

Change 510826 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.5] objectcache: restore a simple version of the apc.serializer check in APCUBagOStuff

https://gerrit.wikimedia.org/r/510826

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.5; 2019-05-14); removed MW-1.34-notes (1.34.0-wmf.6; 2019-05-21).May 21 2019, 12:00 AM

Maintenance_bot removed a project: Patch-For-Review.May 22 2019, 3:14 PM

Change 511917 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: make detectLocalServerCache() prefer apcu over apc

https://gerrit.wikimedia.org/r/511917

gerritbot added a project: Patch-For-Review.May 22 2019, 10:23 PM

Change 511917 merged by jenkins-bot:
[mediawiki/core@master] objectcache: make detectLocalServerCache() prefer apcu over apc

https://gerrit.wikimedia.org/r/511917

Maintenance_bot removed a project: Patch-For-Review.May 23 2019, 10:41 AM

Mentioned in SAL (#wikimedia-operations) [2019-05-24T14:55:04Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.6/includes/libs/objectcache/: rMWd262078b194f / T220470 (duration: 01m 06s)

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.7; 2019-05-28); removed MW-1.34-notes (1.34.0-wmf.5; 2019-05-14).May 25 2019, 2:01 AM

aaron updated the task description. (Show Details)May 31 2019, 7:33 AM

This was likely due to an APC change. Filing a separate task for the 4/20 group 2 regression (which seems out of band for deployments).

	F28776635: Screenshot 2019-04-26 at 00.11.28.png
	Apr 25 2019, 11:10 PM

	F28776621: Screenshot 2019-04-26 at 00.10.17.png
	Apr 25 2019, 11:09 PM

	F28776591: Screenshot 2019-04-26 at 00.05.28.png
	Apr 25 2019, 11:05 PM

	F28776573: Screenshot 2019-04-26 at 00.01.20.png
	Apr 25 2019, 11:02 PM

	F28776302: Screenshot 2019-04-25 at 23.07.15.png
	Apr 25 2019, 10:23 PM

	F28776305: Screenshot 2019-04-25 at 23.08.20.png
	Apr 25 2019, 10:23 PM

	F29283766: save-timing.png
	May 31 2019, 7:33 AM

Investigate backend save timing regression starting at 2019-04-08 19:15:00Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Investigate backend save timing regression starting at 2019-04-08 19:15:00
Closed, ResolvedPublic
Actions