Page MenuHomePhabricator

Save Timing increased 50% since 2018-06-28 20:53
Closed, ResolvedPublic

Description

https://grafana.wikimedia.org/dashboard/db/save-timing?orgId=1&from=1530210478408&to=1530230285624

0d 11h 3m 12s 	3/3 	CRITICAL: https://grafana.wikimedia.org/dashboard/db/save-timing-alerts is alerting: Backend time p75 [ALERT] alert, Frontend save timings [ALERT] alert, Frontend save timings p75 [ALERT] alert.

This happens also to be the time of the train deployment.

Event Timeline

Those do correlate with:

  • 19:16 "rebuilt and synchronized wikiversions files: Group1 (less commons) to 1.32.0-wmf.10"
  • 20:53 "rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.10"

But there wasn't a (noticeable?) change when Commons only was added to wmf.10

  • 20:29 "rebuilt and synchronized wikiversions files: commonswiki to 1.32.0-wmf.10 otra vez"

There was a bump in save timing when we first rolled group1 to wmf.10, but it wasn't huge and went down even before we reverted: https://grafana.wikimedia.org/dashboard/db/save-timing?orgId=1&from=1530115200000&to=1530140400000

@Tgr @daniel could the changes made to address the issue have caused a save timing regression?

@Imarlier / @Krinkle: thoughts on next steps from your point of view?

A slight rise in save timing was to be expected dues to the fact that we are now writing to five tables instead of three, but 50% is a LOT. We are talking about something like 250ms, right?

I don't see how tgr's fix could increase the time it takes to save things - in fact, it did exactly the opposite. Saving often timed out without the fix. It's curious that this fact didn't show up in the save timing stats.

@Tgr @daniel could the changes made to address the issue have caused a save timing regression?

Not in any real way, I'd guess it's an artifact of how failed saves are (not) counted.

Can we get something more specific like a flame graph? The logic of saving pages did not get vastly more complex so I'd expect it to be due to some (probably inadvertent) change in locking stragies or what gets deferred and until when. Or maybe external storage changes? I believe that has changed from a join to an extra lookup in some cases (in preparation for treating it as a service and not as a DB table), not sure if that affects WMF wikis though.

Is that for network connections unexpectedly aborted by the caller? Do you think it could be something like DB connections timing out? Other than that, I don't see how it could be related to MCR.

Change 443408 had a related patch set uploaded (by Gergő Tisza; owner: Aaron Schulz):
[mediawiki/core@master] Fix incorrect arguments to prepareContent() call in WikiPage

https://gerrit.wikimedia.org/r/443408

Change 443408 merged by jenkins-bot:
[mediawiki/core@master] Fix incorrect arguments to prepareContent() call in WikiPage

https://gerrit.wikimedia.org/r/443408

Change 443439 had a related patch set uploaded (by Daniel Kinzler; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.10] Fix incorrect arguments to prepareContent() call in WikiPage

https://gerrit.wikimedia.org/r/443439

Change 443439 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.10] Fix incorrect arguments to prepareContent() call in WikiPage

https://gerrit.wikimedia.org/r/443439

Mentioned in SAL (#wikimedia-operations) [2018-07-02T18:41:15Z] <niharika29@deploy1001> Synchronized php-1.32.0-wmf.10/includes/page/WikiPage.php: Fix incorrect arguments to prepareContent() call in WikiPage T198483 (duration: 00m 51s)

A performance diff on the (somewhat erratic) debug host: diff - save times are down considerably. On grafana, the old save time is around 480ms and the current one 570ms - up 20% from pre-MCR levels (instead of 50%, as before Aaron's fix), which seems still a bit high.

Imarlier moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.
Imarlier added a subscriber: aaron.

@aaron is going to be keeping an eye on this, but to be sure: @Tgr are you and others looking at this as well?

Yeah, the frontend save timing has recovered; it's the backend save timing that still seems slightly higher.

grafana.wikimedia.org_dashboard_db_save-timing_orgId=1&from=now-7d&to=now&panelId=5&fullscreen.png (781×1 px, 312 KB)

grafana.wikimedia.org_dashboard_db_save-timing_orgId=1&from=now-7d&to=now&panelId=5&fullscreen (1).png (781×1 px, 157 KB)

I'm not sure how that is possible; maybe backend save timing includes the stashedit API? The API timings did not change:

grafana.wikimedia.org_dashboard_db_api-requests_refresh=5m&panelId=19&fullscreen&orgId=1&var-metric=p99&var-module=stashedit&from=now-7d&to=now.png (742×1 px, 449 KB)

but those do not include deferred actions AFAIK.

@aaron is going to be keeping an eye on this, but to be sure: @Tgr are you and others looking at this as well?

I'm not sure where to look. Do we have some sort of function-level profiling that could be compared over time?

Judging from https://performance.wikimedia.org/xenon/svgs/daily/2018-06-26.api.svgz (pre-28th) and https://performance.wikimedia.org/xenon/svgs/daily/2018-07-01.api.svgz , perhaps the prepared parser output is not getting reused between SpamBlacklist and doEditContent() as much as before.

Judging from https://performance.wikimedia.org/xenon/svgs/daily/2018-06-26.api.svgz (pre-28th) and https://performance.wikimedia.org/xenon/svgs/daily/2018-07-01.api.svgz , perhaps the prepared parser output is not getting reused between SpamBlacklist and doEditContent() as much as before.

Looking at this more, it seems that SpamBlacklist does cause the canonical parse output to be cached via getCanonicalParserOutput(), but the later doEditContent() call reaches saveRevision(), which unconditionally calls prepareContent() which, in turn, unconditionally clears the cache and then calls getCanonicalParserOutput().

greg changed the subtype of this task from "Release" to "Task".Jul 6 2018, 8:14 PM
Krinkle renamed this task from Saving timings up 50% since 2018-06-28 20:53 to Save Timing increased 50% since 2018-06-28 20:53.Jul 6 2018, 8:19 PM

Looking at this more, it seems that SpamBlacklist does cause the canonical parse output to be cached via getCanonicalParserOutput(), but the later doEditContent() call reaches saveRevision(), which unconditionally calls prepareContent() which, in turn, unconditionally clears the cache and then calls getCanonicalParserOutput().

See also T172941, which is about automatically detecting this particular kind of regression in the future.

Change 444459 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] WIP Use PageUpdater in EditPage

https://gerrit.wikimedia.org/r/444459

Change 444462 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Avoid losing cached ParserOutput in doEditContent.

https://gerrit.wikimedia.org/r/444462

@aaron Thanks for spotting this!

If I see this correctly, the issue is slightly different: DerivedPageDataUpdater::prepareContent() does re-use the ParserOutput if it already has it, but the entire DerivedPageDataUpdater instance is replaced by newPageUpdater() before it gets a chance.

I have uploaded two patches to address this, one stop-gap, and one "real solution" which is WIP.

I wonder why this (or even the previous issue) does not show up on the edit stash dashboard? This should be significantly affecting stash hit ratios.

Change 444462 merged by jenkins-bot:
[mediawiki/core@master] Avoid losing cached ParserOutput in doEditContent.

https://gerrit.wikimedia.org/r/444462

Change 444648 had a related patch set uploaded (by Gergő Tisza; owner: Daniel Kinzler):
[mediawiki/core@wmf/1.32.0-wmf.10] Avoid losing cached ParserOutput in doEditContent.

https://gerrit.wikimedia.org/r/444648

Change 444648 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.10] Avoid losing cached ParserOutput in doEditContent.

https://gerrit.wikimedia.org/r/444648

Mentioned in SAL (#wikimedia-operations) [2018-07-09T18:37:05Z] <catrope@deploy1001> Synchronized php-1.32.0-wmf.10/includes/page/WikiPage.php: Avoid losing cached ParserOutput in doEditContent (T198483) (duration: 00m 51s)

I wonder why this (or even the previous issue) does not show up on the edit stash dashboard? This should be significantly affecting stash hit ratios.

The extra parse seemed to mostly happen on API requests, possibly for bots, in which case $useStashed is false and no edit stash stats are collected (to avoid logging cases where it could never be useful and doesn't effect humans anyway).

Actually the cache hit counts show the effect nicely, I just missed it the first time because the panel is called "save operations per second" (and actual saves would not be affected):

grafana.wikimedia.org_dashboard_db_edit-stash_panelId=9&fullscreen&orgId=1&from=now-15d&to=now.png (781×1 px, 187 KB)

Seems to be back to pre-MCR levels (save timing in general too, as far as I could tell by eyeballing it). I think this can be marked resolved (and in future we should watch the save timing and edit stash dashboards proactively when doing major MCR changes, although I'm not sure how much it tells on non-production wikis).

watch the save timing and edit stash dashboards proactively

There is an icinga alerts based on prometheus- I filed the ticket because that. I am unsure if everything I did was ok, I wasn't sure who or how to notify, also if setting it as "release blocker" was the right thing to do- Please send me feedback to do better.

Seems to be back to pre-MCR levels (save timing in general too, as far as I could tell by eyeballing it).

Thanks to everyone who helped with finding all the subtle ways in which I screwed this up!

in future we should watch the save timing and edit stash dashboards proactively when doing major MCR changes, although I'm not sure how much it tells on non-production wikis.

I'm afraid it will only show in extreme cases. Save time varies widely depending on a large number of factors. Changes to the average that are significant but not extreme will only become apparent over a very large number of operations.

@daniel It may be possible to change the visualization in order to more proactively catch these cases, even when changes to the average are pretty subtle. https://phabricator.wikimedia.org/phame/live/7/post/83/measuring_wikipedia_page_load_times/ has a bunch of good info. This is something that Perf could definitely set aside to help with, if you were interested. (Though best to track it separately, I think, rather than in this ticket.)