Maniphest T198483

Save Timing increased 50% since 2018-06-28 20:53
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Jun 29 2018, 2:09 PM

Description

https://grafana.wikimedia.org/dashboard/db/save-timing?orgId=1&from=1530210478408&to=1530230285624

0d 11h 3m 12s 	3/3 	CRITICAL: https://grafana.wikimedia.org/dashboard/db/save-timing-alerts is alerting: Backend time p75 [ALERT] alert, Frontend save timings [ALERT] alert, Frontend save timings p75 [ALERT] alert.

This happens also to be the time of the train deployment.

Details

Subject	Repo	Branch	Lines +/-
WIP Use PageUpdater in EditPage	mediawiki/core	master	+23 -18
Avoid losing cached ParserOutput in doEditContent.	mediawiki/core	wmf/1.32.0-wmf.10	+61 -7
Avoid losing cached ParserOutput in doEditContent.	mediawiki/core	master	+61 -7
Fix incorrect arguments to prepareContent() call in WikiPage	mediawiki/core	wmf/1.32.0-wmf.10	+1 -1
Fix incorrect arguments to prepareContent() call in WikiPage	mediawiki/core	master	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	Release	dduvall	T191056 1.32.0-wmf.10 deployment blockers
		Resolved		aaron	T198483 Save Timing increased 50% since 2018-06-28 20:53

Event Timeline

jcrespo created this task.Jun 29 2018, 2:09 PM

I also saw this, not sure what it is, but in case it is related: https://grafana.wikimedia.org/dashboard/db/network-performances-global?panelId=19&fullscreen&orgId=1&from=1530201199472&to=1530287599472

Those do correlate with:

19:16 "rebuilt and synchronized wikiversions files: Group1 (less commons) to 1.32.0-wmf.10"
20:53 "rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.10"

But there wasn't a (noticeable?) change when Commons only was added to wmf.10

20:29 "rebuilt and synchronized wikiversions files: commonswiki to 1.32.0-wmf.10 otra vez"

There was a bump in save timing when we first rolled group1 to wmf.10, but it wasn't huge and went down even before we reverted: https://grafana.wikimedia.org/dashboard/db/save-timing?orgId=1&from=1530115200000&to=1530140400000

@Tgr @daniel could the changes made to address the issue have caused a save timing regression?

@Imarlier / @Krinkle: thoughts on next steps from your point of view?

A slight rise in save timing was to be expected dues to the fact that we are now writing to five tables instead of three, but 50% is a LOT. We are talking about something like 250ms, right?

I don't see how tgr's fix could increase the time it takes to save things - in fact, it did exactly the opposite. Saving often timed out without the fix. It's curious that this fact didn't show up in the save timing stats.

Framawiki subscribed.Jun 30 2018, 7:29 AM

In T198483#4326191, @greg wrote:

@Tgr @daniel could the changes made to address the issue have caused a save timing regression?

Not in any real way, I'd guess it's an artifact of how failed saves are (not) counted.

Can we get something more specific like a flame graph? The logic of saving pages did not get vastly more complex so I'd expect it to be due to some (probably inadvertent) change in locking stragies or what gets deferred and until when. Or maybe external storage changes? I believe that has changed from a join to an extra lookup in some cases (in preparation for treating it as a service and not as a DB table), not sure if that affects WMF wikis though.

In T198483#4325588, @jcrespo wrote:

I also saw this, not sure what it is, but in case it is related: https://grafana.wikimedia.org/dashboard/db/network-performances-global?panelId=19&fullscreen&orgId=1&from=1530201199472&to=1530287599472

Is that for network connections unexpectedly aborted by the caller? Do you think it could be something like DB connections timing out? Other than that, I don't see how it could be related to MCR.

Change 443408 had a related patch set uploaded (by Gergő Tisza; owner: Aaron Schulz):
[mediawiki/core@master] Fix incorrect arguments to prepareContent() call in WikiPage

https://gerrit.wikimedia.org/r/443408

gerritbot added a project: Patch-For-Review.Jul 2 2018, 12:02 PM

Here is the profiling data of a POST request for an edit I just made to Wikipedia:Sandbox:

https://performance.wikimedia.org/xhgui/run/view?id=5b3a23a43f3dfaf3013fe2fa

I don't see anything suspicious there...

Change 443408 merged by jenkins-bot:
[mediawiki/core@master] Fix incorrect arguments to prepareContent() call in WikiPage

https://gerrit.wikimedia.org/r/443408

Change 443439 had a related patch set uploaded (by Daniel Kinzler; owner: Aaron Schulz):
[mediawiki/core@wmf/1.32.0-wmf.10] Fix incorrect arguments to prepareContent() call in WikiPage

https://gerrit.wikimedia.org/r/443439

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-07-10 (1.32.0-wmf.12)).Jul 2 2018, 4:00 PM

Change 443439 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.10] Fix incorrect arguments to prepareContent() call in WikiPage

https://gerrit.wikimedia.org/r/443439

Mentioned in SAL (#wikimedia-operations) [2018-07-02T18:41:15Z] <niharika29@deploy1001> Synchronized php-1.32.0-wmf.10/includes/page/WikiPage.php: Fix incorrect arguments to prepareContent() call in WikiPage T198483 (duration: 00m 51s)

ReleaseTaggerBot edited projects, added MW-1.32-notes (WMF-deploy-2018-06-26 (1.32.0-wmf.10)); removed MW-1.32-notes (WMF-deploy-2018-07-10 (1.32.0-wmf.12)).Jul 2 2018, 7:00 PM

A performance diff on the (somewhat erratic) debug host: diff - save times are down considerably. On grafana, the old save time is around 480ms and the current one 570ms - up 20% from pre-MCR levels (instead of 50%, as before Aaron's fix), which seems still a bit high.

@aaron is going to be keeping an eye on this, but to be sure: @Tgr are you and others looking at this as well?

Actually, I see it now on normal levels, since 2018-07-02 20:00: https://grafana.wikimedia.org/dashboard/db/save-timing?orgId=1&from=1529331579412&to=1530599241236&panelId=5&fullscreen

In T198483#4392950, @jcrespo wrote:

Actually, I see it now on normal levels, since 2018-07-02 20:00: https://grafana.wikimedia.org/dashboard/db/save-timing?orgId=1&from=1529331579412&to=1530599241236&panelId=5&fullscreen

Yeah, the frontend save timing has recovered; it's the backend save timing that still seems slightly higher.

grafana.wikimedia.org_dashboard_db_save-timing_orgId=1&from=now-7d&to=now&panelId=5&fullscreen.png (781×1 px, 312 KB)

grafana.wikimedia.org_dashboard_db_save-timing_orgId=1&from=now-7d&to=now&panelId=5&fullscreen (1).png (781×1 px, 157 KB)

I'm not sure how that is possible; maybe backend save timing includes the stashedit API? The API timings did not change:

grafana.wikimedia.org_dashboard_db_api-requests_refresh=5m&panelId=19&fullscreen&orgId=1&var-metric=p99&var-module=stashedit&from=now-7d&to=now.png (742×1 px, 449 KB)

but those do not include deferred actions AFAIK.

In T198483#4391635, @Imarlier wrote:

@aaron is going to be keeping an eye on this, but to be sure: @Tgr are you and others looking at this as well?

I'm not sure where to look. Do we have some sort of function-level profiling that could be compared over time?

Judging from https://performance.wikimedia.org/xenon/svgs/daily/2018-06-26.api.svgz (pre-28th) and https://performance.wikimedia.org/xenon/svgs/daily/2018-07-01.api.svgz , perhaps the prepared parser output is not getting reused between SpamBlacklist and doEditContent() as much as before.

In T198483#4393585, @aaron wrote:

Judging from https://performance.wikimedia.org/xenon/svgs/daily/2018-06-26.api.svgz (pre-28th) and https://performance.wikimedia.org/xenon/svgs/daily/2018-07-01.api.svgz , perhaps the prepared parser output is not getting reused between SpamBlacklist and doEditContent() as much as before.

Looking at this more, it seems that SpamBlacklist does cause the canonical parse output to be cached via getCanonicalParserOutput(), but the later doEditContent() call reaches saveRevision(), which unconditionally calls prepareContent() which, in turn, unconditionally clears the cache and then calls getCanonicalParserOutput().

greg changed the subtype of this task from "Release" to "Task".Jul 6 2018, 8:14 PM

Krinkle renamed this task from Saving timings up 50% since 2018-06-28 20:53 to Save Timing increased 50% since 2018-06-28 20:53.Jul 6 2018, 8:19 PM

In T198483#4404330, @aaron wrote:

Looking at this more, it seems that SpamBlacklist does cause the canonical parse output to be cached via getCanonicalParserOutput(), but the later doEditContent() call reaches saveRevision(), which unconditionally calls prepareContent() which, in turn, unconditionally clears the cache and then calls getCanonicalParserOutput().

See also T172941, which is about automatically detecting this particular kind of regression in the future.

Change 444459 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] WIP Use PageUpdater in EditPage

https://gerrit.wikimedia.org/r/444459

Change 444462 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Avoid losing cached ParserOutput in doEditContent.

https://gerrit.wikimedia.org/r/444462

@aaron Thanks for spotting this!

If I see this correctly, the issue is slightly different: DerivedPageDataUpdater::prepareContent() does re-use the ParserOutput if it already has it, but the entire DerivedPageDataUpdater instance is replaced by newPageUpdater() before it gets a chance.

I have uploaded two patches to address this, one stop-gap, and one "real solution" which is WIP.

I wonder why this (or even the previous issue) does not show up on the edit stash dashboard? This should be significantly affecting stash hit ratios.

Change 444462 merged by jenkins-bot:
[mediawiki/core@master] Avoid losing cached ParserOutput in doEditContent.

https://gerrit.wikimedia.org/r/444462

Change 444648 had a related patch set uploaded (by Gergő Tisza; owner: Daniel Kinzler):
[mediawiki/core@wmf/1.32.0-wmf.10] Avoid losing cached ParserOutput in doEditContent.

https://gerrit.wikimedia.org/r/444648

ReleaseTaggerBot edited projects, added MW-1.32-notes (WMF-deploy-2018-07-10 (1.32.0-wmf.12)); removed MW-1.32-notes (WMF-deploy-2018-06-26 (1.32.0-wmf.10)).Jul 9 2018, 6:00 PM

Change 444648 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.10] Avoid losing cached ParserOutput in doEditContent.

https://gerrit.wikimedia.org/r/444648

Mentioned in SAL (#wikimedia-operations) [2018-07-09T18:37:05Z] <catrope@deploy1001> Synchronized php-1.32.0-wmf.10/includes/page/WikiPage.php: Avoid losing cached ParserOutput in doEditContent (T198483) (duration: 00m 51s)

ReleaseTaggerBot edited projects, added MW-1.32-notes (WMF-deploy-2018-06-26 (1.32.0-wmf.10)); removed MW-1.32-notes (WMF-deploy-2018-07-10 (1.32.0-wmf.12)).Jul 9 2018, 7:00 PM

In T198483#4408326, @Tgr wrote:

I wonder why this (or even the previous issue) does not show up on the edit stash dashboard? This should be significantly affecting stash hit ratios.

The extra parse seemed to mostly happen on API requests, possibly for bots, in which case $useStashed is false and no edit stash stats are collected (to avoid logging cases where it could never be useful and doesn't effect humans anyway).

Actually the cache hit counts show the effect nicely, I just missed it the first time because the panel is called "save operations per second" (and actual saves would not be affected):

grafana.wikimedia.org_dashboard_db_edit-stash_panelId=9&fullscreen&orgId=1&from=now-15d&to=now.png (781×1 px, 187 KB)

Seems to be back to pre-MCR levels (save timing in general too, as far as I could tell by eyeballing it). I think this can be marked resolved (and in future we should watch the save timing and edit stash dashboards proactively when doing major MCR changes, although I'm not sure how much it tells on non-production wikis).

watch the save timing and edit stash dashboards proactively

There is an icinga alerts based on prometheus- I filed the ticket because that. I am unsure if everything I did was ok, I wasn't sure who or how to notify, also if setting it as "release blocker" was the right thing to do- Please send me feedback to do better.

aaron closed this task as Resolved.Jul 10 2018, 9:17 AM

Seems to be back to pre-MCR levels (save timing in general too, as far as I could tell by eyeballing it).

Thanks to everyone who helped with finding all the subtle ways in which I screwed this up!

in future we should watch the save timing and edit stash dashboards proactively when doing major MCR changes, although I'm not sure how much it tells on non-production wikis.

I'm afraid it will only show in extreme cases. Save time varies widely depending on a large number of factors. Changes to the average that are significant but not extreme will only become apparent over a very large number of operations.

@daniel It may be possible to change the visualization in order to more proactively catch these cases, even when changes to the average are pretty subtle. https://phabricator.wikimedia.org/phame/live/7/post/83/measuring_wikipedia_page_load_times/ has a bunch of good info. This is something that Perf could definitely set aside to help with, if you were interested. (Though best to track it separately, I think, rather than in this ticket.)

CCicalese_WMF added projects: Core-Platform-Team-Old, Multi-Content-Revisions.Aug 7 2018, 6:21 PM

CCicalese_WMF removed projects: Multi-Content-Revisions, Core-Platform-Team-Old.Aug 7 2018, 6:35 PM

Cirdan mentioned this in T200119: No page link notifications since around end of June.Aug 24 2018, 7:26 PM

Legoktm mentioned this in T288639: SpamBlacklistHooks::onEditFilterMergedContent causes every edit to be rendered twice.Aug 11 2021, 9:31 PM

Krinkle mentioned this in T288707: Detect and monitor against multiple Parser invocation during edit requests.Aug 12 2021, 2:15 AM

	F23533408: grafana.wikimedia.org_dashboard_db_edit-stash_panelId=9&fullscreen&orgId=1&from=now-15d&to=now.png
	Jul 10 2018, 8:44 AM

	F23258758: grafana.wikimedia.org_dashboard_db_save-timing_orgId=1&from=now-7d&to=now&panelId=5&fullscreen (1).png
	Jul 3 2018, 7:55 AM

	F23258862: grafana.wikimedia.org_dashboard_db_api-requests_refresh=5m&panelId=19&fullscreen&orgId=1&var-metric=p99&var-module=stashedit&from=now-7d&to=now.png
	Jul 3 2018, 7:55 AM

	F23258648: grafana.wikimedia.org_dashboard_db_save-timing_orgId=1&from=now-7d&to=now&panelId=5&fullscreen.png
	Jul 3 2018, 7:55 AM

Save Timing increased 50% since 2018-06-28 20:53Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Save Timing increased 50% since 2018-06-28 20:53
Closed, ResolvedPublic
Actions

Related Objects
Search...