Page MenuHomePhabricator

Investigate > 40% Save Timing regression (2018-09-05)
Closed, ResolvedPublic

Description

Frontend Save Timing

Median (coal)
Screen Shot 2018-09-25 at 05.56.47.png (570×1 px, 63 KB)
Median (statsd)
Screen Shot 2018-09-25 at 05.57.28.png (698×2 px, 193 KB)
p75 (statsd)
Screen Shot 2018-09-25 at 05.57.35.png (696×2 px, 196 KB)

Median:

  • June 2018 and earlier: ~850ms
  • around 5 July 2018: Increased to ~900ms and remained raised.
  • around 8 August 2018: Increased again, to ~950ms and remained raised.
  • around 21 August 2018: Increased again, to ~1,000ms and remained raised.
  • around 5 September 2018: Increased again, to ~1,400ms and is still raised.

Backend Save Timing

Screen Shot 2018-09-25 at 06.02.34.png (700×2 px, 228 KB)

p75;

  • 27 June and earlier: ~530ms.
  • around 5 July 2018: Increased to ~630ms.
  • around 16 July 2018: Decreased back to ~400ms for a few days (?)
  • around 8 August 2018: Increasing to ~650ms.
  • around 5 September 2018: Increasing over 3 days upto ~850ms.

Event Timeline

Krinkle renamed this task from Investigate > 40% Save Timing regression (2018-09- to Investigate > 40% Save Timing regression (2018-09-05).Sep 25 2018, 5:06 AM

TODO: What of our alerts?

The alerts fired but we didn't act on them and then 7 days later it got back to the new "normal":

Screen Shot 2018-09-27 at 8.48.49 AM.png (536×2 px, 238 KB)

I think we can change or add another alert where we alert on a hard limit instead (as we done on some other dashboards). In the coming Grafana 5.3.0 the alerts will have reminders.

8 of augusts correlates to 1.32.0-wmf.16 - log. The other big increase 5th of September I'm not sure I can spot a specific change in the server admin log? @aaron is back on Monday though.

The alerts fired but we didn't act on them and then 7 days later it got back to the new "normal":

Screen Shot 2018-09-27 at 8.48.49 AM.png (536×2 px, 238 KB)

I think we can change or add another alert where we alert on a hard limit instead (as we done on some other dashboards). In the coming Grafana 5.3.0 the alerts will have reminders.

Is the idea that the reminders won't stop until acknowledged I assume?

Looking at https://performance.wikimedia.org/xhgui/run/view?id=5bbfdc7c3f3dfaea44b5847c after a null edit on https://en.wikipedia.org/wiki/1857_in_Sweden I see MediaWiki\Revision\RenderedRevision::getSlotParserOutputUncached being hit 4 times even though normal pages have only 1 slot...

Looking at https://performance.wikimedia.org/xhgui/run/view?id=5bbfdc7c3f3dfaea44b5847c after a null edit on https://en.wikipedia.org/wiki/1857_in_Sweden I see MediaWiki\Revision\RenderedRevision::getSlotParserOutputUncached being hit 4 times even though normal pages have only 1 slot...

@daniel , is there something parsing more than once that is not supposed to here?

@aaron I hope not, but it's possible. Needs investigation, I'll add it to the appropriate boards.

I'm pretty busy with TechConf preparation right now though, so please poke me if I lose track of this.

For reference: I see three extension methods that access the ParserOutput via WikiPage::prepareContentFor edit:

  • TemplateDataHooks::onPageContentSave
  • SimpleCaptcha::shouldCheck
  • SpamBlacklistHooks::filterMergedContent

All of these should be hitting a cached instance, but perhaps they are not for some reason. The caching logic in WikiPage is not nice. Perhaps it would be better to have an in-process cache in the RevisionRenderer service. That would be straight forward, but would not cache PST content for pre-PST content.

For reference: I see three extension methods that access the ParserOutput via WikiPage::prepareContentFor edit:

  • TemplateDataHooks::onPageContentSave
  • SimpleCaptcha::shouldCheck
  • SpamBlacklistHooks::filterMergedContent

All of these should be hitting a cached instance, but perhaps they are not for some reason. The caching logic in WikiPage is not nice. Perhaps it would be better to have an in-process cache in the RevisionRenderer service. That would be straight forward, but would not cache PST content for pre-PST content.

I wonder if the re-parses are just due to the

$this->revision->getUser( RevisionRecord::RAW )->getName() !== $user->getName()

line for null edits. Still, it shouldn't need to reparse for each thing, but it could explain 1 extra parse.

The first thing I just found looking into this is this bit in CategoryMembershipChangeJob:

		// This could possibly use the parser cache if it checked the revision ID,
		// but that's more complicated than it's worth.
		$output = $renderer->getRenderedRevision( $rev->getRevisionRecord(), $options )
			->getRevisionParserOutput();

So, this triggers an additional parse for every edit. This should probably be fixed to use the parser cache, there really is no good reason not to.

Another thing I found: SpamBlacklist causes a re-parse on every edit. Reason:

  • SpamBlacklist uses a EditFilterMergedContent, which gets an IContextSource, which is actually EditPage::context
  • It then uses the WikiPage instance it gets from IContextSource and calls prepareContentForEdit on it. prepareContentForEdit implements in-place caching, to allow the existing parsed version of the content to be re-used while an edit is in progress. But this fails, because:
  • EditPage::context->wikipage is not EditPage::page! EditPage::page is the instance of WikiPage that has the cached state.
  • This means that SpamBlacklist's call to prepareContentForEdit triggers a re-parse.

To resolve this, we have to make sure that hook handlers get the correct WikiPage instance - the one held by EditPage::page. A first step could be to just call setWikiPage on $this->context. Another approach would be to make WikiPage::factory caching, so it returns the same WikiPage instance given the same title. Both will probably break something, somewhere.

Yet another approach would be to cache the PageUpdater globally, in some kind of PageUpdaterFactory - or to cache ParserOutputs in a new kind of ParserCache, one that is not based on the title or id, but on the content hash. ApiStashEdit kind of does this already (but it's not MCR aware yet).

Another note: it seems like the

$this->revision->getUser( RevisionRecord::RAW )->getName() !== $user->getName()

check works correctly for null edits.

Change 473564 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Use ParserCache in CategoryMembershipChangeJob

https://gerrit.wikimedia.org/r/473564

Another thing I found: SpamBlacklist causes a re-parse on every edit. Reason:

Further investigation indicates that my analysis was false - SpamBlacklist does cause the content to be parsed, but that ParserOutput is then used later during the actual edit. At least, that's what happening when I step through this on my local system.

As to CategoryMembershipChangeJob: I think a lot of parsing can be avoided by the patch I made, but that behavior was not new, so it does not explain the spike.

Since CategoryMembershipChangeJob runs via the job queue, wouldn't that have little effect on save timing itself? I guess it wouldn't hurt to optimize.

Since CategoryMembershipChangeJob runs via the job queue, wouldn't that have little effect on save timing itself? I guess it wouldn't hurt to optimize.

Yes, this is unrelated to "time the user waits".

So, the reason for the degrading save times is still unclear.

Another note: it seems like the

$this->revision->getUser( RevisionRecord::RAW )->getName() !== $user->getName()

check works correctly for null edits.

Ooops, no it doesn't. I'll make a patch.

Change 473750 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Reuse DerivedPageDataUpdater during null-edits.

https://gerrit.wikimedia.org/r/473750

Change 473752 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Make the context page the edited page in ExitPage's

https://gerrit.wikimedia.org/r/473752

Change 473762 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Add logging for redundant parsing to WikitextContent.

https://gerrit.wikimedia.org/r/473762

I think I found it. DerivedPageDataUpdater::prepareContent looks up the stashed ParserOutput, but does not use it. This line:

$this->canonicalParserOutput = $output;

is ineffective: canonicalParserOutput is undelcared and unused. Looks like I broke this when I introduced RevisionRenderer.

Change 473780 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Use stashed ParserOutput during saving.

https://gerrit.wikimedia.org/r/473780

Change 473750 merged by jenkins-bot:
[mediawiki/core@master] Reuse DerivedPageDataUpdater during null-edits.

https://gerrit.wikimedia.org/r/473750

Change 473752 merged by jenkins-bot:
[mediawiki/core@master] Make the context page the edited page in EditPages

https://gerrit.wikimedia.org/r/473752

@aaron https://gerrit.wikimedia.org/r/473780 should be good to go. You can merge it if you feel it's ready, or poke @Anomie or @Tgr for further review.

Change 473780 merged by jenkins-bot:
[mediawiki/core@master] Use stashed ParserOutput during saving.

https://gerrit.wikimedia.org/r/473780

Change 473762 merged by jenkins-bot:
[mediawiki/core@master] Add logging for redundant parsing to WikitextContent.

https://gerrit.wikimedia.org/r/473762

Change 473564 merged by jenkins-bot:
[mediawiki/core@master] Use ParserCache in CategoryMembershipChangeJob

https://gerrit.wikimedia.org/r/473564

Hey, that looks a lot better! Nice work, @daniel !

Hey, that looks a lot better! Nice work, @daniel !

Thanks, but don't praise me too much for fixing something that I broke in a rather horrible and embarrassing way ;)