Page MenuHomePhabricator

Preemptively warm caches for Parsoid output
Closed, ResolvedPublic

Description

RESTBase actively purges and pre-generates HTML for use of VisualEditor.

With ParserCache we do not need to actively purge on page edits, entries will fall out of ParserCache automatically.
For template edits, we already refresh HTML for regular page view HTML, so we need to start doing the same for Parsoid HTML as well. For pre-generating cache entries - same, we already end up doing it for regular pages, we need to start doing it for parsoid-html.

Implementation idea: DerivedPageDataUpdater has a method called doParserCacheUpdate(). We could have a similar one called doParsoidCacheUpdate(), and call it in the same place that doParserCacheUpdate() is called. The implementation of doParsoidCacheUpdate() would be similar to the code in ParsoidHTMLHelper.

WARNING: this needs to be configurable per wiki, or even per namespace, and needs to be off by default. The reason is that this will use a lot of space in the parser cache backend, and we have to do this incrementally to ensure that we are not running out of disk space.

Event Timeline

Krinkle subscribed.

I'd like this to be discussed with Perf and ServiceOps first.

RESTBase distinguished itself from the status quo when it was introduced in many ways, some of those differences were experiments we since validated and have or are planning to apply to core. The precache infrastructure, however, is to my knowledge not one of those things.

Its existence has been questioned many times during incident reviews and bug investigations, and at the very least it seems worth taking a step back and take inventory and evaluate and document it as a new deliberate and informed choice based on current needs.

In MediaWiki core, we've actually reduced populating of ParserCache quite a bit in recent years:

  • RefreshLinks jobs which propate template edits no longer warms ParserCache, not even for slow parses (change 501444).
  • CirrusSearch jobs no longer do (via getParserOutputForIndexing, change 775390, ref T285993).
  • Wikidata edits no longer do (ref T285987).

What we have left is that for non-Wikidata edits, we (naturally) have to parse the page when saving the edit and given this is generally redirected to immediately by the acting editor and/or likely requested by people and bots at least once shortly thereafter, this is also stored in the ParserCache.

But for change propagation-like scenarios we basically no longer proactively persist intermediate results to ParserCache.

ssastry subscribed.

I am adding back VisualEditor here since I imagine they would want to be informed / consulted on any decisions / changes made here since those could impact VE load times.

In the long-term, I expect that won't be the case. One generally has the article open prior to VisualEditor loading, thus the server-side has naturally rendered and stored a ParserOutput for the current revision either just then or recently enough (noting that ParserCache has much higher retention times than Varnish, and no LRU eviction). That is to say, in the status quo one can generally assume that when reading a page, any API request will enjoy a ParserCache hit for the current revision for the current user.

But in the transition period where VE uses Parsoid without RESTBase but Parsoid isn't used for read mode, there we indeed need to make sure a Parsoid-flavour entry is (usually) available for most scenarios where a legacy ones is availalbe today. I expect absence of that to be a non-starter for VE latency indeed, so that's a given.

I interpreted the current task not so much as exploring whether to store Parsoid entries per-se, but rather about porting back the pre-emptive population ChangeProp feature that RESTBase has today. If we think there's a chance that we might not need even that for VE, we'd better confirm that with the Editing team indeed.

My rough thinking is that, if we determine that we don't need it, then as part of the parent task (or as rescoped version of this task) we'd still need at least something during that transition phase, e.g. for wikis that don't read from Parsoid, use a post-send deferred update or job to immediately generate and store a Parsoid-version for the same key as we just generated and stored one for the legacy parser, typically during edits and without propagation).

daniel triaged this task as Medium priority.Jun 2 2022, 4:57 PM

I think this should be implemented very closely to how we do cache warming for "normal" parser output, that is, in DerivedPageDataUpdater. Even if we end up not needing this for VE, we will end up needing it when we start using Parsoid for page views.

Change 803337 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] [WIP] Storage: Warm caches with Parsoid outputs

https://gerrit.wikimedia.org/r/803337

Change 803337 abandoned by D3r1ck01:

[mediawiki/core@master] [WIP] Storage: Warm caches with Parsoid outputs

Reason:

covered by: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/806443

https://gerrit.wikimedia.org/r/803337

Change 806443 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] [WIP] Storage: Warm parsoid cache with parsoid outputs

https://gerrit.wikimedia.org/r/806443

Change 809126 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] Move access to the page bundle into ParsoidOutputAccess

https://gerrit.wikimedia.org/r/809126

Change 806443 merged by jenkins-bot:

[mediawiki/core@master] Storage: Warm parsoid parser cache with parsoid outputs

https://gerrit.wikimedia.org/r/806443

Change 809133 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] Move knowledge about HTTP status out of ParsoidOutputAccess

https://gerrit.wikimedia.org/r/809133

Change 809126 merged by jenkins-bot:

[mediawiki/core@master] Move access to the page bundle into ParsoidOutputAccess

https://gerrit.wikimedia.org/r/809126

Change 809133 merged by jenkins-bot:

[mediawiki/core@master] Move knowledge about HTTP status out of ParsoidOutputAccess

https://gerrit.wikimedia.org/r/809133

Change 809642 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] Protect against passing anything but wikitext to parsoid

https://gerrit.wikimedia.org/r/809642

Change 809642 merged by jenkins-bot:

[mediawiki/core@master] Protect against passing unsupported content models to Parsoid.

https://gerrit.wikimedia.org/r/809642

MSantos claimed this task.
MSantos subscribed.

@daniel as far as I can remember this work is complete, I'll be bold and close but please re-open it and let me know in case I'm missing something.