Page MenuHomePhabricator

Ensure ParserCacheMetadata isn't corrupted by 'useParsoid' option
Closed, ResolvedPublic

Description

The ParserCache has a separate "ParserCacheMetadata" cache (includes/parser/ParserCacheMetadata.php; the cache itself is implemented by ParserCache::getMetadata() using ParserCache::$metadataProcCache), which maps article IDs to the set of parser options "actually used" to render that article. Then when you present a certain ParserOptions to the parser cache and say "I want the html that would result from rendering this article with these options", the parser cache pulls from its own metadata cache and does an intersection on the options used, so that ParserCache will "hit" even if the parser options mismatch, as long as the mismatch was on an option that "wasn't actually used to render this page". (This metadata cache also records the parse time and cache lifetime, to implement ParserCache expiry.)

For example, if the ParserOption has user-lang set to "es" but there wasn't any content on the page that actually /used/ the user language (ie, no {{int:...}} message, uses of Extension:InputBox or Extension:Kartographer -- and ignore the ToC for the time being and T303615/T240426/T280295/etc), then it would say, "sure I've got HTML for that -- I rendered it before for user-lang=de, but that shouldn't matter (the render didn't use the user-lang), so here take this!"

Unfortunately, when the legacy parser doesn't "use" the useParsoid option, then /anytime/ we ask for something with 'useParsoid=true' the parser cache says, "oh, well, the useParsoid flag shouldn't matter, use /this cached content/ instead" even though the cached content is from the legacy parser.

And the problem goes the other way, too, because if we generate *parsoid* content via the REST API (in the future! the REST API doesn't use the main parser cache currently), we again never consulted 'useParsoid' when rendering it (because the REST API *always* uses parsoid) -- and we also never consulted the 'user-lang' part of parser options, because parsoid doesn't use user-lang by design, so then we stuff content into ParserCache with the tag "oh, useParsoid didn't matter when rendering this, and neither did user-lang", even though the latter may not be true at all for a render with the legacy parser.

In short, the exact set of parser options "used" by one or the other parser actually depends on which parser was used to generate the content, so we really do need to fork the /metadata cache/ by parser type, even though in general we want to store both types of content in the parser cache. I8d2bfd55acd816c1c49a794a0ec15a28881f689d does this by effectively making the parsermetadatacache a mapping from <article ID, useParsoid flag> -> <set of used parser options> instead of just <article ID> -> <set of used parser options> but other alternatives are possible. I think my preference is making the metadata cache a mapping from <article ID,ParserOptions "extra key"> -> <set of used parser options> which canonicalizes the use of the "extra key" property in parser options (ParserOptions::addExtraKey()) to split the metadata cache, without hard coding useParsoid. (ParserCache::makeMetadataKey() makes the key for the metadata cache, and at present it is just the name of the parser cache plus the page id.)

An additional quirk here is that the "used options" are recorded using a listener callback to decouple the ParserCache from ParserOptions -- a nice architectural idea, but maybe a useless abstraction given that ParserOption's primary purpose is to be the key for the ParserCache -- but anyway that means we can't just "initialize" useParsoid to "always used" in the constructor (say) because we'd (at least) need to wait until the listener is registered before we can fake an event to record the use of useParsoid. The decoupling makes this awkward.

I5fb94a37bdef567cc1ff891fb21183419eac65c7 presents a number of possible places where you could ensure that useParsoid is "always used". Some make me nervous, because of the possibilty for cache corruption if we ever use some path through either parser which happens not to consult useParsoid in the right way to make it 'used' on every parse. One of the reasons I'd like to use a more explicit fork of the metadata cache is to make inadvertent cache corruption of this sort less likely; we really do want to say "keep entries from parsoid and the legacy parser separate in the ParserCacheMetadatacache" and I think that implementing that explicitly is better than relying on implicit guarantees.

Other "big picture" alternatives:

  1. We could add another parser cache to ParserOutputAccess.php -- it already has "primary" and "secondary"; we could add "parsoid primary" and "parsoid secondary" as well, and do the fork in ParserOutputAccess. This would ensure we never corrupt one cache with the other parser's output. ParserOutputAccess is quite complicated, however, and doing the fork at this level would involve a lot of code duplication.
  1. ParserOutputAccess is currently a singleton; we could make a new 'ParsoidOutputAccess' (no relation to the existing ParsoidOutputAccess) which is just ParserOutputAccess called with a different set of constructor arguments so that its primary and secondary caches were distinct. The ParserOutputAccess service is only used in a few places -- Article, WikiPage, ContentHandler, LocalFile -- and maybe it wouldn't be too hard to add switches in those places to select the appropriate instance. However, there are also a lot of statistics in ParserOutputAccess that use unprefixed names; you might want to also ensure the statistics for the two different ParserOutputAccess instances are kept distinct.
  1. Just ensure that the ParserCacheMetadata cache is forked appropriately, as described above.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 800769 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Allow setting a ParserOption to generate Parsoid HTML

https://gerrit.wikimedia.org/r/800769

Change 892496 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] WIP: different options to ensure ParserCacheMetadata isn't corrupted

https://gerrit.wikimedia.org/r/892496

Change 892497 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] WIP: Split ParserCacheMetadata by useParsoid option

https://gerrit.wikimedia.org/r/892497

Change 883501 had a related patch set uploaded (by C. Scott Ananian; author: Derick Alangi):

[mediawiki/core@master] ParserOutputAccess: Fork primary and secondary caches for parsoid

https://gerrit.wikimedia.org/r/883501

Change 892496 abandoned by C. Scott Ananian:

[mediawiki/core@master] WIP: different options to ensure ParserCacheMetadata isn't corrupted

Reason:

no longer necessary, we've made a decision

https://gerrit.wikimedia.org/r/892496

Change 800769 merged by jenkins-bot:

[mediawiki/core@master] Allow setting a ParserOption to generate Parsoid HTML

https://gerrit.wikimedia.org/r/800769

Change 883501 merged by jenkins-bot:

[mediawiki/core@master] ParserOutputAccess: Fork primary and secondary caches for parsoid

https://gerrit.wikimedia.org/r/883501

Change 892497 abandoned by C. Scott Ananian:

[mediawiki/core@master] WIP: Split ParserCacheMetadata by useParsoid option

Reason:

Abandoned in favor of 2caf69797ccde49323e3cedb79f70268b8d76fdd

https://gerrit.wikimedia.org/r/892497

MSantos triaged this task as High priority.Sep 19 2023, 3:43 PM

Looks there was nothing new to do here and we have had the ParsoidOutputAccess merge with ParserOutputAccess patch in production for more than a week now.