The ParserCache has a separate "ParserCacheMetadata" cache (includes/parser/ParserCacheMetadata.php; the cache itself is implemented by ParserCache::getMetadata() using ParserCache::$metadataProcCache), which maps article IDs to the set of parser options "actually used" to render that article. Then when you present a certain ParserOptions to the parser cache and say "I want the html that would result from rendering this article with these options", the parser cache pulls from its own metadata cache and does an intersection on the options used, so that ParserCache will "hit" even if the parser options mismatch, as long as the mismatch was on an option that "wasn't actually used to render this page". (This metadata cache also records the parse time and cache lifetime, to implement ParserCache expiry.)
For example, if the ParserOption has user-lang set to "es" but there wasn't any content on the page that actually /used/ the user language (ie, no {{int:...}} message, uses of Extension:InputBox or Extension:Kartographer -- and ignore the ToC for the time being and T303615/T240426/T280295/etc), then it would say, "sure I've got HTML for that -- I rendered it before for user-lang=de, but that shouldn't matter (the render didn't use the user-lang), so here take this!"
Unfortunately, when the legacy parser doesn't "use" the useParsoid option, then /anytime/ we ask for something with 'useParsoid=true' the parser cache says, "oh, well, the useParsoid flag shouldn't matter, use /this cached content/ instead" even though the cached content is from the legacy parser.
And the problem goes the other way, too, because if we generate *parsoid* content via the REST API (in the future! the REST API doesn't use the main parser cache currently), we again never consulted 'useParsoid' when rendering it (because the REST API *always* uses parsoid) -- and we also never consulted the 'user-lang' part of parser options, because parsoid doesn't use user-lang by design, so then we stuff content into ParserCache with the tag "oh, useParsoid didn't matter when rendering this, and neither did user-lang", even though the latter may not be true at all for a render with the legacy parser.
In short, the exact set of parser options "used" by one or the other parser actually depends on which parser was used to generate the content, so we really do need to fork the /metadata cache/ by parser type, even though in general we want to store both types of content in the parser cache. I8d2bfd55acd816c1c49a794a0ec15a28881f689d does this by effectively making the parsermetadatacache a mapping from <article ID, useParsoid flag> -> <set of used parser options> instead of just <article ID> -> <set of used parser options> but other alternatives are possible. I think my preference is making the metadata cache a mapping from <article ID,ParserOptions "extra key"> -> <set of used parser options> which canonicalizes the use of the "extra key" property in parser options (ParserOptions::addExtraKey()) to split the metadata cache, without hard coding useParsoid. (ParserCache::makeMetadataKey() makes the key for the metadata cache, and at present it is just the name of the parser cache plus the page id.)
An additional quirk here is that the "used options" are recorded using a listener callback to decouple the ParserCache from ParserOptions -- a nice architectural idea, but maybe a useless abstraction given that ParserOption's primary purpose is to be the key for the ParserCache -- but anyway that means we can't just "initialize" useParsoid to "always used" in the constructor (say) because we'd (at least) need to wait until the listener is registered before we can fake an event to record the use of useParsoid. The decoupling makes this awkward.
I5fb94a37bdef567cc1ff891fb21183419eac65c7 presents a number of possible places where you could ensure that useParsoid is "always used". Some make me nervous, because of the possibilty for cache corruption if we ever use some path through either parser which happens not to consult useParsoid in the right way to make it 'used' on every parse. One of the reasons I'd like to use a more explicit fork of the metadata cache is to make inadvertent cache corruption of this sort less likely; we really do want to say "keep entries from parsoid and the legacy parser separate in the ParserCacheMetadatacache" and I think that implementing that explicitly is better than relying on implicit guarantees.
Other "big picture" alternatives:
- We could add another parser cache to ParserOutputAccess.php -- it already has "primary" and "secondary"; we could add "parsoid primary" and "parsoid secondary" as well, and do the fork in ParserOutputAccess. This would ensure we never corrupt one cache with the other parser's output. ParserOutputAccess is quite complicated, however, and doing the fork at this level would involve a lot of code duplication.
- ParserOutputAccess is currently a singleton; we could make a new 'ParsoidOutputAccess' (no relation to the existing ParsoidOutputAccess) which is just ParserOutputAccess called with a different set of constructor arguments so that its primary and secondary caches were distinct. The ParserOutputAccess service is only used in a few places -- Article, WikiPage, ContentHandler, LocalFile -- and maybe it wouldn't be too hard to add switches in those places to select the appropriate instance. However, there are also a lot of statistics in ParserOutputAccess that use unprefixed names; you might want to also ensure the statistics for the two different ParserOutputAccess instances are kept distinct.
- Just ensure that the ParserCacheMetadata cache is forked appropriately, as described above.