Page MenuHomePhabricator

Parser cache infrastructure for OutputTransform
Closed, ResolvedPublic

Description

This task stands broadly for "make OutputTransform outputs independently cacheable", which would allow various post-processing results to be independently cached.

For example: discussion tools output (Txxxx), mobile front end output (Txxxx), and language-converted output (T380517).

See https://www.mediawiki.org/wiki/Parsoid/OutputTransform; this is step 3 and 4 of the "initial proposal":

  1. Add ParserOption::get/setFlavor() (not included in cache key), and add support in ParserOutputAccess::getCachedParserOutput() to check the flavor and if it's is not core, to call MediaWiki\ParserOutputTransform\FlavorDispatcher::transform($parserOutput, $parserOptions::getFlavor()). It will also consult MW config and a hook to determine whether or not to cache the result.
  2. Add a hook in FlavorDispatcher to allow inserting additional passes into the chain for any flavor; this will replace the ParserOutputAfterTidy hook.

Related Objects

StatusSubtypeAssignedTask
OpenNone
Openihurbain
Openihurbain
Resolvedihurbain
OpenNone
Resolvedihurbain
OpenNone
Resolvedihurbain
Resolvedihurbain
Resolvedcscott
Resolvedihurbain
Resolvedcscott
Resolvedihurbain
OpenNone
Opencscott
OpenBUG REPORTNone
Resolvedcscott
OpenNone
OpenNone
ResolvedKrinkle
ResolvedKrinkle
ResolvedDAlangi_WMF
Opencscott
Openihurbain
Resolvedihurbain
Resolvedihurbain
Resolvedihurbain
Resolvedcscott
Resolvedihurbain
OpenNone
ResolvedPRODUCTION ERRORcscott
Openihurbain

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ssastry triaged this task as High priority.May 22 2025, 5:49 PM

Change #1191453 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] [POC] add stats about potential for secondary caching

https://gerrit.wikimedia.org/r/1191453

Change #1202697 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] ParserOptions: add some documentation about cacheability and defaults

https://gerrit.wikimedia.org/r/1202697

Change #1203008 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] WIP Co-locate cache access and pipeline run for Article path

https://gerrit.wikimedia.org/r/1203008

Change #1203008 merged by jenkins-bot:

[mediawiki/core@master] Post-process the ParserOutput in Article closer to the cache access

https://gerrit.wikimedia.org/r/1203008

Change #1203918 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] [DNM] [WIP] Add a second level of caching after post-processing

https://gerrit.wikimedia.org/r/1203918

Change #1202697 merged by jenkins-bot:

[mediawiki/core@master] ParserOptions: add some documentation about cacheability and defaults

https://gerrit.wikimedia.org/r/1202697

Change #1205194 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] [WIP] Add post-processing cache

https://gerrit.wikimedia.org/r/1205194

Change #1203918 abandoned by Isabelle Hurbain-Palatin:

[mediawiki/core@master] [DNM] [WIP] Add a second level of caching after post-processing

Reason:

in favor of Icf7172be6d718362192ca8b7288ef8ce4d4af6b9

https://gerrit.wikimedia.org/r/1203918

Change #1206925 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] Clone ParserOutput in Article before post-processing

https://gerrit.wikimedia.org/r/1206925

Change #1206925 merged by jenkins-bot:

[mediawiki/core@master] Clone ParserOutput in Article before post-processing

https://gerrit.wikimedia.org/r/1206925

Revert merged by @ihurbain:

[mediawiki/core@master] Revert "Clone ParserOutput in Article before post-processing"

Reason for revert: it looks like it *does* break DT on beta. To be investigated.

https://gerrit.wikimedia.org/r/1210640

Change #1210767 had a related patch set uploaded (by C. Scott Ananian; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] Clone ParserOutput in Article before post-processing (take 2)

https://gerrit.wikimedia.org/r/1210767

Change #1210767 merged by jenkins-bot:

[mediawiki/core@master] Clone ParserOutput in Article before post-processing (take 2)

https://gerrit.wikimedia.org/r/1210767

Change #1210844 had a related patch set uploaded (by C. Scott Ananian; author: Isabelle Hurbain-Palatin):

[mediawiki/core@wmf/1.46.0-wmf.4] Clone ParserOutput in Article before post-processing (take 2)

https://gerrit.wikimedia.org/r/1210844

Change #1210844 merged by jenkins-bot:

[mediawiki/core@wmf/1.46.0-wmf.4] Clone ParserOutput in Article before post-processing (take 2)

https://gerrit.wikimedia.org/r/1210844

Data persistence opinion has been requested. I don't think it could cause any problems to the infra to add the extra keys to PC, but it would be extremely important that extra keys should end up in the same host using the concept of sister keys. The work I did in gerrit:1078013 and gerrit:1079677. This way, if we depool a host for maintenance, only 1/8th of the pages will be affected (status quo) and if they are not sister keys then it would be quite a problem. But if that's taken care of, I don't think we have any issues. Once it's rolled out, let us know so we can keep an eye on the infra.

@Ladsgroup when in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1079677 you're talking about "all entries for the same page end up in the same database table in the same cluster", I'm assuming that this only applies to entries for a given ParserCache instance, right?
If I read the current code correctly, already entries for "regular parser cache" and "parsoid parser cache" would probably not end up in the same shard, since the name of the cache is in the part of the key used to determine the shard.
And, from a conceptual point of view, I can see why we'd want "entries of a given cache for a given page want to go to the same shard", but differently named caches should normally be independent and don't have that constraint, even if they're caching content that relate to the same page - because in that case, depooling one shard will only hit a set of entries that should be independent from the other possible set of entries in other caches for that same page.

Assuming that this is correct (please let me know :) ), I think I need to ponder a bit harder if I am actually in that case 🤔 (thinking aloud for a minute, let's hope it's readable by someone else than me)

We've currently taken the option of putting the postprocessed output in a separate ParserCache (as it was done for the Parsoid ParserCache). For "legacy vs parsoid" parsercache, it feels clear that these can be handled independently. For "postprocessed output of a given parse", it's less clear to me. If page X has its parses in #pcache (and #parsoid-pcache), and its postprocessed in #pp-pcache (and #parsoid-pp-pcache) (and all of these are in different shards, and without loss of generality let's ignore the Parsoid entries), I think we're perfectly fine with (for instance) the #pp-pcache shard going down (we fall back to #pcache, recompute what was in #pp-cache, be happy), but the reverse is possibly more annoying:

  • #pcache is down and loses entries for page X
  • for most postprocessed results we'd hit the #pp-pcache (that's fine)
  • eventually we're reparsing and re-storing page X in #pcache (typically because of different or no post-processing options)
  • whatever was existing in #pp-pcache is not based on what's in #pcache anymore.

The question is then: do we actually care. (This is assuming that "regular" invalidation on edits and expiry hasn't happened during that time, because then the point should be moot.)

THAT SAID, we'd be in that exact same situation if/when the entry for #pcache and #pp-pcache had different expiry dates, for instance (and at this time I have not touched that, and I do not know what I'm doing there (unclear if we're using the existing date or recomputing it on save; the latter would make sense, but I need to check) ) and the #pcache entry expired without the #pp-cache entry expiring.

My sense is that it should be fine, because whatever we reparse for #pcache should normally be vastly identical to whatever #pp-pcache is based on, but I'm trying to cover my bases just to make sure there's not something that I don't see that's blatantly obvious to someone else. @cscott - opinions?

For "page" I don't mean two different parser cache types. That's correct. I should have given more context. When we had 3 pc shard and in old times, when a cluster (pc1 for example) turned empty for various reasons. 1/3rd of parser cache entries got invalidated because "idoptions" key was missing, and 1/3rd got invalidated because "idhash" entry was missing so we ended up with 2/3rd of entries getting invalidated (if you want to be more exact, it'd be 56% because 1/9th would be in the same). Adding sister keys avoided that so when we depool a cluster, only 1/nth of the page views require a reparse. That obviously means it's fine whether legacy parser or parsoid is living in separate clusters since they don't share a code path. I don't know about post-parse processing though. I assume it'd be better to make sure they still around the same page in parsoid to avoid partial loss of parsed data. Do I make sense? or I'm just confusing you more. If you want, we can hop on a call.

Ack, that was my understanding between idoptions and idhash, I was just making sure I wasn't missing something more in there, thanks for confirming.

You do make sense :D especially without having the full context! Post-parse processing does store a full ParserOutput as well, it's essentially another version of a "raw" parse (which can *technically* be seen as "a regular parse, but with more options in the cache key" for caching purposes.), hence I don't see much of a scenario in which we'd have a loss of data that would matter (at least not more than "losing a legacy parse and keeping the parsoid parse").

I think this basically boils down to "should invalidating an entry in the 'raw' cache also invalidate the postprocessing entries that derive from it". Right now, this is not the case; Scott and I have discussed it but not in a way that reached a definite conclusion, I think - or not in a way where we saw something that needed to be solved immediately.

As of now, I'm still leaning towards different caches (hence not enforcing locality of raw and postprocessed), mostly for safety of the existing entries as we deploy. I do note, however (and that's something that this conversation has showed me, because I had missed that point) that if we do change our mind and want to enforce said locality, it may be annoying (depending on the quantity of content that need to be relocated, if any). Hence, it's probably worth making at least sure that the decision we're making right now is a conscious one and not a default one - I'll double check with Scott on Monday what we actually want to do there.

Change #1205194 merged by jenkins-bot:

[mediawiki/core@master] Add post-processing cache, disabled by default

https://gerrit.wikimedia.org/r/1205194

Change #1215115 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[operations/mediawiki-config@master] Activate postprocessing cache on testwiki, test2wiki, officewiki

https://gerrit.wikimedia.org/r/1215115

Change #1215115 merged by jenkins-bot:

[operations/mediawiki-config@master] Activate postprocessing cache on testwiki, test2wiki, officewiki

https://gerrit.wikimedia.org/r/1215115

Mentioned in SAL (#wikimedia-operations) [2025-12-04T21:52:22Z] <cscott@deploy2002> Started scap sync-world: Backport for [[gerrit:1215115|Activate postprocessing cache on testwiki, test2wiki, officewiki (T348255)]]

Mentioned in SAL (#wikimedia-operations) [2025-12-04T21:54:27Z] <cscott@deploy2002> ihurbain, cscott: Backport for [[gerrit:1215115|Activate postprocessing cache on testwiki, test2wiki, officewiki (T348255)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-12-04T22:06:45Z] <cscott@deploy2002> Finished scap sync-world: Backport for [[gerrit:1215115|Activate postprocessing cache on testwiki, test2wiki, officewiki (T348255)]] (duration: 14m 23s)

Change #1217768 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[operations/mediawiki-config@master] Activate post-processing cache on idwiki

https://gerrit.wikimedia.org/r/1217768

Change #1217768 merged by jenkins-bot:

[operations/mediawiki-config@master] Activate post-processing cache on idwiki

https://gerrit.wikimedia.org/r/1217768

Mentioned in SAL (#wikimedia-operations) [2025-12-15T14:08:26Z] <ihurbain@deploy2002> Started scap sync-world: Backport for [[gerrit:1217768|Activate post-processing cache on idwiki (T348255)]]

Mentioned in SAL (#wikimedia-operations) [2025-12-15T14:12:30Z] <ihurbain@deploy2002> ihurbain: Backport for [[gerrit:1217768|Activate post-processing cache on idwiki (T348255)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-12-15T14:20:33Z] <ihurbain@deploy2002> Finished scap sync-world: Backport for [[gerrit:1217768|Activate post-processing cache on idwiki (T348255)]] (duration: 12m 07s)

Change #1191453 abandoned by Isabelle Hurbain-Palatin:

[mediawiki/core@master] Add stats about potential for secondary caching

Reason:

meh

https://gerrit.wikimedia.org/r/1191453

Change #1218793 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[operations/mediawiki-config@master] Enable post-processing cache for all Parsoid-rendered wikis

https://gerrit.wikimedia.org/r/1218793

Change #1218806 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[operations/mediawiki-config@master] Activate post-processing cache on some wikis

https://gerrit.wikimedia.org/r/1218806

Change #1218806 merged by jenkins-bot:

[operations/mediawiki-config@master] Activate post-processing cache on some wikis

https://gerrit.wikimedia.org/r/1218806

Mentioned in SAL (#wikimedia-operations) [2025-12-17T14:33:04Z] <lucaswerkmeister-wmde@deploy2002> Started scap sync-world: Backport for [[gerrit:1219158|Revert "Enable v2 non-emergency workflow by default" (T410512 T412715)]], [[gerrit:1218806|Activate post-processing cache on some wikis (T348255)]]

Mentioned in SAL (#wikimedia-operations) [2025-12-17T14:35:18Z] <lucaswerkmeister-wmde@deploy2002> lucaswerkmeister-wmde, ihurbain: Backport for [[gerrit:1219158|Revert "Enable v2 non-emergency workflow by default" (T410512 T412715)]], [[gerrit:1218806|Activate post-processing cache on some wikis (T348255)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-12-17T14:51:49Z] <lucaswerkmeister-wmde@deploy2002> Finished scap sync-world: Backport for [[gerrit:1219158|Revert "Enable v2 non-emergency workflow by default" (T410512 T412715)]], [[gerrit:1218806|Activate post-processing cache on some wikis (T348255)]] (duration: 18m 45s)

Change #1218793 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable post-processing cache for all Parsoid-rendered wikis

https://gerrit.wikimedia.org/r/1218793

Mentioned in SAL (#wikimedia-operations) [2025-12-17T21:23:56Z] <cscott@deploy2002> Started scap sync-world: Backport for [[gerrit:1218793|Enable post-processing cache for all Parsoid-rendered wikis (T348255)]], [[gerrit:1217799|Decommission Article Summaries (T411558)]]

Mentioned in SAL (#wikimedia-operations) [2025-12-17T21:26:10Z] <cscott@deploy2002> ksarabia, ihurbain, cscott: Backport for [[gerrit:1218793|Enable post-processing cache for all Parsoid-rendered wikis (T348255)]], [[gerrit:1217799|Decommission Article Summaries (T411558)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-12-17T21:36:09Z] <cscott@deploy2002> Finished scap sync-world: Backport for [[gerrit:1218793|Enable post-processing cache for all Parsoid-rendered wikis (T348255)]], [[gerrit:1217799|Decommission Article Summaries (T411558)]] (duration: 12m 13s)

The post-processing is before hash stop :( this will cause issues

| ruwiki:pcache:2101895:|#|:idoptions                                                                         |
| thwiki:postproc-parsoid-pcache:19490:|#|:idhash:injectTOC=0!postproc=1!skin=vector-2022!useParsoid=1        |
| thwiki:postproc-parsoid-pcache:19490:|#|:idoptions                                                          |
| commonswiki:pcache:74334962:|#|:idhash:wb=3!wbMobile=0                                                      |
| commonswiki:pcache:74334962:|#|:idoptions                                                                   |
| enwiki:parsoid-pcache:24603480:|#|:idhash:useParsoid=1                                                      |
| enwiki:parsoid-pcache:24603480:|#|:idoptions
This comment was removed by cscott.

Let me document decisions a posteriori because the best time to do that was two months ago but the second best time is now.

We made the decision to separate the postproc cache and the parsing cache for multiple reasons.

The initial reason was to avoid the risk of breaking the existing cache - if we had messed up the cache key, it might have had an impact on the main parser cache, which was scary at the time. This was the primary initial reason, but we did consider whether the risk mitigation was worth the possible annoyance to make the opposite decision (so we tried to make an explicit decision there.)

One of the considerations was "do we want to force colocation", and the answer to that was "we don't really have a reason for that". The reasoning at the time was that, while the postprocessing does depend on the parsed value during its construction, the cached versions were actually independent, and that there was no correctness consideration involved in the location of the cache entries. If anything: being able to get a parsed version from another shard if the shard containing the postprocessed version got out of commission would help recovering; and being able to continue serving postprocessed versions if the parsed version shard got out also sounds like a good idea. Both these versions would need to be invalidated if there's a page invalidation event, but that would be the case; and it's also the case that the legacy parse and the parsoid parse would be invalidated from possibly two different shards in that case. Conversely, if the parsed version expires, we typically wouldn't want the postprocessed version to expire as a consequence either - adding an argument in favor of cache entry independence (not necessarily of cache independence, but at least saying "there's no strong reason to have these in the same cache").

What made us pretty confident that we did want two different caches anyway was that we would have two different metadata for the postprocessing and the parsed version entries. It seemed far less error-prone to rely on two different caches than to add another layer of metadata keys in the existing ParserCache machinery to handle raw-metadata vs postprocessed-metadata. It could technically *maybe* work to have a single metadata entry, but it would have to match the postprocessed entry, and this would be INTERESTING to implement correctly; additionally, one of the prerequisites seems to be to have a set of options that would exclusively be in the raw options or in the postprocessed options, and that is not the case (userlang can be unused in raw parse and used in postproc parse). So that'd be a mess, and that would probably be my main argument to keep the caches separated.

THAT SAID, the fact that the caches are separated is *kind of* orthogonal from the cache key prefix question. If we *do* want to force colocation (and as explained, I'm not convinced we do), then we can still keep caches separated and hack around the cache key generation (at a vague handwaving, indicating that a cache is a postprocessing cache for cache blah could generate pcache-blah:|#|postprocess-pcache-blah:idhash:morestuff as a cache prefix. And I could see reasons for actually wanting to colocate - maybe it improves performance in the happy path where we're not recovering from a shard getting decommissioned.

Anyway - as we said - let's discuss on Monday; but at least now we've documented what we were trying to do there :D

Sorry I couldn't comment earlier. SRE stuff. Anyway. In principle, if the post-processed entry is "enough" on retrieval and wouldn't trigger a reparse if the main entry doesn't exist, then we don't have a depool causing issues problem. But my main worry is sneaky and hard-to-detect bugs that could trigger such reparses en masse. So as a defense in depth mechanism, I still think it's a good idea to force locality by moving the "postprocess-pcache" part of the key to after the hash stop.

Thinking about it, I'm also rethinking whether we should enable the storage of postprocess on Commons and Wikidata at all. They are usually the majority of the storage in PC hosts given that each have roughly 150M pages. I wonder if it's okay not to ever enable postprocess cache there and let them take the performance hit (as trade-off between storage and CPU) given that wikidata doesn't even use parsoid for majority of pages and commons with the multilingual setup can explode in number of caches.

Thinking about it, I'm also rethinking whether we should enable the storage of postprocess on Commons and Wikidata at all. They are usually the majority of the storage in PC hosts given that each have roughly 150M pages. I wonder if it's okay not to ever enable postprocess cache there and let them take the performance hit (as trade-off between storage and CPU) given that wikidata doesn't even use parsoid for majority of pages and commons with the multilingual setup can explode in number of caches.

To be clear, I don't think it'll break anything but I think it'll be extremely wasteful in terms of storage.

Mmmmmmmmmh.
Okay, you're right. I wasn't too worried about that kind of scenario (mass reparse triggering) up until a few hours ago because up until a few hours ago those things were really independent from each other and we'd only hit the raw cache if the post-processed cache didn't have an entry.
But what I'm *currently* considering is something along the lines of "hit the postprocessing cache; if it fails, get a raw cache output, and try to see if there's another postprocessing cache entry we should hit *considering the raw cache result*". Not sure this will land yet, but this significantly increases the risk of shenanigans, and the fact that I'm at all considering this makes it a good idea to future-proof this now rather than when the cache is fuller.
Anyway. I've created T415110 to track this.

As for Commons and Wikidata... probably? We do have time to make that decision right now, because the postproc cache is currently aligned with Parsoid Read Views for deployment. To avoid issues, I'll put a patch on config to explicitly deactivate it there so that we don't accidentally do something we wouldn't want to do. (That's T415111).

I would say T415110: Modify post-processing cache keys so that they are colocated with their main cache should be very low priority; I feel like at the moment there is much more to gain by having the entries be independent. It helps ensure we never had a totally cold cache: even if we lose entries from the postprocessing cache we still have the parsed copies elsewhere which saves a lot of work; and conversely if we lose the main canonical entry we don't immediately lose the postprocessed version that most anon users use, and we'll gracefully regenerate the canonical entry the next time a logged in user visits the page or someone opens it for editing.

There might be other shenanigans involved that might make me reconsider this, but those are my vibes at the moment. I'm not opposed to T415110 but it seems like it is creating work and code complexity for negative gain in the most common case (a shard goes on- or off-line).

Change #1251610 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[operations/mediawiki-config@master] Turn on postprocessing cache for all Parsoid parses

https://gerrit.wikimedia.org/r/1251610

Change #1251783 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Remove $wgUsePostprocCache; default $wgUsePostprocCacheParsoid to true

https://gerrit.wikimedia.org/r/1251783

Change #1251610 merged by jenkins-bot:

[operations/mediawiki-config@master] Turn on postprocessing cache for all Parsoid parses

https://gerrit.wikimedia.org/r/1251610

Mentioned in SAL (#wikimedia-operations) [2026-03-17T13:20:14Z] <cscott@deploy2002> Started scap sync-world: Backport for [[gerrit:1251610|Turn on postprocessing cache for all Parsoid parses (T348255)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-17T13:22:19Z] <cscott@deploy2002> cscott: Backport for [[gerrit:1251610|Turn on postprocessing cache for all Parsoid parses (T348255)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-17T13:30:44Z] <cscott@deploy2002> Finished scap sync-world: Backport for [[gerrit:1251610|Turn on postprocessing cache for all Parsoid parses (T348255)]] (duration: 10m 31s)

Change #1251783 merged by jenkins-bot:

[mediawiki/core@master] Remove $wgUsePostprocCache; default $wgUsePostprocCacheParsoid to true

https://gerrit.wikimedia.org/r/1251783