Page MenuHomePhabricator

ParsoidCachePrewarmJob should set the useParsoid ParserOption
Closed, ResolvedPublic

Description

See https://gerrit.wikimedia.org/r/c/mediawiki/core/+/800769 -- this eventually should warm the main parser cache with the useParsoid flag.

Related Objects

StatusSubtypeAssignedTask
StalledNone
In ProgressNone
OpenNone
OpenNone
In ProgressNone
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
ResolvedNone
Resolvedovasileva
Declined nray
ResolvedDAlangi_WMF
Resolveddaniel
ResolvedClement_Goubert
Resolveddaniel
ResolvedKrinkle
Resolvedssastry
Resolvedssastry

Event Timeline

daniel triaged this task as Medium priority.Feb 1 2023, 2:46 PM
daniel raised the priority of this task from Medium to High.
daniel moved this task from Unsorted to Parsoid pile on the RESTBase Sunsetting board.

Change 883501 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[mediawiki/core@master] ParserOutputAccess: Detect and set appropriate primary cache

https://gerrit.wikimedia.org/r/883501

I'm having a bit of trouble figuring out a coherent state for the useParsoid option. https://gerrit.wikimedia.org/r/c/mediawiki/core/+/800769 "uses" the useParsoid option on all paths through the parser, which is /safe/ (in terms of avoiding corruption of a single non-forked parser cache) but (as we discussed previously) means that after that patch were to land in production, any new parse of page [[X]] would invalidate all other parses for [[X]] in the parser cache -- ie, if you look at the parsoid version of X then the legacy versions of X get invalidated; if you do a new parse for a spanish-speaking user of [[en:X]] it has the side effect of invalidating the existing english-language parse of [[X]]. (Briefly, this is because the new parse will have 'useParsoid' in the 'used options' where 'useParsoid' is true or false, and this will cause existing entries with 'useParsoid' not in the used-options set to be invalidated.)

This only happens the first time we get a /miss/ in the parser cache for a title which is already present in the cache, but with a different set of parser options; but could still have some negative effect on parser cache hit rate when the patch is deployed.

Nevertheless, I think this is still the safest way to write this patch, as may lead to extra parser cache misses, but there's no chance of corruption of the cache.

The follow up in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/883501 splits the parser cache by the 'useParsoid' option, so having 'useParsoid' in the used-options set isn't strictly needed anymore. We *could* then set things up so that useParsoid is 'never' used (on any path) -- this is actually quite easy (though obscure) since the listener which is responsible for recording used options isn't yet connected in WikitextContentHandler, so the 'default' case will be for useParsoid to be 'not used'. This is also consistent with the current pre-warm job. *However* this set up makes me quite nervous, as it relies on a lot of implicit behavior -- both that 'useParsoid' would not be set (despite apparently being used) due to the pretty-obscure timing details of the listener collection, and then further that the parser split in ParserOutputAccess was actually necessary to prevent cache corruption (which might not otherwise be obvious).

We'd talked about making the cache split in ParserOutputAccess configurable, so that it wasn't hard-coded to 'useParsoid' -- but if we went down that path I think we'd really want to ensure that the cache was 'safe' regardless of the user's cache configuration settings.

*So* I think the best approach is:

  • Always set useParsoid as 'used', and do this both in the ParsoidOutputAccess path and the parser output access.
  • Let this ride the train carefully, and deal with temporarily higher miss rates in the legacy and parsoid parser caches as a result.
  • Assuming we can get away with the above, then we can leave this 'always used" path in place even after we allow configurable splitting of the parser cache.

The alternative is, roughly:

  • Don't allow 800769 to ride the train without the followup https://gerrit.wikimedia.org/r/c/mediawiki/core/+/883501 which forks the cache
  • In 883501, also tweak the code so that useParsoid doesn't use the 'used options' framework and is effectively "never used" (and not included in the cache key). This is safe only because the parsoid cache is forked, though "safer" to deploy to production because it will never invalidate entries in the legacy parser cache by side-effect.
  • If 'useParsoid' isn't in the cache key, the ParsoidOutputAccess doesn't have to explicitly set useParsoid, although it probably should just for consistency.
  • When we eventually allow the cache splitting to be a general feature and not hard-coded to useParsoid, we'll have to add some additional protections to ensure the 'useParsoid' doesn't corrupt the legacy parser cache; that is, even if we allow configurable cache splitting, 'useParsoid' *always* has to be part of the split or we'll corrupt the cache. Perhaps there's an avenue here where we add 'useParsoid' to the cache key *only if* 'useParsoid' is not part of the split, but that implies some sort of declarative split mechanism so we can tell is 'useParsoid' is part of it.

My primary concern with this alternative is that is leaves a lot of "magic" in the codebase and makes useParsoid a weird option-but-not-an-option, never present in the "used" option list despite appearances; and it will be unsafe to mix Parsoid and non-Parsoid content in the same parser cache but this property will be hard to document properly.

Thoughts?

I am going ask a possibly stupid question. It seems like the best strategy overall would be to use a different cache instance (instead of a trying to vary by options) for 'useParsoid'. I know we talked about that in the past and I thought that was the solution that you all had landed on as well. Trying to use the same cache instance but use varying options seems unnecessarily complicated to me. Given the different output that Parsoid and legacy parser generate, different PC instances seems like the best approach to me.

Yes, that's what all the patches do: split the cache. 800769 splits the cache via the key (parser option) and 883501 amends that to additionally force a split via the parser cache name. The question is how to describe how/why the parser cache is being split, and to do it in a way that doesn't depend on "magical" properties of the code.

Ideally the cache split would be a detail of the WMF production configuration, and not strictly required for correctness, because (waving hands a bit) the "useParsoid" property would also be part of the cache key, like all the other properties of ParserOptions. That's what 800769+883501 effectively do, but that has implications for production (higher cache turnover when they are first deployed, since existing entries don't have 'useParsoid' as part of the cache key, although this is oversimplifying and in practice almost all of the existing entries would still be valid). In order to have "zero impact on existing legacy parser cache" you have to (temporarily or permanently) give up the "correctness independent of the cache split" property and add a dirty hack somewhere that lies about whether 'useParsoid' is being used. So that's the decision to make, in a nutshell: how best to handle the tradeoff between understandable correct non-magical behavior in the long run, and the short-term desire to avoid expiring some fraction of legacy cache entries unnecessarily.

The summary is I'd like to deploy a particular implementation of the split that also adds useParsoid to the cache key as a "used option" in the legacy parser path, which is safest/cleanest in terms of architecture (doesn't leave landmines if the cache split is done differently in the future; doesn't strictly require hardcoding the split into ParserOutputAccess) but has the possibility of temporarily increasing the cache miss rate (slightly?) for 21 days while things turn over. There are various clever and/or ugly hacks we can do if that doesn't work out, but I'd like to pursue the most straightforward path first.

But, I am asking about actually instantiating a new instance (vs "splitting" a cache) in code. My understanding of splitting is that the same (MediaWiki) ParserCache instance is used but the cache key ensures that different entries are used within that instance. There seems to be a some complexity in ensuring that the right set of options are set on save and fetch of the right cached entry. It seems like a different instance would avoid that.

How the instance is managed wrt the backing store is not a detail the code needs to worry about -- that is probably a SRE concern. But, having different cache instances could let different backing stores be used (mysql, cassandra, etc.).

If the different PC instances in code map to the same backing store (mysql for now), presumably, the cache name would be used for the physical "split" of the shared backing store, but that would presumably still be simpler than trying to mess with options? It is possible I am missing something in this discussion.

daniel renamed this task from ParsoidCachePrewarmJob should warm the useParsoid ParserOption to ParsoidCachePrewarmJob should set the useParsoid ParserOption.Mar 22 2023, 8:16 AM

If the different PC instances in code map to the same backing store (mysql for now), presumably, the cache name would be used for the physical "split" of the shared backing store, but that would presumably still be simpler than trying to mess with options? It is possible I am missing something in this discussion.

The idea is to set the correct options, and based on the options select the correct cache. This way, a "simple" setup doesn't need two parser cached (because useParsoid is present in the cache key), but large instances can use separate instances. We can't really avoid the complexity of ensuring the right options are set, because we need to options to be set in order to determine which ParserCache instance to use. Otherwise, we'd have to loop this information through separately, which would require us to change method signatures on all the relevant code paths.

By the way, the current production setup is to use different ParserCache instances with different names, but both these instances map to the same database tables. The cache name is just a prefix to the row key. This setup makes it easy to split the caches physically, without forcing us to do so immediately.

.We can't really avoid the complexity of ensuring the right options are set, because we need to options to be set in order to determine which ParserCache instance to use. Otherwise, we'd have to loop this information through separately, which would require us to change method signatures on all the relevant code paths.

This is the piece I wasn't aware of.

Right now the 'useParsoid' path and the ParsoidOutputAccess path use different cache names for the parsoid content. I'll switch them to use the same cache name in the next train, so that we're not trying to change too much at once and can keep a closer eye on any subtle content differences.

repeating what I said in the meeting with Daniel. If it's using the same general backend (MySQL PC), only splitting it via even fragmentation should just work out of the box. They are distributed via consistent hashing so it'll be in a random different (or the same in some cases) physical host. That makes it homogeneous and easy to maintain instead of maintaining two separate clusters.