We've merged https://gerrit.wikimedia.org/r/1206925, which clones ParserOutput in Article before post-processing; but it looks like it breaks the DiscussionTools on Beta (the metadata of the threads and of the page are not displayed, which was the initial symptom of breakage in T353257).
This needs to be investigated before reverting the revert.
This is needed so that we do not pollute the local cache with the post-processing - but it may require some work on DT to actually make it work as desired.
Description
Details
Event Timeline
@ihurbain I don't know if this is relevant to your problem, but in case it helps with this or T348255:
ParserOutput is re-used in memory in the same request, but also written to ParserCache.
I don't just mean the synchronous write to ParserCache during retrieval for cache misses (these would not affect you as it is saved before returning to you), but also the asynchronous writes.
Like so:
- Request 1: Initial cache miss
- [mid-request] ParserCache calls MultiWriteBag::get and misses both memc and mysql, we parse, we call MultiWriteBag::set which writes to memc and schedules deferred write to mysql, and we return to caller.
- [mid-request] Any modifications here may pollute mysql.
- [post-send] DeferredUpdate callback writes the by-ref object to mysql.
- Request 2: Cache hit on memc
- [mid-request] ParserCache calls MultiWriteBag::get and hits on memc, and we return to caller. Nothing else.
- Request 3: Cache hit on mysql
- [mid-request] ParserCache calls MultiWriteBag::get and misses on memc but hits on mysql and schedules deferred backfill, and we return to caller.
- [mid-request] Any modifications here may pollute memcached.
- [post-send] DeferredUpdate callback writes the by-ref object to memc.
Change #1210767 had a related patch set uploaded (by C. Scott Ananian; author: Isabelle Hurbain-Palatin):
[mediawiki/core@master] Clone ParserOutput in Article before post-processing (take 2)
Change #1210767 merged by jenkins-bot:
[mediawiki/core@master] Clone ParserOutput in Article before post-processing (take 2)
Take 2 controls the cloning with a configuration variable, to make it easier to tweak as the train rolls out, if needed.
Change #1210844 had a related patch set uploaded (by C. Scott Ananian; author: Isabelle Hurbain-Palatin):
[mediawiki/core@wmf/1.46.0-wmf.4] Clone ParserOutput in Article before post-processing (take 2)
Change #1210844 merged by jenkins-bot:
[mediawiki/core@wmf/1.46.0-wmf.4] Clone ParserOutput in Article before post-processing (take 2)
That is very relevant and very helpful, thank you!
I had seen case 3, I had missed case 1. It does confirm that cloning stuff before we modify it with abandon is indeed the right move, then (and that it's worth double-checking that the clone is not leaving shared stuff anywhere) - good, as this is what we're aiming to do (and it was probably not obvious that it was necessary before we started poking at this code).
It's also my understanding that if case 2 does have a chance of polluting the memc object if we try to access it twice (which would *probably* not happen in a "regular" request, but which might bite on tests) - is that correct?
I think I was wrong. We've encountered this before at T168040, and BagOStuff accounts for both cases already:
Regarding whether the simple "request 2" scenarios involved repeat access, I don't know off-hand when that happens exactly but I would assume it does. ParserOutputAccess has a fair bit of logic specifically for in-process caching, and that's there for production, not tests.
If I recall correctly we've removed logic elsewhere over the years in favour of leaning on ParserOutputAccess. This presumably means repeat calls have gotten more common, including in code paths that you might think of as relating to one user action or one parse. Adding or enabling a few logger->debug calls might reveal some of those cases. My guess would be that it's things like this:
- On save, something somewhere first demands the new ParserOuput and thus the parse, but this isn't explicit. Various code paths during an edit require the result. Afaik we don't pass the PO to all of them, they often get it via the POA service and its process cache.
- On page view, various hooks in the skin might have extensions that add or change information on the page. A common and useful pattern here is to not do uncached DB queries and not even ad-hoc caching through a separate memc key (because it isn't likely to stay popular/together, and has non trivial purge requirements). Instead, they'll compute what they need in the ParserOutput and on page view retrieve that for the current page. As long as the subject is the current page you can pretty much assume it's free to access since even on a miss, it'll have been needed by the pageview itself already.
- Batching in other contexts. Some APIs may be iterating through titles to compute stuff, and rely on ParserOutput for one or more pieces, which tend to be decoupled and unaware of each other. They reuse the PO via the POA service. In practice this is an anti-pattern and local maximum. You never want to be risking having to parse multiple articles in a single GET request. I believe in most cases we've transition this kind of caller to page props, which we can retrieve efficiently from the DB, and in batches, and doesn't expire. (We don't do this for pageview skin hooks,, because you want a pageview to be internally consistent especially if it varies by user preferences, AB test, old revision ID, race conditions around edit activity, and wikitext with pseudo-random output.)
Aha, that does explain why it didn't seem to be a problem in practice (which felt either very lucky or of less consequences than I believed) - it used to be, but Before My Time™ :D
Thinking about all this (and that probably requires more thinking still, but we'll get there), I'm reaching the opinion that:
- cloning ParserOutput before postprocessing is probably currently a good idea anyway for right now, for safety reasons
- BUT it may not be as much of a *mandatory step* than I thought it was; so it'll be worth profiling that thing to estimate the perf impact
- apart from the in-process POA cache, cache SHOULD be safe from these operations even if we do not clone (assuming we don't mess up by storing different stuff under the same key, that is)
- whether modifying what's in the in-process POA cache is actually an issue or not is an interesting question (which is currently solved by "let's clone that thing" for postproc) and may require some pondering
- we'll probably want to instrument a couple of things around these paths to check if we're right in being worried about double postproc and the like
- IF the perf impact is nontrivial AND we answer the in-process POA cache AND we guarantee that we're not double postprocessing stuff, THEN we may consider dropping the clone before postproc.
That's for future-us, probably, but that's interesting stuff. And it did make me dig a bit more into the Correct Way of writing the tests that I'm currently playing with to understand a few more things around that. Cool!