Page MenuHomePhabricator

Make ParsoidHandler::wt2html write to parser cache
Open, Needs TriagePublic

Description

In order to prime the ParserCache with Parsoid output for use by VE, we need to trigger a parse on every edit. The intended mechanism for this is by triggering a job from DerivedPageDataUpdater, see T322427.

However, we are already performing a parsoid parse on every edit: it's triggered vis RESTbase, by calling the page/html REST endpoint exposed by Parsoid (not the one in MW core). That endpoints relies on ParsoidHandler::wt2html.

If ParsoidHandler::wt2html was to write to (and read from) the ParserCache, we could avoid running the parse twice.

NOTE: If we have both mechanisms for cache priming enabled at the same time, we'll probably want to use PoolCounter to avoid two parses happening in parallel. The chances for this are high since both would be triggered on edit via the job queue, and the parse may take several seconds.
NOTE: if CacheThresholdTime is > 0, we may still end up parsing twice.

IDEA: Make the parser cache write probabilistic, with a configurable ratio. So we can have e.g. 5% of all edits write to the parser cache via wt2html. This way, we can sollect statistics across all wikis and page types, so we can predict how much capacity we will need to support 100%.

Event Timeline

daniel updated the task description. (Show Details)

Isn't this better done in TransformHandler::execute behind some RB-enabled-or-not config flag. ParsoidHandler doesn't need to know about ParserCache / RB etc. It is just setting things up, calling Parsoid to generate HTML. Once RB is no longer used for Parsoid, the code in TransformHandler can go away in favour of the integration in ContentHandler / Pars(oid|er)OutputAcceess.

daniel updated the task description. (Show Details)

Change 858687 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] Set parser cache write propability for /page/html endpoint.

https://gerrit.wikimedia.org/r/858687

I will note that, while I think enabling random writes to the PC to understand volume is a good idea, longer-term we want probably to switch on this function the same moment that we switch off pre-generation to restbase, so that we indeed don't try to parse the data twice.

However, I think we need to meet to iron out a transition plan.

I will note that, while I think enabling random writes to the PC to understand volume is a good idea, longer-term we want probably to switch on this function the same moment that we switch off pre-generation to restbase, so that we indeed don't try to parse the data twice.

Yes, absoluely. Although Amir suggested that we may want to disable caching ofr pages that parse quickly, in particular on commons.

However, I think we need to meet to iron out a transition plan.

I was planning to turn this on with a low factor - say, 1% or 5%, to get some numbers. Then meet and discuss them. Does that sound good?

if you ask me, we need to rethink the whole concept of pre-generation specially in large wikis with low number of human visitors/editors: Commons, wikidata, botpedias, etc.

if you ask me, we need to rethink the whole concept of pre-generation specially in large wikis with low number of human visitors/editors: Commons, wikidata, botpedias, etc.

These wikis don't benefit much from VisualEditor anyway. So I'd be fine with just disabling it there. But I guess it's up to the Web an/or Editing team to decide that.

Let's get numbers and then meet and decide.

Isn't this better done in TransformHandler::execute behind some RB-enabled-or-not config flag. ParsoidHandler doesn't need to know about ParserCache / RB etc. It is just setting things up, calling Parsoid to generate HTML. Once RB is no longer used for Parsoid, the code in TransformHandler can go away in favour of the integration in ContentHandler / Pars(oid|er)OutputAcceess.

PageHandler and TransformHandler are only thin wrappers around methods implemented in ParsoidHandler. ParsoidHandler implements the glue between "how the parois API used to work" and "how stuff works in core now". ParsoidHandler is being transformed into a stub that sets up the approproate REST helper class (HtmlInputTransformHelper or HtmlOutputRendererHelper). These classes are responsible for caching and stashing.

Change 858687 merged by jenkins-bot:

[operations/mediawiki-config@master] Set parser cache write propability for /page/html endpoint.

https://gerrit.wikimedia.org/r/858687

Side note: DiscussionTools just started to write to the parsoid parser cache for edits to talk pages we DT is enabled. This wasn't quite intentional, it's a side effect of a fix for T323357. If it causes a problem, let me know.

Change 860642 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] HtmlOutputRendererHelper: allow parser cache to be disabled.

https://gerrit.wikimedia.org/r/860642

Change 854537 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] ParsoidHandler: use HtmlOutputRendererHelper in wt2html

https://gerrit.wikimedia.org/r/854537