Page MenuHomePhabricator

Provide cached access to Parsoid PHP within core
Closed, ResolvedPublic

Description

In order to provide Parsoid HTML in core REST endpoints and ultimately transfer RESTBase functionality into core, the first step is to provide the capability within Parsoid-PHP to provide page/revision html using a cache, kind of like the ParserCache.

To kick the discussions, I have a bunch of questions:

  1. @ssastry What's the future of Parsoid being a MW extension and in general what're the plans for Parsoid delivery? Should we implement this within core or within Parsoid repo for now?
  2. @ssastry I can see that a bunch of things within Parsoid PHP are built into ParsoidServices. Do you think we should add 'Parsoid' to that list, together with 'ParsoidCache'?
  3. In general, I think we would benefit from a generic service that allows access to cached Parsoid content and manages it, which can be reused. But we can generalize later on.
  4. @ssastry What is Parsoid parse vary on right now? Do we still ignore logged in user etc?

Event Timeline

A couple more questions I have after digging a bit into the parsoid php code, I have a bunch more questions:

  • We rely on default parser for a bunch of things, and we always give it canonical parser options. Do we intend to change that soon or in a faraway future or instead parsoid will subsume what default parser is doing for it eventually? Do you think it worths diffing parsoid cache by ParserOptions for the default parser?
  • As a test case I'm interested in caching the page bundle on a read path only. There's a bunch of options we provide into wikitext2html that all change the output. The naive approach to caching this is to dump all the options into a key and invalidate on page_touched. But that's way too naive.. Some of the options to a parse have an efficient post-processing step, for example lang variants. So we can cache the default variant, and then convert the cached default variant to a requested on and cache that as well. This should be way more efficient that doing a full reparse for every variant. Do you think that generally this two-step approach will be applicable to more options. Ideally, I would like to understand if we could categorize all the parsoid options into whether they can be applied on the fly (body_only), transformed from another cached entry (lang variants) or require a full reparse. Are we even interested in a feature like that?

Hey, @Pchelolo . Why does the cache have to be implemented within Parsoid/PHP? Couldn't it be implemented like...

if (data in cache) {
   get it from cache;
   return it;
} else {
   pass wikitext to Parsoid/PHP;
   get the results;
   put those results into cache;
   return results;
}

That would keep the Parsoid/PHP code simpler, and let us use simpler caching mechanics.

We can't rely on Parsoid while it's an extension from within core, so this logic will go into the extension now ad will be moved over when Parsoid is not an extension anymore.

Also, I think we only implemented the Parsoid service API as an extension because Parsoid wasn't yet finished so we couldn't merge it to core. @Tgr and @Anomie probably have more details on that.

@Pchelolo but we can rely on a library in core, right? Could we just depend on the Parsoid/PHP library, not the Parsoid/PHP MW extension?

@Pchelolo but we can rely on a library in core, right? Could we just depend on the Parsoid/PHP library, not the Parsoid/PHP MW extension?

That is what will happen once Parsoid is a library and is integrated. According to @ssastry that will be happening in January. But we want to start poking it before, so the code will go into the extension right now.

Hey, @Pchelolo . Why does the cache have to be implemented within Parsoid/PHP? Couldn't it be implemented like...
[...]
That would keep the Parsoid/PHP code simpler, and let us use simpler caching mechanics.

That makes more sense to me too.

Also, I think we only implemented the Parsoid service API as an extension because Parsoid wasn't yet finished so we couldn't merge it to core. @Tgr and @Anomie probably have more details on that.

That's basically correct.

  1. @ssastry What's the future of Parsoid being a MW extension and in general what're the plans for Parsoid delivery? Should we implement this within core or within Parsoid repo for now?

The extension was always intended (at least in my mind) as being temporary. As far as I know the long-term plan is still to replace Parser.php with Parsoid/PHP.

One way to get to that point would be to create abstractions in MW for the concept of "parsing", switch everything to use those abstractions, and then roll out a switch of implementation from one backed by Parser.php to one backed by Parsoid/PHP. Another would be to pull in Parsoid/PHP and maintain parallel implementations of everything, one for Parser.php and one for Parsoid/PHP. Chances are we'll wind up at some middle point, having an abstraction for MediaWikiServices::getParser() but parallel implementations for at least some of the ways the parsers call out for things like handling extension tags.

  1. @ssastry I can see that a bunch of things within Parsoid PHP are built into ParsoidServices. Do you think we should add 'Parsoid' to that list, together with 'ParsoidCache'?

T229083: Clean up services in Parsoid extension. TL;DR is that "ParsoidServices" should probably go away, once we figure out how we actually want the public services structured. But until we get to that point, ParsoidServices works as a bit of a kitchen sink for holding different bits.

  1. In general, I think we would benefit from a generic service that allows access to cached Parsoid content and manages it, which can be reused. But we can generalize later on.

I think that gets into the "figure out how we actually want the public services structured" thing. I think Parsoid/PHP as a library should focus on parsing, leaving caching to the caller. But a public boundary between MW and Parsoid/PHP could validly have caching built in.

  1. @ssastry What is Parsoid parse vary on right now? Do we still ignore logged in user etc?

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/parsoid/+/f015e3543389ef50aade9cba5de62949620cdfb4/extension/src/Config/PageConfigFactory.php#105 seems relevant here.

I just want to raise this so it's on folks' radar: it would be nice if whatever caching mechanism is introduced, could easily have the HTML for current page revisions dumped in bulk, on a per wiki basis preferably. If that turns out not to be feasible because of the design that's understandable, but if it urns out not to be a big deal, it would be handy for providing HTML dumps of content, particularly for the large wikis.

If there's a better task for me to bring this up, please let me know.

Pchelolo closed this task as Resolved.EditedJun 17 2021, 11:28 PM

We can now cache parsoid output in ParserCache and are doing so for MW REST API

Did you mean "now"?

Yes. Edited. hehe, I can see how my comment was very very confusing :)