Page MenuHomePhabricator

Implement class (name to be decided) to provide access to preprocessor, extensions, and other metadata needed by Parsoid
Closed, ResolvedPublic

Description

Parsoid/JS relies on MediaWiki to:
(a) expand templates -- returns wikitext + categories|properties|modules|jsconfigvars
(b) process tag extensions (that don't have Parsoid native implementations) -- returns HTML + modules|jsconfigvars|categories
(c) get metadata for media -- returns mediatype|mime|size|url|badfile (images, audio) and mediatype|mime|size|url|badfile|derivatives|imagetext (video)
(d) link metadata for links -- returns missing|known|redirect|disambiguation for every link
(e) to get siteinfo and other config -- this is being handled by T212982: Create a ParsingEnvironment class for use with Parsoid/PHP separately and so is not a concern for this task here. This is present here only for completeness' sake.
(f) post lint information when linting is enabled.

On the Parsoid/JS end, lib/mw/ApiRequest.js and lib/mw/Batcher.js implement the functionality to talk with MediaWiki. On the MediaWiki end, the action API and the ParsoidBatchAPI extension deal with Parsoid's requests.

This task requires the implementation of a new class in MediaWiki core to subsume the functionality of code on both the Parsoid and the MediaWiki side. Parsoid/PHP (composer lib) will talk with this new class which in turn could implement the required functionality in one of multiple ways (a) natively by copying the requisite code from the ParsoidBatchAPI and core (b) proxying the requests over to one of ActionAPI or ParsoidBatchAPI whatever makes sense (c) reimplementing code or redefining interfaces (d) some combination of the three.

The overriding requirement here is to provide this information to Parsoid/PHP as efficiently as possible. At a later point in the future, we could conceivably move over some of this functionality out of MediaWiki core into the Parsoid/PHP composer lib, but that is way in the future when we are ready to pull out the legacy PHP parser. But, I am mentioning this here in case it influences the design now.

Pointers to the Parsoid JS code

  • lib/mw/ApiRequest.js
  • lib/mw/Batcher.js

Event Timeline

ssastry triaged this task as Medium priority.Jan 8 2019, 11:22 PM
ssastry moved this task from Backlog to Porting on the Parsoid-PHP board.

I note that Parsoid probably currently ignores things such as ParserOptions::getCurrentRevisionCallback() which are used in various places (e.g. previews, MediaWiki-extensions-TemplateSandbox) to affect parses for previews and such. To replace the existing PHP Parser, implementations of the interface described here will need to take those sorts of things into account when fetching template contents.

I'm not sure that separating this functionality from the functionality described in T212982 will be the best design. To some extent that will come down to how we balance separation of functionality (e.g. configuration settings versus content fetching) with implementation complexity (e.g. spreading implementation over multiple classes and pass multiple objects around).

I note that Parsoid probably currently ignores things such as ParserOptions::getCurrentRevisionCallback() which are used in various places (e.g. previews, MediaWiki-extensions-TemplateSandbox) to affect parses for previews and such. To replace the existing PHP Parser, implementations of the interface described here will need to take those sorts of things into account when fetching template contents.

Sounds good.

I'm not sure that separating this functionality from the functionality described in T212982 will be the best design. To some extent that will come down to how we balance separation of functionality (e.g. configuration settings versus content fetching) with implementation complexity (e.g. spreading implementation over multiple classes and pass multiple objects around).

Conceptually, I think of the two as different things. I assume the API module in T205339 would construct these as two objects (env and say mw) and pass them to the composer lib alongwith the request params. But, sure, if it turns out that it is best implemented as a single interface, I am not fundementally opposed to it. But, if there is a clean and natural separation in the design and implementation that isn't cumbersome to separate, two objects make sense since these two are subject to different performance and caching considerations as well.

But, I think you are taking about the fact that if you split env into 3 objects, it would be 4 separate objects (pConf, wikiConf, pageInfo, mw). That said, presumably at least 2 of the 4 can come from preconstructed from a cache and the 3 of them can be loosely wrapped into a single env object to reduce the # of args passed in/out of the Parsoid/PHP composer lib.

Anyway, these tickets are suggestions and you should see what kind of impedance mismatches you run into as you write code and do what seems pragmatic.

List of places where code in lib/mw/ is called, ignoring lib/confg since that should have been handled for T212982 already:

  • Batcher.getPageProps - More or less action=query&prop=pageprops&ppprop=disambiguation, plus $title->isRedirect().
    • lib/utils/ContentUtils.js - Used for setting classes on links: "new", "mw-redirect", "mw-disambig".
  • ImageInfoRequest / Batcher.imageinfo - action=query&prop=imageinfo or action=query&prop=videoinfo
    • lib/wt2html/pp/processors/AddMediaInfo.js
  • LintRequest - action=record-lint
    • lib/logger/LintLogger.js - probably replaced by SiteConfig::logLinterData()?
  • PHPParseRequest / Batcher.parse - action=parse&prop=text|modules|jsonconfigvars|categories
    • lib/api/apiUtils.js - with onlypst to handle subst: (also ~~~~ and the like?).
    • lib/wt2html/tt/ExtensionHandler.js - for parsing extension tags, I think?
  • PreprocessorRequest / Batcher.preprocess - action=expandtemplates&prop=wikitext|categories|properties|modules|jsconfigvars
    • lib/wt2html/tt/TemplateHandler.js - expands templates when usePHPPreProcessor is true
  • TemplateRequest - prop=info|revisions&rvprop=content|ids|timestamp|size|sha1|contentmodel for a title or revid.
    • Bunch of stuff calling setPageSrcInfo(), which should all be in T212982's PageConfig now.
    • lib/wt2html/tt/TemplateHandler.js - seems to be getting the main-slot wikitext only. Memoizes. For expanding templates and parser functions when usePHPPreProcessor is false.
  • TemplateDataRequest - action=templatedata for a title
    • lib/html2wt/WikitextSerializer.js

Batcher does two main things: it memoizes requests for the same target, and it can do multiple requests in one API call using action=parsoid-batch (this is configurable).

All of these use promises, BTW. Do we want to keep using promises, e.g. using guzzlehttp/promises or react/promise, even though PHP-wise they'll probably always be resolved on creation?

We skipped usePHPPreProcessor in gerrit:486821, but I don't know whether we planned on it being always-true or always-false. Are we keeping the existing PHP Preprocessor, or do we need to implement substing and such in Parsoid?

A few comments:

  • No need to use promises since all requests will be synchronously satisfied. Is there a reason you see for using promises for potential future async modes in PHP?
  • Currently, usePHPPreProcessor is false for parser test runs in Parsoid. It uses the Parsoid-native template expansion and parser functions code. That has always existed since the beginning, but had too many edge case bugs and performance wasn't considered good enough to actually develop it further. But, using mediawiki api for parser tests runs would have been excessive given how frequently they run and it would have slowed down parser test runs greatly. But, the qn of what to do now with integration with Parsoid remains. I don't see a reason to not use the same core preprocessing code for everything (including parser tests).
  • We need memoization support for sure for perf reasons. But, we probably do not need batching of template and extension requests any more.
  • But, we do need batch query support to avoid hitting the db for every single image/link. The "redlinks" and "mediainfo" dom passes make these bulk metadata requests. So, we need support for these bulk/batch requests.
  • As for linting, yes, you could use any existing interfaces provided by SiteConfig.

If we were to use Promises, we should use PHP's generators (yield) to implement an async/await coding style. That does allow coroutine-style concurrency, and could in theory overlap execution using async DB or curl_multi_*.

I suspect we will want to port the existing fairly-limited Parsoid preprocessor implementation in order to get parser tests running quickly. True integration with the core PHP preprocessor might be more difficult than simply porting what we have, to get things going initially. I could be wrong about this.

Oh, and I wanted to mention that @Legoktm was looking at this from the other side; that is, at extracting a proper "pluggable Parser interface" from MediaWiki core that would (in theory) abstract all of the parser integration points. It would be worth keeping both perspectives in mind as we try to meet in the middle here.

  • No need to use promises since all requests will be synchronously satisfied. Is there a reason you see for using promises for potential future async modes in PHP?

The only reason I asked about promises was because not using promises will result in structural changes to the calling code, hopefully just directly calling the result-processing code instead of passing it as a callback to .then(). I wasn't sure of the extent to which we wanted to do that during the port.

  • But, we do need batch query support to avoid hitting the db for every single image/link. The "redlinks" and "mediainfo" dom passes make these bulk metadata requests. So, we need support for these bulk/batch requests.

Unless I'm looking at the wrong thing, the "redlinks" pass isn't doing interesting batching (from the perspective of code I'll be writing for this task). It's just passing a list of titles to Batcher.getPageProps as one call, and getting back an array of data for all the titles from the one promise.

The "mediainfo" pass does do interesting batching as things are currently written. I wonder, though, whether it could be rewritten to be uninteresting like the "redlinks" pass: pass an array of titles and get an array of results back.

If we were to use Promises, we should use PHP's generators (yield) to implement an async/await coding style. That does allow coroutine-style concurrency, and could in theory overlap execution using async DB or curl_multi_*.

I have little idea how that might actually work, I guess internally Parsoid would have to have an event loop somewhere to call curl_multi_*.

As for "async DB", recall that it still only allows one request in flight at a time. If code tries to send a second request when one is already in flight, it'll raise an error. So you'd need to wrap the DB connection in a queue and everything using that DB connection would have to go via the queuing wrapper and use promises.

  • Currently, usePHPPreProcessor is false for parser test runs in Parsoid. It uses the Parsoid-native template expansion and parser functions code. That has always existed since the beginning, but had too many edge case bugs and performance wasn't considered good enough to actually develop it further. But, using mediawiki api for parser tests runs would have been excessive given how frequently they run and it would have slowed down parser test runs greatly. But, the qn of what to do now with integration with Parsoid remains. I don't see a reason to not use the same core preprocessing code for everything (including parser tests).

I suspect we will want to port the existing fairly-limited Parsoid preprocessor implementation in order to get parser tests running quickly. True integration with the core PHP preprocessor might be more difficult than simply porting what we have, to get things going initially. I could be wrong about this.

My question for the moment is whether I need to implement an equivalent of PreprocessorRequest or not, i.e. if we want to support "usePHPPreProcessor = true".

Eventually we'll likely have to get rid of both PreprocessorRequest and PHPParseRequest, but before we can do that we'll need that "pluggable Parser interface".

Oh, and I wanted to mention that @Legoktm was looking at this from the other side; that is, at extracting a proper "pluggable Parser interface" from MediaWiki core that would (in theory) abstract all of the parser integration points. It would be worth keeping both perspectives in mind as we try to meet in the middle here.

There's a tension between wanting to do a straight port of the JS to PHP and making Parsoid-PHP implement a proper "pluggable Parser interface". Chances are we'll wind up taking it in steps:

  1. We implement more or less direct copies of PreprocessorRequest and PHPParseRequest that call Parser.php code, to close this task.
  2. We finish the JS→PHP porting.
  3. We'll implement the pluggable interface
    • Implement pre-save transform natively in Parsoid-PHP.
    • Implement the pluggable interface's versions of Parser::recursive*(), PPFrame, and so on too.
  4. Then we can actually replace Parser.php as the implementation of the pluggable interface.
    • Call tag hooks directly instead of using PHPParseRequest.
    • Start calling parser function hooks directly instead of reimplimenting all of them internally.
  • No need to use promises since all requests will be synchronously satisfied. Is there a reason you see for using promises for potential future async modes in PHP?

The only reason I asked about promises was because not using promises will result in structural changes to the calling code, hopefully just directly calling the result-processing code instead of passing it as a callback to .then(). I wasn't sure of the extent to which we wanted to do that during the port.

Ah, ok. I think it is okay to change the structure of the calling code. I don't expect results to be dependent on ordering.

  • But, we do need batch query support to avoid hitting the db for every single image/link. The "redlinks" and "mediainfo" dom passes make these bulk metadata requests. So, we need support for these bulk/batch requests.

Unless I'm looking at the wrong thing, the "redlinks" pass isn't doing interesting batching (from the perspective of code I'll be writing for this task). It's just passing a list of titles to Batcher.getPageProps as one call, and getting back an array of data for all the titles from the one promise.

The "mediainfo" pass does do interesting batching as things are currently written. I wonder, though, whether it could be rewritten to be uninteresting like the "redlinks" pass: pass an array of titles and get an array of results back.

I meant: "batching support" in general. As you indicate, in the case of redlinks, all the interesting work happens on the api side. In the case of mediainfo, some postprocessing work does happen in the DOM pass. MediaInfo pass retargets the stubs for media into fully expanded markup. I think that work is Parsoid-specific and should remain on the Parsoid side.

  • Currently, usePHPPreProcessor is false for parser test runs in Parsoid. It uses the Parsoid-native template expansion and parser functions code. That has always existed since the beginning, but had too many edge case bugs and performance wasn't considered good enough to actually develop it further. But, using mediawiki api for parser tests runs would have been excessive given how frequently they run and it would have slowed down parser test runs greatly. But, the qn of what to do now with integration with Parsoid remains. I don't see a reason to not use the same core preprocessing code for everything (including parser tests).

I suspect we will want to port the existing fairly-limited Parsoid preprocessor implementation in order to get parser tests running quickly. True integration with the core PHP preprocessor might be more difficult than simply porting what we have, to get things going initially. I could be wrong about this.

My question for the moment is whether I need to implement an equivalent of PreprocessorRequest or not, i.e. if we want to support "usePHPPreProcessor = true".

Eventually we'll likely have to get rid of both PreprocessorRequest and PHPParseRequest, but before we can do that we'll need that "pluggable Parser interface".

Yes, we do want to use expandtemplates code for handling templates, so whether we go through PreprocessorRequest or directly call your new class's interface method is something to clarify as well. I think you are saying implement direct copies of these classes, which is fine. We can remove these classes and call the interface directly later.

But, yes usePHPPreProcessor=true is the default and needs to be supported.

Oh, and I wanted to mention that @Legoktm was looking at this from the other side; that is, at extracting a proper "pluggable Parser interface" from MediaWiki core that would (in theory) abstract all of the parser integration points. It would be worth keeping both perspectives in mind as we try to meet in the middle here.

There's a tension between wanting to do a straight port of the JS to PHP and making Parsoid-PHP implement a proper "pluggable Parser interface". Chances are we'll wind up taking it in steps:

  1. We implement more or less direct copies of PreprocessorRequest and PHPParseRequest that call Parser.php code, to close this task.
  2. We finish the JS→PHP porting.
  3. We'll implement the pluggable interface
    • Implement pre-save transform natively in Parsoid-PHP.
    • Implement the pluggable interface's versions of Parser::recursive*(), PPFrame, and so on too.
  4. Then we can actually replace Parser.php as the implementation of the pluggable interface.
    • Call tag hooks directly instead of using PHPParseRequest.
    • Start calling parser function hooks directly instead of reimplimenting all of them internally.

Sounds reasonable to me. I am keen to avoid scope creep where possible unless it is demonstrable that the scope creep is better for overall effectiveness. In this case, I think the incremental approach is better.

I meant: "batching support" in general. As you indicate, in the case of redlinks, all the interesting work happens on the api side. In the case of mediainfo, some postprocessing work does happen in the DOM pass. MediaInfo pass retargets the stubs for media into fully expanded markup. I think that work is Parsoid-specific and should remain on the Parsoid side.

I think you misunderstood my concern.

For the redlinks, the calling code to the interface would look like $results = $t213228->getRedLinkInfo( $titles );. No problem.

But for mediainfo, the calling code is currently structured to do like $t213228->getMediaInfo( $title )->then( $callback );. It'd be more straightforward at the interface level to rearrange the calling code to work like the redlinks one (i.e. $results = $t213228->getMediaInfo( $titles )).

Yes, we do want to use expandtemplates code for handling templates, so whether we go through PreprocessorRequest or directly call your new class's interface method is something to clarify as well. I think you are saying implement direct copies of these classes, which is fine. We can remove these classes and call the interface directly later.

I was never considering implementing copies of PreprocessorRequest and the other request methods. The question was whether we need the method doing like action=expandtemplates (which breaks on some edge cases, e.g. this), which you answered.

But, yes usePHPPreProcessor=true is the default and needs to be supported.

We should probably put the flag back into SiteConfig then, at least for now. Done in gerrit:486821 PS9.

Change 494817 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/services/parsoid@master] Config: Add DataAccess interface

https://gerrit.wikimedia.org/r/494817

Change 494817 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Config: Add DataAccess interface

https://gerrit.wikimedia.org/r/494817