Page MenuHomePhabricator

Create a ParsingEnvironment class for use with Parsoid/PHP
Closed, ResolvedPublic

Description

There are 3 separate configs that influence wt->html and html->wt in Parsoid: (a) Parsoid Config (b) Wiki Config (c) Page Config. All three are tied together in a single object and usually accessed as env.conf.parsoid, env.conf.wiki, and env.page. This object is instantiated during initialization (either in parse.js, parserTests.js, or during API request handling).

This spreadsheet lists the specific properties for each of these configs. This is partially annotated at this time and will gradually improve, but is probably has sufficient information at this time to start thinking about and designing the PHP interface.

In the Parsoid/JS code,

  • Parsoid config comes from a config file (either on disk or via scap or puppet) and is reused across requests
  • Wiki config comes from from site API calls, and some properties are initialized directly from the response and other properties (usually regexps) are computed and is reused across requests.
  • Page config is pretty much initialized from the wt->html or html->wt request (API, or command-line or other scripts) and is only valid for a single request.

In the Parsoid/PHP code, we need an equivalent interface that supports this, but which doesn't incur a lot of performance overhead. Pre-computing lots of config properties eagerly doesn't make sense in this context. So, the ParsingEnvironment interface needs to be designed appropriately. If necessary, Parsoid code that relies on precomputed properties can be fixed to use different properties. This decision will require taking a closer look at how these computed properties are used and what might be a good replacement instead.

Potential performance opt for later: Once the shape of the env.parsoid and env.wiki config objects settles down during design, they could even be "cached" somewhere and cleared out on every deploy. Or, if this is going to be an easy win, we could rely on expensive-to-initialize techniques and rely on these cached configs to minimize per-request initialization overheads.

Pointers to the JS code

  • lib/config/MWParserEnvironment.js
  • lib/config/WikiConfig.js
  • lib/config/ParsoidConfig.js

Event Timeline

ssastry triaged this task as Medium priority.Jan 4 2019, 11:28 PM
ssastry moved this task from Backlog to Porting on the Parsoid-PHP board.

We'll probably want to have more than one class here.

  • One that holds the "Parsoid config" and the "Wiki config".
  • One that holds the "Page config". Which would include things Parsoid probably currently has no direct concept of like ParserOptions::getCurrentRevisionCallback().

The former would be passed to the Parsoid service's constructor (e.g. in ServiceWiring.php), while the latter would be passed into the equivalent of Parser::parse(). The former is more likely to benefit from caching than the latter.

We could go as far as having three objects by separating the "Parsoid config" from the "Wiki config", but my guess at this point is that in MediaWiki the line between the two will quickly be blurred. as they'll both in the end come from the Config object.

Chances are that details of the caching will be outside of the scope of the interface provided by Parsoid-PHP itself. Instead it would be up to the caller (e.g. MediaWiki) to implement the interface with whatever in-process (e.g. lazy initialization of private fields) and out-of-process caching (e.g. memcached) turns out to be relevant.

We'll probably want to have more than one class here.

  • One that holds the "Parsoid config" and the "Wiki config".
  • One that holds the "Page config". Which would include things Parsoid probably currently has no direct concept of like ParserOptions::getCurrentRevisionCallback().

The former would be passed to the Parsoid service's constructor (e.g. in ServiceWiring.php), while the latter would be passed into the equivalent of Parser::parse(). The former is more likely to benefit from caching than the latter.

We could go as far as having three objects by separating the "Parsoid config" from the "Wiki config", but my guess at this point is that in MediaWiki the line between the two will quickly be blurred. as they'll both in the end come from the Config object.

Multiple objects are fine. I Currently, there is a WikiConfig and a ParsoidConfig and env is a wrapper class around them. And agreed, parsoid & wiki config are the primary targets for caching.

Chances are that details of the caching will be outside of the scope of the interface provided by Parsoid-PHP itself. Instead it would be up to the caller (e.g. MediaWiki) to implement the interface with whatever in-process (e.g. lazy initialization of private fields) and out-of-process caching (e.g. memcached) turns out to be relevant.

Great. That matches my expectations as well. And, by design, it insulates the parser code from those details.

Feel free to tweak the task description to reflect evolving details and for improved clarity.

Change 486821 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/services/parsoid@master] Config: Add PHP configuration interface

https://gerrit.wikimedia.org/r/486821

Change 490696 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/services/parsoid@master] Config: Add PHP page configuration interface

https://gerrit.wikimedia.org/r/490696

Change 486821 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Config: Add PHP site configuration interface

https://gerrit.wikimedia.org/r/486821

Is there anything else we want to do for this class (e.g. make an equivalent of JS's "env" object), or should we close this?

A wrapper 'env' class around the different configs (site / wiki, parsoid, page) might still be useful so we can pass one object everywhere where needed.

Change 490696 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Config: Add PHP page configuration interface

https://gerrit.wikimedia.org/r/490696

Change 494530 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/services/parsoid@master] Config: Add Env wrapper class

https://gerrit.wikimedia.org/r/494530

Change 494530 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Config: Add Env wrapper class

https://gerrit.wikimedia.org/r/494530

Ok, let's call this resolved now.