Page MenuHomePhabricator

Pull out ParsoidOptions from ParserOptions
Open, Needs TriagePublic

Description

(BIKESHED WARNING: ParsoidOptions is very similar to ParserOptions and easily visually confused. Bikeshed a better name. T287184 proposes ParserContext or ContentContext for a similar class.)

ParserOptions contains *most* of the options affecting a wikitext parse. Unfortunately, it has some annoying issues:

  • ParserOption's dual function is "cache key for ParserCache" -- although not all of its fields are part of the cache key! So there are fields which "affect the output" and other fields which are expected to be effectively constant for a given wiki (like templateCallback or currentRevisionRecordCallback) and so aren't included in the cache key even though they affect the parse. It would be better to split the "site configuration" aspects from the "parser configuration" aspects.
  • Some important properties affecting the parse are not included in ParserOptions but instead are part of the Parser object or even passed as arguments to Parser::parse. Most obvious among these are Title, Revision, and User. (User is even stored twice! There's Parser::$mUser which is used unless it is null, in which case ParserOptions::getUser() is used!)
  • Many of the more obscure properties of ParserOptions are actually "pass through" data. The parser doesn't actually need to know these, instead they are used by specific parser function implementations or extension. For example, {{CURRENTDAY}} uses ParserOptions::getTimestamp(), {{PAGEID}} uses ParserOptions::getSpeculativePageId(), etc. These aren't part of the cache key *or* used by the parser itself, instead they are just "pass through" to a specific extension or parser function implementation.
  • As alluded above, currentRevisionRecordCallback and friends are a somewhat orthogonal interface which is about configuring how page titles are mapped into revision records at a site level. It has a little bit to do with Parser configuration, but is mostly out of place in ParserOptions.
  • Title, Revision, and User objects are very complicated and can't be effectively extracted (at this time) from the mediawiki codebase. Parsoid must access only primitive values associated with these objects, for example the title as a prefixed string.

Parsoid contains a somewhat orthogonal organization of parser-affecting state, which for historical reasons was mostly influenced by the way in which various bits of the state were exported via the action API. In parsoid we have SiteConfig (site properties, do not vary per-parse), PageConfig (title, user, and revision information; the most direct analog to the legacy ParserOptions), and DataConfig (maps titles and revisions to article text, most similar to the currentRevisionRecordCallback functions from ParserOptions).

It is probably futile to completely replace ParserOptions with PageConfig/SiteConfig/DataConfig. In particular, a lot of legacy code is depending on the pass-through nature of ParserOptions to get access to obscure wiki details, and the fact that ParserOptions is tightly coupled as the key to ParserCache complicates a potential replacement further.

This task proposes to expose a subset of ParserOptions to Parsoid using a similar approach to that suggested in T287216 for ParserOutput:

  1. Create a new ParsoidOptions interface in Parsoid. Initially it can be completely empty. We will add only those methods to ParsoidOptions which allow access to information directly used by Parsoid. Methods which are only pass-through for parser functions/extensions will remain in ParserOptions and not be copied to ParsoidOptions.
  2. Core's ParserOptions will implement Parsoid's ParsoidOptions. Parsoid will have access to a ParsoidOptions object and can pass it to hooks, to parser functions, etc. When running in integrated mode this will be a full ParserOptions object. When running standalone this will initially be a small wrapper around PageConfig and friends. (Core ParserOptions could actually *extend* ParsoidOptions instead of *implement* it; that is, ParsoidOptions could be an abstract class, not an interface, since ParserOptions does not currently have a superclass in core. But I think it's safer to make ParsoidOptions an interface for now.)
  3. Future work: move Title and Revision and User from Parser to ParserOptions. These wouldn't be directly exposed to ParsoidOptions, however. Instead we'd move Parsoid's PageConfig methods (which return primitive types, not Title/Revision/User objects) to ParsoidOptions. See below.
interface ParsoidOptions {
  // currently PageConfig::getTitle()
  function getTitlePrefixedText(): string;
  // currently PageConfig::getNs()
  function getTitleNamespace(): string;
  // currently PageConfig::getPageId()
  function getPageId(): int;
  function getRevisionId(): ?int;
}
class ParserOptions implements ParsoidOptions {
  // Parsoid doesn’t have access to a full Title object
  // That’s a core-only concept.
  function getTitle(): Title { return $this->title; }
  // These function aren’t needed by core, since it has
  // direct access to the Title object, but they
  // implements the ParsoidOptions interface to allow
  // Parsoid to get at the Title details.
  function getTitlePrefixedText(): string {
     return $this->title->getPrefixedText();
  }
  function getTitleNamespace(): string {
    return $this->title->getNamespace();
  }
  …
  // similarly, Parsoid doesn’t have access to a full
  // Revision object, only core does in ParserOptions
  function getRevision(): Revision {
    return $this->revision;
  }
  // But these functions from ParsoidOptions are what
  // Parsoid uses (currrently PageConfig::getRevisionId())
  function getRevisionId(): ?int {
   return $this->revision ? $this->revision->getId() : null;
  }
}

Event Timeline

I am in the bikeshedding mode, and in that vein, here goes:

Within Parsoid, the trio of objects (SiteConfig, PageConfig, DataAccess) currently capture most of the options necessary for converting wiktiext to html. Parsoid has tried to keep the content transformations user-independent but, in reality, the user object leaks in via the injected services and objects.

  • ConfigOptions is one possibility as a wrapper over the various config and other obejcts above.
  • In keeping with the suggestion in T287184, ContentContext could be a possibility.
  • ContentOptions is combo of the above two, but feels too generic and possibly confusing.
  • OutputOptions is probably not a bad option since it feels a bit more specific and could be considered options for how output is generated

There's another 'type' of options in core that we should think about - the $options array passed to ParserOutput::getText method. These are post-processing steps done directly in getText method, but nothing prevents us from incorporating these into the interfaces somehow. We'd just need to be able to 'define a post-parse transformation' that needs to be applied. This would probably be mostly core responsibility, since core will probably want to cache raw parsoid output and then apply transformations to the cached values.

This is the domain of the html2html transformations piece (see these preliminary slides -- sorry non-WMF folks, I'll work to make these accessible soonish by updating them and uploading them somewhere accessible). The question of where html2html transforms ought to sit and what is cached or not and whether they are extensions, services, or in core or in Parsoid is all up for discussion. Whether there ought to be a framework / libraries to support ease of spinning these up is also up for discussion.

But, ideally, most / all of these html2html bits will live outside core.

ConfigOptions feels even more generic than ContentOptions, so I'm not a fan. To me, "Options" evokes "set of key-value pairs", and, if there's anything more complex than that, I'd prefer Context instead. OutputContext?

There's another 'type' of options in core that we should think about - the $options array passed to ParserOutput::getText method. These are post-processing steps done directly in getText method, but nothing prevents us from incorporating these into the interfaces somehow. We'd just need to be able to 'define a post-parse transformation' that needs to be applied. This would probably be mostly core responsibility, since core will probably want to cache raw parsoid output and then apply transformations to the cached values.

This is a good point. I think we will probably eventually want to cache the results of some of the postprocessing transformations (see discussion in comments of T317070: MobileFormatter has quadratic performance), so eventually we'll need a "cache key for the post processing cache" object, which is roughly the same as the array passed as $options to ParserOutput::getText(). This interacts with the more general refactoring of ParserOutput::getText() described in T293512: ParserOutput::getText() should be removed from ParserOutput.