Page MenuHomePhabricator

Add setFunctionHook equivalent support to Parsoid Extension API
Open, MediumPublic

Description

It would be desirable to have a single implementation of the core parser functions (in includes/parser/CoreParserFunctions.php) that can be used by both the legacy parser and Parsoid. This shouldn't be *too* hard: most are simple text-in/text-out functions, so a wrapper which massaged the input arguments (converted from Text node or Tokens to a flat string), called the implementation function, and then converted the output from a string into a Text Node should probably be sufficient to provide for Parsoid compatibility.

The bigger issue is registration: moving from "call this hook every time you create a Parser object and it will register all the core parser functions one by one" to "maintain a global registry of registered parser functions" is T299528. We probably need to explicitly include a "skip this registration and look further in the registry for another" return value in the parser function API to allow migration of extensions which might (for some reason) deliberately choose to register a particular parser function *only* in a specific context -- in the new world order they would *always* register the parser function, but it would return the "skip this registration" value in all cases that are not the specific context in which it should be active.

Event Timeline

ssastry triaged this task as Medium priority.Nov 18 2020, 3:53 PM

Pretty sure I deprecated and removed setFunctionHook from the core parser...

EDIT: no, that was a different hook...

Recording my thoughts from a conversation @ssastry and I had a week or so ago:

  1. first off, setFunctionHook is a registration method called from the ParserFirstCallInit hook, which is not great: T299528: Deprecate and remove `ParserFirstCallInit` hook (move hook/tag registration out of Parser constructor). We want to move to a global declarative registration mechanism, probably in extension.json, not a per-object-constructed initialization which wastes a lot of cycles setting up the same parser functions over and over again.
  2. There are some weirdnesses with parser functions in particular I'd like to fix: T204371: Replace initial colon in (hash-prefixed) parser function invocation with vertical bar, T204307: Parser Functions should support named parameters
  3. Parser function registration already takes a "flags" argument, some of the difficult stuff (like exposure of the preprocessor frame) is only done when SFH_OBJECT_ARGS is passed. So:
    1. There's probably some low-hanging fruit w/r/t supporting the vast majority of these hooks which don't pass a flag and use the 'simpler' interface.
    2. We could in theory add a new SFH_FIX_ALL_THE_BUGS flag to opt-in to a new interface, ideally one which supports named arguments and vertical bar separators (see above) as well as a way to get at argument "raw text" without delving into the preprocessor frame and strip state (which seems to be mostly what SFH_OBJECT_ARGS is doing in practice).
  4. Worth noting that the *return type* of the parser function hook is somewhat complicated as well: you can *either* return wikitext, *or* you can return an array, and the array lets you specify the "type" of the returned value, and one of those types is effectively "raw html" (it's stuffed into the strip state in the legacy parser). This is very similar (but not quite identical) to how extensions are handled, and it may be worth using the existing parsoid interfaces for expanding extensions to handle parser functions as well, just setting some metadata to indicate that the {{#xxxx}} syntax should be matched not the <xxx> syntax. (Note that the opposite-direction bridge (from parser functions to extension) is already present: {{#tag:xxx}} is effectively a parser function which will expand the *extension* registered under xxx, so if/when parsoid natively implements {{#tag}} we're already going to have to convert between the parser function and the extension api. Might as well make those interfaces are close as possible from the start.)
  5. Many of the "core parser functions" don't do anything too fancy, they just expose stuff from ParserOutput (T287216).

Given all of the above, there's a couple of different ways to arrange the path from here to the promised land (assuming we all agree on the ultimate destination).

  1. One way is to try to support the existing Parser setFunctionHook call directly, at least in those cases where SFH_OBJECT_ARGS isn't used. I'd personally argue this is probably a waste of time, since our existing back-compat hook (where we give the string {{#legacy-parser-function}} to the legacy parser and ask it to expand it) works just fine for these extensions.
  2. If we prioritize T299528: Deprecate and remove `ParserFirstCallInit` hook (move hook/tag registration out of Parser constructor), then it makes sense to have the new registration mechanism provide some "new" semantics as well.
  3. Alternatively, we could keep registration in ParserFirstCallInit for the moment, and add a new argument like SFH_OBJECT_ARGS that opts in to new semantics.

Since I believe in the essential unity of parser functions, extension tags, magic words, and behavior switches (T204370, T90914, T204283, T114432), I think I'd argue for an implementation mechanism that would first extend our existing ParsoidModules configuration to allow "parsoid extensions" to register a hook on a parser function name, with as much commonality with Parsoid's ExtensionTag class as possible. Ideally there's be (at least) a common base class that ExtensionTagHandler and ParserFunctionHandler would implement, and the sourceToDom etc methods would be identical. (See T250530 in particular, since the "extension tag" managed to creep into the domToWikitext method in a way which is unfortunate.) This would add vertical bar support and named-argument support (T204371, T204307), and ideally the named-arguments would be handled by the Parsoid API almost identically to how attribute arguments are handled for extension tags.

This would be used for new extensions and in extensions we're rewriting to be parsoid compatible. A parallel path would try to bridge the implementation in core's CoreParserFunctions in such a way that a single implementation could be used by both Parsoid and the legacy parser. This might result in an alternative parser function API which is a little bit more like the existing setFunctionHook, at least for "simple" parser functions. Extensions might use this if they are "simple" and want the lowest-overhead transition path, but it should probably be discouraged for new code.

Change 761494 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] WIP: Add parser function support to Parsoid

https://gerrit.wikimedia.org/r/761494

Some additional thoughts: it would be desirable to have a single implementation of the core parser functions (in includes/parser/CoreParserFunctions.php) that can be used by both the legacy parser and Parsoid. This shouldn't be *too* hard: most are simple text-in/text-out functions, so a wrapper which massaged the input arguments (converted from Text node or Tokens to a flat string), called the implementation function, and then converted the output from a string into a Text Node should probably be sufficient to provide for Parsoid compatibility.

The bigger issue is registration: moving from "call this hook every time you create a Parser object and it will register all the core parser functions one by one" to "maintain a global registry of registered parser functions" is T299528. We probably need to explicitly include a "skip this registration and look further in the registry for another" return value in the parser function API to allow migration of extensions which might (for some reason) deliberately choose to register a particular parser function *only* in a specific context -- in the new world order they would *always* register the parser function, but it would return the "skip this registration" value in all cases that are not the specific context in which it should be active.