Page MenuHomePhabricator

Make Graph extension compatible with Parsoid
Open, MediumPublic

Description

Parsoid has its own extension API - see https://www.mediawiki.org/wiki/Parsoid/Extension_API.
In this first phase, we are targeting tag-hook extensions for migration.
The Graph extension needs an update to work directly with Parsoid.

Related Objects

Event Timeline

Arlolra triaged this task as Medium priority.Feb 25 2021, 5:24 PM
Arlolra moved this task from Backlog to Missing Functionality on the Parsoid board.

Change 736544 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] Allow multiple definitions in jsconfigvars

https://gerrit.wikimedia.org/r/736544

Change 736544 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/services/parsoid@master] Allow multiple definitions in jsconfigvars

https://gerrit.wikimedia.org/r/736544

Assumption:

I am not sure how setting of jsconfigvars connects with the Graph extension since afaik this is never set directly by anything other than ParserOutput internally, and never more than once during a given parser invocation. I am guessing that the reason this has come up as an issue is likely related to Parsoid contacting the legacy Parser for some templates or extension tags, which then leads to the issue of having ostensibly finalised ParserOutput data for a page, except it was only for one template call or extension tag, and thus this now needs to be merged into the current parser run. Or maybe it's different today but modelled after this past state of being.

Concern:

I suspect htis model may be incompatible with expectations from MW core and more generally how extensions use this particular feature via core's Parser today.

Examples:

  • Config vars that hold primitives such as bool, int or string; which would have no obvious merge strategy.
  • Config vars that hold key-value pairs; which would similarly have no obvious merge strategy for conflicting keys.
  • Anything else that isn't a simple list of values or an associative array with unique IDs as keys; both of which are rare in my experience. The vast majority of config vars have no obvious merge strategy.

As I understand it today, once a page view has been formatted into an OutputPage or ApiParse response, there is no valid way to merge or combine these without taking on risks and responsibilties that are beyond the separation of concerns. That is, without some interface back to the core feature or extension that "owns" the jsconfigvar property, one simply cannot in a stable manner merge these. The necessary business logic is not known to Parsoid and there is no way it can do this correctly today.

I don't oppose such concept coming into existence, but I suspect that it would become very complicated very quickly, and I believe we might not actually need this concept, given existing interfaces that are already place that might be able to accomodate it.

I'll describe what I know, for your team to take into consideration, which I hope will help you make an informed decision:

  • Today, the Parser doesn't expose addJsConfigVars as a progressively built up value (unlike modules and modulestyles, which are naturally mergable lists of individually standalone values). As such, there is no agreed-upon interface to express some kind of combination logic. This is by design, because config vars are considered are having standalone and fully formed invididual free-form values. Even if we consider breaking changes, the feature would cease to be able to accomodate business requirements if it had a restriction such as only supporting lists or otherwise trivially merged values.
  • addJsConfigVars is first and foremost a frontend-facing OutputPage feature. While it was added to ParserOutput at some point (in 2014: change 108333), it had no use case described at the time, and I believe this was most likely a mistake. The method did not end up adopted around that time afaics, certainly not with the expectation that it could be used during the parser run, or be called multiple times with the same keu, or support MCR, or otherwise support combining data internally with that of other similarly shaped ParserOutput objects.
  • In deployed code (Codesearch), there are exactly 2 callers to the ParserOutput version of this method: Wikibase and Graph.

Wikibase caller:

In Wikibase's content handlers, a new ParserOutput is constructed from already parsed/saved content, with aggregated values set under various config keys in one go. If my original assumption above is correct, then this won't affect Parsoid.

As for why Wikibase uses this instead of OutputPage, I suspect it helps with state management but this could be trivially overcome by storing the information via setExtensionData instead and then leaving it for an OutputPage hook to carry over to a JS config key. This is what Graph and most other extensions do, in part this method didn't exist prior to 2014 .

Graph caller:
Graph uses an onParserAfterParse hook, to take its progesively built information under setExtensionData, and turn it into an aggregate value for JS code to consume. This would similarly be trivial to do via OutputPage instead from an onBeforePageDisplay hook.

Graph is an odd extension, in that it was rushed and developed with little to no input or review from other MW developers. It might be worth it to change it a little bit to avoid this problem. I don't think it's indicate of a genuine need or use case, and may not be worth extendind support for in MW core or Parsoid.

Conclusion:

I think the high-level mismatch is that it's hard to go from a seemingly final parser output served from ApiParse, back into internal state that can be merged into another on-going parse. This because things like onParserAfterParse would then be called multiple times from inner to outer invocation.

I suspect that with Parsoid in PHP, this isn't actually the case anymore for templates and parser extensions. In which case, my original assumption would be wrong, and we wouldn't have a similar (or larger) problem with features like setExtensionData which are similarly free-form in nature and are expected to be owned entirely by individual extensions, and allow them to add/integrate state during a parser invocation as they go.

Since I set Isabelle onto this, I should clarify where I was coming from.

Context: Right now, Parsoid creates a new ParserOuptut object per template/extension expansion which requires metadata to be merged. But, while that sounds like a hack, there are reasons why that may not be a bad thing as below. But, for performance reasons, we may still change this model at some point, but, semantically, it may still be useful to think of them as independent.

If an extension tag is used N times on a page, we should be able to process each use independently and not rely on global state. And, that scenario requires that config vars should not conflict. If they conflict, then, the extension is effectively creating a page-level global state structure that it should manage on its own and that also defeats a number of desirable properties about processing it and composing a page. Even without this Parsoid push for independent parsing of extension uses, extensions that combine JS config state in a way that conflict should be rare and should only be an optimization. So, that indicates to me that combining it is safe since conflicts shouldn't occur. And, where config combining happens in a manner that can conflict, the extension can manage it internally and rely on page-level hooks to set that.

But, I think you are arguing that we shouldn't even enable this possibility and that JS config vars shouldn't even be part of ParserOutput since allowing this kind of combination introduces space for subtle errors in the future if some extension abuses this and we don't catch that in code review. One option is to fail fast in the combining method where combining isn't possible so extension uses like that would fail right away. That said, we could go down the route of eliminating this from ParserOutput altogether and have it be tackled by OutputPage via the onBeforePageDisplay hook.

That said, I still prefer solutions that minimize scenarios where an extension needs to know about other uses of that tag being used on a page. That opens the door for content composition, incremental updates on edits, etc. without needing full page reparses. My take is that extensions might be maintaining global state because page-level thinking is what the legacy parser encourages. With Parsoid, we want extension authors to think of individual extension-use instances as decoupled and independent and produce output for it without needing to know anything about other uses and relying on Parsoid to combine that output for the page.

And, so, an extension use should generate output + metadata for that use and have the content engine deal with combining those for full rendering. The current solution in Parsoid takes that approach. But, as above, we could make it fail-fast and enforce the no-conflict expectation rather than do something 'common-sense' (as I said loosely) that has the potential for future edge cases.

I think I'd vote for/endorse the solution suggested by @Krinkle: have the extension use ParserOutput::setExtensionData (also now ContentMetadataCollector::setExtensionData) and use an OutputPage hook (I'll leave it to @Krinkle to suggest which one precisely) to propagate these values to OutputPage::addJsConfigVars. That avoids requiring us to define semantics "in general" for appending/accumulating vaues in "jsConfigVars" and makes it clear that the extension manages its "extension data" and can define whatever accumulation semantics they want, but that "jsconfigvars" is expected to be "set once and forget".

That also would allow me to simplify ContentMetadataCollector (the abstract Parsoid view of ParserOutput and omit the addJsConfigVars method -- if only Wikibase and Graph are using it, and they can use OutputPage hooks and extensiondata instead, then the abstract interface gets that much simpler. So that would be a win.

@ssastry I agree those benefits are desirable. For now, I remain somewhat skeptical on how feasible that is. Or more specifically, I think it is very likely to require some kind of additional API for controlling exactly how information is combined since a generic approach is unlikely to be scalable for WMF, and thus that kind of combining is likely to break deployed extensions unless such API is first offered and widely adopted where needed.

I'm glad we can keep jsconfigvars out of this, I think that's a significant win for simplicity and intuition, and also works nicely to encourage the "right" behaviour from a frontend perspective to keep the concerns separated at the right level. It basically means that for things that need frontend coordination, an extension will have to hook into a later stage where these concerns are more obvious and at front of mind, and can be naturally dealt with there.

Having said that, I do note that an almost identical challenge might still be unsolved for getExtensionData/setExtensionData. Perhaps it is solved already, but I was not able to find it. In order for an extension to reliably transfer/aggregate data from ParserOutput to the frontend, it needs to know where it is stored. As such, it seems natural that a given extension will store a certain kind of data always under a deterministic top key. E.g. every <foo name="x">...</foo> would lead to something like setExtensionData( 'foo', [ 'x' => ... ] ), which in practice would take the form of getExtensionData ?: [] first, followed by ['x'] = .., and then a set. If Parsoid gives each of those its own ParserOutput, then we do still have to decide how and whether to combine that, which seems to be essentially an identical challenge, where a generic approach might work sometimes and fail in other cases. I agree that treating anything non-obvious as an error seems like a good way to surface those problems, to then think about some more on a case-by-case basis (where the only obviously-safe case is when both hold an array with non-overlapping keys).

I hinted at it in my previous comment, but I'll say it again. I think it's quite likely that we can address these all in non-overlapping ways. The problem however is in the aggregation logic. In my experience, extensions tend to use a hook like ParserAfterTidy to round up their extension data and set anything derived from it (e.g. language links, categories, page props, and perhaps jsconfigvars?). For example, to signal that one of the tags is missing or unused, or to signal that the page is using too many tags. That works fine if it's run once, after combining the ephemeral ParserOutput objects. But it seems right now that Parsoid is (indirectly?) causing this to be run on every individual fragment as well, due to it calling out to the legacy Parser. If we could avoid that somehow, I think we'd likely solve a large chunk of long tail issues.

In general we would like extensions *not* to attempt to combine data before the final OutputPage hook, since incremental and wikifunctions-style asynchronous parsing will likely result in different parts of the page being rerendered, or rendered out of "order". So the ContentMetadataCollector interfaces are deliberately "write-only". I think it's fine to define simple "accumulate as list" semantics for them, but there shouldn't be any other processing done until the 'post-processing' step, where all of the "extension data" will be made available to the hook which can then union them in some fashion and call OutputPage:: methods as necessary.

So, yes, we have the same issue more-or-less with setExtensionData and I should probably (re)define that method to have to alternate semantics:

  1. setExtensionData -- throws an exception if the key used was previously set.
  2. appendExtensionData -- accumulates the data provided in a list

There is no getExtensionData method provided outside core (ie, its @internal) but that unnamed OutputPage hook can/should provide a getter. If setExtensionData was used the data will be returned as provided, but if appendExtensionData was called the return value will always be a list.

EDIT: Renamed addExtensionData to appendExtensionData for consistency with js config vars.

I hinted at it in my previous comment, but I'll say it again. I think it's quite likely that we can address these all in non-overlapping ways. The problem however is in the aggregation logic. In my experience, extensions tend to use a hook like ParserAfterTidy to round up their extension data and set anything derived from it (e.g. language links, categories, page props, and perhaps jsconfigvars?). For example, to signal that one of the tags is missing or unused, or to signal that the page is using too many tags. That works fine if it's run once, after combining the ephemeral ParserOutput objects. But it seems right now that Parsoid is (indirectly?) causing this to be run on every individual fragment as well, due to it calling out to the legacy Parser. If we could avoid that somehow, I think we'd likely solve a large chunk of long tail issues.

Ya, makes sense reg. the long-tail issues. I think we can fix that by introducing an @internal interface for Parsoid that doesn't call the global hooks for every extension use. For now, for ease of migration, we could support the global hook by calling it on the finalized document. But, we eventually do want to deprecate the remove existing hooks in favour of Parsoid's post-processors.

[…] I should probably (re)define that method to have to alternate semantics:

  1. setExtensionData -- throws an exception if the key used was previously set.
  2. addExtensionData -- accumulates the data provided in a list

I like the idea of using "set" and "add" for this. I would suggest that the "add" variant allow both list-style array addition and map-like array addition (the latter we'd similarly make throw if there's a conflicting key already).

I'm not sure that all post-processing can be done from OutputPage, however. It depends on whether some of these adjustments are relevant to services that use content without a skin. Eg. when parser output is served from ApiParse/Parsoid, we'd probably want some of these things to be similarly finalised and usable without requiring downstream consumers to know intimitate details of individual extensions.

However, we could probably standardise some kind of post-processing step that we consistently keep fast and lightweight, out of the cached ParserOutput data, and apply both for OutputPage and other external-facing APIs whenever whole page are served. This remains similarly incompatible with any partial/internal loopback use of ApiParse, so that would still need to be avoided somehow or be flagged in some way as wanting the raw/unfinalised version.

Another piece of this puzzle is RevisionRenderer. Since MediaWiki supports MCR (multi-content revisions), there is now also a place within MediaWiki core that combines multiple ParserOutput objects. There is code in RevisionRenderer where we also see extension data and config vars etc merged in a fragile way that is (afaik) improvised and unofficial/undocumented with the hopes that they simply never conflict. The upside is that this code is true in practice (for now), because wikis that use MCR don't (afaik) have any meaningful overlap in their metadata. E.g. we don't have extensions that set metadata when processing wikitext content and also set similar metadata when processing Wikibase content (e.g. Commons file pages have both types of content slots and are merged at display time).

Two rough ideas come to mind:

  1. Simple. A new hook that is run exactly once on a singe ParserOutput object to "finalize" it.

This hook would be both independent of OutputPage/Skin, and independent of a Parser run. We'd make sure this is consistently run (exactly once) whenever we retreive a ParserOutput prior to serving or using it (for some definition or "using"). This would be the logical successor to ParserAfterTidy and other random Parser-hooks (ab)used today for this purpose. I'm not sure if it's best to run this new hook behind ParserCache (like we do today) or in front of it at runtime. Running it behind would continue to tolerate expensive operations and is least-disruptive compared to today (e.g. it would continue to run during an edit request with all the relevant context and state for the common case of parsing during an edit). Running it in front has the benefit of being able to skip it and expose the same ParserCache entity to then merge with another prior to finalization. It also has the benefit of enforcing that it is statless/atomic and can't rely any global state from during the edit.

The downside of this simple approach is that for RevisionRenderer to work, we'd have to signal all the way down the stack for it to get an unfinalized ParserOutput object, then do a strict Parsoid-like combining of the two, and then run the hook. This is hard because right now the various components in-between ParserOutput and RevisionRenderer can treat their content is ready for use standalone as well, which we'd have to abandon.

  1. Delegate merging. A new hook that can accept multiple ParserOutput objects and merges later ones into the former and finalize them in the process.

This would embrace the way RevisionRenderer works today, e.g. start with an empty ParserOutput object and "merge" the main slot into it. Then also merge any others into it. This means that if an extension wants to finalize a piece of data into a config var, they have to also account for the same code having done this for another piece of data already. This is more generic and scalable, but requires more effort from extension developers. It also takes away the ability to be "final" about anything. This means if an extension Foo wants to set flag "foo-X" in the config if there is no "foo-Y" data, it would need to check for and remove any existing foo-X flag whenever it does a merge and sees non-empty data. This is probably fine in easy cases where it's not ambigious. There could be ambigious cases, e.g. something like {{DISPLAYTITLE:}} or other extension-provided mechanisms that "set" a single piece of data. It wouldn't surprise me if wikis currently rely on the last one winning for these, as a way of overriding something that a template does as side-effect perhaps. In that case, MCR/RevisionRenderer would be fine since it has a reliable merge order. Some objects/fragments can be cached, some re-parsed as needed, but always merged in the same order. I don't know if such guruantee would be workable for Parsoid.

I think I'm going with a sort of hybrid of your approach #1 and #2 -- that is, I'm defining set and append APIs for extension data and js config vars (and, as much as possible, every other parser output flag or key) that ensures that they have reasonable semantics for merge that can be implemented in core w/o delegation.

But we will certainly have a 'after parse/finalize' hook available to take the merged data and munge it one last time. For citations and graphs, for example, there's some global numbering that wants to happen. In Cite's case, that can even affect the generated HTML. So I think it's entirely consistent to (say) accumulate data using a primitive flag/set/map in extension data, and then have one final pass that can take the merged extension data and do some sort of summary operation on it.

For example, if you want to maintain "minimum requested cache lifetime", you maintain it as a set of requests, then at 'finalize' time you take the min of the set. If you want to emit 'number of graphs on this page' to JS, you maintain a set of unique graph ids, then compute the cardinality of the set at 'finalize' time.

This is compatible with asynchronous or incremental parsing, since if a portion of the page is updated we just re-run the 'finalize' step as needed.

EDIT 2022-05: we already have an "at the end of everything, really" hook in core, it's OutputPageParserOutput. Unfortunately, a lot of code (incorrectly) uses ParserAfterParse as if it were "after everything", when in fact there are a number of situations with the legacy parser (and even more with Parsoid, and even more more with the future async work) where there are multiple "parses" on a page, so "after parse" ends up being called multiple times, with all but the last not really "at the end of everything".

Change 736544 abandoned by Isabelle Hurbain-Palatin:

[mediawiki/services/parsoid@master] Allow multiple definitions in jsconfigvars

Reason:

Superseded by 0f5dc718cee864800249a9ca29dda494e4d8d613 and e075250ecdc92598547a9754a4dac50879a81d02.

https://gerrit.wikimedia.org/r/736544

Dump from chat transcript outlining the issues:

The problematic code is in Graph:includes/ParserTag.php

In particular the stuff around "live mode" -- I don't know exactly what live mode is! Hopefully you can figure out what the UX side of this is, and whether it is widely used?

But the pattern which is problematic is in ParserTag::finalizeParserOutput():

$liveSpecs = $parserOutput->getExtensionData( 'graph_live_specs' );

[...]
if ( !$liveSpecs ) {
	// Not in live mode
	$parserOutput->addModules( [ 'ext.graph.loader', 'ext.graph.vega2' ] );
	return;
}
[...]

$parserOutput->setJsConfigVar( 'wgGraphSpecs', $liveSpecs );

so if $liveSpecs is false, we never call setJsConfigVar so that's fine. But if $liveSpecs is true, we setJsConfigVar overwriting any previous graph data (since despite its name, ParserTag::finalizeParserOutput can be called more than once).

ParserTag::finalizeParserOutput is called once per fragment, so in the legacy parser this only crops up if you do certain types of transclusions, and more than one of them has a <graph>. In Parsoid (and in the future async work) each "fallback to the legacy parser" is a separate fragment, and so any page with more than one <graph> tag on it (not just cases involving transclusions) will have the data from the last one overwrite all the others.

Two possible solutions:

  1. Instead of calling this function from the ParserAfterParse hook (again, because there can be more than one "parse" per page), transition the code which sets the js config vars to the OutputPageParserOutput hook (see T300979 for a tiny bit of context here). OutputPageParserOutput really is run after all parsing is done on the page.
  2. Keep calling finalizeParserOutput from the ParserAfterParse hook but use the ParserOutput::appendJsConfigVar method to append to (rather than overwrite, as setJsConfigVar does) the previous js config vars (see T300307 for a little context).

In option (1) you'll probably discover that you still have the same fundamental problem, just shifted from js config vars to the ParserOutput::setExtensionData() call, since Graph is still overwriting ParserOutput's extension data on every invocation. But at least the js config vars are fine. Rinse, repeat, and try to figure out what's being written to the extension data so you can use ParserOutput::appendExtensionData() instead of ::setExtensionData(), as in option 2. (But see below.)

Option (2) requires a deeper understanding of what exactly is being stored under the graph_live_specs key in the js config vars, in order to tease out the various uses. Good news is that I suspect graph_live_specs is pretty much exactly the same structure as graph_specs which is what you'd need to fix in the second part of option (1). (Yay?)

tl;dr: I think the bottom-line issue is that Graph is using a single extension data key to collect data for all graph uses on the page, and then similarly exporting these as a single js config vars key. Probably what you want to do fundamentally is have a separate write-once key for every instance of the parser tag on the page, and then use a single append-only tag to index them.

For example, assuming that $id holds a unique identifier for each graph, instead of:

$graph_specs = $parserOutput->getExtensionData('graph_specs');
$graph_specs[$id] = ...new graph data...;
$parserOutput->setExtensionData('graph_specs');

(which I suspect is what's effectively happening under the covers in many places in Graph), do something like:

$parserOutput->appendExtensionData('graph_specs_index', $id);
$parserOutput->setExtensionData("graph_specs[$id]", ...new graph data...);

which has no objectionable read-modify-write conflict -- and now graph tags could be processed in parallel or asynchronously (yay brave new world).

Note that you are allowed to multiply-write an extension data/js config var if the value is always the same, so patterns like this one from ParserTag::formatError():

$parserOutput->setExtensionData('graph_specs_broken', true);

are perfectly safe. Effectively you're doing an "or" on that key for all graph instances, either writing nothing or writing true, and then checking for the presence of the key on the back end.

But stuff like this is a problem (from ParserTag::buildHtml):

		$specs = $this->parserOutput->getExtensionData( 'graph_specs' ) ?: [];
		// Make sure that multiple json blobs that only differ in spacing hash the same
		$hash = sha1( FormatJson::encode( $data, false, FormatJson::ALL_OK ) );
		$specs[$hash] = $data;
		$this->parserOutput->setExtensionData( 'graph_specs', $specs );

You could either do:

$this->parserOutput->setExtensionData("graph_specs[$id].hash", $hash);

(ie a new key for the hash, instead of read-modify-writing the old one) or else just move the computation of the hash into the same place that computed graph_specs in the first place.

(Note also that $hash and $id are both doing the same sort of thing, and you'll probably have to sort this out. I suspect the $id being used in most places is a sequential numbering of graph instances from top to bottom of the page. That's not great, because it introduces ordering constraints between fragments, and it means I need to re-render the second graph on the page if anything changes to the first graph on the page, just in case the change affects the numering. Using a UUID is better, and then do the global numbering in the OtutputPageParserOutput hook which is at the *real* end of all fragment parsing. Or avoid global numbering at all if you can.)

I suspect the best approach is option (1) above -- aka, first move the jsconfigvars stuff to a OutputPageParserOutput hook, which should be relatively straightforward. Then you can focus on fixing the read-modify-write patterns in extension data, instead of trying to juggle both extension data and js config vars at the same time.

Change 797104 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/core@master] Move modules/styles/jsconfigvars after onOutputPageParserOutput

https://gerrit.wikimedia.org/r/797104

Change 797137 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/extensions/Graph@master] Fix data collection of multiple graphs in Parsoid

https://gerrit.wikimedia.org/r/797137

Change 797137 abandoned by Isabelle Hurbain-Palatin:

[mediawiki/extensions/Graph@master] Fix data collection of multiple graphs in Parsoid

Reason:

made a branch mess, re-submitting soon

https://gerrit.wikimedia.org/r/797137

Change 797139 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/extensions/Graph@master] Fix data collection of multiple graphs in Parsoid

https://gerrit.wikimedia.org/r/797139

Change 797104 abandoned by Isabelle Hurbain-Palatin:

[mediawiki/core@master] Move modules/styles/jsconfigvars after onOutputPageParserOutput

Reason:

Talked with Scott - he suggested to add these directly in Graph and to not rely on this hook, which I did. So this patch is not needed anymore.

https://gerrit.wikimedia.org/r/797104

Change 797139 merged by jenkins-bot:

[mediawiki/extensions/Graph@master] Fix data collection of multiple graphs in Parsoid

https://gerrit.wikimedia.org/r/797139

Change 810930 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/extensions/Graph@master] Fix data collection of multiple graphs in Parsoid

https://gerrit.wikimedia.org/r/810930

Change 810930 merged by jenkins-bot:

[mediawiki/extensions/Graph@master] Fix data collection of multiple graphs in Parsoid, take 2

https://gerrit.wikimedia.org/r/810930

ihurbain subscribed.