Page MenuHomePhabricator

LinksUpdate hook review
Closed, ResolvedPublic

Description

Review

There are four LinksUpdate hooks, all of which pass a LinksUpdate object, which is a kitchen sink object with a lot of public properties and methods. I reviewed the extension handlers of these hooks. The aim is to make it easier for extensions to do common things, and to narrow the interfaces so as to make it easier to improve LinksUpdate.

Some extensions are only using LinksUpdate hooks as a convenient place to do an unrelated cache purge. GlobalUserPage, PageTriage, UploadWizard (and fork MediaUploader), PhpTagsWiki, SharedHelpPages and MarkImages only need the title and possibly the current revision. They could use the RevisionDataUpdates hook instead, available since 1.35. That hook is not called on delete, so some of them would have to also hook PageDeletionDataUpdates.

Some extensions are adding a link-like table, usually with their own implementation of incremental updates, specifically Babel, GeoData, GlobalUsage and PageAssessments. They all have some quirk which makes them different to core link tables, but they could probably still benefit from the link table abstraction that I'm developing.

Some extensions use the incremental data from LinksUpdate as input to their own data updates, specifically CirrusSearch, Disambiguator, EventBus and Echo. They could receive a narrower, read-only view into the generated data. They all use LinksUpdateComplete except for Echo, which uses LinksUpdateAfterInsert.

LinksUpdateAfterInsert is only used by the Echo extension. It needs a list of added links, but it could apparently use getAddedLinks() in LinksUpdateComplete instead, like CirrusSearch and EventBus.

LinksUpdateConstructed is only used by SemanticMediaWiki. It updates secondary data based on the ParserOutput. It could use the RevisionDataUpdates hook instead, except that it uses the recursive flag, which is not passed down to the hook despite being an argument to DerivedPageDataUpdater::getSecondaryDataUpdates(). It could apparently also use LinksUpdate or LinksUpdateComplete. It is unclear to me why it is using this unusual hook.

PageImages uses the LinksUpdate hook to modify the page properties before save. It loads the revision text from the database, parses the lead section, extracts extension data from the resulting ParserOutput, and then saves data derived from this operation back to $linksUpdate->mProperties as if it came from the main page parse. This seems inelegant. I think this could use a late parser hook instead, like ParserAfterTidy. It could discover the section of an image by looking for the image tags in the HTML and matching them up by name with previously stored candidates. In Parsoid this would be a DOM pass.

Translate is similarly using the LinksUpdate hook for last-minute modification of the data extracted from ParserOutput. It wipes $linksUpdate->mCategories depending on the page title. I think this could also use a late parser hook. There is ParserOutput::setCategories().

Proposal

  • Deprecate LinksUpdateAfterInsert, instead use LinksUpdateComplete
  • Deprecate LinksUpdateConstructed, instead use LinksUpdate, LinksUpdateComplete or RevisionDataUpdates
  • For unrelated secondary data updates, use RevisionDataUpdates and PageDeletionDataUpdates
  • Migrate PageImages and Translate to use a late parser hook instead.
  • Make all LinksUpdate properties be private or protected.
  • Wrap up incremental table changes derived from LinksUpdate into a narrower object and provide it to a new hook, for CirrusSearch etc.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 743061 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Add a ParserModifyImageHTML hook for PageImages

https://gerrit.wikimedia.org/r/743061

Change 743062 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/PageImages@master] Identify lead images using a new parser hook instead of during LinksUpdate

https://gerrit.wikimedia.org/r/743062

kostajh subscribed.

Moving to triaged; Growth-Team was tagged because of Notifications, please let us know if you need anything from us.

@kostajh I am just asking that Growth-Team consider migrating Echo from LinksUpdateAfterInsert to LinksUpdateComplete, as a small tech debt project. Then I can deprecate LinksUpdateAfterInsert.

Change 743283 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/Translate@master] Use Parser hooks instead of LinksUpdate to suppress categories

https://gerrit.wikimedia.org/r/743283

@kostajh I am just asking that Growth-Team consider migrating Echo from LinksUpdateAfterInsert to LinksUpdateComplete, as a small tech debt project. Then I can deprecate LinksUpdateAfterInsert.

Got it, filed as T297011: Migrate LinksUpdateAfterInsert to LinksUpdateComplete. We'll discuss when we can do it.

Change 744903 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/EventBus@master] Use the new LinksUpdate::getPageId()

https://gerrit.wikimedia.org/r/744903

Change 744908 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/Wikibase@master] Use the new LinksUpdate::getPageId()

https://gerrit.wikimedia.org/r/744908

Change 744910 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/GeoData@master] Use the new LinksUpdate::getPageId()

https://gerrit.wikimedia.org/r/744910

Change 744912 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/Echo@master] Avoid LinksUpdate public properties

https://gerrit.wikimedia.org/r/744912

Change 744912 merged by jenkins-bot:

[mediawiki/extensions/Echo@master] Avoid LinksUpdate public properties

https://gerrit.wikimedia.org/r/744912

Change 744903 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] Use the new LinksUpdate::getPageId()

https://gerrit.wikimedia.org/r/744903

Change 744760 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] LinksUpdate deprecations

https://gerrit.wikimedia.org/r/744760

Change 744908 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Use the new LinksUpdate::getPageId()

https://gerrit.wikimedia.org/r/744908

Change 743283 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Use Parser hooks instead of LinksUpdate to suppress categories

https://gerrit.wikimedia.org/r/743283

Change 743062 had a related patch set uploaded (author: Tim Starling):

[mediawiki/extensions/PageImages@master] Identify lead images using a new parser hook instead of during LinksUpdate

https://gerrit.wikimedia.org/r/743062

@Jdlrobson Per Maintainers this extension is stewarded by Reading Web, though described simultaneously as "active" and with no non-security SLA. As such, I'm offering to help review and take on the responsibility with Tim to verify and follow-up as needed while overseeing the change through to production. However, it's important to me that we don't make changes without steward awareness, as the code should remain familiar with your team and remain consistent/aligned with other practices.

I've described how the code worked and how it would work after Tim's changes in this comment. Let me know whether that direction sounds good, and if not whether giving us some direction/involvement could be prioritised in the short term. Otherwise, we may have to delay this until next quarter.

Change 743061 merged by jenkins-bot:

[mediawiki/core@master] Add a ParserModifyImageHTML hook for PageImages

https://gerrit.wikimedia.org/r/743061

@Krinkle sounds good. Thanks for the heads up. See also T252249. The creator is no longer at WMF.

Change 744910 merged by jenkins-bot:

[mediawiki/extensions/GeoData@master] Use the new LinksUpdate::getPageId()

https://gerrit.wikimedia.org/r/744910

Change 743062 merged by jenkins-bot:

[mediawiki/extensions/PageImages@master] Identify lead images using a new parser hook instead of during LinksUpdate

https://gerrit.wikimedia.org/r/743062

As I commented on T299149#7620494, I'm pretty unhappy we added a ParserModifyImageHTML hook here:

  1. Arbitrary modification of HTML will certainly cause havoc with VisualEditor. I don't think we should be adding new hooks which don't play well with the near-future use of Parsoid for read view HTML. Mutating the HTML to add a comment is a change which is certainly visible to Visual Editor as well as Parsoid during html2wt, as comments are DOM nodes and not stripped in the Parsoid world.
  2. To the extent that modification of HTML is reasonable, the API should be tightly scoped so that the modifications are limited to those which are necessary and can be efficiently identified and ignored by Parsoid/Visual Editor during round tripping. There's an "extra link classes" hook in the core Parser, for example, which allows the addition of additional "class" attributes to generated links. That's harmless & acceptable since it is tightly scoped to the use case and there's no native wikitext syntax for adding class attributes to links, so it's clear that any attributes on <a> tags can always be ignored by Parsoid and VisualEditor. The situation is not quite as clear for images, since specific classes assigned to images *are* controlled by wikitext, but something could probably be worked out to allow "extra" image classes to be distinguished if the hook were written that way. Adding additional data- attributes to the root tag is another option that could be tightly scoped, with appropriate namespacing rules to ensure that different users of the hook don't step on each other.
  3. Ignoring all of the above, we should 100% *not* be adding new hooks which take string $html as an argument. Parsoid manipulates the DOM natively. To invoke that hook Parsoid would have to serialize and then reparse the resulting HTML at every hook invocation. Far better, in 2022, to write a method signature which takes DOMNode $node and force the legacy parser to parse and then reserialize. That way performance will *increase* when we shift to Parsoid and we're not adding additional technical debt to the platform.
  4. In the summary for this task it was mentioned that "In Parsoid this would be a DOM pass." -- the idea there was a pass which just examined and summarized the HTML (adding page properties) and not a mutation pass. DOM passes which mutate the HTML are a whole 'nuther thing (and @subbu and I are still debating what limits to place on them). But it seems like mutating the HTML isn't actually necessary here -- all that's needed is some way to identify certain images and communicate that information to a post processing pass which would record it in a page property. Can't we do that *without* adding a hook which allows arbitrary modification of the HTML?
  5. Finally, @arlola's recent work has mostly removed differences in HTML structure between the legacy parser and parsoid. If the "post-Parsoid" goal was to walk the DOM tree looking for images, why not do the same from the start? We already have a very efficient parser in Remex which parses the entire tree during the tidy phase. It doesn't seem too far-fetched to implement this from the start as a tree walker. (And if that's not actually a viable implementation strategy, then my objections to the hook only grow: we shouldn't be adding new features that we know aren't viable post-Parsoid. Whatever the solution here is, it's going to have to be rewritten to work with Parsoid within the next year, by a team which is already trying to boil an ocean of their own.)

I apologize for coming in so late with objections here. I know it's disruptive to come in after the Reading team's feature is "done" and ask for a review of the basic architecture. But somehow this task crept from a "summarize information about links" to a "modify the generated HTML (in string form)" implementation -- it probably should have triggered review from the parsing content transform team when that line was crossed, but I certainly understand how that line was hard to see in the process. I don't want it to seem like I'm throwing stones or casting blame. Again, I do apologize for the disruption.

Change 744760 merged by jenkins-bot:

[mediawiki/core@master] LinksUpdate deprecations

https://gerrit.wikimedia.org/r/744760

As I commented on T299149#7620494, I'm pretty unhappy we added a ParserModifyImageHTML hook here:

  1. Arbitrary modification of HTML will certainly cause havoc with VisualEditor. I don't think we should be adding new hooks which don't play well with the near-future use of Parsoid for read view HTML. Mutating the HTML to add a comment is a change which is certainly visible to Visual Editor as well as Parsoid during html2wt, as comments are DOM nodes and not stripped in the Parsoid world.

Plenty of hooks already allow arbitrary modification of HTML. It seems like you want to block all maintenance of the old parser until becomes more like Parsoid, but that's not going to happen. It is what it is.

  1. Ignoring all of the above, we should 100% *not* be adding new hooks which take string $html as an argument. Parsoid manipulates the DOM natively. To invoke that hook Parsoid would have to serialize and then reparse the resulting HTML at every hook invocation. Far better, in 2022, to write a method signature which takes DOMNode $node and force the legacy parser to parse and then reserialize. That way performance will *increase* when we shift to Parsoid and we're not adding additional technical debt to the platform.

As I said at gerrit 743061, Parsoid should not call this hook, or ParserAfterTidy. The point of the ParserModifyImageHTML hook is to provide structured access to the width, height and name of an image, but in Parsoid you can get that from DOM attributes. In general, I don't think it's possible to provide extension interfaces that would be efficient in both Parsoid and the old parser. I thought you had already come to the same conclusion, considering the state of Parsoid's src/Ext directory.

I'm not going to sabotage the performance of the old parser just so that performance can increase when we switch to Parsoid.

  1. In the summary for this task it was mentioned that "In Parsoid this would be a DOM pass." -- the idea there was a pass which just examined and summarized the HTML (adding page properties) and not a mutation pass. DOM passes which mutate the HTML are a whole 'nuther thing (and @subbu and I are still debating what limits to place on them). But it seems like mutating the HTML isn't actually necessary here -- all that's needed is some way to identify certain images and communicate that information to a post processing pass which would record it in a page property. Can't we do that *without* adding a hook which allows arbitrary modification of the HTML?

I don't see what problem that would solve. The whole PageImages thing was a couple of hours work necessitated by the fact that it blocked the LinksUpdate refactor. It's easier to just write it twice than to introduce a new narrow interface called by both parsers.

  1. Finally, @arlola's recent work has mostly removed differences in HTML structure between the legacy parser and parsoid. If the "post-Parsoid" goal was to walk the DOM tree looking for images, why not do the same from the start? We already have a very efficient parser in Remex which parses the entire tree during the tidy phase. It doesn't seem too far-fetched to implement this from the start as a tree walker.

Remex can definitely not find images as fast as a single preg_replace_callback() on the whole HTML.

(And if that's not actually a viable implementation strategy, then my objections to the hook only grow: we shouldn't be adding new features that we know aren't viable post-Parsoid. Whatever the solution here is, it's going to have to be rewritten to work with Parsoid within the next year, by a team which is already trying to boil an ocean of their own.)

Parsoid only parses HTML once, it doesn't have to run Remex over the entire document for each little feature. Maybe with a class or tag name index, Parsoid could beat the regex for performance, for documents with few images and a lot of other stuff.

Switching the intermediate document representation from strings to DOM trees is an essential design difference between the old parser and Parsoid. I don't think there's any way you can paper over it.

Change 755049 had a related patch set uploaded (by Zabe; author: Zabe):

[mediawiki/extensions/GlobalUsage@master] Correctly access ParserOutput::$mImages

https://gerrit.wikimedia.org/r/755049

Change 755049 abandoned by Zabe:

[mediawiki/extensions/GlobalUsage@master] Correctly access ParserOutput::$mImages

Reason:

no longer necessary due to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/755062

https://gerrit.wikimedia.org/r/755049

I can't get PageImages to work locally or on the beta cluster more. Has something changed in how they should be configured?
No page image is showing up on https://en.wikipedia.beta.wmflabs.org/wiki/Testing_page_images?action=info when it should be.

Worked out the bug - T299798 - but not sure if that's a recent regression or relating to the recent changes.