Page MenuHomePhabricator

Add a link in VE: define exclusion rules for finding text in the DOM
Open, MediumPublic

Description

When we locate text in the DOM to apply link recommendations to in AddLinkArticleTarget.prototype.annotateSuggestions, certain nodes need to be excluded, like template nodes for example. We need to come up with a full rule set that describes which nodes to exclude, and apply those rules both when loading suggestions in VE and when producing suggestions in the backend.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

One thing that is clearly deficient about the current rules is that it doesn't handle about groups: in HTML like <span typeof="mw:Transclusion" about="#mwt1">...</span><div about="#mwt1">...</div> both nodes should be skipped, but the current implementation skips only the first one.

I think everything with an mw: typeof can be skipped. The relevant code is this which just iterates through the top-level text nodes in the mwparserfromhell syntax tree.
The Parsoid HTML spec says "Clients are also expected to preserve any DOM subtrees marked up with typeof, rel, property" so probably the same applies to the other two as well. Plus, anything with an about property, as @Catrope said.
The more problematic part is wikitext constructs or wikitext HTML that Parsoid does not explicitly mark up. E.g. '''foo''' will turn to <b>foo</b>, and mwaddlink will skip it, so the frontend logic should too (this is important for counting occurrences).

I think a first approximation is that absolutely everything should be skipped other than text nodes within body > section > p tags; but that does not account for conversion of special wikitext characters (entites, French spaces, not sure if there's anything else?). Those should be reverted to a plain character for the purposes of this matching, which seems pretty complicated to do. It might be way less effort to just make mwaddlink skip anchor text candidates with entities or French space.

Also I have absolutely no clue how language converter constructs are handled either by mwaddlink (does mwparserfromhell even recognize them?) or Parsoid. At the moment, I don't think we have any wiki right now that's affected - Serbian has language conversion, but it's character by character, I don't think it uses the wikitext constructs. As for character conversion, hopefully we just need to apply it to the anchor and context text and we'll be okay? I don't know much about LanguageConverter.

To answer my own question:

  • mwparserfromhell doesn't seem to handle language conversion blocks: it treats the block markers as plain text, which will be a problem down once we need to support a wiki which supports those blocks, although a somewhat fringe one, since it will suggest links inside -{}- blocks, and there is no sane way to match that to Parsoid. Improving mwparserfromhell seems like a manageable effort, though.
  • Parsoid support for language conversion is mostly missing (T43716 is the tracking task) which results in VE not being enabled on wikis with complex conversion rules. Bu srwiki (our only LanguageConverter wiki to date) does have VE enabled. Their language conversion markup usage seems to be limited to -{text}- which just disables language conversion. Parsoid converts that to a span tag with no text content, which is a problem for matching text inside it but not otherwise.

So a 99% solution for srwiki would be to make mwaddlink skip suggestions where the context prefix includes -{ or the context suffix includes }-. If we call Parsoid on the server side to convert wikitext-based anchor specifications to Parsoid HTML based ones, that should work fine for LC wikis as long as the link suggestion does not break language conversion markup (we just need to translate the anchor specification separately for each language variant).

If we call Parsoid on the server side to convert wikitext-based anchor specifications to Parsoid HTML based ones...

I guess I only talked about that on Slack so I'll copy my comment here for the record:

I've been thinking about how to match node traversal between mwparserfromhell and Parsoid (in the context of T267694), and I'm wondering if we should just preprocess on the server side: define an extension tag like <link-recommendation>, and in refreshLinkRecommendations.php when warming the cache, fetch the wikitext, apply the tags (which is trivial at that stage), transform with Parsoid, locate the outputs of the extension tags, calculate some kind of template-agnostic pseudo-xpath, and give that to the frontend code instead of occurrence count and whatnot.
(Did someone suggest this already? It seems vaguely familiar but I can't recall the details.)
It would make the job quite a bit slower, but as far as I can see it would give perfectly faithful data to the frontend, and we wouldn't have to worry about French spaces, LanguageConverter and other exotic wikitext features at all. And if mwaddlink would change its node traversal logic in the future (e.g. propose link candidates where there's text formatting inside the anchor, or where the anchor is inside a table) that would work as well.

If we call Parsoid on the server side to convert wikitext-based anchor specifications to Parsoid HTML based ones...

I guess I only talked about that on Slack so I'll copy my comment here for the record:

I've been thinking about how to match node traversal between mwparserfromhell and Parsoid (in the context of T267694), and I'm wondering if we should just preprocess on the server side: define an extension tag like <link-recommendation>, and in refreshLinkRecommendations.php when warming the cache, fetch the wikitext, apply the tags (which is trivial at that stage), transform with Parsoid, locate the outputs of the extension tags, calculate some kind of template-agnostic pseudo-xpath, and give that to the frontend code instead of occurrence count and whatnot.
(Did someone suggest this already? It seems vaguely familiar but I can't recall the details.)
It would make the job quite a bit slower, but as far as I can see it would give perfectly faithful data to the frontend, and we wouldn't have to worry about French spaces, LanguageConverter and other exotic wikitext features at all. And if mwaddlink would change its node traversal logic in the future (e.g. propose link candidates where there's text formatting inside the anchor, or where the anchor is inside a table) that would work as well.

I have a feeling this might need to happen in a future phase of work, unless beta testing and QA reveals substantial problems in our current setup. Thoughts on how to prioritize this and T267695: Add a link in VE: handle span-wrapped entities and annotations when finding link recommendation phrases?

If we can get away with not doing it, we certainly should.

kostajh triaged this task as Medium priority.Mon, Apr 26, 7:54 PM

Change 684810 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] Exclude mw:WikiLink nodes when annotating suggestions

https://gerrit.wikimedia.org/r/684810

Change 684810 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Exclude mw:WikiLink nodes when annotating suggestions

https://gerrit.wikimedia.org/r/684810

kostajh moved this task from Post-release backlog to May 3 - May 7 on the Add-Link board.

As discussed in the patch for T281777: Add a link: link suggested inside red link, we're going to work on implementing the necessary changes here to accommodate bold, italic, and other node attributes discussed in T267694#6764121