Add a link in VE: define exclusion rules for finding text in the DOM
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Catrope
	Nov 10 2020, 11:39 PM

Description

When we locate text in the DOM to apply link recommendations to in AddLinkArticleTarget.prototype.annotateSuggestions, certain nodes need to be excluded, like template nodes for example. We need to come up with a full rule set that describes which nodes to exclude, and apply those rules both when loading suggestions in VE and when producing suggestions in the backend.

Details

Subject	Repo	Branch	Lines +/-
Add Link: refine exclusion rules for finding link text matches	mediawiki/extensions/GrowthExperiments	wmf/1.37.0-wmf.5	+52 -17
Add Link: refine exclusion rules for finding link text matches	mediawiki/extensions/GrowthExperiments	master	+52 -17
Exclude mw:WikiLink nodes when annotating suggestions	mediawiki/extensions/GrowthExperiments	master	+2 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Catrope	T261398 Add a link engineering: Prototype the recommended link plugin in VisualEditor using phrase matching approach
		Resolved		Tgr	T267694 Add a link in VE: define exclusion rules for finding text in the DOM

Event Timeline

Catrope created this task.Nov 10 2020, 11:39 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptNov 10 2020, 11:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

One thing that is clearly deficient about the current rules is that it doesn't handle about groups: in HTML like <span typeof="mw:Transclusion" about="#mwt1">...</span><div about="#mwt1">...</div> both nodes should be skipped, but the current implementation skips only the first one.

MMiller_WMF subscribed.Nov 25 2020, 12:14 AM

kostajh moved this task from Incoming to Ready for Development on the Growth-Team (Sprint 0 (Growth Team)) board.Jan 4 2021, 6:15 PM

I think everything with an mw: typeof can be skipped. The relevant code is this which just iterates through the top-level text nodes in the mwparserfromhell syntax tree.
The Parsoid HTML spec says "Clients are also expected to preserve any DOM subtrees marked up with typeof, rel, property" so probably the same applies to the other two as well. Plus, anything with an about property, as @Catrope said.
The more problematic part is wikitext constructs or wikitext HTML that Parsoid does not explicitly mark up. E.g. '''foo''' will turn to <b>foo</b>, and mwaddlink will skip it, so the frontend logic should too (this is important for counting occurrences).

I think a first approximation is that absolutely everything should be skipped other than text nodes within body > section > p tags; but that does not account for conversion of special wikitext characters (entites, French spaces, not sure if there's anything else?). Those should be reverted to a plain character for the purposes of this matching, which seems pretty complicated to do. It might be way less effort to just make mwaddlink skip anchor text candidates with entities or French space.

Also I have absolutely no clue how language converter constructs are handled either by mwaddlink (does mwparserfromhell even recognize them?) or Parsoid. At the moment, I don't think we have any wiki right now that's affected - Serbian has language conversion, but it's character by character, I don't think it uses the wikitext constructs. As for character conversion, hopefully we just need to apply it to the anchor and context text and we'll be okay? I don't know much about LanguageConverter.

To answer my own question:

mwparserfromhell doesn't seem to handle language conversion blocks: it treats the block markers as plain text, which will be a problem down once we need to support a wiki which supports those blocks, although a somewhat fringe one, since it will suggest links inside -{}- blocks, and there is no sane way to match that to Parsoid. Improving mwparserfromhell seems like a manageable effort, though.
Parsoid support for language conversion is mostly missing (T43716 is the tracking task) which results in VE not being enabled on wikis with complex conversion rules. Bu srwiki (our only LanguageConverter wiki to date) does have VE enabled. Their language conversion markup usage seems to be limited to -{text}- which just disables language conversion. Parsoid converts that to a span tag with no text content, which is a problem for matching text inside it but not otherwise.

So a 99% solution for srwiki would be to make mwaddlink skip suggestions where the context prefix includes -{ or the context suffix includes }-. If we call Parsoid on the server side to convert wikitext-based anchor specifications to Parsoid HTML based ones, that should work fine for LC wikis as long as the link suggestion does not break language conversion markup (we just need to translate the anchor specification separately for each language variant).

In T267694#6764224, @Tgr wrote:

If we call Parsoid on the server side to convert wikitext-based anchor specifications to Parsoid HTML based ones...

I guess I only talked about that on Slack so I'll copy my comment here for the record:

I've been thinking about how to match node traversal between mwparserfromhell and Parsoid (in the context of T267694), and I'm wondering if we should just preprocess on the server side: define an extension tag like <link-recommendation>, and in refreshLinkRecommendations.php when warming the cache, fetch the wikitext, apply the tags (which is trivial at that stage), transform with Parsoid, locate the outputs of the extension tags, calculate some kind of template-agnostic pseudo-xpath, and give that to the frontend code instead of occurrence count and whatnot.
(Did someone suggest this already? It seems vaguely familiar but I can't recall the details.)
It would make the job quite a bit slower, but as far as I can see it would give perfectly faithful data to the frontend, and we wouldn't have to worry about French spaces, LanguageConverter and other exotic wikitext features at all. And if mwaddlink would change its node traversal logic in the future (e.g. propose link candidates where there's text formatting inside the anchor, or where the anchor is inside a table) that would work as well.

Tgr mentioned this in T267695: Add a link in VE: handle span-wrapped entities and annotations when finding link recommendation phrases.Jan 21 2021, 7:58 AM

Tgr mentioned this in T269655: Add a link: sentence highlighting research spike.Feb 10 2021, 2:00 PM

In T267694#6764244, @Tgr wrote:

In T267694#6764224, @Tgr wrote:

If we call Parsoid on the server side to convert wikitext-based anchor specifications to Parsoid HTML based ones...

I guess I only talked about that on Slack so I'll copy my comment here for the record:

I've been thinking about how to match node traversal between mwparserfromhell and Parsoid (in the context of T267694), and I'm wondering if we should just preprocess on the server side: define an extension tag like <link-recommendation>, and in refreshLinkRecommendations.php when warming the cache, fetch the wikitext, apply the tags (which is trivial at that stage), transform with Parsoid, locate the outputs of the extension tags, calculate some kind of template-agnostic pseudo-xpath, and give that to the frontend code instead of occurrence count and whatnot.
(Did someone suggest this already? It seems vaguely familiar but I can't recall the details.)
It would make the job quite a bit slower, but as far as I can see it would give perfectly faithful data to the frontend, and we wouldn't have to worry about French spaces, LanguageConverter and other exotic wikitext features at all. And if mwaddlink would change its node traversal logic in the future (e.g. propose link candidates where there's text formatting inside the anchor, or where the anchor is inside a table) that would work as well.

I have a feeling this might need to happen in a future phase of work, unless beta testing and QA reveals substantial problems in our current setup. Thoughts on how to prioritize this and T267695: Add a link in VE: handle span-wrapped entities and annotations when finding link recommendation phrases?

If we can get away with not doing it, we certainly should.

kostajh moved this task from Backlog to Post-release backlog on the Add-Link board.Mar 29 2021, 11:37 AM

kostajh removed a project: Growth-Team (Sprint 0 (Growth Team)).Mar 29 2021, 1:18 PM

Restricted Application added a project: Growth-Team. · View Herald TranscriptMar 29 2021, 1:18 PM

MBinder_WMF added a project: Growth-Team-Filtering.Apr 15 2021, 6:47 PM

kostajh triaged this task as Medium priority.Apr 26 2021, 7:54 PM

Tgr mentioned this in T281777: Add a link: link suggested inside red link.May 3 2021, 10:37 PM

Change 684810 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] Exclude mw:WikiLink nodes when annotating suggestions

https://gerrit.wikimedia.org/r/684810

gerritbot added a project: Patch-For-Review.May 4 2021, 9:01 AM

Change 684810 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Exclude mw:WikiLink nodes when annotating suggestions

https://gerrit.wikimedia.org/r/684810

As discussed in the patch for T281777: Add a link: link suggested inside red link, we're going to work on implementing the necessary changes here to accommodate bold, italic, and other node attributes discussed in T267694#6764121

kostajh assigned this task to Tgr.May 4 2021, 8:12 PM

Change 687837 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Add Link: refine exclusion rules for finding link text matches

https://gerrit.wikimedia.org/r/687837

Tgr moved this task from Ready for Development to Code Review on the Growth-Team (Sprint 0 (Growth Team)) board.May 9 2021, 10:44 PM

kostajh moved this task from May 3 - May 7 to May 10 - May 14 on the Add-Link board.May 10 2021, 3:45 PM

kostajh mentioned this in T276860: Review test coverage for add link feature.May 12 2021, 12:14 PM

Change 689545 had a related patch set uploaded (by Kosta Harlan; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.5] Add Link: refine exclusion rules for finding link text matches

https://gerrit.wikimedia.org/r/689545

@Etonkovidova not sure if there is anything easy to QA here.

kostajh moved this task from Code Review to QA on the Growth-Team (Sprint 0 (Growth Team)) board.May 12 2021, 12:21 PM

Change 687837 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add Link: refine exclusion rules for finding link text matches

https://gerrit.wikimedia.org/r/687837

Change 689545 merged by Urbanecm: