Add Link: word substrings are turned into links
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tgr
	May 30 2021, 11:20 AM

Description

Sometimes, Add Link turns a part of a word into link (example, example, example). This 1) looks ugly (although that might be language-specific - see T128060: VisualEditor makes it easy to create partially linked words, when the user expects a fully linked one for a (lengthy) related discussion), 2) results in a confusing diff (because Parsoid needs to use a <nowiki> hack to get it to look like that - see T35091: Parsoid: Linking on a part of a word triggers linktrail).

This could be an issue with mwaddlink's word tokenizer (unlikely) or phrase matching (more likely).

Details

Subject	Repo	Branch	Lines +/-
linkrecommendation: Bump version	operations/deployment-charts	master	+1 -1
Fix link target location bug and add test	research/mwaddlink	main	+128 -10
AddLink: Fix substring match in phrase-matching approach	mediawiki/extensions/GrowthExperiments	master	+112 -28

Customize query in gerrit

Related Objects

Mentioned In: T331060: Add link sometimes inserts <nowiki/> tag
T300823: Add Link: Non-first instance of word linked when first instance is declensed
rRMWA95d9c2ff76af: Fix link target location bug and add test
T293017: <nowiki> added to plurals in link suggestions
T285651: Add Qunit tests for AddLinkArticleTarget.prototype.annotateSuggestions
T286100: [arwiki-wmf.12] Unable to find 4 link recommendation phrase item(s) in document.
Mentioned Here: T300823: Add Link: Non-first instance of word linked when first instance is declensed
T274325: Add a link: quality gates
P18742 mwaddlink-query (helper script for Add Link API)
T279519: Add a link: algorithm improvements: Avoid recommending links in sections that usually don't have links
T285651: Add Qunit tests for AddLinkArticleTarget.prototype.annotateSuggestions
T286100: [arwiki-wmf.12] Unable to find 4 link recommendation phrase item(s) in document.
T35091: Parsoid: Linking on a part of a word triggers linktrail
T128060: VisualEditor makes it easy to create partially linked words, when the user expects a fully linked one

Event Timeline

Tgr created this task.May 30 2021, 11:20 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 30 2021, 11:20 AM

Tgr edited projects, added Growth-Team; removed Growth-Team (Sprint 0 (Growth Team)).May 30 2021, 11:21 AM

Tgr moved this task from Inbox to Needs Discussion on the Growth-Team board.

Tgr updated the task description. (Show Details)May 30 2021, 11:28 AM

kostajh added a subscriber: MGerlach.May 30 2021, 11:28 AM

Tgr merged a task: T289509: Add Link: Link recommendations should not add <nowiki> tags.Aug 23 2021, 4:22 PM

Tgr updated the task description. (Show Details)

Tgr added a project: User-Tgr.Aug 25 2021, 6:34 PM

Similar situation in bnwiki, wherein this link suggestion, the algorithm correctly suggested the article but added an undesirable <nowiki> tag after accepting it. The word in the article is "কোরআনের", and after accepting the suggestion it should be "[[কুরআন|কোরআনের]]" (instead of adding the nowiki tag). A mentor notified me and said that he had seen a similar scenario before too.

Tgr mentioned this in T286100: [arwiki-wmf.12] Unable to find 4 link recommendation phrase item(s) in document..Sep 2 2021, 6:34 PM

Impact: Links are recommended where they shouldn't be
What happens if we don’t do this task: Potentially more links added to word substrings, lowering link quality
Level of effort: ? @Tgr @kostajh is this something on us or on the research team?
Decision maker: ?

Both are possible, depending on whether this can be fixed in the phrase matching logic (on us, probably easy), the recommendation generation logic (preferably on the research team, probably easy) or by changing how the two are aligned, that is, replacing simple text search based phrase matching with something more sophisticated (hard). My money would be on the first: that this is a bug in phrase matching and we can fix it relatively easily.

In T283985#7354219, @Tgr wrote:

Both are possible, depending on whether this can be fixed in the phrase matching logic (on us, probably easy), the recommendation generation logic (preferably on the research team, probably easy) or by changing how the two are aligned, that is, replacing simple text search based phrase matching with something more sophisticated (hard). My money would be on the first: that this is a bug in phrase matching and we can fix it relatively easily.

Should we merge this task with T286100: [arwiki-wmf.12] Unable to find 4 link recommendation phrase item(s) in document.? That one is already in current sprint. T285651: Add Qunit tests for AddLinkArticleTarget.prototype.annotateSuggestions would probably be a good idea to do along with this work as well.

In T283985#7354219, @Tgr wrote:

Both are possible, depending on whether this can be fixed in the phrase matching logic (on us, probably easy), the recommendation generation logic (preferably on the research team, probably easy) or by changing how the two are aligned, that is, replacing simple text search based phrase matching with something more sophisticated (hard). My money would be on the first: that this is a bug in phrase matching and we can fix it relatively easily.

This seems to be a bug in the recommendation generation logic.
For the first example mentioned above (article Keypad in cswiki with suggested anchor "kurzor" as a substring of the word "kurzoro" leading to "[[kurzor]]<nowiki/>u" ). From what I understand, the recommendation to add a link [[korzur]] is correct but the identified string context is wrong. The relevant text where we try to insert the link (i.e. the text-node obtained from mwparserfromhell) contains both the word "korzuro" (rychlejší přesun kurzoru na následující) and "korzur" (kterém je kurzor, smazání). Once we commit to anchor+link, we do a simple string-matching for the anchor to identify the surrounding text to get the context (link to relevant code). In this case, the first match comes from "korzuro" (since "korzur" is a substring); thus we record the wrong context of the link. Instead, we would like to only match the instance " korzur " so that we get the right context. This will avoid the substring-matches.

Thus, it seems to me we can probably fix this easily in the recommendation generation in order to record the correct string-context. However, I am not sure anymore how the string-matching is done in the visual-editor -- do we actually use the context of the wikitext or plaintext to get the correct match? (if not then we have to find a different solution).

It would be helpful to know how often this occurs in practice. Not sure how we could do that, but it could help with prioritization.

In T283985#7365621, @MGerlach wrote:

In T283985#7354219, @Tgr wrote:

Both are possible, depending on whether this can be fixed in the phrase matching logic (on us, probably easy), the recommendation generation logic (preferably on the research team, probably easy) or by changing how the two are aligned, that is, replacing simple text search based phrase matching with something more sophisticated (hard). My money would be on the first: that this is a bug in phrase matching and we can fix it relatively easily.

This seems to be a bug in the recommendation generation logic.
For the first example mentioned above (article Keypad in cswiki with suggested anchor "kurzor" as a substring of the word "kurzoro" leading to "[[kurzor]]<nowiki/>u" ). From what I understand, the recommendation to add a link [[korzur]] is correct but the identified string context is wrong. The relevant text where we try to insert the link (i.e. the text-node obtained from mwparserfromhell) contains both the word "korzuro" (rychlejší přesun kurzoru na následující) and "korzur" (kterém je kurzor, smazání). Once we commit to anchor+link, we do a simple string-matching for the anchor to identify the surrounding text to get the context (link to relevant code). In this case, the first match comes from "korzuro" (since "korzur" is a substring); thus we record the wrong context of the link. Instead, we would like to only match the instance " korzur " so that we get the right context. This will avoid the substring-matches.

Thus, it seems to me we can probably fix this easily in the recommendation generation in order to record the correct string-context. However, I am not sure anymore how the string-matching is done in the visual-editor -- do we actually use the context of the wikitext or plaintext to get the correct match? (if not then we have to find a different solution).

@MGerlach for phrase matching, we iterate over text nodes and ignore content in italic/bold nodes, tabes, blockquotes and headings (see https://github.com/wikimedia/mediawiki-extensions-GrowthExperiments/blob/master/modules/ext.growthExperiments.StructuredTask/addlink/AddLinkArticleTarget.js#L312). So no, we are not considering wikitext in the phrase matching logic, we're iterating over Parsoid HTML.

Do you have time to work on a fix in mwaddlink?

In T283985#7387342, @kostajh wrote:

In T283985#7365621, @MGerlach wrote:

In T283985#7354219, @Tgr wrote:

Both are possible, depending on whether this can be fixed in the phrase matching logic (on us, probably easy), the recommendation generation logic (preferably on the research team, probably easy) or by changing how the two are aligned, that is, replacing simple text search based phrase matching with something more sophisticated (hard). My money would be on the first: that this is a bug in phrase matching and we can fix it relatively easily.

This seems to be a bug in the recommendation generation logic.
For the first example mentioned above (article Keypad in cswiki with suggested anchor "kurzor" as a substring of the word "kurzoro" leading to "[[kurzor]]<nowiki/>u" ). From what I understand, the recommendation to add a link [[korzur]] is correct but the identified string context is wrong. The relevant text where we try to insert the link (i.e. the text-node obtained from mwparserfromhell) contains both the word "korzuro" (rychlejší přesun kurzoru na následující) and "korzur" (kterém je kurzor, smazání). Once we commit to anchor+link, we do a simple string-matching for the anchor to identify the surrounding text to get the context (link to relevant code). In this case, the first match comes from "korzuro" (since "korzur" is a substring); thus we record the wrong context of the link. Instead, we would like to only match the instance " korzur " so that we get the right context. This will avoid the substring-matches.

Thus, it seems to me we can probably fix this easily in the recommendation generation in order to record the correct string-context. However, I am not sure anymore how the string-matching is done in the visual-editor -- do we actually use the context of the wikitext or plaintext to get the correct match? (if not then we have to find a different solution).

@MGerlach for phrase matching, we iterate over text nodes and ignore content in italic/bold nodes, tabes, blockquotes and headings (see https://github.com/wikimedia/mediawiki-extensions-GrowthExperiments/blob/master/modules/ext.growthExperiments.StructuredTask/addlink/AddLinkArticleTarget.js#L312). So no, we are not considering wikitext in the phrase matching logic, we're iterating over Parsoid HTML.

Do you have time to work on a fix in mwaddlink?

@kostajh when iterating over the Parsoid HTML for the phrase-matching, are you considering the context (i.e. surrounding text and/or surrounding wikitext) of the anchor-text provided by the link recommendation API?

if yes: then, I need to add a fix to mwaddlink to identify the correct context.
if no: then, I think this needs to be solved on the Parsoid HTML side to identify the correct substring where the anchor-text is not a substring of a different word.

From what I understood (T283985#7365621), the problem is that the anchor-text of the suggested link appears multiple times in the text of the article, both, as a substring of a longer word (wrong) and as separate word(s) (correct). At the moment, we are inserting the link at the wrong position in the text (most likely because that is the first match). At least, this was the case in the example mentioned above.

Ah, I see.

In T283985#7387386, @MGerlach wrote:

In T283985#7387342, @kostajh wrote:

In T283985#7365621, @MGerlach wrote:

In T283985#7354219, @Tgr wrote:

Both are possible, depending on whether this can be fixed in the phrase matching logic (on us, probably easy), the recommendation generation logic (preferably on the research team, probably easy) or by changing how the two are aligned, that is, replacing simple text search based phrase matching with something more sophisticated (hard). My money would be on the first: that this is a bug in phrase matching and we can fix it relatively easily.

This seems to be a bug in the recommendation generation logic.
For the first example mentioned above (article Keypad in cswiki with suggested anchor "kurzor" as a substring of the word "kurzoro" leading to "[[kurzor]]<nowiki/>u" ). From what I understand, the recommendation to add a link [[korzur]] is correct but the identified string context is wrong. The relevant text where we try to insert the link (i.e. the text-node obtained from mwparserfromhell) contains both the word "korzuro" (rychlejší přesun kurzoru na následující) and "korzur" (kterém je kurzor, smazání). Once we commit to anchor+link, we do a simple string-matching for the anchor to identify the surrounding text to get the context (link to relevant code). In this case, the first match comes from "korzuro" (since "korzur" is a substring); thus we record the wrong context of the link. Instead, we would like to only match the instance " korzur " so that we get the right context. This will avoid the substring-matches.

Thus, it seems to me we can probably fix this easily in the recommendation generation in order to record the correct string-context. However, I am not sure anymore how the string-matching is done in the visual-editor -- do we actually use the context of the wikitext or plaintext to get the correct match? (if not then we have to find a different solution).

@MGerlach for phrase matching, we iterate over text nodes and ignore content in italic/bold nodes, tabes, blockquotes and headings (see https://github.com/wikimedia/mediawiki-extensions-GrowthExperiments/blob/master/modules/ext.growthExperiments.StructuredTask/addlink/AddLinkArticleTarget.js#L312). So no, we are not considering wikitext in the phrase matching logic, we're iterating over Parsoid HTML.

Do you have time to work on a fix in mwaddlink?

@kostajh when iterating over the Parsoid HTML for the phrase-matching, are you considering the context (i.e. surrounding text and/or surrounding wikitext) of the anchor-text provided by the link recommendation API?

if yes: then, I need to add a fix to mwaddlink to identify the correct context.

if no: then, I think this needs to be solved on the Parsoid HTML side to identify the correct substring where the anchor-text is not a substring of a different word.

No, we are not. I see the problem now.

From what I understood (T283985#7365621), the problem is that the anchor-text of the suggested link appears multiple times in the text of the article, both, as a substring of a longer word (wrong) and as separate word(s) (correct). At the moment, we are inserting the link at the wrong position in the text (most likely because that is the first match). At least, this was the case in the example mentioned above.

Ack. I'll push a patch.

Change 724723 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] [WIP] AddLink: Fix substring match in phrase-matching approach

https://gerrit.wikimedia.org/r/724723

gerritbot added a project: Patch-For-Review.Sep 29 2021, 11:17 AM

@MGerlach I added you to https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/724723 if you could review the overall approach. At a high level, it's changing the phrase matching logic to search for the phrase + before/after context, then locate the specific text we want to link. I believe this should work since the context is based on plain-text from mwparserfromhell.parse(). One thing I noticed, though, is that the API returns \n in some context strings; I've worked around that by stripping that from the before/after context but we should probably remove that on the mwaddlink side. And it made me wonder if there are other characters like that which may show up unexpectedly and break the phrase matching.

In T283985#7387697, @kostajh wrote:

@MGerlach I added you to https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/724723 if you could review the overall approach. At a high level, it's changing the phrase matching logic to search for the phrase + before/after context, then locate the specific text we want to link. I believe this should work since the context is based on plain-text from mwparserfromhell.parse(). One thing I noticed, though, is that the API returns \n in some context strings; I've worked around that by stripping that from the before/after context but we should probably remove that on the mwaddlink side. And it made me wonder if there are other characters like that which may show up unexpectedly and break the phrase matching.

@kostajh I added a comment in gerrit, but I am not sure if that is the right place to discuss the strategy to fix, so I am copying here again.

At the moment, the context_before/context_after provided by the API suffers from the same problem. Once we identified a link/anchor, mwaddlink uses a simple string-matching in the wikitext (specifically in the corresponding text-node obtained from mwparserfromhell). Thus, if we want to use the context for phrase-matching here, we would also need to fix phrase-matching in mwaddlink.

Is there a possibility to use a simpler heuristic to make sure that the match is not a substring? Could we iterate through all matches from the regex and check if it is the correct "word" (we know that one of the matches should be the correct one)? or whether the <nowiki>-tag (T283985#7316890) would be introduced?

MMiller_WMF subscribed.Sep 29 2021, 10:55 PM

We don't use the context, but mwaddlink doesn't use it either so I don't think it would help. Matching is done by calculating an index within all matches in the page (that aren't inside some wikitext construct, as far as mwparserfromhell is concerned), and finding the match with the same index in GrowthExperiments (for matches which aren't inside some wikitext construct, as far as Parsoid is concerned). So for matching to work, mwparserfromhell and Parsoid needs to agree on what is "top-level wikitext" (seems to be working) and they need to calculate matches the same way.

The mwaddlink logic is: for each mwparserfromhell node,

generate a list potential anchors (that is, words/ngrams that occur in that node and the ML logic thinks they are good link anchors)
grep for the first match of the regex (?<!\[\[)(?<!-->)\b<potential anchor>\b(?![\w\s]*[\]\]]) in the node
count the number of times the link anchor appears in preceding nodes, and in the current node before the actual anchor position (this is a plain substring match, ignoring word boundaries)
pass the link anchor text + the match index (the count above plus one) to the client

The client-side logic is: for each Parsoid node,

match for any of the link anchors (plain text match, ignoring word boundaries)
if a match is found, increase the match counter for that link anchor by one
if reaches the match index we got from Parsoid, we have found the anchor.

In theory, as far as I can see, this is correct and substring matches don't cause any problem since we are substring-matching on both sides.

Is there a possibility to use a simpler heuristic to make sure that the match is not a substring? Could we iterate through all matches from the regex and check if it is the correct "word" (we know that one of the matches should be the correct one)? or whether the <nowiki>-tag (T283985#7316890) would be introduced?

Hmm, that is probably possible by making a request to generate a preview, and checking if the resulting wikitext has <nowiki>. But it seems complicated to do.

The approach I am taking in the patch seems to be working fine, as far as I can tell:

Search for the link_text but include one character before/after the link text from the context provided by mwaddlink, e.g. instead of searching for kurzor search for kurzor or kurzor. etc.
- Once the link text with context is found, then pinpoint the actual link_text, e.g. after finding the text node that has kurzor , search for kurzor in that string and do the link_text/link_target insertion
- Strip newline characters from the context search, because those won't appear in the textNode data

A known issue is that suggested links at the beginning of a list item won't be suggested anymore, e.g:

* kurzor won't get linked
* Some text ahead, means that kurzor does get linked

That's because the context provided by mwparserfromhell for text that begins directly after a list item is an empty space, textNode on the client-side has no leading empty space, the our regex searches for kurzor and does not find it for the first list item. Given that we are considering to remove list item suggestions entirely (T279519) that is maybe not such a big problem.

kostajh triaged this task as Medium priority.Sep 30 2021, 8:43 AM

kostajh moved this task from Incoming to Code Review on the Growth-Team (Sprint 0 (Growth Team)) board.

kostajh mentioned this in T285651: Add Qunit tests for AddLinkArticleTarget.prototype.annotateSuggestions.Oct 7 2021, 7:08 PM

kostajh mentioned this in T293017: <nowiki> added to plurals in link suggestions.Oct 11 2021, 5:01 PM

Trizek-WMF merged a task: T293017: <nowiki> added to plurals in link suggestions.Oct 11 2021, 5:02 PM

Trizek-WMF added subscribers: Trizek-WMF, Strainu.

The recommendations I get (both from the production API and locally) for the pre-edit version of the page:

{
  "context_after": "u na násle",
  "context_before": "ší přesun ",
  "link_index": 2,
  "link_target": "Kurzor",
  "link_text": "kurzor",
  "match_index": 0,
  "score": 0.708010733127594,
  "wikitext_offset": 424
}
{
  "context_after": "akban.\n\nA ",
  "context_before": "Vitesse”) ",
  "link_index": 8,
  "link_target": "Forgóváz",
  "link_text": "forgóváz",
  "match_index": 2,
  "score": 0.820939838886261,
  "wikitext_offset": 6373
}

(I used this helper script FWIW: P18742 I couldn't reproduce the arwiki recommendation; presumably the arwiki dataset has been regenerated since then.)

In both cases the mwaddlink logic recommended linking part of a word. The match_index value was correct, and phrase matching on the MediaWiki side also worked correctly. The problem must be with the mwaddlink candidate selection logic.

Change 754075 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[research/mwaddlink@main] Fix link target location bug and add test

https://gerrit.wikimedia.org/r/754075

Change 724723 abandoned by Kosta Harlan:

[mediawiki/extensions/GrowthExperiments@master] AddLink: Fix substring match in phrase-matching approach

Reason:

https://gerrit.wikimedia.org/r/724723

So this is the article text:

Keypad umožňuje rychlejší editaci programu v režimu 128 Basic, neboť umožňuje rychlejší přesun kurzoru na následující a předcházející slovo, přesun kurzoru o 10 řádek výše a níže, přesun kurzoru na začátek a na konec řádky, smazání písmene na kterém je kurzor, smazání celého slova před kurzorem a za kurzorem, smazání části řádku od začátku řádku do pozice kurzoru a smazání části řádky od pozice kurzoru do konce řádky.

This is the current behavior:

Keypad umožňuje rychlejší editaci programu v režimu 128 Basic, neboť umožňuje rychlejší přesun [[kurzor]]u na následující a předcházející slovo, přesun kurzoru o 10 řádek výše a níže, přesun kurzoru na začátek a na konec řádky, smazání písmene na kterém je kurzor, smazání celého slova před kurzorem a za kurzorem, smazání části řádku od začátku řádku do pozice kurzoru a smazání části řádky od pozice kurzoru do konce řádky.

This is happening for all the wrong reasons: mwaddlink thinks it's linking the second bolded word, but it messes up and links the first. But it actually happens to be kind of correct: in general the first occurrence of a word should be linked (so "kurzoru", not "kurzor" - it's the same word). The problem is it is only partially linked ([[kurzor]]u instead of [[kurzor|kurzoru]]), which results in a disruptive Parsoid construct when translated to wikitext. After fixing the mwaddlink word selection bug, the behavior will be

Keypad umožňuje rychlejší editaci programu v režimu 128 Basic, neboť umožňuje rychlejší přesun kurzoru na následující a předcházející slovo, přesun kurzoru o 10 řádek výše a níže, přesun kurzoru na začátek a na konec řádky, smazání písmene na kterém je [[kurzor]], smazání celého slova před kurzorem a za kurzorem, smazání části řádku od začátku řádku do pozice kurzoru a smazání části řádky od pozice kurzoru do konce řádky.

which, per above, is still wrong.

We could change the logic easily to link the first occurrence but include the suffix. But deciding whether two words, where one is a suffixed version of the other, are the same word, is a hard problem that requires grammatical analysis, stemming, a dictionary - nontrivial NLP tools which mwaddlink currently does not have. (To give an English example: when we want to link to the Star article, then "stars" should be linked but "starling" shouldn't. It's easy for the script to tell that both of those words start with "star". It's really hard to tell that "stars" and "star" refer to the same concept and "starling" doesn't.)

So, not quite sure what would be a good solution here.

In T283985#7625469, @Tgr wrote:

So this is the article text:

Keypad umožňuje rychlejší editaci programu v režimu 128 Basic, neboť umožňuje rychlejší přesun kurzoru na následující a předcházející slovo, přesun kurzoru o 10 řádek výše a níže, přesun kurzoru na začátek a na konec řádky, smazání písmene na kterém je kurzor, smazání celého slova před kurzorem a za kurzorem, smazání části řádku od začátku řádku do pozice kurzoru a smazání části řádky od pozice kurzoru do konce řádky.

This is the current behavior:

Keypad umožňuje rychlejší editaci programu v režimu 128 Basic, neboť umožňuje rychlejší přesun [[kurzor]]u na následující a předcházející slovo, přesun kurzoru o 10 řádek výše a níže, přesun kurzoru na začátek a na konec řádky, smazání písmene na kterém je kurzor, smazání celého slova před kurzorem a za kurzorem, smazání části řádku od začátku řádku do pozice kurzoru a smazání části řádky od pozice kurzoru do konce řádky.

This is happening for all the wrong reasons: mwaddlink thinks it's linking the second bolded word, but it messes up and links the first. But it actually happens to be kind of correct: in general the first occurrence of a word should be linked (so "kurzoru", not "kurzor" - it's the same word). The problem is it is only partially linked ([[kurzor]]u instead of [[kurzor|kurzoru]]), which results in a disruptive Parsoid construct when translated to wikitext. After fixing the mwaddlink word selection bug, the behavior will be

Keypad umožňuje rychlejší editaci programu v režimu 128 Basic, neboť umožňuje rychlejší přesun kurzoru na následující a předcházející slovo, přesun kurzoru o 10 řádek výše a níže, přesun kurzoru na začátek a na konec řádky, smazání písmene na kterém je [[kurzor]], smazání celého slova před kurzorem a za kurzorem, smazání části řádku od začátku řádku do pozice kurzoru a smazání části řádky od pozice kurzoru do konce řádky.

which, per above, is still wrong.

We could change the logic easily to link the first occurrence but include the suffix. But deciding whether two words, where one is a suffixed version of the other, are the same word, is a hard problem that requires grammatical analysis, stemming, a dictionary - nontrivial NLP tools which mwaddlink currently does not have. (To give an English example: when we want to link to the Star article, then "stars" should be linked but "starling" shouldn't. It's easy for the script to tell that both of those words start with "star". It's really hard to tell that "stars" and "star" refer to the same concept and "starling" doesn't.)

So, not quite sure what would be a good solution here.

@MGerlach would you mind reviewing and letting us know your thoughts on this, please?

@kostajh @Tgr Thank you for looking into this. Sorry, I had meant to leave a comment here earlier but somehow got lost.

The intended logic of mwaddlink is to find link for full words and not for subwords via exact string matching. For the example "kuzor":

We tokenize the text so we get a match for the full word "kurzor" as anchor to link to "Kurzor":

, smazání písmene na kterém je kurzor, smazání celého slova před kurzorem a za kurzorem,

The problem is that when we try to find the context of the anchor: we are using python's find() method to find the index in the text and then take 10 surrounding characters (code). However, find yields the first direct string match of "kurzor" a bit earlier which is only a substring of a full word (and thus was skipped in the previous step using the tokenization) and, in turn, also yields the wrong context:

, neboť umožňuje rychlejší přesun kurzoru na následující a předcházející slovo,

When choosing find() to get the index, I did not anticipate this error because we only consider a single text-node from mwparserfromhell (consecutive plain text between other elements such as links, references, etc ) which I thought would not contain several instances of the same substring.

Therefore, from the intended logic, I would link the "second" occurrence (kurzor instead of kurzoru). The wrong string-matching to find the context could probably be fixed by using a better string matching via regex instead of the simple find(). I do not recommend to try to identify that korzuru is actually the same word as korzur. This would open a can of worms. While we might not get exactly the first occurrence of the word (kurzoru), the next occurrence (kurzor) can still be expected to be relatively close since we are in the same text-node as identified via mwparserfromhell.

The patch will fix the find() issue and make mwaddlink's actual behavior match its intended behavior. But that doesn't change the fact that the intended behavior does not match wiki policy (and usability common sense, since readers will want to learn about new terms when they encounter them for the first time).

I can think of three approaches:

Just apply the patch as is and hope communities won't be annoyed by the imperfect link placement. (It is very likely less annoying than the nowiki tags, and even those didn't result in too many complaints.)
Just link the first word which has the candidate term as its prefix. (Ie. keep the current logic but extend the link anchor to the end of the word.) In some cases that will be an unrelated word; but then, incorrect link suggestions are already part of the system, so we can just rely on users to say no to those.
Whenever the first match for the candidate term is not a full-word match, discard that candidate entirely. This will result in recommendations which honor the "link the first instance of the word" rule, but there will be less recommendations overall.

As a mentor, I would recommend that you first try 3, as the number of links added per article seems a bit high at the moment, so removing some of them wouldn't be damaging. The patch should be applied as the second option, but only together with a documentation update to make the issues crystal clear. For now, we're far from that.

In T283985#7627420, @Strainu wrote:

As a mentor, I would recommend that you first try 3, as the number of links added per article seems a bit high at the moment, so removing some of them wouldn't be damaging. The patch should be applied as the second option, but only together with a documentation update to make the issues crystal clear. For now, we're far from that.

Thanks for that perspective. I think we should distinguish between these two things, though:

theoretical availability of links
maximum number of links shown to a user to choose from

In many articles, we have a problem with item 1, finding enough phrases to link to. With item 2, we plan to introduce configuration per-wiki (T274325) to allow a user to see a maximum of e.g. 4 links per article.

It's not immediately obvious to me why 1 is a problem if 2 isn't - they seem linked. What does "enough phrases" mean exactly? Intuitively, I would say that 2 or 3 links per article should be enough for the user to practice. You could then potentially reuse the same article such as this one for other users.

The reason why I believe formal correctness should be prioritised is social rather than technical: add link is marked as an easy task, so potentially used by absolute newbies. They are the most exposed to on-sight reverts for the smallest mistake, effectively defeating the purpose of the Growth tools.

Having not enough links would lead to a lot of efforts: selecting an article, editing, etc. The other day, someone told me that they found an article they felt legitimate to edit after browsing 50 of them!

As a consequence, we both need to:

provide enough articles that have an high potential of linking so that newcomers will have the feeling that they "edit a lot" (theoretical availability of links),
warn communities about the fact that reducing the number of links shown to a user could impact retention (newcomers getting bored).

In T283985#7628035, @Trizek-WMF wrote:

Having not enough links would lead to a lot of efforts: selecting an article, editing, etc. The other day, someone told me that they found an article they felt legitimate to edit after browsing 50 of them!

Could you please elaborate on this? I don't understand why the user felt the article was not legitimate to edit. The person thought that an article with 1 or 2 suggested links was not worth working on?

In T283985#7648501, @kostajh wrote:

I don't understand why the user felt the article was not legitimate to edit. The person thought that an article with 1 or 2 suggested links was not worth working on?

It is not about the number of links per se. Some people are not feeling comfortable with topics close to their interests.

Let's say that you select "Sport" and "Asia", wishing to edit about Indonesian Football (the thing you are very good at): you will get anything about Asia + anything about Sport (separated), not finding the right article aligned with your area of expertise.

When you find the perfect match, if you only have 2 links to add, it can be a bit disappointing.

Aklapper added projects: Growth-Team (Sprint 0 (Growth Team)), User-Tgr, Add-Link.Jan 26 2022, 3:05 PM

Urbanecm_WMF changed the task status from Open to In Progress.Jan 31 2022, 11:34 AM

Maybe it will be clearer to discuss the first-instance issue in a separate task - I filed T300823: Add Link: Non-first instance of word linked when first instance is declensed. Re: T283985#7627078, I think the first option is the best short term - we already have the code, while imperfect it will almost certainly be seen as an improvement, and the issue with linking non-first instance already exists (the bug discussed in this task only applies when the declensed and the root word appear in the same text node; if there is an earlier text node only containing the declensed word, that will be skipped).

Tgr added a project: Patch-For-Review.Feb 3 2022, 2:27 AM

Change 754075 merged by jenkins-bot:

[research/mwaddlink@main] Fix link target location bug and add test

https://gerrit.wikimedia.org/r/754075

kostajh mentioned this in rRMWA95d9c2ff76af: Fix link target location bug and add test.Feb 3 2022, 11:22 AM

kostajh changed the task status from In Progress to Open.Feb 3 2022, 12:54 PM

kostajh moved this task from Code Review to QA on the Growth-Team (Sprint 0 (Growth Team)) board.

Tgr mentioned this in T300823: Add Link: Non-first instance of word linked when first instance is declensed.Feb 8 2022, 9:57 PM

Etonkovidova moved this task from QA to Test in Production | Watching on the Growth-Team (Sprint 0 (Growth Team)) board.Feb 15 2022, 2:32 AM

Checked wmf.22 - tested UI and using P18742 - testwiki and (mostly) ruwiki; and did some light check on cswiki, huwiki, arwiki. No instances of substrings marked as links were found.

matej_suchanek mentioned this in T331060: Add link sometimes inserts <nowiki/> tag.Mar 2 2023, 8:27 PM

	F34475314: image (1).png
	May 30 2021, 11:28 AM

Add Link: word substrings are turned into linksClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Add Link: word substrings are turned into links
Closed, ResolvedPublic
Actions