Page MenuHomePhabricator

Add Link: word substrings are turned into links
Open, Needs TriagePublic

Description

Sometimes, Add Link turns a part of a word into link (example, example, example). This 1) looks ugly (although that might be language-specific - see T128060: VisualEditor makes it easy to create partially linked words, when the user expects a fully linked one for a (lengthy) related discussion), 2) results in a confusing diff (because Parsoid needs to use a <nowiki> hack to get it to look like that - see T35091: Parsoid: Linking on a part of a word triggers linktrail).

image (1).png (436×2 px, 574 KB)

This could be an issue with mwaddlink's word tokenizer (unlikely) or phrase matching (more likely).

Event Timeline

Similar situation in bnwiki, wherein this link suggestion, the algorithm correctly suggested the article but added an undesirable <nowiki> tag after accepting it. The word in the article is "কোরআনের", and after accepting the suggestion it should be "[[কুরআন|কোরআনের]]" (instead of adding the nowiki tag). A mentor notified me and said that he had seen a similar scenario before too.

Impact: Links are recommended where they shouldn't be
What happens if we don’t do this task: Potentially more links added to word substrings, lowering link quality
Level of effort: ? @Tgr @kostajh is this something on us or on the research team?
Decision maker: ?

Both are possible, depending on whether this can be fixed in the phrase matching logic (on us, probably easy), the recommendation generation logic (preferably on the research team, probably easy) or by changing how the two are aligned, that is, replacing simple text search based phrase matching with something more sophisticated (hard). My money would be on the first: that this is a bug in phrase matching and we can fix it relatively easily.

Both are possible, depending on whether this can be fixed in the phrase matching logic (on us, probably easy), the recommendation generation logic (preferably on the research team, probably easy) or by changing how the two are aligned, that is, replacing simple text search based phrase matching with something more sophisticated (hard). My money would be on the first: that this is a bug in phrase matching and we can fix it relatively easily.

Should we merge this task with T286100: [arwiki-wmf.12] Unable to find 4 link recommendation phrase item(s) in document.? That one is already in current sprint. T285651: Add Qunit tests for AddLinkArticleTarget.prototype.annotateSuggestions would probably be a good idea to do along with this work as well.

Both are possible, depending on whether this can be fixed in the phrase matching logic (on us, probably easy), the recommendation generation logic (preferably on the research team, probably easy) or by changing how the two are aligned, that is, replacing simple text search based phrase matching with something more sophisticated (hard). My money would be on the first: that this is a bug in phrase matching and we can fix it relatively easily.

This seems to be a bug in the recommendation generation logic.
For the first example mentioned above (article Keypad in cswiki with suggested anchor "kurzor" as a substring of the word "kurzoro" leading to "[[kurzor]]<nowiki/>u" ). From what I understand, the recommendation to add a link [[korzur]] is correct but the identified string context is wrong. The relevant text where we try to insert the link (i.e. the text-node obtained from mwparserfromhell) contains both the word "korzuro" (rychlejší přesun kurzoru na následující) and "korzur" (kterém je kurzor, smazání). Once we commit to anchor+link, we do a simple string-matching for the anchor to identify the surrounding text to get the context (link to relevant code). In this case, the first match comes from "korzuro" (since "korzur" is a substring); thus we record the wrong context of the link. Instead, we would like to only match the instance " korzur " so that we get the right context. This will avoid the substring-matches.

Thus, it seems to me we can probably fix this easily in the recommendation generation in order to record the correct string-context. However, I am not sure anymore how the string-matching is done in the visual-editor -- do we actually use the context of the wikitext or plaintext to get the correct match? (if not then we have to find a different solution).

It would be helpful to know how often this occurs in practice. Not sure how we could do that, but it could help with prioritization.