Page MenuHomePhabricator

Add Link: Non-first instance of word linked when first instance is declensed
Closed, DeclinedPublic

Description

Wiki policy usually requires the first appearance of a given term to be linked, so that the user first encounters the link the same time they first encounter the term, but the mwaddlink algorithm is based on exact (case-insensitive) full word match, so it does not work when the word is declensed.

E.g. in the article text Dogs are four-legged animals. (...snip...) The dog was the first species to be domesticated. the correct way to add a link would be [[Dog|Dogs]] are four-legged animals. (...snip...) The dog was the first species to be domesticated. but mwaddlink will recommend Dogs are four-legged animals. (...snip...) The [[dog]] was the first species to be domesticated. instead.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think there are three ways to handle this (short of doing some sort of complicated word stemming NLP thing):

  1. Status quo: just link the first exact full-word match. Dogs are four-legged animals. (...snip...) The [[dog]] was the first species to be domesticated. Goes against wiki policy and information usability, but unlikely to be seen by reviewers as a serious error.
  2. Link the first word which is a prefix match. `[[Dog|Dogs]] are four-legged animals. (...snip...) The dog was the first species to be domesticated. (Close to the status quo before T283985: Add Link: word substrings are turned into links except we only linked the prefix and not the full word, and that caused annoying nowiki tags.) Some of the time the link will be wrong (e.g. might link Doge to dog). On one hand it should be easy for users to recognize and reject this kind of mistake. On the other hand, if they fail to do so, that will be seen as a serious error by reviewers (high chance of reverts etc).
  3. Whenever the first prefix match is not a full match (ie. the first word of the article that starts with "dog..." is not exactly "dog"), discard that recommendation entirely and never include it in tasks. This is the most correct, but will reduce the number of recommendations (depends on the language how much).
kostajh triaged this task as Medium priority.Feb 16 2022, 12:49 PM

The proper way to deal with this issue would be to use a stemmer for parsing the text (both in creating the anchor dictionary and when ientifying candidates for new articles) in which we would map "dogs" to the stem "dog". In this way, we would increase our chances to catch the first occurrence of the anchor. Including a stemmer (such as nltk's snowball stemmer) would not be such a problem. In fact, we would just add one line when parsing the text to get the individual ngrams (here) and here. The problem is that a stemmer is very language-specific. The snowball-stemmer supports only Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish. We could include the stemming as an optional step if it matches one of the languages and potentially add one-off solutions for other languages later. However, I am afraid this will be a can of worms.

As a pragmatic solution, I would vote for option 3. (if we dont want to keep the status quo)

  1. Whenever the first prefix match is not a full match (ie. the first word of the article that starts with "dog..." is not exactly "dog"), discard that recommendation entirely and never include it in tasks. This is the most correct, but will reduce the number of recommendations (depends on the language how much).

if we can make sure that we can still generate enough recommendations.

@Tgr -- given that you filed this task, do you have any sense of how often this issue occurs in practice? Did you hear about it from communities or see it frequently?

I don't have evidence that this is a serious problem - I filed it because it occurred to me during the T283985: Add Link: word substrings are turned into links work that our fix replaces one kind of incorrect behavior with another (probably less problematic) one. We could go through the current recommendations and calculate how often this is happening, but I have no idea off-hand.

Thanks, @Tgr. I am declining this task because we haven't heard about it from communities, and because I feel like the various solutions may be worse than the original problem (they may cause other kinds of errors, confusion, and edge cases depending on the language). Please let me know if you disagree.