Page MenuHomePhabricator

Add a link: algorithm improvements
Open, Needs TriagePublic

Description

In T245330, we tested out the link recommendation algorithm and decided that it has enough potential to continue with. This task is about issues that were discovered in testing it on T245330, and potential improvements to make. For each of the following issues, we may decide not to try to improve it, if this would cause us problems with precision, recall, or scalability:

  • Not linking in the middle of a multi-word phrase, e.g. only linking “London” in “London Underground” (examples)
  • Linking to the correct disambiguation of a word, e.g. from an article about biology, linking to “Cell (biology)” instead of to “Cell (geometry)”.
  • Not suggesting a link if that word or phrase has already been linked in the article.
  • Not linking inside the titles of external links (example article)
  • Not linking inside the names of organizations, e.g. “Access & Publishing Group”. (example article)
  • Not linking common names of people (example article)
  • Not linking to common phrases, e.g. “Yet another” (example article)
  • Not linking to article about dates, like the article for 1979.
  • @Dyolf77_WMF points out that in the Arabic language, the word "the" is a prefix that is attached to the beginning of the noun, such as in this word: الدعاء, which means "dua", a type of prayer. It should like to: دعاء, which is the singular form, without the prefix. We still want words like this to link to the correct article. Does our current algorithm account for these sorts of prefixes? And plural forms?

In terms of wikis to prioritize, we are planning on working in these Wikipedias first:

  • kowiki (Korean)
  • cswiki (Czech)
  • arwiki (Arabic)
  • viwiki (Vietnamese)
  • frwiki (French)
  • ukwiki (Ukrainian)
  • srwiki (Serbian)
  • hywiki (Armenian)
  • huwiki (Hungarian)
  • euwiki (Basque)
  • plwiki (Polish)
  • fawiki (Persian)
  • itwiki (Italian)
  • ptwiki (Portuguese)
  • hewiki (Hebrew)
  • svwiki (Swedish)
  • dawiki (Danish)

Event Timeline

MMiller_WMF edited projects, added Growth-Team (Current Sprint); removed Growth-Team.

@DED -- this task is ready for you to work on. I think for each of the bullet points in the task description, it would be good if you leave comments covering:

  • Whether you think it is wise to make the change, and what it might mean for precision and recall. Perhaps a fix would cause more problems than it solves.
  • Whether it is a fix that would need to be language-specific or could scale with the algorithm across languages.
  • Whether you ended up making changes based on the issue.

If you have questions about the issues, or need clearer examples, please let me know. As of now, I've linked to articles on Test Wiki that contain examples of the issue, which you can find by looking at the links marked with the "X", but I can point them out more specifically if you need.

How does this sound?

@DED -- as we continue to work on the algorithm, a community member from Ukrainian Wikipedia (@NickK) requested that we try it in his language and see how it performs. Maybe that can be the next language we try.

MMiller_WMF updated the task description. (Show Details)Jun 30 2020, 3:29 PM
MMiller_WMF updated the task description. (Show Details)Jul 6 2020, 8:58 PM
Restricted Application added subscribers: Petar.petkovic, Base. · View Herald TranscriptJul 6 2020, 8:58 PM

Update: the work has not started here yet, but it can be worked on at any point. The general approach will be to go through lists like these false positives and introduce blacklists or hardcoded to deal with them language by language.

This is not going to be a high priority in the next month or so because we are transitioning and productizing more fundamental aspects of this algorithm. We can plan to address bugs and issues like these at any point. Perhaps a good point to address them would be when we are settling on a "productionized version" of the algorithm.

Urbanecm edited subscribers, added: Urbanecm_WMF; removed: Urbanecm.Aug 26 2020, 2:08 PM

Update: we have not started to systematically address the issues mentioned above because we are still fixing and improving some more fundamental issues of the algorithm.

Restricted Application added a subscriber: Huji. · View Herald TranscriptSep 9 2020, 6:21 PM

@MGerlach is anything else happening in this task or shall I close it?

@MGerlach is anything else happening in this task or shall I close it?

no ongoing work.
some of the above should have been addressed by some of the previous improvements:

  • Not suggesting a link if that word or phrase has already been linked in the article.
  • Not linking inside the titles of external links
  • Not linking to common phrases