Page MenuHomePhabricator

Support languages whose add-a-link models were not published
Open, Needs TriagePublic

Description

In T336927, 18 rounds of add-a-link models were trained and for the pipelines that succeeded models were published here:
https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/

Below is a list of wikis whose models were not published:

WikisReason
jawiki, aswikiT304548#7937512
bowikiT304549#8060880
dzwikiT304551#8412493
diqwiki, dvwikiT304551#8417373
fywikiT308133#8459395
ganwikiT308133#8469595
hywwikiT308134#8548734
krcwikiT308135#8632750
lrcwiki*T308136#8648765
mnwwiki, mywikiT308137#8690680
piwikiT308138#8708597
zhwikiT308139#8720236
wuuwiki, zh_classicalwiki, zh_yuewikiT308139#8728522
shnwikiT308141#8778455
snwiki, szywikiT308142#8804657
tiwiki, urwikiT308143#8827377

The goal is to improve the link-recommendation algorithm in order to support the languages listed above.
*lrcwiki was closed (T330616). thus, there is no need to train the add-a-link model for that language.

Notes:

In T304548#7937512, jawiki did not pass the backtesting evaluation. The suggested next steps are to manually inspect the model with users who have experience with this language or use google-translate as the link-recommendation algorithm is iteratively improved until the model passes the backtesting evaluation.

Some thoughts on potential starting points on how to manually inspect the model. In order to figure where the text processing is failing for these languages, we could start checking the following steps:

  • generation of candidate anchors (code). this uses simple tokenization to split the text into substrings which then serve as potential anchor-words which can be linked. if we cant identify suitable words, e.g. due to the absence of whitespaces, we wont be able to generate good link recommendations.
  • generation of link candidates (code). for each candidate anchor, we look up link candidates in the anchor dictionary. The anchor dictionary contains all the already existing links (anchor+title of the linked page) and is created in this script. we should make sure the anchor dictionary is populated with a sufficient number of links, otherwise we wont be able to generate link candidates for any anchor.
  • disambiguation of link candidates (code). once we have one or more link candidates for a candidate anchor, the model selects the most probable link from the candidates. we can inspect the probabilities assigned to each link candidate to understand where a potential error might come from.

Models to be inspected and their datasets can be found on the stat1008 machine:

WIKI_ID=jawiki
cd /home/kevinbazira/mwaddlink/data/$WIKI_ID

Related Objects

Event Timeline

Aklapper renamed this task from Inpect jawiki and aswiki "add a link" models to improve their performance to Inspect jawiki and aswiki "add a link" models to improve their performance.May 26 2022, 6:47 AM

I'm keeping the two wikis in my freezer.

@MGerlach, when could we start checking on the models?

Anytime. The models should be available for inspection locally on one of the stat-machines. @kevinbazira ran the training pipeline for these models and could probably share where the files for the models are located.
Maybe it would be worth to open a separate task for this work to be discussed in more detail? I think this could involve substantial amount of effort to find the error and try to fix it since we dont yet know what exactly we are looking for.

Thank you for opening the new task.
How should we proceed now (consider that I have not idea of what's going on)?

Hi @MGerlach thanks for the potential starting points on how to manually inspect the model.
Which person (or team) should take on this task?

Hi @KStoller-WMF thanks for this!

I have a question here.

This type of manual inspection can be time-consuming, and the ownership of model improvement tasks such as this one is under discussion. For this specific case, I wanted to ask: is this high priority on your end? Would you prefer to have these model issues fixed, or is it a possibility to leave the jawiki model (and others with same issues) undeployed for now, and focus our efforts on deploying to other wikis where the current pipeline works?

Thanks!

@Miriam I'll check with others on the team to be sure they agree, but if this will be a large effort, then it likely makes sense to move forward with the wikis where the current pipeline works first. Thanks for the reply!

Comment on jawiki: I believe the problem comes from the tokenization to generate n-grams which we use as anchor-candidates. Our current approach (here) uses NLTK's standard tokenizer. Among other things, it relies on whitespaces to identify boundaries of words. However, as I understand this does not work in Japanese as it is written (mostly) without whitespaces (see here or here). Thus, we need to improve the tokenizer for Japanese (and potentially other languages with similar properties too). A starting point might be this blogpost which gives an overview on different techniques for tokenization.

Let's focus on all the wikis at once. It is much more reasonable than doing it as they come.

Trizek-WMF renamed this task from Inspect jawiki and aswiki "add a link" models to improve their performance to Inspect "add a link" models to improve their performance.Oct 12 2022, 12:02 PM
Trizek-WMF updated the task description. (Show Details)

@Miriam We don't think we have the resources to do this. Lets chat

kevinbazira renamed this task from Inspect "add a link" models to improve their performance to Support languages whose add-a-link models were not published.Jul 4 2023, 9:44 AM
kevinbazira updated the task description. (Show Details)