Support languages whose add-a-link models were not published
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	kevinbazira
	May 26 2022, 6:28 AM

Description

In T336927, 18 rounds of add-a-link models were trained and for the pipelines that succeeded models were published here:
https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/

Below is a list of wikis whose models were not published:

Wikis	Reason
jawiki, aswiki	T304548#7937512
bowiki	T304549#8060880
dzwiki	T304551#8412493
diqwiki, dvwiki	T304551#8417373
fywiki	T308133#8459395
ganwiki	T308133#8469595
hywwiki	T308134#8548734
krcwiki	T308135#8632750
~~lrcwiki~~*	T308136#8648765
mnwwiki, mywiki	T308137#8690680
piwiki	T308138#8708597
zhwiki	T308139#8720236
wuuwiki, zh_classicalwiki, zh_yuewiki	T308139#8728522
shnwiki	T308141#8778455
snwiki, szywiki	T308142#8804657
tiwiki, urwiki	T308143#8827377

The goal is to improve the link-recommendation algorithm in order to support the languages listed above.
*lrcwiki was closed (T330616). thus, there is no need to train the add-a-link model for that language.

Notes:

In T304548#7937512, jawiki did not pass the backtesting evaluation. The suggested next steps are to manually inspect the model with users who have experience with this language or use google-translate as the link-recommendation algorithm is iteratively improved until the model passes the backtesting evaluation.

In T304548#7952043, @MGerlach wrote:

Some thoughts on potential starting points on how to manually inspect the model. In order to figure where the text processing is failing for these languages, we could start checking the following steps:

generation of candidate anchors (code). this uses simple tokenization to split the text into substrings which then serve as potential anchor-words which can be linked. if we cant identify suitable words, e.g. due to the absence of whitespaces, we wont be able to generate good link recommendations.

generation of link candidates (code). for each candidate anchor, we look up link candidates in the anchor dictionary. The anchor dictionary contains all the already existing links (anchor+title of the linked page) and is created in this script. we should make sure the anchor dictionary is populated with a sufficient number of links, otherwise we wont be able to generate link candidates for any anchor.

disambiguation of link candidates (code). once we have one or more link candidates for a candidate anchor, the model selects the most probable link from the candidates. we can inspect the probabilities assigned to each link candidate to understand where a potential error might come from.

Models to be inspected and their datasets can be found on the stat1008 machine:

WIKI_ID=jawiki
cd /home/kevinbazira/mwaddlink/data/$WIKI_ID

Related Objects
Search...

Status	Assigned	Task
Open	lbowmaker	T307881 Scaling of link suggestions service
In Progress	Trizek-WMF	T304110 [EPIC] Deploy "add a link" to all Wikipedias
Open	calbon	T309263 Support languages whose add-a-link models were not published
Resolved	kevinbazira	T344319 Remove models with poor evaluation metrics from the published datasets repo
Resolved	kevinbazira	T344799 Automate unpublishing of add-a-link datasets
Resolved	kevinbazira	T344832 Investigate why the add-a-link training pipeline concludes with missing datasets
Resolved	AKhatun_WMF	T347696 Improving language-dependent models for add-a-link

Event Timeline

kevinbazira created this task.May 26 2022, 6:28 AM

kevinbazira mentioned this in T304548: Deploy "add a link" to 4th round of wikis.

Aklapper renamed this task from Inpect jawiki and aswiki "add a link" models to improve their performance to Inspect jawiki and aswiki "add a link" models to improve their performance.May 26 2022, 6:47 AM

In T304548#7956082, @MGerlach wrote:

In T304548#7953084, @Trizek-WMF wrote:

I'm keeping the two wikis in my freezer.

@MGerlach, when could we start checking on the models?

Anytime. The models should be available for inspection locally on one of the stat-machines. @kevinbazira ran the training pipeline for these models and could probably share where the files for the models are located.
Maybe it would be worth to open a separate task for this work to be discussed in more detail? I think this could involve substantial amount of effort to find the error and try to fix it since we dont yet know what exactly we are looking for.

Thank you for opening the new task.
How should we proceed now (consider that I have not idea of what's going on)?

Hi @MGerlach thanks for the potential starting points on how to manually inspect the model.
Which person (or team) should take on this task?

Miriam subscribed.May 30 2022, 3:43 PM

Hi @KStoller-WMF thanks for this!

I have a question here.

This type of manual inspection can be time-consuming, and the ownership of model improvement tasks such as this one is under discussion. For this specific case, I wanted to ask: is this high priority on your end? Would you prefer to have these model issues fixed, or is it a possibility to leave the jawiki model (and others with same issues) undeployed for now, and focus our efforts on deploying to other wikis where the current pipeline works?

Thanks!

@Miriam I'll check with others on the team to be sure they agree, but if this will be a large effort, then it likely makes sense to move forward with the wikis where the current pipeline works first. Thanks for the reply!

Thanks @KStoller-WMF !

Comment on jawiki: I believe the problem comes from the tokenization to generate n-grams which we use as anchor-candidates. Our current approach (here) uses NLTK's standard tokenizer. Among other things, it relies on whitespaces to identify boundaries of words. However, as I understand this does not work in Japanese as it is written (mostly) without whitespaces (see here or here). Thus, we need to improve the tokenizer for Japanese (and potentially other languages with similar properties too). A starting point might be this blogpost which gives an overview on different techniques for tokenization.

Let's focus on all the wikis at once. It is much more reasonable than doing it as they come.

KStoller-WMF edited projects, added Growth-Team; removed Growth-Team (Sprint 0 (Growth Team)).Jun 16 2022, 9:56 PM

• mewoph moved this task from Inbox to Needs Discussion on the Growth-Team board.Jun 17 2022, 7:28 PM

• RZamora-WMF edited projects, added MoveComms-Support (Jul-Sep-2022); removed MoveComms-Support (Apr-Jun-2022).Jul 18 2022, 4:47 PM

MShilova_WMF moved this task from Needs Discussion to Triaged on the Growth-Team board.Aug 9 2022, 7:01 PM

Trizek-WMF renamed this task from Inspect jawiki and aswiki "add a link" models to improve their performance to Inspect "add a link" models to improve their performance.Oct 12 2022, 12:02 PM

Trizek-WMF edited projects, added MoveComms-Support (Oct-Dec-2022); removed MoveComms-Support (Jul-Sep-2022).

Trizek-WMF updated the task description. (Show Details)

Trizek-WMF updated the task description. (Show Details)Oct 17 2022, 4:32 PM

calbon moved this task from Active Tasks to Unsorted on the Machine-Learning-Team board.Oct 25 2022, 6:17 PM

calbon edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).

Trizek-WMF updated the task description. (Show Details)Oct 26 2022, 3:53 PM

Trizek-WMF mentioned this in T304549: Deploy "add a link" to 5th round of wikis.Oct 26 2022, 5:11 PM

calbon moved this task from Unsorted to In Progress on the Machine-Learning-Team board.Nov 15 2022, 3:25 PM

calbon assigned this task to kevinbazira.Nov 22 2022, 3:20 PM

@Miriam We don't think we have the resources to do this. Lets chat

calbon claimed this task.Nov 22 2022, 3:23 PM

Trizek-WMF updated the task description. (Show Details)Nov 23 2022, 4:19 PM

Trizek-WMF updated the task description. (Show Details)

Trizek-WMF mentioned this in T304551: Deploy "add a link" to 7th round of wikis.

calbon moved this task from In Progress to Backlog/Other on the Machine-Learning-Team board.Nov 29 2022, 3:40 PM

• Elitre edited projects, added MoveComms-Support; removed MoveComms-Support (Oct-Dec-2022).Jan 11 2023, 8:58 AM

kevinbazira mentioned this in T336927: Completion report on training 18 rounds of add-a-link models.May 19 2023, 12:36 PM

kevinbazira renamed this task from Inspect "add a link" models to improve their performance to Support languages whose add-a-link models were not published.Jul 4 2023, 9:44 AM

kevinbazira updated the task description. (Show Details)

Restricted Application added a subscriber: Stang. · View Herald TranscriptJul 4 2023, 9:44 AM

kevinbazira updated the task description. (Show Details)Jul 4 2023, 9:45 AM

MGerlach mentioned this in T341851: Scope work for improving multilingual support for link recommendation model for add-a-link task.Jul 14 2023, 9:58 AM

FriedrickMILBarbarossa added a project: Chinese-Sites.Aug 8 2023, 2:54 AM

kevinbazira updated the task description. (Show Details)Aug 16 2023, 6:26 AM

kevinbazira mentioned this in T344319: Remove models with poor evaluation metrics from the published datasets repo.Aug 16 2023, 8:23 AM

MGerlach updated the task description. (Show Details)Aug 18 2023, 8:44 AM

Winston_Sung moved this task from Backlog to Research on the Chinese-Sites board.Aug 22 2023, 11:56 AM

kevinbazira closed subtask T344319: Remove models with poor evaluation metrics from the published datasets repo as Resolved.Aug 31 2023, 3:18 PM

MGerlach mentioned this in T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.Sep 19 2023, 3:58 PM

MGerlach mentioned this in T347696: Improving language-dependent models for add-a-link.Sep 29 2023, 1:08 PM