Page MenuHomePhabricator

Improving language-dependent models for add-a-link
Closed, ResolvedPublic

Description

The ML Team (tried to) train the link recommendation model for all wikis (they selected 301 different languages) T336927. The results from the backtesting evaluation of all wikis (precision and recall at the default linking threshold of 0.5) can be found here: https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task/Results_round-1
For 278 languages the model passed the backtesting evaluation: For the default threshold 0.5, the precision should be around 75% (or more) and the recall should not drop below 20% so there are still enough links to generate. The Growth Team is in the process of deploying the models for those languages one after the other. T304110
There were 23 languages for which the model was not published (and thus cannot be deployed). This was either that the training pipeline didnt complete or that the performance was too low so that it didnt pass the backtesting. T309263

In this task, the aim is to implement specific improvements to the model such that as many languages as possible pass the backtesting. Specifically, we will attempt the following approaches:

  • implement the mwtokenizer package for better sentence and word tokenization across languages.
  • fix the UnicodeDecodeError when using wikipedia2vec to generate the embedding features.
  • fix regex to improve recall in most non-WS languages

Results: https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task#Results

Event Timeline

Currently this task is blocked by T346798. once the latter is resolved, the work on this task can begin.

Update week 27/11/2023 - 3/12/2023:

  • mwtokenizer issues resolved. MR merged.
  • Read through link recommendation docs
  • Pull code, set up dev env, run code for test wikis.
  • Some errors reported and fixed (T352525)

Update week 4/12/2023 - 10/12/2023:

  • Set up repo clone in gitlab
  • Go through and understand code
  • Make and test initial changes to use word and sentence tokenizers from mwtokenizer
    • Try fixing wikitext parsing with edittypes library's utils.

Update week 11/12/2023 - 17/12/2023:

  • Fix sentence tokenization errors in link-recommendation. Send MR. Improves bowiki, but no improvement in mywiki. WS languages remain same.
  • Some analysis into the cause of the issues above.

Update week 19/12/2023 - 24/12/2023:

  • Make changes in mwtokenizer
    • replace ▁ with " " in the tokenizer
    • separate punctuation from tokens.
    • Sent MR
  • In progress:
    • Use the updated mwtokenizer to improve link-recommendation.
    • Refactor and consolidate ngram functions in link-recommendation code.

Update week 1/1/2024 - 7/1/2024:

  • mwtokenizer MR merged, new version released
  • link-recommendation MR updated and refactored to integrate new mwtokenizer
  • Ran non WS languages ad some previously Failed languages. There were some improvements. More debugging required.
  • MR merged

Update week 8/1/2024 - 14/1/2024:

  • Test and fix jawiki error by adding required dependencies.
  • Attempt to fix Unicode errors in zhwiki and fywiki (using different version of Wikipedia2Vec)

Update week 15/1/2024 - 21/1/2024::

  • MR sent to fix unicode errors. Multiple languages tested.
  • Tested all previously failed languages. wikipedia2vec==2.0.0 introduces a new IndexError that occurs in several languages.
    • Reverted to 2 venvs. This time conda has w2v==2.0.0 for jawiki and fywiki. venv has w2v==1.0.5 for rest of the languages.
    • Sent MR4

Update week 22/1/2024 - 28/1/2024::

  • Fixed regex that was causing a lot of the models to have low-recall
  • MR sent
AKhatun_WMF updated the task description. (Show Details)