The ML Team (tried to) train the link recommendation model for all wikis (they selected 301 different languages) T336927. The results from the backtesting evaluation of all wikis (precision and recall at the default linking threshold of 0.5) can be found here: https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task/Results_round-1
For 278 languages the model passed the backtesting evaluation: For the default threshold 0.5, the precision should be around 75% (or more) and the recall should not drop below 20% so there are still enough links to generate. The Growth Team is in the process of deploying the models for those languages one after the other. T304110
There were 23 languages for which the model was not published (and thus cannot be deployed). This was either that the training pipeline didnt complete or that the performance was too low so that it didnt pass the backtesting. T309263
In this task, the aim is to implement specific improvements to the model such that as many languages as possible pass the backtesting. Specifically, we will attempt the following approaches:
[x] implement the mwtokenizer package for better sentence and word tokenization across languages.
[x] fix the UnicodeDecodeError when using wikipedia2vec to generate the embedding features.