Page MenuHomePhabricator

Exploratory work on language-agnostic model for link recommendation for add-a-link
Closed, ResolvedPublic

Description

One of the main limitations of the current model architecture is that we need to train a separate model for each language. This brings challenges for deploying this model for all languages, because we need to train and run 300 or more different models.
In order to simplify the maintenance work, ideally, we would like to develop a single language-agnostic model. For example, we take the language-agnostic revert-risk model as an inspiration where such an approach has been implemented and deployed with success. The main goal is to develop a single model that supports all (or as many as possible) languages in order to decrease the maintenance cost.
In this task, we will explore different approaches to test the feasibility to develop such a model while ensuring the accuracy of the recommendations.

Event Timeline

Update 29/01/2024 - 04/02/2024:

  • Understanding the feature generation and model training component of link-recommendation model
  • Tested language performance of several models by changing the model's language. Tested multilingual models 2 languages at a time.

Update 05/02/2024 - 11/02/2024:

  • Tested replacing wikipedi2vec with uplink model embedding. Performance remains very close to wikipedia2vec versions.
  • Combined 2 languages and used outlink embeddings
  • Tested normalization of features for single models as well as 2 language settings.

Conclusion: Outlink embeddings perform as well as Wikipedia2vec embeddings. We can easily replace w2v embeddings and use our in-house embeddings. The proof-of-concept to combine features across multiple languages to create a single or few models passed, so we can safely start combining multiple languages.

Update 12/2/2024 - 18/2/2024:

  • Plan how to create a language agnostic test model: consider the first round of wikis tested (T343374). The sets of wikis after 4th round seem alphabetical with some adjustments to possibly overall size.
  • Created a branch for language agnostic model and create a new baseline: using the outlink embeddings and normalizing the features.

Update 19/2/2024 - 25/2/2024:

  • Replace w2v with outlink embedding, created baseline and ran 11 test wikis, MR sent
  • Training and evaluating a combined model with all test language wikis
    • with and without wiki_db feature

Update 26/2/2024 - 3/3/2024:

  • Trained a combined model with all data and with stratified split
  • Hyperparameter tuning in progress

Update 4/3/2024 - 10/3/2024:

  • Finished grid search and stored best fit model
  • Add mwtokenizer in one more place, fix code to accommodate wiki_db feature, fix label encoder, push draft MR

Update 11/3/2024 - 17/3/2024:

  • Scale the model to 50 languages: Run pipeline for 50 languages and train a model with max 100k samples per language. Used fall back chains to select languages at the center.
  • Test on a different set of 50 languages (randomly chosen) in a 0-shot manner and compare performance.

Update 18/3/2024 - 24/3/2024:

  • Clean and upload main and secondary wiki results. Add single language result for comparison.
  • Prepare and run script to gather data for all languages (anchor dicts etc)
  • Clean code to push as an intermediate stage of language agnostic modeling

Update 25/3/2024 - 31/3/2024:

  • Modify and push code. Collect all wikis data.
  • Train a model on all wikis
  • Discuss possible pipeline solutions

Update 1/4/2024 - 7/4/2024:

  • Update MR according to comments
  • Update meta for project
  • Run an experiment with 100k cap of samples on ALL wikis.
  • Start running another experiment with 1M cap of samples on ALL wikis.
  • Discuss Airflow with Fabian.

The exploratory part of link-recommendation for add-a-link is done.

We started out by combining 2 languages and found that the performance does not changes significantly. We scaled the experiment to 11 languages, 2 ~50 language sets, and then all 300+ languages.

  1. As an experiment we select a set of 52 central languages from fallback chains and another set of 44 randomly selected wikis. We train a model on all languages in each set and evaluate on each individual language wiki. The performance comparison of the language-agnostic model and the single-language model can be found here: main_v2 and sec_v2. In short: the performance of the language-agnostic model for both sets of languages are comparable to the single-language versions. This shows we can theoretically select any set wikis, perform combined training, and expect very good results.
  2. As another experiment: We train a model with all (317) language wikis with a cap of 100k samples per language. The evaluations can be found here: all_wikis_baseline. Similar to before, some languages have some drop in performance, but a lot of the languages perform almost on par with single language based models.
  3. Same as experiment 2 but with a cap at 1 Million samples per language. Evaluation here: all_wikis_baseline_1M The performance remains extremely close to the 100k samples experiment, with slight decrease in precision in 4 languages and slight increase in 4 other languages.

Here is the meta link for update so far: Exploratory work for language-agnostic model