Page MenuHomePhabricator

Scope work for improving multilingual support for link recommendation model for add-a-link task
Closed, ResolvedPublic

Description

The link recommendation model for the add-a-link task continues to being deployed to more and more wikis. For this, @kevinbazira has trained the model for 297 wikis. However, from those only 278 passed the backtesting evaluation and were published T336927. We would like to increase the number of languages for which this model can be deployed.

There are 2 lines of work towards this goal

  • fix the language-specific model such that it will pass the backtesting evaluation in additional languages.
  • develop a single language-agnostic model. this will facilitate easier deployment and maintenance across languages.

The goal of this task is to scope in more detail the work that is needed for these two lines of work to assess the effort and feasibility.

Event Timeline

weekly update:

  • started to collect ideas for fixing the model for which it did not pass the backtesting evaluation
  • went through report and enumerate languages and potential reasons for failure T309263 . this suggests two potential improvements that could fix the model for several languages:
    • implementing the mwtokenizer package into the processing pipeline. this will likely fix issues in word-tokenization when parsing text in non-whitespace languages (Japanese, Chinese, etc). This is crucial for identifying unlinked text that should be linked.
    • fixing a Unicode-Error in wikipedia2vec. the latter is crucial in generating emebeddings of articles which are used as features for the model
  • next step: identifying relevant code changes.

weekly update:

  • no update as I was busy with other projects this week

weekly update

(1) implement the mwtokenizer package https://pypi.org/project/mwtokenizer/

(2) Fix unicode-error in wikipedia2vec

weekly update:

  • @kevinbazira put together a table with the full results of all trained wikis T343374#9064194 this gives a good baseline for future improvements also as a comparison with any future language-agnostic model
  • Started thinking through the challenges of translating the current model into a language-agnostic model. There will reamin language-dependent components (such as the anchor-dictionary to identify potential words in the text that could be linked). My current thinking is that those will only be different look-up tables. For generating predictions, we should aim for training a single model with a balanced training dataset of links from many languages where the language-code can be an explicit feature variable. As a next step I will try to sketch in more detail a training pipeline for such a single model. With the detailed table of baseline results above we can compare if such a model would still perform well across languages.
  • Put together table on-wiki with evaluation results for 301 wikis https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task/Results_round-1 this will help us check if implementing any changes will actually improve the model
  • Updated the list of wikis for which model needs improvement to capture all 23 languages for which the model has not passed the backtesting and was thus not published https://phabricator.wikimedia.org/T309263 (22 models that need improvement)
  • Sketched a pipeline for a language-agnostic model with existing features/datasets from already trained models. Allows for testing of general feasibility of a single model. If successful, the aim would be to adapt the features and address scaling/infrastructure issues.
  • next step is to sketch a detailed work plan for the individual tasks that we will work on in Q2.

weekly update:

  • sketched a plan structuring the work to be picked up in Q2 (googledoc - internal)
  • Specifies several tasks from easy to hard with relevant documentation. Describes the issue and proposes one or more potential solutions and how to test if they work. Ideally, we will try to fix the current link recommendation models that did not pass the backtesting yet (such as Japanese); as well as working towards a single language-agnostic model that would be easier to maintain/host.
  • closing this task; the specific work will be captured in follow-up tasks