Scope work for improving multilingual support for link recommendation model for add-a-link task
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MGerlach
	Jul 14 2023, 8:33 AM

Description

The link recommendation model for the add-a-link task continues to being deployed to more and more wikis. For this, @kevinbazira has trained the model for 297 wikis. However, from those only 278 passed the backtesting evaluation and were published T336927. We would like to increase the number of languages for which this model can be deployed.

There are 2 lines of work towards this goal

fix the language-specific model such that it will pass the backtesting evaluation in additional languages.
develop a single language-agnostic model. this will facilitate easier deployment and maintenance across languages.

The goal of this task is to scope in more detail the work that is needed for these two lines of work to assess the effort and feasibility.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		MGerlach	T342526 Improving multilingual support for link recommendation model for add-a-link task
		Resolved		MGerlach	T341851 Scope work for improving multilingual support for link recommendation model for add-a-link task

Event Timeline

MGerlach created this task.Jul 14 2023, 8:33 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 14 2023, 8:33 AM

weekly update:

started to collect ideas for fixing the model for which it did not pass the backtesting evaluation
went through report and enumerate languages and potential reasons for failure T309263 . this suggests two potential improvements that could fix the model for several languages:
- implementing the mwtokenizer package into the processing pipeline. this will likely fix issues in word-tokenization when parsing text in non-whitespace languages (Japanese, Chinese, etc). This is crucial for identifying unlinked text that should be linked.
- fixing a Unicode-Error in wikipedia2vec. the latter is crucial in generating emebeddings of articles which are used as features for the model
next step: identifying relevant code changes.

weekly update:

no update as I was busy with other projects this week

MGerlach added a parent task: T342526: Improving multilingual support for link recommendation model for add-a-link task.Jul 24 2023, 10:26 AM

weekly update

Improved documentation/organization of the project
- created epic T342526
- created meta-page: https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task
Started scoping in more detail two specific approaches that seem promising to fix the model for a range of languages that did not pass the backtesting evaluation (see below)

(1) implement the mwtokenizer package https://pypi.org/project/mwtokenizer/

Improves word-tokenization for generating anchor text candidates.
- get_anchor_dictionary_spark.py: https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/generate_anchor_dictionary_spark.py#L103
- utils.py: https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/utils.py#L118
Improves sentence tokenization for generating backtesting data
- generate_backtesting_data.py: https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/generate_backtesting_data.py#L154
- utils.py: https://github.com/wikimedia/research-mwaddlink/blob/main/src/scripts/utils.py#L112

(2) Fix unicode-error in wikipedia2vec

documented in T325521; appears when calling wikipedia2vec to generate article embeddings https://github.com/wikimedia/research-mwaddlink/blob/main/run-pipeline.sh#L28
Proposed solutions we can try
- https://github.com/wikipedia2vec/wikipedia2vec/issues/68#issuecomment-848422809 Suggestion by the author of the package: “A possible workaround is to use --disambi option to avoid calling the is_disambiguation method”
- https://github.com/wikipedia2vec/wikipedia2vec/pull/72#issue-1182830170 merge-request with small changes to encoding which supposedly fix the issue for Chinese. This hasnt been merged yet in the official package.

weekly update:

@kevinbazira put together a table with the full results of all trained wikis T343374#9064194 this gives a good baseline for future improvements also as a comparison with any future language-agnostic model
Started thinking through the challenges of translating the current model into a language-agnostic model. There will reamin language-dependent components (such as the anchor-dictionary to identify potential words in the text that could be linked). My current thinking is that those will only be different look-up tables. For generating predictions, we should aim for training a single model with a balanced training dataset of links from many languages where the language-code can be an explicit feature variable. As a next step I will try to sketch in more detail a training pipeline for such a single model. With the detailed table of baseline results above we can compare if such a model would still perform well across languages.

Put together table on-wiki with evaluation results for 301 wikis https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task/Results_round-1 this will help us check if implementing any changes will actually improve the model
Updated the list of wikis for which model needs improvement to capture all 23 languages for which the model has not passed the backtesting and was thus not published https://phabricator.wikimedia.org/T309263 (22 models that need improvement)
Sketched a pipeline for a language-agnostic model with existing features/datasets from already trained models. Allows for testing of general feasibility of a single model. If successful, the aim would be to adapt the features and address scaling/infrastructure issues.
next step is to sketch a detailed work plan for the individual tasks that we will work on in Q2.

weekly update:

sketched a plan structuring the work to be picked up in Q2 (googledoc - internal)
Specifies several tasks from easy to hard with relevant documentation. Describes the issue and proposes one or more potential solutions and how to test if they work. Ideally, we will try to fix the current link recommendation models that did not pass the backtesting yet (such as Japanese); as well as working towards a single language-agnostic model that would be easier to maintain/host.
closing this task; the specific work will be captured in follow-up tasks

Scope work for improving multilingual support for link recommendation model for add-a-link taskClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Scope work for improving multilingual support for link recommendation model for add-a-link task
Closed, ResolvedPublic
Actions

Related Objects
Search...