Page MenuHomePhabricator

Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines
Open, Needs TriagePublic

Assigned To
None
Authored By
isarantopoulos
Jul 8 2025, 11:57 AM
Referenced Files
F65984925: addalink_eval_bests_v2_v1.csv
Sep 8 2025, 8:40 AM
F65984922: image.png
Sep 8 2025, 8:40 AM
F65920987: image.png
Aug 27 2025, 12:50 PM
F65822305: image.png
Aug 21 2025, 8:01 PM
F65822278: image.png
Aug 21 2025, 8:01 PM
F63902958: addalink_pipeline.excalidraw
Jul 11 2025, 12:01 PM

Description

As a machine learning engineer,
I want to update the add-a-link models by using the learnings from T393474 and options considered in this document.

We focus on following goals:

  • Scale Add-a-Link model across more languages FY2024 WE1.2
  • Retrain Add-a-Link models FY2024 WE1.2

Following steps will be implemented to achieve the goals:

  • Move pipeline from research datasets to ml pipelines.
  • Add training Airflow DAG in ml airflow
  • Add qid filter per wiki logic. Add enwiki, country names. Report new results on enwiki
  • Training on all languages on the new pipeline.
  • Add staging release pipeline steps:
    • check if wikis are above the threshold.
    • to export hdfs to pkl
    • generate_sqlite_data
    • create_tables
    • copy-sqlite-to-mysql
  • Staging release all languages above threshold.
  • Airflow dag for staging release. (has a pre-defined list of wikis to release)
  • Add prod release dag:
    • export from staging db.
    • copy exports to a shared place.
  • Airflow dag for prod release (has a pre-defined list of wikis to release).
  • Update inference service to use new models.
  • Release jawiki to staging for testing end-to-end. (It's ok to release jawiki as it's not used by growth team yet.)
  • Release jawiki to prod for testing end-to-end. (It's ok to release jawiki as it's not used by growth team yet.)
  • Manually calculate current accuracy performance of major wikis on prod before release.
  • Release models. (already automated every 30 minutes.)
    • TBD: how to release iteratively
  • Manually calculate current accuracy performance of major wikis on prod after release (in X months).
  • Removed from scope:
  • (Optional) Experiment on less models.
    • Cluster based on derived features on the training sets.

Reporting format

Progress update on the hypothesis for the week, including if something has shipped:

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

Any emerging blockers or risks:

Any unresolved dependencies:

New lessons from the hypothesis:

Changes to the hypothesis scope or timeline:

Related Objects

Event Timeline

isarantopoulos renamed this task from Q1 25-26 Goal: Scaling Add-a-link to more wikis via production pipelines to Q1 25-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines.Jul 8 2025, 1:41 PM
isarantopoulos renamed this task from Q1 25-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines to Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines.Jul 8 2025, 1:55 PM

Sharing excalidraw for the add-a-link presentation.

Generating anchors steps worked well 🎉 with ml-pipelines and airflow. I'll continue with the next steps.

note that environment artifact caching is wip so that for now you'll need to copy the gitlab artifact to hdfs manually

$ ssh stat1008.eqiad.wmnet
$ wget -O add_a_link-0.1.1.conda.tgz "https://gitlab.wikimedia.org/api/v4/projects/3455/packages/generic/add_a_link/0.1.1/add_a_link-0.1.1.conda.tgz"
$ hdfs dfs -put add_a_link-0.1.1.conda.tgz /wmf/cache/artifacts/airflow/ml/
$ hdfs dfs -ls /wmf/cache/artifacts/airflow/ml/

ml-pipelines and airflow dag MRs:

  • initial set up for add-a-link in ml-pipelines.
  • anchor generation step (other steps will be in a new MR): bug fixes from the previous goal and refactoring.
  • unit tests.
  • ci for unit tests
  • cd for airflow environment.
  • airflow dag.
  • running dag for languages in parallel.

https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/merge_requests/5/diffs
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1573/

We get this error in filter_dict_anchor step.

Tracing back the issue, I see some languages have duplicate anchor records.
We use a re-freshed table every month. this is from nlwiki and it was fine last month. I'll check further.

Caused by: java.lang.RuntimeException: Duplicate map key Jimmy Snuka was found, please check the input data. If you want to remove the duplicated keys, you can set spark.sql.mapKeyDedupPolicy to LAST_WIN so that the key inserted at last takes precedence.

image.png (1×3 px, 480 KB)

Note that this is for three wikis below:

wiki_dbs = {
    "shard_nl": ["nlwiki"],
    "shard_tr": ["trwiki"],
    "shard_ja": ["jawiki"],
}

Release plan

  • We create two new dags namely release-to-staging, release-to-prod. We can discuss if release-to-prod should still be manual.

Release to staging

  • We add a new step to export hdfs files to pkl before this step.
  • We run rest of the steps in run-pipeline.sh till export-files which will update staging-mysql.

Looking into the other jobs, I think we can connect to mariadb from airflow.

Release to prod

We need to figure out how to save files to this location. On stat, we do it by saving them into /srv/ folder which syncs with analytics.wikimedia.org periodically.

This should be enough as an hourly job is already updating prod with new files.

Thank you for working on this release plan @OKarakaya-WMF. It LGTM. Could we add a step to ensure the model is only released when it passes an evaluation threshold, and log evaluation metrics for visibility?

Since this pipeline will be releasing models more regularly, it's also a great time for the team to start planning for a model registry.

thank you @kevinbazira , I've updated the description.

Change #1181653 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] provision the mysql analytics research password in the analytics-ml HDFS home

https://gerrit.wikimedia.org/r/1181653

Change #1181653 merged by Brouberol:

[operations/puppet@production] provision the mysql analytics research password in the analytics-ml HDFS home

https://gerrit.wikimedia.org/r/1181653

based on the discussions here.

  • MariaDB connection from airflow: We can connect to mariadb from airflow. The password is here and airflow-ml user can access it:

/user/analytics-ml/mysql-analytics-research-client-pw.txt

based on discussions here

  • Staging environment: We don't have a staging api. We can use the db and test the api locally.

airflow dag mr for staging release:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1638

ml-pipelines fix for mysqldump:
https://gitlab.wikimedia.org/repos/machine-learning/ml-pipelines/-/merge_requests/26

We have mysqldump in export tables step. mysqldump didn't work in airflow as it does not exist. we had two options:

  • install mysqldump in the environment e.g. apt install within a docker image. but the new environment needs other utilities as well e.g. hdfs connection.
  • create dump without mysqldump. we need to be careful with encoding. I've updated the step with this option.

staging release airflow dag tested on dev with three wikis and it works well.

image.png (255×1 px, 33 KB)

Change #1183685 had a related patch set uploaded (by Ozge; author: Ozge):

[research/mwaddlink@main] feat: updates add-a-link to use new models.

https://gerrit.wikimedia.org/r/1183685

Benchmark completed (except for enwiki):

taking micro_precision >= 0.75 micro_recall >= 0.2 as thresholds

  • 296 wikis are above the threshold. 18 wikis are below the threshold.
  • prod results : 247 wikis are above the threshold and 54 wikis are below the threshold.
  • Compared to the prod results we have 20% more languages above the threshold 🎉

On prod, I see we have chosen to release some wikis that are below (but close to) the threshold.
We can discuss if we want to follow a similar strategy.

Great results! 🎉

@OKarakaya-WMF From the models/wikis already in production, are there any that are currently falling short of the threshold with the new training runs?

I've picked the best scores and compared v1 (results from current prod) vs v2 (results from the new pipeline).

I share the comparison csv attached.

We have 61 wikis above the threshold in v2 that are below the threshold or not evaluated in v1.

There are 4 wikis that are above the release threshold in v1 and below in v2.
They are mostly close to the release threshold in v2 though.

image.png (237×1 px, 45 KB)

csv in the previous comment is also available here:

enwiki results:

v1:
precision:0.81 recall:0.45

v2:
precision:0.80 recall:0.40

v2 after removing countries and continents and sampling from 84M to 50M training set
precision:0.805727 recall:0.365026

I've calculated online scores for add-a-link here
I share the main highlights below:
We can re-use the notebook to calculate scores some time after the model releases.

Main highlights

  • The statistics below are since 2025-06-01 till today (2025-08-09)
  • enwiki online acceptance rate (0.87) is higher than offline precision (0.81)
  • itwiki online acceptance rate (0.67) is lower than offline precision (0.84)
  • actions from enwiki doubled after making it available to all users (2025-09-03)
  • in enwiki, 5% of the recommendations are not exact match between the link and the link target page.
  • in enwiki, acceptance rate is lower (0.80) when the recommendation is not an exact match.
  • in enwiki, 40% of the actions are from mobile and there is no big difference in acceptance rates.
  • in enwiki, most of the users are from United States, India and United Kingdom.

@SSalgaonkar-WMF : There are no big surprises but I'm tagging as it could be interesting.

I'm just copying Ozge's comment from the task description:

Progress update on the hypothesis for the week, including if something has shipped:

  • Training completed for all models and evaluation comparison shared in an excelsheet.
  • Staging pipeline implemented and all models uploaded to staging. They are all ready for production.
  • Production pipeline is implemented and jawiki is uploaded to https://analytics.wikimedia.org/ via the pipeline. Serving will start using it once it’s updated.
  • Added wiki specific filters to training set. We exclude continents and countries from enwiki.
  • Current performance of the models on production (online) calculated and insights are shared. Although this is a basic implementation, we can use it again some time after the release for a comparison.
  • Patch for the serving is in review by Growth Team.

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • We keep track of both training and production scores:
  • Comparing training evaluation of the current models (v1) to training evaluation of the new models (v2)
    • 296 wikis are above the threshold. 18 wikis are below the threshold.
    • Compared to the prod results we have 20% more languages above the threshold
    • We have 61 wikis above the threshold in v2 that are below the threshold or not evaluated in v1.
    • There are 4 wikis that are above the release threshold in v1 and below in v2.
    • We aim to report more detailed comparison including the improvements on the current models above the threshold.
  • Comparing training evaluation of the current models to their actual performance on production:
    • Some wikis e.g. enwiki performs better on production than training. One reason could be that only the top picks are used on production.
    • Some wikis e.g itwiki performs better on training then production. There could be several reasons e.g. size of the embeddings, some issues in encodings etc. We can investigate this more if needed.

Any emerging blockers or risks:

  • Patch for the serving is in review since 03/09/2025. I believe it should be deployed next week. But let’s discuss based on how it goes.

Any unresolved dependencies:

  • We will decide on a release plan in collaboration with the Growth Team.

New lessons from the hypothesis:

  • Removing countries and continents from enwiki has reduced the recall but it keeps the precision similar. We don’t use the filters in evaluation. I think using the filters in evaluation should help to get more consistent results.
  • enwiki is enabled to all users. It helps a lot to gain more insights about how the models perform on production as the usage has doubled already. It would be great to know if we have a schedule to enable other wikis to all users. It’s also good to know in terms of managing the load on the servers.

Changes to the hypothesis scope or timeline:

  • N/A

Reporting (19/09/2025)

Progress update on the hypothesis for the week, including if something has shipped:

  • We propose a release plan in collaboration to the Growth Team. I understand they also want to add the wikis to the tasks. Therefore, we will update the plan.

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

  • The serving patch needs to be reviewed/merged/deployed.

Any unresolved dependencies:

  • N/A

New lessons from the hypothesis:

  • N/A

Changes to the hypothesis scope or timeline:

  • We collaborate with the Growth Team on the release plan in scope of this task.

Reporting (26/09/2025)

Progress update on the hypothesis for the week, including if something has shipped:

  • We have agreed with Growth Team to collaborate in October 2025.

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

  • N/A

Any unresolved dependencies:

  • N/A

New lessons from the hypothesis:

  • N/A

Changes to the hypothesis scope or timeline:

  • We have shared an analysis about case-sensitive recommendations.
  • Deployments will start in October as agreed with the Growth Team.

All models are deployed to the new location via the airflow dag.

All the released models are above the following thresholds

micro_precision_threshold = "0.75"
micro_recall_threshold = "0.2"

We have 297 models in total.

Reporting (03/10/2025)

Progress update on the hypothesis for the week, including if something has shipped:

  • We have deployed the new models to the new location.

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

  • N/A

Any unresolved dependencies:

  • N/A

New lessons from the hypothesis:

  • N/A

Changes to the hypothesis scope or timeline:

  • As discussed, based on the availability of the Growth team, we can become the owner of the api. We can also split the goals to two:
    • Inference service deployments (MLTeam)
      • Current project has model per wiki. We have previously discussed how to reduce the number of models.
      • The project has a mariadb database where we store the data needed for inference.
    • Mediawiki deployments (Growth Team)

Reporting (10/10/2025)

Progress update on the hypothesis for the week, including if something has shipped:

  • Last time, we discussed closing this goal, as the new models are moved to the new location
  • We will suggest Growth team that we deploy inference service if both teams agree.

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

  • N/A

Any unresolved dependencies:

  • N/A

New lessons from the hypothesis:

  • N/A

Changes to the hypothesis scope or timeline:

  • N/A

Change #1183685 merged by jenkins-bot:

[research/mwaddlink@main] feat: updates add-a-link to use new models.

https://gerrit.wikimedia.org/r/1183685

As discussed, I'm creating a new goal for deployments.
and I'm closing this goal.