Page MenuHomePhabricator

FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment
Open, Needs TriagePublic

Assigned To
Authored By
isarantopoulos
May 6 2025, 2:37 PM
Referenced Files
F63902183: addalink_pipeline.excalidraw
Jul 11 2025, 11:55 AM
F62821730: image.png
Jul 3 2025, 12:20 PM
Restricted File
Jun 2 2025, 9:55 AM
F60288983: top (1).csv
May 20 2025, 8:57 AM
F60288961: image.png
May 20 2025, 8:57 AM

Description

In scope of this goal, we want to answer following questions:

  • How does the current process work for model training?
  • How does the current process work for deployment/serving?
  • What are the ongoing/expected/desired improvements in both parts?
  • What are the biggest problems on both sides?

By answering the questions above, we will plan improvements for the next steps.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
isarantopoulos renamed this task from Q4 24-25 Goal: Investigate Add a link model training and deployment to Q4 24-25 Goal: Investigate Add-a-link model training and deployment.May 6 2025, 2:40 PM

We have the context below in the ML team slack channel but sharing it again here for posterity:

The add-a-link project predates our current ML team and LiftWing infrastructure. Our team was previously known as the Scoring Platform team, so you might see this name in older docs like the add-a-link project architecture.

Initially, 2 teams drove the implementation of this project: the Growth team built the model-serving infrastructure, while the Research team developed the model training and evaluation pipeline for language-specific models.

When the ML team was formed, the Research team asked us to run the model training and evaluation pipeline, and publish / unpublish models in the model repo. The Growth team would then deploy these models in production to be used on their specific wikis.

As shown in this training completion report, the ML team trained, evaluated, and published 16 of 18 batches (starting from batch 3 to 18) across ~300 wikis. A few of the models were not published on their wikis because they did not pass the backtesting evaluation.

Recently, the Research team has been driving the next steps (see details here). Their goals were to improve underperforming models and to train a single language-agnostic model to replace the many language-specific ones, which should make things easier to maintain.

The Research team has also built a new training pipeline based on Airflow (T388258, T361929). Since this new pipeline replaces the old pipline's bash script with an Airflow DAG that aligns with the direction we are taking with the peacock model training pipeline, it will probably be worthwhile understanding this new add-a-link Airflow training pipeline.

You can also find more info in the Wikitech docs, project repo, and phab epic task.

Hey, @fkaelin. We have recently started investigation about the add-a-link model. Please feel free to add more folks from your team if related.

I want to validate if my understanding is correct.
I share some information and questions below. Could you help with the questions or suggest a contact?

  • prod
    • model on prod. model per language.
  • research-datasets
    • supports single model for multiple languages. Not used on prod yet.
    • uses research.article_topics table for embeddings rather than wikipedia2vec. However, this table seems to be empty. (I think this table is populated recently, we have data now. Do you know how to populate data to this table? Do we have any pipeline that we can trigger later?)
      • Can we summarize the improvements in this repo compared to the prod one? It seems it's this list and maybe some more.
  • airflow-dags
    • uses research datasets with a language group config. Not used on prod yet.
      • I've just triggered this pipeline. Curious about the evaluation scores. I hope it's not a problem and does not effect anyting.
  • research-mwaddlink-gitlab: Should we ignore this repo as it's in a user's personal space?

Looking into the results here can you help me to navigate what is the implementation/code/commit/repo for previous (e.g. previous precision, recall) and the new (precision, recall) implementations?

Do we have similar evaluation scores for the implementation in airflow where we have SHARDS of languages (one model used by multiple languages)? How did we decide to have some languages in the same SHARD/model? I think we can proceed with this implementation if the evaluation scores are better.

Is this the service we serve and used by the daily job on production?

Do we have a documentation about the experiments done before and their evaluation scores? I'd like to get an understanding about what did we try before and didn't work in order not to get into the same route twice.

Thank you!

Hey, @Michael. We have recently started investigation about the add-a-link model. Please feel free to add more folks from your team if related.

I want to validate if my understanding is correct.
I share some information and questions below. Could you help with the questions or suggest a contact?

I'd like to map the architecture diagram we have in Miro to actual implementation/code/repo and understand how the link recommendation service is used.
https://miro.com/app/board/uXjVI-Wgq44=/

As I understand, we have a daily job which generates predictions for a given set of articles and stores them to a database.

Is this the implementation for the job?

What is our strategy to select the articles that we want to make predictions? To clarify my question, I suppose everyday we pick a set of articles and we generate recommendations for them. How do we decide on the articles we pick?

How can I request access to the elastic search instance?

Thank you!

Sharing some notes and multiple ideas:

Model/Training

  • What are common and un-common things among the languages?
  • Problem: Training is manual.
    • Solution: move to airflow.
  • Problem: many models:
    • Solution: The current efforts haven’t simplified the current process without sacrificing performance. Therefore, I think we need some experimentation here.
      • Have the same model structure but use multilingual bert. use the sentences embeddings version.
        • serving a neural network would be more expensive than gradient boosting but we would save a lot by having a single model.
      • Alternatively, we can keep the GB model and only replace wikipedia2vec with the bert sentences model.
    • The current models might be strong as they use the embeddings of a page. The full article is represented by embeddings. However, the embeddings are generated only during the training and they are not updated later. So in inference, this might impact the performance.
  • Wikipedia2Vec:

It has Multi-task Learning.

Textual Context: Standard skip-gram for word-to-word prediction.
Entity Context: Learns the relatedness of entities.

  • output of the model is a dict where keys are articles and values are the embeddings (size=50). It learns articles as entities.
In [4]: model["İngiltere"]
Out[4]:
array([-1.4799727 , -1.9188014 ,  1.255978  , -0.9291187 , -1.621248  ,
       -1.1431726 , -0.85117525,  0.41257238,  0.13056344,  0.6387092 ,
       -1.5251698 ,  0.3221179 ,  0.5230236 , -1.3298826 , -2.8342779 ,
       -0.5459568 , -0.10679709, -1.8341618 ,  0.01000451,  1.7433337 ,
        1.026936  ,  1.2728058 , -1.3444848 , -0.08137286,  0.26413748,
        2.1955068 ,  2.3960302 , -0.72420645,  1.0613892 , -0.6018021 ,
        0.37485346, -0.5537357 ,  0.67882067,  1.5518492 , -0.64002323,
       -0.866975  ,  1.2935635 , -0.6776365 ,  1.4768347 , -1.1798778 ,
       -0.7108869 , -1.1821009 , -1.7069952 , -0.57637167,  0.48973945,
        2.2490513 , -0.20400155, -0.70167893, -0.63959754, -1.9178141 ],
      dtype=float32)
  • Feature Importance
# enwiki: ‘weight’: the number of times a feature is used to split the data across all trees.
[('f1', 1113), # freq: How many times was the link used with this text
 ('f4', 1098), # w2v: W2V Distance between the source and target page
 ('f2', 1001), # ambig: How many different links where used with this text
 ('f3', 830), # kur: Skew of usage text/link distribution
 ('f5', 775), # leven: levenshtein distance between the text and the link. 
 ('f0', 422)] # ngram: number of words in the text.
 
 # trwiki: ‘weight’: the number of times a feature is used to split the data across all trees.
 [('f1', 1220),
 ('f4', 1152),
 ('f2', 1017),
 ('f3', 945),
 ('f5', 798),
 ('f0', 330)]
 
 # nlwiki: weight
 [('f1', 1308),
 ('f4', 1195),
 ('f2', 994),
 ('f5', 911),
 ('f3', 758),
 ('f0', 249)]
 
 # enwiki: ‘total_gain’: the total improvement in accuracy brought by a feature 
[('f5', 731322.6267700639),
 ('f1', 562421.2296013556),
 ('f2', 313809.47264602705),
 ('f4', 154585.59138801633),
 ('f3', 115051.49178370048),
 ('f0', 13801.914218655007)]

 # trwiki: ‘total_gain’: the total improvement in accuracy brought by a feature 
 [('f5', 600563.4022971413),
 ('f1', 431900.2928599014),
 ('f2', 188087.28608366448),
 ('f4', 118918.14733512145),
 ('f3', 91983.85046138855),
 ('f0', 9690.182899598794)]
 
 # nlwiki total_gain
 [('f1', 686270.0808737405),
 ('f5', 583314.1633530738),
 ('f2', 323852.1216276612),
 ('f4', 175693.73246569172),
 ('f3', 118137.45635772852),
 ('f0', 11242.340368773006)]
  • All features (except ngram) is somewhat important.
  • The order of importance between two languages is same. So what if we train wikipedia2vec for each language and single xgboost and add language as an additional attribute?
  • Calculate feature importance of xgboost models in each language and cluster them. How many different groups will we have?

Sharing some options that we consider (not sorted by priority):

Options

  • Option1: Reduce the number of models and complexity of the pipeline
    • Near term:
      • Continue on the model experiments on both xgboost and wikipedia2vec (Collaboration with the Research team. We want to learn what they have tried so far in this regard, what has worked, what has not worked, which implementation should be our baseline.)
        • Can we get similar results with less xgboost models?
        • Can we get similar results with a different embeddings model other than wikipedia2vec.
      • Move pipelines into Airflow (Collaboration with the Data Engineering team. We want to learn best practices in WMF e.g. operators to use, access to sources etc.)
    • We don’t want to reduce quality. Disadvantage: Research team mentions that they have experimented with different ways and all reduced quality. Need to find out what these experiments were.
  • Option2: Make the training automated as it is.
    • Near term:
      • Move the current pipeline into airflow as it is. (Collaboration with the Data Engineering team. We want to learn best practices in WMF e.g. operators to use, access to sources etc.)
      • Figure out with research team which should be the current: the first round or the second round. (Collaboration with the Research team. We want to learn which implementation should be our baseline.)
      • Update package versions etc, so that we can improve it easier later.
    • Long term:
      • Continue working on the improvements.
    • Disadvantage is that we don’t prefer to take over many models as it won’t be easy to maintain.
  • Option3: Improve inference and how the predictions are used.
    • Near Term:
      • Replace refreshLinkRecommendations.php with streaming which predicts topic of the article for each change (or with a filter), and sends them to Elastic. So that we have single source of truth. (Collaboration with the Growth Team)
      • This needs more investigation on how the current system works after inference. (Collaboration with the Growth Team)
  • Option4: Separate wikipedia2vec and xgboost training pipelines (Collaboration with the Research team. Check if Research team already found a way to replace wikipedia2vec. If so, this option is obsolete.)
    • Move wikipedia2vec training into a new pipeline (Collaboration with the Data Engineering team. We want to learn best practices in WMF on Airflow. Operators to use, access to sources etc.)
    • Long term (Collaboration with the Search team and the Growth team. We want to learn best practices to add embeddings to the articles in ES and if it’s the best place):
      • Move wikipedia2vec out of the project and have the embeddings in elastic search ready. So that we can also calculate similarity there.

Hey, @Michael. We have recently started investigation about the add-a-link model. Please feel free to add more folks from your team if related.

I want to validate if my understanding is correct.
I share some information and questions below. Could you help with the questions or suggest a contact?

I'd like to map the architecture diagram we have in Miro to actual implementation/code/repo and understand how the link recommendation service is used.
https://miro.com/app/board/uXjVI-Wgq44=/

As I understand, we have a daily job which generates predictions for a given set of articles and stores them to a database.

Is this the implementation for the job?

No, that implementation is maintenance/refreshLinkRecommendations.php. The implementation is complex, the place that makes the actual http request to the service is ServiceLinkRecommendationProvider::getDetailed.

What is our strategy to select the articles that we want to make predictions? To clarify my question, I suppose everyday we pick a set of articles and we generate recommendations for them. How do we decide on the articles we pick?

The script is triggered hourly. There are actually two different approaches that we use in different wikis:

  1. the classic topic-based approach: For each topic with insufficient number of recommendations, use a search query to elastic/CirrusSearch to get a random set of articles without link recommendations, and try to get new recommendations for them.
  2. the new iterative approach: Just iterate over all articles in the main namespace based on their page id. We check a batch of articles for recommendations and then cache the page-id of the last article so we can pick up there in the next invocation of the script.

We introduced the iterative approach very recently in order to generate suggestions for ~all articles that can have them. This is especially needed for surfacing these suggestions when reading an article.

How can I request access to the elastic search instance?

I'm not sure what you mean by that. We use the normal, public, elastic search instance that is also used for on-wiki search. You're accessing the instance when searching for an article on wiki. If you need different access, then you probably need to talk to the search team.

Sharing some options that we consider:

Options

  • Option1: Reduce the number of models
    • Continue on the model experiments on both xgboost and wikipedia2vec.
      • Can we get similar results with less xgboost models?
      • Can we get similar results with a different embeddings model other than wikipedia2vec.
    • move wikipedia2vec out of the project and have the embeddings in elastic search ready. So that we can also calculate similarity there.

As I understand it based on what @MGerlach explained, reducing the number of models is likely to reduce the quality of the predictions. We would really like to avoid a reduction in quality if we have better alternatives.

  • Option2: Make the training automated.
    • Move the current pipeline into airflow as it is.
    • Update package versions etc, so that we can improve it easier later.

Sounds good to me.

  • Option3: Improve inference and how the predictions are used.
    • Replace refreshLinkRecommendations.php with streaming which predicts topic of the article for each change (or with a filter), and sends them to Elastic. So that we have single source of truth.

I don't really understand what you are proposing here. The topics are already in elastic, independently from everything here. What is not there (yet) is the actual data of a prediction, that is: which words should link to which articles. I'm not sure how much data can be stored there, that would be a question for the search team to answer.

Thank you @Michael ,

I've updated the options with your comments.
About Elastic search: I want to take a look into the data and run some queries if we have a UI like Kibana. I'll check with the search team.

Current tasks/plan for investigation (will be updated as we figure out more or have results)

  • Pick a subset of experimentation languages.
  • Reproduce the first round results here on the selected subset of languages (on a notebook).
  • Reproduce the second round results here on the selected subset of languages (on a notebook).
  • Train a single xgboost model by using either the first round implementation or the second round. Compare with the current performance.
  • Calculate feature importance of xgboost models in each language and cluster them. How many different groups/models will we have?
  • Evaluate combinations from the results of the previous task.
  • How many models should we expect if we go with a hybrid approach?

Current Limitations (will be updated as we figure out more.)

Priority by product:

  • Current recommender can not recommend an article that is not linked before (confirm with @fkaelin )
    • priority due to following requirement: (Link to an orphaned article [TENTATIVE] ) Suggested Edits module contains tasks that show an anchor article, anchor text, and article recommendation. The article recommendation is an orphan article (an article that doesn’t already have any links to it)

No priority by product:

  • As I understand we can’t recommend articles to link which are created after the training date (2022-05-24). This is because we don’t have the articles in wikipedia2vec model. This can be solved by training wikipedia2vec more frequently or changing the architecture (e.g. use another embeddings model.)

single model

Looking into this PR, we have previously tested single xgboost approach and got good results.

would you advise into this direction? @fkaelin

Based on the results here:

"""
current number of models at https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/ : 275
"""

df_passed_old = df[(df["precision"] >= 0.75) & (df["recall"] >= 0.20)]
len(df_passed_old) # 289

df_passed_new = df[(df["precision_combined"] >= 0.75) & (df["recall_combined"] >= 0.20)]
len(df_passed_new) # 225

len(df_passed_old[df_passed_old["wiki"].isin(df_passed_new["wiki"])]) # 225

len(df_passed_old[~df_passed_old["wiki"].isin(df_passed_new["wiki"])]) # 64

Based on the results above, we lose 64 languages by combining all languages into a single model. Could this be improved:

  • by finding optimal groups of languages. We can analyze the feature importances in models and the training data.
  • by improving the overall performance with feature engineering.

embeddings

It seems switching from wikipedia2vec to article topic embeddings didn't have a negative impact in overall.
https://gitlab.wikimedia.org/akhatun/research-mwaddlink/-/merge_requests/6

@kevinbazira may help with the back testing threshold. What is our precision and recall thresholds to release a model to production and enable add-a-link for a language?

The precision and recall thresholds we used for add-a-link models were 75% and 20% respectively (T308146#7951993)

updated the single model comment based on the thresholds you've shared @kevinbazira . Thank you!

tried mwaddlink pipeline

https://airflow-research.wikimedia.org/dags/mwaddlink/grid?tab=logs&task_id=model_model_0&dag_run_id=manual__2025-05-16T12%3A59%3A23.745190%2B00%3A00

got this error:

KeyError: 'wiki_db'
Number of Wikis: 1
ERROR:  FileNotFoundError /tmp/research/mwaddlink/training/link_train.parquet/wiki_db=enwiki
Total training data size: 0

I've managed to set up the environment on a notebook. I'll try to reproduce this on a notebook with a smaller data next week.
@fkaelin

  • prod
    • model on prod. model per language.
  • research-datasets
    • supports single model for multiple languages. Not used on prod yet.

correct

  • uses research.article_topics table for embeddings rather than wikipedia2vec. However, this table seems to be empty. (I think this table is populated recently, we have data now. Do you know how to populate data to this table? Do we have any pipeline that we can trigger later?)

The table was not configured for the prod instance - the table is available now.

    • Can we summarize the improvements in this repo compared to the prod one? It seems it's this list and maybe some more.
  • airflow-dags
    • uses research datasets with a language group config. Not used on prod yet.
      • I've just triggered this pipeline. Curious about the evaluation scores. I hope it's not a problem and does not effect anyting.
  • research-mwaddlink-gitlab: Should we ignore this repo as it's in a user's personal space?

The linked meta page is a good summary by the Martin. The described effort contains both changes to the modelling approach (joint models, link embeddings instead of w2v,...), and also changes to how the pipeline is executed (airflow migration with research engineering support). The latter was essentially required by the first one (i.e. the paint points for retraining apply to the researchers too), but it also means that it the two are connected. If (theoretically) you/ML team had started this investigation without that model improvement work done by Aisha/Martin, migrating the current training pipeline to airflow dag with the same modelling approach (ie. the output of the model should be equivalent, no joint model etc) would be reasonable first step - and only then iterate on the In that sensemodelling and serving components. For the linked repos this means:

Looking into the results here can you help me to navigate what is the implementation/code/commit/repo for previous (e.g. previous precision, recall) and the new (precision, recall) implementations?

This gets a little tricky, I don't have full knowledge of which improvements made it into the research-datasets/airflow dag, glancing at the sections from the meta page the research-datasets code implements all sections in the Results: "Mwtokenizer improvement", "Improving the model for individual languages", and "Exploratory work for language-agnostic model"). So the pipeline can't run for the current production model (e.g. with word2vec) leading to these results. The research-datasets pipeline ended up being quite an effort to implement, and after the initial validation runs Research hasn't had the resources to work on it again.

Do we have similar evaluation scores for the implementation in airflow where we have SHARDS of languages (one model used by multiple languages)? How did we decide to have some languages in the same SHARD/model? I think we can proceed with this implementation if the evaluation scores are better.

There are not any linked joint-model evaluation metrics on the meta page, and I am not aware of another place. For the research-datasets pipeline, also see this notebook can execute all the steps individually (not orchestrated like the dag of course).

Is this the service we serve and used by the daily job on production?

I defer to Growth on this.

Do we have a documentation about the experiments done before and their evaluation scores? I'd like to get an understanding about what did we try before and didn't work in order not to get into the same route twice.

The results linked in on the meta page are the most comprehensive (also see the linked spreadsheet). Keep in mind that historically all model were trained ad-hoc in notebooks, mostly hand-tuned until a researcher was happy with the results. Only with the training pipelines in research-datasets (revert risk model, add-a-link) can we do things like a grid-search across various hyper-parameters to find the best model, and as far as I know we haven't done this to add-a-link yet. Extending the airflow dag to collect evaluation metrics for a suite of experiments (e.g. different join models configuration, xgboost params, etc) would be very interesting, and does not exist yet.

tried mwaddlink pipeline

https://airflow-research.wikimedia.org/dags/mwaddlink/grid?tab=logs&task_id=model_model_0&dag_run_id=manual__2025-05-16T12%3A59%3A23.745190%2B00%3A00

got this error:

KeyError: 'wiki_db'
Number of Wikis: 1
ERROR:  FileNotFoundError /tmp/research/mwaddlink/training/link_train.parquet/wiki_db=enwiki
Total training data size: 0

I've managed to set up the environment on a notebook. I'll try to reproduce this on a notebook with a smaller data next week.
@fkaelin

Setting up a notebook and / or a airflow development instance is the way to go for now - running on the production airflow instance is not ideal as you can't access the logs and intermediate data directories easily (and failures sends alerts to the mailing lists). See this notebook that can run all individual steps of the airflow dag, and this notebook for a dev setup to run/modify code in the research-datasets on the spark cluster.

Hello @Michael , @kostajh , @Tgr and @kevinbazira

I'd like to confirm my understanding about the current implementation with a question:

  • GrowthExperiments repo: Calls /v1/linkrecommendations/ to get recommendations.
  • v1/linkrecommendations service is deployed by deployment-charts by using the image wikimedia/research-mwaddlink
  • Dockerfile for the wikimedia/research-mwaddlink service is in mwaddlink repo
  • The repo serves a flask api that is used by GrowthExperiments repo (ServiceLinkRecommendationProvider).
  • We also have an hourly job for load-datasets, which is implemented in mwaddlink repo. However, we only use it to refresh the datasets used in the Flask api if they are changed in analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/ as we pass --download in the job.

My question (@fkaelin may help here as well) :
We intend to use the model for a new use-case Link to an orphaned article. Suggested Edits module contains tasks that show an anchor article, anchor text, and article recommendation. The article recommendation is an orphan article (an article that doesn’t already have any links to it). As I understand, we currently can't support this feature as we recommend from existing anchors. We can support this feature by enhancing the anchors_with_mentions with pre-filtered potential orphans. This requires extending the dataset we use and some experimentation as we need to find a good pre-filtering method and it may have a negative impact on overall evaluation scores.

Sorry in advance, I've tagged many people as it touches different repos.

Hello @Michael , @kostajh , @Tgr and @kevinbazira

I'd like to confirm my understanding about the current implementation with a question:

  • [...]
  • [...]
  • Dockerfile for the wikimedia/research-mwaddlink service is in mwaddlink repo

As I understand it, this file is, at least in part, the specification for the CI pileline of that repository. You can see elements of it in https://integration.wikimedia.org/ci/job/research-mwaddlink-pipeline-test/1078/console. I'm not sure if the code here is also executed when actually building for production, or if the accordingly named sections are only used for testing the building-for-production step in CI. But @Urbanecm_WMF knows this service better than me.

  • [...]
  • [...]

My question (@fkaelin may help here as well) :
We intend to use the model for a new use-case Link to an orphaned article. Suggested Edits module contains tasks that show an anchor article, anchor text, and article recommendation. The article recommendation is an orphan article (an article that doesn’t already have any links to it). As I understand, we currently can't support this feature as we recommend from existing anchors. We can support this feature by enhancing the anchors_with_mentions with pre-filtered potential orphans. This requires extending the dataset we use and some experimentation as we need to find a good pre-filtering method and it may have a negative impact on overall evaluation scores.

Could we improve the existing pipeline first, before we add new features? While it is surely good to keep that use-case in mind as we design the new system, I'm concerned that we might be doing too much at once if also want to introduce new functionality while migrating to a new pipeline at the same time.

I've reproduced the implementation in research datasets for one wiki (simplewiki) on a notebook 🎉

snapshot = "2025-04"
wikidata_snapshot = "2025-05-05"
wiki_dbs = ["simplewiki"]

threshold,N,micro_precision,micro_recall,wiki_db
0.0,1000,0.5050038491147036,0.47953216374269003,simplewiki
0.1,1000,0.6755888650963597,0.4612573099415205,simplewiki
0.2,1000,0.7285546415981199,0.45321637426900585,simplewiki
0.3,1000,0.7687253613666228,0.4276315789473684,simplewiki
0.4,1000,0.8023088023088023,0.4064327485380117,simplewiki
0.5,1000,0.8449111470113085,0.3823099415204678,simplewiki
0.6,1000,0.8812949640287769,0.358187134502924,simplewiki
0.7,1000,0.8904382470119522,0.3267543859649123,simplewiki
0.8,1000,0.9144893111638955,0.2814327485380117,simplewiki
0.9,1000,0.9353448275862069,0.15862573099415206,simplewiki

I had to use a newer version of the data though as I run into some issues with the old version.

enwiki is in progress with increased memory.

Could we improve the existing pipeline first, before we add new features? While it is surely good to keep that use-case in mind as we design the new system, I'm concerned that we might be doing too much at once if also want to introduce new functionality while migrating to a new pipeline at the same time.

Yes! Our goal is to better support the known and established use cases for the model, while keeping this tentative future use case in mind. Ideally, this initial investigation (and the immediate decisions that result from this investigation) will help inform future decisions re: whether we are able to use the same model for the Orphan Article structured task or if we will need to build a new model once that structured task is prioritized. To that end, I believe Özge's question is more about exploring this possibility than about what our implementation path will look like (please correct me if you disagree with any of this, Özge!)

Prod vs New Pipeline

As we are likely to proceed with either prod pipeline or new pipeline, we discuss here the pros and cons of each pipeline.

Requirement/CriterionProd Implementation (mwaddalink)New Implementation (research-datasets)Importance
Works for top 20Based on previous results no (see zhwiki, jawiki). Should be confirmed with a new dataset as the results are from an old dataset.Based on a new dataset yes for top 10 except for fawiki. Fawiki creates an empty training set for some reasons. Needs debugging. Need to check for the rest. - gets results above release threshold for zhwiki and jawiki.High
Training works for all languages.Based on the previous results yes. We can experiment with the current data.We already have problems with fawiki. We haven’t tried for all languages. Should be fixable.High
Dependenciestrains wikipedia2vec within the same pipeline.uses other pipelines (article topics) which means the performance depends on the other projects.Medium
Pre-work to migrate to Airflow- old dependencies should be updated (in case we run into issues with dependency conflicts) - TODO: add more here.Questions to answer for each language that are above the threshold: - How many of them fail to train due to data issues? - How many of them fail to train due to a bigger problem e.g. tokenizer, embeddings etc. - How many of them are below the release threshold? - Can we leave the old models as it’s if we hit a time consuming problem for a language? (depends on how many they are) Steps: - make sure we can train for all languages. - make the output same as current prod. (Investigate this more.) might be more risky as this pipeline is not used on prod before and we may run into issues that we don’t know yet.Medium
Less modelsNeeds investigation.Current results show that the performance decreases with a single model. Needs experimentation on groups of models.High
StabilityNo. randomly fails and works in the second try.No. randomly fails and works in the second try.Medium
Works for Link to an orphaned article use case (3)NoNoMedium.

@isarantopoulos @SSalgaonkar-WMF please feel free to add more requirements/or update the current ones.

@fkaelin what would be the major challenges to productionize the new pipeline in research datasets?

Thank you all!

Got results for top 10 languages with the new pipeline.
New pipeline: implementation in research datasets with some improvements similar to here. Note that we still use one model per language.
Old pipeline: Didn't reproduce but used the results here.

Therefore, it's important to mention that the results are from different datasets. New pipeline results are from last month, old pipeline results are from some years ago.

Highlights:

  • New pipeline is above the release threshold for zhwiki and jawiki.
  • Pipeline fails for fawiki as it creates an empty training dataset. will investigate further.
  • The other results are similar.

image.png (326×666 px, 48 KB)

Added more detailed results attached.

Hello @Michael , @kostajh , @Tgr and @kevinbazira

I'd like to confirm my understanding about the current implementation with a question:

  • [...]
  • [...]
  • Dockerfile for the wikimedia/research-mwaddlink service is in mwaddlink repo

As I understand it, this file is, at least in part, the specification for the CI pileline of that repository. You can see elements of it in https://integration.wikimedia.org/ci/job/research-mwaddlink-pipeline-test/1078/console. I'm not sure if the code here is also executed when actually building for production, or if the accordingly named sections are only used for testing the building-for-production step in CI. But @Urbanecm_WMF knows this service better than me.

It is used for both purposes. All that the blubber.yaml file does it that it specifies a bunch of images. Those images can be (and are!) used in a variety of different ways: as a base to build other images, in the CI pipeline or to deploy the service to production. CI builds some of the images on every patchset (the test pipeline) and some images after a patch is successfully merged (after gate), which is the publish pipeline. This is configured in config.yaml within the repository.

We make direct use of the following images specified there:

  • codehealth: to run code health checks in CI,
  • test: within the CI pipeline to run tox, which runs linting, unit and integration tests using its own configuration,
  • production: to run the service in production

Only test and production images are then published within Wikimedia's Docker registry in a versioned manner. The production deployment then uses a specified version of the production image, see values.yaml in the deployment-charts repo (note this is a slightly different link than what @OKarakaya-WMF previously shared).

I'm not sure if the code here is also executed when actually building for production, or if the accordingly named sections are only used for testing the building-for-production step in CI

@Michael I'm not sure I understand this correctly, hopefully what I wrote above clarifies. The production image is built only in the publish pipeline in CI (after a patch is merged), it should not be getting built sooner or later.

Prod vs New Pipeline

As we are likely to proceed with either prod pipeline or new pipeline, we discuss here the pros and cons of each pipeline.

[...]

Cross-referencing some entries in this table to existing tasks:

Got results for top 10 languages with the new pipeline.
New pipeline: implementation in research datasets with some improvements similar to here. Note that we still use one model per language.
Old pipeline: Didn't reproduce but used the results here.

Therefore, it's important to mention that the results are from different datasets. New pipeline results are from last month, old pipeline results are from some years ago.

Highlights:

  • New pipeline is above the release threshold for zhwiki and jawiki.
  • Pipeline fails for fawiki as it creates an empty training dataset. will investigate further.
  • The other results are similar.

image.png (326×666 px, 48 KB)

Added more detailed results attached.

Those numbers look promising! Though, I wonder if you could share a bit more about the testing process and how the data for it is selected?

Background to my question is that we have issues with the model consistently proposing links that are not in line with the Manual of Style and the community rightly highlights this. For example, it tends to suggest linking country and continent names, which is often not desirable. For example: a reverted Add-a-Link linking of "Central America" and "South America". The revert referenced https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Linking#What_generally_should_not_be_linked

We have a task to adjust the relevant section of the existing pipeline to stop linking to articles like these (T386867). But ideally, shouldn't the model learn on its own that some words should usually not be linked? How would the model treat a sentence like "Crunchyroll began streaming the film in North America, Central America, South America, Europe, Africa, Oceania, the [[Commonwealth of Independent States]], and India on April 20, 2023." (the original from the reverted diff above)?

Hey @Michael ,

Thank you very much for the comments and sharing the link to issues. It's great to see you had similar issues. It helps to identify the common issues.

About the testing process, I've used the implementation here with some small fixes for the new pipeline. We filter out the entities in this list. Production version also have a similar filter here. Do we have a code for the entities you've shared? We can add it here if they're not covered by the existing ones. Otherwise, we can take a look to negative sampling, adding entities as features to the model, or hard-coding rules, etc.

For a broader analysis, do we have the data that we can query to analyse the accepted and the rejected recommendations on production? This would help a lot to understand the most common issues about the predictions on production.

maybe something like this but need to figure out where we define the events and what their attributes mean:

select * from event.mediawiki_structured_task_article_link_suggestion_interaction 
-- where active_interface like '%recommendedlinktoolbar_dialog%'
-- and event.task_type = 'link-recommendation'
limit 500

[...]

Awesome summary! Thank you very much @Urbanecm_WMF 💟

Hello @Michael and @Urbanecm_WMF ,

I've created some questions/investigation items for myself, but I'd be great to get some help from one of you.
I'll start answering them today but would you be available for a short call tomorrow or early next week to go through them?

  • Community filters: We use community filters to get articles that we want to produce recommendations for. How to download all current filters to analyse and maybe compare with this list?
  • Enhancement with topics: Check out implementation for enhancing recommendations with topics. Also, we can take a look into the most common topics, if some topics are oversaturated or undersaturated. In Miro diagram, refreshLinkRecommendations.php saves to both MariaDB and ElasticSearch and if I remember we have recommendations without topics in MariaDB and with topics in Elastic Search. Looking into the implementation we save them to the growthexperiments_link_submissions table. Would this be the mariadb table or the elasticsearch table? Sorry, this is my first time reading php :) . Therefore, I’d like to go through it together with one of you.
  • Order to show (sorting) recommendations: Check out implementation for deciding the order of the recommendations to show. Is it purely based on prediction probability of the add-a-link model after filtering by the community filters. How topics are used (filter vs/and sort)
  • Do we save accepted/rejected recommendations in our databases or should we look for it in events? Seems possible with events, but we can look into the other databases if we have an easier way.
  • Elastic: Already asked to the Search Team but we can check together the connection to the Elasticsearch instance we use to push recommendations and the index name.

Hello @Michael and @Urbanecm_WMF ,

I've created some questions/investigation items for myself, but I'd be great to get some help from one of you.
I'll start answering them today but would you be available for a short call tomorrow or early next week to go through them?

Happy to find some time early next week. Though I'll also leave some notes inline as well.

  • Community filters: We use community filters to get articles that we want to produce recommendations for. How to download all current filters to analyse and maybe compare with this list?

What are "community filters"? We'd like to get recommendations for approximately all articles that can have them, so that we can also surface recommendations when reading an article. The list you linked is unrelated, it is about articles to not link to.

  • Enhancement with topics: Check out implementation for enhancing recommendations with topics. Also, we can take a look into the most common topics, if some topics are oversaturated or undersaturated. In Miro diagram, refreshLinkRecommendations.php saves to both MariaDB and ElasticSearch and if I remember we have recommendations without topics in MariaDB and with topics in Elastic Search. Looking into the implementation we save them to the growthexperiments_link_submissions table. Would this be the mariadb table or the elasticsearch table? Sorry, this is my first time reading php :) . Therefore, I’d like to go through it together with one of you.

Recommendations are not enhanced with topics by the current production implementation. Topics are preexisting in CirrusSearch (their are being added by an unrelated process). What are you trying to accomplish by looking into topic saturation?

The available recommendations are saved in the growthexperiments_link_recommendations mariadb table. growthexperiments_link_submissions is a mariadb table as well, however it does not contain recommendations, but submissions as the name says. It is used to skip link recommendations that were previously explicitly decline by a user.

  • Order to show (sorting) recommendations: Check out implementation for deciding the order of the recommendations to show. Is it purely based on prediction probability of the add-a-link model after filtering by the community filters. How topics are used (filter vs/and sort)

There are several different things mixed together here:
On many wikis, if the user selects multiple topics, then they are presented with articles that have at least one of the topics (OR/union), however on some wikis we have enabled the functionality for the users to see articles that match all of their selected topics (AND/intersection).

Then on Special:Homepage, if one is getting articles recommended for a list of topics, those are boosted by a custom boost function that prioritizes articles with fewer links, see UnderlinkedFunctionScoreBuilder for the specific implementation.

Once on the article, the (typically) 3 recommendations with the highest link-score are shown by order of appearance.

  • Do we save accepted/rejected recommendations in our databases or should we look for it in events? Seems possible with events, but we can look into the other databases if we have an easier way.

They are saved in the database as mentioned above and also in the in-wiki logs (as shown on Special:Log). See \GrowthExperiments\NewcomerTasks\AddLink\LinkSubmissionRecorder::record.

  • Elastic: Already asked to the Search Team but we can check together the connection to the Elasticsearch instance we use to push recommendations and the index name.

Adding the weighted tag to CirrusSearch is happening here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/GrowthExperiments/+/refs/heads/master/includes/NewcomerTasks/AddLink/LinkRecommendationUpdater.php#177 where the WeightedTagsUpdater is a service coming from the CirrusSearch extension.

Hello @fkaelin ,

I've compiled the previous discussions here. We can use it as agenda items for our meeting tomorrow.

  • One of our goals is to scale add-a-link models across more languages. We can do it incrementally by using the production implementation e.g. add new tokenizer, add new embeddings etc. Another options is to use the research datasets implementation where the improvements are already implemented. I’ve shared pros and cons for each option in the task. Final option is to come up with a new single model where multi language embeddings are fine-tuned within the model. Let’s go through the options together in the meeting. I’d mostly love to hear what should we consider if we go with the research datasets option and what would be the potential the consequences that we are not aware yet.
  • Another goal is to retrain the models. I believe best option we have is to go with Airflow. Thank you for sharing this document. I’ll go through it and try to run airflow locally with internal operators.
  • Orphan articles: I understand we can’t use the current solution as it is to support orphan article recommendations. I have shared some reasons in this task and checked previous communication. Let’s discuss what could be the best next steps.
  • Less models: How can we reduce the number of models? I’ve read about the previous work. As I understand, we need to find the optimal groups of languages to train together so that we can achieve some good scores between one model per language and one model for all languages. My suggestion was to cluster the languages based on their feature importances and the correlations in the training datasets. Maybe the SHARDS in the airflow is already based on something similar. Let’s also check this excelsheet. I think the precision column is for research dataset implementation.
  • Production performance: Looking into it with the Data and the Search teams. I can share my findings in the meeting and this task.
  • How the predictions are used with community configurations and the topics: Looking into it with the Growth Team. I can share my findings in the meeting and this task.

Meeting notes

  • The idea from the research team before they stop working on the project was to generate recommendations within the pipeline. The main advantages of this approach are not to expose the serving challenges (having pickles, and many models), and having more control on targeting (picking right articles for prediction by using community configuration. This has direct impact on production performance).
  • We are comfortable proceeding with the research-datasets project after testing it on all languages individually. article-topics embeddings are used in other projects as well.
  • We can run airflow on stat as a personal dev instance. I'll give it a try later.
  • less models: the idea from research team was to run a kind of grid search to find best groups of languages. I think we can try this and the hypothesis based approach (finding similar models) in scope of the investigation. We don't know how the current groups are created (SHARDS in airflow repo). pickles will get bigger with less models.
  • Orphan articles: We can experiment with adding orphan articles to anchors, but we need to read more about the most recent experiments.
  • more tips around yarn, code sharing, sshuttle and datafusion-wmf.

looking into the growthexperiments_link_recommendations with dfwmf, do we have recommendations for 18 wikis?

> select wiki_db, count(*) from mariadb.extensions.growthexperiments_link_recommendations group by wiki_db order by count(*) desc;
+---------+----------+
| wiki_db | count(*) |
+---------+----------+
| frwiki  | 2577180  |
| eswiki  | 1954821  |
| arzwiki | 1484057  |
| fawiki  | 1142973  |
| ptwiki  | 1138060  |
| idwiki  | 733392   |
| warwiki | 393453   |
| viwiki  | 273253   |
| itwiki  | 238757   |
| nlwiki  | 179242   |
| ruwiki  | 160059   |
| arwiki  | 158689   |
| plwiki  | 156938   |
| enwiki  | 137448   |
| dewiki  | 88689    |
| ukwiki  | 88120    |
| svwiki  | 49016    |
| cawiki  | 47591    |
+---------+----------+
18 row(s) fetched.
Elapsed 4.522 seconds.

looking into the growthexperiments_link_recommendations with dfwmf, do we have recommendations for 18 wikis?

> select wiki_db, count(*) from mariadb.extensions.growthexperiments_link_recommendations group by wiki_db order by count(*) desc;
+---------+----------+
| wiki_db | count(*) |
+---------+----------+
| frwiki  | 2577180  |
| eswiki  | 1954821  |
| arzwiki | 1484057  |
| fawiki  | 1142973  |
| ptwiki  | 1138060  |
| idwiki  | 733392   |
| warwiki | 393453   |
| viwiki  | 273253   |
| itwiki  | 238757   |
| nlwiki  | 179242   |
| ruwiki  | 160059   |
| arwiki  | 158689   |
| plwiki  | 156938   |
| enwiki  | 137448   |
| dewiki  | 88689    |
| ukwiki  | 88120    |
| svwiki  | 49016    |
| cawiki  | 47591    |
+---------+----------+
18 row(s) fetched.
Elapsed 4.522 seconds.

I do not know that tool, but that list is very incomplete. For example: hywiki has Add a Link enabled as well and it is not in the list: https://hy.wikipedia.org/w/index.php?search=hasrecommendation%3Alink&title=Սպասարկող%3AՈրոնել&ns0=1&uselang=en

Thank you for the answer @Michael , I agree. I think there is something wrong with the tool dfwmf.
I've created some topics for our meeting tomorrow. No need to answer yet and not very well structured. I'll explain them in our meeting tomorrow.

  • Updated with the meeting notes.

Growth Team Topics

  • What are the main improvements in mind?
    • manual training.
    • recommendations from the pipeline.
      • how can we measure the performance.
      • best performing article types.
      • probability score vs reverted/rejected recommendations.
    • a new way. how we would this if we don’t have the current system.
      • one model with an embedding layer where the embeddings work for multiple languages. (check with research)
      • serving: how to separate data used for both training and inference (anchors, embeddings etc.)

Model Serving

why do we have internal and external pods?

ozge@deploy1003:~$ kubectl get pods
NAME                                                       READY   STATUS      RESTARTS   AGE
linkrecommendation-external-69d46fbc66-42x94               3/3     Running     0          88d
linkrecommendation-internal-58657886f4-ct46b               3/3     Running     0          88d
linkrecommendation-internal-58657886f4-jx5dw               3/3     Running     0          88d
linkrecommendation-internal-58657886f4-qkg85               3/3     Running     0          88d


DB_BACKEND in the service is mysql. How we populate data to this database? Is it this one? Is it safe to assume we don’t use pickle files?
Do we always use this as api? I see click. Just want to make sure we don’t use it as a script from somewhere else.
What to consider when we deploy new models or A/B test them? We need a release plan here maybe for x% of the users. Any previous best practices for A/B testing would help. Check also the hourly script to get updates.

PHP side

has option to run for a given page.

# Do we run this job more than one options currently? all vs ores
	# fixLinkRecommendationData.php this is a monitoring job. elastic vs mariadb.
	# listTaskCounts.php number of recommendations per language.
	# iterateThroughAllPages depends on language.
# Do we have other tasks related to add-a-link?
def refresh_link_recommendations():
	if iterateThroughAllPages:
		refreshByIteratingThroughAllPages()
	else:
		refreshViaOresTopics()

def refreshByIteratingThroughAllPages():
	pageRowsProcessed = 0
	limit = 5000 # configurable
	batch_size = "some_number"
	lastPageId = "where we left off"
	while pageRowsProcessed < limit:
		pageRecordsIterator = get_pages_by_page_id(lastPageId)
		for page in pageRecordsIterator:
			processCandidate(page)
def refreshViaOresTopics():
	oresTopics = getOresTopics() # based on the topics in weighted_tag
	for oresTopic in oresTopics:
		suggestions = taskSuggester.suggest() # which taskSuggester do we use here?
		totalExistingSuggestionsCount = suggestions.getTotalCount()
		# what is the value of minimum_tasks_per_topic.
		# We want 500 per topic.
		recommendationsNeeded = getMinimumTasksPerTopic() - totalExistingSuggestionsCount
		# LinkRecommendationTaskTypeHandler::getSearchTerm
    # \GrowthExperiments\Maintenance\RefreshLinkRecommendations::initServices
		# $linkRecommendationCandidateTaskType = NullTaskTypeHandler::getNullTaskType(
		#	'_nolinkrecommendations', '-hasrecommendation:link' );
		titleBatch = findArticlesInTopic(oresTopic) # uses taskSuggester.suggest()
		for title in titleBatch:
			processCandidate(title)
			recommendationsNeeded--
			if recommendationsNeeded == 0:
				break
		
def processCandidate(page)
  # check if it's a good candidate. recently edited etc.
  evaluateTitle()
  # call to /v1/linkrecommendations/
	recommendationStatus = getDetailed()
	# save to growthexperiments_link_recommendations
	insertExistingLinkRecommendation(recommendationStatus)
	# update CirrusSearch (recommendation.link)
	updateWeightedTags(page)

  • where is the community configuration involved?
  • how the data in growthexperiments_link_recommendations used later as we don’t have recommendations cirrus search (MW client code)?
  • RevalidateLinkRecommendations : how/why do we use this? It seems this is where we call the service?
  • check GrowthExperimentsLinkRecommendationProviderUncached together.
  • is this the only job we run?
  • which languages have this feature now?
  • which task suggesters do we use now? (implements TaskSuggester)
  • where is the mariadb analytics replicas?

@OKarakaya-WMF I look forward to tomorrow's meeting! I should be able to help you with several of the questions you listed here, though I foresee that for some others I will have to defer to @Urbanecm_WMF because I'm not well versed yet in how the linkrecommendation service is operated in detail. But let's figure that out in detail tomorrow.

  • How to use research dataset environment on notebook.

Following the readme after cloning the repo on a stat machine:

  • Create the environment:
conda-analytics-clone env
source conda-analytics-activate env
conda env create -f conda-environment.yaml
  • To use the environment and create a kernel for the notebook:
source conda-analytics-activate env
conda activate research-datasets
pip list | grep boost
python -m ipykernel install --user --name research-datasets
  • Now research-datasets kernel should be available to pick in the notebook.

Hi @SSalgaonkar-WMF and @isarantopoulos ,

I'm sharing some topics below that we can have in the decision brief document.

Content for Decision Brief

  • proof that shows it’s safe to use the improved pipeline for training.
  • airflow: ml-airflow design decisions. environments/repos/operators to use etc.
  • decision if we want to generate recommendations on a pipeline rather than the api? Explain why/why not?
  • less models: results from hypothesis based and grid search approaches and the decision if we want to go with less models.
  • explain how orphan articles use case can be implemented in the future. explain how we can use it with some modifications on the improved pipeline or why we need a new implementation.
  • explain how the planned implementation can be used for add-an-image later. If not relavant explain why would it not have impact on the current decisions.
  • plan to make community config more aligned with training. should excluded entities be part of community config?
  • explain why/why not we need a new model/architecture?
  • release plan: a/b testing, feature flags etc.
  • how do we monitor production performance of the models.
  • Plan with the decisions.

We can use this in the sync and added some todo items above to clarify together.

Research Datasets Pipeline Results

I've trained/evaluated model per language on top 50 wikis here ordered by number of articles by using the following inputs:

snapshot = "2025-04"
wikidata_snapshot = "2025-05-05"
grid_search=False
langs = ["enwiki", "cebwiki", "dewiki", "frwiki", "svwiki", 
         "nlwiki", "ruwiki", "eswiki", "itwiki", "plwiki", 
         "arzwiki", "zhwiki", "jawiki", "ukwiki", "viwiki", 
         "warwiki", "arwiki", "ptwiki", "fawiki", "cawiki", 
          "idwiki", "srwiki", "kowiki", "nowiki", "trwiki",
          "cewiki", "fiwiki", "cswiki", "huwiki", "rowiki", 
          "ttwiki", "euwiki", "shwiki", "zh-min-nanwiki", "mswiki", 
          "hewiki", "eowiki", "hywiki", "dawiki", "bgwiki", 
           "uzwiki", "cywiki", "simplewiki", "bewiki", "skwiki", 
          "elwiki", "etwiki", "azbwiki", "kkwiki", "minwiki"]

I've tried this on a notebook rather than airflow as it's easier to debug.

  • passed_prod: production pipeline results
  • passed_any: research datasets pipeline results if the model has passed the release threshold for any of the classification thresholds.
passed_prod  passed_any
False        True           3
True         False          9
             True          35
  • 3 wikis failed. I'll check separately.
  • Based on the results, 9 wikis that were above the release threshold with the prod implementation are now below with the research datasets implementation. We need to figure out/ fix this.
['cywiki', 'dawiki', 'etwiki', 'euwiki', 'hywiki', 'mswiki', 'srwiki', 'ttwiki', 'uzwiki']

I see these wikis were above the threshold in these results. There might be something wrong with the embeddings or the snapshot I use. @fkaelin do you get similar results and which snapshot do you use? The results are not deterministic but similar.
Also, what is wikidata_properties = ["P31"]?
Please feel free to let me know if you need to take a look to the full script.

  • @fkaelin for a quick experiment you can train/evaluate uzwiki with the snapshot you prefer and we can take a look if it's similar to the results in the excelsheet or mine.

  • tried uzwiki twice. the results are not deterministic but similar. We can make it deterministic later.
(research-datasets) ozge@stat1010:~/repos/wiki/gitlab/research-datasets$ cat  data/model_uzwiki/model_uzwiki.backtest.eval.csv
threshold,N,micro_precision,micro_recall,wiki_db
0.0,1000,0.22180451127819548,0.10720775287704422,uzwiki
0.1,1000,0.756198347107438,0.1108419139915203,uzwiki
0.2,1000,0.782051282051282,0.1108419139915203,uzwiki
0.3,1000,0.7972350230414746,0.10478497880072683,uzwiki
0.4,1000,0.8159203980099502,0.09933373712901272,uzwiki
0.5,1000,0.8397790055248618,0.09206541490006057,uzwiki
0.6,1000,0.8323353293413174,0.08419139915202907,uzwiki
0.7,1000,0.85,0.07207752877044216,uzwiki
0.8,1000,0.8761904761904762,0.05572380375529982,uzwiki
0.9,1000,0.9242424242424242,0.036947304663840094,uzwiki
(research-datasets) ozge@stat1010:~/repos/wiki/gitlab/research-datasets$ cat  data/model_uzwiki/model_uzwiki.backtest.eval_v1.csv
threshold,N,micro_precision,micro_recall,wiki_db
0.0,1000,0.20026007802340703,0.09373097991479001,uzwiki
0.1,1000,0.7522522522522522,0.10164333536214243,uzwiki
0.2,1000,0.7892156862745098,0.09799147900182593,uzwiki
0.3,1000,0.7927461139896373,0.0931223371880706,uzwiki
0.4,1000,0.8313953488372093,0.08703590992087644,uzwiki
0.5,1000,0.8627450980392157,0.08034083992696288,uzwiki
0.6,1000,0.8846153846153846,0.06999391357273281,uzwiki
0.7,1000,0.8990825688073395,0.05964698721850274,uzwiki
0.8,1000,0.9390243902439024,0.04686548995739501,uzwiki
0.9,1000,0.9824561403508771,0.03408399269628728,uzwiki

@fkaelin
Do we delete the article topic data older than the beginning of this year?

select * 
from research.article_topics 
where wiki_db = 'uzwiki'
and snapshot = '2024-12'
-- order by snapshot DESC
limit 10

returned result:

Presto Error
presto error: Partition location does not exist: hdfs://analytics-hadoop/wmf/data/research/article_topics/snapshot=2024-12/wiki_db=uzwiki

{F61228891}

I feel there is a lot here that I do not understand and cannot respond to, but there is one detail I can contribute:

[...]
Also, what is wikidata_properties = ["P31"]?
[...]

P31 is the "instance of" Property on Wikidata. So for example the enwiki article Cat is associated with the Wikidata Item house cat (Q146), and that Item has the "instance of (P31)"-values "species" and "taxon".


@fkaelin
Do we delete the article topic data older than the beginning of this year?

{F61228891}

Note that attaching this image may not have worked as intended. I'm seeing it as {F61228891}.

thank you @Michael,

I've add the query in the image and I'm looking into the properties.

Meeting notes with Martin:

  • How do we query databases? What are the important tables to check? (e.g. growthexperiments_link_recommendations, growthexperiments_link_submissions etc.)
curl -s -XGET https://cloudelastic.wikimedia.org:8243/enwiki_content/_search?q=title.keyword:Tourism | jq . | less
  • Notable problems:
    • architecture complexity
      • The current architecture for add-an-image is leaner. Can we get some inspiration? Need to check if we use a model there.
      • how to connect to mariadb x1 from a job (Amir Sarabadani)
      • save to topic (needs investigation if it’s feasible)
    • inference time is long.
    • data used in inference is old.

I see these wikis were above the threshold in these results. There might be something wrong with the embeddings or the snapshot I use. @fkaelin do you get similar results and which snapshot do you use? The results are not deterministic but similar.

These results should be deterministic, at first glance it could be that there are multiple test/evaluation files, and when loading the data the order they are read is not guaranteed. Note as mentioned iin our discussion, this evaluation code has not been migrated to a spark/pipeline oriented design. I will have a closer look.

Also, what is wikidata_properties = ["P31"]?

As Martin has pointed out, this is the "instance of" property used to filter the list of potential anchors for a set of items we want to exclude from the training (and evaluation...) dataset. Likely it should be this set of qids that needs be configurable. I.e. moving this list to be an argument to the run method in that file will make it such that the list can be configured via airflow variable (during the initial implementation we aimed to not make everything configurable, and push the choice of what to make configurable to the production push in a second phase). The change to experiment with a per wiki filter for country could also be applied here.

  • @fkaelin for a quick experiment you can train/evaluate uzwiki with the snapshot you prefer and we can take a look if it's similar to the results in the excelsheet or mine.

It seems to fail because of a low recall? I will have run this experiment to verify your results, with a focus on seeing how we could add support for the w2v embeddings as well (to allow for more gradual changes/improvements). In my understanding the role of the embeddings (in terms of feature importance) is more of a fallback when the other features are not "meaningful enough" for the xgboost model.

@fkaelin
Do we delete the article topic data older than the beginning of this year?

Yes, the dag is removing the oldest snapshot when a new one is being generated. Note that the underlying outlink model that the embeddings and topics are based is not retrained for every dag run (it uses this one). I don't know if / how the topics/embeddings change over time, my expectation is very little.

$ hdfs dfs -ls /wmf/data/research/article_topics/
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 4 items
drwxr-xr-x   - analytics-research analytics-privatedata-users          0 2025-02-03 12:01 /wmf/data/research/article_topics/snapshot=2025-01
drwxr-xr-x   - analytics-research analytics-privatedata-users          0 2025-03-27 17:33 /wmf/data/research/article_topics/snapshot=2025-02
drwxr-xr-x   - analytics-research analytics-privatedata-users          0 2025-04-03 14:17 /wmf/data/research/article_topics/snapshot=2025-03
drwxr-xr-x   - analytics-research analytics-privatedata-users          0 2025-05-03 10:48 /wmf/data/research/article_topics/snapshot=2025-04

I've reproduced the results for uzwiki by using akhatun/research-mwaddlink and the results are similar to the high scores in the spreadsheet.

So low scores in research datasets should be due to the differences between these two repos (embeddings, dataset generation etc.) or a data issue.
I'll dive into the steps further and try to figure out what causes the low scores..

,threshold,N,micro_precision,micro_recall
0,0.0,10000,0.4155429897056691,0.41640836884928323
1,0.1,10000,0.5779311260035831,0.42183262301433555
2,0.2,10000,0.6345598578514844,0.41510073614877957
3,0.3,10000,0.689478545637753,0.39766563347539713
4,0.4,10000,0.7360376814380467,0.370834947694692
5,0.5,10000,0.7917523417866118,0.3356741573033708
6,0.6,10000,0.8538266741699494,0.2939267725687718
7,0.7,10000,0.8979732050841636,0.2531964354901201
8,0.8,10000,0.9233412067854843,0.208252615265401
9,0.9,10000,0.9696699375557538,0.157932971716389

akhatun/research-mwaddlink worked after following changes:

  • Problem: python3.7 is not available on stat1010:
    • Solution: used python3.9
  • Problem: pip install wikipedia2vec==1.0.5 fails
    • Solution : pip install wikipedia2vec==2.0.0
  • Problem:
[2025-06-03 10:18:08,350] [INFO] Terminating pool workers... (train_embedding@cli.py:326)
GENERATING BACKTESTING DATA
Processing the following Wikipedia dump files:
/mnt/data/xmldatadumps/public/uzwiki/latest/uzwiki-latest-pages-articles.xml.bz2
  0%|                                                                                                                                                                                      | 0/200000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/srv/home/ozge/repos/wiki/gitlab/akhatun/research-mwaddlink/src/scripts/generate_backtesting_data.py", line 207, in <module>
    for title, sentence in mwxml.map(process_dump, paths, threads=10):
  File "/srv/home/ozge/repos/wiki/gitlab/akhatun/research-mwaddlink/venv/lib/python3.9/site-packages/mwxml/map/map.py", line 49, in map
    yield from para.map(process_path, paths, mappers=threads)
  File "/srv/home/ozge/repos/wiki/gitlab/akhatun/research-mwaddlink/venv/lib/python3.9/site-packages/para/map.py", line 71, in _map_single_item
    yield from process(item)
  File "/srv/home/ozge/repos/wiki/gitlab/akhatun/research-mwaddlink/venv/lib/python3.9/site-packages/mwxml/map/map.py", line 47, in process_path
    yield from process(dump, path)
  File "/srv/home/ozge/repos/wiki/gitlab/akhatun/research-mwaddlink/src/scripts/generate_backtesting_data.py", line 185, in process_dump
    code = next(page).text
  File "/srv/home/ozge/repos/wiki/gitlab/akhatun/research-mwaddlink/venv/lib/python3.9/site-packages/mwxml/iteration/page.py", line 37, in __next__
    revision = next(self.__revisions)
  File "/srv/home/ozge/repos/wiki/gitlab/akhatun/research-mwaddlink/venv/lib/python3.9/site-packages/mwxml/iteration/page.py", line 44, in load_revisions
    yield Revision.from_element(first_revision)
  File "/srv/home/ozge/repos/wiki/gitlab/akhatun/research-mwaddlink/venv/lib/python3.9/site-packages/mwxml/iteration/revision.py", line 61, in from_element
    raise MalformedXML("Unexpected tag found when processing " +
mwxml.errors.MalformedXML: Unexpected tag found when processing a <revision>: 'origin'
  0%|
  • Solution:

use:

mwxml==0.3.4

We get different scores for enwiki for research datasets and akhatun repos.

research datasets:

threshold,N,micro_precision,micro_recall,wiki_db
0.5,99525,0.8243500764615928,0.13146214540038195,enwiki

akhatun:

5,0.5,10000,0.8105395232120451,0.44120660216277746

Also for the other wikis, we see similar precisions and lower recalls.

I create 200k sentences for both (100K train, 100k test) and I see there are similar number of links in both:

number of links:
akhatun:
263545

research_datasets:
270435

However, the size of the training sets are different. Positive/negative ratio is similar.

training set:
akhatun:
True       234913
False    84834401

research_datasets:
1,57070
0,19367187

Then, I tried generating training set by using generate_training_data.py in research dataset but I used the sentences generated from akhatun.
The size of the training set becomes 74M which is closer to the size of the training set in akhatun 85M than the research datasets 19M.

@fkaelin , Do you have a guess what could cause this?

I see we pick the sentences from different sources and maybe we have a different sampling strategy or something else is different in other sources e.g. redirect, anchors.
For the common sentences from the repos, I see they are mostly same. We pick the first sentence of the article in both of them and the text is same.

Some extra notes:
I've checked:

  • basic statistics of the features in model input.
  • positive ratio in training set.

I don't see big differences.

research-datasets:
I've updated it here and tried on a previously-low-scored wiki (hewiki). It's much better now. 🚀 🎉

threshold:  0.5
finished: 10000 sentences
micro_precision:	 0.8045914922349764
micro_recall:	 0.24476716718361646

I've started a new benchmark.

I'll check how we create multiple-lang datasets for a single model, and try a group

So far, 33/34 is above the threshold.
only ttwiki is below:

threshold,N,micro_precision,micro_recall,wiki_db
0.0,10000,0.6684278783803257,0.473542722972743,ttwiki
0.1,10000,0.968329596412556,0.11747305429941178,ttwiki
0.2,10000,0.9786743515850144,0.11544737557791677,ttwiki
0.3,10000,0.9848528983396446,0.11491010434014207,ttwiki
0.4,10000,0.9891431924882629,0.11457023417054685,ttwiki
0.5,10000,0.9911452184179457,0.11412840295007307,ttwiki
0.6,10000,0.992886781268524,0.11385650681439691,ttwiki
0.7,10000,0.9940422996723265,0.11341467559392313,ttwiki
0.8,10000,0.9955143540669856,0.1131389342033714,ttwiki
0.9,10000,0.9966986794717887,0.11286704730831974,ttwiki

This was above the threshold in akhatun.

,threshold,N,micro_precision,micro_recall
0,0.0,10000,0.4862865596904518,0.41120146273396524
1,0.1,10000,0.6756885467100564,0.4096136265216764
2,0.2,10000,0.7057737495803961,0.40465765288938077
3,0.3,10000,0.883611810482948,0.371505557426743
4,0.4,10000,0.9013251301467108,0.3665495837944474
5,0.5,10000,0.9155101046992938,0.36183419140643797
6,0.6,10000,0.9574147610570998,0.3364773820981713
7,0.7,10000,0.9671071127354512,0.3310557212972765
8,0.8,10000,0.9772058823529411,0.31971709006928406
9,0.9,10000,0.985691823899371,0.3016262509622787

Therefore, I think we have fixed the most important bugs, but there might still be some small issues if it's not due to the embeddings. I'll compare the datasets.

We found that we lose redirects.

This item is removed from the dataset because Novi-Sad does not exist in pageidsdf:

"{novi-sad, Novi-Sad}",ttwiki,Yosip Runänin

But Novi-Sad exists in redirects:

ttwiki,Novi-Sad,Нови-Сад

and Нови-Сад exist in pageidsdf.

After the fix here:

original training set:
rd:
1,126716
0,832209

ak:
true,186017
false,1266810

number of links:
rd:
233188
ak:
251449

training set after fixes:
rd:
1,157868
0,855380

ak:
1,170464
0,1185083

We get back ~30K positive samples (25% of the positive samples) that we lose due to redirects.
But we still lose ~9% somewhere in the process.

I get high scores after the fix. Interestingly, this is higher than akhatun results though. I'll check further.

threshold,N,micro_precision,micro_recall,wiki_db
0.0,10000,0.7786637206398002,0.551638479599823,ttwiki
0.1,10000,0.9887619861967873,0.550889849253071,ttwiki
0.2,10000,0.9913507545086493,0.5498434948285248,ttwiki
0.3,10000,0.9929265592323779,0.5492310832879695,ttwiki
0.4,10000,0.9948759106062477,0.5482784431137725,ttwiki
0.5,10000,0.9962852897473997,0.5472724799347027,ttwiki
0.6,10000,0.9977011494252873,0.5459861956410866,ttwiki
0.7,10000,0.9990029910269193,0.5450095186293173,ttwiki
0.8,10000,0.9993125,0.5434184141657886,ttwiki
0.9,10000,0.9998121360135263,0.5425629523906617,ttwiki

I share an inconsistency between the repos here which I think leads to 5% difference in training positives.

In both repos, wdproperties has items which has at least one P31 (instance of) item.
akhatun repo (also production repo) uses anchor links which does not appear in wdproperties or pageids.
However, rd removes these anchors.

I think a page is still valid if it does not have any P31 properties. Because we just want to remove items which are the instance of a pre-defined set e.g. names. I should check if there is something special with items without P31.

I'll continue with updating rd to keep these items. @fkaelin , lets discuss if this approach would cause some problems e.g. unwanted properties leaks to training due to redirects etc.
We also have some more diff in pageids.

Aklapper renamed this task from Q4 24-25 Goal: Investigate Add-a-link model training and deployment to FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment.Jun 30 2025, 9:09 AM

Update for the issue above is here.

@fkaelin , let's discuss if we want to use this change as I see there is also diff in pageidsdfs.

Proposed Next Steps

Focusing on following goals:

  • Scale Add-a-Link model across more languages FY2024 WE1.2
  • Retrain Add-a-Link models FY2024 WE1.2

proposed todo items as follow up:

  • Move pipeline to mlpipelines.
  • Add Airflow DAG.
  • Benchmark on all languages on research datasets after the recent fixes.
  • Add release steps:
    • to export data/model for inference.
    • to copy exports to a shared place.
  • Update inference service to use new models.
    • Consistent package versions.
    • Add new input wiki_db.
  • (Optional) Experiment on less models.
    • Cluster based on derived features on the training sets.
  • Manually calculate current accuracy performance of major wikis on prod before release.
  • Release models. (already automated every 30 minutes.) (we can select a subset as initial if there are no conflicts in package versions)
  • Manually calculate current accuracy performance of major wikis on prod after release (in X months).

single model for multiple languages experiment

I've clustered (kmeans) languages by generating features from the training sets:

  • positive ratio in the dataset.
  • statistics (mean, std) for each feature grouped by label.

After finding ideal number of clusters, I've picked 6 languages from a cluster which are similar to each other.

langs = ["ukwiki",
"cawiki",
"trwiki",
"hewiki",
"nowiki",
"huwiki"]

In average precision drops by 1%.
In average recall increases by 3%

Max precision drop is 4%.
Max recall drop is 3%

Max precision increase is 2%.
Max recall increase is 9%

The results are promising and I'll continue with a larger set.
I'm curious to see if we get good results due to the clustering or a larger set could work as well.

This time I've tried 44 languages in a single model.
I see some languages drop significantly although the average accuracy change is fine.

Average precision decreases by 2%
Average recall increases by 3%
Average f1 increases by 2%

Max precision drop is 26%.
Max recall drop is 20%
Max f1 drop is 14%

I checked the correlations among the extracted features, original accuracy and joint model accuracy.

There was a high correlation (0.40) between the ratio and f1 in original scores, this ratio is lower in the joint model scores (0.27).

Number of items per language in the training set varies among the languages. However, this does not seem to have an impact for joint models. We can still try to balance them though.
There is negative correlation between the training set size and accuracy. Although this seems counter-intuitive, I see small languages generally have less ambig (How many different links where used with this text) which I think makes it easier to learn.

We may try to group languages by ratio, ngram, and freq in the given order.
I think it's best to run it when we have training sets from all languages.

One final check could be to try 6 languages again but this time we can pick languages that are not similar to each other.

image.png (432×1 px, 140 KB)

I've checked serving:

  • We can create a new step to export hdfs tables (and the model) to pkl and then continue with sqlite step. Alternatively, we can change this step to read from hdfs. I'll continue with trying this.
  • I've upgraded xgboost to 2.1.4 (the version in RD), in the inference service. the current prod models still work and return the same result. Tested on one wiki. So, I think we can already iteratively release.
  • Would be great to upgrade other dependencies. Docker works fine but I had to do some more updates to run it on mac.

I've converted hdfs tables into pkl and used the rest of the pipeline as it's.
I've deployed one of the wikis from the rd pipeline to local and it worked all fine after some small changes.
One difference is the encoder as the new models need wiki_id as input.
I'll check further the differences in process_page functions between two repos.

Sharing notebooks here.

Sharing excalidraw for the add-a-link presentation.