User Details
- User Since
- Apr 1 2025, 7:13 AM (31 w, 5 d)
- Availability
- Available
- LDAP User
- Ozge
- MediaWiki User
- OKarakaya-WMF [ Global Accounts ]
Thu, Nov 6
I've collected current performance rates and counts of the candidate wikis:
Wed, Nov 5
Tue, Nov 4
Fri, Oct 31
I'm sharing final evaluation results for this phase:
Thu, Oct 30
As discussed, I'm creating a new goal for deployments.
and I'm closing this goal.
Wed, Oct 29
I've started a patch to deploy new models here: https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/1199815
It's WIP but I think it should be ready to review tomorrow. I'll let you know.
Tue, Oct 21
I think it should not be complicated to get offline scores for enwiki.
I'll get back to this and share the results soon after finishing some other tasks.
Fri, Oct 17
Thu, Oct 16
articlequality deployed to staging successfully:
Wed, Oct 15
I've added questions from two large models into the prototype ui.
gpt-oss:120b, aya:35b
Overall evaluation is in progress.
Mon, Oct 13
I've split the models below into two groups:
Oct 10 2025
- I've checked several benchmarks related to QA generation:
Reporting (10/10/2025)
Progress update on the hypothesis for the week, including if something has shipped:
- Last time, we discussed closing this goal, as the new models are moved to the new location
- We will suggest Growth team that we deploy inference service if both teams agree.
Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):
- N/A
Any emerging blockers or risks:
- N/A
Any unresolved dependencies:
- N/A
New lessons from the hypothesis:
- N/A
Changes to the hypothesis scope or timeline:
- N/A
Oct 9 2025
Thank you for the comments.
We can run the experiments on larger LLMs. I've checked that we can use some larger models (tested with gpt-oss:120b, llama4:maverick) on mllab.
I'll check further some public benchmarks, and see if we can re-run the experiments on a different set of LLMs.
I'll revisit the evaluation part if we can do it better with minimum human effort.
Oct 8 2025
Sharing the results for the larger dataset below.
I used evaluation model and the query model as same due to the limits on the cloud models.
Oct 7 2025
Oct 6 2025
I've started a toolforge app
This is a Streamlit app where we keep the data in gitlab registry
Can you share implementation? (dataset generation, and application)
I'm curious to know how it works in more details and it should help with the QA part to get answers as well.
Oct 3 2025
Reporting (03/10/2025)
Progress update on the hypothesis for the week, including if something has shipped:
- We have deployed the new models to the new location.
Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):
- N/A
Any emerging blockers or risks:
- N/A
Any unresolved dependencies:
- N/A
New lessons from the hypothesis:
- N/A
Changes to the hypothesis scope or timeline:
- As discussed, based on the availability of the Growth team, we can become the owner of the api. We can also split the goals to two:
- Inference service deployments (MLTeam)
- Current project has model per wiki. We have previously discussed how to reduce the number of models.
- The project has a mariadb database where we store the data needed for inference.
- Mediawiki deployments (Growth Team)
- Inference service deployments (MLTeam)
I've updated the prompt based on the previous scores.
Oct 2 2025
All models are deployed to the new location via the airflow dag.
Looking into the question related scores, we generally get low scores in question_relevance_to_title and curiosity.
Oct 1 2025
I've updated checks to a rubric based approach to:
- Get better insights from generated QA
- Compare models from multiple perspectives.
Sep 30 2025
Results for both gpt-oss:20b and aya-expanse:32b are available in the spreadsheet.
Sep 26 2025
Reporting (26/09/2025)
Progress update on the hypothesis for the week, including if something has shipped:
- We have agreed with Growth Team to collaborate in October 2025.
Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):
- N/A
Any emerging blockers or risks:
- N/A
Any unresolved dependencies:
- N/A
New lessons from the hypothesis:
- N/A
Changes to the hypothesis scope or timeline:
- We have shared an analysis about case-sensitive recommendations.
- Deployments will start in October as agreed with the Growth Team.
My short term suggestion is to make anchors case-sensitive and train/evaluate models. So that, we can analyse case where the performance increase/decrease.
Long term suggestion would be to have similarity between lower level embeddings (e.g. paragraph) as an additional feature.
Sep 25 2025
I'm sharing an analysis on case-insensitivity on enwiki and simplewiki.
Sep 24 2025
Alternative ranking strategy from Fabian:
https://huggingface.co/BAAI/bge-reranker-v2-gemma
Sep 23 2025
hello @KStoller-WMF ,
I totally agree 💯 . All clear, thank you!
Sep 22 2025
About the release of new wikis that are above the release threshold in v2 and do not have add-a-link onboarding tasks;
I share the below the list of wikis filtered by the criteria above (47 in total);
The wikis are sorted by ~their size.
Hello good morning,
Sep 19 2025
Progress update on the hypothesis for the week, including if something has shipped:
- We propose a release plan in collaboration to the Growth Team. I understand they also want to add the wikis to the tasks. Therefore, we will update the plan.
Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):
- N/A
Any emerging blockers or risks:
- The serving patch needs to be reviewed/merged/deployed.
Any unresolved dependencies:
- N/A
New lessons from the hypothesis:
- N/A
Changes to the hypothesis scope or timeline:
- We collaborate with the Growth Team on the release plan in scope of this task.
- I know the inference api currently supports the wikis here.
- Also I have the list of wikis that are below/above the release threshold
- However, I'm missing the information about which wikis are enabled in tasks currently. Can you share this information? Have we already enabled tasks for all the wikis here. I can look into usage if this is not easy to find.
- As we want to enable tasks for wikis, I think we should depend on the list of wikis currently enabled in tasks, rather than the list of wikis that are currently being served. They might be the same though. I just want to make sure.
Sep 12 2025
Sep 11 2025
I've calculated online scores for add-a-link here
I share the main highlights below:
We can re-use the notebook to calculate scores some time after the model releases.
enwiki results:
Sep 9 2025
csv in the previous comment is also available here:
Sep 8 2025
I've picked the best scores and compared v1 (results from current prod) vs v2 (results from the new pipeline).
Benchmark completed (except for enwiki):
Sep 2 2025
Thank you @brouberol ,
This will be fixed in scope of https://phabricator.wikimedia.org/T398950
Aug 28 2025
Use_the_yarn_CLI page works like charm!
Aug 27 2025
oh thanks :)
I don't have access to yarn logs. Is it expected?
staging release airflow dag tested on dev with three wikis and it works well.
airflow dag mr for staging release:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1638
Aug 26 2025
cool, no problem! it's back to normal 😍
I'm getting following errors. Could it be related to the patches above?