Hey @kevinbazira do we already have a dockerfile to try this on staging?
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Yesterday
Mon, Dec 15
Nice implementation!
Does last token pooling come from qwenlm?
Fri, Dec 12
- Reporting 05/11/2025
Thu, Dec 11
Below I share how long it will take to generate embeddings with different set ups and I compare two models:
model_name = "Qwen/Qwen3-Embedding-0.6B"
- float16, all chars
205/207038 [00:55<15:41:24, 3.66it/s]
- float16 , first 300 chars.
206/207038 [00:27<7:37:03, 7.54it/s]
- float32 , first 300 chars.
264/207038 [01:06<14:31:14, 3.96it/s]
- float32 , all chars.
124/207038 [01:12<33:27:43, 1.72it/s]
model_name = "sentence-transformers/all-mpnet-base-v2"
- float32 , all chars.
209/207038 [00:21<5:51:39, 9.80it/s]
- float32 , first 300 chars.
228/207038 [00:10<2:32:48, 22.56it/s]
Tue, Dec 9
I agree.
Currently, the only indicator about the model version is the model hash (c4796c3c193d983980a445bb2a76f65def9f2459599fa6df055984bd851d3ca3 is the v2 zhwiki model)
I think we can switch to a semantic versioning.
Mon, Dec 8
- Reporting 05/11/2025
Fri, Nov 28
Looking into 17days periods:
I've created a list of currently in use models.
These models below got at least one suggestion accept or suggestion reject since 2025-06-01.
The wikis are sorted by accept count. Therefore, the wikis above are used less.
I'll split the remaining deployments into 3.
- Deployment 1: Deploy wikis between 1-50.
- Deployment 2: Deploy wikis between 51-113.
- Deployment 3: Deploy enwiki.
Please feel free to suggest another order.
Wed, Nov 26
Mon, Nov 24
Started updating following wikis:
cool, thank you @Pablo ,
We got results for itwiki:
Fri, Nov 21
The service works fine:
curl https://api.wikimedia.org/service/lw/inference/v1/models/reference-risk:predict -X POST -d '{"rev_id": 1322686680, "lang": "en"}'
{"model_name":"reference-risk","model_version":"2024-11","wiki_db":"enwiki","revision_id":1322686680,"reference_count":37,"survival_ratio":{"min":0.16666666666666666,"mean":0.6632285937319566,"median":0.6505386708644346},"reference_risk_score":0.08108108108108109}%
https://en.wikipedia.org/w/index.php?title=MarketStar&oldid=1322686680
The issue is that the Deprecated or Blacklisted domains are quiet rare (~120)
Please feel free to let me know if you get 0 for a url which is Deprecated or Blacklisted and we can take a look further.
Thu, Nov 20
thank you both @Sdkb and @Chipmunkdavis for reporting this issue,
Nov 14 2025
- Reporting 14/11/2025
Nov 6 2025
I've collected current performance rates and counts of the candidate wikis:
Nov 5 2025
Nov 4 2025
Oct 31 2025
I'm sharing final evaluation results for this phase:
Oct 30 2025
As discussed, I'm creating a new goal for deployments.
and I'm closing this goal.
Oct 29 2025
I've started a patch to deploy new models here: https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/1199815
It's WIP but I think it should be ready to review tomorrow. I'll let you know.
Oct 21 2025
I think it should not be complicated to get offline scores for enwiki.
I'll get back to this and share the results soon after finishing some other tasks.
Please feel free to let me know if we should increase the priority.
Oct 17 2025
Oct 16 2025
articlequality deployed to staging successfully:
Oct 15 2025
I've added questions from two large models into the prototype ui.
gpt-oss:120b, aya:35b
Overall evaluation is in progress.
Oct 13 2025
I've split the models below into two groups:
Oct 10 2025
- I've checked several benchmarks related to QA generation:
Reporting (10/10/2025)
Progress update on the hypothesis for the week, including if something has shipped:
- Last time, we discussed closing this goal, as the new models are moved to the new location
- We will suggest Growth team that we deploy inference service if both teams agree.
Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):
- N/A
Any emerging blockers or risks:
- N/A
Any unresolved dependencies:
- N/A
New lessons from the hypothesis:
- N/A
Changes to the hypothesis scope or timeline:
- N/A
Oct 9 2025
Thank you for the comments.
We can run the experiments on larger LLMs. I've checked that we can use some larger models (tested with gpt-oss:120b, llama4:maverick) on mllab.
I'll check further some public benchmarks, and see if we can re-run the experiments on a different set of LLMs.
I'll revisit the evaluation part if we can do it better with minimum human effort.
Oct 8 2025
Sharing the results for the larger dataset below.
I used evaluation model and the query model as same due to the limits on the cloud models.
Oct 7 2025
Oct 6 2025
I've started a toolforge app
This is a Streamlit app where we keep the data in gitlab registry
Can you share implementation? (dataset generation, and application)
I'm curious to know how it works in more details and it should help with the QA part to get answers as well.
Oct 3 2025
Reporting (03/10/2025)
Progress update on the hypothesis for the week, including if something has shipped:
- We have deployed the new models to the new location.
Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):
- N/A
Any emerging blockers or risks:
- N/A
Any unresolved dependencies:
- N/A
New lessons from the hypothesis:
- N/A
Changes to the hypothesis scope or timeline:
- As discussed, based on the availability of the Growth team, we can become the owner of the api. We can also split the goals to two:
- Inference service deployments (MLTeam)
- Current project has model per wiki. We have previously discussed how to reduce the number of models.
- The project has a mariadb database where we store the data needed for inference.
- Mediawiki deployments (Growth Team)
- Inference service deployments (MLTeam)
I've updated the prompt based on the previous scores.
Oct 2 2025
All models are deployed to the new location via the airflow dag.
Looking into the question related scores, we generally get low scores in question_relevance_to_title and curiosity.
Oct 1 2025
I've updated checks to a rubric based approach to:
- Get better insights from generated QA
- Compare models from multiple perspectives.
Sep 30 2025
Results for both gpt-oss:20b and aya-expanse:32b are available in the spreadsheet.
Sep 26 2025
Reporting (26/09/2025)
Progress update on the hypothesis for the week, including if something has shipped:
- We have agreed with Growth Team to collaborate in October 2025.
Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):
- N/A
Any emerging blockers or risks:
- N/A
Any unresolved dependencies:
- N/A
New lessons from the hypothesis:
- N/A
Changes to the hypothesis scope or timeline:
- We have shared an analysis about case-sensitive recommendations.
- Deployments will start in October as agreed with the Growth Team.
My short term suggestion is to make anchors case-sensitive and train/evaluate models. So that, we can analyse case where the performance increase/decrease.
Long term suggestion would be to have similarity between lower level embeddings (e.g. paragraph) as an additional feature.
Sep 25 2025
I'm sharing an analysis on case-insensitivity on enwiki and simplewiki.
Sep 24 2025
Alternative ranking strategy from Fabian:
https://huggingface.co/BAAI/bge-reranker-v2-gemma
Sep 23 2025
hello @KStoller-WMF ,
I totally agree 💯 . All clear, thank you!
Sep 22 2025
About the release of new wikis that are above the release threshold in v2 and do not have add-a-link onboarding tasks;
I share the below the list of wikis filtered by the criteria above (47 in total);
The wikis are sorted by ~their size.
Hello good morning,
Sep 19 2025
Progress update on the hypothesis for the week, including if something has shipped:
- We propose a release plan in collaboration to the Growth Team. I understand they also want to add the wikis to the tasks. Therefore, we will update the plan.
Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):
- N/A
Any emerging blockers or risks:
- The serving patch needs to be reviewed/merged/deployed.
Any unresolved dependencies:
- N/A
New lessons from the hypothesis:
- N/A
Changes to the hypothesis scope or timeline:
- We collaborate with the Growth Team on the release plan in scope of this task.
- I know the inference api currently supports the wikis here.
- Also I have the list of wikis that are below/above the release threshold
- However, I'm missing the information about which wikis are enabled in tasks currently. Can you share this information? Have we already enabled tasks for all the wikis here. I can look into usage if this is not easy to find.
- As we want to enable tasks for wikis, I think we should depend on the list of wikis currently enabled in tasks, rather than the list of wikis that are currently being served. They might be the same though. I just want to make sure.
Sep 12 2025
Sep 11 2025
I've calculated online scores for add-a-link here
I share the main highlights below:
We can re-use the notebook to calculate scores some time after the model releases.