User Details
- User Since
- Apr 1 2025, 7:13 AM (45 w, 21 h)
- Availability
- Available
- LDAP User
- Ozge
- MediaWiki User
- OKarakaya-WMF [ Global Accounts ]
Mon, Feb 9
Dataset for hackathon is created.
Please see the following notebook for a sample usage:
https://gitlab.wikimedia.org/repos/machine-learning/exploratory-notebook/-/blob/edit_suggestions_dataset/edit_suggestions/generate_suggestions.ipynb?ref_type=heads
- enwiki revert rates are around 1%. We need to consider this when we make a conclusion by using revert data.
- There is no meaningful correlation between the accept/reject actions and how many links there are to the target article. (pearson: -0.01 )
- There is no meaningful correlation between the accept/reject actions and the location of the recommendation.
- section level: -0.01
- paragraph level: -0.01
- section level percentage: 0.01
- paragraph level: percentage -0.005
- There is no meaningful correlation between the accept/reject actions and probability scores.(pearson: 0.11 although higher than the scores above.)
- There is no meaningful correlation between the accept/reject actions and the similarity between the source and the target page (0.2 although this is the highest corr we have found so far.)
Thu, Jan 29
enwiki revert rates and counts:
Sharing some initial results:
Deployment completed. I've checked some of the wikis and they work fine.
Tue, Jan 27
Following wikis are deployed to prod and the others are in the queue.
I see large wikis (e.g. dewiki) take long (~4 hours) and small wikis (e.g. hiwiki) get deployed quickly (~10 minutes)
In overall, I think it still makes sense to deploy them sequentially not to add too much load to maria db.
Mon, Jan 26
final list of wikis to update with the release date today (26/01/2026)
training and staging deployments are completed.
Following wikis are below the release threshold. I'll remove them from the deployment and update rest of the wikis.
Thu, Jan 22
dinwiki has failed it does not have enough data for training and it's one of smallest wikis.
Tue, Jan 20
Looking into the models that we need to update based on:
Frontend enabled models: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/ext-GrowthExperiments.php
Wikis supported by v2: https://analytics.wikimedia.org/published/wmf-ml-models/addalink/v2/
zhwiki new model checksum.
c4950228598e64c08ae817df316f2f3127d93df27dfbcddfadd5f2550586bdff zhwiki.linkmodel.json
Mon, Jan 19
zhwiki v2 model checksum:
I've trained a model without countries and continents for zhwiki.
We get similar f1 scores. I'll proceed with deploying it.
Fri, Jan 16
Actually, I had an idea to stop recommending popular links.
So if there are already too many links to a page e.g. a page in 99 percentile . We can stop recommending it.
hi @KStoller-WMF ,
crystal clear, thank you!
Wed, Jan 14
Do we need this change for all wikis or only for zhwiki?
In other words, do we want to change it for all wikis but more urgently for zhwiki?
Tue, Jan 13
models to pick
We currently discuss it here:
https://wikimedia.slack.com/archives/G01A0FNPLG4/p1768310436209719
Jan 9 2026
Reporting 09/01/2026
Progress update on the hypothesis for the week, including if something has shipped:
Thank you again @dcausse ,
Dataset for hackathon is created.
Please see the following notebook for a sample usage:
https://gitlab.wikimedia.org/repos/machine-learning/exploratory-notebook/-/blob/edit_suggestions_dataset/edit_suggestions/generate_suggestions.ipynb?ref_type=heads
Closing this task as we can follow it up with the previous one:
Jan 8 2026
I agree! thank you @dcausse
when median length of the query: 77 chars + the prompt (108 chars):
- max latency: 290ms.
- 99.9 percentile latency: 280ms.
- median latency: 34ms
@dcausse , cool.
I'll update the service and the performance tests accordingly.
We get better results on prod.
Do we plan to query the api on prod with the following prompt?
We set max length to 300 chars. So that, if the query text length is higher than 300 chars, only the first 300 chars will be used.
We can increase it if we expect longer text.
Following prompt is ~90 chars.
Jan 6 2026
I think we have similar behavior in good faith as well.
We have calls with the same rev_id for the successful calls before errors.
Indeed, caching should reduce the load on the mwapi.
thank you @kevinbazira
Great! I was looking for this task
Hey @gkyziridis @kevinbazira
I remember we were discussing about timeout errors in mwapi but I could not find the related task.
Do you know if we got similar timeout error from mediawiki api before in another task?
Thank you!
Looking into the errors, they are mostly due to timeouts during the mwapi calls.
The timeout is 5 seconds.
Dec 22 2025
(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ MODEL=embeddings locust
Min length: 250, Max length: 350
question_length
count 65.000000
mean 301.353846
std 28.316532
min 250.000000
25% 283.000000
50% 303.000000
75% 324.000000
max 348.000000
[2025-12-22 13:49:40,609] stat1010/INFO/locust.main: Run time limit set to 120 seconds
[2025-12-22 13:49:40,609] stat1010/INFO/locust.main: Starting Locust 2.31.5
[2025-12-22 13:49:40,610] stat1010/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-12-22 13:49:40,610] stat1010/INFO/locust.runners: All users spawned: {"Embeddings": 2} (2 total users)
[2025-12-22 13:51:40,147] stat1010/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-12-22 13:51:40,224] stat1010/INFO/locust.main: Shutting down (exit code 0)
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST https://inference-staging.svc.codfw.wmnet:30443/v1/models/qwen3-embedding:predict 1902 0(0.00%) | 74 67 292 70 | 15.90 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 1902 0(0.00%) | 74 67 292 70 | 15.90 0.00
(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ export MAX_LENGTH=350
(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ export MIN_LENGTH=100
(venv) ozge@stat1010:~/repos/wiki/gerrit/inference-services/test/locust$ MODEL=embeddings locust
Min length: 100, Max length: 350
question_length
count 732.000000
mean 135.498634
std 58.858208
min 100.000000
25% 105.000000
50% 112.000000
75% 128.000000
max 348.000000
[2025-12-22 13:44:47,723] stat1010/INFO/locust.main: Run time limit set to 120 seconds
[2025-12-22 13:44:47,723] stat1010/INFO/locust.main: Starting Locust 2.31.5
[2025-12-22 13:44:47,724] stat1010/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-12-22 13:44:47,724] stat1010/INFO/locust.runners: All users spawned: {"Embeddings": 2} (2 total users)
[2025-12-22 13:46:47,260] stat1010/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-12-22 13:46:47,339] stat1010/INFO/locust.main: Shutting down (exit code 0)
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST https://inference-staging.svc.codfw.wmnet:30443/v1/models/qwen3-embedding:predict 1890 0(0.00%) | 74 66 304 70 | 15.80 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 1890 0(0.00%) | 74 66 304 70 | 15.80 0.00
Staging results.
results with a new set up from local.
performance test results in local with CPU:
Dec 18 2025
Dec 17 2025
Great idea! Let's turn this into a goal. I think it's fine not to create child tickets for now.
I have added checkboxes to the description indicating each step/task.
I'll update them as we progress and I can add weekly updates here.
Thank you!
Dec 16 2025
Hey @kevinbazira do we already have a dockerfile to try this on staging?
Dec 15 2025
Nice implementation!
Does last token pooling come from qwenlm?
Dec 12 2025
- Reporting 12/11/2025
Dec 11 2025
Below I share how long it will take to generate embeddings with different set ups and I compare two models:
model_name = "Qwen/Qwen3-Embedding-0.6B"
- float16, all chars
205/207038 [00:55<15:41:24, 3.66it/s]
- float16 , first 300 chars.
206/207038 [00:27<7:37:03, 7.54it/s]
- float32 , first 300 chars.
264/207038 [01:06<14:31:14, 3.96it/s]
- float32 , all chars.
124/207038 [01:12<33:27:43, 1.72it/s]
model_name = "sentence-transformers/all-mpnet-base-v2"
- float32 , all chars.
209/207038 [00:21<5:51:39, 9.80it/s]
- float32 , first 300 chars.
228/207038 [00:10<2:32:48, 22.56it/s]
Dec 9 2025
I agree.
Currently, the only indicator about the model version is the model hash (c4796c3c193d983980a445bb2a76f65def9f2459599fa6df055984bd851d3ca3 is the v2 zhwiki model)
I think we can introduce semantic versioning.
Dec 8 2025
- Reporting 05/11/2025
Nov 28 2025
Looking into 17days periods:
I've created a list of currently in use models.
These models below got at least one suggestion accept or suggestion reject since 2025-06-01.
The wikis are sorted by accept count. Therefore, the wikis above are used less.
I'll split the remaining deployments into 3.
- Deployment 1: Deploy wikis between 1-50. (28/11/2025)
- Deployment 2: Deploy wikis between 51-80. (01/12/2025)
- Deployment 3: Deploy wikis between 81-113. (09/01/2026)
- Deployment 4: Deploy enwiki. (12/01/2026)
Please feel free to suggest another order.
Nov 26 2025
Nov 24 2025
Started updating following wikis:
cool, thank you @Pablo ,
We got results for itwiki:
Nov 21 2025
The service works fine:
curl https://api.wikimedia.org/service/lw/inference/v1/models/reference-risk:predict -X POST -d '{"rev_id": 1322686680, "lang": "en"}'
{"model_name":"reference-risk","model_version":"2024-11","wiki_db":"enwiki","revision_id":1322686680,"reference_count":37,"survival_ratio":{"min":0.16666666666666666,"mean":0.6632285937319566,"median":0.6505386708644346},"reference_risk_score":0.08108108108108109}%
https://en.wikipedia.org/w/index.php?title=MarketStar&oldid=1322686680
The issue is that the Deprecated or Blacklisted domains are quiet rare (~120)
Please feel free to let me know if you get 0 for a url which is Deprecated or Blacklisted and we can take a look further.
Nov 20 2025
thank you both @Sdkb and @Chipmunkdavis for reporting this issue,
Nov 14 2025
- Reporting 14/11/2025
Nov 6 2025
I've collected current performance rates and counts of the candidate wikis:
Nov 5 2025
Nov 4 2025
Oct 31 2025
I'm sharing final evaluation results for this phase: