User Details
- User Since
- Jan 6 2025, 12:21 PM (74 w, 1 d)
- Availability
- Available
- IRC Nick
- georgekyz
- LDAP User
- Gkyziridis
- MediaWiki User
- GKyziridis-WMF [ Global Accounts ]
Fri, Jun 5
@Clement_Goubert thank for your comments, they were super informative and helpful.
All the necessary actions are already taken and the corresponding patches are already under review.
The plan is to start deployments Monday next week following this order:
- Merge and publish the image of liftwing-openapi-server in registry: patch-inference-services
- Kubeconfig files (+mesh that could be done later but is not a problem to do right there): patch-puppet
- Deployment both the admin_ng part and the actual service: patch-deployment-charts
- Ingress configuration and DNS: patch-operations/DNS
- LVS in service_setup, configuring the liftwing-openapi-server in "hieradata/common/service.yaml" in a separated puppet patch
Wed, Jun 3
Thank you both @isarantopoulos and @Clement_Goubert for you help.
So just to make sure that I've understood what you suggest:
- Create a repo in gitlab under: repos/sre/miscweb/liftwing-openapi-specs
- Add the /docs files in there and configure the pipeline similar to static-codereview repo.
- Create a new helmfile in helmfile.d/services/miscweb for the deployment.
We decided to go with the option 1, having a dedicated endpoint liftwing-openapi-server which serves the umbrella yaml for all endpoints. I've tested it locally and it works fine with the RestSandbox.
Tue, Jun 2
Thnx for your comments @apaskulin!
I was investigating @Clement_Goubert's option for exposing the openapi specs on the Kserve level by overriding the standard specs with the custom ones.
This works fine I tested it locally configuring it on the RestSandbox.
Mon, Jun 1
Thank you all for your advices and comments. It is much appreciated!
I would like to add some clarification to some specific points:
- We would like to deliver something that is working and ship it fast, we have a hard deadline on 10 of June.
- We are not making changes on the services and on the model's schemas, which means that we will not need multiple deployments.
Thu, May 28
# Clone mediawiki-config and fetch the patch. git clone "https://gerrit.wikimedia.org/r/operations/mediawiki-config" cd mediawiki-config git fetch origin refs/changes/88/1294988/1 git checkout FETCH_HEAD cd ..
Hi @apaskulin thnx for your work in the openapi-specs, it is much appreciated.
Based on the discussion we had in this slack-thread and after investigating multiple options, we ended up to add these /docs yaml files in the mediawiki-config repository under the /static/liftwing-openapi-specs/ directory, I filed this patch for doing that.
When this patch is merged and deployed, then we can add the rest specs for the models over there.
Fri, May 22
Hi @hashar thank you very much for your comment.
We would like ideally to go with the easiest way using something that already exists such as fetching the specs from the main repo e.g. github.raw (tested on this comment). This way we avoid implement extra services.
Thu, May 21
Hey @apaskulin, thnx for your comments.
We are still figuring out how we will expose the openapi-specs docs.
We decided to go first with the current models that are configured in /docs and avoid revscoring/ORES models with (multi-wiki pattern) for now. We will discuss about them in a second iteration.
Wed, May 20
Classic Kserve v1 API call:
curl -s -i https://inference.svc.eqiad.wmnet:30443/v1/models/qwen3-14b:predict \ -X POST \ -H "Content-Type: application/json" \ -H "Host: qwen3-14b.experimental.wikimedia.org" \ -d '{"prompt": "What is the capital of France?", "max_tokens": 50}'
Tue, May 19
I can query event_sanitized.mediawiki_page_revert_risk_multilingual_prediction_change_v1 and see results.
Thank you very much for your help @Ottomata !
Mon, May 18
I cannot see results in`event_sanitized.mediawiki_page_revert_risk_multilingual_prediction_change_v1`:
Wed, May 13
This is an initial report for qwen model deployment using the optimize-model skill.
I found it pretty detailed I am pasting it here.
- Qwen36-27B Optimization Report
You can pull the mediawiki-config patch and follow these steps to reproduce and test it locally.
Configuration of 46 wikis + testwiki on changeprop is merged and deployed.
We can see streams coming from multiple wikis in https://stream.wikimedia.org/v2/ui/#/?streams=mediawiki.page_revert_risk_multilingual_prediction_change.v1.
Tue, May 12
I created a subtask for configuring the openapi specs in the mediawiki-config: https://phabricator.wikimedia.org/T426081
I will paste there my findings.
Mon, May 11
Thank you very much @Ottomata !
I went ahead and tested a public, api.wikimedia.org Lift Wing POST endpoint in Swagger UI using the anonymous example from the docs, and the API request successfully returned the expected response. Unless I'm missing something, this means that a complete OpenAPI spec for the public Lift Wing endpoints will work in Swagger UI even though the endpoints are v1 and POST.
This is tested by @apaskulin and by me (in the comment above) using a Swagger-UI, so we can move forward and test how we can use these yaml files in the Rest Sandbox.
Update
Since Kserve does not allow the v1 post endpoints tested in the default Swagger-UI, we decided to move towards the direction of retrieving the openapi specs from LiftWing model server, and make them available for the Rest Sandbox.
We can close this task.
More information can be found in T419455 .
I +1 the patch event-sanitisation: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1283749
I think DE needs to deploy? I do not have permissions for +2 in the patch.
Hey @apaskulin thnx for your update.
Would this be one API spec in the repository or multiple specs, such as one under each model directory?
I think it would be better to have a /docs directory on the top level of the repo where we can add/configure examples and openapi specs for the all of the models.
May 6 2026
- I configured the rest of the wikis for revertrisk-multilingual on the changeprop repo: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1283758
- Add the mediawiki_page_revert_risk_multilingual_prediction_change_v1 to event_sanitizied: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1283749
May 4 2026
The new stream is deployed on Eventstreams.
I can see streams at: stream.wikimedia.org.
I am still getting some errors:
Apr 30 2026
I am having errors on the current deployment, please check these logs in the paste:
Apr 29 2026
I think that those errors are due to missing revisions which are randomly selected from the "data/revisions_lang_and_id.tsv" during the locust test.
Basically is a specific revision_id that is missing, check the logs:
INFO:root:Model Server: RevertRiskMultilingualGPU INFO:root:Successfully loaded 342 canonical wiki languages. WARNING:root:CUDA is not available or PyTorch is CPU-only; using CPU instead. 2026-04-29 14:47:07.078 1 kserve INFO [model_server.py:register_model():402] Registering model: revertrisk-multilingual 2026-04-29 14:47:07.079 1 kserve INFO [model_server.py:setup_event_loop():282] Setting max asyncio worker threads as 32 2026-04-29 14:47:07.130 1 kserve INFO [server.py:_register_endpoints():110] OpenAI endpoints not registered 2026-04-29 14:47:07.130 1 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers 2026-04-29 14:47:07.144 1 uvicorn.error INFO: Started server process [1] 2026-04-29 14:47:07.144 1 uvicorn.error INFO: Waiting for application startup. 2026-04-29 14:47:07.148 1 kserve INFO [server.py:start():70] Starting gRPC server with 4 workers 2026-04-29 14:47:07.149 1 kserve INFO [server.py:start():71] Starting gRPC server on [::]:8081 2026-04-29 14:47:07.149 1 uvicorn.error INFO: Application startup complete. 2026-04-29 14:47:07.149 1 uvicorn.error INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit) INFO:root:Opening a new Asyncio session for mwapi. INFO:root:revision 1096365864 (en): revision_missing INFO:root:revision 1096365864 (en): revision_missing
Hey @achou, I think that this change is adding some extra latency as well.
Apr 28 2026
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST /v1/models/revertrisk-multilingual:predict 820 1300 1800 2200 3200 4200 5400 5900 8200 10000 10000 1743
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 820 1300 1800 2200 3200 4200 5400 5900 8200 10000 10000 1743
Apr 21 2026
I am pasting here some results from loading tests.
Locust Test results:
locust RevertriskMultilingual --headless --users 35 --spawn-rate 5 --run-time 120s --only-summary
[2026-04-21 15:17:11,442] stat1010/INFO/locust.main: Run time limit set to 120 seconds
[2026-04-21 15:17:11,442] stat1010/INFO/locust.runners: Ramping to 35 users at a rate of 5.00 per second
[2026-04-21 15:17:17,447] stat1010/INFO/locust.runners: All users spawned: {"RevertriskMultilingual": 35} (35 total users)
[2026-04-21 15:19:10,981] stat1010/INFO/locust.main: Shutting down (exit code 1)
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /v1/models/revertrisk-multilingual:predict 931 7(0.75%) | 1351 65 24093 430 | 7.81 0.06
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 931 7(0.75%) | 1351 65 24093 430 | 7.81 0.06Apr 2 2026
Testing again on staging
I used this event for testing:
The latest version of revertrisk-multilingual model handling the stream is deployed on production.
$ kube_env revertrisk ml-serve-eqiad $ kubectl get pods NAME READY STATUS RESTARTS AGE revertrisk-language-agnostic-predictor-00004-deployment-b477w7x 3/3 Running 0 11m revertrisk-language-agnostic-predictor-00004-deployment-b4b6mnj 3/3 Running 0 9m36s revertrisk-language-agnostic-predictor-00004-deployment-b4t4jlt 3/3 Running 0 11m revertrisk-language-agnostic-predictor-00004-deployment-b4tfdlw 3/3 Running 0 9m36s revertrisk-language-agnostic-predictor-00004-deployment-b4vz7cp 3/3 Running 0 11m revertrisk-language-f284bff08aba54bd309680bad6316c0a-deplos7tj6 3/3 Running 0 11m revertrisk-multilingual-pre-save-predictor-00003-deploymen8xsr4 3/3 Running 0 11m revertrisk-multilingual-predictor-00004-deployment-5c575f98bbms 3/3 Running 0 11m revertrisk-multilingual-predictor-00004-deployment-5c575f9flqt8 3/3 Running 0 11m revertrisk-multilingual-predictor-00004-deployment-5c575f9n78wg 3/3 Running 0 11m revertrisk-wikidata-predictor-00003-deployment-6558b4b65-k8wvg 3/3 Running 0 11m revertrisk-wikidata-predictor-00003-deployment-6558b4b65-q4lhh 3/3 Running 0 11m
Mar 31 2026
I recall we verified it works on staging. Is there anything left to do before we move it to production?
Hey @achou, yes indeed, I will work on this tomorrow, I do not think that we miss anything else in order to go on production.
Mar 17 2026
Mar 12 2026
I do not think that there are any other different metrics to measure for this change. We will keep monitoring the latency and throughput as we already do, and the error rates as well.
If we see that we are still having issues with big batches we can higher that number.
I think 300 seems ok for now.
Mar 11 2026
gkyziridis@deploy2002:$ kube_env edit-check ml-serve-codfw gkyziridis@deploy2002:$ kubectl get pods NAME READY STATUS RESTARTS AGE edit-check-predictor-00003-deployment-ff659c867-nhqm2 4/4 Running 0 7m58s
Mar 10 2026
Feb 24 2026
Build Image:
docker build -f .pipeline/revertrisk/multilingual.yaml --target production --platform=linux/amd64 -t multilingual:events .Feb 18 2026
I was experimenting with the option:
Another option: make a .v2 stream with a different/new or just new major version 2.0.0 schema that supports multiple model predictions per event, either via a array of them, or a map of them. The downside would be that evolving the items in the array or map would not be easily supported (it's complicated).
And I found it kinda complicated. I think we can go with the option that we are creating a different (dedicated) stream for the rr-multilingual predictions, something like: EVENTGATE_STREAM=mediawiki.page_revert_risk_multilingual_prediction_change.v1, this will separate the stream right ?
This way we have two different streams pointing to the same schema, and in the deployment charts we set the corresponding EVENT_STREAM value for each of the rr models. We also set the correct values under the changeprop so we maintain two different streams.
Feb 11 2026
@Ottomata thank you for the comments.
We are not in the state to deploy this on production. I just built it like this in order to understand the flow and test it on staging as well.
Currently many people from our team are absent, so we will make the final decisions when they are back.
For now I just implemented this and we can test things on staging.
I will experiment with the alternatives as well:
Finished the implementation of the event mechanism in inference-services for the rr-multilingual model. \
This is the local testing on my machine:
Feb 6 2026
Update
Since the task: T406217 is finished we have a first version of end-to-end pipeline including all the basic steps of an ML-Lifecycle: Data Generation -> Model Training -> Export model in S3 bucket.
More info could be found here: https://phabricator.wikimedia.org/T398970
Generate Data (SparkSubmitOperator) -> Train/Validation/Test split (SparkSubmitOperator) -> Copy from HDFS to a PVC (WMFKubernetesPodOperator) -> Train model on GPU pod (WMFKubernetesPodOperator) -> Copy retrained model to S3 (PythonOperator)
Hey, I am working on this, I think that I have finished the implementation for publishing the predictions in events. I am now testing it locally.
Based on this: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Streams I think there are these steps:
- Implementation on inference-services side (this is what I am testing).
- Test it and deploy the new model server versions.
- Configure Changeprop.
- Configure the new changes in the mediawiki-config repo.
Feb 3 2026
Jan 30 2026
Jan 29 2026
Hey @Isaac, this ticket is assigned to @klausman but he is currently on his sabbatical. He will start working on this when he is back, I think around next month (???).
I am tagging @DPogorzelski-WMF here for visibility, maybe he has something more to add.
Update
The end-to-end tone-check retraining pipeline succeeded, we solved the issues of Multy-Attach PVC.
The new version of the retrained tone-check model is successfully copied to the dedicated S3 bucket under: s3://wmf-ml-models/retrained-models/tone-check/, here are the logs of the export step:
Here are the content of the S3 bucket:
$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls -H s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/ 2026-01-28 22:24 865 s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/config.json 2026-01-28 22:24 678M s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/model.safetensors 2026-01-28 22:24 1357M s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/optimizer.pt 2026-01-28 22:24 13K s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/rng_state.pth 2026-01-28 22:24 1064 s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/scheduler.pt 2026-01-28 22:24 695 s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/special_tokens_map.json 2026-01-28 22:24 2M s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/tokenizer.json 2026-01-28 22:24 1330 s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/tokenizer_config.json 2026-01-28 22:24 9K s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/trainer_state.json 2026-01-28 22:24 5K s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/training_args.bin 2026-01-28 22:24 972K s3://wmf-ml-models/retrained-models/tone-check/checkpoint-21530/vocab.txt
Jan 28 2026
We are currently do not store anywhere the predictions from the rr-multilingual model so we cannot export them in the same way that we are doing for the rr-language-agnostic one.
If there is this necessity, I can open a new Phabricator task in order to start developing the first step of saving the slice of the rr-multilingual predictions into the event stream, and then we can add them to the refinery and export them into the event_sanitized as we do for the rr-langugage-agnostic.
Jan 27 2026
I also checked the PVC using kubectl and I see that the PVC is "RWO": "ReadWriteOnce" I am not sure if this makes the problem:
$ kube_env airflow-ml-deploy dse-k8s-eqiad $ kubectl get pvc airflow-ml-model-training -n airflow-dev NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE airflow-ml-model-training Bound pvc-8a6a2920-8d7e-4616-8ab6-a6a70b26d116 20Gi RWO ceph-rbd-ssd 151d
Jan 21 2026
$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls -H --recursive s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/ 2026-01-20 13:33 865 s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/config.json 2026-01-20 13:33 678M s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/model.safetensors 2026-01-20 13:33 1357M s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/optimizer.pt 2026-01-20 13:33 13K s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/rng_state.pth 2026-01-20 13:33 1064 s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/scheduler.pt 2026-01-20 13:33 695 s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/special_tokens_map.json 2026-01-20 13:33 2M s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/tokenizer.json 2026-01-20 13:33 1330 s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/tokenizer_config.json 2026-01-20 13:33 24K s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/trainer_state.json 2026-01-20 13:33 5K s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/training_args.bin 2026-01-20 13:33 972K s3://wmf-ml-models/retrained-models/tone-check/checkpoint-63618/vocab.txt
Jan 20 2026
Jan 15 2026
Jan 12 2026
Jan 9 2026
curl -s -X \ POST "https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-language-agnostic:predict" \ -d '{"rev_id": 2, "lang": "test"}' \ -H "Host: revertrisk-language-agnostic.revertrisk.wikimedia.org"
Jan 6 2026
Things we need to keep in mind:
- Testwiki is not a canonical/normal wiki so it is excluded from canonical_wikis list
- Testwiki is not a supported wiki for the revertrisk model, so predictions will be completely inaccurate.
- We treat testwiki as enwiki on the fly in order for the revert-risk model server to accept such API hits posting {"lang"="test"}
