Page MenuHomePhabricator

Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing
Open, Needs TriagePublic

Description

Request details

  • What use case is the model going to support/resolve?**

Detecting bad edits on Wikidata. Similar to Wikipedia Revert Risk, the Wikidata version is an improvement on previous models (ORES) that are currently running on LiftWing

Not yet, but we have a peer-reviewed paper published that explains the model.

  • What team created/trained/etc.. the model? What tools and frameworks have you used?**

Team: research
Main Technology: BERT

  • What kind of data was the model trained with, and what kind of data the model is going to need in production (for example, calls to internal/external services, special datasources for features, etc..) ?**

Model was trained using the research content-diff data. On inference, the model just need to make calls to WikiBase. Approach is similar to the the used on the Wikipedia Revert Risk Multilingual

  • If you have a minimal codebase that you used to run the first tests with the model, could you please share it?**

Repo

  • State what team will own the model and please share some main point of contacts (see more info in '''Ownership of a model''').**

Model was developed by the Research team. The productization is requested by Wikimedia Enterprise.

  • What is the current latency and throughput of the model, if you have tested it?** We don't need anything precise at this stage, just some ballparks numbers to figure out how the model performs with the expected inputs. For example, does the model take ms/seconds/etc.. to respond to queries? How does it react when 1/10/20/etc.. requests in parallel are made? If you don't have these numbers don't worry, open the task and we'll figure something out while we discuss about next steps!

Architecture is the same than Revert Risk Multilingual. Similar serving time should be expected.

A "wme" tier rate limit which is 200 K requests per hour

https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage#Request_a_bearer_token

  1. Response time for 90% of the requests should be <= 500 ms
  • Is there an expected frequency in which the model will have to be retrained with new data?** What are the resources required to train the model and what was the dataset size?

Recommend: Monthly
Critical: Yearly.

  • Have you checked if the output of your model is safe from a human rights point of view? **Is there any risk of it being offensive for somebody? Even if you have any slight worry or corner case, please tell us!

Model has been evaluated using

  • Everything else that is relevant in your opinion.**

Timing

Target delivery by end of December. WME has a contractual latest possible date to release our v1 of Wikidata product in January, and it would be ideal to launch with RR.

Weekly reporting structure

Progress update on the hypothesis for the week, including if something has shipped:
Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):
Any emerging blockers or risks:
Any unresolved dependencies:
New lessons from the hypothesis:
Changes to the hypothesis scope or timeline:

Details

Other Assignee
kevinbazira
Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+0 -20
machinelearning/liftwing/inference-servicesmain+1 -1
operations/puppetproduction+14 -0
operations/deployment-chartsmaster+25 -0
machinelearning/liftwing/inference-servicesmain+2 -2
operations/deployment-chartsmaster+2 -2
machinelearning/liftwing/inference-servicesmain+32 -20
machinelearning/liftwing/inference-servicesmain+64 -0
operations/deployment-chartsmaster+20 -20
operations/deployment-chartsmaster+20 -0
machinelearning/liftwing/inference-servicesmain+287 -1
machinelearning/liftwing/inference-servicesmain+14 -16
operations/deployment-chartsmaster+4 -4
integration/configmaster+2 -2
machinelearning/liftwing/inference-servicesmain+41 -15
machinelearning/liftwing/inference-servicesmain+1 -1
machinelearning/liftwing/inference-servicesmain+968 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

What would be the extra work required to deploy the full model?

Hi @Miriam, the plan is to deploy the full model. When the Research Team provides the full model, we shall build a model-server and host it on LiftWing.

Oh wonderful @kevinbazira sorry I misunderstood from your msg that you wanted to deploy only the metadata one! @Trokhymovych what do you need to provide the full model to ML?
Thanks so much both for all the work!

Hi @Miriam

I am working on collecting a binary of all components of this model (BERT + classifier). I just need a bit more time to thoroughly test everything and ensure nothing is missing, as the project has been inactive for some time. The plan is to have it ready by Monday.

Thank you for your input and support!

Hi @kevinbazira,

I have prepared the model binary for the full model and instructions/tests that might help you with productization. I was unsure where to commit these files, so I have prepared a Google Drive folder containing all the necessary files and a README that explains their contents.

In particular, I include the following content:

  • binaries: Contains the serialized model binary wikidata_revertrisk_graph2text_v2.pkl of the full model.
  • data: Contains the expert_sample.csv testing dataset labeled by expert annotators (the one we used in the paper). expert_sample.csv includes:
    1. All the final features required for inference of the model.
    2. Artifacts from intermediate steps that might help to debug the feature preparation process (e.g., "texts_add", "texts_remove", "texts_change").
    3. Expert labels as ground truth for evaluation (e.g., "expert_label").
    4. Pre-computed BERT scores for consistency checking (e.g., "add_score_mean", "remove_score_mean", "change_score_mean", "add_score_max", "remove_score_max", "change_score_max").
    5. Model inference scores for consistency checking (e.g., "ores_pred", "metadata_baseline_pred", "content_baseline_pred", "graph2text_pred").
  • modules: Contains the code that might be helpful.
    • model_inference.py: Code to load the model binary and run inference on the expert-labeled dataset. Also includes output consistency checks against pre-computed model inference scores in the dataset.
    • bert_consistency_check.py: Code to verify that the BERT scores computed during feature preparation match the pre-computed scores in the expert-labeled dataset, ensuring consistency and correctness of the feature extraction process. It can be used to check if the feature preparation logic you implement matches the expected outputs.
    • pack_binary.py: Code used to create the serialized model binary wikidata_revertrisk_graph2text_v2.pkl.

Please let me know if you need any further assistance or modifications.

Change #1201558 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] revertrisk-wikidata: add model-server for Graph2Text model * revertrisk_wikidata dir has the added model-server * utils.py has helper functions for feature processing previously in knowledge_integrity library * README.md has instructions on how to run the model-server locally * Makefile has configrations to enable building and running the model-server locally * blubber.yaml has the model-server docker image configuration

https://gerrit.wikimedia.org/r/1201558

@Trokhymovych thank you for sharing the full model and detailed instructions. The model has been uploaded to swift:

$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls -H s3://wmf-ml-models/revertrisk/wikidata/20251104121312/
                    DIR  s3://wmf-ml-models/revertrisk/wikidata/20251104121312/data/
2025-11-04 06:43   631M  s3://wmf-ml-models/revertrisk/wikidata/20251104121312/wikidata_revertrisk_graph2text_v2.pkl

and is also publicly accessible via the analytics portal: https://analytics.wikimedia.org/published/wmf-ml-models/revertrisk/wikidata/20251104121312/

I have pushed a patch that adds the revertrisk-wikidata model-server to LiftWing inference services repo. This can be tested locally by running:

# first terminal: build and start model-server
make revertrisk-wikidata

# second terminal: query the isvc
curl -s localhost:8080/v1/models/revertrisk-wikidata:predict -X POST -d '{"rev_id": 1892513445}' -i -H "Content-type: application/json"

The model-server now returns prediction results that match expert sample predictions, as shown below:

#rev_idexpert sample predictionmodel-server prediction
119455160430.27188992393779540.2718899239377954
219595126880.87763962145432370.8776396214543237
319272117840.66911194100185580.6691119410018558
419232623520.4028660451195430.402866045119543
519546896410.72501004945767130.7250100494576713
619189132380.0458140843204584560.045814084320458456

We shall also share a staging endpoint once this model-server has been deployed in LiftWing for further testing.

Hi @kevinbazira,

Thank you for the update!

I have quickly reviewed the code you shared and noted that you are using a predefined dictionary of labels for feature processing (self.id2label from full_labels_2024-04_text_en.csv) for text preparation. It is not correct, and we should use an API call to extract those labels (please check here). Do I understand it correctly?

The problem with this approach is that if we create a binary with all the labels, it will be too big, and the one I have collected during the model training covers only the ones needed for model preparation (so the revisions that are related to the entities not processed during model training/validation will be processed incorrectly).

Thanks for the review @Trokhymovych. I have updated the model-server to remove the reliance on the static full_labels_2024-04_text_en.csv file.

The model-server now fetches labels dynamically from the Wikidata API, ensuring that it always uses the most up-to-date labels for any given entity.

This model-server now returns predictions results below:

#rev_idexpert sample predictionmodel-server prediction
119455160430.27188992393779540.2718899239377954
219595126880.87763962145432370.8776396214543237
319272117840.66911194100185580.6691119410018558
419232623520.4028660451195430.4380243142560278
519546896410.72501004945767130.5375441528969437
619189132380.0458140843204584560.44463073207370857

Change #1201558 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-wikidata: add model-server for Graph2Text model

https://gerrit.wikimedia.org/r/1201558

As we prepare to publish the revertrisk-wikidata model-server image to the Wikimedia Docker registry, here is a summary of the image layers:

$ docker history b601e2d84c63
IMAGE          CREATED         CREATED BY                                      SIZE      COMMENT
b601e2d84c63   2 minutes ago   [production] 📂 [common_settings.sh] -> comm…   1.36kB    buildkit.exporter.image.v0
<missing>      2 minutes ago   [production] 📂 [model_server_entrypoint.sh]…   303B      buildkit.exporter.image.v0
<missing>      2 minutes ago   [production] 📦 {build}[/opt/lib/venv/lib/py…   1.81GB    buildkit.exporter.image.v0
<missing>      4 minutes ago   [production] 📂 [python] -> python/             33.1kB    buildkit.exporter.image.v0
<missing>      4 minutes ago   [production] 📂 [src/models/revertrisk_wikid…   30.7kB    buildkit.exporter.image.v0
<missing>      4 minutes ago   mount / from exec /bin/sh -c (getent group "…   9.07kB    buildkit.exporter.image.v0
<missing>      4 minutes ago   mount / from exec /bin/sh -c (getent group "…   8.88kB    buildkit.exporter.image.v0
<missing>      4 minutes ago   mount / from exec /bin/sh -c apt-get update …   59.2MB    buildkit.exporter.image.v0
<missing>      11 days ago     /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B        
<missing>      11 days ago     /bin/sh -c #(nop)  ENV LC_ALL=C.UTF-8           0B        
<missing>      11 days ago     /bin/sh -c #(nop) ADD file:abeaf73dbbde23882…   74.8MB

The largest layer size is ~1.81GB which meets the Wikimedia Docker registry's 4GB compressed layer size limit.

Change #1202378 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] revertrisk-wikidata: update CI config to publish the model-server image

https://gerrit.wikimedia.org/r/1202378

Change #1202379 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[integration/config@master] inference-services: update CI pipeline jobs for revertrisk-wikidata model-server

https://gerrit.wikimedia.org/r/1202379

Change #1202379 merged by jenkins-bot:

[integration/config@master] inference-services: update CI pipeline jobs for revertrisk-wikidata model-server

https://gerrit.wikimedia.org/r/1202379

Change #1202378 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-wikidata: update CI config to publish the model-server image

https://gerrit.wikimedia.org/r/1202378

Change #1202710 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] revertrisk-wikidata: improve error handling and logging

https://gerrit.wikimedia.org/r/1202710

Change #1202710 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-wikidata: improve error handling and logging

https://gerrit.wikimedia.org/r/1202710

The revertrisk-wikidata model-server has been containerized and integrated into the CI/CD pipeline which published it successfully to the Wikimedia docker registry:

docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-revertrisk-wikidata:2025-11-07-042629-publish

Change #1202908 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update revertrisk-wikidata isvc in experimental namespace

https://gerrit.wikimedia.org/r/1202908

Change #1202908 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update revertrisk-wikidata isvc in experimental namespace

https://gerrit.wikimedia.org/r/1202908

The revertrisk-wikidata model-server has been deployed in the LiftWing experimental namespace. It is currently available through an internal endpoint that can only be accessed by tools that run within the WMF infrastructure (e.g deploy2002, stat1008, etc):

# pod running in experimental ns
$ kube_env experimental ml-staging-codfw
$ kubectl get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
revertrisk-wikidata-predictor-default-00019-deployment-557bkfdk   3/3     Running   0          96s


# query revertrisk-wikidata isvc
$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-wikidata:predict" -X POST -d '{"rev_id": 1945516043}' -H  "Host: revertrisk-wikidata.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1

{
  "model_name": "revertrisk-wikidata",
  "model_version": "2",
  "revision_id": 1945516043,
  "output": {
    "prediction": false,
    "probabilities": {
      "true": 0.2718899239377954,
      "false": 0.7281100760622046
    }
  }
}

real	0m0.674s
user	0m0.019s
sys	0m0.000s

@Trokhymovych please test it and let us know of any edge cases you may come across. Once you have confirmed that there are none, we shall prepare to move it to production and provide an external endpoint.

Change #1203376 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] docker-compose: add revertrisk-wikidata config

https://gerrit.wikimedia.org/r/1203376

Change #1203376 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] docker-compose: add revertrisk-wikidata config

https://gerrit.wikimedia.org/r/1203376

Change #1203781 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] test: add unit tests for revertrisk-wikidata

https://gerrit.wikimedia.org/r/1203781

I have added unit tests for critical components of the model-server to make sure future changes do not break functionality. Here is the output when I build the test image and run the tests:

$ docker buildx build --target test -f .pipeline/revertrisk_wikidata/blubber.yaml --platform=linux/amd64 . -t rrw_unit_test
$ docker run --rm rrw_unit_test
...
Initialized empty Git repository in /srv/revertrisk_wikidata/.git/
ci-lint: install_deps> python -I -m pip install pre-commit
ci-lint: commands[0]> pre-commit run --all-files --show-diff-on-failure
[INFO] Initializing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO] Initializing environment for https://github.com/astral-sh/ruff-pre-commit.
[INFO] Installing environment for https://github.com/pre-commit/pre-commit-hooks.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
[INFO] Installing environment for https://github.com/astral-sh/ruff-pre-commit.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
check yaml...............................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
ruff (legacy alias)......................................................Passed
ruff format..............................................................Passed
ci-lint: OK ✔ in 11.8 seconds
ci-unit: install_deps> python -I -m pip install -r /srv/revertrisk_wikidata/requirements-test.txt
ci-unit: commands[0]> pytest test/unit
============================= test session starts ==============================
platform linux -- Python 3.11.2, pytest-9.0.0, pluggy-1.6.0
cachedir: .tox/ci-unit/.pytest_cache
rootdir: /srv/revertrisk_wikidata
configfile: tox.ini
plugins: anyio-4.11.0, asyncio-1.3.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
collected 5 items

test/unit/revertrisk_wikidata/test_model.py .....                        [100%]

============================== 5 passed in 2.84s ===============================
  ci-lint: OK (11.80=setup[3.06]+cmd[8.74] seconds)
  ci-unit: OK (10.94=setup[6.60]+cmd[4.34] seconds)
  congratulations :) (22.76 seconds)

Change #1203781 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] test: add unit tests for revertrisk-wikidata

https://gerrit.wikimedia.org/r/1203781

@Trokhymovych, following up on T406179#11353371, the revertrisk-wikidata model-server is now live in LiftWing's experimental namespace. Please test it by adjusting the rev_id in the curl command below and let us know whether it's returning correct predictions:

# ssh into WMF stat machine
$ ssh stat1008.eqiad.wmnet

# send request to revertrisk-wikidata inference service hosted on LiftWing
$ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-wikidata:predict" -X POST -d '{"rev_id": 1945516043}' -H  "Host: revertrisk-wikidata.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1

This will enable us to proceed with the deployment to production and provide an external endpoint for wider use.

Change #1204108 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: deploy revertrisk-wikidata to the revision-models ns

https://gerrit.wikimedia.org/r/1204108

Change #1204108 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy revertrisk-wikidata to the revision-models ns

https://gerrit.wikimedia.org/r/1204108

Change #1204250 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: deploy revertrisk-wikidata to the revision-models ns staging

https://gerrit.wikimedia.org/r/1204250

Change #1204250 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy revertrisk-wikidata to the revision-models ns staging

https://gerrit.wikimedia.org/r/1204250

As we prepare to run load tests, the revertrisk-wikidata isvc has been deployed in LiftWing staging:

# pod running in revision-models ns staging
$ kube_env revision-models ml-staging-codfw
$ kubectl get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
revertrisk-wikidata-predictor-00001-deployment-6fff6dbcbf-mxgmg   3/3     Running   0          77s

# query revertrisk-wikidata isvc
$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-wikidata:predict" -X POST -d '{"rev_id": 1945516043}' -H  "Host: revertrisk-wikidata.revision-models.wikimedia.org" -H "Content-Type: application/json" --http1.1

{
  "model_name": "revertrisk-wikidata",
  "model_version": "2",
  "revision_id": 1945516043,
  "output": {
    "prediction": false,
    "probabilities": {
      "true": 0.2718899239377954,
      "false": 0.7281100760622046
    }
  }
}

real	0m0.457s
user	0m0.010s
sys	0m0.010s

Change #1204730 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] locust: add revertrisk-wikidata load test

https://gerrit.wikimedia.org/r/1204730

I have run locust load tests on the revertrisk-wikidata staging isvc for 120s with 2 users, each sending requests between 1s to 5s, using sample Wikidata revision IDs that were shared in the expert_sample.csv in T406179#11333762. Results show the average response time is 568ms with 0% failure rate over 66 requests.

$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
...
MODEL=revertrisk_wikidata my_locust_venv/bin/locust --headless --csv results/revertrisk_wikidata
[2025-11-13 04:53:43,557] stat1008/INFO/locust.main: Run time limit set to 120 seconds
[2025-11-13 04:53:43,557] stat1008/INFO/locust.main: Starting Locust 2.31.5
[2025-11-13 04:53:43,558] stat1008/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-11-13 04:53:43,559] stat1008/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 2} (2 total users)
[2025-11-13 04:55:42,893] stat1008/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-11-13 04:55:43,001] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/revertrisk-wikidata:predict                                            66     0(0.00%) |    568     375     886    550 |    0.56        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                        66     0(0.00%) |    568     375     886    550 |    0.56        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/revertrisk-wikidata:predict                                                550    580    600    610    750    790    850    890    890    890    890     66
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                            550    580    600    610    750    790    850    890    890    890    890     66

Change #1204730 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] locust: add revertrisk-wikidata load test

https://gerrit.wikimedia.org/r/1204730

Hi @kevinbazira

I have reviewed the latest version of the model, and the differences in scores appear to be unacceptable, so changes are required.

Firstly, my general recommendation is that we should initially perform a performance check using metrics (rather than comparing exact matches in scores). I think it can be done on your side. I have evaluated the API version of the model and compared the metrics with what we reported in the paper, and I found the following:

Paper score:  0.9343786982248521
API score:  0.8986094674556213

Therefore, I conducted a deeper analysis/debugging to understand the reasons for this discrepancy. The logic for metric evaluation and analysis is presented in this notebook.

After implementing all the changes (mentioned below), I was able to achieve results very close to those reported in the paper:

Paper score:  0.9343786982248521
API score:  0.9341863905325444

I am also providing the updated code for the module you have implemented, required to achieve the reported score. I hope it will help you to make the changes more efficiently.

I previously added all intermediate features to the expert-labeled dataset (shared earlier), which can help in manual debugging, if the scores differ significantly, as in our case. This can help us understand whether the feature processing delivers the expected results or not.

I have made the following changes to the code you provided to achieve acceptable results:

  1. Logic for replacing Wikidata IDs with English labels is not implemented completely correctly, which causes the most severe mistakes
    • When extracting labels for Wikidata IDs, the label for the page was missing, leading to incorrect text processing. Change to be done: entity_ids = extract_entity_ids(diffs) -> entity_ids = extract_entity_ids(diffs) + [inputs["page_title"]]
    • Some changes appear in the numeric_id, so matches = re.findall(r"Q\d+|P\d+", diff_str) is insufficient. Possible approaches: either extract text first and then extract IDs for label extraction, or improve the extract_entity_ids function. Can be smth like:
def extract_entity_ids(diffs):
    ids = set()
    for diff_str in diffs:
        # extract Qxxx / Pxxx
        matches = re.findall(r"Q\d+|P\d+", diff_str)
        ids.update(matches)
        # extract numeric values
        numeric_values = re.findall(
            r"numeric-id.*?\{[^}]*?new_value':\s*(\d+).*?old_value':\s*(\d+)",
            diff_str
        )
        for a, b in numeric_values:
            if a.isdigit():
                ids.add(f"Q{a}")
            if b.isdigit():
                ids.add(f"Q{b}")
    return list(ids)
  1. Handling of anonymous users
    • In the dataset, user_age is set to the default numeric value (NUMERIC_NaN = -999), but in feature processing, it is set to 0. It should be: features["user_age"] = -999.0 in the code
    • event_user_groups should be CATEGORICAL_NaN = "nan":
features[group] = str(float(group.split("-")[1] in user_groups))

change to:

if features["user_is_anonymous"] != "True":
    features[group] = str(float(group.split("-")[1] in user_groups))
  • user_is_bot should be -1 for anon users:
if features["user_is_anonymous"] == "True":
    features["user_is_bot"] = "-1"
else:
    features["user_is_bot"] = str(int("bot" in user.get("groups", [])))
  1. I have found out that there is a difference in handling Categorical NaN features

X.append("NaN" if i in cat_feature_indices else -999) should be changed to X.append("nan" if i in cat_feature_indices else -999)

  1. Numeric features rounding: features such as user_age, page_seconds_since_previous_revision, and page_age should be rounded in feature preparation (as we have that in training data). It also helps to handle cases of new users or new pages, when values are close to 0 (but not 0).
  1. Use constants instead of hardcoding (optional):
NUMERIC_NaN = -999
CATEGORICAL_NaN = "nan"

I also noted that some differences are unavoidable (but do not significantly influence metrics):

  • In some cases, user_age is NaN in our dataset but correctly collected in the API.
  • Users may be added or removed from groups, which can influence predictions.

Please let me know if any other clarification is needed.

Additionally, tagging @Miriam and @fkaelin for visibility or if they would like to add something.

Change #1206197 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] revertrisk-wikidata: update feature processing based on research team feedback

https://gerrit.wikimedia.org/r/1206197

Change #1206197 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-wikidata: update feature processing based on research team feedback

https://gerrit.wikimedia.org/r/1206197

Change #1206344 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update revertrisk-wikidata image in both experimental and revision-models ns

https://gerrit.wikimedia.org/r/1206344

Change #1206344 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update revertrisk-wikidata image in both experimental and revision-models ns

https://gerrit.wikimedia.org/r/1206344

@Trokhymovych, thank you for reviewing the revertrisk-wikidata model-server and sharing detailed feedback (that's super useful).

The recommendations you made have been added to the model-server, which now returns prediction results below:

#rev_idexpert sample predictionmodel-server prediction
119455160430.27188992393779540.2718899239377954
219595126880.87763962145432370.8776396214543237
319272117840.66911194100185580.6691119410018558
419232623520.4028660451195430.4380243142560278
519546896410.72501004945767130.39802166623571744
619189132380.0458140843204584560.045814084320458456

The metrics (ROC AUC) comparison of expert_sample.csv predictions and this model-server predictions run in P85340 also shows:

>>> from sklearn.metrics import roc_auc_score
>>> data["api_prediction"] = collected_predictions
>>> data_eval = data[data.api_prediction != 0]
>>> print("Number of error predictions", (np.array(collected_predictions) == 0).sum())
Number of error predictions 1
>>> print("Paper score: ", roc_auc_score(data_eval["expert_label"], data_eval["graph2text_pred"]))
Paper score:  0.9343786982248521
>>> print("API score: ", roc_auc_score(data_eval["expert_label"], data_eval["api_prediction"]))
API score:  0.9341863905325444

These results match the ones you shared in: T406179#11375298

In case there are no further edge cases, we shall proceed to deploy the model-sever in production.

NOTE: The revertrisk-wikidata model-server referenced in T406179#11375298 using this Gerrit change was not the latest. The latest version is always available in the Gerrit repo or the GitHub mirror.

Change #1207023 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] locust: update revertrisk-wikidata load test results

https://gerrit.wikimedia.org/r/1207023

Change #1207027 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: deploy revertrisk-wikidata to the revision-models ns prod

https://gerrit.wikimedia.org/r/1207027

Change #1207023 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] locust: update revertrisk-wikidata load test results

https://gerrit.wikimedia.org/r/1207023

Change #1207027 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy revertrisk-wikidata to the revertrisk ns prod

https://gerrit.wikimedia.org/r/1207027

@Trokhymovych, here are resources to help you create a comprehensive model card for the revertrisk-wikidata model:

  1. Section to create model card: https://meta.wikimedia.org/wiki/Machine_learning_models#Create_a_model_card
  2. FAQs to answer in the model card: https://docs.google.com/document/d/1Q5aJGGBJB4LN3dXS8_-IjZYi0a3T1MIWtDXDwZeEins/edit
  3. Model card template: https://meta.wikimedia.org/wiki/Machine_learning_models/Model_card_template

For reference, you can look at these examples of exisiting model cards created by your colleagues from the Research team:

  1. RevertRisk Multilingual: https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Multilingual_revert_risk
  2. RevertRisk Language-agnostic: https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Language-agnostic_revert_risk
  3. Article descriptions: https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Article_descriptions
  4. Language-agnostic article quality: https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Language-agnostic_Wikipedia_article_quality

In case of any challenges, please feel free to reach out. We'll be happy to clarify.

The revertrisk-wikidata inference service is now live in LiftWing production. It can be accessed through:
1.External endpoint:

$ curl "https://api.wikimedia.org/service/lw/inference/v1/models/revertrisk-wikidata:predict" -X POST -d '{"rev_id": 1945516043}' -H "Content-Type: application/json" --http1.1

2.Internal endpoint:

$ curl "https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-wikidata:predict" -X POST -d '{"rev_id": 1945516043}' -H  "Host: revertrisk-wikidata.revertrisk.wikimedia.org" -H "Content-Type: application/json" --http1.1

3.Documentation:

Change #1208189 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/puppet@production] httpbb: add post deployment tests for the revertrisk-wikidata endpoint

https://gerrit.wikimedia.org/r/1208189

Change #1208189 merged by Klausman:

[operations/puppet@production] httpbb: add post deployment tests for the revertrisk-wikidata endpoint

https://gerrit.wikimedia.org/r/1208189

Sucheta-Salgaonkar-WMF renamed this task from Request to host Wikidata Revert Risk on Lift Wing to Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing.Nov 24 2025, 2:28 PM

Change #1218730 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] revertrisk-wikidata: add model card link

https://gerrit.wikimedia.org/r/1218730

Change #1218730 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revertrisk-wikidata: add model card link

https://gerrit.wikimedia.org/r/1218730

Weekly Update:

  • The Wikimedia Enterprise team conducted load tests to simulate their traffic and shared results in T409388#11483570
  • We are working on optimizing the revertrisk-wikidata inference service to achieve the Enterprise team's latency target in T414060

Change #1225553 had a related patch set uploaded (by Gkyziridis; author: Gkyziridis):

[operations/deployment-charts@master] ml-services: Remove revertrisk-wikidata from revesion-models ns.

https://gerrit.wikimedia.org/r/1225553

Change #1225553 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Remove revertrisk-wikidata from revesion-models ns.

https://gerrit.wikimedia.org/r/1225553

Update

Weekly Update:

  • The Wikimedia Enterprise team conducted load tests to simulate their traffic and shared results in T409388#11483570
  • We are working on optimizing the revertrisk-wikidata inference service to achieve the Enterprise team's latency target in T414060

We are using the following strategy which is based on the idea of small, stable, and parsimonious steps:

  1. We have currently deployed on experimental staging a rr-wikidata model version using concurrency and multiple workers
  2. We are experimenting with different configurations for the above version of the model in order to reach the target goal of the latency.
  3. If the above not work, we will use a GPU as the last option

Weekly Update:

  • We implemented multi-worker processing:
    • Tested multi-worker processing with non concurrency-safe vs concurrency-safe lazy model loading in a local container. This resulted in a performance regression (T414060#11511531).
    • Tested multi-worker processing with concurrency-safe lazy model loading in a pod on LiftWing. This resulted in a performance improvement, achieving ~6.4 RPS with 8 workers and 8 CPUs in a single optimized pod while meeting the ~500ms latency target (T414060#11515216).
  • Decided to move this optimized pod into horizontal scaling:
    • To meet the throughput target (~42 RPS), we calculated that 7 replicas of the optimized pod are required. This translates to 7 pods * 8 CPUs per pod = 56 CPUs, which are currently unavailable in the experimental namespace (T414060#11515839).
    • Checked grafana and confirmed that LiftWing staging does not have sufficient CPU resources, but production environments do (T414060#11527771).
    • The revertrisk namespace quota needs to be increased to allocate 56 CPUs to the rr-wikidata model-server. (T414060#11528208)
      • Dawid (ML SRE) has confirmed that he will deploy this resource adjustment next week.