Page MenuHomePhabricator

Test liftwing wikidata revert risk API for scale and latency
Open, Needs TriagePublic

Description

Context
All liftwing endpoints support a wme tier rate limit which is 200 K requests per hour .
We call ML APIs with a deadline (configurable). Run the test from an aws instance.

To do

  • Test the wikidata revertrisk for scale. Achieve at least 150K requests per hour. (Tip: you can reuse existing liftwing scale test under experiments repos).
  • Capture response time for 200 returned calls (passing calls)

Acceptance criteria

  • Share a report on scale achieved and latency distribution.
  • Connect with Francisco and ML team if for most (90%) 200 returned calls, the latency is not <= 500 ms.

Event Timeline

As we worked on T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing, we conducted locust load tests on the revertrisk-wikidata inference service staging endpoint. These tests ran for 120 seconds with 2 users, each sending requests at intervals between 1 and 5 seconds, using sample Wikidata revision IDs got from the Research team's expert_sample.csv.

The results showed an average response time of 583ms, with a 0% failure rate over 65 requests:

1$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
2...
3MODEL=revertrisk_wikidata my_locust_venv/bin/locust --headless --csv results/revertrisk_wikidata
4[2025-11-18 13:38:16,933] stat1008/INFO/locust.main: Run time limit set to 120 seconds
5[2025-11-18 13:38:16,933] stat1008/INFO/locust.main: Starting Locust 2.31.5
6[2025-11-18 13:38:16,934] stat1008/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
7[2025-11-18 13:38:16,934] stat1008/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 2} (2 total users)
8[2025-11-18 13:40:16,227] stat1008/INFO/locust.main: --run-time limit reached, shutting down
9Load test results are within the threshold
10[2025-11-18 13:40:16,344] stat1008/INFO/locust.main: Shutting down (exit code 0)
11Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
12--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
13POST /v1/models/revertrisk-wikidata:predict 65 0(0.00%) | 583 397 903 580 | 0.56 0.00
14--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
15 Aggregated 65 0(0.00%) | 583 397 903 580 | 0.56 0.00
16
17Response time percentiles (approximated)
18Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
19--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
20POST /v1/models/revertrisk-wikidata:predict 580 610 660 680 750 820 840 900 900 900 900 65
21--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
22 Aggregated 580 610 660 680 750 820 840 900 900 900 900 65

Based on the tip in this task's description, our understanding is that WME is going to further evaluate this service's scale and latency:

In T409388, @prabhat wrote:

Tip: you can reuse existing liftwing scale test under experiments repos

Once you have completed your load tests, please share the results with us so we can optimize where needed.

Let us know if you need any additional information or support from our side.

Hey @kevinbazira thank very much for running the loading tests for Revert-Risk wikidata.
I think we should change a little bit the configuration in order to simulate a more realistic scenario close to reality.
We also need to run heavier tests spawning more users in order to check our API's capacity and capability to handle maximum RPS.
I ran three different locust tests with heavier configuration, you can see the results in the following phab paste:

1# 500 users | 5 per second
2$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
3[2025-11-24 13:19:16,836] stat1010/INFO/locust.main: Run time limit set to 120 seconds
4[2025-11-24 13:19:16,837] stat1010/INFO/locust.main: Starting Locust 2.31.5
5[2025-11-24 13:19:16,837] stat1010/INFO/locust.runners: Ramping to 500 users at a rate of 5.00 per second
6[2025-11-24 13:20:55,994] stat1010/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 500} (500 total users)
7[2025-11-24 13:21:16,348] stat1010/INFO/locust.main: --run-time limit reached, shutting down
8Load test results are within the threshold
9[2025-11-24 13:21:16,556] stat1010/INFO/locust.main: Shutting down (exit code 1)
10Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
11--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
12POST /v1/models/revertrisk-wikidata:predict 1202 33(2.75%) | 18076 472 46826 12000 | 10.05 0.28
13--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
14 Aggregated 1202 33(2.75%) | 18076 472 46826 12000 | 10.05 0.28
15
16Response time percentiles (approximated)
17Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
18--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
19POST /v1/models/revertrisk-wikidata:predict 12000 26000 32000 35000 41000 42000 44000 45000 47000 47000 47000 1202
20--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
21 Aggregated 12000 26000 32000 35000 41000 42000 44000 45000 47000 47000 47000 1202
22
23Error report
24# occurrences Error
25------------------|---------------------------------------------------------------------------------------------------------------------------------------------
2633 POST /v1/models/revertrisk-wikidata:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-wikidata:predict', code=502)
27------------------|---------------------------------------------------------------------------------------------------------------------------------------------
28
29+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
30
31# 500 users | 2 per second
32$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
33[2025-11-24 13:13:03,964] stat1010/INFO/locust.main: Run time limit set to 120 seconds
34[2025-11-24 13:13:03,964] stat1010/INFO/locust.main: Starting Locust 2.31.5
35[2025-11-24 13:13:03,965] stat1010/INFO/locust.runners: Ramping to 500 users at a rate of 2.00 per second
36[2025-11-24 13:15:03,496] stat1010/INFO/locust.main: --run-time limit reached, shutting down
37Load test results are within the threshold
38[2025-11-24 13:15:03,651] stat1010/INFO/locust.main: Shutting down (exit code 1)
39Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
40--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
41POST /v1/models/revertrisk-wikidata:predict 879 9(1.02%) | 10939 474 25179 11000 | 7.35 0.08
42--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
43 Aggregated 879 9(1.02%) | 10939 474 25179 11000 | 7.35 0.08
44
45Response time percentiles (approximated)
46Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
47--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
48POST /v1/models/revertrisk-wikidata:predict 11000 14000 16000 17000 20000 21000 23000 24000 25000 25000 25000 879
49--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
50 Aggregated 11000 14000 16000 17000 20000 21000 23000 24000 25000 25000 25000 879
51
52Error report
53# occurrences Error
54------------------|---------------------------------------------------------------------------------------------------------------------------------------------
559 POST /v1/models/revertrisk-wikidata:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-wikidata:predict', code=502)
56------------------|---------------------------------------------------------------------------------------------------------------------------------------------
57
58+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
59
60# 100 users | 5 per second
61$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
62[2025-11-24 13:26:48,568] stat1010/INFO/locust.main: Run time limit set to 120 seconds
63[2025-11-24 13:26:48,568] stat1010/INFO/locust.main: Starting Locust 2.31.5
64[2025-11-24 13:26:48,569] stat1010/INFO/locust.runners: Ramping to 100 users at a rate of 5.00 per second
65[2025-11-24 13:27:07,640] stat1010/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 100} (100 total users)
66[2025-11-24 13:28:48,102] stat1010/INFO/locust.main: --run-time limit reached, shutting down
67Load test results are within the threshold
68[2025-11-24 13:28:48,215] stat1010/INFO/locust.main: Shutting down (exit code 1)
69Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
70--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
71POST /v1/models/revertrisk-wikidata:predict 1742 4(0.23%) | 3314 81 6776 3400 | 14.58 0.03
72--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
73 Aggregated 1742 4(0.23%) | 3314 81 6776 3400 | 14.58 0.03
74
75Response time percentiles (approximated)
76Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
77--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
78POST /v1/models/revertrisk-wikidata:predict 3400 3800 4000 4200 4600 4900 5400 5700 6500 6800 6800 1742
79--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
80 Aggregated 3400 3800 4000 4200 4600 4900 5400 5700 6500 6800 6800 1742
81
82Error report
83# occurrences Error
84------------------|---------------------------------------------------------------------------------------------------------------------------------------------
854 POST /v1/models/revertrisk-wikidata:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-wikidata:predict', code=502)
86------------------|---------------------------------------------------------------------------------------------------------------------------------------------

Ideas

  • We experiment with more resources on isvc
  • Configure AutoScaling
  • We could probably use Kserve Batcher?
  • Digg into the model's server logic and make it faster.

The revertrisk-wikidata inference service production endpoint uses similar scaling configs that other revertrisk inference-services use: https://github.com/wikimedia/operations-deployment-charts/blob/8412fc655d3b1e10b38cf0c954d910b820e93a05/helmfile.d/ml-services/revertrisk/values.yaml#L145-L150

IMO the prod endpoint should scale well unless results from the WME folks say otherwise.

Update

The revertrisk-wikidata inference service production endpoint uses similar scaling configs that other revertrisk inference-services use: https://github.com/wikimedia/operations-deployment-charts/blob/8412fc655d3b1e10b38cf0c954d910b820e93a05/helmfile.d/ml-services/revertrisk/values.yaml#L145-L150

IMO the prod endpoint should scale well unless results from the WME folks say otherwise.

@Kevin you are right, the above locust tests in: https://phabricator.wikimedia.org/T409388#11400499 were targeting staging where autoscaling is not activated:

revertrisk-wikidata:
  predictor:
    image: "machinelearning-liftwing-inference-services-revertrisk-wikidata"
    image_version: "2025-11-17-105041-publish"
    custom_env:
      - name: MODEL_NAME
        value: "revertrisk-wikidata"
      - name: STORAGE_URI
        value: "s3://wmf-ml-models/revertrisk/wikidata/20251104121312/"
      - name: FORCE_HTTP
        value: "True"
    container:
      resources:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "2"
          memory: 4Gi

I rerun the tests using the configuration we are using on prod where the autoscaling is activated and we are using more resources:

revertrisk-wikidata:
  annotations:
    autoscaling.knative.dev/target: "3"
  predictor:
    config:
      minReplicas: 5
      maxReplicas: 15
    image: "machinelearning-liftwing-inference-services-revertrisk-wikidata"
    image_version: "2025-11-17-105041-publish"
    custom_env:
      - name: MODEL_NAME
        value: "revertrisk-wikidata"
      - name: STORAGE_URI
        value: "s3://wmf-ml-models/revertrisk/wikidata/20251104121312/"
      - name: FORCE_HTTP
        value: "True"
    container:
      resources:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "2"
          memory: 4Gi

Here are the results from the final loading tests using the prod configuration with autoscaling activated (above config).
Results:

1# 500 users | 5 per second
2$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
3[2025-11-25 13:54:07,783] stat1010/INFO/locust.main: Run time limit set to 120 seconds
4[2025-11-25 13:54:07,783] stat1010/INFO/locust.main: Starting Locust 2.31.5
5[2025-11-25 13:54:07,784] stat1010/INFO/locust.runners: Ramping to 500 users at a rate of 5.00 per second
6[2025-11-25 13:55:46,924] stat1010/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 500} (500 total users)
7[2025-11-25 13:56:07,316] stat1010/INFO/locust.main: --run-time limit reached, shutting down
8Load test results are within the threshold
9[2025-11-25 13:56:07,494] stat1010/INFO/locust.main: Shutting down (exit code 1)
10Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
11--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
12POST /v1/models/revertrisk-wikidata:predict 6126 12(0.20%) | 2644 61 9392 2600 | 51.21 0.10
13--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
14 Aggregated 6126 12(0.20%) | 2644 61 9392 2600 | 51.21 0.10
15
16Response time percentiles (approximated)
17Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
18--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
19POST /v1/models/revertrisk-wikidata:predict 2600 3300 3700 3900 4600 5400 6500 7300 8500 9400 9400 6126
20--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
21 Aggregated 2600 3300 3700 3900 4600 5400 6500 7300 8500 9400 9400 6126
22
23Error report
24# occurrences Error
25------------------|---------------------------------------------------------------------------------------------------------------------------------------------
2612 POST /v1/models/revertrisk-wikidata:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-wikidata:predict', code=502)
27------------------|---------------------------------------------------------------------------------------------------------------------------------------------
28
29++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
30
31# 500 users | 2 per second
32$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
33[2025-11-25 13:57:54,409] stat1010/INFO/locust.main: Run time limit set to 120 seconds
34[2025-11-25 13:57:54,409] stat1010/INFO/locust.main: Starting Locust 2.31.5
35[2025-11-25 13:57:54,410] stat1010/INFO/locust.runners: Ramping to 500 users at a rate of 2.00 per second
36[2025-11-25 13:59:53,905] stat1010/INFO/locust.main: --run-time limit reached, shutting down
37Load test results are within the threshold
38[2025-11-25 13:59:54,026] stat1010/INFO/locust.main: Shutting down (exit code 0)
39Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
40--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
41POST /v1/models/revertrisk-wikidata:predict 3715 0(0.00%) | 943 364 11883 860 | 31.07 0.00
42--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
43 Aggregated 3715 0(0.00%) | 943 364 11883 860 | 31.07 0.00
44
45Response time percentiles (approximated)
46Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
47--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
48POST /v1/models/revertrisk-wikidata:predict 860 1000 1100 1200 1400 1600 1900 2200 3300 12000 12000 3715
49--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
50 Aggregated 860 1000 1100 1200 1400 1600 1900 2200 3300 12000 12000 3715
51
52++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
53
54# 100 users | 2 per second
55$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
56[2025-11-25 14:01:11,144] stat1010/INFO/locust.main: Run time limit set to 120 seconds
57[2025-11-25 14:01:11,145] stat1010/INFO/locust.main: Starting Locust 2.31.5
58[2025-11-25 14:01:11,145] stat1010/INFO/locust.runners: Ramping to 100 users at a rate of 5.00 per second
59[2025-11-25 14:01:30,186] stat1010/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 100} (100 total users)
60[2025-11-25 14:03:10,665] stat1010/INFO/locust.main: --run-time limit reached, shutting down
61Load test results are within the threshold
62[2025-11-25 14:03:10,772] stat1010/INFO/locust.main: Shutting down (exit code 0)
63Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
64--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
65POST /v1/models/revertrisk-wikidata:predict 3000 0(0.00%) | 705 352 11618 670 | 25.10 0.00
66--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
67 Aggregated 3000 0(0.00%) | 705 352 11618 670 | 25.10 0.00
68
69Response time percentiles (approximated)
70Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
71--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
72POST /v1/models/revertrisk-wikidata:predict 670 740 810 850 960 1100 1200 1300 1800 12000 12000 3000
73--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
74 Aggregated 670 740 810 850 960 1100 1200 1300 1800 12000 12000 3000

We can easily see the difference on the average latency and the number of RPS (much higher number of RPS on prod due to autoscaling and much lower average latency compared with staging).

@prabhat, has the WME team had a chance to run scale and latency tests on the revertrisk-wikidata inference service? Does this service meet your performance requirements?

If you run into issues or if the service does not meet your performance requirements, please let us know so we can further optimize it. Thanks!