Page MenuHomePhabricator

Test liftwing wikidata revert risk API for scale and latency
Open, Needs TriagePublic

Description

Context
All liftwing endpoints support a wme tier rate limit which is 200 K requests per hour .
We call ML APIs with a deadline (configurable). Run the test from an aws instance.

To do

  • Test the wikidata revertrisk for scale. Achieve at least 150K requests per hour. (Tip: you can reuse existing liftwing scale test under experiments repos).
  • Capture response time for 200 returned calls (passing calls)

Acceptance criteria

  • Share a report on scale achieved and latency distribution.
  • Connect with Francisco and ML team if for most (90%) 200 returned calls, the latency is not <= 500 ms.

Event Timeline

As we worked on T406179: Q2 FY2025-26 Goal: Host Wikidata Revert Risk model on LiftWing, we conducted locust load tests on the revertrisk-wikidata inference service staging endpoint. These tests ran for 120 seconds with 2 users, each sending requests at intervals between 1 and 5 seconds, using sample Wikidata revision IDs got from the Research team's expert_sample.csv.

The results showed an average response time of 583ms, with a 0% failure rate over 65 requests:

1$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
2...
3MODEL=revertrisk_wikidata my_locust_venv/bin/locust --headless --csv results/revertrisk_wikidata
4[2025-11-18 13:38:16,933] stat1008/INFO/locust.main: Run time limit set to 120 seconds
5[2025-11-18 13:38:16,933] stat1008/INFO/locust.main: Starting Locust 2.31.5
6[2025-11-18 13:38:16,934] stat1008/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
7[2025-11-18 13:38:16,934] stat1008/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 2} (2 total users)
8[2025-11-18 13:40:16,227] stat1008/INFO/locust.main: --run-time limit reached, shutting down
9Load test results are within the threshold
10[2025-11-18 13:40:16,344] stat1008/INFO/locust.main: Shutting down (exit code 0)
11Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
12--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
13POST /v1/models/revertrisk-wikidata:predict 65 0(0.00%) | 583 397 903 580 | 0.56 0.00
14--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
15 Aggregated 65 0(0.00%) | 583 397 903 580 | 0.56 0.00
16
17Response time percentiles (approximated)
18Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
19--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
20POST /v1/models/revertrisk-wikidata:predict 580 610 660 680 750 820 840 900 900 900 900 65
21--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
22 Aggregated 580 610 660 680 750 820 840 900 900 900 900 65

Based on the tip in this task's description, our understanding is that WME is going to further evaluate this service's scale and latency:

In T409388, @prabhat wrote:

Tip: you can reuse existing liftwing scale test under experiments repos

Once you have completed your load tests, please share the results with us so we can optimize where needed.

Let us know if you need any additional information or support from our side.

Hey @kevinbazira thank very much for running the loading tests for Revert-Risk wikidata.
I think we should change a little bit the configuration in order to simulate a more realistic scenario close to reality.
We also need to run heavier tests spawning more users in order to check our API's capacity and capability to handle maximum RPS.
I ran three different locust tests with heavier configuration, you can see the results in the following phab paste:

1# 500 users | 5 per second
2$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
3[2025-11-24 13:19:16,836] stat1010/INFO/locust.main: Run time limit set to 120 seconds
4[2025-11-24 13:19:16,837] stat1010/INFO/locust.main: Starting Locust 2.31.5
5[2025-11-24 13:19:16,837] stat1010/INFO/locust.runners: Ramping to 500 users at a rate of 5.00 per second
6[2025-11-24 13:20:55,994] stat1010/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 500} (500 total users)
7[2025-11-24 13:21:16,348] stat1010/INFO/locust.main: --run-time limit reached, shutting down
8Load test results are within the threshold
9[2025-11-24 13:21:16,556] stat1010/INFO/locust.main: Shutting down (exit code 1)
10Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
11--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
12POST /v1/models/revertrisk-wikidata:predict 1202 33(2.75%) | 18076 472 46826 12000 | 10.05 0.28
13--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
14 Aggregated 1202 33(2.75%) | 18076 472 46826 12000 | 10.05 0.28
15
16Response time percentiles (approximated)
17Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
18--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
19POST /v1/models/revertrisk-wikidata:predict 12000 26000 32000 35000 41000 42000 44000 45000 47000 47000 47000 1202
20--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
21 Aggregated 12000 26000 32000 35000 41000 42000 44000 45000 47000 47000 47000 1202
22
23Error report
24# occurrences Error
25------------------|---------------------------------------------------------------------------------------------------------------------------------------------
2633 POST /v1/models/revertrisk-wikidata:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-wikidata:predict', code=502)
27------------------|---------------------------------------------------------------------------------------------------------------------------------------------
28
29+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
30
31# 500 users | 2 per second
32$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
33[2025-11-24 13:13:03,964] stat1010/INFO/locust.main: Run time limit set to 120 seconds
34[2025-11-24 13:13:03,964] stat1010/INFO/locust.main: Starting Locust 2.31.5
35[2025-11-24 13:13:03,965] stat1010/INFO/locust.runners: Ramping to 500 users at a rate of 2.00 per second
36[2025-11-24 13:15:03,496] stat1010/INFO/locust.main: --run-time limit reached, shutting down
37Load test results are within the threshold
38[2025-11-24 13:15:03,651] stat1010/INFO/locust.main: Shutting down (exit code 1)
39Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
40--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
41POST /v1/models/revertrisk-wikidata:predict 879 9(1.02%) | 10939 474 25179 11000 | 7.35 0.08
42--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
43 Aggregated 879 9(1.02%) | 10939 474 25179 11000 | 7.35 0.08
44
45Response time percentiles (approximated)
46Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
47--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
48POST /v1/models/revertrisk-wikidata:predict 11000 14000 16000 17000 20000 21000 23000 24000 25000 25000 25000 879
49--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
50 Aggregated 11000 14000 16000 17000 20000 21000 23000 24000 25000 25000 25000 879
51
52Error report
53# occurrences Error
54------------------|---------------------------------------------------------------------------------------------------------------------------------------------
559 POST /v1/models/revertrisk-wikidata:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-wikidata:predict', code=502)
56------------------|---------------------------------------------------------------------------------------------------------------------------------------------
57
58+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
59
60# 100 users | 5 per second
61$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
62[2025-11-24 13:26:48,568] stat1010/INFO/locust.main: Run time limit set to 120 seconds
63[2025-11-24 13:26:48,568] stat1010/INFO/locust.main: Starting Locust 2.31.5
64[2025-11-24 13:26:48,569] stat1010/INFO/locust.runners: Ramping to 100 users at a rate of 5.00 per second
65[2025-11-24 13:27:07,640] stat1010/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 100} (100 total users)
66[2025-11-24 13:28:48,102] stat1010/INFO/locust.main: --run-time limit reached, shutting down
67Load test results are within the threshold
68[2025-11-24 13:28:48,215] stat1010/INFO/locust.main: Shutting down (exit code 1)
69Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
70--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
71POST /v1/models/revertrisk-wikidata:predict 1742 4(0.23%) | 3314 81 6776 3400 | 14.58 0.03
72--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
73 Aggregated 1742 4(0.23%) | 3314 81 6776 3400 | 14.58 0.03
74
75Response time percentiles (approximated)
76Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
77--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
78POST /v1/models/revertrisk-wikidata:predict 3400 3800 4000 4200 4600 4900 5400 5700 6500 6800 6800 1742
79--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
80 Aggregated 3400 3800 4000 4200 4600 4900 5400 5700 6500 6800 6800 1742
81
82Error report
83# occurrences Error
84------------------|---------------------------------------------------------------------------------------------------------------------------------------------
854 POST /v1/models/revertrisk-wikidata:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-wikidata:predict', code=502)
86------------------|---------------------------------------------------------------------------------------------------------------------------------------------

Ideas

  • We experiment with more resources on isvc
  • Configure AutoScaling
  • We could probably use Kserve Batcher?
  • Digg into the model's server logic and make it faster.

The revertrisk-wikidata inference service production endpoint uses similar scaling configs that other revertrisk inference-services use: https://github.com/wikimedia/operations-deployment-charts/blob/8412fc655d3b1e10b38cf0c954d910b820e93a05/helmfile.d/ml-services/revertrisk/values.yaml#L145-L150

IMO the prod endpoint should scale well unless results from the WME folks say otherwise.

Update

The revertrisk-wikidata inference service production endpoint uses similar scaling configs that other revertrisk inference-services use: https://github.com/wikimedia/operations-deployment-charts/blob/8412fc655d3b1e10b38cf0c954d910b820e93a05/helmfile.d/ml-services/revertrisk/values.yaml#L145-L150

IMO the prod endpoint should scale well unless results from the WME folks say otherwise.

@Kevin you are right, the above locust tests in: https://phabricator.wikimedia.org/T409388#11400499 were targeting staging where autoscaling is not activated:

revertrisk-wikidata:
  predictor:
    image: "machinelearning-liftwing-inference-services-revertrisk-wikidata"
    image_version: "2025-11-17-105041-publish"
    custom_env:
      - name: MODEL_NAME
        value: "revertrisk-wikidata"
      - name: STORAGE_URI
        value: "s3://wmf-ml-models/revertrisk/wikidata/20251104121312/"
      - name: FORCE_HTTP
        value: "True"
    container:
      resources:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "2"
          memory: 4Gi

I rerun the tests using the configuration we are using on prod where the autoscaling is activated and we are using more resources:

revertrisk-wikidata:
  annotations:
    autoscaling.knative.dev/target: "3"
  predictor:
    config:
      minReplicas: 5
      maxReplicas: 15
    image: "machinelearning-liftwing-inference-services-revertrisk-wikidata"
    image_version: "2025-11-17-105041-publish"
    custom_env:
      - name: MODEL_NAME
        value: "revertrisk-wikidata"
      - name: STORAGE_URI
        value: "s3://wmf-ml-models/revertrisk/wikidata/20251104121312/"
      - name: FORCE_HTTP
        value: "True"
    container:
      resources:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "2"
          memory: 4Gi

Here are the results from the final loading tests using the prod configuration with autoscaling activated (above config).
Results:

1# 500 users | 5 per second
2$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
3[2025-11-25 13:54:07,783] stat1010/INFO/locust.main: Run time limit set to 120 seconds
4[2025-11-25 13:54:07,783] stat1010/INFO/locust.main: Starting Locust 2.31.5
5[2025-11-25 13:54:07,784] stat1010/INFO/locust.runners: Ramping to 500 users at a rate of 5.00 per second
6[2025-11-25 13:55:46,924] stat1010/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 500} (500 total users)
7[2025-11-25 13:56:07,316] stat1010/INFO/locust.main: --run-time limit reached, shutting down
8Load test results are within the threshold
9[2025-11-25 13:56:07,494] stat1010/INFO/locust.main: Shutting down (exit code 1)
10Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
11--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
12POST /v1/models/revertrisk-wikidata:predict 6126 12(0.20%) | 2644 61 9392 2600 | 51.21 0.10
13--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
14 Aggregated 6126 12(0.20%) | 2644 61 9392 2600 | 51.21 0.10
15
16Response time percentiles (approximated)
17Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
18--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
19POST /v1/models/revertrisk-wikidata:predict 2600 3300 3700 3900 4600 5400 6500 7300 8500 9400 9400 6126
20--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
21 Aggregated 2600 3300 3700 3900 4600 5400 6500 7300 8500 9400 9400 6126
22
23Error report
24# occurrences Error
25------------------|---------------------------------------------------------------------------------------------------------------------------------------------
2612 POST /v1/models/revertrisk-wikidata:predict: BadStatusCode('https://inference-staging.svc.codfw.wmnet:30443/v1/models/revertrisk-wikidata:predict', code=502)
27------------------|---------------------------------------------------------------------------------------------------------------------------------------------
28
29++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
30
31# 500 users | 2 per second
32$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
33[2025-11-25 13:57:54,409] stat1010/INFO/locust.main: Run time limit set to 120 seconds
34[2025-11-25 13:57:54,409] stat1010/INFO/locust.main: Starting Locust 2.31.5
35[2025-11-25 13:57:54,410] stat1010/INFO/locust.runners: Ramping to 500 users at a rate of 2.00 per second
36[2025-11-25 13:59:53,905] stat1010/INFO/locust.main: --run-time limit reached, shutting down
37Load test results are within the threshold
38[2025-11-25 13:59:54,026] stat1010/INFO/locust.main: Shutting down (exit code 0)
39Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
40--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
41POST /v1/models/revertrisk-wikidata:predict 3715 0(0.00%) | 943 364 11883 860 | 31.07 0.00
42--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
43 Aggregated 3715 0(0.00%) | 943 364 11883 860 | 31.07 0.00
44
45Response time percentiles (approximated)
46Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
47--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
48POST /v1/models/revertrisk-wikidata:predict 860 1000 1100 1200 1400 1600 1900 2200 3300 12000 12000 3715
49--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
50 Aggregated 860 1000 1100 1200 1400 1600 1900 2200 3300 12000 12000 3715
51
52++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
53
54# 100 users | 2 per second
55$ MODEL_LOCUST_DIR="revertrisk_wikidata" make run-locust-test
56[2025-11-25 14:01:11,144] stat1010/INFO/locust.main: Run time limit set to 120 seconds
57[2025-11-25 14:01:11,145] stat1010/INFO/locust.main: Starting Locust 2.31.5
58[2025-11-25 14:01:11,145] stat1010/INFO/locust.runners: Ramping to 100 users at a rate of 5.00 per second
59[2025-11-25 14:01:30,186] stat1010/INFO/locust.runners: All users spawned: {"RevertriskWikidata": 100} (100 total users)
60[2025-11-25 14:03:10,665] stat1010/INFO/locust.main: --run-time limit reached, shutting down
61Load test results are within the threshold
62[2025-11-25 14:03:10,772] stat1010/INFO/locust.main: Shutting down (exit code 0)
63Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
64--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
65POST /v1/models/revertrisk-wikidata:predict 3000 0(0.00%) | 705 352 11618 670 | 25.10 0.00
66--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
67 Aggregated 3000 0(0.00%) | 705 352 11618 670 | 25.10 0.00
68
69Response time percentiles (approximated)
70Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
71--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
72POST /v1/models/revertrisk-wikidata:predict 670 740 810 850 960 1100 1200 1300 1800 12000 12000 3000
73--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
74 Aggregated 670 740 810 850 960 1100 1200 1300 1800 12000 12000 3000

We can easily see the difference on the average latency and the number of RPS (much higher number of RPS on prod due to autoscaling and much lower average latency compared with staging).

@prabhat, has the WME team had a chance to run scale and latency tests on the revertrisk-wikidata inference service? Does this service meet your performance requirements?

If you run into issues or if the service does not meet your performance requirements, please let us know so we can further optimize it. Thanks!

@kevinbazira We’re planning to run the scale test in the next few days.
@HShaikh

@kevinbazira We’re planning to run the scale test in the next few days.
@HShaikh

Hi @SGupta-WMF, thank you for letting us know. We look forward to hearing from you regarding the results.

Results from @SGupta-WMF 's test --

In short - success rate is slightly under what we'd want but can live with. Latency is far above the 0.5ms we'd need.


Run 1
• Duration: ~67.3 mins
• Total Requests: 87,595
• Success: 77,205 (88.14%)
• Failures: 10,390 (11.86%)
• Actual RPS: 21.7
• Requests/hour: 78,109
• Target achievement: 52.07% of 150K/hour
P90 latency (first 200 successes): ~5.7s

Run 2
• Duration: ~67.1 mins
• Total Requests: 75,885
• Success: 64,292 (84.72%)
• Failures: 11,593 (15.28%)
• Actual RPS: 18.85
• Requests/hour: 67,866
• Target achievement: 45.24% of 150K/hour

Results from @SGupta-WMF 's test --

In short - success rate is slightly under what we'd want but can live with. Latency is far above the 0.5ms we'd need.

Thank you for sharing the results from your load tests.

Could you please clarify if the target latency of 0.5ms is a typo? We have been working towards a target of ~500ms as mentioned in this task's description.

If it is not a typo, achieving sub-millisecond latency is not feasible for this service due to the multiple network requests involved in its operation.

If it is indeed a typo, could you also confirm whether the other reported numbers, such as the ~5.7s latency for the first 200 successes, are accurate?

Run 1
• Duration: ~67.3 mins
• Total Requests: 87,595
• Success: 77,205 (88.14%)
• Failures: 10,390 (11.86%)
• Actual RPS: 21.7
• Requests/hour: 78,109
• Target achievement: 52.07% of 150K/hour
P90 latency (first 200 successes): ~5.7s

Run 2
• Duration: ~67.1 mins
• Total Requests: 75,885
• Success: 64,292 (84.72%)
• Failures: 11,593 (15.28%)
• Actual RPS: 18.85
• Requests/hour: 67,866
• Target achievement: 45.24% of 150K/hour

Looking forward to your confirmation.

so sorry! yes sub-milisecond is incorrect. Half a second, 500ms. The P90s is NOT a typo.

so sorry! yes sub-milisecond is incorrect. Half a second, 500ms. The P90s is NOT a typo.

Thank you for the confirmation. We are working on optimizing the revertrisk-wikidata inference service to achieve the ~500ms latency target in T414060.

@SGupta-WMF and @FNavas-foundation, as shown in T414060#11536942, we have optimized the revertrisk-wikidata inference service. Please run the same load tests whose results you shared in T409388#11483570 and confirm whether this service now meets your latency requirements. Thanks in advance.

JArguello-WMF subscribed.

reopening becaus @SGupta-WMF will run the tests again, thanks!

Run 3 results -

  • Total Requests 146,231
  • Successful Requests 136,287
  • Failed Requests 9,944
  • Success Rate 93.20%
  • Test Duration 3,880.94 s (60.00 min)
  • Actual RPS 37.68
  • Requests per Hour 135,645
  • Target Achievement (150K/hour) 90.43%

Latency Distribution

  • Successful Requests (n = 136,287)
  • Min Latency 0.2102 s
  • Median Latency 0.5617 s
  • Mean Latency 0.9661 s
  • P90 Latency 1.8293 s
  • P95 Latency 3.2628 s
  • P99 Latency 7.0478 s

Latency Buckets

  • <500 ms: 33.86%
  • 500 ms–1 s: 46.72%
  • 1–2 s: 10.38%
  • 2–5 s: 6.53%
  • >5 s: 2.52%

Observed Errors
The inference service returned:
{"error":"TypeError : unhashable type: 'dict'"} indicating a server-side exception in the model service.
Several requests failed with:
"https://api.wikimedia.org/service/lw/inference/v1/models/revertrisk-wikidata:predict": context deadline exceeded, suggesting slow failures leading to client-side timeouts.

@kevinbazira @FNavas-foundation

Key Takeaways from latest test

  • Throughput improved significantly, reaching ~136K requests/hour (≈90% of the 150K/hour target), up from ~68–78K/hour in previous runs.
  • Reliability increased, with success rate improving to 93.2% (down from ~85–88%), though error rate remains non-negligible.
  • Warm-up latency improved substantially, with first-200 P90 dropping from ~5.7s to ~0.69s.
  • Steady-state latency remains high, with overall P90 at ~1.8s and only ~34% of requests completing under 500ms. -The inference service returned TypeError: unhashable type: 'dict' and context deadline exceeded, contributing to request failures and increased tail latency.

Thank you for sharing the load test results!

Observed Errors
The inference service returned:
{"error":"TypeError : unhashable type: 'dict'"} indicating a server-side exception in the model service.
Several requests failed with:
"https://api.wikimedia.org/service/lw/inference/v1/models/revertrisk-wikidata:predict": context deadline exceeded, suggesting slow failures leading to client-side timeouts.

We noticed that you are using the external endpoint for this service. This means that each of your requests is taking a longer route through the public internet and several layers of infrastructure (WMF APIGW, proxies, etc) before reaching LiftWing. This path introduces significant network latency and is also subject to APIGW rate limits, which further impact performance.

For example, using the external endpoint results in the following performance:

$ time curl "https://api.wikimedia.org/service/lw/inference/v1/models/revertrisk-wikidata:predict" -X POST -d '{"rev_id": 1945516043}' -H "Content-Type: application/json" --http1.1
{"model_name":"revertrisk-wikidata","model_version":"2","revision_id":1945516043,"output":{"prediction":false,"probabilities":{"true":0.2718899239377954,"false":0.7281100760622046}}}
real	0m5.608s
user	0m0.020s
sys	0m0.000s

To achieve the best performance from LiftWing services, we usually recommend using the internal endpoint. The internal endpoint eliminates external network hops and proxies, as all traffic remains within the WMF infrastructure.

For example, using the internal endpoint results in significantly better performance:

$ time curl "https://inference.svc.eqiad.wmnet:30443/v1/models/revertrisk-wikidata:predict" -X POST -d '{"rev_id": 1945516043}' -H  "Host: revertrisk-wikidata.revertrisk.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"model_name":"revertrisk-wikidata","model_version":"2","revision_id":1945516043,"output":{"prediction":false,"probabilities":{"true":0.2718899239377954,"false":0.7281100760622046}}}
real	0m0.575s
user	0m0.012s
sys	0m0.008s

Switching to the internal endpoint should significantly reduce latency and improve the overall performance of this service.

@kevinbazira Thank you for the suggestion! However, I wanted to clarify a few important points about WME's infrastructure:
WME uses the external endpoint for all LiftWing requests by design. WME is not part of WMF infrastructure - it operates as a separate service outside of the WMF network. This means we don't have access to internal WMF endpoints like https://inference.svc.eqiad.wmnet:30443/.
The external endpoint (https://api.wikimedia.org/service/lw/inference/v1/models/) is the appropriate path for WME's architecture, and we're aware of the additional latency this introduces compared to internal-only services.
Regarding API limits, there are specific rate limit specifications for WME documented in the link you referenced earlier. These limits are different from general public API limits since WME has dedicated capacity allocations.
The performance issues we're seeing (timeouts, unhashable type: 'dict' errors) appear to be related to the service-side processing or the model itself rather than network latency, since:
Some requests are succeeding while others fail with server-side exceptions
The timeout errors suggest the service is taking too long to respond, not just network delay
@FNavas-foundation @HShaikh

JArguello-WMF moved this task from Doing to Done on the Wikimedia-Enterprise-Kanban-On-Call board.
JArguello-WMF added a subscriber: SGupta-WMF.

@SGupta-WMF, thank you for the clarification regarding WME's infrastructure and the use of the external endpoint. We investigated and fixed the server-side errors you reported, as detailed in: T414060#11584141

We hope these improvements will address the issues you observed. Please let us know if you encounter any further challenges or have additional feedback.

@kevinbazira @FNavas-foundation Wikidata Revert Risk Scale Test - February 5, 2026
Test Summary:
Total Requests: 120,443
Successful: 113,230 (94.01%)
Failed: 7,213 (5.99%)
Duration: 65.4 minutes (3,923 seconds)
Actual RPS: 30.70
Requests/Hour: 110,508
Target Achievement: 73.67% of 150K/hour goal
Latency - Successful Calls (n=113,230):
Min: 0.17s
Median: 0.50s
Mean: 1.10s
P90: 2.58s
P95: 4.69s
P99: 8.47s
Distribution:
<500ms: 49.56%
500ms-1s: 27.32%
1s-2s: 10.45%
2s-5s: 8.13%
5s: 4.54%

Comparison vs Jan 27: -
Throughput: 135,645/hr → 110,508/hr (-19% )
Success Rate: 93.20% → 94.01% (+0.81% )
Median Latency: 0.56s → 0.50s (-11% )
P90 Latency: 1.83s → 2.58s (+41% )
P95 Latency: 3.26s → 4.69s (+44% )
<500ms %: 33.86% → 49.56% (+46% )

JArguello-WMF claimed this task.

Feb 18 Tests
Throughput
Total Requests: 126,831
Success Rate: 99.84% (126,633 successful, 198 failed)
Actual RPS: 31.99
Requests/Hour: 115,152

Target Achievement (150K/hr): 76.77%
Latency (Success Calls, n=126,633)

Median: 1.41s
Mean: 1.51s
P90: 2.67s
P95: 3.11s
P99: 4.06s
Distribution

<500ms: 13.23%
500ms–1s: 19.90%
1s–2s: 41.00%
2s–5s: 25.60%
5s: 0.27%

Wikidata Revert Risk — Scale Test Results 19 Feb
• Total Requests: 161,081
• Successful: 160,743
• Failed: 338
• Actual RPS: 41.94
• Test Duration: 60 minutes

Latency Distribution (successful calls):
• Min: 0.18s | Median: 0.43s | Mean: 0.54s
• p90: 0.81s | p95: 1.12s | p99: 2.02s
• < 500ms: 66.83%
• 500ms–1s: 26.82%
• 1s–2s: 5.32%
• 2s–5s: 0.97%
• > 5s: 0.07%

Decision - given the better results, we're going to move forward with productionizing at the current standard. We expect our reusers want higher speed to depend on, but we will, as they say, "cross that bridge when we get there".