As an engineer, I want to perform some thorough load testing on the language agnostic articlequality model and report on p50, p90, p95, p99 latencies.
The model is intended to be used by WME so we are aiming for 500ms latency.
As an engineer, I want to perform some thorough load testing on the language agnostic articlequality model and report on p50, p90, p95, p99 latencies.
The model is intended to be used by WME so we are aiming for 500ms latency.
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| articlequality: add async requests | machinelearning/liftwing/inference-services | main | +29 -14 |
Change #1135721 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] articlequality: add async requests
I ran 2 types of load tests for the existing service:
I ran these 2 scenarios on ml-staging and got these results:
[2025-04-10 09:22:20,514] stat1008/INFO/locust.main: Run time limit set to 300 seconds
[2025-04-10 09:22:20,514] stat1008/INFO/locust.main: Starting Locust 2.31.5
[2025-04-10 09:22:20,515] stat1008/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-04-10 09:22:20,516] stat1008/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 10} (10 total users)
[2025-04-10 09:08:38,878] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /v1/models/articlequality:predict 515 0(0.00%) | 956 72 16302 220 | 1.72 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 515 0(0.00%) | 956 72 16302 220 | 1.72 0.00
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST /v1/models/articlequality:predict 220 410 680 1000 2100 4100 12000 12000 16000 16000 16000 515
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
AggregatedMODEL=articlequality my_locust_venv/bin/locust --headless --csv results/articlequality
[2025-04-10 09:22:20,514] stat1008/INFO/locust.main: Run time limit set to 300 seconds
[2025-04-10 09:22:20,514] stat1008/INFO/locust.main: Starting Locust 2.31.5
[2025-04-10 09:22:20,515] stat1008/INFO/locust.runners: Ramping to 10 users at a rate of 10.00 per second
[2025-04-10 09:22:20,516] stat1008/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 10} (10 total users)
[2025-04-10 09:27:19,691] stat1008/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-10 09:27:19,823] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /v1/models/articlequality:predict 1228 0(0.00%) | 2172 396 17709 1100 | 4.21 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 1228 0(0.00%) | 2172 396 17709 1100 | 4.21 0.00
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST /v1/models/articlequality:predict 1100 1400 1800 2000 2700 13000 16000 16000 17000 18000 18000 1228
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 1100 1400 1800 2000 2700 13000 16000 16000 17000 18000 18000 1228The service clearly struggles to serve 10rps and by looking at the grafana dashboard during the load test this is clearly because of the preprocessing step. The main issue is that the requests to obtain the data are not asynchronous resulting in blocking code.
I added this functionality in the patch attached to this task and then ran some more load tests locally , for both A and B scenarios before and after the change.
MODEL=articlequality my_locust_venv/bin/locust --headless --csv results/articlequality
[2025-04-10 15:43:41,268] wmf3251/INFO/locust.main: Run time limit set to 120 seconds
[2025-04-10 15:43:41,268] wmf3251/INFO/locust.main: Starting Locust 2.31.5
[2025-04-10 15:43:41,269] wmf3251/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-04-10 15:43:41,269] wmf3251/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 2} (2 total users)
[2025-04-10 15:45:40,938] wmf3251/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-10 15:45:41,007] wmf3251/INFO/locust.main: Shutting down (exit code 0)
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /v1/models/articlequality:predict 44 0(0.00%) | 1784 265 8221 1200 | 0.44 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 44 0(0.00%) | 1784 265 8221 1200 | 0.44 0.00
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST /v1/models/articlequality:predict 1500 1900 2200 2800 3200 5600 8200 8200 8200 8200 8200 44
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 1500 1900 2200 2800 3200 5600 8200 8200 8200 8200 8200 44[2025-04-10 15:40:17,233] wmf3251/INFO/locust.runners: Ramping to 10 users at a rate of 10.00 per second
[2025-04-10 15:40:17,233] wmf3251/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 10} (10 total users)
[2025-04-10 15:42:16,895] wmf3251/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-10 15:42:16,971] wmf3251/INFO/locust.main: Shutting down (exit code 0)
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /v1/models/articlequality:predict 43 0(0.00%) | 22984 1261 55498 17000 | 0.37 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 43 0(0.00%) | 22984 1261 55498 17000 | 0.37 0.00
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST /v1/models/articlequality:predict 17000 26000 32000 38000 49000 52000 55000 55000 55000 55000 55000 43
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated[2025-04-10 16:05:24,818] wmf3251/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-10 16:05:24,894] wmf3251/INFO/locust.main: Shutting down (exit code 1)
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /v1/models/articlequality:predict 361 2(0.55%) | 458 244 5450 410 | 3.02 0.02
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 361 2(0.55%) | 458 244 5450 410 | 3.02 0.02
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST /v1/models/articlequality:predict 410 420 440 470 550 610 1400 1600 5500 5500 5500 361
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 410 420 440 470 550 610 1400 1600 5500 5500 5500 361
Error report
# occurrences Error
------------------|---------------------------------------------------------------------------------------------------------------------------------------------
2 POST /v1/models/articlequality:predict: BadStatusCode('http://localhost:8080/v1/models/articlequality:predict', code=500)
------------------|---------------------------------------------------------------------------------------------------------------------------------------------[2025-04-10 15:32:16,334] wmf3251/INFO/locust.main: Starting Locust 2.31.5
[2025-04-10 15:32:16,334] wmf3251/INFO/locust.runners: Ramping to 10 users at a rate of 10.00 per second
[2025-04-10 15:32:16,335] wmf3251/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 10} (10 total users)
[2025-04-10 15:34:15,494] wmf3251/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-10 15:34:15,561] wmf3251/INFO/locust.main: Shutting down (exit code 1)
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /v1/models/articlequality:predict 1339 8(0.60%) | 683 226 5896 580 | 11.24 0.07
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 1339 8(0.60%) | 683 226 5896 580 | 11.24 0.07
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST /v1/models/articlequality:predict 580 700 760 790 890 1100 3400 3800 5900 5900 5900 1339
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 580 700 760 790 890 1100 3400 3800 5900 5900 5900 1339
Error report
# occurrences Error
------------------|---------------------------------------------------------------------------------------------------------------------------------------------
8 POST /v1/models/articlequality:predict: BadStatusCode('http://localhost:8080/v1/models/articlequality:predict', code=500)
------------------|--The above results show clearly that the service is able to scale much better now and I'll run another load test after we deploy to staging to verify the results.
p90 latency is <600 ms while before this is was < 3 seconds
Change #1135721 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] articlequality: add async requests
Glad to see the latency dropping! One thought: I suspect if we further instrumented the preprocess step, much of the latency is from how long it takes to get the HTML for the revision. @cscott gave that great DPE Deep Dive talk a month ago or so about Parser cache and how it works so tagging him to hopefully help clarify my guesses. The LiftWing model calls the "https://{lang}.wikipedia.org/w/rest.php/v1/revision/{revid}/html" endpoint (code) when assessing the quality for a given revision. I think that means:
After deploying to ml-staging I reran the previous tests (the same test on ml-staging as the first 2 results in the previous comment)
MODEL=articlequality my_locust_venv/bin/locust --headless --csv results/articlequality
[2025-04-14 12:57:54,992] stat1008/INFO/locust.main: Run time limit set to 300 seconds
[2025-04-14 12:57:54,992] stat1008/INFO/locust.main: Starting Locust 2.31.5
[2025-04-14 12:57:54,993] stat1008/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-04-14 12:57:54,994] stat1008/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 2} (2 total users)
[2025-04-14 13:02:54,398] stat1008/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-14 13:02:54,497] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /v1/models/articlequality:predict 1977 0(0.00%) | 99 83 820 95 | 6.60 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 1977 0(0.00%) | 99 83 820 95 | 6.60 0.00
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST /v1/models/articlequality:predict 95 97 99 100 110 110 130 250 790 820 820 1977
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
Aggregated 95 97 99 100 110 110 130 250 790 820 820 1977MODEL=articlequality my_locust_venv/bin/locust --headless --csv results/articlequality
[2025-04-14 12:47:08,657] stat1008/INFO/locust.main: Run time limit set to 300 seconds
[2025-04-14 12:47:08,657] stat1008/INFO/locust.main: Starting Locust 2.31.5
[2025-04-14 12:47:08,658] stat1008/INFO/locust.runners: Ramping to 10 users at a rate of 10.00 per second
[2025-04-14 12:47:08,659] stat1008/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 10} (10 total users)
[2025-04-14 12:52:08,141] stat1008/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-14 12:52:08,251] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /v1/models/articlequality:predict 9054 0(0.00%) | 128 81 5420 110 | 30.22 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 9054 0(0.00%) | 128 81 5420 110 | 30.22 0.00
Response time percentiles (approximated)
Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST /v1/models/articlequality:predict 110 120 130 140 170 240 310 350 810 5400 5400 9054
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
AggregatedIt is clear that the service operates in a much better way so we'll proceed to deploy this to production.
@Isaac you're totally right on this. Although adding async requests is an improvement the reported tests above are not exactly comparable.
I'll rerun a couple of tests using the same revision ids by warming up the cache before.