Page MenuHomePhabricator

Load test the language agnostic article-quality model
Closed, ResolvedPublic1 Estimated Story Points

Description

As an engineer, I want to perform some thorough load testing on the language agnostic articlequality model and report on p50, p90, p95, p99 latencies.

The model is intended to be used by WME so we are aiming for 500ms latency.

Event Timeline

isarantopoulos set the point value for this task to 1.

Change #1135721 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] articlequality: add async requests

https://gerrit.wikimedia.org/r/1135721

I ran 2 types of load tests for the existing service:

  1. Scenario A. simulating 2 concurrent users - equivalent of making 2rps
  2. Scenario B. simulating 10 concurrent users - equivalent of making 10rps

I ran these 2 scenarios on ml-staging and got these results:

[ml-staging] Scenario A - 2rps

[2025-04-10 09:22:20,514] stat1008/INFO/locust.main: Run time limit set to 300 seconds
[2025-04-10 09:22:20,514] stat1008/INFO/locust.main: Starting Locust 2.31.5
[2025-04-10 09:22:20,515] stat1008/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-04-10 09:22:20,516] stat1008/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 10} (10 total users)
[2025-04-10 09:08:38,878] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/articlequality:predict                                                515     0(0.00%) |    956      72   16302    220 |    1.72        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                       515     0(0.00%) |    956      72   16302    220 |    1.72        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/articlequality:predict                                                     220    410    680   1000   2100   4100  12000  12000  16000  16000  16000    515
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated

[ml-staging]Scenario B - 10rps

MODEL=articlequality my_locust_venv/bin/locust --headless --csv results/articlequality
[2025-04-10 09:22:20,514] stat1008/INFO/locust.main: Run time limit set to 300 seconds
[2025-04-10 09:22:20,514] stat1008/INFO/locust.main: Starting Locust 2.31.5
[2025-04-10 09:22:20,515] stat1008/INFO/locust.runners: Ramping to 10 users at a rate of 10.00 per second
[2025-04-10 09:22:20,516] stat1008/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 10} (10 total users)
[2025-04-10 09:27:19,691] stat1008/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-10 09:27:19,823] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/articlequality:predict                                               1228     0(0.00%) |   2172     396   17709   1100 |    4.21        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      1228     0(0.00%) |   2172     396   17709   1100 |    4.21        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/articlequality:predict                                                    1100   1400   1800   2000   2700  13000  16000  16000  17000  18000  18000   1228
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                           1100   1400   1800   2000   2700  13000  16000  16000  17000  18000  18000   1228

The service clearly struggles to serve 10rps and by looking at the grafana dashboard during the load test this is clearly because of the preprocessing step. The main issue is that the requests to obtain the data are not asynchronous resulting in blocking code.
I added this functionality in the patch attached to this task and then ran some more load tests locally , for both A and B scenarios before and after the change.

[local load test] Scenario A - synchronous requests

MODEL=articlequality my_locust_venv/bin/locust --headless --csv results/articlequality
[2025-04-10 15:43:41,268] wmf3251/INFO/locust.main: Run time limit set to 120 seconds
[2025-04-10 15:43:41,268] wmf3251/INFO/locust.main: Starting Locust 2.31.5
[2025-04-10 15:43:41,269] wmf3251/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-04-10 15:43:41,269] wmf3251/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 2} (2 total users)
[2025-04-10 15:45:40,938] wmf3251/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-10 15:45:41,007] wmf3251/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/articlequality:predict                                                 44     0(0.00%) |   1784     265    8221   1200 |    0.44        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                        44     0(0.00%) |   1784     265    8221   1200 |    0.44        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/articlequality:predict                                                    1500   1900   2200   2800   3200   5600   8200   8200   8200   8200   8200     44
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                           1500   1900   2200   2800   3200   5600   8200   8200   8200   8200   8200     44

[local load test] Scenario B - synchronous requests

[2025-04-10 15:40:17,233] wmf3251/INFO/locust.runners: Ramping to 10 users at a rate of 10.00 per second
[2025-04-10 15:40:17,233] wmf3251/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 10} (10 total users)
[2025-04-10 15:42:16,895] wmf3251/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-10 15:42:16,971] wmf3251/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/articlequality:predict                                                 43     0(0.00%) |  22984    1261   55498  17000 |    0.37        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                        43     0(0.00%) |  22984    1261   55498  17000 |    0.37        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/articlequality:predict                                                   17000  26000  32000  38000  49000  52000  55000  55000  55000  55000  55000     43
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated

[local load test] Scenario A - async requests

[2025-04-10 16:05:24,818] wmf3251/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-10 16:05:24,894] wmf3251/INFO/locust.main: Shutting down (exit code 1)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/articlequality:predict                                                361     2(0.55%) |    458     244    5450    410 |    3.02        0.02
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                       361     2(0.55%) |    458     244    5450    410 |    3.02        0.02

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/articlequality:predict                                                     410    420    440    470    550    610   1400   1600   5500   5500   5500    361
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                            410    420    440    470    550    610   1400   1600   5500   5500   5500    361

Error report
# occurrences      Error
------------------|---------------------------------------------------------------------------------------------------------------------------------------------
2                  POST /v1/models/articlequality:predict: BadStatusCode('http://localhost:8080/v1/models/articlequality:predict', code=500)
------------------|---------------------------------------------------------------------------------------------------------------------------------------------

[local load test] Scenario B - async requests

[2025-04-10 15:32:16,334] wmf3251/INFO/locust.main: Starting Locust 2.31.5
[2025-04-10 15:32:16,334] wmf3251/INFO/locust.runners: Ramping to 10 users at a rate of 10.00 per second
[2025-04-10 15:32:16,335] wmf3251/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 10} (10 total users)
[2025-04-10 15:34:15,494] wmf3251/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-10 15:34:15,561] wmf3251/INFO/locust.main: Shutting down (exit code 1)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/articlequality:predict                                               1339     8(0.60%) |    683     226    5896    580 |   11.24        0.07
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      1339     8(0.60%) |    683     226    5896    580 |   11.24        0.07

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/articlequality:predict                                                     580    700    760    790    890   1100   3400   3800   5900   5900   5900   1339
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                            580    700    760    790    890   1100   3400   3800   5900   5900   5900   1339

Error report
# occurrences      Error
------------------|---------------------------------------------------------------------------------------------------------------------------------------------
8                  POST /v1/models/articlequality:predict: BadStatusCode('http://localhost:8080/v1/models/articlequality:predict', code=500)
------------------|--

The above results show clearly that the service is able to scale much better now and I'll run another load test after we deploy to staging to verify the results.
p90 latency is <600 ms while before this is was < 3 seconds

Change #1135721 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articlequality: add async requests

https://gerrit.wikimedia.org/r/1135721

Glad to see the latency dropping! One thought: I suspect if we further instrumented the preprocess step, much of the latency is from how long it takes to get the HTML for the revision. @cscott gave that great DPE Deep Dive talk a month ago or so about Parser cache and how it works so tagging him to hopefully help clarify my guesses. The LiftWing model calls the "https://{lang}.wikipedia.org/w/rest.php/v1/revision/{revid}/html" endpoint (code) when assessing the quality for a given revision. I think that means:

  • Presuming that these are old revids that are being used in the load testing, my understanding is that they probably weren't cached for the initial load test. They may or may not have been cached then for the follow-up requests which would greatly speed up responses.
  • My guess is that Enterprise is intending to hit the LiftWing API for new revisions so presumably that means the HTML will be cached already so an accurate assessment might be achieved by running a "warm-up" test on the revision IDs used in testing first to force them into the Parser cache and then run the actual load tests. But also, if you want to know worst-case, you might want to switch it to choose random (old) revision IDs or something like that instead.

After deploying to ml-staging I reran the previous tests (the same test on ml-staging as the first 2 results in the previous comment)

[ml-staging] Scenario A - 2rps

MODEL=articlequality my_locust_venv/bin/locust --headless --csv results/articlequality
[2025-04-14 12:57:54,992] stat1008/INFO/locust.main: Run time limit set to 300 seconds
[2025-04-14 12:57:54,992] stat1008/INFO/locust.main: Starting Locust 2.31.5
[2025-04-14 12:57:54,993] stat1008/INFO/locust.runners: Ramping to 2 users at a rate of 10.00 per second
[2025-04-14 12:57:54,994] stat1008/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 2} (2 total users)
[2025-04-14 13:02:54,398] stat1008/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-14 13:02:54,497] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/articlequality:predict                                               1977     0(0.00%) |     99      83     820     95 |    6.60        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      1977     0(0.00%) |     99      83     820     95 |    6.60        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/articlequality:predict                                                      95     97     99    100    110    110    130    250    790    820    820   1977
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                             95     97     99    100    110    110    130    250    790    820    820   1977

[ml-staging]Scenario B - 10rps

MODEL=articlequality my_locust_venv/bin/locust --headless --csv results/articlequality
[2025-04-14 12:47:08,657] stat1008/INFO/locust.main: Run time limit set to 300 seconds
[2025-04-14 12:47:08,657] stat1008/INFO/locust.main: Starting Locust 2.31.5
[2025-04-14 12:47:08,658] stat1008/INFO/locust.runners: Ramping to 10 users at a rate of 10.00 per second
[2025-04-14 12:47:08,659] stat1008/INFO/locust.runners: All users spawned: {"ArticlequalityLanguageAgnostic": 10} (10 total users)
[2025-04-14 12:52:08,141] stat1008/INFO/locust.main: --run-time limit reached, shutting down
Load test results are within the threshold
[2025-04-14 12:52:08,251] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/articlequality:predict                                               9054     0(0.00%) |    128      81    5420    110 |   30.22        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                      9054     0(0.00%) |    128      81    5420    110 |   30.22        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/articlequality:predict                                                     110    120    130    140    170    240    310    350    810   5400   5400   9054
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated

It is clear that the service operates in a much better way so we'll proceed to deploy this to production.

@Isaac you're totally right on this. Although adding async requests is an improvement the reported tests above are not exactly comparable.
I'll rerun a couple of tests using the same revision ids by warming up the cache before.