Page MenuHomePhabricator

Run load tests for the article-descriptions isvc
Open, Needs TriagePublic

Description

In T343123 we created the article-descriptions model-server and it is currently hosted in the experimental namespace on LiftWing.

We worked on optimizing response time for a single request in T353127.

Now we would like to run load tests and measure the number of multiple parallel requests the article-descriptions isvc can handle effectively.

Event Timeline

I ran load tests using most languages supported by the model with 3 beams set based on T343123#9380779. All the inputs utilized for the request payload can be found in: P54507. Below are the load test results:

  1. requests < 50 in 30s:
kevinbazira@deploy2002:~$ wrk -t 2 -c 6 -d 30s -s article-descriptions.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --latency -- article-descriptions.input
thread 1 created logfile wrk_1.log created
thread 2 created logfile wrk_2.log created
Running 30s test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict
  2 threads and 6 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   996.93ms  681.67ms   1.89s    56.25%
    Req/Sec     2.67      4.16    20.00     87.18%
  Latency Distribution
     50%    1.29s 
     75%    1.73s 
     90%    1.85s 
     99%    1.89s 
  43 requests in 30.05s, 11.16KB read
  Socket errors: connect 0, read 0, write 0, timeout 27
  Non-2xx or 3xx responses: 40
Requests/sec:      1.43
Transfer/sec:     380.22B
thread 1 made 28 requests and got 24 responses
thread 2 made 22 requests and got 19 responses
  1. requests > 50 in 30s:
kevinbazira@deploy2002:~$ wrk -t 8 -c 24 -d 30s -s article-descriptions.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --latency -- article-descriptions.input
thread 1 created logfile wrk_1.log created
thread 2 created logfile wrk_2.log created
thread 3 created logfile wrk_3.log created
thread 4 created logfile wrk_4.log created
thread 5 created logfile wrk_5.log created
thread 6 created logfile wrk_6.log created
thread 7 created logfile wrk_7.log created
thread 8 created logfile wrk_8.log created
Running 30s test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict
  8 threads and 24 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    26.92ms    7.43ms  34.51ms   66.67%
    Req/Sec     1.68      2.83    10.00     84.62%
  Latency Distribution
     50%   28.40ms
     75%   32.67ms
     90%   34.51ms
     99%   34.51ms
  82 requests in 30.05s, 20.01KB read
  Socket errors: connect 0, read 0, write 0, timeout 76
  Non-2xx or 3xx responses: 77
Requests/sec:      2.73
Transfer/sec:     682.07B
thread 1 made 13 requests and got 9 responses
thread 2 made 13 requests and got 10 responses
thread 3 made 13 requests and got 10 responses
thread 4 made 13 requests and got 10 responses
thread 5 made 14 requests and got 11 responses
thread 6 made 13 requests and got 10 responses
thread 7 made 14 requests and got 11 responses
thread 8 made 14 requests and got 11 responses

Based on the above reports, the isvc that is currently running on 1 pod in the experimental namespace can handle a maximum of 20 requests per second if the total number of requests made within a 30-second time frame is less than 50. However, if the total number of requests exceeds 50 within the same duration, the server's throughput drops down to a maximum of around 10 requests per second.

We shall compare these numbers with the anticipated load that the Android team will share in response to T343123#9420718.

Change 985127 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] test: add load test script and input for article-descriptions

https://gerrit.wikimedia.org/r/985127

Change 985127 merged by Kevin Bazira:

[machinelearning/liftwing/inference-services@main] test: add load test script and input for article-descriptions

https://gerrit.wikimedia.org/r/985127

As discussed we need to rerun the above tests as a lot of the requests done have failed so the statistics are not really useful at the moment (in the first one it seems that 27 out of 41 got a timeout and in the second one 76 out of 82).
I suggest we leave the wrk/lua tests and write a test that we can use with locust.

Change 995039 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] locust: add article_descriptions load tests

https://gerrit.wikimedia.org/r/995039

Change 995039 merged by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] locust: add article_descriptions load tests

https://gerrit.wikimedia.org/r/995039