Page MenuHomePhabricator

Upgrade model servers to kserve 0.11.2
Closed, ResolvedPublic3 Estimated Story Points

Description

Upgrade kserve on the following model servers from 0.11.1 to 0.11.2

  • revscoring
  • langid
  • llm
  • revertrisk-language-agnostic
  • revertrisk-wikidata

Event Timeline

Change 975814 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] Upgrade model servers to kserve 0.11.2

https://gerrit.wikimedia.org/r/975814

Just wanted to provide a reference for the revertrisk-wikidata model, which is currently under evaluation and improvement. T343419

Change 975814 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] Upgrade model servers to kserve 0.11.2

https://gerrit.wikimedia.org/r/975814

isarantopoulos lowered the priority of this task from High to Medium.
isarantopoulos set the point value for this task to 3.
isarantopoulos moved this task from Unsorted to In Progress on the Machine-Learning-Team board.

Change 976748 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update docker images to latest versions

https://gerrit.wikimedia.org/r/976748

Change 976748 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update docker images to latest versions

https://gerrit.wikimedia.org/r/976748

All revscoring models have been deployed on ml-staging.
While running some load tests on ml-staging for enwiki-goodfaith I noticed some increased latencies compared to older load tests. This happens when number of connections is >1.
Will investigate further to check if the issue is with the inputs by running the same load test on production, but until then I'm holding the production deployment.
Load tests with 1 thread and 1 connection had same results. The example below is a load test with increased latencies:

 wrk -c 4 -t 2 --timeout 2s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency  --header "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
thread 2 created logfile wrk_2.log created
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   729.71ms  330.75ms   1.91s    68.14%
    Req/Sec     3.46      3.34    10.00     64.79%
  Latency Distribution
     50%  664.28ms
     75%  913.57ms
     90%    1.22s
     99%    1.55s
  145 requests in 1.00m, 50.98KB read
  Socket errors: connect 0, read 0, write 0, timeout 32
Requests/sec:      2.41
Transfer/sec:     868.76B

And an old one that was significantly better

wrk -c 4 -t 2 --timeout 2s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency -d 60 --header enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org -- enwiki.input
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   433.57ms   77.57ms 797.58ms   71.92%
    Req/Sec     5.44      2.80    10.00     67.64%
  Latency Distribution
     50%  410.98ms
     75%  474.18ms
     90%  546.55ms
     99%  697.93ms
  552 requests in 1.00m, 207.54KB read
Requests/sec:      9.19
Transfer/sec:      3.45KB

Running the same test on production gave the following results, so it doesnt seem to be such a big difference (although there is some)

wrk -c 4 -t 2 --timeout 2s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency  --header "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
thread 2 created logfile wrk_2.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   524.99ms  486.69ms   2.00s    81.21%
    Req/Sec     3.97      3.49    10.00     59.57%
  Latency Distribution
     50%  294.92ms
     75%  730.76ms
     90%    1.27s
     99%    1.96s
  191 requests in 1.00m, 67.15KB read
  Socket errors: connect 0, read 0, write 0, timeout 42
Requests/sec:      3.18
Transfer/sec:      1.12KB
thread 1 made 95 requests and got 92 responses
thread 2 made 101 requests and got 99 responses

I see a drop in performance (not as big as the one we experienced with other servers). This is for articlequality. On top we have the new version in staging and in the bottom the current production one

wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict --latency -d 60 -- articlequality.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.40s    40.59ms   1.51s    76.19%
    Req/Sec     0.00      0.00     0.00    100.00%
  Latency Distribution
     50%    1.39s
     75%    1.41s
     90%    1.47s
     99%    1.51s
  42 requests in 1.00m, 19.38KB read
Requests/sec:      0.70
Transfer/sec:     330.22B
thread 1 made 44 requests and got 42 responses
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict --latency -d 60 -- articlequality.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.09s    89.55ms   1.30s    69.81%
    Req/Sec     0.08      0.27     1.00     92.45%
  Latency Distribution
     50%    1.09s
     75%    1.17s
     90%    1.19s
     99%    1.30s
  53 requests in 1.00m, 24.45KB read
Requests/sec:      0.88
Transfer/sec:     416.61B
thread 1 made 55 requests and got 53 responses

There is a difference in the deployments as production is using multiprocessing but the results are also not inline with previous[[ https://phabricator.wikimedia.org/T348265#9249921 | load tests ran for articlequality ]] (same input)

Results for drafttopic kserve 0.11.2

wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict --header "Host: enwiki-drafttopic.revscoring-drafttopic.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   201.34ms  197.60ms 996.26ms   87.99%
    Req/Sec     7.03      3.13    10.00     86.13%
  Latency Distribution
     50%  120.68ms
     75%  195.33ms
     90%  410.24ms
     99%  979.44ms
  310 requests in 1.00m, 1.10MB read
Requests/sec:      5.16
Transfer/sec:     18.68KB
thread 1 made 312 requests and got 310 responses

kserve 0.11.1 on production.

wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict --header "Host: enwiki-drafttopic.revscoring-drafttopic.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   244.10ms  240.95ms   1.20s    86.51%
    Req/Sec     6.53      3.31    10.00     35.50%
  Latency Distribution
     50%  133.80ms
     75%  287.33ms
     90%  592.72ms
     99%    1.06s
  262 requests in 1.00m, 0.93MB read
Requests/sec:      4.36
Transfer/sec:     15.79KB
thread 1 made 264 requests and got 262 responses

Results are similar (new version slightly better) so we can proceed upgrading this one.

Change 977605 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update articlequality and articletopic to kserve 0.11.2

https://gerrit.wikimedia.org/r/977605

Change 977605 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update articlequality and articletopic to kserve 0.11.2

https://gerrit.wikimedia.org/r/977605

I ran load testing for all revscoring model servers comparing staging (version 0.11.2) with production (0.11.1).
All servers brought similar results with exception articlequality which were worse as recorded in a previous comment. My assumption was that this is because we have enabled multiprocessing on that server and this assumption was validated as load testing results after deploying kserve 0.11.2 to prod are the same.
There is something weird though going on with damaging and goodfaith as they seem to be performing worse in production in comparison to staging and this could be related to the alerts we are getting every once in a while https://phabricator.wikimedia.org/T351735

wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-damaging:predict --header "Host: enwiki-damaging.revscoring-editquality-damaging.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-damaging:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.53s     3.21s   12.67s    82.09%
    Req/Sec     2.82      2.43    10.00     72.73%
  Latency Distribution
     50%  768.89ms
     75%    3.79s
     90%    7.88s
     99%   12.67s
  44 requests in 1.00m, 15.41KB read
Requests/sec:      0.73
Transfer/sec:     263.03B
thread 1 made 46 requests and got 44 responses
isaranto@deploy2002:~/load_testing$ wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-damaging:predict --header "Host: enwiki-damaging.revscoring-editquality-damaging.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-damaging:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   884.27ms    1.05s    3.99s    82.99%
    Req/Sec     3.47      2.29    10.00     77.57%
  Latency Distribution
     50%  317.54ms
     75%    1.19s
     90%    2.84s
     99%    3.98s
  107 requests in 1.00m, 37.46KB read
Requests/sec:      1.78
Transfer/sec:     639.24B
thread 1 made 109 requests and got 107 responses

----------------------------------------------------------------------------------------------------------------------
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --header "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.12s     1.33s    6.58s    84.82%
    Req/Sec     3.01      2.40    10.00     77.78%
  Latency Distribution
     50%  427.58ms
     75%    1.60s
     90%    3.05s
     99%    5.84s
  81 requests in 1.00m, 28.46KB read
Requests/sec:      1.35
Transfer/sec:     485.71B
thread 1 made 83 requests and got 81 responses
isaranto@deploy2002:~/load_testing$ wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --header "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   886.40ms    1.06s    4.04s    82.99%
    Req/Sec     3.56      2.22    10.00     79.44%
  Latency Distribution
     50%  323.90ms
     75%    1.22s
     90%    2.85s
     99%    4.00s
  107 requests in 1.00m, 37.59KB read
Requests/sec:      1.78
Transfer/sec:     640.51B
thread 1 made 109 requests and got 107 responses

The above makes sense (answering to myself 😛 ) as codfw in production is not idle but constantly gets traffic from enwiki.
Running a load test on eqiad verified this:

wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-damaging:predict --header "Host: enwiki-damaging.revscoring-editquality-damaging.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-damaging:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   922.10ms    1.07s    4.44s    82.44%
    Req/Sec     2.73      1.71    10.00     52.53%
  Latency Distribution
     50%  379.49ms
     75%    1.09s
     90%    2.83s
     99%    4.13s
  99 requests in 1.00m, 34.66KB read
Requests/sec:      1.65
Transfer/sec:     590.62B
thread 1 made 101 requests and got 99 responses

The following model servers have been upgraded to kserve 0.11.2

  • revscoring
  • langid
  • llm
  • revertrisk-language-agnostic
  • revertrisk-wikidata

Also:

  • Ran load tests and verified performance remains the same/improves
  • httbb tests successfully ran for all clusters