Upgrade kserve on the following model servers from 0.11.1 to 0.11.2
- revscoring
- langid
- llm
- revertrisk-language-agnostic
- revertrisk-wikidata
Upgrade kserve on the following model servers from 0.11.1 to 0.11.2
Change 975814 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] Upgrade model servers to kserve 0.11.2
Just wanted to provide a reference for the revertrisk-wikidata model, which is currently under evaluation and improvement. T343419
Change 975814 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] Upgrade model servers to kserve 0.11.2
Change 976748 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: update docker images to latest versions
Change 976748 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: update docker images to latest versions
All revscoring models have been deployed on ml-staging.
While running some load tests on ml-staging for enwiki-goodfaith I noticed some increased latencies compared to older load tests. This happens when number of connections is >1.
Will investigate further to check if the issue is with the inputs by running the same load test on production, but until then I'm holding the production deployment.
Load tests with 1 thread and 1 connection had same results. The example below is a load test with increased latencies:
wrk -c 4 -t 2 --timeout 2s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency --header "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
thread 2 created logfile wrk_2.log created
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict
2 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 729.71ms 330.75ms 1.91s 68.14%
Req/Sec 3.46 3.34 10.00 64.79%
Latency Distribution
50% 664.28ms
75% 913.57ms
90% 1.22s
99% 1.55s
145 requests in 1.00m, 50.98KB read
Socket errors: connect 0, read 0, write 0, timeout 32
Requests/sec: 2.41
Transfer/sec: 868.76BAnd an old one that was significantly better
wrk -c 4 -t 2 --timeout 2s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency -d 60 --header enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org -- enwiki.input
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict
2 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 433.57ms 77.57ms 797.58ms 71.92%
Req/Sec 5.44 2.80 10.00 67.64%
Latency Distribution
50% 410.98ms
75% 474.18ms
90% 546.55ms
99% 697.93ms
552 requests in 1.00m, 207.54KB read
Requests/sec: 9.19
Transfer/sec: 3.45KBRunning the same test on production gave the following results, so it doesnt seem to be such a big difference (although there is some)
wrk -c 4 -t 2 --timeout 2s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --latency --header "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
thread 2 created logfile wrk_2.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict
2 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 524.99ms 486.69ms 2.00s 81.21%
Req/Sec 3.97 3.49 10.00 59.57%
Latency Distribution
50% 294.92ms
75% 730.76ms
90% 1.27s
99% 1.96s
191 requests in 1.00m, 67.15KB read
Socket errors: connect 0, read 0, write 0, timeout 42
Requests/sec: 3.18
Transfer/sec: 1.12KB
thread 1 made 95 requests and got 92 responses
thread 2 made 101 requests and got 99 responsesI see a drop in performance (not as big as the one we experienced with other servers). This is for articlequality. On top we have the new version in staging and in the bottom the current production one
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict --latency -d 60 -- articlequality.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.40s 40.59ms 1.51s 76.19%
Req/Sec 0.00 0.00 0.00 100.00%
Latency Distribution
50% 1.39s
75% 1.41s
90% 1.47s
99% 1.51s
42 requests in 1.00m, 19.38KB read
Requests/sec: 0.70
Transfer/sec: 330.22B
thread 1 made 44 requests and got 42 responseswrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict --latency -d 60 -- articlequality.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.09s 89.55ms 1.30s 69.81%
Req/Sec 0.08 0.27 1.00 92.45%
Latency Distribution
50% 1.09s
75% 1.17s
90% 1.19s
99% 1.30s
53 requests in 1.00m, 24.45KB read
Requests/sec: 0.88
Transfer/sec: 416.61B
thread 1 made 55 requests and got 53 responsesThere is a difference in the deployments as production is using multiprocessing but the results are also not inline with previous[[ https://phabricator.wikimedia.org/T348265#9249921 | load tests ran for articlequality ]] (same input)
Results for drafttopic kserve 0.11.2
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict --header "Host: enwiki-drafttopic.revscoring-drafttopic.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 201.34ms 197.60ms 996.26ms 87.99%
Req/Sec 7.03 3.13 10.00 86.13%
Latency Distribution
50% 120.68ms
75% 195.33ms
90% 410.24ms
99% 979.44ms
310 requests in 1.00m, 1.10MB read
Requests/sec: 5.16
Transfer/sec: 18.68KB
thread 1 made 312 requests and got 310 responseskserve 0.11.1 on production.
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict --header "Host: enwiki-drafttopic.revscoring-drafttopic.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-drafttopic:predict
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 244.10ms 240.95ms 1.20s 86.51%
Req/Sec 6.53 3.31 10.00 35.50%
Latency Distribution
50% 133.80ms
75% 287.33ms
90% 592.72ms
99% 1.06s
262 requests in 1.00m, 0.93MB read
Requests/sec: 4.36
Transfer/sec: 15.79KB
thread 1 made 264 requests and got 262 responsesResults are similar (new version slightly better) so we can proceed upgrading this one.
Change 977605 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: update articlequality and articletopic to kserve 0.11.2
Change 977605 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: update articlequality and articletopic to kserve 0.11.2
I ran load testing for all revscoring model servers comparing staging (version 0.11.2) with production (0.11.1).
All servers brought similar results with exception articlequality which were worse as recorded in a previous comment. My assumption was that this is because we have enabled multiprocessing on that server and this assumption was validated as load testing results after deploying kserve 0.11.2 to prod are the same.
There is something weird though going on with damaging and goodfaith as they seem to be performing worse in production in comparison to staging and this could be related to the alerts we are getting every once in a while https://phabricator.wikimedia.org/T351735
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-damaging:predict --header "Host: enwiki-damaging.revscoring-editquality-damaging.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-damaging:predict
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 2.53s 3.21s 12.67s 82.09%
Req/Sec 2.82 2.43 10.00 72.73%
Latency Distribution
50% 768.89ms
75% 3.79s
90% 7.88s
99% 12.67s
44 requests in 1.00m, 15.41KB read
Requests/sec: 0.73
Transfer/sec: 263.03B
thread 1 made 46 requests and got 44 responses
isaranto@deploy2002:~/load_testing$ wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-damaging:predict --header "Host: enwiki-damaging.revscoring-editquality-damaging.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-damaging:predict
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 884.27ms 1.05s 3.99s 82.99%
Req/Sec 3.47 2.29 10.00 77.57%
Latency Distribution
50% 317.54ms
75% 1.19s
90% 2.84s
99% 3.98s
107 requests in 1.00m, 37.46KB read
Requests/sec: 1.78
Transfer/sec: 639.24B
thread 1 made 109 requests and got 107 responses
----------------------------------------------------------------------------------------------------------------------
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --header "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.12s 1.33s 6.58s 84.82%
Req/Sec 3.01 2.40 10.00 77.78%
Latency Distribution
50% 427.58ms
75% 1.60s
90% 3.05s
99% 5.84s
81 requests in 1.00m, 28.46KB read
Requests/sec: 1.35
Transfer/sec: 485.71B
thread 1 made 83 requests and got 81 responses
isaranto@deploy2002:~/load_testing$ wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict --header "Host: enwiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference-staging.svc.codfw.wmnet:30443/v1/models/enwiki-goodfaith:predict
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 886.40ms 1.06s 4.04s 82.99%
Req/Sec 3.56 2.22 10.00 79.44%
Latency Distribution
50% 323.90ms
75% 1.22s
90% 2.85s
99% 4.00s
107 requests in 1.00m, 37.59KB read
Requests/sec: 1.78
Transfer/sec: 640.51B
thread 1 made 109 requests and got 107 responsesThe above makes sense (answering to myself 😛 ) as codfw in production is not idle but constantly gets traffic from enwiki.
Running a load test on eqiad verified this:
wrk -c 1 -t 1 --timeout 50s -s revscoring.lua https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-damaging:predict --header "Host: enwiki-damaging.revscoring-editquality-damaging.wikimedia.org" --latency -d 60 -- enwiki.input
thread 1 created logfile wrk_1.log created
Running 1m test @ https://inference.svc.eqiad.wmnet:30443/v1/models/enwiki-damaging:predict
1 threads and 1 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 922.10ms 1.07s 4.44s 82.44%
Req/Sec 2.73 1.71 10.00 52.53%
Latency Distribution
50% 379.49ms
75% 1.09s
90% 2.83s
99% 4.13s
99 requests in 1.00m, 34.66KB read
Requests/sec: 1.65
Transfer/sec: 590.62B
thread 1 made 101 requests and got 99 responsesThe following model servers have been upgraded to kserve 0.11.2
Also: