Page MenuHomePhabricator

Upgrade Revert Risk Language-agnostic docker images to KServe 0.11
Closed, ResolvedPublic3 Estimated Story Points

Description

We're going to upgrade the Revert Risk Language-agnostic model server docker images to KServe 0.11

Event Timeline

Change 964559 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade to KServe 0.11.1

https://gerrit.wikimedia.org/r/964559

Change 964559 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade to KServe 0.11.1

https://gerrit.wikimedia.org/r/964559

Change 965057 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: upgrade kserve to 0.11.1 for revertrisk

https://gerrit.wikimedia.org/r/965057

Change 965057 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: upgrade kserve to 0.11.1 for revertrisk

https://gerrit.wikimedia.org/r/965057

When testing the model in ml-staging, the following error was encountered, resulting in the pod being in a CrashLoopBackOff state:

Message:     Traceback (most recent call last):
  File "/srv/revert-risk-model/model-server/model.py", line 6, in <module>
    import kserve
  File "/opt/lib/python/site-packages/kserve/__init__.py", line 18, in <module>
    from .model_server import ModelServer
  File "/opt/lib/python/site-packages/kserve/model_server.py", line 25, in <module>
    from ray import serve as rayserve
  File "/opt/lib/python/site-packages/ray/__init__.py", line 136, in <module>
    from ray._private.worker import (  # noqa: E402,F401
  File "/opt/lib/python/site-packages/ray/_private/worker.py", line 50, in <module>
    import ray._private.parameter
  File "/opt/lib/python/site-packages/ray/_private/parameter.py", line 4, in <module>
    import pkg_resources
ModuleNotFoundError: No module named 'pkg_resources'

The error No module named 'pkg_resources' typically indicates an issue with the installation or configuration of the setuptools package. I attempted to add python3-setuptools to the blubber file for the production image, and it resolved the problem.

Change 965094 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revert-risk: add python3-setuptools to revertrisk-la blubber file

https://gerrit.wikimedia.org/r/965094

Change 965094 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert-risk: add python3-setuptools to revertrisk-la blubber file

https://gerrit.wikimedia.org/r/965094

Change 965146 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update revertrisk-la docker image

https://gerrit.wikimedia.org/r/965146

Change 965146 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update revertrisk-la docker image

https://gerrit.wikimedia.org/r/965146

Change 965532 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: deploy a revertrisk-la that uses kserve 0.10 in staging

https://gerrit.wikimedia.org/r/965532

Change 965532 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy a revertrisk-la that uses kserve 0.10 in staging

https://gerrit.wikimedia.org/r/965532

@achou found a regression in latency when load testing RR-LA with KServe 0.11 on ml-staging. After some digging, we found out that the Python process running KServe 0.11 runs way more threads than before (~200 vs ~10), and that it uses way more CPU time than before (ending up in severe throttling by Kubernetes). The majority of the time spent on CPU for the new threads seems to be libgomp (OpenMP lib), used by XGBoost (brought in by the Knowledge Integrity package).

From a quick check on RR-LA with Kserve 0.10 we didn't see a change in XGBoost's or libgomp's version, but the current theory is that some change (likely a dependency) triggered more parallelism that in turn caused CPU usage and throttling.

Afaics from the KI Code, we use Xgboost's DMatrix:
https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/models/revertrisk.py#L103

From the Python docs I see the following:

thread (integer, optional) – Number of threads to use for loading data when parallelization is applicable. If -1, uses maximum threads available on the system.

My impression is that for some reason now Xgboost gets the number of cpu cores available on the bare metal k8s node (not the container) and creates many threads to run on it.

My theory may not be correct, I see https://github.com/dmlc/xgboost/pull/7654 that should be included in XGBoost 1.6+, and we have 1.7.6 afaics from KI.

After re-reading https://github.com/dmlc/xgboost/issues/7653 I am wondering if there is a difference in setting nthreads=-1 in our use case.

Answer: It seems that -1 is used when we don't specify any value.

I found https://github.com/dmlc/xgboost/pull/9651, released 2 days ago, that is what it would work in our use case. The code that gets the max number of CPUs that a container offers (represented by the cgroup) is not compatible with what we use now (cgroups v2).

The fix should be included in xgboost 2.0.1 (not yet released), that is a big jump for KI probably :(

Remaining to understand: why did we see this change in behavior from Xgboost?

Change 965666 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: set OMP_NUM_THREADS for revertrisk-la

https://gerrit.wikimedia.org/r/965666

Change 965666 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: set OMP_NUM_THREADS for revertrisk-la

https://gerrit.wikimedia.org/r/965666

Change 967442 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: set OMP_NUM_THREADS in all revertrisk isvc

https://gerrit.wikimedia.org/r/967442

Change 967442 merged by Elukey:

[operations/deployment-charts@master] ml-services: set OMP_NUM_THREADS in all revertrisk isvc

https://gerrit.wikimedia.org/r/967442

cgroup v2 support is here in the latest xgboost patch release https://github.com/dmlc/xgboost/releases/tag/v2.0.1 !

There will need to be a change in knowledge-integrity dependencies as the current dependency specification for xgboost would not allow v 2.0.1 to be installed.

calbon triaged this task as Medium priority.Nov 2 2023, 6:35 PM
calbon set the point value for this task to 3.
calbon changed the point value for this task from 3 to 3.5.Nov 2 2023, 6:38 PM
calbon changed the point value for this task from 3.5 to 3.Nov 2 2023, 6:57 PM

Change 975008 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] revert kserve upgrades

https://gerrit.wikimedia.org/r/975008

Change 975008 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert kserve upgrades

https://gerrit.wikimedia.org/r/975008

Change 975205 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: rollback xgboost/catboost models to kserve 0.10

https://gerrit.wikimedia.org/r/975205

Change 975274 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade Kserve 0.11.1 and knowledge integrity 0.5.0

https://gerrit.wikimedia.org/r/975274

Change 975274 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade Kserve 0.11.1 and knowledge integrity 0.5.0

https://gerrit.wikimedia.org/r/975274

Change 975304 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update revertrisk-la image and model binary

https://gerrit.wikimedia.org/r/975304

Change 975304 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update revertrisk-la image and model binary

https://gerrit.wikimedia.org/r/975304

Update:

The revertrisk-la image (kserve 0.11.1 & knowledge integrity v0.5.0) with model binary v3 has been deployed to staging. I ran some load tests and can confirm the latency issue has been fixed with xgboost 2.0.1. Therefore, there is no need to set the env var OMP_NUM_THREADS manually.

Latency for the model servers on grafana dashboard (green is the old one and yellow is the new model server)

latency.png (1×2 px, 349 KB)

Change 976748 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update docker images to latest versions

https://gerrit.wikimedia.org/r/976748

Change 975205 abandoned by Ilias Sarantopoulos:

[operations/deployment-charts@master] ml-services: rollback xgboost/catboost models to kserve 0.10

Reason:

abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/976748 which has the latest updates

https://gerrit.wikimedia.org/r/975205

Change 976748 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update docker images to latest versions

https://gerrit.wikimedia.org/r/976748

Upgraded the server and run some load tests. Results are in line with past values

wrk -c 4 -t 2 --timeout 3s -s revertrisk.lua https://inference.svc.codfw.wmnet:30443/v1/models/revertrisk-language-agnostic:predict --header "Host: revertrisk-language-agnostic.revertrisk.wikimedia.org" --latency -- revertrisk.input
thread 1 created logfile wrk_1.log created
thread 2 created logfile wrk_2.log created
Running 10s test @ https://inference.svc.codfw.wmnet:30443/v1/models/revertrisk-language-agnostic:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   102.54ms   26.00ms 224.48ms   88.29%
    Req/Sec    15.38      5.19    30.00     53.51%
  Latency Distribution
     50%   95.04ms
     75%  104.36ms
     90%  132.62ms
     99%  212.92ms
  299 requests in 10.01s, 109.63KB read
  Non-2xx or 3xx responses: 4
Requests/sec:     29.86
Transfer/sec:     10.95KB
thread 1 made 151 requests and got 149 responses
thread 2 made 152 requests and got 150 responses