Upgrade Revert Risk Language-agnostic docker images to KServe 0.11
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	achou
	Sep 28 2023, 8:41 AM

Description

We're going to upgrade the Revert Risk Language-agnostic model server docker images to KServe 0.11

Details

Subject	Repo	Branch	Lines +/-
ml-services: update docker images to latest versions	operations/deployment-charts	master	+11 -11
ml-services: rollback xgboost/catboost models to kserve 0.10	operations/deployment-charts	master	+3 -3
ml-services: update revertrisk-la image and model binary	operations/deployment-charts	master	+2 -4
revert-risk: upgrade Kserve 0.11.1 and knowledge integrity 0.5.0	machinelearning/liftwing/inference-services	main	+2 -2
revert kserve upgrades	machinelearning/liftwing/inference-services	main	+3 -3
ml-services: set OMP_NUM_THREADS in all revertrisk isvc	operations/deployment-charts	master	+5 -10
ml-services: set OMP_NUM_THREADS for revertrisk-la	operations/deployment-charts	master	+9 -0
ml-services: deploy a revertrisk-la that uses kserve 0.10 in staging	operations/deployment-charts	master	+14 -0
ml-services: update revertrisk-la docker image	operations/deployment-charts	master	+1 -1
revert-risk: add python3-setuptools to revertrisk-la blubber file	machinelearning/liftwing/inference-services	main	+3 -2
ml-services: upgrade kserve to 0.11.1 for revertrisk	operations/deployment-charts	master	+3 -3
revert-risk: upgrade to KServe 0.11.1	machinelearning/liftwing/inference-services	main	+3 -3

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		elukey	T337213 Update to KServe 0.11
Resolved		achou	T347550 Upgrade Revert Risk Language-agnostic docker images to KServe 0.11
Open	BUG REPORT	None	T349844 Increased latencies with Kserve 0.11.1 (cgroups v2)
Resolved		MunizaA	T350389 Upgrade xgboost in knowledge_integrity
Open		isarantopoulos	T353461 Allow to set Catboost's threads in readability-liftwing

Event Timeline

achou created this task.Sep 28 2023, 8:41 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 28 2023, 8:42 AM

achou added a parent task: T337213: Update to KServe 0.11.Sep 28 2023, 8:42 AM

Change 964559 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade to KServe 0.11.1

https://gerrit.wikimedia.org/r/964559

gerritbot added a project: Patch-For-Review.Oct 9 2023, 3:57 PM

calbon moved this task from Unsorted to In Progress on the Machine-Learning-Team board.Oct 10 2023, 3:07 PM

Change 964559 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade to KServe 0.11.1

https://gerrit.wikimedia.org/r/964559

elukey mentioned this in rMLISab16ec6f5fba: revert-risk: upgrade to KServe 0.11.1.Oct 11 2023, 6:15 AM

Maintenance_bot removed a project: Patch-For-Review.Oct 11 2023, 6:30 AM

Change 965057 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: upgrade kserve to 0.11.1 for revertrisk

https://gerrit.wikimedia.org/r/965057

gerritbot added a project: Patch-For-Review.Oct 11 2023, 8:40 AM

Change 965057 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: upgrade kserve to 0.11.1 for revertrisk

https://gerrit.wikimedia.org/r/965057

Maintenance_bot removed a project: Patch-For-Review.Oct 11 2023, 9:10 AM

When testing the model in ml-staging, the following error was encountered, resulting in the pod being in a CrashLoopBackOff state:

Message:     Traceback (most recent call last):
  File "/srv/revert-risk-model/model-server/model.py", line 6, in <module>
    import kserve
  File "/opt/lib/python/site-packages/kserve/__init__.py", line 18, in <module>
    from .model_server import ModelServer
  File "/opt/lib/python/site-packages/kserve/model_server.py", line 25, in <module>
    from ray import serve as rayserve
  File "/opt/lib/python/site-packages/ray/__init__.py", line 136, in <module>
    from ray._private.worker import (  # noqa: E402,F401
  File "/opt/lib/python/site-packages/ray/_private/worker.py", line 50, in <module>
    import ray._private.parameter
  File "/opt/lib/python/site-packages/ray/_private/parameter.py", line 4, in <module>
    import pkg_resources
ModuleNotFoundError: No module named 'pkg_resources'

The error No module named 'pkg_resources' typically indicates an issue with the installation or configuration of the setuptools package. I attempted to add python3-setuptools to the blubber file for the production image, and it resolved the problem.

Change 965094 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revert-risk: add python3-setuptools to revertrisk-la blubber file

https://gerrit.wikimedia.org/r/965094

gerritbot added a project: Patch-For-Review.Oct 11 2023, 10:29 AM

Change 965094 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert-risk: add python3-setuptools to revertrisk-la blubber file

https://gerrit.wikimedia.org/r/965094

achou mentioned this in rMLISe1b54be6300a: revert-risk: add python3-setuptools to revertrisk-la blubber file.Oct 11 2023, 12:32 PM

Change 965146 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update revertrisk-la docker image

https://gerrit.wikimedia.org/r/965146

Change 965146 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update revertrisk-la docker image

https://gerrit.wikimedia.org/r/965146

Maintenance_bot removed a project: Patch-For-Review.Oct 11 2023, 3:10 PM

Change 965532 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: deploy a revertrisk-la that uses kserve 0.10 in staging

https://gerrit.wikimedia.org/r/965532

gerritbot added a project: Patch-For-Review.Oct 12 2023, 3:39 PM

Change 965532 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: deploy a revertrisk-la that uses kserve 0.10 in staging

https://gerrit.wikimedia.org/r/965532

Maintenance_bot removed a project: Patch-For-Review.Oct 12 2023, 4:10 PM

@achou found a regression in latency when load testing RR-LA with KServe 0.11 on ml-staging. After some digging, we found out that the Python process running KServe 0.11 runs way more threads than before (~200 vs ~10), and that it uses way more CPU time than before (ending up in severe throttling by Kubernetes). The majority of the time spent on CPU for the new threads seems to be libgomp (OpenMP lib), used by XGBoost (brought in by the Knowledge Integrity package).

From a quick check on RR-LA with Kserve 0.10 we didn't see a change in XGBoost's or libgomp's version, but the current theory is that some change (likely a dependency) triggered more parallelism that in turn caused CPU usage and throttling.

Afaics from the KI Code, we use Xgboost's DMatrix:
https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/blob/main/knowledge_integrity/models/revertrisk.py#L103

From the Python docs I see the following:

thread (integer, optional) – Number of threads to use for loading data when parallelization is applicable. If -1, uses maximum threads available on the system.

My impression is that for some reason now Xgboost gets the number of cpu cores available on the bare metal k8s node (not the container) and creates many threads to run on it.

elukey assigned this task to achou.Oct 13 2023, 7:32 AM

My theory may not be correct, I see https://github.com/dmlc/xgboost/pull/7654 that should be included in XGBoost 1.6+, and we have 1.7.6 afaics from KI.

After re-reading https://github.com/dmlc/xgboost/issues/7653 I am wondering if there is a difference in setting nthreads=-1 in our use case.

Answer: It seems that -1 is used when we don't specify any value.

I found https://github.com/dmlc/xgboost/pull/9651, released 2 days ago, that is what it would work in our use case. The code that gets the max number of CPUs that a container offers (represented by the cgroup) is not compatible with what we use now (cgroups v2).

The fix should be included in xgboost 2.0.1 (not yet released), that is a big jump for KI probably :(

Remaining to understand: why did we see this change in behavior from Xgboost?

Change 965666 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: set OMP_NUM_THREADS for revertrisk-la

https://gerrit.wikimedia.org/r/965666

gerritbot added a project: Patch-For-Review.Oct 13 2023, 8:54 AM

Change 965666 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: set OMP_NUM_THREADS for revertrisk-la

https://gerrit.wikimedia.org/r/965666

Change 967442 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: set OMP_NUM_THREADS in all revertrisk isvc

https://gerrit.wikimedia.org/r/967442

Change 967442 merged by Elukey:

[operations/deployment-charts@master] ml-services: set OMP_NUM_THREADS in all revertrisk isvc

https://gerrit.wikimedia.org/r/967442

cgroup v2 support is here in the latest xgboost patch release https://github.com/dmlc/xgboost/releases/tag/v2.0.1 !

There will need to be a change in knowledge-integrity dependencies as the current dependency specification for xgboost would not allow v 2.0.1 to be installed.

Opened T350389 for the KI's xgboost upgrade.

calbon triaged this task as Medium priority.Nov 2 2023, 6:35 PM

calbon set the point value for this task to 3.

calbon changed the point value for this task from 3 to 3.5.Nov 2 2023, 6:38 PM

calbon changed the point value for this task from 3.5 to 3.Nov 2 2023, 6:57 PM

calbon moved this task from In Progress to Blocked on the Machine-Learning-Team board.Nov 2 2023, 7:07 PM

achou added a subtask: T349844: Increased latencies with Kserve 0.11.1 (cgroups v2).Nov 14 2023, 5:09 PM

Change 975008 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] revert kserve upgrades

https://gerrit.wikimedia.org/r/975008

Change 975008 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert kserve upgrades

https://gerrit.wikimedia.org/r/975008

isarantopoulos mentioned this in rMLIS7e3b6dd6ddac: revert kserve upgrades.Nov 16 2023, 5:00 PM

Change 975205 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: rollback xgboost/catboost models to kserve 0.10

https://gerrit.wikimedia.org/r/975205

Change 975274 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade Kserve 0.11.1 and knowledge integrity 0.5.0

https://gerrit.wikimedia.org/r/975274

achou moved this task from Blocked to In Progress on the Machine-Learning-Team board.Nov 17 2023, 3:03 PM

Change 975274 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade Kserve 0.11.1 and knowledge integrity 0.5.0

https://gerrit.wikimedia.org/r/975274

achou mentioned this in rMLIS891ca4504b8f: revert-risk: upgrade Kserve 0.11.1 and knowledge integrity 0.5.0.Nov 17 2023, 3:05 PM

Change 975304 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update revertrisk-la image and model binary

https://gerrit.wikimedia.org/r/975304

Change 975304 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update revertrisk-la image and model binary

https://gerrit.wikimedia.org/r/975304

Update:

The revertrisk-la image (kserve 0.11.1 & knowledge integrity v0.5.0) with model binary v3 has been deployed to staging. I ran some load tests and can confirm the latency issue has been fixed with xgboost 2.0.1. Therefore, there is no need to set the env var OMP_NUM_THREADS manually.

Latency for the model servers on grafana dashboard (green is the old one and yellow is the new model server)

Change 976748 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update docker images to latest versions

https://gerrit.wikimedia.org/r/976748

Change 975205 abandoned by Ilias Sarantopoulos:

[operations/deployment-charts@master] ml-services: rollback xgboost/catboost models to kserve 0.10

Reason:

abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/976748 which has the latest updates

https://gerrit.wikimedia.org/r/975205

Change 976748 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update docker images to latest versions

https://gerrit.wikimedia.org/r/976748

Upgraded the server and run some load tests. Results are in line with past values

wrk -c 4 -t 2 --timeout 3s -s revertrisk.lua https://inference.svc.codfw.wmnet:30443/v1/models/revertrisk-language-agnostic:predict --header "Host: revertrisk-language-agnostic.revertrisk.wikimedia.org" --latency -- revertrisk.input
thread 1 created logfile wrk_1.log created
thread 2 created logfile wrk_2.log created
Running 10s test @ https://inference.svc.codfw.wmnet:30443/v1/models/revertrisk-language-agnostic:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   102.54ms   26.00ms 224.48ms   88.29%
    Req/Sec    15.38      5.19    30.00     53.51%
  Latency Distribution
     50%   95.04ms
     75%  104.36ms
     90%  132.62ms
     99%  212.92ms
  299 requests in 10.01s, 109.63KB read
  Non-2xx or 3xx responses: 4
Requests/sec:     29.86
Transfer/sec:     10.95KB
thread 1 made 151 requests and got 149 responses
thread 2 made 152 requests and got 150 responses

isarantopoulos closed this task as Resolved.Nov 27 2023, 2:50 PM

isarantopoulos moved this task from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.Nov 27 2023, 2:53 PM

	F41522106: latency.png
	Nov 20 2023, 6:27 PM

Upgrade Revert Risk Language-agnostic docker images to KServe 0.11Closed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Upgrade Revert Risk Language-agnostic docker images to KServe 0.11
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...