Increased latencies with Kserve 0.11.1 (cgroups v2)
Open, MediumPublicBUG REPORT
Actions

Assigned To

None

Authored By

	isarantopoulos
	Oct 26 2023, 4:31 PM

Description

Steps to replicate the issue (include links if applicable):
While upgrading model servers to kserve 0.11.1 during load testing we noticed that some model servers demonstrate increased latencies. These model servers include gradient boosting models (GBM) xgboost (revertrisk-language-agnostic) and catboost(revertrisk-multilingual and readability).
The issue that arises is that too many threads are created (100+) causing model inference to throttle with increased load. This is probably caused because of cgroups v2 is not supported by the underlying libraries.

At the moment we have limited the number of threads by setting OMP_NUM_THREADS and
OMP_THREAD_LIMIT environment variables on the pods but we still see more threads created. Even if we succeed to set just 1 thread this is a temporary solution as it has a toll on model performance.

The issue in xgboost seems to be resolved and support is available in release 2.0.1.
For catboost we have opened and issue on GH but on our side we can start by specifying the number of threads after we load the model if possible.

Details

	Subject	Repo	Branch	Lines +/-
	ml-services: update revertrisk-la image and model binary	operations/deployment-charts	master	+2 -4
	revert-risk: upgrade Kserve 0.11.1 and knowledge integrity 0.5.0	machinelearning/liftwing/inference-services	main	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		elukey	T337213 Update to KServe 0.11
Resolved		achou	T347550 Upgrade Revert Risk Language-agnostic docker images to KServe 0.11
Resolved		isarantopoulos	T347551 Upgrade Revert Risk Multilingual docker images to KServe 0.11.2
Open	BUG REPORT	None	T349844 Increased latencies with Kserve 0.11.1 (cgroups v2)
Resolved		MunizaA	T350389 Upgrade xgboost in knowledge_integrity
Open		isarantopoulos	T353461 Allow to set Catboost's threads in readability-liftwing

Event Timeline

isarantopoulos created this task.Oct 26 2023, 4:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2023, 4:31 PM

isarantopoulos renamed this task from Increased latencies with Kserve 0.11.1 to Increased latencies with Kserve 0.11.1 (cgroups v2).Oct 26 2023, 4:31 PM

isarantopoulos updated the task description. (Show Details)Oct 27 2023, 8:01 AM

I started doing some prep-work on knowledge_integrity to update xgboost
https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/tree/update-xgboost

One thing that needs to happen is update base python that is being used in the project and in CI (from minimum of 3.7 to 3.8) as xgboost 2.0.1 releases have 3.8 as minimum.

elukey mentioned this in T350389: Upgrade xgboost in knowledge_integrity.Nov 2 2023, 2:19 PM

klausman moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.Nov 14 2023, 3:48 PM

achou added a parent task: T347550: Upgrade Revert Risk Language-agnostic docker images to KServe 0.11.Nov 14 2023, 5:09 PM

achou added a subtask: T350389: Upgrade xgboost in knowledge_integrity.

achou added a parent task: T347551: Upgrade Revert Risk Multilingual docker images to KServe 0.11.2.

Next steps:

Work with Research on T350389 to move KI to xgboost 2.0.1
Keep working with Yandex upstream on https://github.com/catboost/catboost/issues/2518, and hopefully get a new release of Catboost. It is likely going to take some time.
While waiting for a permanent fix in xgboost, we should probably use the thread_count parameter where catboost is used to tune the max amount of threads to use.

MunizaA closed subtask T350389: Upgrade xgboost in knowledge_integrity as Resolved.Nov 16 2023, 9:54 PM

Change 975274 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revert-risk: update knowledge integrity to 0.5.0

https://gerrit.wikimedia.org/r/975274

gerritbot added a project: Patch-For-Review.Nov 17 2023, 1:50 PM

Change 975274 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade Kserve 0.11.1 and knowledge integrity 0.5.0

https://gerrit.wikimedia.org/r/975274

achou mentioned this in rMLIS891ca4504b8f: revert-risk: upgrade Kserve 0.11.1 and knowledge integrity 0.5.0.Nov 17 2023, 3:05 PM

Change 975304 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update revertrisk-la image and model binary

https://gerrit.wikimedia.org/r/975304

Change 975304 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update revertrisk-la image and model binary

https://gerrit.wikimedia.org/r/975304

isarantopoulos triaged this task as Medium priority.Dec 7 2023, 8:49 AM

Opened T353461 to track the efforts to fix catboost in Readability.

elukey added a subtask: T353461: Allow to set Catboost's threads in readability-liftwing.Dec 14 2023, 3:41 PM

isarantopoulos closed subtask T353461: Allow to set Catboost's threads in readability-liftwing as Resolved.Dec 20 2023, 12:40 PM

elukey reopened subtask T353461: Allow to set Catboost's threads in readability-liftwing as Open.Feb 28 2024, 1:04 PM

Increased latencies with Kserve 0.11.1 (cgroups v2)Open, MediumPublicBUG REPORTActions

Description

Details

Related ObjectsSearch...

Event Timeline

Increased latencies with Kserve 0.11.1 (cgroups v2)
Open, MediumPublicBUG REPORT
Actions

Related Objects
Search...