Page MenuHomePhabricator

Optimize resource utilization for InferenceServices on LiftWing cluster
Open, Needs TriagePublic

Description

Currently, our InferenceServices have their resource requests and limits set based on intuition and needs gathered during development of the service. However, we can now optimize the existing resource configuration further by exploring the Grafana container details dashboard and verifying the resource utilization in production clusters.

This will allow us to free up CPU and Memory resources in the LiftWing cluster, allowing us to deploy more services or further scale the existing services.

I've created a table with InferenceServices that we could optimize, I've omitted services already optimized or using low resources in general. It's ordered by biggest possible saving, so let's start tackling those from top to bottom:

ServiceCurrent CPUCurrent MemNew CPUNew MemSavingsStatus
article-descriptions165Gi25Gi14 CPU
llm/aya-llm68Gi18Gi5 CPUTODO
llm/embeddings68Gi18Gi5 CPUTODO
edit-check48Gi14Gi3 CPU, 4GiTODO
revertrisk-multilingual46Gi26Gi2 CPUTODO
revertrisk-multilingual-pre-save46Gi26Gi2 CPUTODO
revertrisk-wikidata24Gi14Gi1 CPUTODO
reference-risk22Gi12Gi1 CPUTODO
article-country22Gi11Gi1 CPU, 1GiTODO
reference-need226Gi/8Gi227Gi1GiTODO

Event Timeline

I went through utilization graphs of our InferenceServices and it seems there is a lot of CPU savings we could make, whereas Memory is usually set quite reasonably with no major overcommitments.

I've created a table with InferenceServices that we could optimize, I've omitted services already optimized or using low resources in general. It's ordered by biggest possible saving, so let's start tackling those from top to bottom:

ServiceCurrent CPUCurrent MemNew CPUNew MemSavingsStatus
article-descriptions165Gi25Gi14 CPUTODO
llm/aya-llm68Gi18Gi5 CPUTODO
llm/embeddings68Gi18Gi5 CPUTODO
edit-check48Gi14Gi3 CPU, 4GiTODO
revertrisk-multilingual46Gi26Gi2 CPUTODO
revertrisk-multilingual-pre-save46Gi26Gi2 CPUTODO
revertrisk-wikidata24Gi14Gi1 CPUTODO
reference-risk22Gi12Gi1 CPUTODO
article-country22Gi11Gi1 CPU, 1GiTODO
reference-need226Gi/8Gi227Gi1GiTODO

Change #1227736 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] ml-services: Lower resource usage for article-descriptions on staging.

https://gerrit.wikimedia.org/r/1227736

Change #1227736 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Lower resource usage for article-descriptions on staging.

https://gerrit.wikimedia.org/r/1227736

Change #1229477 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] ml-services: Lower resource limits for article descriptions.

https://gerrit.wikimedia.org/r/1229477

Change #1229477 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Lower resource limits for article descriptions.

https://gerrit.wikimedia.org/r/1229477