Page MenuHomePhabricator

Increase visibility of container/pod ressource exhaustion
Closed, ResolvedPublic

Description

Currently we lack high level visibility of containers/pods that reach or exceed configured CPU/memory requests/limits as well as CPU throttling.

We should be able to detect that in a generic fashion to have issues like T266194 addressed more proactively.

I'll add this as a meta-task to track ideas/possible work regarding this.

Event Timeline

JMeybohm triaged this task as Medium priority.Oct 22 2020, 8:42 AM
JMeybohm created this task.
kamila changed the status of subtask T264625: Deploy kube-state-metrics from In Progress to Stalled.Sep 4 2023, 10:35 AM
kamila changed the status of subtask T264625: Deploy kube-state-metrics from Stalled to In Progress.Oct 31 2023, 5:02 PM

Change 984219 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Alert for containers with memory issues

https://gerrit.wikimedia.org/r/984219

Change 984219 merged by jenkins-bot:

[operations/alerts@master] Alert for containers with memory issues

https://gerrit.wikimedia.org/r/984219

Change 994196 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] linkrecommendation: Bump memory limit by 200Mi

https://gerrit.wikimedia.org/r/994196

Change 994197 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] rdf-streaming-updated: Bump taskmanager memory limit by ~33%

https://gerrit.wikimedia.org/r/994197

Change 994196 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Bump memory limit by 200Mi

https://gerrit.wikimedia.org/r/994196

akosiaris claimed this task.
akosiaris subscribed.

The 2 patches linked worked just fine. There is taskmanager left to be +1ed by the team, but I am willing to say that we are in a better place than we used to. I am gonna resolve this

Change 994197 merged by jenkins-bot:

[operations/deployment-charts@master] rdf-streaming-updated: Bump taskmanager memory limit by ~33%

https://gerrit.wikimedia.org/r/994197