Page MenuHomePhabricator

Raise an alarm on container restarts/OOMs in kubernetes
Open, MediumPublic

Description

Per T255975 we 've been having issues with a specific pod in changeprop being restarted often due to OOM issues. A raised alarm could have caught this, allowing more people to look into it and speed up the investigation done in T255975. The exact implementation and level of the alarm (WARNING, CRITICAL) should be discussed in this task

Event Timeline

akosiaris triaged this task as Medium priority.Jun 24 2020, 1:41 PM
akosiaris created this task.

An interesting thing to note here is that some services have quite often pod restarts. e.g.

kubectl get pods
NAME                                READY   STATUS    RESTARTS   AGE
citoid-production-cd946dd66-5d68v   3/3     Running   92         20d
citoid-production-cd946dd66-6wmtw   3/3     Running   73         20d
citoid-production-cd946dd66-725mv   3/3     Running   93         20d
citoid-production-cd946dd66-75vtq   3/3     Running   87         20d
citoid-production-cd946dd66-f8np9   3/3     Running   77         20d
citoid-production-cd946dd66-jld5v   3/3     Running   82         20d
citoid-production-cd946dd66-tz62j   3/3     Running   100        20d
citoid-production-cd946dd66-zqk4f   3/3     Running   91         20d

with that not being necessarily an issue that (at least for now) requires action to be taken.

So perhaps this alert/alarm should be per service and not global.

JMeybohm added a comment.EditedJun 24 2020, 1:50 PM

With kube-state-metrics (sorry for me repeating this over and over 😂 ) there is kube_pod_container_status_restarts_total and kube_pod_container_status_last_terminated_reason which can be used to detect OOM on containers.

So perhaps this alert/alarm should be per service and not global.

Or maybe based on time. Like container got killed > N times in t

With kube-state-metrics (sorry for me repeating this over and over 😂 ) there is kube_pod_container_status_restarts_total and kube_pod_container_status_last_terminated_reason which can be used to detect OOM on containers.

Awesome, that's exactly the kind of information I was missing :-)

So perhaps this alert/alarm should be per service and not global.

Or maybe based on time. Like container got killed > N times in t

That could help but the alert should always be actionable. For that to happen the owner needs to acknowledge the need for it, which might not happen at the same time for all services.

That could help but the alert should always be actionable. For that to happen the owner needs to acknowledge the need for it, which might not happen at the same time for all services.

That's true. I thought of something more like a "smells" thing. Not necessarily bad "does not look safe". Containers constantly hitting their CPU limits or being throttled a lot (needs definition ofc) would be something is see in that category as well. Maybe this would be better in some kind of dashboard ...

jijiki moved this task from Incoming 🐫 to Unsorted on the serviceops board.Aug 17 2020, 11:45 PM