Page MenuHomePhabricator

Alert on regularly restarting containers
Open, Needs TriagePublic

Description

Occasional restarts of containers are expected and not an issue however regular restarts and hitting a crash loop backoff is probably indicative of a situation that we should actually investigate.

It could make sense to alert on either an increase or an absolute rate of container restarts.

However we want to maintain the assumption that no workload will continue forever without error and that being accepting and tolerating some failures is important for us to build systems that are resilient in spite of that.

For example looking at sum by (container_name)(rate(kubernetes_io:container_restart_count{monitored_resource="k8s_container"}[${__interval}]))

image.png (670×1 px, 40 KB)

you can see that there is a sudden increase in restarting api-backend and mediawiki containers resulting from to T354248.