Maniphest T354282

Alert on regularly restarting containers
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Tarrow
	Jan 3 2024, 2:42 PM

Tags

Referenced Files

	F41649654: image.png
	Jan 3 2024, 2:43 PM

	F41649652: image.png
	Jan 3 2024, 2:42 PM

Subscribers

Description

Occasional restarts of containers are expected and not an issue however regular restarts and hitting a crash loop backoff is probably indicative of a situation that we should actually investigate.

It could make sense to alert on either an increase or an absolute rate of container restarts.

However we want to maintain the assumption that no workload will continue forever without error and that being accepting and tolerating some failures is important for us to build systems that are resilient in spite of that.

For example looking at sum by (container_name)(rate(kubernetes_io:container_restart_count{monitored_resource="k8s_container"}[${__interval}]))

image.png (670×1 px, 40 KB)

you can see that there is a sudden increase in restarting api-backend and mediawiki containers resulting from to T354248.

Related Objects

Mentioned Here: T354248: `api-backend` regularly crashing due to OOM

Event Timeline

Tarrow created this task.Jan 3 2024, 2:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 3 2024, 2:42 PM

Tarrow updated the task description. (Show Details)Jan 3 2024, 2:43 PM

Charlie_WMDE moved this task from Tech prioritized backlog to Kanban board Q1 2024 on the Wikibase Cloud board.Mar 14 2024, 2:22 PM

Charlie_WMDE edited projects, added Wikibase Cloud (Kanban board Q1 2024); removed Wikibase Cloud.

conny-kawohl_WMDE edited projects, added Wikibase Cloud (Kanban Board Q2 2024); removed Wikibase Cloud (Kanban board Q1 2024).Wed, Apr 17, 1:40 PM