docker DISK SPACE alert
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	Aug 4 2021, 4:40 AM

Description

From IRC:

[06:20:22]  <+icinga-wm> PROBLEM - Disk space on releases1002 is CRITICAL: DISK CRITICAL - free space: /srv/docker 1023 MB (0% inode=65%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops

And it is indeed full:

root@releases1002:~# df -hT /srv/docker
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/vdb1      ext4  147G  139G  1.0G 100% /srv/docker

The /srv/docker partition is routinely filling up. The reason is some Jenkins job to build MediaWiki images which do not prune intermediate containers / images:

releases1002_docker_partition.png (339×902 px, 31 KB)

That got manually cleaned up today 8/4 and had to be cleaned up last week as well:

2021-07-30 21:27 	<dduvall> 	"Total reclaimed space: 141.4GB" on releases1002 following docker prune

Related Objects

Mentioned Here: T286511: Failed PipelineLib based jobs resulting in pileup of old images/containers

Event Timeline

Marostegui created this task.Aug 4 2021, 4:40 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 4 2021, 4:40 AM

Mentioned in SAL (#wikimedia-operations) [2021-08-04T05:35:16Z] <joe> docker image prune on releases1002, T288024

Antoine, could you please have a look whether we can free something?

I've done a docker image prune -a on that server, but I think we will need to give it a larger docker partition given the amount of images we're building there.

That is routinely filing up due to some Jenkins job creating images/containers but not reclaiming them at end of build. @dancy / @dduvall will know the details, I will bring it up at our team meeting tonight.

If there's no immediate fix on the Jenkins side we should add a systemd timer to trigger a cleanup before this escalates to alerts

In T288024#7258016, @hashar wrote:

That is routinely filing up due to some Jenkins job creating images/containers but not reclaiming them at end of build. @dancy / @dduvall will know the details, I will bring it up at our team meeting tonight.

tracked at T286511: Failed PipelineLib based jobs resulting in pileup of old images/containers

Great, and this task can be marked as resolved since immediate action have been taken earlier today to resolve the alarm.

	F34575487: releases1002_docker_partition.png
	Aug 4 2021, 7:27 AM

releases1002 /srv/docker DISK SPACE alertClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

releases1002 /srv/docker DISK SPACE alert
Closed, ResolvedPublic
Actions