Page MenuHomePhabricator

releases1002 /srv/docker DISK SPACE alert
Closed, ResolvedPublic

Description

From IRC:

[06:20:22]  <+icinga-wm> PROBLEM - Disk space on releases1002 is CRITICAL: DISK CRITICAL - free space: /srv/docker 1023 MB (0% inode=65%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=releases1002&var-datasource=eqiad+prometheus/ops

And it is indeed full:

root@releases1002:~# df -hT /srv/docker
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/vdb1      ext4  147G  139G  1.0G 100% /srv/docker

The /srv/docker partition is routinely filling up. The reason is some Jenkins job to build MediaWiki images which do not prune intermediate containers / images:

releases1002_docker_partition.png (339×902 px, 31 KB)

That got manually cleaned up today 8/4 and had to be cleaned up last week as well:

2021-07-30 21:27 	<dduvall> 	"Total reclaimed space: 141.4GB" on releases1002 following docker prune

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-08-04T05:35:16Z] <joe> docker image prune on releases1002, T288024

MoritzMuehlenhoff triaged this task as Medium priority.

Antoine, could you please have a look whether we can free something?

I've done a docker image prune -a on that server, but I think we will need to give it a larger docker partition given the amount of images we're building there.

hashar removed hashar as the assignee of this task.Aug 4 2021, 7:27 AM
hashar updated the task description. (Show Details)
hashar added subscribers: dancy, dduvall, hashar.

That is routinely filing up due to some Jenkins job creating images/containers but not reclaiming them at end of build. @dancy / @dduvall will know the details, I will bring it up at our team meeting tonight.

If there's no immediate fix on the Jenkins side we should add a systemd timer to trigger a cleanup before this escalates to alerts

That is routinely filing up due to some Jenkins job creating images/containers but not reclaiming them at end of build. @dancy / @dduvall will know the details, I will bring it up at our team meeting tonight.

tracked at T286511: Failed PipelineLib based jobs resulting in pileup of old images/containers

hashar claimed this task.

Great, and this task can be marked as resolved since immediate action have been taken earlier today to resolve the alarm.