Running with gitlab-runner 14.2.0 (58ba2b95) on runner-1002.gitlab-runners.eqiad1.wikimedia.cloud zVdTsoHy Preparing the "docker" executor 00:21 Using Docker executor with image docker-registry.wikimedia.org/dev/buster-php74:latest ... Pulling docker image docker-registry.wikimedia.org/dev/buster-php74:latest ... WARNING: Failed to pull image with policy "always": write /var/lib/docker/tmp/GetImageBlob628799366: no space left on device (manager.go:205:1s) ERROR: Preparation failed: failed to pull image "docker-registry.wikimedia.org/dev/buster-php74:latest" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob628799366: no space left on device (manager.go:205:1s) Will be retried in 3s ... Using Docker executor with image docker-registry.wikimedia.org/dev/buster-php74:latest ... Pulling docker image docker-registry.wikimedia.org/dev/buster-php74:latest ... WARNING: Failed to pull image with policy "always": write /var/lib/docker/tmp/GetImageBlob994982896: no space left on device (manager.go:205:1s) ERROR: Preparation failed: failed to pull image "docker-registry.wikimedia.org/dev/buster-php74:latest" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob994982896: no space left on device (manager.go:205:1s) Will be retried in 3s ... Using Docker executor with image docker-registry.wikimedia.org/dev/buster-php74:latest ... Pulling docker image docker-registry.wikimedia.org/dev/buster-php74:latest ... WARNING: Failed to pull image with policy "always": write /var/lib/docker/tmp/GetImageBlob557884970: no space left on device (manager.go:205:1s) ERROR: Preparation failed: failed to pull image "docker-registry.wikimedia.org/dev/buster-php74:latest" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob557884970: no space left on device (manager.go:205:1s) Will be retried in 3s ... ERROR: Job failed (system failure): failed to pull image "docker-registry.wikimedia.org/dev/buster-php74:latest" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob557884970: no space left on device (manager.go:205:1s)
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T291221 runner-1002 is out of space | |||
Resolved | • dancy | T294034 docker-gc: A tool for partially pruning docker resources | |||
Resolved | • dancy | T295707 Run docker-gc resource monitor on gitlab runners | |||
Resolved | • dancy | T295709 Periodically run docker-gc on gitlab runners |
Event Timeline
cc: @dduvall
brennen@runner-1002:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 18G 0 18G 0% /dev tmpfs 3.6G 374M 3.2G 11% /run /dev/sda1 20G 19G 566M 98% / tmpfs 18G 0 18G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 18G 0 18G 0% /sys/fs/cgroup tmpfs 3.6G 0 3.6G 0% /run/user/0 tmpfs 3.6G 0 3.6G 0% /run/user/20958
Looks, probably unsurprisingly, like stuff in /var/lib/docker has filled up the available 20 gigs.
Based on the docs under clearing Docker cache, I ran:
brennen@runner-1002:/usr/share/gitlab-runner$ sudo ./clear-docker-cache Check and remove all unused containers (both dangling and unreferenced) including volumes. ------------------------------------------------------------------------------------------ Deleted Volumes: runner-lycmpb8q-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-gmkgqulx-project-42-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-uh1sdm1s-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-uh1sdm1s-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-zvdtsohy-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-zvdtsohy-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-gmkgqulx-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-sxbd4p8s-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-sxbd4p8s-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-zvdtsohy-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-lycmpb8q-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-gmkgqulx-project-42-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-uh1sdm1s-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-uh1sdm1s-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-zvdtsohy-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-wnljwfzy-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-kvusweg-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-bffzzv-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-7hp85zrl-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-kvusweg-project-45-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-sxbd4p8s-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-gmkgqulx-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-lycmpb8q-project-42-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-kvusweg-project-43-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-7hp85zrl-project-43-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-kvusweg-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-wnljwfzy-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-kvusweg-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-7hp85zrl-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-sxbd4p8s-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-gmkgqulx-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-lycmpb8q-project-42-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-7hp85zrl-project-43-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-uh1sdm1s-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-kvusweg-project-39-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-uh1sdm1s-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-zvdtsohy-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-7hp85zrl-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-kvusweg-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-sxbd4p8s-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-lycmpb8q-project-43-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-gmkgqulx-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-7hp85zrl-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-lycmpb8q-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-wnljwfzy-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-bffzzv-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-bffzzv-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-uh1sdm1s-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-zvdtsohy-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-7hp85zrl-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-uh1sdm1s-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-bffzzv-project-43-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-lycmpb8q-project-43-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-gmkgqulx-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-7hp85zrl-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-lycmpb8q-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-wnljwfzy-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-sxbd4p8s-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-kvusweg-project-45-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-lycmpb8q-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-sxbd4p8s-project-39-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-wnljwfzy-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-kvusweg-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-7hp85zrl-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-zvdtsohy-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-bffzzv-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-gmkgqulx-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-gmkgqulx-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-kvusweg-project-43-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-kvusweg-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-7hp85zrl-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-lycmpb8q-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-kvusweg-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-sxbd4p8s-project-39-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-bffzzv-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-wnljwfzy-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-zvdtsohy-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-gmkgqulx-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-gmkgqulx-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-bffzzv-project-43-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-sxbd4p8s-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-zvdtsohy-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-kvusweg-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-zvdtsohy-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-kvusweg-project-39-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70 runner-gmkgqulx-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-sxbd4p8s-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 runner-bffzzv-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8 Total reclaimed space: 5.563GB
...so that's 5.5 gigs back. But I wonder if we need to:
- schedule that to run ~daily?
- add a bunch of space to the runners? 20 gigs does feel like it'll tend to fill up pretty fast.
- tweak some settings?
Notes from IRC discussion in #wikimedia-releng:
- a docker system prune -af would probably also work
- we should add a daily timer to the profile to run clear-docker-cache
- should give runners more space - 20 gigs ain't what it used to be
- https://wikitech.wikimedia.org/wiki/Help:Adding_Disk_Space_to_Cloud_VPS_instances#Cinder:_Attachable_Block_Storage_for_Cloud_VPS
- looks like we can refactor the profile to use cinderutils::ensure e.g. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/wmcs/kubeadm/worker.pp#7
- probably need to request a quota increase for storage; should do some quick math on what we think we'll need per-runner
We should be able to build a small program which monitors the output of docker events and uses the information to target least recently used volumes and images for removal.
Two quick things that may be already known, but just in case it's of use:
- Are we doing something to bypass a built-in GitLab feature, or does GitLab not manage by default the ability to "just" run CI jobs that internally use Docker without an uncontrolled cache? (I see the clear-docker-cache cron is provided by upstream, but I'm not sure whether that's an additional GC thing or whether that's the primary way in which upstream themselves limit the size of the cache. The recommended weekly cycle seems too infrequent, when users can do anything they want.)
- I believe we may have already solved the size and auto-pruning on the existing Jenkins agents (I believe we settled on 40G there, and presumably something somewhere keeps the size in check, which may be reusable here?).
Gitlab runners using the docker executor will create runner-local cache volumes but they are never automatically deleted thereafter. Instead administrators are instructed to run https://gitlab.com/gitlab-org/gitlab-runner/blob/main/packaging/root/usr/share/gitlab-runner/clear-docker-cache which deletes all Gitlab-created volumes.
- I believe we may have already solved the size and auto-pruning on the existing Jenkins agents (I believe we settled on 40G there, and presumably something somewhere keeps the size in check, which may be reusable here?).
size: We should definitely use a 40GB disk.
auto-pruning: The maintenance-disconnect-full-disks job runs every 5 minutes, scanning all Jenkins agent nodes. If free space on a node drops below a configured threshold, all images are deleted (excluding those tagged latest). Docker volumes are not using for CI jobs so they are not handled. We can use this approach as a first step but my goal is to build something that considers recency.
For the CI agents running on WMCS I have added a daily docker system prune, and on Sunday it deletes everything (all images, all volumes etc). Maybe a similar logic can be used? https://gerrit.wikimedia.org/r/c/operations/puppet/+/731840