Page MenuHomePhabricator

runner-1002 is out of space
Open, MediumPublic3 Estimated Story Points

Description

Running with gitlab-runner 14.2.0 (58ba2b95)
  on runner-1002.gitlab-runners.eqiad1.wikimedia.cloud zVdTsoHy
Preparing the "docker" executor
00:21
Using Docker executor with image docker-registry.wikimedia.org/dev/buster-php74:latest ...
Pulling docker image docker-registry.wikimedia.org/dev/buster-php74:latest ...
WARNING: Failed to pull image with policy "always": write /var/lib/docker/tmp/GetImageBlob628799366: no space left on device (manager.go:205:1s)
ERROR: Preparation failed: failed to pull image "docker-registry.wikimedia.org/dev/buster-php74:latest" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob628799366: no space left on device (manager.go:205:1s)
Will be retried in 3s ...
Using Docker executor with image docker-registry.wikimedia.org/dev/buster-php74:latest ...
Pulling docker image docker-registry.wikimedia.org/dev/buster-php74:latest ...
WARNING: Failed to pull image with policy "always": write /var/lib/docker/tmp/GetImageBlob994982896: no space left on device (manager.go:205:1s)
ERROR: Preparation failed: failed to pull image "docker-registry.wikimedia.org/dev/buster-php74:latest" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob994982896: no space left on device (manager.go:205:1s)
Will be retried in 3s ...
Using Docker executor with image docker-registry.wikimedia.org/dev/buster-php74:latest ...
Pulling docker image docker-registry.wikimedia.org/dev/buster-php74:latest ...
WARNING: Failed to pull image with policy "always": write /var/lib/docker/tmp/GetImageBlob557884970: no space left on device (manager.go:205:1s)
ERROR: Preparation failed: failed to pull image "docker-registry.wikimedia.org/dev/buster-php74:latest" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob557884970: no space left on device (manager.go:205:1s)
Will be retried in 3s ...
ERROR: Job failed (system failure): failed to pull image "docker-registry.wikimedia.org/dev/buster-php74:latest" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob557884970: no space left on device (manager.go:205:1s)

Event Timeline

brennen added subscribers: dduvall, brennen.

cc: @dduvall

brennen@runner-1002:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             18G     0   18G   0% /dev
tmpfs           3.6G  374M  3.2G  11% /run
/dev/sda1        20G   19G  566M  98% /
tmpfs            18G     0   18G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            18G     0   18G   0% /sys/fs/cgroup
tmpfs           3.6G     0  3.6G   0% /run/user/0
tmpfs           3.6G     0  3.6G   0% /run/user/20958

Looks, probably unsurprisingly, like stuff in /var/lib/docker has filled up the available 20 gigs.

Based on the docs under clearing Docker cache, I ran:

brennen@runner-1002:/usr/share/gitlab-runner$ sudo ./clear-docker-cache 

Check and remove all unused containers (both dangling and unreferenced) including volumes.
------------------------------------------------------------------------------------------


Deleted Volumes:
runner-lycmpb8q-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-42-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-uh1sdm1s-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-uh1sdm1s-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-zvdtsohy-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-sxbd4p8s-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-sxbd4p8s-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-zvdtsohy-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-lycmpb8q-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-gmkgqulx-project-42-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-uh1sdm1s-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-uh1sdm1s-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-wnljwfzy-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-bffzzv-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-kvusweg-project-45-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-sxbd4p8s-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-gmkgqulx-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-lycmpb8q-project-42-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-43-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-43-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-kvusweg-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-wnljwfzy-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-kvusweg-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-7hp85zrl-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-sxbd4p8s-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-lycmpb8q-project-42-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-43-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-uh1sdm1s-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-39-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-uh1sdm1s-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-sxbd4p8s-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-lycmpb8q-project-43-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-7hp85zrl-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-lycmpb8q-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-wnljwfzy-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-bffzzv-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-bffzzv-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-uh1sdm1s-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-zvdtsohy-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-7hp85zrl-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-uh1sdm1s-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-bffzzv-project-43-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-lycmpb8q-project-43-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-gmkgqulx-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-lycmpb8q-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-wnljwfzy-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-sxbd4p8s-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-kvusweg-project-45-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-lycmpb8q-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-sxbd4p8s-project-39-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-wnljwfzy-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-kvusweg-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-bffzzv-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-43-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-lycmpb8q-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-sxbd4p8s-project-39-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-bffzzv-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-wnljwfzy-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-gmkgqulx-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-bffzzv-project-43-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-sxbd4p8s-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-39-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-sxbd4p8s-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-bffzzv-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8

Total reclaimed space: 5.563GB

...so that's 5.5 gigs back. But I wonder if we need to:

  • schedule that to run ~daily?
  • add a bunch of space to the runners? 20 gigs does feel like it'll tend to fill up pretty fast.
  • tweak some settings?

Notes from IRC discussion in #wikimedia-releng:

thcipriani triaged this task as Medium priority.
thcipriani set the point value for this task to 3.

We should be able to build a small program which monitors the output of docker events and uses the information to target least recently used volumes and images for removal.

Two quick things that may be already known, but just in case it's of use:

  • Are we doing something to bypass a built-in GitLab feature, or does GitLab not manage by default the ability to "just" run CI jobs that internally use Docker without an uncontrolled cache? (I see the clear-docker-cache cron is provided by upstream, but I'm not sure whether that's an additional GC thing or whether that's the primary way in which upstream themselves limit the size of the cache. The recommended weekly cycle seems too infrequent, when users can do anything they want.)
  • I believe we may have already solved the size and auto-pruning on the existing Jenkins agents (I believe we settled on 40G there, and presumably something somewhere keeps the size in check, which may be reusable here?).

Two quick things that may be already known, but just in case it's of use:

  • Are we doing something to bypass a built-in GitLab feature, or does GitLab not manage by default the ability to "just" run CI jobs that internally use Docker without an uncontrolled cache?

Gitlab runners using the docker executor will create runner-local cache volumes but they are never automatically deleted thereafter. Instead administrators are instructed to run https://gitlab.com/gitlab-org/gitlab-runner/blob/main/packaging/root/usr/share/gitlab-runner/clear-docker-cache which deletes all Gitlab-created volumes.

  • I believe we may have already solved the size and auto-pruning on the existing Jenkins agents (I believe we settled on 40G there, and presumably something somewhere keeps the size in check, which may be reusable here?).

size: We should definitely use a 40GB disk.

auto-pruning: The maintenance-disconnect-full-disks job runs every 5 minutes, scanning all Jenkins agent nodes. If free space on a node drops below a configured threshold, all images are deleted (excluding those tagged latest). Docker volumes are not using for CI jobs so they are not handled. We can use this approach as a first step but my goal is to build something that considers recency.

For the CI agents running on WMCS I have added a daily docker system prune, and on Sunday it deletes everything (all images, all volumes etc). Maybe a similar logic can be used? https://gerrit.wikimedia.org/r/c/operations/puppet/+/731840

dancy removed dancy as the assignee of this task.Nov 10 2021, 5:41 PM
dancy added a subscriber: dancy.
dancy changed the status of subtask T295707: Run docker-gc resource monitor on gitlab runners from Open to In Progress.