runner-1002 is out of space
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

None

Authored By

	• Mholloway
	Sep 16 2021, 10:02 PM

Description

Running with gitlab-runner 14.2.0 (58ba2b95)
  on runner-1002.gitlab-runners.eqiad1.wikimedia.cloud zVdTsoHy
Preparing the "docker" executor
00:21
Using Docker executor with image docker-registry.wikimedia.org/dev/buster-php74:latest ...
Pulling docker image docker-registry.wikimedia.org/dev/buster-php74:latest ...
WARNING: Failed to pull image with policy "always": write /var/lib/docker/tmp/GetImageBlob628799366: no space left on device (manager.go:205:1s)
ERROR: Preparation failed: failed to pull image "docker-registry.wikimedia.org/dev/buster-php74:latest" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob628799366: no space left on device (manager.go:205:1s)
Will be retried in 3s ...
Using Docker executor with image docker-registry.wikimedia.org/dev/buster-php74:latest ...
Pulling docker image docker-registry.wikimedia.org/dev/buster-php74:latest ...
WARNING: Failed to pull image with policy "always": write /var/lib/docker/tmp/GetImageBlob994982896: no space left on device (manager.go:205:1s)
ERROR: Preparation failed: failed to pull image "docker-registry.wikimedia.org/dev/buster-php74:latest" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob994982896: no space left on device (manager.go:205:1s)
Will be retried in 3s ...
Using Docker executor with image docker-registry.wikimedia.org/dev/buster-php74:latest ...
Pulling docker image docker-registry.wikimedia.org/dev/buster-php74:latest ...
WARNING: Failed to pull image with policy "always": write /var/lib/docker/tmp/GetImageBlob557884970: no space left on device (manager.go:205:1s)
ERROR: Preparation failed: failed to pull image "docker-registry.wikimedia.org/dev/buster-php74:latest" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob557884970: no space left on device (manager.go:205:1s)
Will be retried in 3s ...
ERROR: Job failed (system failure): failed to pull image "docker-registry.wikimedia.org/dev/buster-php74:latest" with specified policies [always]: write /var/lib/docker/tmp/GetImageBlob557884970: no space left on device (manager.go:205:1s)

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T291221 runner-1002 is out of space
Resolved	• dancy	T294034 docker-gc: A tool for partially pruning docker resources
Resolved	• dancy	T295707 Run docker-gc resource monitor on gitlab runners
Resolved	• dancy	T295709 Periodically run docker-gc on gitlab runners

Event Timeline

• Mholloway created this task.Sep 16 2021, 10:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 16 2021, 10:02 PM

cc: @dduvall

brennen@runner-1002:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             18G     0   18G   0% /dev
tmpfs           3.6G  374M  3.2G  11% /run
/dev/sda1        20G   19G  566M  98% /
tmpfs            18G     0   18G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            18G     0   18G   0% /sys/fs/cgroup
tmpfs           3.6G     0  3.6G   0% /run/user/0
tmpfs           3.6G     0  3.6G   0% /run/user/20958

Looks, probably unsurprisingly, like stuff in /var/lib/docker has filled up the available 20 gigs.

Based on the docs under clearing Docker cache, I ran:

brennen@runner-1002:/usr/share/gitlab-runner$ sudo ./clear-docker-cache 

Check and remove all unused containers (both dangling and unreferenced) including volumes.
------------------------------------------------------------------------------------------


Deleted Volumes:
runner-lycmpb8q-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-42-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-uh1sdm1s-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-uh1sdm1s-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-zvdtsohy-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-sxbd4p8s-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-sxbd4p8s-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-zvdtsohy-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-lycmpb8q-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-gmkgqulx-project-42-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-uh1sdm1s-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-uh1sdm1s-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-wnljwfzy-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-bffzzv-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-kvusweg-project-45-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-sxbd4p8s-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-gmkgqulx-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-lycmpb8q-project-42-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-43-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-43-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-kvusweg-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-wnljwfzy-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-kvusweg-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-7hp85zrl-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-sxbd4p8s-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-lycmpb8q-project-42-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-43-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-uh1sdm1s-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-39-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-uh1sdm1s-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-sxbd4p8s-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-lycmpb8q-project-43-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-7hp85zrl-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-lycmpb8q-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-wnljwfzy-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-bffzzv-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-bffzzv-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-uh1sdm1s-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-zvdtsohy-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-7hp85zrl-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-uh1sdm1s-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-bffzzv-project-43-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-lycmpb8q-project-43-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-gmkgqulx-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-lycmpb8q-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-wnljwfzy-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-sxbd4p8s-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-kvusweg-project-45-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-lycmpb8q-project-31-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-sxbd4p8s-project-39-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-wnljwfzy-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-kvusweg-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-bffzzv-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-46-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-43-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-7hp85zrl-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-lycmpb8q-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-sxbd4p8s-project-39-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-bffzzv-project-38-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-wnljwfzy-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-40-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-gmkgqulx-project-44-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-bffzzv-project-43-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-sxbd4p8s-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-46-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-44-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-zvdtsohy-project-31-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-kvusweg-project-39-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
runner-gmkgqulx-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-sxbd4p8s-project-38-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
runner-bffzzv-project-40-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8

Total reclaimed space: 5.563GB

...so that's 5.5 gigs back. But I wonder if we need to:

schedule that to run ~daily?
add a bunch of space to the runners? 20 gigs does feel like it'll tend to fill up pretty fast.
tweak some settings?

Notes from IRC discussion in #wikimedia-releng:

a docker system prune -af would probably also work
we should add a daily timer to the profile to run clear-docker-cache
should give runners more space - 20 gigs ain't what it used to be
- https://wikitech.wikimedia.org/wiki/Help:Adding_Disk_Space_to_Cloud_VPS_instances#Cinder:_Attachable_Block_Storage_for_Cloud_VPS
- looks like we can refactor the profile to use cinderutils::ensure e.g. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/wmcs/kubeadm/worker.pp#7
- probably need to request a quota increase for storage; should do some quick math on what we think we'll need per-runner

brennen moved this task from Inbox to CI & Job Runners on the GitLab board.Sep 21 2021, 7:46 PM

brennen edited projects, added GitLab (CI & Job Runners); removed GitLab.

thcipriani edited projects, added Release-Engineering-Team (Done by Wed 06 Oct); removed Release-Engineering-Team (Doing).Sep 22 2021, 3:54 PM

thcipriani assigned this task to • dancy.Sep 22 2021, 4:34 PM

thcipriani triaged this task as Medium priority.

thcipriani set the point value for this task to 3.

We should be able to build a small program which monitors the output of docker events and uses the information to target least recently used volumes and images for removal.

Two quick things that may be already known, but just in case it's of use:

Are we doing something to bypass a built-in GitLab feature, or does GitLab not manage by default the ability to "just" run CI jobs that internally use Docker without an uncontrolled cache? (I see the clear-docker-cache cron is provided by upstream, but I'm not sure whether that's an additional GC thing or whether that's the primary way in which upstream themselves limit the size of the cache. The recommended weekly cycle seems too infrequent, when users can do anything they want.)
I believe we may have already solved the size and auto-pruning on the existing Jenkins agents (I believe we settled on 40G there, and presumably something somewhere keeps the size in check, which may be reusable here?).

In T291221#7384795, @Krinkle wrote:

Two quick things that may be already known, but just in case it's of use:

Are we doing something to bypass a built-in GitLab feature, or does GitLab not manage by default the ability to "just" run CI jobs that internally use Docker without an uncontrolled cache?

Gitlab runners using the docker executor will create runner-local cache volumes but they are never automatically deleted thereafter. Instead administrators are instructed to run https://gitlab.com/gitlab-org/gitlab-runner/blob/main/packaging/root/usr/share/gitlab-runner/clear-docker-cache which deletes all Gitlab-created volumes.

I believe we may have already solved the size and auto-pruning on the existing Jenkins agents (I believe we settled on 40G there, and presumably something somewhere keeps the size in check, which may be reusable here?).

size: We should definitely use a 40GB disk.

auto-pruning: The maintenance-disconnect-full-disks job runs every 5 minutes, scanning all Jenkins agent nodes. If free space on a node drops below a configured threshold, all images are deleted (excluding those tagged latest). Docker volumes are not using for CI jobs so they are not handled. We can use this approach as a first step but my goal is to build something that considers recency.

brennen moved this task from Backlog to Radar on the User-brennen board.Sep 28 2021, 9:56 PM

brennen moved this task from Backlog to In progress on the Release-Engineering-Team (Done by Wed 06 Oct) board.Sep 29 2021, 4:24 PM

Lens0021 subscribed.Oct 11 2021, 7:47 AM

Krinkle unsubscribed.Oct 11 2021, 5:32 PM

dduvall mentioned this in T293832: Request increased quota for gitlab-runners Cloud VPS project.Oct 19 2021, 9:50 PM

dduvall mentioned this in T293835: Provide separate/larger volume for /var/lib/docker on GitLab runners.Oct 19 2021, 10:03 PM

thcipriani edited projects, added Release-Engineering-Team (Done by Thu 04 Nov 🧟); removed Release-Engineering-Team (Done by Wed 06 Oct).Oct 20 2021, 1:12 PM

For the CI agents running on WMCS I have added a daily docker system prune, and on Sunday it deletes everything (all images, all volumes etc). Maybe a similar logic can be used? https://gerrit.wikimedia.org/r/c/operations/puppet/+/731840

• dancy added a subtask: T294034: docker-gc: A tool for partially pruning docker resources.Oct 21 2021, 3:37 PM

Lens0021 unsubscribed.Oct 21 2021, 3:40 PM

brennen moved this task from Backlog to In progress on the Release-Engineering-Team (Done by Thu 04 Nov 🧟) board.Oct 22 2021, 6:07 PM

thcipriani edited projects, added Release-Engineering-Team (Done by Wed 24 Nov 🧟); removed Release-Engineering-Team (Done by Thu 04 Nov 🧟).Nov 10 2021, 4:34 PM

thcipriani moved this task from Backlog to In progress on the Release-Engineering-Team (Done by Wed 24 Nov 🧟) board.

• dancy removed • dancy as the assignee of this task.Nov 10 2021, 5:41 PM

• dancy subscribed.

• dancy closed subtask T294034: docker-gc: A tool for partially pruning docker resources as Resolved.Nov 15 2021, 4:53 PM

• dancy added a subtask: T295707: Run docker-gc resource monitor on gitlab runners.Nov 15 2021, 5:19 PM

• dancy moved this task from In progress to Backlog on the Release-Engineering-Team (Done by Wed 24 Nov 🧟) board.

• dancy added a subtask: T295709: Periodically run docker-gc on gitlab runners.Nov 15 2021, 5:25 PM

• dancy changed the status of subtask T295709: Periodically run docker-gc on gitlab runners from Open to In Progress.Nov 16 2021, 7:24 PM