Page MenuHomePhabricator

runner-1026.gitlab-runners.eqiad1.wikimedia.cloud ran out of disk space
Closed, ResolvedPublic

Description

In https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot/-/jobs/44688, https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot/-/jobs/44690 and https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot/-/jobs/44689 (three presumably concurrent jobs on the same host), the runner ran out of disk space.

Not sure 1) whether it needs manual cleanup to free disk space and 2) if something can be done/adjusted to prevent this from happening again

Event Timeline

Change 876184 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: lower docker_gc watermarks in wmcs

https://gerrit.wikimedia.org/r/876184

Jelto added subscribers: dancy, Jelto.

Thanks for opening the task. runner-1026 should be unblocked because the daily cleanup happened some hours after you opened the task.

I created a patch (see above) to lower the watermarks/thresholds for image and volume cleanup. @dancy can you take a look? Currently high watermarks are set to 20GB for volumes and images each. That's quite high for a total of 40GB for /var/lib/docker.

We also have two cleanup jobs for GitLab Runners currently. One lives in class { 'docker::gc'} (which isn't doing anything because of to high watermarks) and one in systemd::timer::job { 'clear-docker-cache'} (which does two daily cleanups). We should try to merge this both into the same logic. But that's a bit separate from this task.

Change 876184 merged by Dzahn:

[operations/puppet@production] gitlab_runner: lower docker_gc watermarks in wmcs

https://gerrit.wikimedia.org/r/876184

Change 876240 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Make gitlab-runner cache volumes eligible for docker-gc

https://gerrit.wikimedia.org/r/876240

on runnner-1026:

+ExecStart=/usr/bin/docker run --rm         -v /var/run/docker.sock:/var/run/docker.sock         -v docker-resource-monitor:/state         docker-registry.wikimedia.org/docker-gc:1.0.0         gc         --state-file /state/state.json         --image-filter 'id=~.*'         --images 10g:5g         --volumes 10g:5g

disk space there is also just around 40% usage right now

Change 876240 merged by Dzahn:

[operations/puppet@production] Make gitlab-runner cache volumes eligible for docker-gc

https://gerrit.wikimedia.org/r/876240

Jelto triaged this task as Medium priority.Jan 9 2023, 9:32 AM

Adjusting the watermarks and volume filter labels (thanks @Dzahn and @dancy ) fixed the cleanup of volumes and images. From gitlab-runner-1026 after merging both changes:

Jan 07 18:07:11 runner-1026 systemd[1]: Started Perform a round of docker image/volume garbage collection.
Jan 07 18:07:12 runner-1026 docker[4052523]: [2023-01-07 18:07:12,354] Running df()
Jan 07 18:07:13 runner-1026 docker[4052523]: [2023-01-07 18:07:13,934] df() returned 17 images, 5 volumes
Jan 07 18:07:13 runner-1026 docker[4052523]: [2023-01-07 18:07:13,935] volumes usage is 6.02 GB, below high water mark of 10 GB.
Jan 07 18:07:15 runner-1026 docker[4052523]: [2023-01-07 18:07:15,367] images usage is 10.9 GB, above high water mark of 10 GB.  Need to prune.
Jan 07 18:07:15 runner-1026 docker[4052523]: [2023-01-07 18:07:15,367] Attempting to prune images down to 5 GB
[...]
Jan 07 18:07:18 runner-1026 docker[4052523]: [2023-01-07 18:07:18,114] Pruning sha256:446440c0188655f77e20d4c8df36c514c3e00da47dcdcf92428938c2ca9025a2
Jan 07 18:07:18 runner-1026 docker[4052523]: [2023-01-07 18:07:18,142] Pruned sha256:446440c0188655f77e20d4c8df36c514c3e00da47dcdcf92428938c2ca9025a2
Jan 07 18:07:19 runner-1026 docker[4052523]: [2023-01-07 18:07:19,513] Pruning sha256:9625767011d2fdb65dc79b882cdf8a727b9f2081660949e5e6eee0043ab0b8de
Jan 07 18:07:19 runner-1026 docker[4052523]: [2023-01-07 18:07:19,775] Pruned sha256:9625767011d2fdb65dc79b882cdf8a727b9f2081660949e5e6eee0043ab0b8de
[...]
Jan 07 18:07:36 runner-1026 docker[4052523]: [2023-01-07 18:07:36,287] Reached low water mark of 5 GB
Jan 07 18:07:36 runner-1026 systemd[1]: docker-gc.service: Succeeded.

I guess the task can be closed now? At least Docker volumes and images should not exceed 20GB it total anymore.

@Jelto There's still an outstanding matter of tuning the size of buildkitd's cache (which exists inside of its container). For example, on runner-1026.gitlab-runners.eqiad1.wikimedia.cloud:

CONTAINER ID   IMAGE                                                             COMMAND                  LOCAL VOLUMES   SIZE      CREATED        STATUS        NAMES
cbb799682700   docker-registry.wikimedia.org/repos/releng/buildkit:wmf-v0.10-1   "/usr/local/bin/entr…"   0               3.34GB    2 months ago   Up 2 months   buildkitd

That 3.34GB is the buildkit cache.

Buildkit GC options are sorta mentioned in https://github.com/moby/buildkit/blob/master/docs/buildkitd.toml.md

@Jelto There's still an outstanding matter of tuning the size of buildkitd's cache (which exists inside of its container). For example, on runner-1026.gitlab-runners.eqiad1.wikimedia.cloud:

CONTAINER ID   IMAGE                                                             COMMAND                  LOCAL VOLUMES   SIZE      CREATED        STATUS        NAMES
cbb799682700   docker-registry.wikimedia.org/repos/releng/buildkit:wmf-v0.10-1   "/usr/local/bin/entr…"   0               3.34GB    2 months ago   Up 2 months   buildkitd

That 3.34GB is the buildkit cache.

Buildkit GC options are sorta mentioned in https://github.com/moby/buildkit/blob/master/docs/buildkitd.toml.md

I opened T327060 as a followup task. This task should be resolved.