Page MenuHomePhabricator

WMCS GitLab runners running frequently running out of disk space
Closed, ResolvedPublic

Event Timeline

dduvall changed the task status from Open to In Progress.Jun 30 2023, 8:25 PM
dduvall claimed this task.
dduvall triaged this task as Medium priority.

After poking around a bit on runner-1029.gitlab-runners.eqiad1.wikimedia.cloud it seems a great deal of space is being taken up by the buildkitd container:

dduvall@runner-1029:~$ sudo docker system df -v
[...]
Containers space usage:

CONTAINER ID   IMAGE                                                                         COMMAND                  LOCAL VOLUMES   SIZE      CREATED        STATUS     
   NAMES
1cec0076d53d   docker-registry.wikimedia.org/repos/releng/buildkit:wmf-v0.11-6               "/usr/local/bin/entr…"   0               22.4GB    2 months ago   Up 2 months
   buildkitd
6bba7453f0d5   docker-registry.wikimedia.org/repos/releng/docker-gc/resource-monitor:1.1.2   "./docker-resource-a…"   1               0B        3 months ago   Up 3 months
   docker-resource-monitor
c56583379804   registry:2                                                                    "/entrypoint.sh /etc…"   0               0B        3 months ago   Up 3 months
   registry

Which is strange given buildkitd is configured to garbage collect over 6G of local cache usage.

dduvall@runner-1029:~$ grep -B 1 gc /etc/buildkitd.toml 
enabled = true
gc = true
--
# MB units
gckeepstorage = 6000

Actually, there is already a problem here. According to the config parsing implementation, this value is supposed to be in bytes, not Mb. So we need to fix that. However, I would think this mistake would make for ultra-aggressive GC, not lack of GC...

Following along on the discussion from this seemingly relevant bug, buildctl du -v shows that there are many entries in the cache, but none are reclaimable:

dduvall@runner-1029:~$ sudo docker run -e BUILDKIT_HOST=tcp://buildkitd:1234 --network gitlab-runner --rm -it docker-registry.wikimedia.org/repos/releng/kokkuri:v1.6.0 buildctl du
ID                                                                      RECLAIMABLE     SIZE            LAST ACCESSED
d7dunt2mx26dvixspeyku25mq                                               false           1.41GB
qct3lx4tdlytihvywryu3umlv                                               false           1.40GB
ven7jq12xi8kdx06rl28gesju                                               false           1.40GB
vm039zrlsep3hf331uweqm6xx                                               false           1.40GB
wwma5cx8kttfeyktpz43icq7v                                               false           1.40GB
uu65r9xgpz8tanq0e23av89vc                                               false           1.30GB
lzvg4t8grdw7ioa24r4a8scpv                                               false           1.30GB
gq7dert94vqacar7sz8i164uu                                               false           1.30GB
485bk0wcqv2uyfkakcdu80hal                                               false           1.30GB
b35eq5bqx57b2ycu8awf44pns                                               false           1.30GB
tihqp88axmx3a35ryon4keowj                                               false           1.21GB
xl4g7lyv3ui6gd9nujmubvn1r                                               false           1.21GB
sc3xboiy5vg6grox7n49ggzn3                                               false           1.21GB
ihcdbn3s6c0nuothjecuj34tk                                               false           1.21GB
iff6dvooulfhkm9j34l17xblu                                               false           1.21GB
5q6nxzzgfdniu415ulg23upu0                                               false           1.17GB
ke1oay02m5tpa0t8x176ljtmg                                               false           1.11GB
nkz6h7gnbor5b67cpqvzkvnni                                               false           1.08GB
uifwl35gmw948n0kug6jrwoho                                               false           372.59MB
m03wl5ck44h6uco6con81g17h                                               false           123.37MB
s1lc9tnxb2pk21kdmgx91k3nx                                               false           31.32MB
zdstsftjo3bnnslri8g2ld4ln                                               false           21.46MB
reyzgdtb2to3rkt1lsiaw4bdq                                               false           3.49MB
otl8g6q2c64xzmgdln0fqpkzj                                               false           2.64MB
p3txkrzdhh1csbj603k11cced                                               false           352.26kB
c5qsrnbm6p50cnq6cqpqm7gjk                                               false           352.26kB
w0v5waheygxkoifdng0e6g27e                                               false           12.29kB
j3rmfqfr3lo4xh3jxgsm5ty8v                                               false           12.29kB
xl5g281sxit4x5354mc3k9c4s                                               false           4.10kB
vigsovhz7qxlq6exp4uksa072                                               false           4.10kB
Reclaimable:    0B
Total:          23.43GB

So the GC process does not appear to be the problem exactly (though we should still fix that GC configuration issue), but there's some reason these cache entries are marked as not reclaimable. I'll dig deeper.

Just adding that buildctl prune does not remove the cache entries either, which confirms there are indeed not functionally reclaimable by the GC process.

I'm very curious to learn what you discover.

There is a similar issue on the integration Jenkins agents T338317. I tracked it down to https://gerrit.wikimedia.org/g/machinelearning/liftwing/inference-services installing the torch Python module which results in a 15GB layer stored in the buildkit cache. That overflows the 24G /var/lib/docker partition.

The workaround fix is to prune it from time to time (docker buildx prune --force').

The fix is to grow the Docker partition in order for it to hold the huge layer (T340070 which has a few unrelated blockers).

Mentioned in SAL (#wikimedia-releng) [2023-07-10T22:56:13Z] <dduvall> stopping buildkitd on runner-1029.gitlab-runners.eqiad1.wikimedia.org to debug buildkitd cache issues (T340887)

Closed T340586 as duplicate. Latest comment there probably relevant here, too:

The Jenkins agents now have a 90G disk via the flavor g3.cores8.ram24.disk20.ephemeral90.4xiops. I have rebuild them all last week and they no more suffer from disk space issue.

The 90G disk space is partioned as:

sdb                        8:16   0   90G  0 disk 
├─vd-docker              254:0    0   45G  0 lvm  /var/lib/docker
└─vd-second--local--disk 254:1    0   45G  0 lvm  /srv

Looks like the gitlab runners are using 40G instances and could benefit from using the same larger ephemeral disk.

I've been unable to reproduce this issue locally, but I'm in conversation with upstream about the issue. I will submit the /var/lib/buildkit/cache.db and /var/lib/buildkit/runc-native/metadata_v2.db from one of our runners to them if they need it.

In the meantime, there are a couple of things we should do.

  1. Weirdly, it seems we have gckeepstorage configured incorrectly. The value should be in byte units, not MB. It's poorly documented but see https://github.com/moby/buildkit/issues/2922
  2. Configure buildkitd on WMCS with debug = true to get more information about when/if snapshots are being considered for GC.

Change 938016 had a related patch set uploaded (by Dduvall; author: Dduvall):

[operations/puppet@production] buildkitd: Fix gckeepstorage units

https://gerrit.wikimedia.org/r/938016

I'm still unable to repro this issue locally, and according to upstream the ref counting is solely in memory so the only way to get more visibility into that would be to enable tracing. Furthermore, tracing for the cache ref locking/unlocking seems to only be in v0.12 release candidates and our fork is still based on v0.11.

For now, I've restarted buildkitd on all the WMCS runners to clear out the cache. We'll have to monitor the situation to see if the un-reclaimable cache entries appear and go from there. Tracing is possible once v0.12 is released and we base our fork on it, but it'll be time intensive.

Just to clarify, this issue is not the same as T338317: Python torch fills disk of CI Jenkins instances and T340586: GitLab CI: "ENOSPC: no space left on device, mkdir" since in those cases issuing a prune command was sufficient. In this case, pruning has no effect since the cache entries are locked (they have active references in memory even though there are no containers running).

Change 938016 merged by Jelto:

[operations/puppet@production] buildkitd: Fix gckeepstorage units

https://gerrit.wikimedia.org/r/938016

So far, so good, on the new GC configuration. All runners are reporting <= gckeepstorage of cache storage, all reclaimable.

I'm going to close this out for now. If it crops up again, we can reopen or follow-up with a new task and additional debugging, but I think the time investment in tracing is not worth it at the moment.