(Originally reported by @gmodena on Slack.)
See https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/jobs/113474#L531
(Originally reported by @gmodena on Slack.)
See https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/jobs/113474#L531
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
buildkitd: Fix gckeepstorage units | operations/puppet | production | +11 -9 |
After poking around a bit on runner-1029.gitlab-runners.eqiad1.wikimedia.cloud it seems a great deal of space is being taken up by the buildkitd container:
dduvall@runner-1029:~$ sudo docker system df -v [...] Containers space usage: CONTAINER ID IMAGE COMMAND LOCAL VOLUMES SIZE CREATED STATUS NAMES 1cec0076d53d docker-registry.wikimedia.org/repos/releng/buildkit:wmf-v0.11-6 "/usr/local/bin/entr…" 0 22.4GB 2 months ago Up 2 months buildkitd 6bba7453f0d5 docker-registry.wikimedia.org/repos/releng/docker-gc/resource-monitor:1.1.2 "./docker-resource-a…" 1 0B 3 months ago Up 3 months docker-resource-monitor c56583379804 registry:2 "/entrypoint.sh /etc…" 0 0B 3 months ago Up 3 months registry
Which is strange given buildkitd is configured to garbage collect over 6G of local cache usage.
dduvall@runner-1029:~$ grep -B 1 gc /etc/buildkitd.toml enabled = true gc = true -- # MB units gckeepstorage = 6000
Actually, there is already a problem here. According to the config parsing implementation, this value is supposed to be in bytes, not Mb. So we need to fix that. However, I would think this mistake would make for ultra-aggressive GC, not lack of GC...
Following along on the discussion from this seemingly relevant bug, buildctl du -v shows that there are many entries in the cache, but none are reclaimable:
dduvall@runner-1029:~$ sudo docker run -e BUILDKIT_HOST=tcp://buildkitd:1234 --network gitlab-runner --rm -it docker-registry.wikimedia.org/repos/releng/kokkuri:v1.6.0 buildctl du ID RECLAIMABLE SIZE LAST ACCESSED d7dunt2mx26dvixspeyku25mq false 1.41GB qct3lx4tdlytihvywryu3umlv false 1.40GB ven7jq12xi8kdx06rl28gesju false 1.40GB vm039zrlsep3hf331uweqm6xx false 1.40GB wwma5cx8kttfeyktpz43icq7v false 1.40GB uu65r9xgpz8tanq0e23av89vc false 1.30GB lzvg4t8grdw7ioa24r4a8scpv false 1.30GB gq7dert94vqacar7sz8i164uu false 1.30GB 485bk0wcqv2uyfkakcdu80hal false 1.30GB b35eq5bqx57b2ycu8awf44pns false 1.30GB tihqp88axmx3a35ryon4keowj false 1.21GB xl4g7lyv3ui6gd9nujmubvn1r false 1.21GB sc3xboiy5vg6grox7n49ggzn3 false 1.21GB ihcdbn3s6c0nuothjecuj34tk false 1.21GB iff6dvooulfhkm9j34l17xblu false 1.21GB 5q6nxzzgfdniu415ulg23upu0 false 1.17GB ke1oay02m5tpa0t8x176ljtmg false 1.11GB nkz6h7gnbor5b67cpqvzkvnni false 1.08GB uifwl35gmw948n0kug6jrwoho false 372.59MB m03wl5ck44h6uco6con81g17h false 123.37MB s1lc9tnxb2pk21kdmgx91k3nx false 31.32MB zdstsftjo3bnnslri8g2ld4ln false 21.46MB reyzgdtb2to3rkt1lsiaw4bdq false 3.49MB otl8g6q2c64xzmgdln0fqpkzj false 2.64MB p3txkrzdhh1csbj603k11cced false 352.26kB c5qsrnbm6p50cnq6cqpqm7gjk false 352.26kB w0v5waheygxkoifdng0e6g27e false 12.29kB j3rmfqfr3lo4xh3jxgsm5ty8v false 12.29kB xl5g281sxit4x5354mc3k9c4s false 4.10kB vigsovhz7qxlq6exp4uksa072 false 4.10kB Reclaimable: 0B Total: 23.43GB
So the GC process does not appear to be the problem exactly (though we should still fix that GC configuration issue), but there's some reason these cache entries are marked as not reclaimable. I'll dig deeper.
Just adding that buildctl prune does not remove the cache entries either, which confirms there are indeed not functionally reclaimable by the GC process.
There is a similar issue on the integration Jenkins agents T338317. I tracked it down to https://gerrit.wikimedia.org/g/machinelearning/liftwing/inference-services installing the torch Python module which results in a 15GB layer stored in the buildkit cache. That overflows the 24G /var/lib/docker partition.
The workaround fix is to prune it from time to time (docker buildx prune --force').
The fix is to grow the Docker partition in order for it to hold the huge layer (T340070 which has a few unrelated blockers).
Mentioned in SAL (#wikimedia-releng) [2023-07-10T22:56:13Z] <dduvall> stopping buildkitd on runner-1029.gitlab-runners.eqiad1.wikimedia.org to debug buildkitd cache issues (T340887)
I've been unable to reproduce this issue locally, but I'm in conversation with upstream about the issue. I will submit the /var/lib/buildkit/cache.db and /var/lib/buildkit/runc-native/metadata_v2.db from one of our runners to them if they need it.
In the meantime, there are a couple of things we should do.
Change 938016 had a related patch set uploaded (by Dduvall; author: Dduvall):
[operations/puppet@production] buildkitd: Fix gckeepstorage units
I'm still unable to repro this issue locally, and according to upstream the ref counting is solely in memory so the only way to get more visibility into that would be to enable tracing. Furthermore, tracing for the cache ref locking/unlocking seems to only be in v0.12 release candidates and our fork is still based on v0.11.
For now, I've restarted buildkitd on all the WMCS runners to clear out the cache. We'll have to monitor the situation to see if the un-reclaimable cache entries appear and go from there. Tracing is possible once v0.12 is released and we base our fork on it, but it'll be time intensive.
Just to clarify, this issue is not the same as T338317: Python torch fills disk of CI Jenkins instances and T340586: GitLab CI: "ENOSPC: no space left on device, mkdir" since in those cases issuing a prune command was sufficient. In this case, pruning has no effect since the cache entries are locked (they have active references in memory even though there are no containers running).
Change 938016 merged by Jelto:
[operations/puppet@production] buildkitd: Fix gckeepstorage units
So far, so good, on the new GC configuration. All runners are reporting <= gckeepstorage of cache storage, all reclaimable.
I'm going to close this out for now. If it crops up again, we can reopen or follow-up with a new task and additional debugging, but I think the time investment in tracing is not worth it at the moment.