Page MenuHomePhabricator

Recent incidents of buildkitd's storage volume filling up
Closed, ResolvedPublic

Description

Several jobs have failed recently due to a buildkitd volume being full. This isn't something that we've seen regularly.

https://gitlab.wikimedia.org/repos/sre/libvmod-wmfuniq/-/jobs/517608
https://gitlab.wikimedia.org/repos/releng/scap/-/jobs/515939
https://gitlab.wikimedia.org/repos/releng/scap/-/jobs/515937

A quick way to prune all bulidkitd volumes:

for n in $(seq 0 2); do kubectl -n gitlab-runner exec buildkitd-$n -- buildctl --addr localhost:1234 prune; done

This assumes that your kubectl config is pointing to the gitlab-cloud-runner cluster.

I set up this dashboard for monitoring:
https://grafana.cloud.releng.team/d/demojg23eidq8c/buildkitd

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
.gitlab-ci.yml: Add wmcs tag to publish-build jobrepos/sre/wikitech-static-docker!3dancymain-I5d769213f9b107da1513eaac703ea90102b250d4main
Customize query in GitLab

Related Objects

Event Timeline

On a 7-day view, https://grafana.cloud.releng.team/d/demojg23eidq8c/buildkitd shows regular daily spikes where a buildkitd volume maxes out, starting 2025-05-19. There must be a new CI job doing something interesting/immense.

thcipriani subscribed.

Tagging in cloud-services-team folks for awareness/comment. It makes sense that that job is eating space if it's dumping the entirety of wikitech (but, honestly, unsure what the size we should expect there).

Would isolating that job (possibly on a WMCS runner) help/make sense?

Tagging in cloud-services-team folks for awareness/comment. It makes sense that that job is eating space if it's dumping the entirety of wikitech (but, honestly, unsure what the size we should expect there).

Would isolating that job (possibly on a WMCS runner) help/make sense?

Yep. I had a discussion offline with @Andrew and this is what we're going to try first.

dancy changed the task status from Open to In Progress.May 27 2025, 6:13 PM
dancy claimed this task.
dancy triaged this task as Low priority.

@Andrew has since changed where he performing the build of the wikitech-static container image, so the main cause of this ticket has been resolved.