Page MenuHomePhabricator

Improve garbage collection of unused MediaWiki images on deployment host
Closed, ResolvedPublic

Description

With the introduction of PHP 8.1 based images, we've doubled the baseline storage requirements for MediaWiki container images on the deployment host.

There is a time based cleanup timer set up on the deployment host to mitigate these kinds of issues. After T387796: deployment server - low disk space on /srv it now removes images older than 7 days.

However, the risk of running out of space on /srv is still a problem and not always deterministic, since some deployments may trigger full rebuilds if changes greatly effect subsequent layer size (e.g. when changes touch l10n cache generation). If there were a week with many such deployments, we could easily run out of space within the 7 day span.

Proposal

Implement a more precise and aggressive garbage collection routine specific to images that scap builds by:

  1. Label images built by scap (the build-images.py script in the release project).
  2. Implementing a scap clean-images command that would:
    1. Iterate over images built by scap (filtering by label).
    2. Untag images (docker image rm) not referenced by last_image in any of the scap/image-build/*-state.json files under the staging directory.
    3. Run docker image prune to remove dangling (untagged/unreferenced) images. Note that this command shouldn't mess with images that are apart of the ancestry of a last_image.
  3. Run scap clean-images post scap sync-world if any full build occurs? Or as a systemd timer?
  4. (Optional) Have build-images.py pull the last_image if it doesn't exist locally. This would prevent any full builds from happening if a last_image is for some reason removed.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Allow deployment group to sudo -u mwbuilder scap clean-imagesrepos/releng/train-dev!161dancymain-Ie6a33eb0a29453e0ba97e6314bef853fd76f114bmain
scap clean-images: sudo to the docker_user if neededrepos/releng/scap!1008dancymaster-I096f95d715e7c41f5b4293741285876e67bf0cfemaster
build_image_incr.py: Store the build type in a labelrepos/releng/release!151dancymain-Iad3b42ee1649aaf586eb5607d76b2a7433f3ed63main
kubernetes: Add clean-images subcommandrepos/releng/scap!675dduvallreview/clean-images-18cbmaster
make-container-image: Pull non-local last_imagerepos/releng/release!149dduvallreview/label-images-b0a7main
kubernetes: Add vnd.wikimedia.builder labels to MW imagesrepos/releng/scap!673dduvallreview/label-images-aa74master
make-container-image: Support additional image labelsrepos/releng/release!147dduvallreview/label-images-585cmain
Customize query in GitLab

Event Timeline

@dancy thoughts on the implementation?

dduvall updated the task description. (Show Details)
dduvall changed the task status from Open to In Progress.Mar 4 2025, 11:08 PM
dduvall claimed this task.
dduvall triaged this task as Medium priority.

(Optional) Have build-images.py pull the last_image if it doesn't exist locally. This would prevent any full builds from happening if a last_image is for some reason removed.

I really like this idea since it provides a means for automatic recovery if a useful image is accidentally removed.

The scap clean-images implementation has been merged. I plan on doing a release early next week. Sample behavior from train-dev:

debian@deploy:/srv/mediawiki-staging$ scap clean-images --dry-run
19:04:15 Skipped traindev:5000/restricted/mediawiki-multiversion-debug:2025-03-07-182446-publish-81 due to references in /srv/mediawiki-staging/scap/image-
build
19:04:15 Skipped traindev:5000/restricted/mediawiki-multiversion:2025-03-07-182446-publish due to references in /srv/mediawiki-staging/scap/image-build
19:04:15 Marked traindev:5000/restricted/mediawiki-webserver:2025-03-07-173643-webserver for deletion
19:04:15 Marked traindev:5000/restricted/mediawiki-multiversion-debug:2025-03-07-173643-publish-81 for deletion
19:04:15 Marked traindev:5000/restricted/mediawiki-multiversion:2025-03-07-173643-publish-81 for deletion
19:04:15 Marked traindev:5000/restricted/mediawiki-multiversion:latest for deletion
19:04:15 Skipped traindev:5000/restricted/mediawiki-webserver:2025-03-07-182446-webserver due to references in /srv/mediawiki-staging/scap/image-build
19:04:15 Marked traindev:5000/restricted/mediawiki-webserver:latest for deletion
19:04:15 Skipped traindev:5000/restricted/mediawiki-multiversion-debug:2025-03-07-182446-publish due to references in /srv/mediawiki-staging/scap/image-bui
ld
19:04:15 Marked traindev:5000/restricted/mediawiki-multiversion-debug:latest for deletion
19:04:15 Skipped traindev:5000/restricted/mediawiki-multiversion:2025-03-07-182446-publish-81 due to references in /srv/mediawiki-staging/scap/image-build
19:04:15 Marked traindev:5000/restricted/mediawiki-multiversion:2025-03-07-173643-publish for deletion
19:04:15 Marked traindev:5000/restricted/mediawiki-multiversion-debug:2025-03-07-173643-publish for deletion
19:04:15 Untagged 0 unused refs
19:04:15 Deleted 0 image layers
19:04:15 Skipped 5 used refs
debian@deploy:/srv/mediawiki-staging$ scap clean-images
19:04:19 Untagged 13 unused refs
19:04:19 Deleted 24 image layers
19:04:19 Skipped 5 used refs

One open/incomplete item:

Run scap clean-images post scap sync-world if any full build occurs? Or as a systemd timer?

Either solution will involve puppet changes, systemd timer setup or sudo entries for scap clean-images to allow it to be invoked post sync-world. Personally I lean toward a daily systemd timer because it's simpler and if something does go wrong with clean-images it won't hold up deployers.

@dancy thoughts?

One open/incomplete item:

Run scap clean-images post scap sync-world if any full build occurs? Or as a systemd timer?

Either solution will involve puppet changes, systemd timer setup or sudo entries for scap clean-images to allow it to be invoked post sync-world. Personally I lean toward a daily systemd timer because it's simpler and if something does go wrong with clean-images it won't hold up deployers.

@dancy thoughts?

I agree with running a daily timer and trying to spread the knowledge about the availability of scap clean-images to quickly recover space in unusual circumstances.

I agree with running a daily timer and trying to spread the knowledge about the availability of scap clean-images to quickly recover space in unusual circumstances.

I'll go ahead with that on Monday. Thanks for all the reviews!

Noting that as currently written, scap clean-images can only be executed by people in the docker group. We should improve that to make it sudo -u mwbuilder (along with a suitable new sudoers rule).

Change #1192567 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Allow deployment group to sudo -u mwbuilder scap clean-images

https://gerrit.wikimedia.org/r/1192567

Change #1192573 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Add optional scap-clean-images systemd timer

https://gerrit.wikimedia.org/r/1192573

Change #1192573 merged by Clément Goubert:

[operations/puppet@production] deployment_server: Add optional scap-clean-images systemd timer

https://gerrit.wikimedia.org/r/1192573

Change #1192567 merged by Slyngshede:

[operations/puppet@production] Allow deployment group to sudo -u mwbuilder scap clean-images

https://gerrit.wikimedia.org/r/1192567

We now have a systemd timer which runs scap clean-images weekly. And users in the deployment group can now run scap clean-images successfully.