With the introduction of PHP 8.1 based images, we've doubled the baseline storage requirements for MediaWiki container images on the deployment host.
There is a time based cleanup timer set up on the deployment host to mitigate these kinds of issues. After T387796: deployment server - low disk space on /srv it now removes images older than 7 days.
However, the risk of running out of space on /srv is still a problem and not always deterministic, since some deployments may trigger full rebuilds if changes greatly effect subsequent layer size (e.g. when changes touch l10n cache generation). If there were a week with many such deployments, we could easily run out of space within the 7 day span.
Proposal
Implement a more precise and aggressive garbage collection routine specific to images that scap builds by:
- Label images built by scap (the build-images.py script in the release project).
- Implementing a scap clean-images command that would:
- Iterate over images built by scap (filtering by label).
- Untag images (docker image rm) not referenced by last_image in any of the scap/image-build/*-state.json files under the staging directory.
- Run docker image prune to remove dangling (untagged/unreferenced) images. Note that this command shouldn't mess with images that are apart of the ancestry of a last_image.
- Run scap clean-images post scap sync-world if any full build occurs? Or as a systemd timer?
- (Optional) Have build-images.py pull the last_image if it doesn't exist locally. This would prevent any full builds from happening if a last_image is for some reason removed.