The WMCS instances for the integration starts having a full /srv partition more and more often. The maintenance-disconnect-full-disks job unpool the instances which eventually recover once build have completed, but the builds surely fail.
Analysis
MatmaRex looked at the #wikimedia-releng IRC channel logs to get count of message originating from Jenkins (logged as wmf-insecte) and having maintenance-disconnect-full-disks complaining. That indicates the frequency of the error. The raw data are P49465 and the rendering:
Previous analyses of the issue are on T338627#8922715 & T338317#8909563 completed below:
The instance have a 36GB /srv. 1.8G is consumed by git mirrors and roughly 200MB by the Jenkins agent for a total of 2GB which leaves 34GB.
The Jenkins agent allow up to 3 concurrent builds, in all cases I have investigated they ran wmf-quibble* jobs (most probably the selenium variants) and the workspace for a build was ~ 11GB. With 3 concurrent builds that is 33GB and with some other consumption.
Moreover I have witnessed /var/lib/docker 24G being full sometime which comes from T338317#8909563 that is machinelearning/liftwing/inference-services introducing a 13G layer in the Docker buildkit cache which overflow the 24G partition.
Needless to say we are short on disk.
The list of instances is publicly accessible via https://openstack-browser.toolforge.org/project/integration
Debian version
Instances are based on Debian Bullseye and I don't think we should upgrade to Bookworm right now (different java, different docker, different kernel, different libs of everything etc)
Disk space flavor
The flavors are g3.cores8.ram24.disk20.ephemeral60.4xiops with the following partitioning:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 20G 0 disk ├─sda1 8:1 0 19.9G 0 part / ├─sda14 8:14 0 3M 0 part └─sda15 8:15 0 124M 0 part /boot/efi sdb 8:16 0 60G 0 disk ├─vd-docker 254:0 0 24G 0 lvm /var/lib/docker └─vd-second--local--disk 254:1 0 36G 0 lvm /srv
Thus:
- disk20 is the 20G for the system on /
- ephemeral60 is 60G split between:
- 36G on /srv for Jenkins and git mirror
- 24G on /var/lib/docker for Docker and its build cache
Goal
We need much more disk space, I am guessing:
- the Jenkins area at /srv could be bumped to 50G (5G for git mirror and 3 builds * 15GB = 45 G)
- roughly double the Docker cache to 45G
I am aiming at requesting 20G for the system and 90G ephemeral disk space.