Page MenuHomePhabricator

releases1003 file system over 90% full
Closed, ResolvedPublic

Description

File system usage on releases1003 has been growing since March 2024, triggered an alert yesterday and warrants a review and possibly a clean up.

File system usage since the beginning of the year: https://w.wiki/AUFK

releases1003_root_partition.png (764×918 px, 56 KB)

Event Timeline

Mentioned in SAL (#wikimedia-releng) [2024-06-24T08:39:19Z] <hashar> releases1003: deleting left over temporary files from the MediaWiki branching (rm -fR /tmp/mw-branching-*) | T368239

hashar added subscribers: jnuche, dancy, hashar.

That is the / partition being filed, though the application/services should write to a standalone partition mounted somewhere under /srv. Also /tmp is on the root partition as well. I guess there is a bit of repartitioning that is needed?

There are three left over 3.2GBytes directories in /tmp:

3198	/tmp/mw-branching-01_qo309
3220	/tmp/mw-branching-x7h64ixp
3218	/tmp/mw-branching-iah3hkx7

That comes from https://gitlab.wikimedia.org/repos/releng/release.git which does create a temp directory but the cleanup might not occurs in case of failure/interrupt.


There is 3G+ in /home/dancy, I haven't touched them though.


/srv/org/wikimedia/releases has 32G which is to be expected, that is the tarballs we have released over time.


/srv/jenkins-agent/workspace has ~ 53G which entirely due to the Branch cut test patches Jenkins job. It should remove some material on build completion. Notably in work/mediawiki which kept copies of wmf branch since March 26th and php-1.42.0-wmf.24. I have deleted a bunch of them which resolves the root cause.


$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       125G   67G   52G  57% /

Thanks for cleaning up hashar!

I've created a patch to make branch-cut-test-patches clean up the MW checkouts. The stuff in /tmp/mw-branching-* is already being cleared regularly here: https://gitlab.wikimedia.org/repos/releng/release/-/blob/main/make-release/automatic-branch-cut?ref_type=heads#L24

As discussed in IRC, the release VMs should probably have a separate disk mounted at /srv (similar to the 150GB disk mounted at /srv/docker). However, I'm not sure if /srv/docker needs all that space. The Docker partition is using only 5%, with no significant change over the past month. Therefore, we could use this larger disk at /srv and mount a smaller one for Docker, or just use the 150GB disk for /srv (including /srv/docker).

I don't think there's an expected increase in disk usage any time soon, mounting the 150G disk directly at /srv makes sense to me.

There is 3G+ in /home/dancy, I haven't touched them though.

Cleaned up.

Changing the partition layout in the correct way would mean reimaging the hosts and if we're doing that we should also upgrade them to Bookworm.

LSobanski triaged this task as Medium priority.Jun 24 2024, 3:18 PM
LSobanski moved this task from Incoming to Backlog on the collaboration-services board.

jnuche closed https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/365

scap clean: perform l10n cleanup only when l10n files can be found

The immediate issue has been fixed by cleaning up old files and left over temporary files. The Jenkins job now cleans the old MediaWiki checkouts thus I guess we are set.

A potential follow up is to change the partitioning of releases.1003 to use a /srv for data instead of using the root partition. Then I am not sure it is worth the effort.

Cool! I would say that gets us back to "not super urgent but when we do the next distro version upgrade we just need to remember this and do it while at it".

We could start to ask the question "what keeps us from upgrading those machines to bookworm" (any missing packages, add support to puppet).

releases1003 has the following partitions:

# lsblk 
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda    254:0    0  128G  0 disk 
├─vda1 254:1    0  127G  0 part /
└─vda2 254:2    0  976M  0 part [SWAP]
vdb    254:16   0  150G  0 disk 
└─vdb1 254:17   0  150G  0 part /srv/docker

Can a new partition be added to the Ganeti VM for /srv or does that require recreating the VM from scratch?

There is a 150G /srv/docker which appears unused. Looking at https://gitlab.wikimedia.org/repos/releng/jenkins-deploy.git/ there is no job or code making use of docker beside Pipelinelib. It looks like PipelineLib and Docker were added in order to build container images (9f11af61021b72d1d4e9d0226d29c242d29a11d1 for:

Which are no more needed. @jnuche mentioned that docpub uses Docker though so we should keep it.


Maybe /srv/docker can be shrunk to say 30G and the remaining 120G could be used for /srv?

That saves us from having to conduct a full reiimage and/or an OS upgrade (which is not trivial).

It's possible to add a new virtual disk to the ganeti VM. It does not require recreating the VM from scratch.

It does require a reboot though and maybe short schedule downtime. It's possible that the disk names are shuffled when doing this and the machine doesn't come back until we fix /etc/fstab via the console.

What is also possible and is to create a new VM, with the same OS as before with the right amount of disk space in a single disk and copy the data over.

Do you already know what you see as the main blocker going to bookworm though?

It's possible to add a new virtual disk to the ganeti VM. It does not require recreating the VM from scratch.

It does require a reboot though and maybe short schedule downtime.

Revisiting given Docker is barely used here, it is unlikely to file a partition. The partition used for /srv/docker could be remounted to /srv and with 125G I think it is large enough. That would prevent the application from filing the root partition.

The data from /srv in the root partition would have to be copied to the new partition though and /srv/docker content can be deleted.

It's possible that the disk names are shuffled when doing this and the machine doesn't come back until we fix /etc/fstab via the console.

The partition have an unique id which can be used instead of the device name. On releases1003 that is already the case for the root and swap partitions:

egrep -v ^# /etc/fstab 
UUID=f2218b51-03e3-46e9-a199-0efb07a71740 /               ext4    errors=remount-ro 0       1
UUID=e07f4247-eb4d-4659-9cfd-712822e005c2 none            swap    sw              0       0
/dev/sr0        /media/cdrom0   udf,iso9660 user,noauto     0       0
/dev/vdb1       /srv/docker ext4 errors=remount-ro 0 2

The uid can be obtained via lsblk:

$ lsblk --fs
NAME   FSTYPE FSVER LABEL UUID                                 FSAVAIL FSUSE% MOUNTPOINT
vda                                                                           
├─vda1 ext4   1.0         f2218b51-03e3-46e9-a199-0efb07a71740   67.6G    41% /
└─vda2 swap   1           e07f4247-eb4d-4659-9cfd-712822e005c2                [SWAP]
vdb                                                                           
└─vdb1 ext4   1.0         65903bdc-cc67-49a9-95a8-0929c79124d7  139.1G     0% /srv/docker

And that would ensure they don't get mixed up, then I imagine the boot device is hopefully always vda and it might not be a concern.

What is also possible and is to create a new VM, with the same OS as before with the right amount of disk space in a single disk and copy the data over.

Do you already know what you see as the main blocker going to bookworm though?

Either recreating a VM or upgrading the OS is a lot more work. From my previous comment, I'd like the application to not write to the root partition T368239#9980371.

I will soon respond to the other points, but for now let me just say: currently we only use 46% of the disk space so we have some time to do this.

I will be out for a week but get back to this after.

There are three left over 3.2GBytes directories in /tmp:

3198	/tmp/mw-branching-01_qo309
3220	/tmp/mw-branching-x7h64ixp
3218	/tmp/mw-branching-iah3hkx7

That comes from https://gitlab.wikimedia.org/repos/releng/release.git which does create a temp directory but the cleanup might not occurs in case of failure/interrupt.

These are gone as of today. No more /tmp/mw-branching* on either of the releases servers as of today. So something or someone already fixed that.

There is 3G+ in /home/dancy, I haven't touched them though.

This is also already fixed. /home/dancy is just a few megabytes and no user home is larger than 1GB. If that becomes an issue again we can use the same notification method tested on people hosts, which warns us about very large user homes.

/srv/org/wikimedia/releases has 32G which is to be expected, that is the tarballs we have released over time.

ACK, that's at 35G now. Nothing to do here.

/srv/jenkins-agent/workspace has ~ 53G which entirely due to the Branch cut test patches Jenkins job. It should remove some material on build completion. Notably in work/mediawiki which kept copies of wmf branch since March 26th and php-1.42.0-wmf.24. I have deleted a bunch of them which resolves the root cause.

This is down to 8.4G and is empty on releases2003. You already fixed it too.

$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 125G 67G 52G 57% /

now even lower:

/dev/vda1 125G 53G 66G 45% /

So.. not sure there is much left to do here.

Except to say "when we create the next releases machine, maybe pick a different partman recipe". The only actionable thing seems to me to somehow have a reminder for that.

Dzahn lowered the priority of this task from Medium to Low.Oct 2 2024, 7:00 PM

To me this comes back to questions like:

"how realistic is releases servers on bookworm"? "when do we want to try upgrading? / what are the blockers?".

If the answer is anything under "a couple years" I would think the best path forward is to create new VMs and decide about the partition sizes while doing that.

Then we can close this ticket, create a new one for that upgrade, leave a comment there to remember the disk size discussion back there and be done for now.

If the answer is more like "that is a major problem and not planned for a while" AND we are really concerned that all the fixes above were only temporary and the issue will repeat itself (which i'm not really), then we can still create new virtual disks and mount them on the existing VMs.

I wouldn't rate that very high effort, though not zero.

Regardless I would say creating a ticket for "releases servers to bookworm" is correct, even if stalled at first.

hashar reassigned this task from Dzahn to jnuche.

The root partition filed because the services hosted on the hosts filed it. My previous messages were to have /srv to be a standalone partition which could be done by reusing the barely used /srv/docker partition. That saves one from having to recreate the VM (which is a lot more work) and address the issue of the root partition filing up due to one of the hosted services. I guess my messages weren't too clear.

Anyway, that was a one off error which I have cleaned back in June and @jnuche fixed the job to have it clean left over files ( https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/83 ). Feel free to file a new task to upgrade the VMs to Bookworm and with a a proper partition scheme.

re: @LSobanski

File system usage since the beginning of the year: https://w.wiki/AUFK

This is how that looks now, fwiw:

Screenshot from 2024-10-02 12-36-25.png (789×1 px, 102 KB)

re: @hashar ACK, thanks for closing it. agreed. If you ever feel like you still want additional disks, we can do it.