Page MenuHomePhabricator

archiva1002 is running low on space left in the root partition
Closed, ResolvedPublic

Description

Hello folks,

Archiva1002 is running low on space left in the root partition:

root@archiva1002:/var/lib/archiva# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           395M   40M  355M  11% /run
/dev/vda1        94G   84G  5.6G  94% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
tmpfs           1.0G     0  1.0G   0% /var/lib/nginx
tmpfs           395M     0  395M   0% /run/user/13926

Most of the space used is in /var/lib/archiva/repositories:

elukey@archiva1002:/var/lib/archiva/repositories$ sudo du -hs * | sort -h
68K	internal
1.9M	mirror-spark
2.6M	analytics-old-uploads
1.6G	python
2.6G	mirror-cloudera
3.7G	snapshots
12G	mirror-maven-central
47G	releases

More details:

elukey@archiva1002:/var/lib/archiva/repositories/releases/org$ sudo du -hs * | sort -h
24K	jboss
72K	slf4j
1.1M	codehaus
5.6M	xbib
5.7M	apache
13M	elasticsearch
186M	linkeddatafragments
17G	wikimedia
23G	wikidata

elukey@archiva1002:/var/lib/archiva/repositories/releases/org/wikimedia/analytics$ sudo du -hs * | sort -h
508K	mediawiki-tables-sqoop-orm
31M	hdfs-tools
401M	camus-wmf
16G	refinery

elukey@archiva1002:/var/lib/archiva/repositories/releases/org/wikidata/query/rdf$ sudo du -hs * | sort -h
112K	parent
7.2M	query-service-parent
8.7M	streaming-updater-common
14M	testTools
24M	jetty-logging
31M	common
38M	blazegraph
153M	flink-fs-swift
327M	streaming-updater
625M	mw-oauth-proxy
678M	rdf-spark-tools
698M	streaming-updater-consumer
728M	tools
1.1G	streaming-updater-producer
6.7G	blazegraph-service
12G	service

We should probably drop older revisions to free some space :)

Event Timeline

We might perhaps be able to drop all wdqs artifacts prior to 0.3.40, this is the oldest reference I found here: https://github.com/wmde/wikibase-release-pipeline/search?q=WDQS_VERSION.
The newest being 0.3.97 so I wonder if you could drop the ones before too, I'll try to get an answer to this.

I think from our (people keeping an eye on Wikibase releases) side it would be helpful to keep both 0.3.40 and 0.3.97. Other than these we won't be impacted by them going missing.

However I think when we started depending on these artefacts we (wrongly) assumed that they would be archived here for eternity. Is there any specific commitment for how long they are to remain available? Will there be archive copies kept elsewhere?

Thanks so much for noticing this and flagging it up :).

Previous occurrence: https://phabricator.wikimedia.org/T304224

@Tarrow Hi! We do have some backups about archiva's releases but in general we'd probably need be more explicit about what can and cannot be dropped, to avoid any kind of surprise. For example we may decide to prune older backups in the future etc.., so in my opinion let's define clearly what is needed :)

I think I'd probably consider growing the disk in ganeti, which should be able to increase the headroom for us.

https://wikitech.wikimedia.org/wiki/Ganeti#Adding_disk_space

I know it's a bit fiddly and not exactly a recommended approach, but I think it'll work here.

I'm away for a few more days, but I'm happy to do it when I get back, if we can schedule a bit of downtime for Archiva.

@BTullis another way could be to add a new disk of say 200G, format it and then mount /var/lib/archiva on it.

Should we also think about releasing more projects to Maven Central and using Archiva mostly as a local cache? This would externalise the disk space issue.

It is erroring out now:

PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space:
/ 1696 MB (1% inode=84%)
/tmp 1696 MB (1% inode=84%)
/var/tmp 1696 MB (1% inode=84%)
https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops

That will surely cause CI builds to fail once the disk is filed and new dependencies are added somewhere.

/var/lib/archiva should probably be made a standalone partition to avoid more or less killing the box by filing /.

I'm looking into this issue now, since I'm on leave next week and I would rather not leave it any longer.
I will take the advice from @hashar and @elukey which is to use a separate disk for /var/lib/archiva and move the existing contents there.

I checked that the primary (ganeti1008) and secondary (ganeti1025) nodes both have plenty of spare disk space:

Then I added a 200 GB disk using the command:

btullis@ganeti1027:~$ sudo gnt-instance modify --disk add:size=200g archiva1002.wikimedia.org

This is pre-allocating now.

I will return in an hour or so to complete the format and data move operations.

VM archiva1002.wikimedia.org rebooted by btullis@cumin1001 with reason: Adding disk

I had to rename the network interface from ens5 to ens14 in /etc/network/interfaces as described here: https://wikitech.wikimedia.org/wiki/Ganeti#Adding_a_disk

Once I did that I could SSH into the machine and create a partition table with a single 200 GB partition.

btullis@archiva1002:~$ ls -l /dev/vd*
brw-rw---- 1 root disk 254,  0 Jul 24 20:45 /dev/vda
brw-rw---- 1 root disk 254,  1 Jul 24 20:45 /dev/vda1
brw-rw---- 1 root disk 254,  2 Jul 24 20:45 /dev/vda2
brw-rw---- 1 root disk 254,  5 Jul 24 20:45 /dev/vda5
brw-rw---- 1 root disk 254, 16 Jul 24 20:45 /dev/vdb
btullis@archiva1002:~$ sudo fdisk /dev/vdb

Welcome to fdisk (util-linux 2.33.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Device does not contain a recognized partition table.
Created a new DOS disklabel with disk identifier 0x85fcb6c1.

Command (m for help): p
Disk /dev/vdb: 200 GiB, 214748364800 bytes, 419430400 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x85fcb6c1

Command (m for help): n
Partition type
   p   primary (0 primary, 0 extended, 4 free)
   e   extended (container for logical partitions)
Select (default p):

Using default response p.
Partition number (1-4, default 1):
First sector (2048-419430399, default 2048):
Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-419430399, default 419430399):

Created a new partition 1 of type 'Linux' and of size 200 GiB.

Command (m for help): p
Disk /dev/vdb: 200 GiB, 214748364800 bytes, 419430400 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x85fcb6c1

Device     Boot Start       End   Sectors  Size Id Type
/dev/vdb1        2048 419430399 419428352  200G 83 Linux

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

btullis@archiva1002:~$

I could then make an ext4 filesystem on the new partition.

btullis@archiva1002:~$ sudo mke2fs -m 0.1 -t ext4 /dev/vdb1
mke2fs 1.44.5 (15-Dec-2018)
Creating filesystem with 52428544 4k blocks and 13107200 inodes
Filesystem UUID: 49b6e4e9-0cf4-459f-851b-d2ecafc151f0
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872

Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

I've mounted /dev/vdb1 to /mnt temporarily and started an rsync operation with:

sudo rsync -av /var/lib/archiva/ /mnt

Once this is complete I will:

  • add /dev/vdb1 to /etc/fstab as /var/lib/archiva
  • stop the archiva service
  • run the rsync command once again to make sure no changes have occurred
  • rename /var/lib/archiva to /var/lib/archiva-bak
  • mkdir /var/lib/archiva and chown it to the correct user (currently this ownership is bacula:ulog which I'm not sure about, but I can come back to double check this)
  • sudo umount /mnt to remove the temporary mount
  • sudo mount -a to mount the new volume
  • start the archiva service
  • If all is well, remove /var/lib/archiva-bak

All steps above have now been follwed, except the final removal of the backup in /var/lib/archiva-bak

The service starts and appears to be OK. I will check the status of the archiva-gitfat-link timer and service, to make sure that it looks OK.

The git-fat link service apepars to work without errors:

Jul 24 21:15:01 archiva1002 systemd[1]: Started Archiva tool to create jar symlinks using their sha1 checksum as filename..
Jul 24 21:16:28 archiva1002 systemd[1]: archiva-gitfat-link.service: Succeeded.

I will now remove the backup directory: /var/lib/archiva-bak

BTullis claimed this task.
BTullis triaged this task as High priority.
BTullis moved this task from Ready to Done on the Data-Engineering-Planning (Sprint 01) board.

The disk space is now looking much more healthy.

btullis@archiva1002:~$ df -h -t ext4
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        94G  9.5G   80G  11% /
/dev/vdb1       196G   75G  121G  39% /var/lib/archiva

I haven't updated the partman recipe yet, but at least with a second disk we can choose not to format this when we reimage the host, which will save on the rebuild time.

@BTullis very well done, thank you very much! :)