Page MenuHomePhabricator

Expand thanos-swift sd[ab]3 SSDs
Closed, ResolvedPublic

Description

thanos-be2004's filesystem on sdb3 is reported as basically full:

/dev/sdb3        94G   92G  1.3G  99% /srv/swift-storage/sdb3

We have big containers at the moment due to tegola's usage (e.g. T307184: Followups for Tegola and Swift interactions ), and the lack of space is exacerbated by quarantined databases in this case (due to container-replicator failures)

# du -hcs /srv/swift-storage/sdb3/*
1.0M	/srv/swift-storage/sdb3/accounts
56G	/srv/swift-storage/sdb3/containers
25G	/srv/swift-storage/sdb3/quarantined
13G	/srv/swift-storage/sdb3/tmp
92G	total

Growing SSD container partition

The solution I have devised is to grow sda/sdb 3 partitions: the swift-grow-ssd-part script will delete the 4 partition, append space at the end of 3 partition and recreate the 4 partition. On the filesystem end it will xfs_grow the 3 partition online and mkfs.xfs the 4 partition. We lose the filesystem on the 4 partition but that's okay, data will be reconstructed by swift (and the 4 partition is largely unused anyways).

The procedure is as follows:

### The script will need the --doit flag to act on the disk!
swift-grow-ssd-part --amount 100G --dev /dev/sda
swift-grow-ssd-part --amount 100G --dev /dev/sdb
mount -a
run-puppet-agent

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-08-01T08:48:30Z] <godog> thanos-be2004: copy quarantined and tmp off sdb3 and into sdb4 for analysis and to free space - T314275

I have freed some space on thanos-be2004 sdb3, though depending on how data is shuffled around the ring the free space might not last long. AFAICT this is due to the tegola containers being quite big and thus hard to replicate around the cluster when the ring changes (e.g. like in this case where the replication factor is changing in the thanos cluster)

Change 819095 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: add script to grow the SSD partition for container databases

https://gerrit.wikimedia.org/r/819095

Change 819095 merged by Filippo Giunchedi:

[operations/puppet@production] swift: add script to grow the SSD partition for container databases

https://gerrit.wikimedia.org/r/819095

Change 819503 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: account for 'bootable' and missing sector size in grow_ssd_part

https://gerrit.wikimedia.org/r/819503

Change 819503 merged by Filippo Giunchedi:

[operations/puppet@production] swift: account for 'bootable' and missing sector size in grow_ssd_part

https://gerrit.wikimedia.org/r/819503

Mentioned in SAL (#wikimedia-operations) [2022-08-02T09:44:19Z] <godog> grow sdb3 by 100G on thanos-be2004 - T314275

Change 819522 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: set fs label and run xfs_growfs on mountpoint

https://gerrit.wikimedia.org/r/819522

Mentioned in SAL (#wikimedia-operations) [2022-08-02T10:49:12Z] <godog> grow sda3 by 100G on thanos-be2004 - T314275

Change 819522 merged by Filippo Giunchedi:

[operations/puppet@production] swift: set fs label and run xfs_growfs on mountpoint

https://gerrit.wikimedia.org/r/819522

Mentioned in SAL (#wikimedia-operations) [2022-08-02T14:04:04Z] <godog> grow sda/sdb 3 by 100G on thanos-be1001 - T314275

Mentioned in SAL (#wikimedia-operations) [2022-08-03T06:56:46Z] <godog> grow sda/sdb 3 by 100G on thanos-be1002 - T314275

Mentioned in SAL (#wikimedia-operations) [2022-08-03T06:56:55Z] <godog> grow sda/sdb 3 by 100G on thanos-be2003 - T314275

Mentioned in SAL (#wikimedia-operations) [2022-08-04T07:46:57Z] <godog> grow sda/sdb 3 by 100G on thanos-be1003 - T314275

Mentioned in SAL (#wikimedia-operations) [2022-08-04T07:47:05Z] <godog> grow sda/sdb 3 by 100G on thanos-be2002 - T314275

Mentioned in SAL (#wikimedia-operations) [2022-08-08T07:50:26Z] <godog> grow sda/sdb 3 by 100G on thanos-be1004 - T314275

Mentioned in SAL (#wikimedia-operations) [2022-08-08T07:53:46Z] <godog> grow sda/sdb 3 by 100G on thanos-be2001 - T314275

Change 821174 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] install_server: set minimum 200G for swift sd[ab]3

https://gerrit.wikimedia.org/r/821174

fgiunchedi renamed this task from thanos-be2004 sdb3 fully used to Expand thanos-swift sd[ab]3 SSDs.Aug 9 2022, 6:18 AM

Change 821174 merged by Filippo Giunchedi:

[operations/puppet@production] install_server: set minimum 200G for swift sd[ab]3

https://gerrit.wikimedia.org/r/821174

fgiunchedi claimed this task.

This is complete, thanos-swift cluster has been expanded online. ms cluster has not been expanded, though we had no troubles with ssd space there since there are many more hosts in the cluster. I have changed the partman recipe accordingly so both clusters will get the new standard size SSDs partitions at the next reimage.