Page MenuHomePhabricator

Expand meitnerium's root partition to 100G
Closed, ResolvedPublic

Description

We are seeing alarms for meitnerium's root partition getting to 94% usage. We (analytics) have a long term plan to cleanup old jars etc.. from archiva, but it would be super great if we could expand the root partition from 40G to 100G as short/medium resolution.

Archiva can go down for maintenance with a bit of a heads up to the Analytics team, and it is fine if meitnerium is rebooted anytime.

I never done this work before so I'd be happy to do it with some assistance to avoid fireworks :)

Event Timeline

elukey created this task.Jan 30 2018, 5:34 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 30 2018, 5:34 PM

Looking to https://wikitech.wikimedia.org/w/index.php?title=Ganeti#Resize_a_VM, it might be less painful to create a new disk, format it and then use it as the archiva's partition?

Yeah, it's probably easiest to add a new disk and move /var/lib/archiva to it

So meitnerium seems to be on ganeti1005 that has a ton of disk space free, so in theory the only thing needed to create the new disk would be the following:

gnt-instance modify --disk add:size=100g meitnerium.wikimedia.org

Then from inside the VM, format/mount/copy/etc.. Does it make sense?

Yea, that makes sense. I also think it's the easiest way to create a new disk in ganeti and then mount it.

Dzahn triaged this task as High priority.Feb 1 2018, 11:35 PM

Mentioned in SAL (#wikimedia-operations) [2018-02-01T23:37:15Z] <mutante> creating new 100GB virtual disk for ganeti VM meitnerium (T186020)

..
Fri Feb  2 00:23:13 2018  - INFO: - device disk/1: 99.30% done, 21s remaining (estimated)
Fri Feb  2 00:23:34 2018  - INFO: - device disk/1: 100.00% done, 1s remaining (estimated)
Fri Feb  2 00:23:35 2018  - INFO: - device disk/1: 100.00% done, 0s remaining (estimated)
Fri Feb  2 00:23:36 2018  - INFO: Instance meitnerium.wikimedia.org's disks are in sync
Modified instance meitnerium.wikimedia.org
 - disk/1 -> add:size=102400,mode=rw
Please don't forget that most parameters take effect only at the next (re)start of the instance initiated by ganeti; restarting from within the instance will not be enough.
[ganeti1004:~] $

@elukey so yea, now we'd have to restart the instance from ganeti, as the comment above says rebooting from within the instance won't do it. You said above archiva downtime needs some heads-up, so didn't do that yet, but the new hardware is there if you wanna continue

elukey added a comment.Feb 2 2018, 7:11 AM

@elukey so yea, now we'd have to restart the instance from ganeti, as the comment above says rebooting from within the instance won't do it. You said above archiva downtime needs some heads-up, so didn't do that yet, but the new hardware is there if you wanna continue

Thanks a lot! We just need a bit of heads up in advance to avoid doing builds while archiva is down, but it can be taken down any time with a ping to the analytics chan first. Will try to check how to do it today!

elukey added a comment.Feb 2 2018, 1:44 PM

Is it a simple gnt-instance reboot meitnerium.wikimedia.org right?

Yea, or

gnt-instance shutdown <fqdn>
gnt-instance startup <fqdn>
elukey added a comment.Feb 2 2018, 2:28 PM

New disk in place, added ext4 and everything looks good. I mounted /dev/vdb1 to /mnt/archiva and started a cp -a from /var/lib/archiva to that dir, but ganeti1005 crashed :(

How nice :(. But it does look like disk IO is a possible reproduction scenario for T181121. I 'll empty ganeti1005 to avoid having any worse problems during the weekend

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Feb 2 2018, 3:13 PM
elukey moved this task from Backlog to In Progress on the User-Elukey board.Feb 5 2018, 5:37 PM
elukey added a comment.Feb 6 2018, 8:49 AM

@akosiaris do you think that we can re-attempt to do the copy (maybe using ionice or rsync with limit bandwitdh or other) ?

@elukey. Feel free to try. If anything it will provide us with some more insight into T181121. FWIW I had refilled ganeti1005 with the VMs assigned to it after the reboot.

Mentioned in SAL (#wikimedia-operations) [2018-02-08T16:23:42Z] <elukey> stop archiva on meitnerium to swap /var/lib/archiva from the root partition to a new separate one - T186020

elukey added a subscriber: Gehel.Feb 8 2018, 5:07 PM
elukey@meitnerium:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             10M     0   10M   0% /dev
tmpfs           792M  8.4M  783M   2% /run
/dev/vda1        49G   11G   36G  23% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
tmpfs           1.0G     0  1.0G   0% /var/lib/nginx
/dev/vdb1        99G   33G   66G  34% /var/lib/archiva

Tested the new config with different builds (@joal and @Gehel helped thanks!), everything looks good!

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Feb 8 2018, 5:07 PM
Dzahn closed this task as Resolved.
Dzahn claimed this task.
Dzahn reassigned this task from Dzahn to elukey.Feb 8 2018, 9:42 PM

Icinga for meitnerium looks fine. no disk space warnings.

Though one thing: puppet is still disabled there... should it?

Dzahn reopened this task as Open.Feb 8 2018, 9:42 PM
elukey moved this task from In Progress to Done on the User-Elukey board.Feb 9 2018, 2:21 PM
Dzahn closed this task as Resolved.Feb 9 2018, 8:16 PM