Page MenuHomePhabricator

upgrade ping offload servers to bullseye (was: ping servers running out of disk)
Closed, ResolvedPublic

Description

All three ping* servers, ping1001, ping2001 and ping3001 ran out of disk space today, shortly after one another.

icinga alerted about it with minutes in between and they all had 0 bytes left.

We fixed them for now by running "apt-get clean" and "apt-get autoremove" (which finished kernel upgrades and removed old kernel and packages) and that got them all back to just 77% usage.

This was not a case of logs filling up /var/log, instead /usr/lib is the largest thing here and also / in general is just tiny on these. Just 3 GB on /dev/vda1.

Assuming it was caused by ongoing kernel upgrades pushing them over the edge.

This ticket is to decide that either nothing needs to be done, or that the disks are too small or that some other automated action is needed to prevent this in the future.

Event Timeline

We certainly should automate the removal of obsolete kernels in a better way, but with only 3G on the root partition that would happen again anyway (there will always be two kernels installed in any case), 3G is simply too small these days. The VMs should simply be bumped to 5 or 10G for the disks, but given that resizing the root disk is a little fiddly in Ganeti, we can simply wait until bullseye is out and recreate these as ping[123]002 on bullseye with bigger disks.

Thank you. This solution seems good to me. Should we just close this again then? Or we can recycle/rename it to "upgrade ping servers to bullseye" :p

+1 to rename and stall on bullseye being ready.

Dzahn renamed this task from ping servers running out of disk to upgrade ping offload servers to bullseye (was: ping servers running out of disk).Feb 6 2021, 12:06 AM
Dzahn changed the task status from Open to Stalled.
Dzahn triaged this task as Low priority.
taavi changed the task status from Stalled to Open.Jan 9 2022, 2:53 PM
LSobanski subscribed.

Looks like the disk space problem was addressed in T295767 but the upgrade to Bullseye is still pending.

Change 879284 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ping[123]003

https://gerrit.wikimedia.org/r/879284

Change 879284 merged by Muehlenhoff:

[operations/puppet@production] Add ping[123]003

https://gerrit.wikimedia.org/r/879284

Change 881829 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Fix MAC

https://gerrit.wikimedia.org/r/881829

Change 881829 merged by Muehlenhoff:

[operations/puppet@production] Fix MAC

https://gerrit.wikimedia.org/r/881829

Change 881837 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/homer/public@master] Move ping offload from ping2002 to ping2003 in codfw

https://gerrit.wikimedia.org/r/881837

Change 881869 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/homer/public@master] Move ping offload from ping1002 to ping1003 in eqiad

https://gerrit.wikimedia.org/r/881869

Change 881837 merged by Muehlenhoff:

[operations/homer/public@master] Move ping offload from ping2002 to ping2003 in codfw

https://gerrit.wikimedia.org/r/881837

Change 881869 merged by Muehlenhoff:

[operations/homer/public@master] Move ping offload from ping1002 to ping1003 in eqiad

https://gerrit.wikimedia.org/r/881869

Change 883137 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove old ping hosts

https://gerrit.wikimedia.org/r/883137

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ping3002.esams.wmnet

  • ping3002.esams.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster esams to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster esams to Netbox

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ping2002.codfw.wmnet

  • ping2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ping1002.eqiad.wmnet

  • ping1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

Change 883137 merged by Muehlenhoff:

[operations/puppet@production] Remove old ping hosts

https://gerrit.wikimedia.org/r/883137

New ping1003/ping2003/ping3003 Bullseye VMs with 10G disk space have been created and swapped into service. The old hosts have been decommissioned.