Page MenuHomePhabricator

Rebuild Routinator (rpki) VMs with larger disk
Closed, ResolvedPublic

Description

As discussed in T291543, the current VMs for rpki1001 and rpki2001 are a little undersized, operating close to their capacity and sometimes exhausting inodes when the service restarts.

Current drive is 10GB, recommendation is to rebuild them with 20GB disk to avoid any future niggles.

Creating this task to track progress.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

https://packages.nlnetlabs.nl/ also provides the routinator debs for bullseye (plus it's a static Go binary anyway), so if we're recreating the VMs anyway, let's also switch to Bullseye?

@ayounsi Riccardo suggested maybe using a separate disk/partition for the routinator data? That was partly to just do a quick dirty job and not rebuild, but we've reason to rebuild anyway so let's do that.

Do you think it would still make sense to have a separate disk/partition for the Routinator data?

@ayounsi Riccardo suggested maybe using a separate disk/partition for the routinator data? That was partly to just do a quick dirty job and not rebuild, but we've reason to rebuild anyway so let's do that.

Do you think it would still make sense to have a separate disk/partition for the Routinator data?

The files in /var/lib/routinator is all fairly small (we have >400k files with 2.8G in total on rpki1001), in theory we could create create a custom partman config with fstype=small or news, but OTOH we easily have the extra disk space available and have spent quite some time to reduce our maze of Partman configs (https://phabricator.wikimedia.org/T156955)

So unless we expect RPKI usage (and the data used by routinator) to grow massively over time, so my suggestion would be to simply go with the extra disk usage.

Change 726610 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add repo sync definition and repo component for Routinator

https://gerrit.wikimedia.org/r/726610

Change 726610 merged by Muehlenhoff:

[operations/puppet@production] Add repo sync definition and repo component for Routinator

https://gerrit.wikimedia.org/r/726610

Mentioned in SAL (#wikimedia-operations) [2021-10-05T15:10:22Z] <moritzm> imported routinator 0.10.1-1bullseye to thirdparty/routinator for bullseye-wikimedia T292503

I've added routinator to apt.wikimedia.org at "thirdparty/routinator" for bullseye-wikimedia and adapted the Puppet code, so that when the these get reinstalled with Bullseye, the thirdparty component is picked.

A security update is now available which means we need to upgrade again:

https://www.nlnetlabs.nl/news/2021/Nov/09/routinator-0.10.2-released/

I'll dig into this and try to rebuild the VMs as part of the process.

Mentioned in SAL (#wikimedia-operations) [2021-11-11T10:37:12Z] <moritzm> updated routinator in thirdparty/routinator for bullseye-wikimedia to 0.10.12 T292503

A security update is now available which means we need to upgrade again:

https://www.nlnetlabs.nl/news/2021/Nov/09/routinator-0.10.2-released/

I'll dig into this and try to rebuild the VMs as part of the process.

I've imported 0.10.2 into the repository component, let me know if you need any assistance in rebuilding the VMs.

cookbooks.sre.hosts.decommission executed by cmooney@cumin1001 for hosts: rpki1001.eqiad.wmnet

  • rpki1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

Change 739237 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Update IP address for RPKI Validator session to rpki2001

https://gerrit.wikimedia.org/r/739237

cookbooks.sre.hosts.decommission executed by cmooney@cumin2002 for hosts: rpki2001.codfw.wmnet

  • rpki2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox

Change 739237 abandoned by Cathal Mooney:

[operations/homer/public@master] Update IP address for RPKI Validator session to rpki2001

Reason:

Had to rebuild VM again and it has re-allocated original IP.

https://gerrit.wikimedia.org/r/739237

Change 739242 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Updating MAC address in DHCP config for rpki2001

https://gerrit.wikimedia.org/r/739242

Change 739242 merged by Cathal Mooney:

[operations/puppet@production] Updating MAC address in DHCP config for rpki2001

https://gerrit.wikimedia.org/r/739242

cookbooks.sre.hosts.decommission executed by cmooney@cumin2002 for hosts: rpki2001.codfw.wmnet

  • rpki2001.codfw.wmnet (FAIL)
    • Host steps raised exception:

ERROR: some step on some host failed, check the bolded items above

Change 739580 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add DHCP entry for install of rpki2002.codfw.wmnet

https://gerrit.wikimedia.org/r/739580

Change 739580 merged by Cathal Mooney:

[operations/puppet@production] Add DHCP entry for install of rpki2002.codfw.wmnet

https://gerrit.wikimedia.org/r/739580

Change 739609 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add role in site.pp for new rpki2002 VM

https://gerrit.wikimedia.org/r/739609

Change 739609 merged by Cathal Mooney:

[operations/puppet@production] Add role in site.pp for new rpki2002 VM

https://gerrit.wikimedia.org/r/739609

Change 739611 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Changing glob pattern for partman receipe for rpki VMs

https://gerrit.wikimedia.org/r/739611

Ok both VMs have been rebuilt with 20GB disk and updated to version 0.10.2.

rpki1001 remains with the same name, rpki2001 has been replaced with rpki2002.

Hit an annoying amount of glitches and stupid mistakes of my own doing this but got there in the end. Only way to learn :)

Change 739611 merged by Cathal Mooney:

[operations/puppet@production] Changing glob pattern for partman receipe for rpki VMs

https://gerrit.wikimedia.org/r/739611