Page MenuHomePhabricator

Migrating magru to routed Ganeti
Closed, ResolvedPublic

Description

An intro on routed Ganeti can be found here: https://phabricator.wikimedia.org/phame/post/view/312/ganeti_on_modern_network_design/

We have piloted routed Ganeti with two test servers who are running some initial workloads. As the next step of the rollout we've decided to migrate one of the PoPs (namely magru) to it, to give it more exposure to real world issues. This will also allow us to install future pops directly with routed Ganeti.

Magru currently consists of two separate Ganeti clusters in two different rows with two servers each.

row B4: ganeti7002 and ganeti7004

  • bast7001
  • doh7002
  • durum7002
  • ncredir7002
  • prometheus7001

row B3: ganeti7001 and ganeti7003

  • atlas7001
  • doh7002
  • durum7001
  • install7001
  • ncredir7001
  • netflow7001

When the migration is completed, we'll have a common four node Ganeti cluster spanning the two rows (and would also have flexibility in case of potential row changes at the DC).

There is some pending upstream work we have commissioned for Bind which will unblock the use of BGP in VMs (T362392). Until this work is completed, we will keep one node using the old setup (ganeti7002), which will continue to run doh7002 and durum7002.

The migration path will look like the following:

  • Move all VMs in ganeti7002 to ganeti7004
  • Switch B4 VMs to plain disk storage, i.e. disable DRBD for them.

During this initial period, the B4 VMs are no longer redundant, so if 7004 were to fail, we'd lose the Prometheus metrics, but would still have ncredir/wikidough/durum operational

  • Allocate IPs for magru routed Ganeti - https://netbox.wikimedia.org/ipam/prefixes/?role_id=41&site_id=11
  • Add allocated IPs to modules/network/data/data.yaml in Puppet
  • Reimage ganeti7002 with routed Ganeti
  • Update ganeti7002 switch port to remove the trunked public vlan
  • Setup routing between ganeti7002 and its ToR switch
  • Create bast7002, ncredir7003 and prometheus7002 on routed Ganeti and fail over services
  • Decom bast7001, ncredir7002, prometheus7001
  • Move all VMs on ganeti7001 to ganeti7003
  • Switch B3 VMs to plain disk storage, i.e. disable DRBD for them.
  • Decom atlas7001 (We'll re-add a probe in magru at a later point)
  • Decom doh7001, durum7001 (they will no longer be redundant, but with the current request rate in magru that's acceptable)
  • Create install7002, ncredir7002, netflow7002 on the routed Ganeti cluster and fail over services
  • Decom install7001, ncredir7001, netflow7001
  • Reimage ganeti7001 with routed Ganeti and also add them to the cluster
  • Setup routing between ganeti7001 and its ToR switch
  • Switch VMs back to DRBD
  • Reimage ganeti7003 with routed Ganeti and also add them to the cluster
  • Setup routing between ganeti7003 and it's ToR switch
  • Create doh7003, doh7004, durum7003, durum7004 on the routed Ganeti cluster

Once support in Bird is available (T362392)

  • Move doh7003, doh7004, durum7003, durum7004 to production
  • Decommission doh7002, durum7002
  • Reimage ganeti7004 with routed Ganeti and add it to the cluster
  • Setup routing between ganeti7004 and its ToR switch

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+0 -18
operations/puppetproduction+2 -5
operations/puppetproduction+7 -3
operations/puppetproduction+0 -3
operations/puppetproduction+1 -5
operations/puppetproduction+0 -1
operations/puppetproduction+2 -2
operations/puppetproduction+1 -0
operations/puppetproduction+0 -4
operations/puppetproduction+2 -5
operations/puppetproduction+2 -0
operations/puppetproduction+5 -11
operations/puppetproduction+0 -1
operations/dnsmaster+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/homer/publicmaster+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+24 -40
operations/puppetproduction+1 -5
operations/puppetproduction+2 -6
operations/puppetproduction+1 -3
operations/puppetproduction+1 -0
operations/homer/publicmaster+0 -1
operations/puppetproduction+0 -1
operations/puppetproduction+0 -4
operations/homer/publicmaster+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+12 -4
operations/puppetproduction+0 -9
operations/puppetproduction+3 -8
operations/puppetproduction+6 -1
operations/puppetproduction+4 -4
operations/debs/wmf-laptopmaster+4 -4
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+6 -0
operations/cookbooksmaster+6 -2
operations/software/spicerackmaster+62 -14
operations/puppetproduction+3 -0
operations/puppetproduction+20 -0
operations/dnsmaster+12 -0
operations/homer/publicmaster+14 -0
operations/dnsmaster+3 -0
operations/puppetproduction+6 -0
operations/alertsmaster+34 -2
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+3 -6
operations/puppetproduction+6 -1
operations/puppetproduction+0 -6
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1153947 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Assign ncredir role to ncredir7003

https://gerrit.wikimedia.org/r/1153947

Change #1153948 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ncredir7003 to conftool

https://gerrit.wikimedia.org/r/1153948

Change #1153958 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Extend kafka firewall config for netflow7002

https://gerrit.wikimedia.org/r/1153958

Change #1153959 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Assign netinsights role to netflow7002

https://gerrit.wikimedia.org/r/1153959

Change #1153958 merged by Muehlenhoff:

[operations/puppet@production] Extend kafka firewall config for netflow7002

https://gerrit.wikimedia.org/r/1153958

Change #1153993 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/homer/public@master] Add netflow7002

https://gerrit.wikimedia.org/r/1153993

Change #1153993 merged by Muehlenhoff:

[operations/homer/public@master] Add netflow7002

https://gerrit.wikimedia.org/r/1153993

Change #1153959 merged by Muehlenhoff:

[operations/puppet@production] Assign netinsights role to netflow7002

https://gerrit.wikimedia.org/r/1153959

Change #1154022 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove netflow7001 from Kafka Jumbo ACLs

https://gerrit.wikimedia.org/r/1154022

Change #1154022 merged by Muehlenhoff:

[operations/puppet@production] Remove netflow7001 from Kafka Jumbo ACLs

https://gerrit.wikimedia.org/r/1154022

cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: netflow7001.magru.wmnet

  • netflow7001.magru.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster magru01 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster magru01 to Netbox

Change #1154161 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/homer/public@master] Remove netflow7001

https://gerrit.wikimedia.org/r/1154161

Change #1154161 merged by Muehlenhoff:

[operations/homer/public@master] Remove netflow7001

https://gerrit.wikimedia.org/r/1154161

Change #1153126 abandoned by Muehlenhoff:

[operations/puppet@production] Also add replica label for the new upcoming prometheus7002 node

Reason:

Got merged as https://gerrit.wikimedia.org/r/c/operations/puppet/+/1154046

https://gerrit.wikimedia.org/r/1153126

Change #1154923 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove bastion role from bast7001

https://gerrit.wikimedia.org/r/1154923

Change #1154923 merged by Muehlenhoff:

[operations/puppet@production] Remove bastion role from bast7001

https://gerrit.wikimedia.org/r/1154923

cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: bast7001.wikimedia.org

  • bast7001.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox

Change #1155151 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Apply installserver role to install7002

https://gerrit.wikimedia.org/r/1155151

Change #1155151 merged by Muehlenhoff:

[operations/puppet@production] Apply installserver role to install7002

https://gerrit.wikimedia.org/r/1155151

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host install7002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host install7002.wikimedia.org with OS bullseye executed with errors:

  • install7002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console install7002.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host install7002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host install7002.wikimedia.org with OS bullseye executed with errors:

  • install7002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console install7002.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host install7002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host install7002.wikimedia.org with OS bullseye executed with errors:

  • install7002 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console install7002.wikimedia.org" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host install7002.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host install7002.wikimedia.org with OS bookworm completed:

  • install7002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506101547_jmm_1002669_install7002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1155503 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Apply installserver role on install7002

https://gerrit.wikimedia.org/r/1155503

Change #1155503 merged by Muehlenhoff:

[operations/puppet@production] Apply installserver role on install7002

https://gerrit.wikimedia.org/r/1155503

Change #1155585 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] atftpd: Add support for Bookworm

https://gerrit.wikimedia.org/r/1155585

Change #1155585 merged by Muehlenhoff:

[operations/puppet@production] atftpd: Add support for Bookworm

https://gerrit.wikimedia.org/r/1155585

Change #1155616 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Revert "Revert back to install7001"

https://gerrit.wikimedia.org/r/1155616

Change #1155622 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] DHCP: install7001->7002

https://gerrit.wikimedia.org/r/1155622

Change #1155616 merged by Muehlenhoff:

[operations/puppet@production] Revert "Revert back to install7001"

https://gerrit.wikimedia.org/r/1155616

Change #1155622 merged by jenkins-bot:

[operations/homer/public@master] DHCP: install7001->7002

https://gerrit.wikimedia.org/r/1155622

Change #1155624 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Failover webproxy to install7002

https://gerrit.wikimedia.org/r/1155624

Change #1153947 merged by Muehlenhoff:

[operations/puppet@production] Assign ncredir role to ncredir7003

https://gerrit.wikimedia.org/r/1153947

Change #1153948 merged by Muehlenhoff:

[operations/puppet@production] Add ncredir7003 to conftool

https://gerrit.wikimedia.org/r/1153948

Change #1155624 merged by Muehlenhoff:

[operations/dns@master] Failover webproxy to install7002

https://gerrit.wikimedia.org/r/1155624

Change #1155649 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove ncredir7001 from conftool

https://gerrit.wikimedia.org/r/1155649

Change #1155649 merged by Muehlenhoff:

[operations/puppet@production] Remove ncredir7001 from conftool

https://gerrit.wikimedia.org/r/1155649

cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: ncredir7001.magru.wmnet

  • ncredir7001.magru.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster magru01 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster magru01 to Netbox

Change #1156814 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Apply ncredir role to ncredir7004

https://gerrit.wikimedia.org/r/1156814

Change #1156815 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ncredir7004 to conftool

https://gerrit.wikimedia.org/r/1156815

cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: install7001.wikimedia.org

  • install7001.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster magru01 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster magru01 to Netbox

Change #1159354 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Reimage ganeti7003 with insetup role

https://gerrit.wikimedia.org/r/1159354

Change #1159354 merged by Muehlenhoff:

[operations/puppet@production] Reimage ganeti7003 with insetup role

https://gerrit.wikimedia.org/r/1159354

Change #1159390 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Routed Ganeti: disable rp_filter

https://gerrit.wikimedia.org/r/1159390

Mentioned in SAL (#wikimedia-operations) [2025-06-16T09:44:09Z] <moritzm> remove magru01 in Netbox (all Ganeti nodes have been removed from it) T394263

Change #1159390 merged by Ayounsi:

[operations/puppet@production] Routed Ganeti: disable rp_filter

https://gerrit.wikimedia.org/r/1159390

Change #1159398 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ganeti7003 to the routed Ganeti cluster

https://gerrit.wikimedia.org/r/1159398

Change #1159398 merged by Muehlenhoff:

[operations/puppet@production] Add ganeti7003 to the routed Ganeti cluster

https://gerrit.wikimedia.org/r/1159398

Change #1156814 merged by Muehlenhoff:

[operations/puppet@production] Apply ncredir role to ncredir7004

https://gerrit.wikimedia.org/r/1156814

VM durum7003.magru.wmnet switching disk type to drbd

VM doh7003.wikimedia.org switching disk type to drbd

VM bast7002.wikimedia.org switching disk type to drbd

VM ncredir7003.magru.wmnet switching disk type to drbd

VM prometheus7002.magru.wmnet switching disk type to drbd

Change #1156815 merged by Muehlenhoff:

[operations/puppet@production] Add ncredir7004 to conftool

https://gerrit.wikimedia.org/r/1156815

Change #1159916 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add doh7004/durum7004

https://gerrit.wikimedia.org/r/1159916

Change #1159916 merged by Muehlenhoff:

[operations/puppet@production] Add doh7004/durum7004

https://gerrit.wikimedia.org/r/1159916

Change #1159983 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove ncredir7002

https://gerrit.wikimedia.org/r/1159983

Change #1159983 merged by Muehlenhoff:

[operations/puppet@production] Remove ncredir7002

https://gerrit.wikimedia.org/r/1159983

cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: ncredir7002.magru.wmnet

  • ncredir7002.magru.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox

Change #1178839 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove Ganeti role from ganeti7004

https://gerrit.wikimedia.org/r/1178839

Change #1178840 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] netbox: Remove ganeti02/magru cluster

https://gerrit.wikimedia.org/r/1178840

cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for hosts: doh7002.wikimedia.org

  • doh7002.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox

cookbooks.sre.hosts.decommission executed by sukhe@cumin1003 for hosts: durum7002.magru.wmnet

  • durum7002.magru.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster magru02 to Netbox

Change #1178839 merged by Muehlenhoff:

[operations/puppet@production] Remove Ganeti role from ganeti7004

https://gerrit.wikimedia.org/r/1178839

Change #1178840 merged by Muehlenhoff:

[operations/puppet@production] netbox: Remove ganeti02/magru cluster

https://gerrit.wikimedia.org/r/1178840

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti7004.magru.wmnet with OS bookworm

Change #1178887 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ganeti7004 to the routed Ganeti cluster in magru

https://gerrit.wikimedia.org/r/1178887

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti7004.magru.wmnet with OS bookworm completed:

  • ganeti7004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202508141434_jmm_3904929_ganeti7004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1179111 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Update Cumin aliases to handle the transition to routed Ganeti

https://gerrit.wikimedia.org/r/1179111

Change #1179111 merged by Muehlenhoff:

[operations/puppet@production] Update Cumin aliases to handle the transition to routed Ganeti

https://gerrit.wikimedia.org/r/1179111

Change #1178887 merged by Muehlenhoff:

[operations/puppet@production] Add ganeti7004 to the routed Ganeti cluster in magru

https://gerrit.wikimedia.org/r/1178887

MoritzMuehlenhoff claimed this task.
MoritzMuehlenhoff updated the task description. (Show Details)

Magru is fully running on routed Ganeti \o/