Page MenuHomePhabricator

Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets
Open, Needs TriagePublic

Description

eqiad rows C and D have been migrated to new Nokia switches, and more importantly to the new network design.

You can find all the information on https://wikitech.wikimedia.org/wiki/Vlan_migration

All new servers are by default getting in those new vlans, but to not have to wait for a full 5+ years server refresh cycle, we're now asking service owners to re-image their existing baremetal servers using the --move-vlan cookbook parameter, at their own pace/convenience. This will change the server's IP.
There will of course be some special cases (like Ganeti, or DBs) and that's ok to not covert 100% of the servers, but the most we can get, the better.

Please contact netops for any help.

cumin1003:~$ sudo cumin 'A:owner-data-persistence and P{P:netbox::host%location ~ "[C|D].*eqiad"} and P{F:fqdn ~ ".wmnet$"} and not A:vms and not P{F:netmask = "255.255.255.0"}'
91 hosts will be targeted:
apus-fe1003.eqiad.wmnet,aqs[1014-1015,1018-1019,1022,1025,1027].eqiad.wmnet,backup[1006-1007,1014].eqiad.wmnet,db[1150,1153,1166-1175,1180-1182,1184,1189,1217-1225,1230-1233,1242-1245,1247-1249,1252-1253,1258-1259,1262-1263].eqiad.wmnet,dbprov1004.eqiad.wmnet,dbproxy[1024-1025,1029].eqiad.wmnet,es[1033,1045-1046,1051-1053,1057].eqiad.wmnet,moss-be1002.eqiad.wmnet,ms-backup1002.eqiad.wmnet,ms-be[1066-1067,1082,1086,1091-1093].eqiad.wmnet,ms-fe[1011,1013,1019-1020].eqiad.wmnet,pc[1013-1014,1016-1018].eqiad.wmnet,restbase[1033,1040-1042,1045].eqiad.wmnet,sessionstore[1005-1006].eqiad.wmnet,thanos-fe1007.eqiad.wmnet
  • apus-fe1003.eqiad.wmnet
  • aqs[1014-1015,1018-1019,1022,1025,1027].eqiad.wmnet
    • aqs1025
    • aqs1027
  • backup[1006-1007,1014].eqiad.wmnet
  • db[1150,1153,1166-1175,1180-1182,1184,1189,1217-1225,1230-1233,1242-1245,1247-1249,1252-1253,1258-1259,1262-1263].eqiad.wmnet
  • dbprov1004.eqiad.wmnet
  • dbproxy[1024-1025,1029].eqiad.wmnet
  • es[1033,1045-1046,1051-1053,1057].eqiad.wmnet
  • moss-be1002.eqiad.wmnet
  • ms-backup1002.eqiad.wmnet
  • ms-be[1066-1067,1082,1086,1091-1093].eqiad.wmnet
    • ms-be1066
    • ms-be1067
    • ms-be1082
    • ms-be1086
    • ms-be1091
    • ms-be1092
    • ms-be1093
  • ms-fe[1011,1013,1019-1020].eqiad.wmnet
    • ms-fe1011
    • ms-fe1013
    • ms-fe1019
    • ms-fe1020
  • pc[1013-1014,1016-1018].eqiad.wmnet
  • restbase[1033,1040-1042,1045].eqiad.wmnet
  • sessionstore[1005-1006].eqiad.wmnet
  • thanos-fe1007.eqiad.wmnet

Event Timeline

we're now asking service owners to re-image their existing baremetal servers

We don't reimage backups hosts. Let us know the alternative method (we can put them out of service for an extended time if necessary). This was brought up in advance @ the IF-DP meeting at the offsite.

Change #1266138 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: drain 3 eqiad backends for reimage to per-rack VLAN

https://gerrit.wikimedia.org/r/1266138

Change #1266138 merged by MVernon:

[operations/puppet@production] swift: drain 3 eqiad backends for reimage to per-rack VLAN

https://gerrit.wikimedia.org/r/1266138

Marostegui subscribed.

For db* related hosts (including pc*, es* and dbproxy*) will be tricky as this also requires changing their IP on the load balancer (and even grants in the case of dbproxy).
Luckily we have a big batch of refreshing coming soon so we''ll be able to get some of these done automatically.
But otherwise this is going to take quite a bit.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe1011.eqiad.wmnet with OS bullseye completed:

  • ms-fe1011 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604081029_mvernon_114260_ms-fe1011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
MatthewVernon subscribed.

A wrinkle here is that ferm doesn't get reloaded on the other swift nodes (presumably because the config for ferm hasn't actually changed, because the hostname of the node is unchanged), so you have to do that by cumin-hand before the reimaged node works again.

A wrinkle here is that ferm doesn't get reloaded on the other swift nodes (presumably because the config for ferm hasn't actually changed, because the hostname of the node is unchanged), so you have to do that by cumin-hand before the reimaged node works again.

Note that firewall::service (and ferm::service if you have some other blockers migrating to firewall) host parameters now also take a list of host FQDNs which, unlike ferm and @resolve, get (re)resolved during the Puppet compilation stage making this problem go away.

Eevans updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe1013.eqiad.wmnet with OS bullseye

@jcrespo
We don't reimage backups hosts. Let us know the alternative method (we can put them out of service for an extended time if necessary). This was brought up in advance @ the IF-DP meeting at the offsite.

We can work together on that, the process is a bit more manual, and requires editing the host's /etc/network/interfaces file, updating netbox and updating the switch port config.

It should be a few minutes downtime max, and we can plan for a 1h maintenance for the first host to make sure we have the process right. Let's find a time next week or at your earliest convenience.

We can work together on that, the process is a bit more manual, and requires editing the host's /etc/network/interfaces file, updating netbox and updating the switch port config.

I 100% prefer that, as the alternative is to do a manual reimage, partitioning things by hand. It is not the first time I had to do a manual ip change.

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe1013.eqiad.wmnet with OS bullseye completed:

  • ms-fe1013 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604090904_mvernon_990280_ms-fe1013.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe1019.eqiad.wmnet with OS bullseye completed:

  • ms-fe1019 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604091034_mvernon_1062581_ms-fe1019.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-fe1020.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-fe1020.eqiad.wmnet with OS bullseye completed:

  • ms-fe1020 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604091204_mvernon_1117708_ms-fe1020.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-fe1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-fe1007.eqiad.wmnet with OS bullseye completed:

  • thanos-fe1007 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604091308_mvernon_1171571_thanos-fe1007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host apus-fe1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host apus-fe1003.eqiad.wmnet with OS bookworm completed:

  • apus-fe1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604091409_mvernon_1220583_apus-fe1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2026-04-09T14:37:36Z] <Emperor> ceph orch host drain moss-be1002 --zap-osd-devices T421719

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be1002.eqiad.wmnet with OS bookworm completed:

  • moss-be1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202604091547_mvernon_1285899_moss-be1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=b67b68ad-79cc-40ba-b2d3-11ce2438694e) set by jynus@cumin1003 for 4:00:00 on 3 host(s) and their services with reason: maintenance

backup[1006-1007,1014].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=acda330c-af7e-43eb-ab9e-f17a3dfaee68) set by jynus@cumin1003 for 4:00:00 on 2 host(s) and their services with reason: maintenance

backup[1007,1014].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2026-04-16T09:29:22Z] <jynus> setting backup1014 in maintenance, no backup or recovery will run while it T421719

Mentioned in SAL (#wikimedia-operations) [2026-04-16T10:09:20Z] <jynus> backup1014 returns from maintenance, backups and recovery can flow as usual T421719

Marostegui added a subtask: Unknown Object (Task).
Marostegui unsubscribed.
Marostegui subscribed.
VRiley-WMF closed subtask Unknown Object (Task) as Resolved.Fri, Apr 17, 9:01 PM