Page MenuHomePhabricator

Re-IP Swift hosts to per-rack subnets in codfw rows A-D
Open, MediumPublic

Description

As part of the move from per-row to per-rack redundancy model hosts in codfw rows A-D need to be configured / moved to new per-rack vlans/subnets. This work can be tackled once we have completed the physical move of all hosts in those rows from old 'asw' switch devices to new 'lsw' ones.

In discussion on irc we touched on some of the challenges for these hosts, which as I understand may use IP addresses as identifiers. We also need to consider how clusters function with hosts on different subnets that were previous layer-2 adjacent.

Having tested the migration process on ms-be2075, we know that it works thus:

  1. Drain node
  2. Remove node from rings
  3. Reimage node (with --move-vlan)
  4. Make sure the swift ring manager knows about the relevant per-rack subnet
  5. Add node back to rings

Newer nodes automatically get added to new-style subnets (ms-be2081 and later); so start with the newest node with old-style networking and move backwards, meaning that the oldest nodes get done last (and might have been aged out in the mean time).

  • ms-be2080
  • ms-be2079
  • ms-be2078
  • ms-be2077 (draining)
  • ms-be2076 (draining)
  • ms-be2075
  • ms-be2074 (draining)
  • ms-be2073
  • ms-be2072
  • ms-be2071
  • ms-be2070

Below this point, nodes are using old-style storage, so we might want to fix that at the same time

  • ms-be2069
  • ms-be2068
  • ms-be2067
  • ms-be2066
  • ms-be2065
  • ms-be2064
  • ms-be2063
  • ms-be2062
  • ms-be2061
  • ms-be2060
  • ms-be2059
  • ms-be2058
  • ms-be2057

Event Timeline

cmooney triaged this task as Medium priority.

Swift uses IP(v4) address (and then device name) as the identifier for entries in its rings.

Additionally, when adding nodes to the ring, we use IP address to tell where the node is located, and thus which "zone" it should be in (the zones are used to make sure each of the three replicas is in a different row) - see the find_ip_zone function.

The safest approach would be to drain a node & remove it from the rings, then renumber it and add it again. But a drain takes 2-3 weeks (we do it gradually to avoid overload), and a reload the same time again.

In theory swift-ring-builder has a set_info command with a --change-ip argument, so one could change every device on a node in the rings, renumber it and push out the new rings. We'd need to write some tooling to do this, and I've no idea how safe such an operation is.

In either approach, extra constraints are that we'd not want too many nodes "in flight" at once, because swift will try and backfill to make up for missing/down devices and we need to avoid overloading (in terms of load or capacity) the rest of the cluster; and that you have to wait 12 hours between changes to the rings.

Sorry, I think object stores are often not really written with renumbering in mind...

Can I very tentatively ask if you have thoughts about timescales for this, please? It seems likely to be a non-trivial bit of work from the data persistence side, so we'd like to include it in our quarterly planning as appropriate.

Sorry, I think object stores are often not really written with renumbering in mind...

Yeah, and that's fine. I think as time goes on as a pattern we need to try to not use IPs as identifiers but it is what it is.

Can I very tentatively ask if you have thoughts about timescales for this, please? It seems likely to be a non-trivial bit of work from the data persistence side, so we'd like to include it in our quarterly planning as appropriate.

So for us I guess the fear is it never gets done and a host causes some row-wide outage which would have been contained to a rack had we done the work to move things.

That said there is no particular rush, the setup now is stable and there is no huge performance or other benefit from moving. So we can work with whatever constraints you have. How many hosts are we talking overall do you know? If there is a 3 week delay to de-pool each maybe we could just do 1 a month until we are done with them all? Or more aggressive if that's possible, but we shouldn't take any risks to try and get it done quicker.

So, looking at netbox, hosts are distributed in codfw A/B thus:
A2 - ms-be2051, ms-be2074
A4 - ms-be20{60,62,66,70,75}
A7 - ms-be2052

B2 - ms-be2076
B4 - ms-be20{53,57,63,67,71}

ms-be205{1,2} were purchased 2019-09-18 and so will be aging out of rotation this FY (I think! They'll be 5 years old then).
So, if I assume correctly that this can be done rackwise, we might plausibly be able to do A2 and A7 once those old machines are gone (it'll just be one node to drain in A2), and similarly B2 could be done straightforwardly.

A4 and B4 are more challenging; could they be done host-wise, or does there need to be a rack-level flag day?

Would scheduling the node in B2 in Q1/Q2 as a test-case be a good idea?

Now this applies to rows C and D as well as the switches got upgraded there as well.

It makes sens to ignore all the 2019 hosts, the renumbering is on a per host basis, no need to tackle a full rack at once.

I realize we're already in Q2, but anytime that works for you works for us :)

The first one is usually the one that takes the longer, to figure out proper processes.

I spoke to @cmooney about this in Atlanta, and I think my understanding is:

  1. This can be done host-by-host
  2. Only codfw needs doing right now
  3. Migration-wise, running the sre.hosts.reimage cookbook with the --move-vlan argument specified is sufficient

Is that correct? If that's so, I'll try and schedule the first node to drain-reimage-reload so we can see if it goes OK in practice :)

Change #1128907 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: remove ms-be2075 from rings

https://gerrit.wikimedia.org/r/1128907

Change #1128908 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: re-add ms-be2075 to the rings

https://gerrit.wikimedia.org/r/1128908

Change #1128907 merged by MVernon:

[operations/puppet@production] swift: remove ms-be2075 from rings

https://gerrit.wikimedia.org/r/1128907

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye completed:

  • ms-be2075 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202503191533_mvernon_3779816_ms-be2075.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1128908 merged by MVernon:

[operations/puppet@production] swift: re-add ms-be2075 to the rings

https://gerrit.wikimedia.org/r/1128908

MatthewVernon renamed this task from Re-IP Swift hosts to per-rack subnets in codfw row A and B. to Re-IP Swift hosts to per-rack subnets in codfw rows A-D.Apr 24 2025, 3:46 PM
MatthewVernon updated the task description. (Show Details)
MatthewVernon updated the task description. (Show Details)

Change #1138830 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] Swift: drain ms-be2080 (prep for VLAN move)

https://gerrit.wikimedia.org/r/1138830

Change #1138831 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: remove ms-be2080 entirely from rings prior to reimage

https://gerrit.wikimedia.org/r/1138831

Change #1138832 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: restore ms-be2080 to the rings post-reimage

https://gerrit.wikimedia.org/r/1138832

Change #1138830 merged by MVernon:

[operations/puppet@production] Swift: drain ms-be2080 (prep for VLAN move)

https://gerrit.wikimedia.org/r/1138830

Change #1138831 merged by MVernon:

[operations/puppet@production] swift: remove ms-be2080 entirely from rings prior to reimage

https://gerrit.wikimedia.org/r/1138831

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2080.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2080.codfw.wmnet with OS bullseye completed:

  • ms-be2080 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202506161246_mvernon_260954_ms-be2080.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1138832 merged by MVernon:

[operations/puppet@production] swift: restore ms-be2080 to the rings post-reimage

https://gerrit.wikimedia.org/r/1138832

Change #1176432 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: add 1 new codfw host, drain 3

https://gerrit.wikimedia.org/r/1176432

Change #1176432 merged by MVernon:

[operations/puppet@production] swift: add 1 new codfw host, drain 3

https://gerrit.wikimedia.org/r/1176432

Change #1180901 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: remove 3 drained codfw hosts

https://gerrit.wikimedia.org/r/1180901

Change #1180901 merged by MVernon:

[operations/puppet@production] swift: remove 3 drained codfw hosts

https://gerrit.wikimedia.org/r/1180901

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2079.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2079.codfw.wmnet with OS bullseye completed:

  • ms-be2079 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202508220923_mvernon_1048854_ms-be2079.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1182174 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: re-add 3 codfw hosts, drain the next 3

https://gerrit.wikimedia.org/r/1182174

Change #1182174 merged by MVernon:

[operations/puppet@production] swift: re-add 3 codfw hosts, drain the next 3

https://gerrit.wikimedia.org/r/1182174

Change #1194566 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: remove 3 drained codfw hosts

https://gerrit.wikimedia.org/r/1194566

Change #1194566 merged by MVernon:

[operations/puppet@production] swift: remove 3 drained codfw hosts

https://gerrit.wikimedia.org/r/1194566

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host ms-be2078.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host ms-be2078.codfw.wmnet with OS bullseye completed:

  • ms-be2078 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202510081204_mvernon_2293117_ms-be2078.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2078.codfw.wmnet with OS bullseye completed:

  • ms-be2078 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202511031619_mvernon_3336736_ms-be2078.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1202192 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] re-add hosts to ring, drain 3 more

https://gerrit.wikimedia.org/r/1202192

Change #1202192 merged by MVernon:

[operations/puppet@production] re-add hosts to ring, drain 3 more

https://gerrit.wikimedia.org/r/1202192