Page MenuHomePhabricator

[ceph] Move cloudcephosd1001 (b7) and cloudcephosd1002 (b4) to rack e4
Closed, ResolvedPublic

Description

Both can be turned off at the same time.

They will require new IPs:

  • cloudcephosd1001.eqiad.wmnet:
addr: "10.64.148.14"
iface: "ens2f0np0"
addr: "192.168.5.6"
prefix: "24"
iface: "ens2f1np1"
  • cloudcephosd1002.eqiad.wmnet:
public:
  addr: "10.64.148.15"
  iface: "ens2f0np0"
cluster:
  addr: "192.168.5.7"
  prefix: "24"
  iface: "ens2f1np1"

Event Timeline

Change 888659 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs ceph:Move cloudcephosd1001/1002 to e4

https://gerrit.wikimedia.org/r/888659

@Jclark-ctr We can start with this one, it will need changes in netbox too (for the 10.64.148.* ips), I'm available mostly on european timezones, but I can accomodate others if needed.

Note that the ips on those ranges I selected manually, I don't thing there's anything new coming in those ranges, so they should be valid when we do the actual move.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-13T14:06:50Z] <wm-bot2> Set the ceph cluster for eqiad1 in maintenance, alert silence ids: 8fbf6bfd-eec1-4d81-8e0d-ea431d8411ee (T329498) - cookbook ran by dcaro@vulcanus

Icinga downtime and Alertmanager silence (ID=34f24a3a-279b-41cf-89ec-66102b211bda) set by dcaro@cumin1001 for 3:00:00 on 1 host(s) and their services with reason: moving racks

cloudcephosd1001.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=dcc24c74-6aaa-4607-8c93-ab2699307f18) set by dcaro@cumin1001 for 3:00:00 on 1 host(s) and their services with reason: moving racks

cloudcephosd1002.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host cloudcephosd1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host cloudcephosd1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Relocated Servers to rack E4 updated netbox.
cloudsw1-e4-eqiad
cloudcephosd1001 port 18,19 cloudcephosd1002 Ports16,17

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: cloudcephosd1001.eqiad.wmnet

  • cloudcephosd1001.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1001.eqiad.wmnet with OS bullseye

cookbooks.sre.hosts.decommission executed by cmooney@cumin1001 for hosts: cloudcephosd1002.eqiad.wmnet

  • cloudcephosd1002.eqiad.wmnet (FAIL)
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.65.2.178
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1001.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1001.eqiad.wmnet with OS bullseye

Change 888659 merged by Andrew Bogott:

[operations/puppet@production] wmcs ceph:Move cloudcephosd1001/1002 to e4

https://gerrit.wikimedia.org/r/888659

Change 889142 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Adjust interface names for cloudcephosd1001 and cloudcephosd1002

https://gerrit.wikimedia.org/r/889142

Change 889142 merged by David Caro:

[operations/puppet@production] Adjust interface names for cloudcephosd1001 and cloudcephosd1002

https://gerrit.wikimedia.org/r/889142

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1001.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1001 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302141308_andrew_2411049_cloudcephosd1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302141457_andrew_2433467_cloudcephosd1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin2002 for host cloudcephosd1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302141521_andrew_2437022_cloudcephosd1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1002.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302141543_andrew_2450266_cloudcephosd1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudcephosd1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudcephosd1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudcephosd1002.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302141632_andrew_2061134_cloudcephosd1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudcephosd1002.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302141656_andrew_2066894_cloudcephosd1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Just want to document the process we should be following here, for any other moves that we wish to do.

I've added a section to the server lifecycle page on Wikitech based on our experience with these ones:

https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs

To expand on the generic language used:

  1. We will need to update the NIC firmware as we are going from buster to bullseye
    1. That requires us to first update the iDRAC itself:
      • sudo cookbook sre.hardware.upgrade-firmware -n -c idrac <host fqdn>
    2. When complete do the NIC:
      • sudo cookbook sre.hardware.upgrade-firmware -n -c nic <host fqdn>
      • "Network_Firmware_RXP80_WN64_21.85.21.92.EXE" is the known-working release with bullseye
  2. There are puppet changes needed, afaik these are the two we did for existing (can be done in one patch):

Any questions just ask!

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T10:59:40Z] <wm-bot2> Depooling OSDs with ids in [55, 54, 53, 52, 51, 50] on cloudcephosd1001 from eqiad1 (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T11:01:49Z] <wm-bot2> Depooling OSDs with ids in [55, 54, 53, 52, 51, 50] on cloudcephosd1001 from eqiad1 (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T11:03:40Z] <wm-bot2> Destroying OSDs with ids in [55, 54, 53, 52, 51, 50] on cloudcephosd1001 from eqiad1 (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T11:19:18Z] <wm-bot2> Depooling OSDs with ids in [53, 52, 51, 50] on cloudcephosd1001 from eqiad1 (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T11:20:41Z] <wm-bot2> Destroying OSDs with ids in [53, 52, 51, 50] on cloudcephosd1001 from eqiad1 (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T13:14:46Z] <wm-bot2> Adding new OSDs ['cloudcephosd1001.eqiad.wmnet'] to the cluster (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T13:14:50Z] <wm-bot2> Adding OSD cloudcephosd1001.eqiad.wmnet... (1/1) (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T13:21:45Z] <wm-bot2> Depooling OSDs with ids in [63, 62, 61, 60, 59, 58, 57, 56] on cloudcephosd1002 from eqiad1 (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T13:23:25Z] <wm-bot2> Destroying OSDs with ids in [63, 62, 61, 60, 59, 58, 57, 56] on cloudcephosd1002 from eqiad1 (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T13:24:43Z] <wm-bot2> Adding new OSDs ['cloudcephosd1001.eqiad.wmnet'] to the cluster (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T13:24:47Z] <wm-bot2> Adding OSD cloudcephosd1001.eqiad.wmnet... (1/1) (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T13:29:18Z] <wm-bot2> Adding new OSDs ['cloudcephosd1001.eqiad.wmnet'] to the cluster (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T13:29:47Z] <wm-bot2> Adding new OSDs ['cloudcephosd1001.eqiad.wmnet'] to the cluster (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T14:05:43Z] <wm-bot2> Added 1 new OSDs ['cloudcephosd1001.eqiad.wmnet'] (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T16:00:32Z] <wm-bot2> Adding new OSDs ['cloudcephosd1002.eqiad.wmnet'] to the cluster (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T16:00:38Z] <wm-bot2> Adding OSD cloudcephosd1002.eqiad.wmnet... (1/1) (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T16:01:05Z] <wm-bot2> Adding new OSDs ['cloudcephosd1002.eqiad.wmnet'] to the cluster (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T16:01:09Z] <wm-bot2> Adding OSD cloudcephosd1002.eqiad.wmnet... (1/1) (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T16:50:45Z] <wm-bot2> Adding new OSDs ['cloudcephosd1002.eqiad.wmnet'] to the cluster (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T16:50:49Z] <wm-bot2> Adding OSD cloudcephosd1002.eqiad.wmnet... (1/1) (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T17:41:58Z] <wm-bot2> Adding new OSDs ['cloudcephosd1002.eqiad.wmnet'] to the cluster (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T17:42:03Z] <wm-bot2> Adding OSD cloudcephosd1002.eqiad.wmnet... (1/1) (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T17:47:17Z] <wm-bot2> Added OSD cloudcephosd1002.eqiad.wmnet... (1/1) (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T17:47:21Z] <wm-bot2> Added 1 new OSDs ['cloudcephosd1002.eqiad.wmnet'] (T329498) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud) [2023-02-16T17:55:21Z] <dcaro> Manually zapped /dev/sdc on cloudcephosd1002, probably a leftover drive since the beginning (or during the reimage the drives changed names, and this one had leftovers from the previous OS) (T329498)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-16T19:13:59Z] <wm-bot2> The cluster is now rebalanced after adding the new OSDs ['cloudcephosd1002.eqiad.wmnet'] (T329498) - cookbook ran by dcaro@vulcanus

The new hosts are up and running, joined the cluster and rebalanced all the data