Page MenuHomePhabricator

[ceph] Move cloudcephosd1003 (b2) to rack e4 and cloudcephosd1004 (c8) to rack f4
Closed, ResolvedPublic

Description

Both can be turned off at the same time.

They will require new IPs:

  • cloudcephosd1003.eqiad.wmnet:
public:
  addr: "10.64.148.16"
  iface: "ens2f0np0"
cluster:
  addr: "192.168.5.8"
  prefix: "24"
  iface: "ens2f1np1"
  • cloudcephosd1004.eqiad.wmnet:
public:
  addr: "10.64.149.14"
  iface: "ens3f0np0"
cluster:
  addr: "192.168.6.6"
  prefix: "24"
  iface: "ens3f1np1"

For the record, the final process used was:

  • wmcs.ceph.osd.depool_and_destroy cookbook (remove all the osds from the host and remove CRUSH entries for the)
  • sre.hosts.decomission
  • Move the hosts to the new racks
  • In puppet, edit hieradata/eqiad/profile/cloudceph/osd.yaml with new IPs on the new ranges (public and cluster networks) if needed (search the range in Netbox for the next free IP in the range)
  • Half-follow https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging
    • Move from DECOMMISSIONING to PLANNED
    • Add only the public IP to the main interface
    • flag that intefrace as primary
    • Add also an fqdn to that new IP
    • Add also an fqdn to the mgmt IP (if not there)
    • Run the sre.dns.netbox
  • Merge the patch with the new IPs
  • Upgrade the idrac firmware (cookbook sre.hardware.upgrade-firmware -n -c idrac cloudcephosd1004)
  • Upgrade the nic firmware (cookbook sre.hardware.upgrade-firmware -n -c nic cloudcephosd1004)
  • Reimage the host (cookbook sre.hosts.reimage --os bullseye --new -t T329502 cloudcephosd1004)
    • Repeat the reimage until it works (puppet might timeout, etc., you can check the console by sshing to root@<hostname>.mgmt.eqiad.wmnet, use mgmt pass)
  • Put the host back in ceph (wmcs.ceph.osd.bootstrap_and_add), it might take a while to finish the rebalancing
  • Profit!

EDIT (cmooney): For the record I believe the best process to follow for these kind of moves is the one outlined below:

https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows%2Fracks%2C_changing_IPs

Event Timeline

Change 888660 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs.ceph: move cloudcephosd1003/1004 to e4/f4

https://gerrit.wikimedia.org/r/888660

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-21T14:00:32Z] <wm-bot2> Depooling OSDs with ids in [71, 70, 69, 68, 67, 66, 65, 64] on cloudcephosd1003 from eqiad1 (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-21T14:21:19Z] <wm-bot2> Destroying OSDs with ids in [71, 70, 69, 68, 67, 66, 65, 64] on cloudcephosd1003 from eqiad1 (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-21T15:48:10Z] <wm-bot2> Depooling OSDs with ids in [71, 70, 69, 68, 67, 66, 65, 64] on cloudcephosd1003 from eqiad1 (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-21T15:50:26Z] <wm-bot2> Destroying OSDs with ids in [71, 70, 69, 68, 67, 66, 65, 64] on cloudcephosd1003 from eqiad1 (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-22T07:02:46Z] <wm-bot2> Depooling OSDs with ids in [31, 30, 29, 28, 27, 26, 25, 24] on cloudcephosd1004 from eqiad1 (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-22T07:23:19Z] <wm-bot2> Destroying OSDs with ids in [31, 30, 29, 28, 27, 26, 25, 24] on cloudcephosd1004 from eqiad1 (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-22T08:13:09Z] <wm-bot2> Depooling OSDs with ids in [31, 30, 29, 28, 27, 26, 25, 24] on cloudcephosd1004 from eqiad1 (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-22T08:15:20Z] <wm-bot2> Destroying OSDs with ids in [31, 30, 29, 28, 27, 26, 25, 24] on cloudcephosd1004 from eqiad1 (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-22T08:22:49Z] <wm-bot2> Depooling OSDs with ids in [31, 30, 29, 28, 27, 26, 25, 24] on cloudcephosd1004 from eqiad1 (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-22T08:25:03Z] <wm-bot2> Destroying OSDs with ids in [31, 30, 29, 28, 27, 26, 25, 24] on cloudcephosd1004 from eqiad1 (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-22T08:25:49Z] <wm-bot2> Depooled and destroyed OSD daemons [31, 30, 29, 28, 27, 26, 25, 24] and removed the OSD host cloudcephosd1004 from the CRUSH map. (T329502) - cookbook ran by dcaro@vulcanus

cookbooks.sre.hosts.decommission executed by dcaro@cumin1001 for hosts: cloudcephosd1003.eqiad.wmnet

  • cloudcephosd1003.eqiad.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by dcaro@cumin1001 for hosts: cloudcephosd1004.eqiad.wmnet

  • cloudcephosd1004.eqiad.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

@Jclark-ctr @ayounsi the two hosts are ready to be moved :), note that they go to different racks.

cloudcephosd1003. Port 14,15
cloudcephosd1004. Port 18,19

Change 888660 merged by David Caro:

[operations/puppet@production] wmcs.ceph: move cloudcephosd1003/1004 to e4/f4

https://gerrit.wikimedia.org/r/888660

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudcephosd1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudcephosd1004.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin2002 for host cloudcephosd1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin2002 for host cloudcephosd1003.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302271505_dcaro_252239_cloudcephosd1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T15:58:34Z] <wm-bot2> Adding new OSDs ['coludcephosd1003'] to the cluster (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T15:58:40Z] <wm-bot2> Adding OSD coludcephosd1003... (1/1) (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T15:59:09Z] <wm-bot2> Adding new OSDs ['coludcephosd1003.eqiad.wmnet'] to the cluster (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T15:59:14Z] <wm-bot2> Adding OSD coludcephosd1003.eqiad.wmnet... (1/1) (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T16:01:50Z] <wm-bot2> Adding new OSDs ['cloudcephosd1003.eqiad.wmnet'] to the cluster (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T16:01:54Z] <wm-bot2> Adding OSD cloudcephosd1003.eqiad.wmnet... (1/1) (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T16:02:29Z] <wm-bot2> Rebooting node cloudcephosd1003.eqiad.wmnet (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T16:05:39Z] <wm-bot2> Finished rebooting node cloudcephosd1003.eqiad.wmnet (T329502) - cookbook ran by dcaro@vulcanus

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1004.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T16:13:12Z] <wm-bot2> Added OSD cloudcephosd1003.eqiad.wmnet... (1/1) (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T16:13:16Z] <wm-bot2> Added 1 new OSDs ['cloudcephosd1003.eqiad.wmnet'] (T329502) - cookbook ran by dcaro@vulcanus

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1004.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1004 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1004.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T18:14:51Z] <wm-bot2> The cluster is now rebalanced after adding the new OSDs ['cloudcephosd1003.eqiad.wmnet'] (T329502) - cookbook ran by dcaro@vulcanus

Change 892526 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloudcephosd1004: use the right interface names

https://gerrit.wikimedia.org/r/892526

Change 892526 merged by David Caro:

[operations/puppet@production] cloudcephosd1004: use the right interface names

https://gerrit.wikimedia.org/r/892526

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1004.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1004 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302271654_dcaro_280466_cloudcephosd1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T18:38:04Z] <wm-bot2> Adding new OSDs ['cloudcephosd1004.eqiad.wmnet'] to the cluster (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T18:38:08Z] <wm-bot2> Adding OSD cloudcephosd1004.eqiad.wmnet... (1/1) (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T18:38:43Z] <wm-bot2> Rebooting node cloudcephosd1004.eqiad.wmnet (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T18:41:29Z] <wm-bot2> Adding new OSDs ['cloudcephosd1004.eqiad.wmnet'] to the cluster (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T18:41:34Z] <wm-bot2> Adding OSD cloudcephosd1004.eqiad.wmnet... (1/1) (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T18:42:10Z] <wm-bot2> Rebooting node cloudcephosd1004.eqiad.wmnet (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T18:45:50Z] <wm-bot2> Finished rebooting node cloudcephosd1004.eqiad.wmnet (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T18:53:00Z] <wm-bot2> Added OSD cloudcephosd1004.eqiad.wmnet... (1/1) (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T18:53:04Z] <wm-bot2> Added 1 new OSDs ['cloudcephosd1004.eqiad.wmnet'] (T329502) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-27T21:01:01Z] <wm-bot2> The cluster is now rebalanced after adding the new OSDs ['cloudcephosd1004.eqiad.wmnet'] (T329502) - cookbook ran by dcaro@vulcanus

dcaro updated the task description. (Show Details)

Done!

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudcephosd1003.eqiad.wmnet with OS buster executed with errors:

  • cloudcephosd1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
cmooney updated the task description. (Show Details)