Page MenuHomePhabricator

[ceph] Move cloudcephosd1005 (c8) and cloudcephosd1010 (d5) to rack f4
Closed, ResolvedPublic

Description

They will require new IPs:

  • cloudcephosd1005.eqiad.wmnet:
public:
  addr: "10.64.149.15"
  iface: "ens3f0np0"
cluster:
  addr: "192.168.6.7"
  prefix: "24"
  iface: "ens3f1np1"
  • cloudcephosd1010.eqiad.wmnet:
public:
  addr: "10.64.149.16"
  iface: "ens3f0np0"
cluster:
  addr: "192.168.6.8"
  prefix: "24"
  iface: "ens3f1np1"
  • wmcs.ceph.osd.depool_and_destroy cookbook (remove all the osds from the host and remove CRUSH entries for the)
  • sre.hosts.decomission
  • Move the hosts to the new racks
  • In puppet, edit hieradata/eqiad/profile/cloudceph/osd.yaml with new IPs on the new ranges (public and cluster networks) if needed (search the range in Netbox for the next free IP in the range)
  • Follow https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs
    • Note that for the new interfaces to come up, puppet has to run once, so that happens after reimage, then you have to ensure that the new interface is setup in the right vlan (cloud-storage one)
    • BEFORE REIMAGE Upgrade the idrac firmware (cookbook sre.hardware.upgrade-firmware -n -c idrac cloudcephosd1004)
    • BEFORE REIMAGE Upgrade the nic firmware (cookbook sre.hardware.upgrade-firmware -n -c nic cloudcephosd1004)
    • IF REIMAGE FAILS Repeat the reimage until it works (puppet might timeout, etc., you can check the console by sshing to root@<hostname>.mgmt.eqiad.wmnet, use mgmt pass)
  • Merge the patch with the new IPs
  • Put the host back in ceph (wmcs.ceph.osd.bootstrap_and_add), it might take a while to finish the rebalancing
  • Profit!

Event Timeline

Change 888663 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs.ceph: move cloudcephosd1005/1010 to f4

https://gerrit.wikimedia.org/r/888663

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T08:55:09Z] <wm-bot2> Depooling OSDs with ids in [39, 38, 37, 36, 35, 34, 33, 32] on cloudcephosd1005 from eqiad1 (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T09:11:20Z] <wm-bot2> Destroying OSDs with ids in [39, 38, 37, 36, 35, 34, 33, 32] on cloudcephosd1005 from eqiad1 (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T09:12:02Z] <wm-bot2> Depooled and destroyed OSD daemons [39, 38, 37, 36, 35, 34, 33, 32] and removed the OSD host cloudcephosd1005 from the CRUSH map. (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T09:13:57Z] <wm-bot2> Depooling OSDs with ids in [79, 78, 77, 76, 75, 74, 73, 72] on cloudcephosd1010 from eqiad1 (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T09:28:28Z] <wm-bot2> Depooling OSDs with ids in [79, 78, 77, 76, 75, 74, 73, 72] on cloudcephosd1010 from eqiad1 (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T09:46:51Z] <wm-bot2> Destroying OSDs with ids in [79, 78, 77, 76, 75, 74, 73, 72] on cloudcephosd1010 from eqiad1 (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T09:47:33Z] <wm-bot2> Depooled and destroyed OSD daemons [79, 78, 77, 76, 75, 74, 73, 72] and removed the OSD host cloudcephosd1010 from the CRUSH map. (T329504) - cookbook ran by dcaro@vulcanus

@Jclark-ctr These hosts are ready to be moved to the new rack whenever you have a moment (and opportunity) :)

Hosts have been relocated and are cabled @dcaro
cloudcephosd1005 port 14,15
cloudcephosd1010 port 16,17

cookbooks.sre.hosts.decommission executed by dcaro@cumin1001 for hosts: cloudcephosd1005.eqiad.wmnet

  • cloudcephosd1005.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by dcaro@cumin1001 for hosts: cloudcephosd1010.eqiad.wmnet

  • cloudcephosd1010.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 888663 merged by David Caro:

[operations/puppet@production] wmcs.ceph: move cloudcephosd1005/1010 to f4

https://gerrit.wikimedia.org/r/888663

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1005.eqiad.wmnet with OS bullseye executed with errors:

  • cloudcephosd1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudcephosd1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudcephosd1005.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303011724_cmooney_889891_cloudcephosd1005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T08:32:29Z] <wm-bot2> Adding new OSDs ['cloudcephosd1005.eqiad.wmnet'] to the cluster (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T08:32:37Z] <wm-bot2> Adding OSD cloudcephosd1005.eqiad.wmnet... (1/1) (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T08:33:10Z] <wm-bot2> Rebooting node cloudcephosd1005.eqiad.wmnet (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T08:36:54Z] <wm-bot2> Finished rebooting node cloudcephosd1005.eqiad.wmnet (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T08:44:38Z] <wm-bot2> Added OSD cloudcephosd1005.eqiad.wmnet... (1/1) (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T08:44:46Z] <wm-bot2> Added 1 new OSDs ['cloudcephosd1005.eqiad.wmnet'] (T329504) - cookbook ran by dcaro@vulcanus

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1010.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T10:43:27Z] <wm-bot2> The cluster is now rebalanced after adding the new OSDs ['cloudcephosd1005.eqiad.wmnet'] (T329504) - cookbook ran by dcaro@vulcanus

@Jclark-ctr given this was a sort of non-standard move we need to get the cable labels for the links to these hosts updated in Netbox now that the move is done.

Can you take a look?

cloudcephosd1005:
https://netbox.wikimedia.org/dcim/cables/6110/
https://netbox.wikimedia.org/dcim/cables/6111/

cloudcephosd1010:
https://netbox.wikimedia.org/dcim/cables/6115/
https://netbox.wikimedia.org/dcim/cables/6116/

Thanks.

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1010.eqiad.wmnet with OS bullseye completed:

  • cloudcephosd1010 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303020925_dcaro_1068188_cloudcephosd1010.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T14:09:36Z] <wm-bot2> Adding new OSDs ['cloudcephosd1010.eqiad.wmnet'] to the cluster (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T14:09:42Z] <wm-bot2> Adding OSD cloudcephosd1010.eqiad.wmnet... (1/1) (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T14:10:16Z] <wm-bot2> Rebooting node cloudcephosd1010.eqiad.wmnet (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T14:13:52Z] <wm-bot2> Finished rebooting node cloudcephosd1010.eqiad.wmnet (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T14:21:14Z] <wm-bot2> Added OSD cloudcephosd1010.eqiad.wmnet... (1/1) (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T14:21:21Z] <wm-bot2> Added 1 new OSDs ['cloudcephosd1010.eqiad.wmnet'] (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T16:17:14Z] <wm-bot2> The cluster is now rebalanced after adding the new OSDs ['cloudcephosd1010.eqiad.wmnet'] (T329504) - cookbook ran by dcaro@vulcanus