⚓ T329504 [ceph] Move cloudcephosd1005 (c8) and cloudcephosd1010 (d5) to rack f4

	Subject	Repo	Branch	Lines +/-
	wmcs.ceph: move cloudcephosd1005/1010 to f4	operations/puppet	production	+16 -16

dcaro created this task.Feb 13 2023, 11:50 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2023, 11:50 AM

Change 888663 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs.ceph: move cloudcephosd1005/1010 to f4

https://gerrit.wikimedia.org/r/888663

gerritbot added a project: Patch-For-Review.Feb 13 2023, 11:53 AM

dcaro mentioned this in T297083: [ceph] Getting rack level HA.Feb 13 2023, 11:53 AM

dcaro updated the task description. (Show Details)

dcaro updated the task description. (Show Details)Feb 13 2023, 11:59 AM

dcaro updated the task description. (Show Details)Feb 28 2023, 8:52 AM

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T08:55:09Z] <wm-bot2> Depooling OSDs with ids in [39, 38, 37, 36, 35, 34, 33, 32] on cloudcephosd1005 from eqiad1 (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T09:11:20Z] <wm-bot2> Destroying OSDs with ids in [39, 38, 37, 36, 35, 34, 33, 32] on cloudcephosd1005 from eqiad1 (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T09:12:02Z] <wm-bot2> Depooled and destroyed OSD daemons [39, 38, 37, 36, 35, 34, 33, 32] and removed the OSD host cloudcephosd1005 from the CRUSH map. (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T09:13:57Z] <wm-bot2> Depooling OSDs with ids in [79, 78, 77, 76, 75, 74, 73, 72] on cloudcephosd1010 from eqiad1 (T329504) - cookbook ran by dcaro@vulcanus

dcaro updated the task description. (Show Details)Feb 28 2023, 9:14 AM

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T09:28:28Z] <wm-bot2> Depooling OSDs with ids in [79, 78, 77, 76, 75, 74, 73, 72] on cloudcephosd1010 from eqiad1 (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T09:46:51Z] <wm-bot2> Destroying OSDs with ids in [79, 78, 77, 76, 75, 74, 73, 72] on cloudcephosd1010 from eqiad1 (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-28T09:47:33Z] <wm-bot2> Depooled and destroyed OSD daemons [79, 78, 77, 76, 75, 74, 73, 72] and removed the OSD host cloudcephosd1010 from the CRUSH map. (T329504) - cookbook ran by dcaro@vulcanus

@Jclark-ctr These hosts are ready to be moved to the new rack whenever you have a moment (and opportunity) :)

Hosts have been relocated and are cabled @dcaro
cloudcephosd1005 port 14,15
cloudcephosd1010 port 16,17

cookbooks.sre.hosts.decommission executed by dcaro@cumin1001 for hosts: cloudcephosd1005.eqiad.wmnet

cloudcephosd1005.eqiad.wmnet (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Management interface not found on Icinga, unable to downtime it
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by dcaro@cumin1001 for hosts: cloudcephosd1010.eqiad.wmnet

cloudcephosd1010.eqiad.wmnet (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Management interface not found on Icinga, unable to downtime it
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 888663 merged by David Caro:

[operations/puppet@production] wmcs.ceph: move cloudcephosd1005/1010 to f4

https://gerrit.wikimedia.org/r/888663

Maintenance_bot removed a project: Patch-For-Review.Mar 1 2023, 3:10 PM

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1005.eqiad.wmnet with OS bullseye executed with errors:

cloudcephosd1005 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudcephosd1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudcephosd1005.eqiad.wmnet with OS bullseye completed:

cloudcephosd1005 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303011724_cmooney_889891_cloudcephosd1005.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T08:32:29Z] <wm-bot2> Adding new OSDs ['cloudcephosd1005.eqiad.wmnet'] to the cluster (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T08:32:37Z] <wm-bot2> Adding OSD cloudcephosd1005.eqiad.wmnet... (1/1) (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T08:33:10Z] <wm-bot2> Rebooting node cloudcephosd1005.eqiad.wmnet (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T08:36:54Z] <wm-bot2> Finished rebooting node cloudcephosd1005.eqiad.wmnet (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T08:44:38Z] <wm-bot2> Added OSD cloudcephosd1005.eqiad.wmnet... (1/1) (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T08:44:46Z] <wm-bot2> Added 1 new OSDs ['cloudcephosd1005.eqiad.wmnet'] (T329504) - cookbook ran by dcaro@vulcanus

Cookbook cookbooks.sre.hosts.reimage was started by dcaro@cumin1001 for host cloudcephosd1010.eqiad.wmnet with OS bullseye

dcaro updated the task description. (Show Details)Mar 2 2023, 9:36 AM

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T10:43:27Z] <wm-bot2> The cluster is now rebalanced after adding the new OSDs ['cloudcephosd1005.eqiad.wmnet'] (T329504) - cookbook ran by dcaro@vulcanus

@Jclark-ctr given this was a sort of non-standard move we need to get the cable labels for the links to these hosts updated in Netbox now that the move is done.

Can you take a look?

cloudcephosd1005:
https://netbox.wikimedia.org/dcim/cables/6110/
https://netbox.wikimedia.org/dcim/cables/6111/

cloudcephosd1010:
https://netbox.wikimedia.org/dcim/cables/6115/
https://netbox.wikimedia.org/dcim/cables/6116/

Thanks.

Cookbook cookbooks.sre.hosts.reimage started by dcaro@cumin1001 for host cloudcephosd1010.eqiad.wmnet with OS bullseye completed:

cloudcephosd1010 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202303020925_dcaro_1068188_cloudcephosd1010.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T14:09:36Z] <wm-bot2> Adding new OSDs ['cloudcephosd1010.eqiad.wmnet'] to the cluster (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T14:09:42Z] <wm-bot2> Adding OSD cloudcephosd1010.eqiad.wmnet... (1/1) (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T14:10:16Z] <wm-bot2> Rebooting node cloudcephosd1010.eqiad.wmnet (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T14:13:52Z] <wm-bot2> Finished rebooting node cloudcephosd1010.eqiad.wmnet (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T14:21:14Z] <wm-bot2> Added OSD cloudcephosd1010.eqiad.wmnet... (1/1) (T329504) - cookbook ran by dcaro@vulcanus

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T14:21:21Z] <wm-bot2> Added 1 new OSDs ['cloudcephosd1010.eqiad.wmnet'] (T329504) - cookbook ran by dcaro@vulcanus

dcaro updated the task description. (Show Details)Mar 2 2023, 3:55 PM

Mentioned in SAL (#wikimedia-cloud-feed) [2023-03-02T16:17:14Z] <wm-bot2> The cluster is now rebalanced after adding the new OSDs ['cloudcephosd1010.eqiad.wmnet'] (T329504) - cookbook ran by dcaro@vulcanus

dcaro updated the task description. (Show Details)Mar 2 2023, 4:52 PM

Moved!

Status	Assigned	Task
Open	None	T253824 planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1)
Resolved	ayounsi	T254013 all network devices must run OpenSSH >= 7.2p1 but != 7.4p1
Resolved	ayounsi	T317175 Junos: resolve DNS through mgmt_junos
Resolved	ayounsi	T327862 Use mgmt_junos on all network devices
		Restricted Task
Open	None	T316539 Upgrade network devices to Junos 20+
Resolved	cmooney	T316544 Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+
Resolved	dcaro	T297083 [ceph] Getting rack level HA
Resolved	• nskaggs	T329504 [ceph] Move cloudcephosd1005 (c8) and cloudcephosd1010 (d5) to rack f4

[ceph] Move cloudcephosd1005 (c8) and cloudcephosd1010 (d5) to rack f4
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

[ceph] Move cloudcephosd1005 (c8) and cloudcephosd1010 (d5) to rack f4 Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

[ceph] Move cloudcephosd1005 (c8) and cloudcephosd1010 (d5) to rack f4
Closed, ResolvedPublic
Actions

Related Objects
Search...