[ceph] Move cloudcephosd1001 (b7) and cloudcephosd1002 (b4) to rack e4
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Feb 13 2023, 11:39 AM

Description

Both can be turned off at the same time.

They will require new IPs:

cloudcephosd1001.eqiad.wmnet:

addr: "10.64.148.14"
iface: "ens2f0np0"
addr: "192.168.5.6"
prefix: "24"
iface: "ens2f1np1"

cloudcephosd1002.eqiad.wmnet:

public:
  addr: "10.64.148.15"
  iface: "ens2f0np0"
cluster:
  addr: "192.168.5.7"
  prefix: "24"
  iface: "ens2f1np1"

Details

	Subject	Repo	Branch	Lines +/-
	Adjust interface names for cloudcephosd1001 and cloudcephosd1002	operations/puppet	production	+4 -4
	wmcs ceph:Move cloudcephosd1001/1002 to e4	operations/puppet	production	+16 -16

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T253824 planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1)
Resolved		ayounsi	T254013 all network devices must run OpenSSH >= 7.2p1 but != 7.4p1
Resolved		ayounsi	T317175 Junos: resolve DNS through mgmt_junos
Resolved		ayounsi	T327862 Use mgmt_junos on all network devices
			Restricted Task
Open		None	T316539 Upgrade network devices to Junos 20+
Open		cmooney	T316544 Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+
Open		dcaro	T297083 [ceph] Getting rack level HA
Resolved		• nskaggs	T329498 [ceph] Move cloudcephosd1001 (b7) and cloudcephosd1002 (b4) to rack e4
Resolved	BUG REPORT	dcaro	T329535 Cloud Ceph outage 2023-02-13
In Progress		dcaro	T329709 [cookbooks.ceph] Add a cookbook to drain a ceph osd in a safe manner
Resolved		dcaro	T329711 [ceph] Add monitoring for inter-osd/mon/cloudvirt connectivity
Open		dcaro	T329778 [ceph] Investigate if there's a way to degrade instead of failing when jumbo frames are being dropped in the network
Resolved	Request	Papaul	T330754 hw troubleshooting: Link hard down (probably cable) for cloudcephosd2002-dev.codfw.wmnet
Resolved		cmooney	T329799 Add network-layer protections to avoid inadvertently lowering IRB MTU

Event Timeline

dcaro created this task.Feb 13 2023, 11:39 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2023, 11:39 AM

Change 888659 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs ceph:Move cloudcephosd1001/1002 to e4

https://gerrit.wikimedia.org/r/888659

gerritbot added a project: Patch-For-Review.Feb 13 2023, 11:40 AM

dcaro mentioned this in T297083: [ceph] Getting rack level HA.Feb 13 2023, 11:53 AM

dcaro updated the task description. (Show Details)

@Jclark-ctr We can start with this one, it will need changes in netbox too (for the 10.64.148.* ips), I'm available mostly on european timezones, but I can accomodate others if needed.

Note that the ips on those ranges I selected manually, I don't thing there's anything new coming in those ranges, so they should be valid when we do the actual move.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-02-13T14:06:50Z] <wm-bot2> Set the ceph cluster for eqiad1 in maintenance, alert silence ids: 8fbf6bfd-eec1-4d81-8e0d-ea431d8411ee (T329498) - cookbook ran by dcaro@vulcanus

Icinga downtime and Alertmanager silence (ID=34f24a3a-279b-41cf-89ec-66102b211bda) set by dcaro@cumin1001 for 3:00:00 on 1 host(s) and their services with reason: moving racks

cloudcephosd1001.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=dcc24c74-6aaa-4607-8c93-ab2699307f18) set by dcaro@cumin1001 for 3:00:00 on 1 host(s) and their services with reason: moving racks

cloudcephosd1002.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host cloudcephosd1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host cloudcephosd1001.eqiad.wmnet with OS bullseye executed with errors:

cloudcephosd1001 (FAIL)
- Downtimed on Icinga/Alertmanager
- Unable to disable Puppet, the host may have been unreachable
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

Relocated Servers to rack E4 updated netbox.
cloudsw1-e4-eqiad
cloudcephosd1001 port 18,19 cloudcephosd1002 Ports16,17

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: cloudcephosd1001.eqiad.wmnet

cloudcephosd1001.eqiad.wmnet (FAIL)
- Host not found on Icinga, unable to downtime it
- Found physical host
- Management interface not found on Icinga, unable to downtime it
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Host is already powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above