Migrate servers in codfw racks C2 & C3 from asw to lsw
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cmooney
	Aug 22 2024, 12:12 PM

Description

Currently scheduled for Thurs Sept 5th 16:30 UTC

As part of the scheduled refresh of switch equipment in codfw rows C and D we need to move the network connections for servers in racks C2 and C3 from the old to new switch.

Hosts in this rack are managed by the following teams:

Collaboration Services
Core Platform
Data Persistence
Data Platform
Infrastructure Foundations
Machine Learning
Observability
Search Platform
ServiceOps
Traffic

A full list of the specific hosts can be found below. We will use the sheet to plan the moves and co-ordinate with other SRE teams on actions required to ensure things go smoothly:

https://docs.google.com/spreadsheets/d/16xoZuDeC_-o6s70uEMnvdgn4BlT1f8__WPYprRuduIA#gid=462798674

Server links will be moved one-by-one from old to the new switch. So no two hosts will be offline at once.

Based on previous experience each host is likely to only lose comms for ~10 seconds. It is inevitable that a small number of the new cables do not work, however, or there is some minor glitch in the move. So it is possible in an edge case that a host will be offline for 2-3 minutes. On previous occasions this happened with about 1 out of 20 hosts.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		cmooney	T370630 Migrate codfw servers in rows C & D from legacy ASW to LSW
		Resolved		cmooney	T373096 Migrate servers in codfw racks C2 & C3 from asw to lsw

Event Timeline

cmooney triaged this task as Medium priority.Aug 22 2024, 12:12 PM

cmooney created this task.

Restricted Application added a project: DC-Ops. · View Herald TranscriptAug 22 2024, 12:12 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

cmooney added a parent task: T370630: Migrate codfw servers in rows C & D from legacy ASW to LSW.Aug 22 2024, 12:13 PM

Papaul moved this task from Backlog to Codfw Switch migration on the ops-codfw board.Aug 22 2024, 1:09 PM

cmooney updated the task description. (Show Details)Aug 22 2024, 1:34 PM

cmooney updated the task description. (Show Details)

cmooney mentioned this in T370630: Migrate codfw servers in rows C & D from legacy ASW to LSW.Aug 22 2024, 2:08 PM

Jelto added a project: collaboration-services.Aug 30 2024, 9:48 AM

Fabfur subscribed.Aug 30 2024, 4:14 PM

The server phab2002 mentioned here for Collaboration Services is standby and therefore ok to do anytime.

LSobanski moved this task from Incoming to Consultation on the collaboration-services board.Sep 4 2024, 9:06 AM

d/p hosts are listed below:

Rack_C2	backup2009	n/a
Rack_C2	backup2006	n/a
Rack_C2	db2232	`notifications_enabled: false`
Rack_C2	db2231	`notifications_enabled: false`
Rack_C3	db2141	`dbstore s1`
Rack_C3	db2169	`s6`
Rack_C3	db2150	`s7`
Rack_C3	db2186	`sanitarium multi`
Rack_C3	db2191	`x1`
Rack_C3	db2144	`x2`

In T373096#10106969, @Dzahn wrote:

The server phab2002 mentioned here for Collaboration Services is standby and therefore ok to do anytime.

Thanks for confirming!

Further Data Persistence nodes (Ceph / Swift) in C2:

`C2`	moss-be2003	needs maintenance mode setting (and unsetting afterwards)
`C2`	moss-fe2001	needs depooling (and repool afterwards)
`C2`	ms-be2055	just needs checking afterwards
`C2`	ms-be2068	just needs checking afterwards
`C2`	ms-fe2011	needs depooling (and repool afterwards)

There's no Swift/Ceph in C3. I can do these at the end of the staff meeting today.

MatthewVernon added a project: SRE-swift-storage.Sep 5 2024, 1:40 PM

Mentioned in SAL (#wikimedia-operations) [2024-09-05T14:55:31Z] <claime> depooling kubernetes nodes for T373096 - kubernetes2017 kubernetes2021 kubernetes2038 kubernetes2039 mw2335 mw2336 mw2337 mw2338 mw2412 mw2413 mw2414 mw2415 mw2416 mw2417 mw2418 mw2419 wikikube-worker2019

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:07:25Z] <fabfur@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on cp2035.codfw.wmnet with reason: T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:07:37Z] <fabfur@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp2035.codfw.wmnet with reason: T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:08:06Z] <fabfur@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on cp2036.codfw.wmnet with reason: T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:08:19Z] <fabfur@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp2036.codfw.wmnet with reason: T373096

Hosts cp203[5-6] downtimed and depooled

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:21:07Z] <topranks> prep lsw1-c2-codfw for server migration from asw-c2-codfw T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:30:40Z] <topranks> prep lsw1-c3-codfw for server migration from asw-c3-codfw T373096

Icinga downtime and Alertmanager silence (ID=8726666c-096a-491c-b6d3-edc93e2996f1) set by cmooney@cumin1002 for 0:30:00 on 1 host(s) and their services with reason: Move backup2006 uplink to lsw1-c2-codfw

backup2006.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-09-05T16:30:08Z] <Emperor> moss-be2003 to maintenance mode T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T16:30:44Z] <Emperor> depool moss-fe2001 ms-fe2011 T373096

@cmooney all good to go from a Swift/Ceph perspective, thanks for your patience

In T373096#10122267, @MatthewVernon wrote:

@cmooney all good to go from a Swift/Ceph perspective, thanks for your patience

Much obliged!

Icinga downtime and Alertmanager silence (ID=cde90074-86b4-49ac-9878-436a5d041f2b) set by cmooney@cumin2002 for 0:30:00 on 23 host(s) and their services with reason: Move server uplinks codfw racks C2

backup[2006,2009].codfw.wmnet,cephosd2002.codfw.wmnet,cp[2035-2036].codfw.wmnet,db[2231-2232].codfw.wmnet,elastic[2098-2099].codfw.wmnet,ganeti[2035-2036].codfw.wmnet,kafka-logging2003.codfw.wmnet,logging-hd2002.codfw.wmnet,logging-sd2003.codfw.wmnet,lvs2013.codfw.wmnet,ml-cache2003.codfw.wmnet,ml-serve2010.codfw.wmnet,moss-be2003.codfw.wmnet,moss-fe2001.codfw.wmnet,ms-be[2055,2068].codfw.wmnet,ms-fe2011.codfw.wmnet,wdqs2017.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-09-05T16:42:46Z] <topranks> move server uplinks codfw rack c2 from asw-c2-codfw to lsw1-c2-codfw T373096

Icinga downtime and Alertmanager silence (ID=07e91a47-4c42-404a-bc7d-ad277bbf3e2b) set by cmooney@cumin2002 for 0:30:00 on 34 host(s) and their services with reason: Move server uplinks codfw racks C3

cassandra-dev2002.codfw.wmnet,conf2005.codfw.wmnet,db[2141,2144,2150,2169,2186,2191].codfw.wmnet,deploy2002.codfw.wmnet,kubernetes[2017,2021,2038-2039].codfw.wmnet,maps2007.codfw.wmnet,ml-serve2007.codfw.wmnet,mw[2335-2338,2412-2419].codfw.wmnet,mwlog2002.codfw.wmnet,mwmaint2002.codfw.wmnet,phab2002.codfw.wmnet,prometheus2006.codfw.wmnet,puppetserver2001.codfw.wmnet,rdb2009.codfw.wmnet,wikikube-worker2019.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-09-05T16:59:06Z] <topranks> move server uplinks codfw rack c3 from asw-c3-codfw to lsw1-c3-codfw T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T17:04:49Z] <Emperor> moss-be2003 exit maintenance mode T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T17:05:13Z] <Emperor> pool moss-fe2001 ms-fe2011 T373096

All links moved and all hosts now responding to ping again. Average interruption in the region of seconds thanks to @Jhancock.wm :)

[2024-09-05T17:00:37.559309] 2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default
[2024-09-05T17:00:42.004240] 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default

Swift / Ceph back to normal, thanks!

Mentioned in SAL (#wikimedia-operations) [2024-09-05T17:08:30Z] <claime> Repooling kubernetes nodes after T373096 - kubernetes2017 kubernetes2021 kubernetes2038 kubernetes2039 mw2335 mw2336 mw2337 mw2338 mw2412 mw2413 mw2414 mw2415 mw2416 mw2417 mw2418 mw2419 wikikube-worker2019

kudos @Jhancock.wm!

d/p nodes are repooling

cmooney closed this task as Resolved.Sep 5 2024, 6:05 PM

cmooney claimed this task.

Dzahn mentioned this in T374245: ProbeDown - people1004.Sep 10 2024, 8:49 PM

Migrate servers in codfw racks C2 & C3 from asw to lswClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Migrate servers in codfw racks C2 & C3 from asw to lsw
Closed, ResolvedPublic
Actions

Related Objects
Search...