Page MenuHomePhabricator

Migrate servers in codfw racks C2 & C3 from asw to lsw
Closed, ResolvedPublic

Description

Currently scheduled for Thurs Sept 5th 16:30 UTC

As part of the scheduled refresh of switch equipment in codfw rows C and D we need to move the network connections for servers in racks C2 and C3 from the old to new switch.

Hosts in this rack are managed by the following teams:

Collaboration Services
Core Platform
Data Persistence
Data Platform
Infrastructure Foundations
Machine Learning
Observability
Search Platform
ServiceOps
Traffic

A full list of the specific hosts can be found below. We will use the sheet to plan the moves and co-ordinate with other SRE teams on actions required to ensure things go smoothly:

https://docs.google.com/spreadsheets/d/16xoZuDeC_-o6s70uEMnvdgn4BlT1f8__WPYprRuduIA#gid=462798674

Server links will be moved one-by-one from old to the new switch. So no two hosts will be offline at once.

Based on previous experience each host is likely to only lose comms for ~10 seconds. It is inevitable that a small number of the new cables do not work, however, or there is some minor glitch in the move. So it is possible in an edge case that a host will be offline for 2-3 minutes. On previous occasions this happened with about 1 out of 20 hosts.

Event Timeline

cmooney triaged this task as Medium priority.Aug 22 2024, 12:12 PM
cmooney created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
cmooney updated the task description. (Show Details)

The server phab2002 mentioned here for Collaboration Services is standby and therefore ok to do anytime.

d/p hosts are listed below:

Rack_C2backup2009n/a
Rack_C2backup2006n/a
Rack_C2db2232notifications_enabled: false
Rack_C2db2231notifications_enabled: false
Rack_C3db2141dbstore s1
Rack_C3db2169s6
Rack_C3db2150s7
Rack_C3db2186sanitarium multi
Rack_C3db2191x1
Rack_C3db2144x2

The server phab2002 mentioned here for Collaboration Services is standby and therefore ok to do anytime.

Thanks for confirming!

Further Data Persistence nodes (Ceph / Swift) in C2:

C2moss-be2003needs maintenance mode setting (and unsetting afterwards)
C2moss-fe2001needs depooling (and repool afterwards)
C2ms-be2055just needs checking afterwards
C2ms-be2068just needs checking afterwards
C2ms-fe2011needs depooling (and repool afterwards)

There's no Swift/Ceph in C3. I can do these at the end of the staff meeting today.

Mentioned in SAL (#wikimedia-operations) [2024-09-05T14:55:31Z] <claime> depooling kubernetes nodes for T373096 - kubernetes2017 kubernetes2021 kubernetes2038 kubernetes2039 mw2335 mw2336 mw2337 mw2338 mw2412 mw2413 mw2414 mw2415 mw2416 mw2417 mw2418 mw2419 wikikube-worker2019

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:07:25Z] <fabfur@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on cp2035.codfw.wmnet with reason: T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:07:37Z] <fabfur@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp2035.codfw.wmnet with reason: T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:08:06Z] <fabfur@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on cp2036.codfw.wmnet with reason: T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:08:19Z] <fabfur@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp2036.codfw.wmnet with reason: T373096

Hosts cp203[5-6] downtimed and depooled

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:21:07Z] <topranks> prep lsw1-c2-codfw for server migration from asw-c2-codfw T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T15:30:40Z] <topranks> prep lsw1-c3-codfw for server migration from asw-c3-codfw T373096

Icinga downtime and Alertmanager silence (ID=8726666c-096a-491c-b6d3-edc93e2996f1) set by cmooney@cumin1002 for 0:30:00 on 1 host(s) and their services with reason: Move backup2006 uplink to lsw1-c2-codfw

backup2006.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-09-05T16:30:08Z] <Emperor> moss-be2003 to maintenance mode T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T16:30:44Z] <Emperor> depool moss-fe2001 ms-fe2011 T373096

@cmooney all good to go from a Swift/Ceph perspective, thanks for your patience

@cmooney all good to go from a Swift/Ceph perspective, thanks for your patience

Much obliged!

Icinga downtime and Alertmanager silence (ID=cde90074-86b4-49ac-9878-436a5d041f2b) set by cmooney@cumin2002 for 0:30:00 on 23 host(s) and their services with reason: Move server uplinks codfw racks C2

backup[2006,2009].codfw.wmnet,cephosd2002.codfw.wmnet,cp[2035-2036].codfw.wmnet,db[2231-2232].codfw.wmnet,elastic[2098-2099].codfw.wmnet,ganeti[2035-2036].codfw.wmnet,kafka-logging2003.codfw.wmnet,logging-hd2002.codfw.wmnet,logging-sd2003.codfw.wmnet,lvs2013.codfw.wmnet,ml-cache2003.codfw.wmnet,ml-serve2010.codfw.wmnet,moss-be2003.codfw.wmnet,moss-fe2001.codfw.wmnet,ms-be[2055,2068].codfw.wmnet,ms-fe2011.codfw.wmnet,wdqs2017.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-09-05T16:42:46Z] <topranks> move server uplinks codfw rack c2 from asw-c2-codfw to lsw1-c2-codfw T373096

Icinga downtime and Alertmanager silence (ID=07e91a47-4c42-404a-bc7d-ad277bbf3e2b) set by cmooney@cumin2002 for 0:30:00 on 34 host(s) and their services with reason: Move server uplinks codfw racks C3

cassandra-dev2002.codfw.wmnet,conf2005.codfw.wmnet,db[2141,2144,2150,2169,2186,2191].codfw.wmnet,deploy2002.codfw.wmnet,kubernetes[2017,2021,2038-2039].codfw.wmnet,maps2007.codfw.wmnet,ml-serve2007.codfw.wmnet,mw[2335-2338,2412-2419].codfw.wmnet,mwlog2002.codfw.wmnet,mwmaint2002.codfw.wmnet,phab2002.codfw.wmnet,prometheus2006.codfw.wmnet,puppetserver2001.codfw.wmnet,rdb2009.codfw.wmnet,wikikube-worker2019.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-09-05T16:59:06Z] <topranks> move server uplinks codfw rack c3 from asw-c3-codfw to lsw1-c3-codfw T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T17:04:49Z] <Emperor> moss-be2003 exit maintenance mode T373096

Mentioned in SAL (#wikimedia-operations) [2024-09-05T17:05:13Z] <Emperor> pool moss-fe2001 ms-fe2011 T373096

All links moved and all hosts now responding to ping again. Average interruption in the region of seconds thanks to @Jhancock.wm :)

[2024-09-05T17:00:37.559309] 2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default
[2024-09-05T17:00:42.004240] 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default

Swift / Ceph back to normal, thanks!

Mentioned in SAL (#wikimedia-operations) [2024-09-05T17:08:30Z] <claime> Repooling kubernetes nodes after T373096 - kubernetes2017 kubernetes2021 kubernetes2038 kubernetes2039 mw2335 mw2336 mw2337 mw2338 mw2412 mw2413 mw2414 mw2415 mw2416 mw2417 mw2418 mw2419 wikikube-worker2019

cmooney claimed this task.