Page MenuHomePhabricator

Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw
Closed, ResolvedPublic

Description

Currently scheduled for Feb 22 16:00 UTC

The following server uplink moves need to be completed as part of the wider migration from our old top-of-rack switches in codfw to their new replacements. The work is just to move the cable, so we expect an interruption of 60 seconds or less per hosts. Moves will be sequential, so only 1 host will be disconnected at any given moment.

TeamHost typeasw-b2-codfw intlsw1-b2-codfw int
Data Persistencems-be2046xe-2/0/4ms-be2046
Data Persistencems-be2076xe-2/0/9ms-be2076
Data Persistencems-fe2010xe-2/0/14ms-fe2010
Data Persistencems-fe2014xe-2/0/32ms-fe2014
Data Persistencethanos-fe2002xe-2/0/3thanos-fe2002
Machine Learningml-cache2002xe-2/0/8ml-cache2002
Observabilitykafka-logging2002xe-2/0/5kafka-logging2002
Observabilitykafka-logging2004xe-2/0/31kafka-logging2004
Search Platformelastic2041xe-2/0/1elastic2041
Search Platformelastic2042xe-2/0/2elastic2042
Search Platformelastic2057xe-2/0/11elastic2057
Search Platformelastic2063xe-2/0/19elastic2063
Search Platformelastic2064xe-2/0/20elastic2064
Search Platformelastic2077xe-2/0/29elastic2077
Search Platformelastic2078xe-2/0/30elastic2078
Search Platformelastic2092xe-2/0/33elastic2092
Search Platformelastic2093xe-2/0/34elastic2093
Search Platformelastic2094xe-2/0/35elastic2094
Search Platformwdqs2024xe-2/0/0wdqs2024
Search Platformwdqs2014xe-2/0/10wdqs2014
Search Platformwdqs2010xe-2/0/26wdqs2010
Service Opsmc2042xe-2/0/27mc2042
Service Opsmc2043xe-2/0/28mc2043
Trafficcp2031xe-2/0/12cp2031
Trafficcp2032xe-2/0/13cp2032
moss-be2002xe-2/0/6moss-be2002

We can track the details of the moves and what needs to be done to prepare in the Google sheet here, if not specific action is needed for a given type of host just state that on the first tab

https://docs.google.com/spreadsheets/d/1PlGGLclKFYR9XaqjOLibhiwwny0fOD8gLMwsNhIzGRo

Event Timeline

cmooney triaged this task as Medium priority.Jan 25 2024, 11:46 AM
cmooney created this task.
MatthewVernon subscribed.

The affected thanos frontend will need depooling.
Similarly, swift in codfw will need depooling.

Mentioned in SAL (#wikimedia-operations) [2024-02-21T22:10:40Z] <ryankemper> [WDQS] T355868 Depooling wdqs2024, wdqs2014, wdqs2010 in anticipation of row maintenance

Mentioned in SAL (#wikimedia-operations) [2024-02-22T15:46:19Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2031-2032].codfw.wmnet with reason: T355868

Mentioned in SAL (#wikimedia-operations) [2024-02-22T15:46:49Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2031-2032].codfw.wmnet with reason: T355868

Icinga downtime and Alertmanager silence (ID=93a3c441-2097-4840-a202-5694f260c1b5) set by cmooney@cumin1002 for 1:00:00 on 4 host(s) and their services with reason: prepping for server uplink migration codfw rack b2

asw-b-codfw,cr[1-2]-codfw,lsw1-b2-codfw.mgmt

Icinga downtime and Alertmanager silence (ID=90864fe1-6d91-45db-a2a5-2bb22463c114) set by cmooney@cumin1002 for 0:30:00 on 25 host(s) and their services with reason: Migrating servers in codfw rack B2 to lsw1-b2-codfw

cp[2031-2032].codfw.wmnet,elastic[2041-2042,2057,2063-2064,2077-2078,2092-2094].codfw.wmnet,kafka-logging[2002,2004].codfw.wmnet,lvs2012.codfw.wmnet,mc[2042-2043].codfw.wmnet,ml-cache2002.codfw.wmnet,ms-be2076.codfw.wmnet,ms-fe[2010,2014].codfw.wmnet,thanos-fe2002.codfw.wmnet,wdqs[2010,2014,2024].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-02-22T16:00:21Z] <topranks> Commencing network maintenance migrating servers to new switch codfw rack B2 T355868

All hosts moved successfully and back responding to pings.

cp2031 and cp2032 are ok and repooled

cmooney claimed this task.

closing, thanks for the help!