Page MenuHomePhabricator

Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw
Closed, ResolvedPublic

Description

Currently scheduled for Tue Feb 13 16:00 UTC

The following server uplink moves need to be completed as part of the wider migration from our old top-of-rack switches in codfw to their new replacements. The work is just to move the cable, so we expect an interruption of 60 seconds or less per hosts. Moves will be sequential, so only 1 host will be disconnected at any given moment.

TeamHost typeasw-a4-codfw intlsw1-a4-codfw int
Data Persistencebackup2004xe-4/0/0backup2004
Data Persistencebackup2002xe-4/0/22backup2002
Trafficcp2027xe-4/0/23cp2027
Trafficcp2028xe-4/0/24cp2028
Data persistencedb2183xe-4/0/14move
Data persistencedbprov2001xe-4/0/18dbprov2001
Search Platformelastic2061xe-4/0/15elastic2061
Search Platformelastic2062xe-4/0/16elastic2062
Search Platformelastic2089xe-4/0/17elastic2089
Infra Foundationsganeti2027xe-4/0/33hosts
Service Opskafka-main2001xe-4/0/19kafka-main2001
Observabilitylogstash2026xe-4/0/20logstash2026
Observabilitylogstash2033xe-4/0/26logstash2033
Service Opsmc2055xe-4/0/34mc2055
Service Opsmc-gp2001xe-4/0/21mc-gp2001
Data Persistencems-be2062xe-4/0/2ms-be2062
Data Persistencems-be2060xe-4/0/4ms-be2060
Data Persistencems-be2066xe-4/0/6ms-be2066
Data Persistencems-be2070xe-4/0/8ms-be2070
Data Persistencems-be2075xe-4/0/10ms-be2075

We can track the details of the moves and what needs to be done to prepare in the Google sheet here, if not specific action is needed for a given type of host just state that on the first tab

https://docs.google.com/spreadsheets/d/1PlGGLclKFYR9XaqjOLibhiwwny0fOD8gLMwsNhIzGRo

Event Timeline

cmooney triaged this task as Medium priority.Jan 25 2024, 11:38 AM
cmooney created this task.

Thank you, I will shutdown media backups anyway every time one host is affected, not just this one, to minimize failures.

MatthewVernon subscribed.

Once complete, I'll want to check the ms-be nodes are all happy (shouldn't be an issue).

Mentioned in SAL (#wikimedia-operations) [2024-02-13T14:26:40Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2061*,elastic2062*,elastic2089* for switch maintenance - bking@cumin2002 - T355863

Mentioned in SAL (#wikimedia-operations) [2024-02-13T14:26:45Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2061*,elastic2062*,elastic2089* for switch maintenance - bking@cumin2002 - T355863

Mentioned in SAL (#wikimedia-operations) [2024-02-13T14:47:58Z] <brett@cumin2002> START - Cookbook sre.hosts.downtime for 3:00:00 on cp[2027-2028].codfw.wmnet with reason: T355863

Mentioned in SAL (#wikimedia-operations) [2024-02-13T14:48:16Z] <brett@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cp[2027-2028].codfw.wmnet with reason: T355863

Mentioned in SAL (#wikimedia-operations) [2024-02-13T15:30:33Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[2061-2062,2089].codfw.wmnet with reason: T355863

Mentioned in SAL (#wikimedia-operations) [2024-02-13T15:30:51Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[2061-2062,2089].codfw.wmnet with reason: T355863

Mentioned in SAL (#wikimedia-operations) [2024-02-13T15:44:10Z] <topranks> moving netbox links and pre-configuring lsw1-a4-codfw for servers before network move T355863

Forgot to update earlier. Rack is physically ready

Icinga downtime and Alertmanager silence (ID=349240a0-30c3-4371-9418-7f1f46072237) set by cmooney@cumin1002 for 0:30:00 on 23 host(s) and their services with reason: Migrating servers in codfw rack A4 to lsw1-a4-codfw

backup[2002,2004].codfw.wmnet,cp[2027-2028].codfw.wmnet,db2183.codfw.wmnet,dbprov2001.codfw.wmnet,elastic[2061-2062,2089].codfw.wmnet,es2026.codfw.wmnet,ganeti[2027,2034].codfw.wmnet,kafka-main2001.codfw.wmnet,logstash[2026,2033].codfw.wmnet,mc2055.codfw.wmnet,mc-gp2001.codfw.wmnet,ml-serve2005.codfw.wmnet,ms-be[2060,2062,2066,2070,2075].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-02-13T16:08:30Z] <topranks> moving codfw rack a4 server links T355863

All work completed, no issues to report :)

Swift looks happy, thanks :)

great, thanks for the update!

cmooney claimed this task.

Closing - thanks all for the help!

Mentioned in SAL (#wikimedia-operations) [2024-02-15T08:50:31Z] <moritzm> rebalance Ganeti codfw/A now that the switch maintenance for A5 and A6 are completed T355864 T355863