Page MenuHomePhabricator

Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw
Closed, ResolvedPublic

Description

Currently scheduled for Wed Feb 7 16:00 UTC

The following server uplink moves need to be completed as part of the wider migration from our old top-of-rack switches in codfw to their new replacements. The work is just to move the cable, so we expect an interruption of 60 seconds or less per hosts. Moves will be sequential, so only 1 host will be disconnected at any given moment.

TeamHost typeasw-a2-codfw intlsw1-a2-codfw int
Data Persistencems-be2044xe-2/0/4ms-be2044
Data Persistencems-be2074xe-2/0/9ms-be2074
Data Persistencems-be2051xe-2/0/11ms-be2051
Data Persistencems-fe2009xe-2/0/14ms-fe2009
Data Persistencems-fe2013xe-2/0/31ms-fe2013
Data Persistencethanos-fe2001xe-2/0/3thanos-fe2001
Infra Foundationsganeti2029xe-2/0/27hosts
Infra Foundationsganeti2030xe-2/0/28hosts
Machine Learningml-cache2001xe-2/0/8ml-cache2001
Observabilitykafka-logging2001xe-2/0/5kafka-logging2001
Observabilitylogging-hd2001xe-2/0/32logging-hd2001
Search Platformelastic2087xe-2/0/0elastic2087
Search Platformelastic2037xe-2/0/1elastic2037
Search Platformelastic2038xe-2/0/2elastic2038
Search Platformelastic2055xe-2/0/12elastic2055
Search Platformelastic2088xe-2/0/13elastic2088
Search Platformelastic2073xe-2/0/29elastic2073
Search Platformelastic2074xe-2/0/30elastic2074
Search Platformwdqs2013xe-2/0/10wdqs2013
Search Platformwdqs2023xe-2/0/19wdqs2023
Service Opsmc2039xe-2/0/7mc2039
Service Opsmc2038xe-2/0/26mc2038
moss-be2001xe-2/0/6moss-be2001

We can track the details of the moves and what needs to be done to prepare in the Google sheet here, if not specific action is needed for a given type of host just state that on the first tab

https://docs.google.com/spreadsheets/d/1PlGGLclKFYR9XaqjOLibhiwwny0fOD8gLMwsNhIzGRo

Event Timeline

cmooney triaged this task as Medium priority.Jan 25 2024, 11:36 AM
cmooney created this task.
MatthewVernon subscribed.

swift will need depooling in codfw before this work.
Likewise the affected thanos-fe node.

I'm oncall that week, so if there's an outage at the time I may have to fling a spanner in the works. Hopefully not necessary!

This rack is physically ready for tomorrow.

Mentioned in SAL (#wikimedia-operations) [2024-02-07T14:44:25Z] <klausman@cumin2002> START - Cookbook sre.hosts.downtime for 3:00:00 on ml-cache2001.codfw.wmnet with reason: Machine network link move (T355861)

Mentioned in SAL (#wikimedia-operations) [2024-02-07T14:44:42Z] <klausman@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ml-cache2001.codfw.wmnet with reason: Machine network link move (T355861)

Icinga downtime and Alertmanager silence (ID=bab7a949-b7c4-40b3-b9f5-e00978a8ce0f) set by cmooney@cumin1002 for 1:00:00 on 4 host(s) and their services with reason: prepping for server uplink migration codfw rack a2

asw-a-codfw,cr[1-2]-codfw,lsw1-a2-codfw.mgmt

You're good to go re swift and thanos now.

You're good to go re swift and thanos now.

Thanks!

Icinga downtime and Alertmanager silence (ID=075fbfb6-a879-438b-b065-55d67628e920) set by cmooney@cumin1002 for 0:30:00 on 22 host(s) and their services with reason: Migrating servers in codfw rack A2 to lsw1-a2-codfw

elastic[2037-2038,2055,2073-2074,2087-2088].codfw.wmnet,ganeti[2029-2030].codfw.wmnet,kafka-logging2001.codfw.wmnet,logging-hd2001.codfw.wmnet,lvs2011.codfw.wmnet,mc[2038-2039].codfw.wmnet,ml-cache2001.codfw.wmnet,ms-be[2051,2074].codfw.wmnet,ms-fe[2009,2013].codfw.wmnet,thanos-fe2001.codfw.wmnet,wdqs[2013,2023].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-02-07T16:04:51Z] <vgutierrez> <topranks> Commencing server uplink moves from old switch to new in codfw rack A2 T355861

All servers moved and pinging again fine. Very short delay for all apart from elastic2038 (issue there due to badly seated NIC card shifting when cable was moved, needed to be properly re-inserted into motherboard after which all ok).

Any systems that were depooled/drained can now be brought back live. Many thanks to everyone for their help!

I've kicked off a rebalance of ganeti/A now that the maintenance is over.

cmooney claimed this task.

I've kicked off a rebalance of ganeti/A now that the maintenance is over.

Thanks!