Page MenuHomePhabricator

Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw
Closed, ResolvedPublic

Description

Currently scheduled for Feb 28 16:00 UTC

The following server uplink moves need to be completed as part of the wider migration from our old top-of-rack switches in codfw to their new replacements. The work is just to move the cable, so we expect an interruption of 60 seconds or less per hosts. Moves will be sequential, so only 1 host will be disconnected at any given moment.

TeamHost typeasw-b6-codfw intlsw1-b6-codfw int
Core Platform / Data Persistencerestbase2024ge-6/0/26disruption
Core Platform?maps2009ge-6/0/16advance
Data Engineeringaqs2005ge-6/0/34aqs2005
Data Engineeringaqs2006ge-6/0/35aqs2006
Data Engineeringaqs2007ge-6/0/36aqs2007
Data Engineeringaqs2008ge-6/0/37aqs2008
Data persistencedb2098ge-6/0/0move
Data persistencedb2110ge-6/0/1move
Data persistencedb2111ge-6/0/2move
Data persistencedb2124ge-6/0/4move
Data persistencedb2134ge-6/0/5move
Data persistencedb2096ge-6/0/29move
Data persistencedb2161ge-6/0/38move
Data persistencedb2162ge-6/0/39move
Data persistencedbproxy2002ge-6/0/3dbproxy2002
Machine Learningml-serve2006ge-6/0/27deployment-server
Search Platformwcqs2001ge-6/0/32wcqs2001
Service Opskubernetes2034ge-6/0/9drain
Service Opskubernetes2009ge-6/0/18drain
Service Opskubernetes2010ge-6/0/21drain
Service Opskubernetes2020ge-6/0/30drain
Service Opskubernetes2033ge-6/0/42drain
Service Opsmw2325ge-6/0/6depool
Service Opsmw2326ge-6/0/7depool
Service Opsmw2327ge-6/0/8depool
Service Opsmw2328ge-6/0/10depool
Service Opsmw2329ge-6/0/11depool
Service Opsmw2330ge-6/0/12depool
Service Opsmw2331ge-6/0/13depool
Service Opsmw2332ge-6/0/14depool
Service Opsmw2333ge-6/0/15depool
Service Opsmw2334ge-6/0/17depool
Service Opsmw2428ge-6/0/19drain (is k8s)
Service Opsmw2429ge-6/0/33drain (is k8s)
Service Opsmw2430ge-6/0/40drain (is k8s)
Service Opsmw2431ge-6/0/41drain (is k8s)
Service Opsrdb2008ge-6/0/31rdb2008

We can track the details of the moves and what needs to be done to prepare in the Google sheet here, if not specific action is needed for a given type of host just state that on the first tab

https://docs.google.com/spreadsheets/d/1PlGGLclKFYR9XaqjOLibhiwwny0fOD8gLMwsNhIzGRo

Event Timeline

cmooney triaged this task as Medium priority.Jan 25 2024, 11:54 AM
cmooney created this task.

db2098 - backup slave @jcrespo
db2110 - slave
db2111 - slave
db2124 - slave
db2134 - m3 master (not used)
db2096 - slave
db2161 - slave
db2162 - slave
dbproxy2002 - not used

Mentioned in SAL (#wikimedia-operations) [2024-02-28T15:40:11Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 0:40:00 on 7 hosts with reason: Silence for maintenance T355871

Mentioned in SAL (#wikimedia-operations) [2024-02-28T15:40:29Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 7 hosts with reason: Silence for maintenance T355871

Mentioned in SAL (#wikimedia-operations) [2024-02-28T15:40:59Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'T355871 - depooling db2110 db2111 db2124 db2134 db2096 db2161 db2162', diff saved to https://phabricator.wikimedia.org/P58085 and previous config saved to /var/cache/conftool/dbconfig/20240228-154043-arnaudb.json

Icinga downtime and Alertmanager silence (ID=1f99f40e-0648-48d6-a40a-a3ebae9e7b2b) set by cmooney@cumin1002 for 1:00:00 on 4 host(s) and their services with reason: prepping for server uplink migration codfw rack b6

asw-b-codfw,cr[1-2]-codfw,lsw1-b6-codfw.mgmt

Mentioned in SAL (#wikimedia-operations) [2024-02-28T15:51:32Z] <topranks> configuring lsw1-b6-codfw in advance of server migration T355871

Icinga downtime and Alertmanager silence (ID=691919af-8b8a-4f2d-b390-eea3c6a54f5c) set by cmooney@cumin1002 for 0:30:00 on 37 host(s) and their services with reason: Migrating servers in codfw rack B6 to lsw1-b6-codfw

aqs[2005-2008].codfw.wmnet,db[2096,2098,2110-2111,2124,2134,2161-2162].codfw.wmnet,dbproxy2002.codfw.wmnet,kubernetes[2009-2010,2020,2033-2034].codfw.wmnet,maps2009.codfw.wmnet,ml-serve2006.codfw.wmnet,mw[2325-2334,2428-2431].codfw.wmnet,rdb2008.codfw.wmnet,restbase2024.codfw.wmnet,wcqs2001.codfw.wmnet

Works completed, all servers moved to the new switch and back responding to ping now. No issues.

cmooney claimed this task.

Closing task - thanks all for the co-operation!