Page MenuHomePhabricator

Migrate servers in codfw rack B8 from asw-b8-codfw to lsw1-b8-codfw
Closed, ResolvedPublic

Description

Currently scheduled for Mar 5 16:00 UTC

The following server uplink moves need to be completed as part of the wider migration from our old top-of-rack switches in codfw to their new replacements. The work is just to move the cable, so we expect an interruption of 60 seconds or less per hosts. Moves will be sequential, so only 1 host will be disconnected at any given moment.

TeamHost typeasw-b8-codfw intlsw1-b8-codfw int
Collaboration Servicesgitlab-runner2002ge-8/0/34gitlab-runner2002
Core Platform / Data Persistencerestbase2029ge-8/0/10disruption
Core Platform / Data Persistencerestbase2030ge-8/0/11disruption
Core Platform / Data Persistencerestbase2014ge-8/0/18disruption
Core Platform / Data Persistencesessionstore2001ge-8/0/5disruption
Data persistencedb2148ge-8/0/12move
Data persistencedb2163ge-8/0/17move
Data persistencedb2185ge-8/0/28move
Data persistencedb2164ge-8/0/35move
Data persistencedb2189ge-8/0/36move
Data Persistencees2025ge-8/0/3es2025
Data Persistencees2029ge-8/0/30es2029
Data Persistencees2030ge-8/0/32es2030
Infra Foundationsganeti2019ge-8/0/19hosts
Infra Foundationsganeti2020ge-8/0/20hosts
Search Platformwdqs2007ge-8/0/7wdqs2007
Service Opskubernetes2035ge-8/0/29drain
Service Opskubernetes2054ge-8/0/44drain
Service Opsmw2433ge-8/0/6depool
Service Opsmw2434ge-8/0/15drain
Service Opsmw2435ge-8/0/16drain
Service Opsmw2432ge-8/0/27depool
Service Opsparse2008ge-8/0/1depool
Service Opsparse2009ge-8/0/8depool
Service Opsparse2010ge-8/0/9depool
Trafficdns2004ge-8/0/2dns2004

We can track the details of the moves and what needs to be done to prepare in the Google sheet here, if not specific action is needed for a given type of host just state that on the first tab

https://docs.google.com/spreadsheets/d/1PlGGLclKFYR9XaqjOLibhiwwny0fOD8gLMwsNhIzGRo

Event Timeline

cmooney triaged this task as Medium priority.Jan 25 2024, 11:56 AM
cmooney created this task.

db2148 - slave
db2163 - slave
db2185 zarcillo dc master (nothing required)
db2164 - slave
db2189 - slave
es2029 - slave
es2030 - slave
es2030 - standalone

Mentioned in SAL (#wikimedia-operations) [2024-03-04T21:14:15Z] <inflatador> bking@cumin2002 depool wdqs2007 for T355873

Draining ganeti2019.codfw.wmnet of running VMs

Draining ganeti2020.codfw.wmnet of running VMs

Mentioned in SAL (#wikimedia-operations) [2024-03-05T14:35:10Z] <fabfur@cumin2002> START - Cookbook sre.hosts.downtime for 4:00:00 on dns2004.wikimedia.org with reason: T355873

Mentioned in SAL (#wikimedia-operations) [2024-03-05T14:35:25Z] <fabfur@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dns2004.wikimedia.org with reason: T355873

Mentioned in SAL (#wikimedia-operations) [2024-03-05T15:43:32Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 0:40:00 on 8 hosts with reason: Silence for maintenance T355873

Mentioned in SAL (#wikimedia-operations) [2024-03-05T15:43:48Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 8 hosts with reason: Silence for maintenance T355873

Mentioned in SAL (#wikimedia-operations) [2024-03-05T15:44:01Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'T355873 - depooling db2148 db2163 db2185 db2164 db2189 es2025 es2029 es2030', diff saved to https://phabricator.wikimedia.org/P58489 and previous config saved to /var/cache/conftool/dbconfig/20240305-154400-arnaudb.json

Icinga downtime and Alertmanager silence (ID=19e5ce18-f2ba-4d9e-a80a-2c957c2eecad) set by cmooney@cumin1002 for 1:00:00 on 4 host(s) and their services with reason: prepping for server uplink migration codfw rack b8

asw-b-codfw,cr[1-2]-codfw,lsw1-b8-codfw.mgmt

Icinga downtime and Alertmanager silence (ID=f241631d-4830-4ac7-b5c1-29790ccbb916) set by cmooney@cumin1002 for 0:30:00 on 25 host(s) and their services with reason: Migrating servers in codfw rack B8 to lsw1-b8-codfw

db[2148,2163-2164,2185,2189].codfw.wmnet,dns2004.wikimedia.org,es[2025,2029-2030].codfw.wmnet,ganeti[2019-2020].codfw.wmnet,gitlab-runner2002.codfw.wmnet,kubernetes[2035,2054].codfw.wmnet,kubestage2002.codfw.wmnet,mw[2432-2435].codfw.wmnet,parse[2008-2010].codfw.wmnet,restbase[2029-2030].codfw.wmnet,wdqs2007.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-03-05T16:04:52Z] <topranks> commencing migration of servers in codfw rack b8 to lsw1-b8-codfw T355873

All links moved without problem, servers back online and responding to ping now.

cmooney claimed this task.