Page MenuHomePhabricator

Migrate servers in codfw racks C4 & C5 from asw to lsw
Closed, ResolvedPublic

Description

Currently scheduled for Tue Sept 10th 2024 16:00 UTC

As part of the scheduled refresh of switch equipment in codfw rows C and D we need to move the network connections for servers in racks C4 and C5 from the old to new switch.

Hosts in this rack are managed by the following teams:

Collaboration Services
Core Platform
Data Persistence
Data Platform
Infrastructure Foundations
Machine Learning
Observability
Search Platform
ServiceOps
Traffic

A full list of the specific hosts can be found below. We will use the sheet to plan the moves and co-ordinate with other SRE teams on actions required to ensure things go smoothly:

https://docs.google.com/spreadsheets/d/16xoZuDeC_-o6s70uEMnvdgn4BlT1f8__WPYprRuduIA#gid=1455278591

Server links will be moved one-by-one from old to the new switch. So no two hosts will be offline at once.

Based on previous experience each host is likely to only lose comms for ~10 seconds. It is inevitable that a small number of the new cables do not work, however, or there is some minor glitch in the move. So it is possible in an edge case that a host will be offline for 2-3 minutes. On previous occasions this happened with about 1 out of 20 hosts.

Related Objects

Event Timeline

cmooney triaged this task as Medium priority.Aug 22 2024, 12:16 PM
cmooney created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

There are 4 swift servers in C4 - ms-be2058 ms-be2064 ms-be2072 ms-be2077 ; they'll need checking afterwards.

I'm on leave 9-13 September, are you OK to check dispersion/replication is OK afterwards @Eevans, please?

There are 4 swift servers in C4 - ms-be2058 ms-be2064 ms-be2072 ms-be2077 ; they'll need checking afterwards.

I'm on leave 9-13 September, are you OK to check dispersion/replication is OK afterwards @Eevans, please?

Yup; Can do!

I depooled gitlab-runner2003 for tomorrows maintenance

I depooled gitlab-runner2003 for tomorrows maintenance

Thanks!

db2114 is decommed (see T362948)

Ok duely noted. I'll mention this to dc-ops and they can decide how best to proceed.

Mentioned in SAL (#wikimedia-operations) [2024-09-10T14:06:36Z] <claime> Depooling kubernetes2040.codfw.wmnet kubernetes2041.codfw.wmnet kubernetes2058.codfw.wmnet mw2440.codfw.wmnet mw2442.codfw.wmnet mw2443.codfw.wmnet parse2011.codfw.wmnet parse2012.codfw.wmnet parse2013.codfw.wmnet wikikube-worker2039.codfw.wmnet - T373097

Mentioned in SAL (#wikimedia-operations) [2024-09-10T15:08:54Z] <sukhe@puppetmaster1001> conftool action : set/pooled=no; selector: name=dns2005.wikimedia.org [reason: T373097 codfw maintenance]

Mentioned in SAL (#wikimedia-operations) [2024-09-10T15:45:59Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: network maintenance T373097

Mentioned in SAL (#wikimedia-operations) [2024-09-10T15:46:05Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: network maintenance T373097

db/es hosts have been depooled

thanks for confirming!

Mentioned in SAL (#wikimedia-operations) [2024-09-10T15:56:30Z] <topranks> move server uplinks in Netbox from asw-c5-codfw to lsw1-c5-codfw to prep physical moves T373097

Mentioned in SAL (#wikimedia-operations) [2024-09-10T15:57:08Z] <topranks> move server uplinks in Netbox from asw-c4-codfw to lsw1-c4-codfw to prep physical moves T373097

Mentioned in SAL (#wikimedia-operations) [2024-09-10T15:57:59Z] <topranks> push server and vlan configuration to lsw1-c4-codfw with Homer to prep physical moves T373097

Mentioned in SAL (#wikimedia-operations) [2024-09-10T15:59:44Z] <topranks> push server and vlan configuration to lsw1-c5-codfw with Homer to prep physical moves T373097

Icinga downtime and Alertmanager silence (ID=c5ef5c49-317c-49af-b11b-61e58fe45620) set by cmooney@cumin1002 for 0:30:00 on 23 host(s) and their services with reason: Move server uplinks codfw racks C4

backup2003.codfw.wmnet,db[2184,2233].codfw.wmnet,dbprov2004.codfw.wmnet,elastic[2065-2066,2081-2082,2100-2101].codfw.wmnet,ganeti[2037-2038].codfw.wmnet,kafka-stretch2001.codfw.wmnet,logstash2035.codfw.wmnet,mc[2047-2048].codfw.wmnet,ms-backup2001.codfw.wmnet,ms-be[2058,2064,2072,2077].codfw.wmnet,wdqs[2011,2018].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:02:58Z] <topranks> commence maintenance - move server uplinks from old to new switch codfw rack C4 T373097

Icinga downtime and Alertmanager silence (ID=a5d7ae66-6b48-4bdb-8951-87b0e41404de) set by cmooney@cumin1002 for 0:20:00 on 26 host(s) and their services with reason: Move server uplinks codfw racks C5

db[2126,2165-2166,2192,2208-2209].codfw.wmnet,dns2005.wikimedia.org,es2037.codfw.wmnet,ganeti[2011-2012].codfw.wmnet,gitlab-runner2003.codfw.wmnet,kafka-main2008.codfw.wmnet,kubernetes[2040-2041,2058].codfw.wmnet,ml-serve2003.codfw.wmnet,mw[2440,2442-2443].codfw.wmnet,parse[2011-2013].codfw.wmnet,pc2015.codfw.wmnet,restbase[2025,2033].codfw.wmnet,wikikube-worker2039.codfw.wmnet

Move done, all migrated hosts are pinging again no issues to report.

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:18:38Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2166 (re)pooling @ 25%: T373097', diff saved to https://phabricator.wikimedia.org/P68819 and previous config saved to /var/cache/conftool/dbconfig/20240910-161832-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:18:40Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es2037 (re)pooling @ 25%: T373097', diff saved to https://phabricator.wikimedia.org/P68820 and previous config saved to /var/cache/conftool/dbconfig/20240910-161832-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:18:43Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2126 (re)pooling @ 25%: T373097', diff saved to https://phabricator.wikimedia.org/P68821 and previous config saved to /var/cache/conftool/dbconfig/20240910-161832-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:18:46Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2208 (re)pooling @ 25%: T373097', diff saved to https://phabricator.wikimedia.org/P68822 and previous config saved to /var/cache/conftool/dbconfig/20240910-161832-arnaudb.json

db/es hosts are repooling, in a tmux on my session on cumin1002

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:33:37Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2166 (re)pooling @ 50%: T373097', diff saved to https://phabricator.wikimedia.org/P68825 and previous config saved to /var/cache/conftool/dbconfig/20240910-163337-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:33:47Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2126 (re)pooling @ 50%: T373097', diff saved to https://phabricator.wikimedia.org/P68826 and previous config saved to /var/cache/conftool/dbconfig/20240910-163337-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:33:54Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es2037 (re)pooling @ 50%: T373097', diff saved to https://phabricator.wikimedia.org/P68827 and previous config saved to /var/cache/conftool/dbconfig/20240910-163338-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:34:00Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2208 (re)pooling @ 50%: T373097', diff saved to https://phabricator.wikimedia.org/P68828 and previous config saved to /var/cache/conftool/dbconfig/20240910-163338-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:34:06Z] <claime> Repooled kubernetes2040.codfw.wmnet kubernetes2041.codfw.wmnet kubernetes2058.codfw.wmnet mw2440.codfw.wmnet mw2442.codfw.wmnet mw2443.codfw.wmnet parse2011.codfw.wmnet parse2012.codfw.wmnet parse2013.codfw.wmnet wikikube-worker2039.codfw.wmnet - T373097

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:37:23Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2165 (re)pooling @ 50%: T373097', diff saved to https://phabricator.wikimedia.org/P68829 and previous config saved to /var/cache/conftool/dbconfig/20240910-163722-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:37:42Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2192 (re)pooling @ 50%: T373097', diff saved to https://phabricator.wikimedia.org/P68830 and previous config saved to /var/cache/conftool/dbconfig/20240910-163742-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:42:15Z] <sukhe@puppetmaster1001> conftool action : set/pooled=yes; selector: name=dns2005.wikimedia.org [reason: end: T373097 codfw maintenance]

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:48:43Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2166 (re)pooling @ 75%: T373097', diff saved to https://phabricator.wikimedia.org/P68833 and previous config saved to /var/cache/conftool/dbconfig/20240910-164842-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:48:46Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2126 (re)pooling @ 75%: T373097', diff saved to https://phabricator.wikimedia.org/P68834 and previous config saved to /var/cache/conftool/dbconfig/20240910-164842-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:48:48Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: T373097', diff saved to https://phabricator.wikimedia.org/P68835 and previous config saved to /var/cache/conftool/dbconfig/20240910-164842-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:48:51Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2208 (re)pooling @ 75%: T373097', diff saved to https://phabricator.wikimedia.org/P68836 and previous config saved to /var/cache/conftool/dbconfig/20240910-164843-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:52:29Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2165 (re)pooling @ 75%: T373097', diff saved to https://phabricator.wikimedia.org/P68837 and previous config saved to /var/cache/conftool/dbconfig/20240910-165228-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T16:52:48Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2192 (re)pooling @ 75%: T373097', diff saved to https://phabricator.wikimedia.org/P68838 and previous config saved to /var/cache/conftool/dbconfig/20240910-165248-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T17:03:48Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2126 (re)pooling @ 100%: T373097', diff saved to https://phabricator.wikimedia.org/P68839 and previous config saved to /var/cache/conftool/dbconfig/20240910-170347-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T17:03:51Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2166 (re)pooling @ 100%: T373097', diff saved to https://phabricator.wikimedia.org/P68840 and previous config saved to /var/cache/conftool/dbconfig/20240910-170347-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T17:03:54Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'es2037 (re)pooling @ 100%: T373097', diff saved to https://phabricator.wikimedia.org/P68841 and previous config saved to /var/cache/conftool/dbconfig/20240910-170348-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T17:03:56Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2208 (re)pooling @ 100%: T373097', diff saved to https://phabricator.wikimedia.org/P68842 and previous config saved to /var/cache/conftool/dbconfig/20240910-170348-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T17:07:34Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2165 (re)pooling @ 100%: T373097', diff saved to https://phabricator.wikimedia.org/P68843 and previous config saved to /var/cache/conftool/dbconfig/20240910-170734-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-10T17:07:54Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2192 (re)pooling @ 100%: T373097', diff saved to https://phabricator.wikimedia.org/P68844 and previous config saved to /var/cache/conftool/dbconfig/20240910-170753-arnaudb.json

cmooney claimed this task.

Icinga downtime and Alertmanager silence (ID=5ff7d01a-40d8-4196-9008-7bf9b79ea4e8) set by cmooney@cumin1002 for 0:20:00 on 1 host(s) and their services with reason: Move ganeti2012 server uplink

ganeti2012.codfw.wmnet