Page MenuHomePhabricator

Migrate hosts from codfw row A/B ASW to new LSW devices
Closed, ResolvedPublic

Description

Master task to track physical link move of existing hosts to new switches in codfw rows A and B.

Interruption

The move for each host involves disconnecting it's switch uplink and re-connecting the cable to the new switch. This cable move takes approx 10-20 seconds for each host. Hosts will be moved sequentially, meaning no two hosts will be down at the same time.

Schedule

A provisional schedule is laid out below. First one before @Papaul goes on leave so him and @Jhancock.wm can finalize the process, then 3 per week (tues-thurs) until we have them all done (while Papaul is on leave).

The tasks for each rack, or below google sheet, can be used to co-ordinate any actions (i.e. to depool/drain) that should be taken for a given set of hosts in advance of the disruption.

https://docs.google.com/spreadsheets/d/1PlGGLclKFYR9XaqjOLibhiwwny0fOD8gLMwsNhIzGRo

SREs should note that there is some flexibility here. If certain hosts cannot easily accommodate the interruption, or the timeframe is not suitable for certain hosts, we can skip those and come back to them in a second phase. Just note any such hosts on the tasks for that rack or the google sheet.

DeviceTaskDate
asw-b5-codfwT355549Thursday Jan 25th 16:00 UTC
asw-b4-codfwT355860Tuesday Feb 6th 16:00 UTC
asw-a2-codfwT355861Wednesday Feb 7th 16:00 UTC
asw-a3-codfwT355862Thursday Feb 8th 16:00 UTC
asw-a4-codfwT355863Tuesday Feb 13th 16:00 UTC
asw-a5-codfwT355864Wednesday Feb 14th 16:00 UTC
asw-a6-codfwT355866Thursday Feb 15th 16:00 UTC
asw-a7-codfwT355867Tuesday Feb 20th 16:00 UTC
asw-a8-codfwT355874Wednesday Feb 21st 16:00 UTC
asw-b2-codfwT355868Thursday Feb 22nd 16:00 UTC
asw-b3-codfwT355870Tuesday Feb 27th 16:00 UTC
asw-b6-codfwT355871Wednesday Feb 28th 16:00 UTC
asw-b7-codfwT355872Thursday Feb 29th 16:00 UTC
asw-b8-codfwT355873Tuesday March 5th 16:00 UTC

Below is a list of the hosts affected grouped by team and date to help the planning:

Collaboration Services

Tue 05 Mar - Rack B8 - T355873

gitlab-runner2002

Core Platform / Data Persistence

Wed 14 Feb - Rack A5 - T355864

maps2005

Thu 22 Feb - Rack B2 - T355868

moss-be2002

Tue 27 Feb - Rack B3 - T355870

restbase2021
restbase2028

Wed 28 Feb - Rack B6 - T355871

maps2009
restbase2024

Tue 05 Mar - Rack B8 - T355873

restbase2014
restbase2029
restbase2030
sessionstore2001

Data Engineering

Thu 15 Feb - Rack A6 - T355866

aqs2001
aqs2002
aqs2003
aqs2004

Tue 20 Feb - Rack A7 - T355867

cephosd2001

Wed 28 Feb - Rack B6 - T355871

aqs2005
aqs2006
aqs2007
aqs2008

Data Persistence

Tue 06 Feb - Rack B4 - T355860

backup2005
backup2008
dbprov2002
ms-be2053
ms-be2057
ms-be2063
ms-be2067
ms-be2071

Wed 07 Feb - Rack A2 - T355861

ms-be2044
ms-be2051
ms-be2074
ms-fe2009
ms-fe2013
thanos-fe2001

Thu 08 Feb - Rack A3 - T355862

db2103
db2142
es2020

Tue 13 Feb - Rack A4 - T355863

backup2002
backup2004
db2183
dbprov2001
ms-be2060
ms-be2062
ms-be2066
ms-be2070
ms-be2075

Wed 14 Feb - Rack A5 - T355864

db2104
db2121
db2132
db2145
db2153
db2154
db2175
db2176
pc2011

Thu 15 Feb - Rack A6 - T355866

db2097
db2105
db2122
db2133
db2155
db2156
dbproxy2001
es2024
es2027
es2028

Tue 20 Feb - Rack A7 - T355867

ms-be2045
ms-be2052
thanos-be2001

Wed 21 Feb - Rack A8 - T355874

db2106
db2146

Thu 22 Feb - Rack B2 - T355868

ms-be2046
ms-be2076
ms-fe2010
ms-fe2014
thanos-fe2002

Tue 27 Feb - Rack B3 - T355870

db2108
db2123
es2021

Wed 28 Feb - Rack B6 - T355871

db2096
db2098
db2110
db2111
db2124
db2134
db2161
db2162
dbproxy2002

Thu 29 Feb - Rack B7 - T355872

ms-be2047
thanos-be2002

Tue 05 Mar - Rack B8 - T355873

db2148
db2163
db2164
db2185
db2189
es2025
es2029
es2030

Infra Foundations

Tue 06 Feb - Rack B4 - T355860

ganeti2031

Wed 07 Feb - Rack A2 - T355861

ganeti2029
ganeti2030

Tue 13 Feb - Rack A4 - T355863

ganeti2027

Wed 14 Feb - Rack A5 - T355864

ganeti2023
ganeti2024
puppetmaster2001
puppetserver2002

Tue 20 Feb - Rack A7 - T355867

ganeti-test2001
ganeti-test2002
ganeti-test2003
ganeti2028

Thu 29 Feb - Rack B7 - T355872

ganeti2032

Tue 05 Mar - Rack B8 - T355873

ganeti2019
ganeti2020

Machine Learning

Wed 07 Feb - Rack A2 - T355861

ml-cache2001

Wed 14 Feb - Rack A5 - T355864

ml-serve2001

Thu 15 Feb - Rack A6 - T355866

ml-staging2001

Thu 22 Feb - Rack B2 - T355868

ml-cache2002

Wed 28 Feb - Rack B6 - T355871

ml-serve2006

Observability

Tue 06 Feb - Rack B4 - T355860

logstash2027
logstash2034

Wed 07 Feb - Rack A2 - T355861

kafka-logging2001
logging-hd2001

Thu 08 Feb - Rack A3 - T355862

netmon2002

Tue 13 Feb - Rack A4 - T355863

logstash2026
logstash2033

Wed 14 Feb - Rack A5 - T355864

logstash2001

Thu 22 Feb - Rack B2 - T355868

kafka-logging2002
kafka-logging2004

Thu 29 Feb - Rack B7 - T355872

logstash2036

Search Platform

Tue 06 Feb - Rack B4 - T355860

elastic2058
elastic2070
elastic2095
elastic2096
elastic2097
wdqs2016

Wed 07 Feb - Rack A2 - T355861

elastic2037
elastic2038
elastic2055
elastic2073
elastic2074
elastic2087
elastic2088
wdqs2013
wdqs2023

Tue 13 Feb - Rack A4 - T355863

elastic2061
elastic2062
elastic2089

Tue 20 Feb - Rack A7 - T355867

elastic2039
elastic2040
elastic2056
elastic2069
elastic2075
elastic2076
elastic2090
elastic2091
wdqs2009
wdqs2020

Thu 22 Feb - Rack B2 - T355868

elastic2041
elastic2042
elastic2057
elastic2063
elastic2064
elastic2077
elastic2078
elastic2092
elastic2093
elastic2094
wdqs2010
wdqs2014
wdqs2024

Wed 28 Feb - Rack B6 - T355871

wcqs2001

Thu 29 Feb - Rack B7 - T355872

elastic2043
elastic2044
elastic2079
elastic2080

Tue 05 Mar - Rack B8 - T355873

wdqs2007

Service Ops

Tue 06 Feb - Rack B4 - T355860

kafka-main2002
mc-gp2002
mc2044
mc2045

Wed 07 Feb - Rack A2 - T355861

mc2038
mc2039

Thu 08 Feb - Rack A3 - T355862

mw2291
mw2292
mw2293
mw2294
mw2295
mw2296
mw2297
mw2298
mw2299
mw2300
mw2377
mw2378
mw2379
mw2380
mw2381
mw2382
mw2383
mw2384
mw2385
mw2386
mw2387
mw2388
mw2389
mw2390
mw2391
mw2392
mw2393
mw2394
mw2395
mw2396
mw2397
mw2398
mw2399
mw2400

Tue 13 Feb - Rack A4 - T355863

kafka-main2001
mc-gp2001
mc2055

Wed 14 Feb - Rack A5 - T355864

kubernetes2018
kubernetes2019
mw2401
mw2402
mw2403
mw2404
mw2405
mw2406
mw2407
mw2408
mw2409
mw2410
mw2411
mw2420
mw2421
mw2422
mw2423
parse2001
parse2002
parse2003
rdb2007

Thu 15 Feb - Rack A6 - T355866

kubernetes2007
kubernetes2008
kubernetes2027
kubernetes2028
kubernetes2055
kubernetes2059
kubernetes2060
mw2301
mw2302
mw2303
mw2304
mw2305
mw2306
mw2307
mw2308
mw2309
mw2424
mw2425
mw2426
mw2427

Tue 20 Feb - Rack A7 - T355867

mc2040
mc2041

Wed 21 Feb - Rack A8 - T355874

kubernetes2025
kubernetes2026
parse2004
parse2005

Thu 22 Feb - Rack B2 - T355868

mc2042
mc2043

Tue 27 Feb - Rack B3 - T355870

kubernetes2029
kubernetes2030
kubernetes2057
mw2259
mw2260
mw2261
mw2262
mw2263
mw2264
mw2265
mw2266
mw2267
mw2268
mw2269
mw2270
mw2310
mw2311
mw2312
mw2313
mw2314
mw2315
mw2316
mw2317
mw2318
mw2319
mw2320
mw2321
mw2322
mw2323
mw2324

Wed 28 Feb - Rack B6 - T355871

kubernetes2009
kubernetes2010
kubernetes2020
kubernetes2033
kubernetes2034
mw2325
mw2326
mw2327
mw2328
mw2329
mw2330
mw2331
mw2332
mw2333
mw2334
mw2428
mw2429
mw2430
mw2431
rdb2008

Thu 29 Feb - Rack B7 - T355872

mc2046

Tue 05 Mar - Rack B8 - T355873

kubernetes2035
kubernetes2054
mw2432
mw2433
mw2434
mw2435
parse2008
parse2009
parse2010

Traffic

Tue 06 Feb - Rack B4 - T355860

cp2033
cp2034

Tue 13 Feb - Rack A4 - T355863

cp2027
cp2028

Tue 20 Feb - Rack A7 - T355867

cp2029
cp2030

Thu 22 Feb - Rack B2 - T355868

cp2031
cp2032

Tue 27 Feb - Rack B3 - T355870

conf2004

Tue 05 Mar - Rack B8 - T355873

dns2004

WMCS

Tue 20 Feb - Rack A7 - T355867

cloudbackup2001

Related Objects

StatusSubtypeAssignedTask
Resolvedcmooney
Resolvedcmooney
ResolvedMarostegui
ResolvedMarostegui
Resolvedklausman
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
Resolvedklausman
Resolvedcmooney
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
Resolvedcmooney
ResolvedMarostegui
ResolvedMarostegui
Resolvedcmooney
ResolvedMarostegui
ResolvedMarostegui
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
DuplicateNone
ResolvedMarostegui
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney
Resolvedcmooney

Event Timeline

cmooney triaged this task as Medium priority.Jan 22 2024, 2:22 PM
cmooney created this task.

@cmooney I think that's doable. I'll block out my schedule for it.

moss-be* hosts should be @MatthewVernon unless I am mistaken, in which case, please accept my apologies in advance :)

moss-be* hosts should be @MatthewVernon unless I am mistaken, in which case, please accept my apologies in advance :)

Thanks yep!

moss-be* hosts should be @MatthewVernon unless I am mistaken, in which case, please accept my apologies in advance :)

Guilty as charged, but they're not in service at the moment, so no action needed around the move.

ssingh added a subscriber: Clement_Goubert.

As discussed in 998431, Traffic will be taking care of conf2004, so I have moved that host there. Thanks to @Clement_Goubert for the patch.

Mentioned in SAL (#wikimedia-operations) [2024-02-12T10:06:00Z] <cmooney@cumin1002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Remove legacy codfw vc switches from synced hiera data after netbox status change - cmooney@cumin1002 - T355544"

Mentioned in SAL (#wikimedia-operations) [2024-02-12T10:07:58Z] <cmooney@cumin1002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Remove legacy codfw vc switches from synced hiera data after netbox status change - cmooney@cumin1002 - T355544"

Mentioned in SAL (#wikimedia-operations) [2024-02-22T16:56:04Z] <topranks> disabling link from asw-a-codfw vc to ssw1-a1-codfw and ssw1-a8-codfw T355544

Mentioned in SAL (#wikimedia-operations) [2024-02-22T16:57:48Z] <cmooney@cumin1002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Remove legacy codfw vc switches from synced hiera data after netbox status change - cmooney@cumin1002 - T355544"

Mentioned in SAL (#wikimedia-operations) [2024-02-22T16:58:39Z] <cmooney@cumin1002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Remove legacy codfw vc switches from synced hiera data after netbox status change - cmooney@cumin1002 - T355544"

Mentioned in SAL (#wikimedia-operations) [2024-02-22T17:05:33Z] <topranks> disabling IPv6 RAs for private1-a-codfw vlan on codfw core routers T355544

Mentioned in SAL (#wikimedia-operations) [2024-02-27T14:31:59Z] <fabfur> restarting pybal on lvs2014,lvs2011,lvs2012 and lvs2013 for T355544

Mentioned in SAL (#wikimedia-operations) [2024-02-27T14:39:07Z] <claime> depooling mw2325.codfw.wmnet,mw2326.codfw.wmnet,mw2327.codfw.wmnet,mw2328.codfw.wmnet,mw2329.codfw.wmnet,mw2330.codfw.wmnet,mw2331.codfw.wmnet,mw2332.codfw.wmnet,mw2333.codfw.wmnet,mw2334.codfw.wmnet for T355544

Mentioned in SAL (#wikimedia-operations) [2024-02-27T14:45:50Z] <claime> disregard previous depooling message for T355544

cookbooks.sre.hosts.decommission executed by cmooney@cumin1002 for hosts: testvm2001.codfw.wmnet

  • testvm2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw_test to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw_test to Netbox

Mentioned in SAL (#wikimedia-operations) [2024-02-27T16:23:39Z] <fabfur> restarting pybal on lvs2014,lvs2011,lvs2012 and lvs2013 for T355544

Change 1007393 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Disable CR DHCP relay and IPv6 RA generation private1-b-codfw vlan

https://gerrit.wikimedia.org/r/1007393

Mentioned in SAL (#wikimedia-operations) [2024-02-28T16:25:08Z] <topranks> Disabling IPv6 RAs for private1-b-codfw vlan on codfw CR routers, moving GW to lsw/ssw T355544

Change 1007393 merged by jenkins-bot:

[operations/homer/public@master] Disable CR DHCP relay and IPv6 RA generation private1-b-codfw vlan

https://gerrit.wikimedia.org/r/1007393

cmooney claimed this task.

Closing task. Big thanks to all the SRE teams for the help and co-operation getting this one over the line :)