Page MenuHomePhabricator

Migrate servers in codfw racks D5 & D6 from asw to lsw
Closed, ResolvedPublic

Description

Currently scheduled for Wed Sept 18th 2024 16:00 UTC

As part of the scheduled refresh of switch equipment in codfw rows C and D we need to move the network connections for servers in racks D5 and D6 from the old to new switch.

Hosts in this rack are managed by the following teams:

Collaboration Services
Core Platform
Data Persistence
Data Platform
Infrastructure Foundations
Machine Learning
ServiceOps

A full list of the specific hosts can be found below. We will use the sheet to plan the moves and co-ordinate with other SRE teams on actions required to ensure things go smoothly:

https://docs.google.com/spreadsheets/d/16xoZuDeC_-o6s70uEMnvdgn4BlT1f8__WPYprRuduIA#gid=517372520

Server links will be moved one-by-one from old to the new switch. So no two hosts will be offline at once.

Based on previous experience each host is likely to only lose comms for ~10 seconds. It is inevitable that a small number of the new cables do not work, however, or there is some minor glitch in the move. So it is possible in an edge case that a host will be offline for 2-3 minutes. On previous occasions this happened with about 1 out of 20 hosts.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

gitlab-runner2004 is a special purpose runner, so if we depool the runner some builds are not possible for that time. I think it's better to leave it pooled and running and if a job is scheduled during the maintenance users might have to retry. So no action is needed for gitlab-runner2004 imho.

all needed switchover prior to tonight have been done. I'll run T375050 as soon as this is done because circular replication will be enabled right after by @Ladsgroup and we commit to reduce maintenance to the minimum.

gitlab-runner2004 is a special purpose runner, so if we depool the runner some builds are not possible for that time. I think it's better to leave it pooled and running and if a job is scheduled during the maintenance users might have to retry. So no action is needed for gitlab-runner2004 imho.

Thanks for the confirmation.

all needed switchover prior to tonight have been done. I'll run T375050 as soon as this is done because circular replication will be enabled right after by @Ladsgroup and we commit to reduce maintenance to the minimum.

Thank you!

Draining ganeti2017.codfw.wmnet of running VMs

Mentioned in SAL (#wikimedia-operations) [2024-09-18T15:39:22Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'depool db2129 db2130 db2140 db2172 db2187 db2193 db2194 db2215 db2216 db2217 db2218 - T373104', diff saved to https://phabricator.wikimedia.org/P69261 and previous config saved to /var/cache/conftool/dbconfig/20240918-153922-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T15:39:51Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 0:30:00 on 12 hosts with reason: network maintenance T373104

Mentioned in SAL (#wikimedia-operations) [2024-09-18T15:40:15Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 12 hosts with reason: network maintenance T373104

Icinga downtime and Alertmanager silence (ID=7e878ed4-7126-4f45-87aa-d1087aacf81a) set by cmooney@cumin1002 for 0:30:00 on 23 host(s) and their services with reason: Move servers in codfw rack D5

db[2129,2172,2193,2215-2216].codfw.wmnet,ganeti2017.codfw.wmnet,gitlab-runner2004.codfw.wmnet,krb2002.codfw.wmnet,kubernetes[2024,2048-2049].codfw.wmnet,mc-wf2002.codfw.wmnet,ml-serve2008.codfw.wmnet,mw[2444-2447].codfw.wmnet,parse[2016-2017].codfw.wmnet,puppetmaster2002.codfw.wmnet,rdb2010.codfw.wmnet,restbase[2026-2027].codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:00:27Z] <topranks> moving servers in codfw rack D5 from asw-d5-codfw to lsw1-d5-codfw T373104

Icinga downtime and Alertmanager silence (ID=9cef1cb8-6d99-4d39-b2db-e242da2fe3f6) set by cmooney@cumin1002 for 0:25:00 on 24 host(s) and their services with reason: Move servers in codfw rack D6

aqs[2009-2012].codfw.wmnet,db[2130,2140,2187,2194,2217-2218].codfw.wmnet,dbproxy2004.codfw.wmnet,ganeti2026.codfw.wmnet,kafka-main2010.codfw.wmnet,kubernetes[2013-2014,2050-2051].codfw.wmnet,maps2010.codfw.wmnet,ml-serve2004.codfw.wmnet,mw[2448-2451].codfw.wmnet,restbase2034.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:07:10Z] <topranks> moving servers in codfw rack D6 from asw-d6-codfw to lsw1-d6-codfw T373104

All hosts moved and responding to ping again. Thanks all for the help!

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:23:17Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2129 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69262 and previous config saved to /var/cache/conftool/dbconfig/20240918-162316-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:23:22Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2130 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69263 and previous config saved to /var/cache/conftool/dbconfig/20240918-162321-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:23:27Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2140 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69264 and previous config saved to /var/cache/conftool/dbconfig/20240918-162326-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:23:32Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2172 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69265 and previous config saved to /var/cache/conftool/dbconfig/20240918-162331-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:23:42Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2193 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69266 and previous config saved to /var/cache/conftool/dbconfig/20240918-162341-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:23:47Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69267 and previous config saved to /var/cache/conftool/dbconfig/20240918-162346-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:23:53Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2215 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69268 and previous config saved to /var/cache/conftool/dbconfig/20240918-162351-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:24:00Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2216 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69269 and previous config saved to /var/cache/conftool/dbconfig/20240918-162357-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:24:03Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2217 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69270 and previous config saved to /var/cache/conftool/dbconfig/20240918-162401-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:24:07Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2218 (re)pooling @ 25%: T373104', diff saved to https://phabricator.wikimedia.org/P69271 and previous config saved to /var/cache/conftool/dbconfig/20240918-162406-arnaudb.json

nodes repooling, haproxy reloaded, thanks for the update @cmooney

@Ladsgroup I'll get to T375050

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:38:22Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2129 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69276 and previous config saved to /var/cache/conftool/dbconfig/20240918-163822-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:38:27Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2130 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69277 and previous config saved to /var/cache/conftool/dbconfig/20240918-163827-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:38:35Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2140 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69278 and previous config saved to /var/cache/conftool/dbconfig/20240918-163832-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:38:43Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2172 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69279 and previous config saved to /var/cache/conftool/dbconfig/20240918-163837-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:38:48Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2193 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69280 and previous config saved to /var/cache/conftool/dbconfig/20240918-163847-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:38:52Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2194 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69281 and previous config saved to /var/cache/conftool/dbconfig/20240918-163852-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:38:58Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2215 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69282 and previous config saved to /var/cache/conftool/dbconfig/20240918-163857-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:39:03Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2216 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69283 and previous config saved to /var/cache/conftool/dbconfig/20240918-163902-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:39:07Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2217 (re)pooling @ 50%: T373104', diff saved to https://phabricator.wikimedia.org/P69284 and previous config saved to /var/cache/conftool/dbconfig/20240918-163907-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:53:28Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2129 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69286 and previous config saved to /var/cache/conftool/dbconfig/20240918-165327-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:53:33Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2130 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69287 and previous config saved to /var/cache/conftool/dbconfig/20240918-165332-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:53:38Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2140 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69288 and previous config saved to /var/cache/conftool/dbconfig/20240918-165337-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:53:44Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2172 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69289 and previous config saved to /var/cache/conftool/dbconfig/20240918-165344-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:53:53Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69290 and previous config saved to /var/cache/conftool/dbconfig/20240918-165352-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:53:58Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2194 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69291 and previous config saved to /var/cache/conftool/dbconfig/20240918-165357-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:54:04Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2215 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69292 and previous config saved to /var/cache/conftool/dbconfig/20240918-165403-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:54:08Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2216 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69293 and previous config saved to /var/cache/conftool/dbconfig/20240918-165407-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T16:54:13Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: T373104', diff saved to https://phabricator.wikimedia.org/P69294 and previous config saved to /var/cache/conftool/dbconfig/20240918-165412-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T17:08:33Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2129 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69297 and previous config saved to /var/cache/conftool/dbconfig/20240918-170833-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T17:08:38Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2130 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69298 and previous config saved to /var/cache/conftool/dbconfig/20240918-170838-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T17:08:43Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2140 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69299 and previous config saved to /var/cache/conftool/dbconfig/20240918-170843-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T17:08:50Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2172 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69300 and previous config saved to /var/cache/conftool/dbconfig/20240918-170849-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T17:08:59Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69301 and previous config saved to /var/cache/conftool/dbconfig/20240918-170858-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T17:09:03Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2194 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69302 and previous config saved to /var/cache/conftool/dbconfig/20240918-170903-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T17:09:09Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2215 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69303 and previous config saved to /var/cache/conftool/dbconfig/20240918-170909-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T17:09:14Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2216 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69304 and previous config saved to /var/cache/conftool/dbconfig/20240918-170913-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-09-18T17:09:19Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: T373104', diff saved to https://phabricator.wikimedia.org/P69305 and previous config saved to /var/cache/conftool/dbconfig/20240918-170918-arnaudb.json

cmooney claimed this task.