Page MenuHomePhabricator

codfw: rack A4 maintenance
Closed, ResolvedPublic

Description

See parent and grand-parent tasks T426197: codfw: pod AB switches upgrade (2026)

This task is to schedule the software upgrade of rack A4 top of rack switch scheduled for Tuesday 2026-06-09 12:00 UTC with an expected network connectivity loss of ~30min

https://wikitech.wikimedia.org/wiki/Network_leaf_maintenance

No information available about depool
mc2055: Couldn't get or parse depool Hiera key (will be moved to another rack T427373) moved to A3
mc-gp2004: Couldn't get or parse depool Hiera key
dbprov2005: Couldn't get or parse depool Hiera key
backup2004: Couldn't get or parse depool Hiera key

Depool needed
cirrussearch2061: depool using local_command depool
cirrussearch2062: depool using local_command depool
cirrussearch2089: depool using local_command depool
cp2044: depool using local_command depool
db2183: skipping host (manual depool needed)
db2198: skipping host (manual depool needed)
db2199: skipping host (manual depool needed)
db2241: depool using cookbook sre.mysql.depool -r "rack depool" {name}
ganeti2027: depooled (ping Moritz when it's good to re-add)
ganeti2034: depooled (ping Moritz when it's good to re-add)
ganeti2045: depooled (ping Moritz when it's good to re-add)
ml-serve2005: depool using cookbook sre.k8s.pool-depool-node
aux-k8s-worker2006: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2251: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2252: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2253: depool using cookbook sre.k8s.pool-depool-node

No depool needed
ms-be2062: skipping host (Can't be depooled, need to go down one at a time with special care)
ms-be2066: skipping host (Can't be depooled, need to go down one at a time with special care)
ms-be2070: skipping host (Can't be depooled, need to go down one at a time with special care)
ms-be2075: skipping host (Can't be depooled, need to go down one at a time with special care)
logstash2033: skipping host (No cookbook, no depool needed but there's a switch we can flip to mitigate the churn caused when the cluster detects a down node)

Per team grouping
Infrastructure Foundations: aux-k8s-worker2006, ganeti2027, ganeti2034, ganeti2045 @MoritzMuehlenhoff
Data Persistence: backup2004, db2183, db2198, db2199, db2241, dbprov2005, ms-be2062, ms-be2066, ms-be2070, ms-be2075 @jcrespo @FCeratto-WMF @MatthewVernon
Search Platform: cirrussearch2061, cirrussearch2062, cirrussearch2089 Discovery-Search
Traffic: cp2044 Traffic
Observability: logstash2033 Observability-Logging
ServiceOps: mc2055, mc-gp2004, wikikube-worker2251, wikikube-worker2252, wikikube-worker2253 ServiceOps new
Machine Learning: ml-serve2005 Machine-Learning-Team

Event Timeline

ayounsi triaged this task as Medium priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

db2183 will require stopping mediabackups in advance, to prevent losing metadata. I will take care of that. Edit: I won't be around that day, I may chose to failover the db instead, in any case, I will take care of it myself.

For db2198, db2199, dbprov2005, backup2004 no action will be needed other than downtime, assuming this happens during the day. Hopefully backup2004 will be decomissioned by then even.

@ayounsi mc2055 and mc-gp2004 are on A4, and that is by accident. mc-gp2004 is working as a backup in case mc2055 or any other mc2XXX server is down.

Would it be ok if I ask mc2055 to be moved to either A3 or A5? If that is a yes, I can ask DC-Ops to do so. If this is too tight for their schedule, I could see if we can remove it from the pool around that time.

Depool for cp2044 looks good; please ping Traffic if you want us to take care of it.

Draining ganeti2027.codfw.wmnet of running VMs

VM kubestagemaster2005.codfw.wmnet switching disk type to drbd

Draining ganeti2027.codfw.wmnet of running VMs

VM kubestagemaster2005.codfw.wmnet switching disk type to plain

Draining ganeti2027.codfw.wmnet of running VMs

Mentioned in SAL (#wikimedia-operations) [2026-06-01T16:02:30Z] <moritzm> temporarily remove ganeti2027 from the codfw cluster T427357

Draining ganeti2045.codfw.wmnet of running VMs

VM aux-k8s-etcd2003.codfw.wmnet switching disk type to drbd

VM dse-k8s-etcd2001.codfw.wmnet switching disk type to drbd

Draining ganeti2045.codfw.wmnet of running VMs

VM aux-k8s-etcd2003.codfw.wmnet switching disk type to plain

VM dse-k8s-etcd2001.codfw.wmnet switching disk type to plain

Draining ganeti2045.codfw.wmnet of running VMs

Draining ganeti2045.codfw.wmnet of running VMs

Completed depooling of db2241 by fceratto@cumin1003: Depool for rack maintenance

Icinga downtime and Alertmanager silence (ID=55ec911b-df34-41d6-a145-ff36a55ba765) set by fceratto@cumin1003 for 2 days, 0:00:00 on 1 host(s) and their services with reason: Depool for rack maintenance

db2241.codfw.wmnet

Starting pool of db2241 by fceratto@cumin1003: Depool for rack maintenance

Completed pooling of db2241 by fceratto@cumin1003: Depool for rack maintenance

Mentioned in SAL (#wikimedia-operations) [2026-06-02T09:32:55Z] <moritzm> temporarily remove ganeti2045 from the codfw cluster T427357

VM rpki2003.codfw.wmnet switching disk type to plain

VM netflow2004.codfw.wmnet switching disk type to plain

@ayounsi mc2055 and mc-gp2004 are on A4, and that is by accident. mc-gp2004 is working as a backup in case mc2055 or any other mc2XXX server is down.

Would it be ok if I ask mc2055 to be moved to either A3 or A5? If that is a yes, I can ask DC-Ops to do so. If this is too tight for their schedule, I could see if we can remove it from the pool around that time.

mc2055 has been moved to A3 (tx @Jhancock.wm !), and mc-gp2004 is a memcached standby server. Wikikube workers have their cookbooks in place, we are good to go from ServiceOps new

Change #1298816 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Switchover backup1-codfw primary db2183->db2184

https://gerrit.wikimedia.org/r/1298816

Icinga downtime and Alertmanager silence (ID=6c545cca-39bb-4652-9cd8-5da4fc40f265) set by jynus@cumin2002 for 4:00:00 on 2 host(s) and their services with reason: Switchover db

db[2183-2184].codfw.wmnet

Change #1298816 merged by Jcrespo:

[operations/puppet@production] mariadb: Switchover backup1-codfw primary db2183->db2184

https://gerrit.wikimedia.org/r/1298816

Change #1298824 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Point media backups to the new replica, db2183

https://gerrit.wikimedia.org/r/1298824

Change #1298824 merged by Jcrespo:

[operations/puppet@production] dbbackups: Point media backups to the new replica, db2183

https://gerrit.wikimedia.org/r/1298824

I have moved the primary function (except the backups) of db2183 to db2184 (T428467) so db2183 can lose connectivity for an extended period of time for this task. CC @FCeratto-WMF No further action is needed except downtime/remove downtime of db2183,db2198,db2199,backup2004,dbprov2005 before and after maintenance.

Mentioned in SAL (#wikimedia-operations) [2026-06-09T12:15:23Z] <topranks> drain traffic on ssw1-a1-codfw - add gshut community in evpn underlay - T427357

Mentioned in SAL (#wikimedia-operations) [2026-06-09T12:42:43Z] <topranks> increase OSPF cost on ssw1-a1-codfw link to lsw1-a4-codfw to force traffic via alternate spine T427357

Mentioned in SAL (#wikimedia-operations) [2026-06-09T12:45:18Z] <topranks> shut sub-interfaces for row A/B legacy vlans on cr1-codfw T427357

Mentioned in SAL (#wikimedia-operations) [2026-06-09T12:56:46Z] <XioNoX> lsw1-a4-codfw> request system reboot - T427357

ayounsi claimed this task.

Maintenance done, all servers except Ganeti and the ones mentioned by @jcrespo have been repooled. Please repool them as you see fit.

ms swift in codfw looks OK after this work, thanks.

VM netflow2004.codfw.wmnet switching disk type to drbd

VM rpki2003.codfw.wmnet switching disk type to drbd

All Ganeti nodes are back in service

Change #1300041 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Reenable regular es backups and update RO job ids

https://gerrit.wikimedia.org/r/1300041

Change #1300041 merged by Jcrespo:

[operations/puppet@production] dbbackups: Reenable regular es backups and update RO job ids

https://gerrit.wikimedia.org/r/1300041