Page MenuHomePhabricator

codfw: rack A3 maintenance
Closed, ResolvedPublic

Description

See parent and grand-parent tasks T426197: codfw: pod AB switches upgrade (2026)

This task is to schedule the software upgrade of rack A3 top of rack switch scheduled for Tuesday 2026-06-02 with an expected network connectivity loss of ~30min 12:00 UTC

The checkmark indicates hosts ready for the rack maintenance (alarms silenced + host depooled if needed)

Depool needed

  • db2158: depool using cookbook sre.mysql.depool -r "rack depool" {name}
  • db2250: skipping host (depool not needed)
  • pc2021: skipping host (manual depool needed)

wikikube-worker2011: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2033: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2034: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2050: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2055: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2056: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2057: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2058: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2059: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2060: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2061: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2062: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2068: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2069: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2070: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2071: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2107: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2108: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2109: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2110: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2111: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2112: depool using cookbook sre.k8s.pool-depool-node
wikikube-worker2113: depool using cookbook sre.k8s.pool-depool-node

No depool needed:
kafka-main2006: skipping host (no depool needed)
netmon2002: skipping host (no depool needed)

Data Persistence: db2158, db2250, pc2021 @Marostegui @FCeratto-WMF
ServiceOps: kafka-main2006, wikikube-worker2011, wikikube-worker2033, wikikube-worker2034, wikikube-worker2050, wikikube-worker2055, wikikube-worker2056, wikikube-worker2057, wikikube-worker2058, wikikube-worker2059, wikikube-worker2060, wikikube-worker2061, wikikube-worker2062, wikikube-worker2068, wikikube-worker2069, wikikube-worker2070, wikikube-worker2071, wikikube-worker2107, wikikube-worker2108, wikikube-worker2109, wikikube-worker2110, wikikube-worker2111, wikikube-worker2112, wikikube-worker2113 @JMeybohm
Infrastructure Foundations: netmon2002
Observability: netmon2002

Event Timeline

ayounsi triaged this task as Medium priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui added subscribers: CWilliams-WMF, jcrespo.

@jcrespo FYI db2250
@FCeratto-WMF can you take care of depooling pc2021 and coordinating db2158? cc @CWilliams-WMF

Thanks for the heads up, @Marostegui

db2250 needs no special handling or depooling -other than downtiming-, assuming maintenance happens during the day.

Depooled pc1021.eqiad.wmnet and pc2021.codfw.wmnet rack A3 maintenance - fceratto@cumin1003 - T427301

Icinga downtime and Alertmanager silence (ID=45ec6ca5-e7dc-4aac-829a-479be0c8c095) set by fceratto@cumin1003 for 2 days, 0:00:00 on 1 host(s) and their services with reason: rack A3 maintenance

pc2021.codfw.wmnet

Completed depooling of db2158 by fceratto@cumin1003: rack A3 maintenance

Icinga downtime and Alertmanager silence (ID=dd2e4787-ea9c-4ac3-a947-eed9b2dfef8b) set by fceratto@cumin1003 for 2 days, 0:00:00 on 1 host(s) and their services with reason: rack A3 maintenance

db2158.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=79ff0806-933d-49e7-8c67-71ba7d45bc8b) set by fceratto@cumin1003 for 2 days, 0:00:00 on 1 host(s) and their services with reason: rack A3 maintenance

db2250.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=7ff06b6f-10ee-45fc-a0d8-f9e445c2a726) set by ayounsi@cumin1003 for 1:00:00 on 27 host(s) and their services with reason: Switch maintenance

db[2158,2250].codfw.wmnet,netmon2002.wikimedia.org,pc2021.codfw.wmnet,wikikube-worker[2011,2033-2034,2050,2055-2062,2068-2071,2107-2113].codfw.wmnet

Icinga downtime and Alertmanager silence (ID=2cc85cc5-a5e3-45c9-b39d-9775610ef6e4) set by ayounsi@cumin1003 for 1:00:00 on 3 host(s) and their services with reason: Switch maintenance

lsw1-a3-codfw,lsw1-a3-codfw IPv6,lsw1-a3-codfw.mgmt

Mentioned in SAL (#wikimedia-operations) [2026-06-02T12:21:28Z] <XioNoX> reboot lsw1-a3-codfw for software upgrade - T427301

Mentioned in SAL (#wikimedia-operations) [2026-06-02T12:41:13Z] <topranks> enable bgp graceful-shutdown in underlay on ssw1-a1-codfw T427301

Mentioned in SAL (#wikimedia-operations) [2026-06-02T12:50:08Z] <topranks> enable bgp graceful-shutdown in overlay on ssw1-a1-codfw T427301

Mentioned in SAL (#wikimedia-operations) [2026-06-02T12:54:55Z] <topranks> shutdown sub-interfaces on cr1-codfw et-1/1/5 for row A/B vlans T427301

Mentioned in SAL (#wikimedia-operations) [2026-06-02T13:03:13Z] <topranks> increase OSPF cost on ssw1-a1-codfw et-0/0/2 towards lsw1-a3-codfw T427301

Mentioned in SAL (#wikimedia-operations) [2026-06-02T13:24:40Z] <topranks> increase OSPF cost on ssw1-a1-codfw et-0/0/4 towards lsw1-a5-codfw T427301

ayounsi claimed this task.

A3 switch upgrade went fine, ~11min switch downtime plus a few more for the interfaces to come back up.

We had some issues with the spines upgrade, we're going to push it to next window.

wikikube-worker repooled.