Even though services are supposed to be designed to handle a row failure, those rows have became close to "too big to fail".
A planned row downtime is extremely costly in term of engineering time (preparation, communication, sync up, depools, monitoring, repool), even more so in eqiad (primary DC for many services and data-engineering presence). See for example T172459: eqiad row D switch upgrade
Same goes for outages as we've seen with the codfw row B issues.
As a reminder eqiad (except rows E and F) as well as codfw use Juniper's virtual-chassis technology, which creates one large logical switch out of all the top of rack switches of a given row. This was convenient back in the time before we had automation and before routing between top of racks was common. The downside is that they share a failure domain and use proprietary protocols.
Even though they have been quite stable when left alone, a small change or failure (T327001: asw-b2-codfw down) can have dramatic consequences.
We've also seen bugs creeping up, such as the one documented in T320566: Cr1-eqiad comms problem when moving to 40G row D handoff.
All those downsides are being fixed with the "new network design" currently live in eqiad rows E/F, and will be rolled to codfw rows A and B before the end of 2023 (we received part of the hardware). Other rows will follow progressively (exact timeline depending on hardware availability and budget).
Because of this apparent stability added to the impact of an upgrade and not having significant reasons to upgrades (features, security, etc) those switches are now running quite old and now unsupported software version.
However upgrading is becoming more urgent now, for the issues listed above, as well as new features needed for network automation, security, and the switch replacement timeline that won't happen soon enough.
So better bite the bullet and do it. It will also help service owners to make sure their servers are row redundant.
We should also look if we can leverage the recent automation work to help with the depools/repool/communication/etc
Schedule wise, it makes sens to do codfw first, then eqiad ideally during a DC switchover if the timelines are compatible.
Tentative schedule (every 2 weeks on Thursday):
|codfw A||Feb 7th - 14:00-16:00 UTC||T327925|
|codfw B||Feb 21st - 14:00-16:00 UTC||T327991|
|eqiad A||March 7th - 14:00-16:00 UTC*||T329073|
|eqiad B||March 28th - 14:00-16:00 UTC*||T330165|
|eqiad C||April 4th - 13:00-15:00 UTC*||T331882|
|eqiad D||April 18th - 13:00-15:00 UTC*|
|codfw C||May 2nd - 13:00-15:00 UTC|
|codfw D||May 16th - 13:00-15:00 UTC|
- During DC switchover