Similar to T167274 and T168462
All the CODFW switches will need to be upgraded to fix T133387.
This turn is for row B.
I'm planning on doing the upgrade on Wednesday July 12th at 0800 UTC
That's for the sake of picking a date, I can reschedule at will, let me know if that's an issue for anyone.
2h total maintenance time.
As the previous NSSU upgrade (each rack after the other) had shown issues (see T168462#3390519 ), this time the upgrade is going to be "standard", which means the whole ROW will go down for between 10 and 20min.
the full list of servers is available at: https://racktables.wikimedia.org/index.php?page=row&row_id=2199
To summarize, here is the types of hosts in that row:
achernar <-- recursive DNS, see below cp* db* elastic* es* eventlog* ganeti* graphite* kafka* kubernetes* labstore* labtestcontrol* labtestnet* labtestneutron* labtestvirt* lvs* maps* maps-test* mc* ms-be* ms-fe* mw* ores* osm-db* osm-web* pc* prometheus* rdb* rdb* restbase* restbase-test* scb* subra* tmh* wtp*
As achernar is the 2nd recursive dns in LVS resolv.conf, we should not hit T154759, but we might want to remove it from LVS resolv.conf before the maintenance to be on the safe side.
I subscribed users on that task based on https://etherpad.wikimedia.org/p/p0Iq39YWZg . Don't hesitate to add people I could have missed, or remove yourself from the task if you're not involved.
Timeline, please edit and add anyone who needs to be notified or any extra step that needs to be done.
1h before the window (0700 UTC):
- Depool codfw from DNS
- Warn people of the upcoming maintenance
- Ping @elukey to disable kafka
- Route ulsfo cache traffic around codfw
- Failover LVS* to passive node of the couple
- Downtime switch in Icinga/LibreNMS
After the upgrade:
- Confirm switches are in a healthy state
- Re-enable igmp-snooping
- Confirm T133387 is fixed (otherwise re-disable igmp-snooping)
- Run an LibreNMS discovery/pool
- Ask confirmation of "all good" to the list of users above
- Remove monitoring downtime
- Route ulsfo cache traffic through codfw
- Re-failover LVS*
- Repool codfw in DNS