Similar to T167274
All the CODFW switches will need to be upgraded to fix T133387.
This turn is for row A.
I'm planning on doing the upgrade on Thursday 29th at 0800 UTC
That's for the sake of picking a date, I can reschedule at will, let me know if that's an issue for anyone.
2h total maintenance time.
Best scenario: 10min downtime per RACK (one rack after the other) with a 3min pause in between. This went fine for row D, no complications are expected.
Possible alternate scenario: 1h downtime for the whole ROW
the full list of servers is available at: https://racktables.wikimedia.org/index.php?page=row&row_id=2198
To summarize, here is the types of hosts in that row:
acamar auth* baham <- authoritative nameservers, needs special handling? bast2001 <- impactful for staff, maybe dual-homing in the future? conf* contint* cp* db* dbstore* elastic* es* ganeti* heze-array jeze kafka* kubernetes* labtestweb* lvs* maps* maps-test* mc* ms-be* ms-fe* mw* ores* osm-web* procyon prometheus* puppetmaster* rbf* rdb* sarin scb* stat* suhail wdqs*
I subscribed users on that task based on https://etherpad.wikimedia.org/p/p0Iq39YWZg . Don't hesitate to add people I could have missed, or remove yourself from the task if you're not involved.
Because connectivity to bast2001 will be lost during the process, upgrade will have to be ran through mr1-codfw.
Timeline, please edit and add anyone who needs to be notified or any extra step that needs to be done.
1h before the window (0700 UTC):
- Depool codfw from DNS
- Warn people of the upcoming maintenance
- Route ulsfo cache traffic around codfw
- Switch services served from codfw (es: restbase-async, citoid) to be served from eqiad.
- Downtime some eqiad DBs (T168462#3366437)
- Depool acamar
- Ping @elukey to disable kafka ( T168462#3379576 )
- Failover LVS* to passive node of the couple
- Downtime switch in Icinga/LibreNMS (note that the switch will always shows up as UP except when the master member restarts, at the end)
After the upgrade:
- Confirm switches are in a healthy state
- Re-enable igmp-snooping
- Confirm T133387 is fixed (otherwise re-disable igmp-snooping)
- Run an LibreNMS discovery/pool
- Ask confirmation of "all good" to the list of users above
- Remove monitoring downtime
- Route ulsfo cache traffic through codfw
- repool acamar
- Re-failover LVS*
- Repool CODFW in DNS