All the CODFW switches will need to be upgraded to fix T133387.
This turn is for row C.
I'm planning on doing the upgrade on Wednesday July 19th at 0800 UTC
That's for the sake of picking a date, I can reschedule at will, let me know if that's an issue for anyone.
2h total maintenance time.
As the previous standard upgrade was smooth, we're doing the same thing. Which means the whole ROW will go down for between 10 and 20min (row B was ~10min)
the full list of servers is available at: https://racktables.wikimedia.org/index.php?page=row&row_id=2200
To summarize, here is the types of hosts in that row:
conf* cp* db* dbstore* elastic* es* graphite* kafka* kubernetes* labtestcontrol* labtestservices* maps* mc* ms-be* mw* mwlog* naos oresdb* pc* phab* rdb* restbase* scb* tegmen wdqs* wtp*
I subscribed users on that task based on https://etherpad.wikimedia.org/p/p0Iq39YWZg . Don't hesitate to add people I could have missed, or remove yourself from the task if you're not involved.
Timeline, please edit and add anyone who needs to be notified or any extra step that needs to be done.
1h before the window (0700 UTC):
- Depool codfw from DNS
- Warn people of the upcoming maintenance
- Ping @elukey to disable kafka
- Route ulsfo cache traffic around codfw
- Downtime switch in Icinga/LibreNMS
- Set restbase-async and citoid to active-active
After the upgrade:
- Confirm switches are in a healthy state
- Re-enable igmp-snooping
- Confirm T133387 is fixed (otherwise re-disable igmp-snooping)
- Run an LibreNMS discovery/pool
- Ask confirmation of "all good" to the list of users above
- Remove monitoring downtime
- Route ulsfo cache traffic through codfw
- Repool codfw in DNS
- Revert restbase-async and citoid