All the CODFW switches will need to be upgraded to fix T133387.
ROW D is the least busy row, that's why I'd like to start with it.
I'm planning on doing the upgrade on Tuesday 20th 1500UTC (0800 PDT)
That's for the sake of picking a date, I can reschedule at will, let me know if that's an issue for anyone.
1h30min total maintenance time.
Best scenario: 10min downtime per RACK (one rack after the other)
Possible alternate scenario: 1h downtime for the whole ROW
the full list of servers is available at: https://racktables.wikimedia.org/index.php?page=row&row_id=2201
To summarize, here is the types of hosts in that row:
db* mc* elastic* restbase* es* ms-fe* ms-be* cp* puppetmaster* wdqs* maps* conf* gerrit* rdb* pc* scb* labtestpuppetmaster* wezen wasat
I subscribed users on that task based on https://etherpad.wikimedia.org/p/p0Iq39YWZg . Don't hesitate to add people I could have missed, or remove yourself from the task if you're not involved.
@Papaul Could you be available to go onsite that day in case something goes terribly wrong? (no need to be there, but only available/reachable)
Timeline, please edit and add anyone who needs to be notified or any extra step that needs to be done.
1h before the window (1400 UTC):
- Notify @Marostegui, about the upcoming work.
- Depool CODFW from DNS.
- Downtime switch in Icinga/LibreNMS
After the upgrade:
- Confirm switches are in a healthy state
- Re-enable igmp-snooping
- Confirm T133387 is fixed (otherwise re-disable igmp-snooping)
- Ask confirmation of "all good" to the list of users above
- Remove monitoring downtime
- Repool CODFW in DNS