Page MenuHomePhabricator

codfw row C switch upgrade
Closed, ResolvedPublic

Description

Similar to T167274, T168462 and T169345

All the CODFW switches will need to be upgraded to fix T133387.

This turn is for row C.

I'm planning on doing the upgrade on Wednesday July 19th at 0800 UTC
That's for the sake of picking a date, I can reschedule at will, let me know if that's an issue for anyone.

2h total maintenance time.

As the previous standard upgrade was smooth, we're doing the same thing. Which means the whole ROW will go down for between 10 and 20min (row B was ~10min)

the full list of servers is available at: https://racktables.wikimedia.org/index.php?page=row&row_id=2200

To summarize, here is the types of hosts in that row:

conf*
cp*
db*
dbstore*
elastic*
es*
graphite*
kafka*
kubernetes*
labtestcontrol*
labtestservices*
maps*
mc*
ms-be*
mw*
mwlog*
naos
oresdb*
pc*
phab*
rdb*
restbase*
scb*
tegmen
wdqs*
wtp*

I subscribed users on that task based on https://etherpad.wikimedia.org/p/p0Iq39YWZg . Don't hesitate to add people I could have missed, or remove yourself from the task if you're not involved.

Timeline, please edit and add anyone who needs to be notified or any extra step that needs to be done.
1h before the window (0700 UTC):

  • Depool codfw from DNS
  • Warn people of the upcoming maintenance
  • Ping @elukey to disable kafka
  • Route ulsfo cache traffic around codfw
  • Downtime switch in Icinga/LibreNMS
  • Set restbase-async and citoid to active-active

After the upgrade:

  • Confirm switches are in a healthy state
  • Re-enable igmp-snooping
  • Confirm T133387 is fixed (otherwise re-disable igmp-snooping)
  • Run an LibreNMS discovery/pool
  • Ask confirmation of "all good" to the list of users above
  • Remove monitoring downtime
  • Route ulsfo cache traffic through codfw
  • Repool codfw in DNS
  • Revert restbase-async and citoid

Event Timeline

From the db side:

  • db1031 needs to be downtimed as it is db2033's x1 master and will page with replication broken once db2033 becomes unreachable
  • We could downtime all the affected hosts - they will not page, but will generate noise on IRC with PING DOWN.

re: graphite machines, we'll take the 10 min hit, ditto mwlog

Change 366198 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool codfw for asw-c-codfw upgrade

https://gerrit.wikimedia.org/r/366198

Change 366199 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Route traffic around codfw for asw-c-codfw upgrade

https://gerrit.wikimedia.org/r/366199

Change 366198 merged by Ayounsi:
[operations/dns@master] Depool codfw for asw-c-codfw upgrade

https://gerrit.wikimedia.org/r/366198

Change 366199 merged by Ayounsi:
[operations/puppet@production] Route traffic around codfw for asw-c-codfw upgrade

https://gerrit.wikimedia.org/r/366199

Mentioned in SAL (#wikimedia-operations) [2017-07-19T08:28:49Z] <XioNoX> asw-c-codfw restarted 8min ago for switch upgrade - T170380

Switch upgrade took ~9min and went as expected.
Icinga paged about some svc.codfw services unreachable, followup task to be opened.