Page MenuHomePhabricator

codfw row D switch upgrade
Closed, ResolvedPublic

Description

All the CODFW switches will need to be upgraded to fix T133387.

ROW D is the least busy row, that's why I'd like to start with it.

I'm planning on doing the upgrade on Tuesday 20th 1500UTC (0800 PDT)
That's for the sake of picking a date, I can reschedule at will, let me know if that's an issue for anyone.

1h30min total maintenance time.

Best scenario: 10min downtime per RACK (one rack after the other)
Possible alternate scenario: 1h downtime for the whole ROW

the full list of servers is available at: https://racktables.wikimedia.org/index.php?page=row&row_id=2201

To summarize, here is the types of hosts in that row:

db*
mc*
elastic*
restbase*
es*
ms-fe*
ms-be*
cp*
puppetmaster*
wdqs*
maps*
conf*
gerrit*
rdb*
pc*
scb*
labtestpuppetmaster*
wezen
wasat

I subscribed users on that task based on https://etherpad.wikimedia.org/p/p0Iq39YWZg . Don't hesitate to add people I could have missed, or remove yourself from the task if you're not involved.

@Papaul Could you be available to go onsite that day in case something goes terribly wrong? (no need to be there, but only available/reachable)

Timeline, please edit and add anyone who needs to be notified or any extra step that needs to be done.
1h before the window (1400 UTC):

  • Notify @Marostegui, about the upcoming work.
  • Depool CODFW from DNS.
  • Downtime switch in Icinga/LibreNMS

After the upgrade:

  • Confirm switches are in a healthy state
  • Re-enable igmp-snooping
  • Confirm T133387 is fixed (otherwise re-disable igmp-snooping)
  • Ask confirmation of "all good" to the list of users above
  • Remove monitoring downtime
  • Repool CODFW in DNS

Details

Related Gerrit Patches:

Event Timeline

ayounsi created this task.Jun 7 2017, 10:47 AM
Restricted Application added a project: Operations. · View Herald TranscriptJun 7 2017, 10:47 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Papaul added a comment.Jun 7 2017, 4:20 PM

@ayounsi Yes I can be available

Is this happening tomorrow then?

Correct, nobody voiced any concern for that date.

Change 360352 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool codfw for asw-d-codfw upgrade

https://gerrit.wikimedia.org/r/360352

Change 360357 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Route around codfw for asw-d-codfw switch upgrade

https://gerrit.wikimedia.org/r/360357

Change 360352 merged by Ayounsi:
[operations/dns@master] Depool codfw for asw-d-codfw upgrade

https://gerrit.wikimedia.org/r/360352

Change 360357 merged by Ayounsi:
[operations/puppet@production] Route around codfw for asw-d-codfw switch upgrade

https://gerrit.wikimedia.org/r/360357

Mentioned in SAL (#wikimedia-operations) [2017-06-20T14:32:05Z] <XioNoX> depooled codfw - T167274

Mentioned in SAL (#wikimedia-operations) [2017-06-20T15:08:56Z] <XioNoX> starting asw-d-codfw switch upgrade - T167274

Upgrade done. Took a bit longer than expected ~1h45min. But process was smooth.

Full logs on P5597

Mentioned in SAL (#wikimedia-operations) [2017-06-20T17:47:27Z] <XioNoX> repool codfw - T167274

ayounsi closed this task as Resolved.Jun 20 2017, 5:48 PM

Change 360381 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "Route around codfw for asw-d-codfw switch upgrade"

https://gerrit.wikimedia.org/r/360381

Change 360381 merged by Ema:
[operations/puppet@production] Revert "Route around codfw for asw-d-codfw switch upgrade"

https://gerrit.wikimedia.org/r/360381

Mentioned in SAL (#wikimedia-operations) [2017-06-20T18:06:33Z] <ema> route ulsfo back to codfw T167274