Page MenuHomePhabricator

codfw row A switch upgrade
Closed, ResolvedPublic

Description

Similar to T167274

All the CODFW switches will need to be upgraded to fix T133387.

This turn is for row A.

I'm planning on doing the upgrade on Thursday 29th at 0800 UTC
That's for the sake of picking a date, I can reschedule at will, let me know if that's an issue for anyone.

2h total maintenance time.

Best scenario: 10min downtime per RACK (one rack after the other) with a 3min pause in between. This went fine for row D, no complications are expected.
Possible alternate scenario: 1h downtime for the whole ROW

the full list of servers is available at: https://racktables.wikimedia.org/index.php?page=row&row_id=2198

To summarize, here is the types of hosts in that row:

acamar
auth*
baham <- authoritative nameservers, needs special handling?
bast2001 <- impactful for staff, maybe dual-homing in the future? 
conf*
contint*
cp*
db*
dbstore*
elastic*
es*
ganeti*
heze-array
jeze
kafka*
kubernetes*
labtestweb*
lvs*
maps*
maps-test*
mc*
ms-be*
ms-fe*
mw*
ores*
osm-web*
procyon
prometheus*
puppetmaster*
rbf*
rdb*
sarin
scb*
stat*
suhail
wdqs*

I subscribed users on that task based on https://etherpad.wikimedia.org/p/p0Iq39YWZg . Don't hesitate to add people I could have missed, or remove yourself from the task if you're not involved.

Because connectivity to bast2001 will be lost during the process, upgrade will have to be ran through mr1-codfw.

Timeline, please edit and add anyone who needs to be notified or any extra step that needs to be done.
1h before the window (0700 UTC):

  • Depool codfw from DNS
  • Warn people of the upcoming maintenance
  • Route ulsfo cache traffic around codfw
  • Switch services served from codfw (es: restbase-async, citoid) to be served from eqiad.
  • Downtime some eqiad DBs (T168462#3366437)
  • Depool acamar
  • Ping @elukey to disable kafka ( T168462#3379576 )
  • Failover LVS* to passive node of the couple
  • Downtime switch in Icinga/LibreNMS (note that the switch will always shows up as UP except when the master member restarts, at the end)

After the upgrade:

  • Confirm switches are in a healthy state
  • Re-enable igmp-snooping
  • Confirm T133387 is fixed (otherwise re-disable igmp-snooping)
  • Run an LibreNMS discovery/pool
  • Ask confirmation of "all good" to the list of users above
  • Remove monitoring downtime
  • Route ulsfo cache traffic through codfw
  • repool acamar
  • Re-failover LVS*
  • Repool CODFW in DNS

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We need to make sure we downtime the following DBs in EQIAD as they have cross replication with some of the dbs affected here, so we can avoid pages like we had yesterday for cross dc replication

db1068 (s4 master, which has cross dc replication with db2019 which is s4 codfw master)
db1020 (m2 master, which has cross dc replication with db2011 which is m2 codfw master)
db1016 (m1 master, which has cross dc replication with db2010 which is m1 codfw master)

Joe added a project: User-Joe.

Replication on misc services is non-paging because they use a single server operation mode, so up to some point, it can be ignored. db1068 would page because it is how it detects a split-brain scenario between datacenters, in a much more sensitive environement.

Just to be sure I'll shutdown kafka on kafka2001 before https://racktables.wikimedia.org/index.php?page=rack&rack_id=2207, please ping me 5/10 mins before the rack :)

Change 362141 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool codfw for asw-a-codfw switch upgrade

https://gerrit.wikimedia.org/r/362141

Mentioned in SAL (#wikimedia-operations) [2017-06-29T07:36:21Z] <elukey> depooled kafka2001.codfw.wmnet for T168462

Change 362145 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Route cache traffic around codfw for asw-a-codfw switch upgrade

https://gerrit.wikimedia.org/r/362145

Change 362141 merged by Ayounsi:
[operations/dns@master] Depool codfw for asw-a-codfw switch upgrade

https://gerrit.wikimedia.org/r/362141

Change 362145 merged by Ayounsi:
[operations/puppet@production] Route cache traffic around codfw for asw-a-codfw switch upgrade

https://gerrit.wikimedia.org/r/362145

Mentioned in SAL (#wikimedia-operations) [2017-06-29T07:47:11Z] <XioNoX> Route cache traffic around codfw - T168462

Mentioned in SAL (#wikimedia-operations) [2017-06-29T07:57:37Z] <volans> switching citoid and restbase-async temporarily to eqiad for T168462

Mentioned in SAL (#wikimedia-operations) [2017-06-29T08:25:05Z] <ema> failover codfw LVSs to secondaries T168462

Mentioned in SAL (#wikimedia-operations) [2017-06-29T08:29:19Z] <XioNoX> asw-a-codfw upgrade started - T168462

Mentioned in SAL (#wikimedia-operations) [2017-06-29T09:29:04Z] <godog> silence paging alerts for *.svc.codfw.wmnet for two hours - T168462

Mentioned in SAL (#wikimedia-operations) [2017-06-29T10:34:47Z] <ema> re-enable puppet and start pybal on lvs2001-2003 T168462

Mentioned in SAL (#wikimedia-operations) [2017-06-29T10:45:44Z] <ema> switching citoid and restbase-async back to codfw after T168462

Upgrade has been completed in ~1h45min.
Notable events:

  • NSSU bug, where members a4 and a5 were not passing traffic after being upgraded/rebooted. They started passing traffic again after virtual-chassis mastership switched to the backup node (to upgrade master).

We might consider not doing NSSU in the future. That would mean all racks in the row as down at once (instead of one after the other), but more predictable result.

  • *.svc.codfw icinga alerts

Possibly related to T154759