Page MenuHomePhabricator

codfw row B switch upgrade
Closed, ResolvedPublic

Description

Similar to T167274 and T168462

All the CODFW switches will need to be upgraded to fix T133387.

This turn is for row B.

I'm planning on doing the upgrade on Wednesday July 12th at 0800 UTC
That's for the sake of picking a date, I can reschedule at will, let me know if that's an issue for anyone.

2h total maintenance time.

As the previous NSSU upgrade (each rack after the other) had shown issues (see T168462#3390519 ), this time the upgrade is going to be "standard", which means the whole ROW will go down for between 10 and 20min.

the full list of servers is available at: https://racktables.wikimedia.org/index.php?page=row&row_id=2199

To summarize, here is the types of hosts in that row:

achernar   <-- recursive DNS, see below 
cp*
db*
elastic*
es*
eventlog*
ganeti*
graphite*
kafka*
kubernetes*
labstore*
labtestcontrol*
labtestnet*
labtestneutron*
labtestvirt*
lvs*
maps*
maps-test*
mc*
ms-be*
ms-fe*
mw*
ores*
osm-db*
osm-web*
pc*
prometheus*
rdb*
rdb*
restbase*
restbase-test*
scb*
subra*
tmh*
wtp*

As achernar is the 2nd recursive dns in LVS resolv.conf, we should not hit T154759, but we might want to remove it from LVS resolv.conf before the maintenance to be on the safe side.

I subscribed users on that task based on https://etherpad.wikimedia.org/p/p0Iq39YWZg . Don't hesitate to add people I could have missed, or remove yourself from the task if you're not involved.

Timeline, please edit and add anyone who needs to be notified or any extra step that needs to be done.
1h before the window (0700 UTC):

  • Depool codfw from DNS
  • Warn people of the upcoming maintenance
  • Ping @elukey to disable kafka
  • Route ulsfo cache traffic around codfw
  • Failover LVS* to passive node of the couple
  • Downtime switch in Icinga/LibreNMS

After the upgrade:

  • Confirm switches are in a healthy state
  • Re-enable igmp-snooping
  • Confirm T133387 is fixed (otherwise re-disable igmp-snooping)
  • Run an LibreNMS discovery/pool
  • Ask confirmation of "all good" to the list of users above
  • Remove monitoring downtime
  • Route ulsfo cache traffic through codfw
  • Re-failover LVS*
  • Repool codfw in DNS

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We can absorb whatever outage is convenient for you @ayounsi for our things:

labstore*
labtestcontrol*
labtestnet*
labtestneutron*
labtestvirt*

@madhuvishy we may need to keep tabs and restart backups. Otherwise it's just administrative stuff users don't care about.

Edit: moving the maintenance to Wednesday July 12 for availability reasons.

In this row, as important hosts we have to downtime:

es2018 -> es1014 needs to be downtimed as it will page with replication broken
db2029 is s7 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.
db2028 is s6 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.
db2023 is s5 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.
db2019 is s4 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.
db2018 is s3 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.
db2017 is s2 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.
db2016 is s1 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.

Thinking about it, we should spread those masters a bit, they are all in B6 (T169501)

For kafka2002 it is sufficient to depool it from eventbus via pybal/conftool, and then re-balance the cluster when the work is done (will take care of the two steps).

During the last upgrade we have generated a lot of pages and alerts on IRC, that is expected since the maintenance needed reboots of all the row's switches. What I am wondering is if we could silence icinga in a way to avoid any other un-related outages that can occur at the same time to be "masked" by the background noise of alarms firing. Maybe something like creating a script that grabs the list of "affected" hosts and silence them in icinga for two hours via icinga-downtime on einstenium?

During the first upgrade, despite the switch being downtimed in Icinga, the upgrade process didn't make it go down (as in it was still replying to pings) so all the hosts depending on that stack alerted as being down.

For the second one, we disabled the checks and set the switch manually as down in Icinga, that way the parent/child relationship of Icinga should have had the hosts as "unreachable" and only the switch as "down", and thus not paging for the hosts. Not sure if that worked or not.
It's also possible that the services took longer to recover than the hosts, and thus alert as soon as the hosts came back up. Maybe there is a possibility to have the services checks wait a bit longer when a host comes back up?

For this upgrade, all the switches should go down at the same time, so it should behave the same as #2.

Change 364661 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool codfw for asw-b-codfw upgrade.

https://gerrit.wikimedia.org/r/364661

Change 364663 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Route traffic around codfw for asw-b-codfw upgrade.

https://gerrit.wikimedia.org/r/364663

Change 364661 merged by Ayounsi:
[operations/dns@master] Depool codfw for asw-b-codfw upgrade.

https://gerrit.wikimedia.org/r/364661

Change 364663 merged by Ayounsi:
[operations/puppet@production] Route traffic around codfw for asw-b-codfw upgrade.

https://gerrit.wikimedia.org/r/364663

Change 364667 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Temporarily remove achernar from lvs2* resolv.conf

https://gerrit.wikimedia.org/r/364667

Change 364667 merged by Ema:
[operations/puppet@production] Temporarily remove achernar from lvs2* resolv.conf

https://gerrit.wikimedia.org/r/364667

Change 364669 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Temporarily remove achernar from lvs4* resolv.conf

https://gerrit.wikimedia.org/r/364669

Change 364669 merged by Ema:
[operations/puppet@production] Temporarily remove achernar from lvs4* resolv.conf

https://gerrit.wikimedia.org/r/364669

ayounsi closed this task as Resolved.EditedJul 12 2017, 9:10 AM

Switch went down for about 10min and came back up properly.

Some notes:

  • The upgrade was more smooth than using NSSU
  • If a ganeti* host will be impacted, list the impacted VMs as well
  • The parent/child relationship worked fine: out of the 157 icinga hosts that were impacted, 132 were unreachable (as expected) and 25 down (all ganeti VMs, expected)
  • The only page was "search.svc.codfw.wmnet/LVS HTTP IPv4 is CRITICAL" - T170378
  • restbase-async and citoid has been set to active-active for the maintenance
  • The only page was "search.svc.codfw.wmnet/LVS HTTP IPv4 is CRITICAL"

Preceded by "search.svc.codfw.wmnet/ElasticSearch health check for shards is CRITICAL"

The number of shards never reached the critical threshold, in irc I've seen:
10:24 <icinga-wm> PROBLEM - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.2.1.30, port=9200): Read timed out. (read timeout=4)
Looks like a problem with the check itself.