codfw row B switch upgrade
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ayounsi
	Jun 30 2017, 2:51 PM

Description

Similar to T167274 and T168462

All the CODFW switches will need to be upgraded to fix T133387.

This turn is for row B.

I'm planning on doing the upgrade on Wednesday July 12th at 0800 UTC
That's for the sake of picking a date, I can reschedule at will, let me know if that's an issue for anyone.

2h total maintenance time.

As the previous NSSU upgrade (each rack after the other) had shown issues (see T168462#3390519 ), this time the upgrade is going to be "standard", which means the whole ROW will go down for between 10 and 20min.

the full list of servers is available at: https://racktables.wikimedia.org/index.php?page=row&row_id=2199

To summarize, here is the types of hosts in that row:

achernar   <-- recursive DNS, see below 
cp*
db*
elastic*
es*
eventlog*
ganeti*
graphite*
kafka*
kubernetes*
labstore*
labtestcontrol*
labtestnet*
labtestneutron*
labtestvirt*
lvs*
maps*
maps-test*
mc*
ms-be*
ms-fe*
mw*
ores*
osm-db*
osm-web*
pc*
prometheus*
rdb*
rdb*
restbase*
restbase-test*
scb*
subra*
tmh*
wtp*

As achernar is the 2nd recursive dns in LVS resolv.conf, we should not hit T154759, but we might want to remove it from LVS resolv.conf before the maintenance to be on the safe side.

I subscribed users on that task based on https://etherpad.wikimedia.org/p/p0Iq39YWZg . Don't hesitate to add people I could have missed, or remove yourself from the task if you're not involved.

Timeline, please edit and add anyone who needs to be notified or any extra step that needs to be done.
1h before the window (0700 UTC):

Depool codfw from DNS
Warn people of the upcoming maintenance
Ping @elukey to disable kafka
Route ulsfo cache traffic around codfw
Failover LVS* to passive node of the couple
Downtime switch in Icinga/LibreNMS

After the upgrade:

Confirm switches are in a healthy state
Re-enable igmp-snooping
Confirm T133387 is fixed (otherwise re-disable igmp-snooping)
Run an LibreNMS discovery/pool
Ask confirmation of "all good" to the list of users above
Remove monitoring downtime
Route ulsfo cache traffic through codfw
Re-failover LVS*
Repool codfw in DNS

Details

Subject	Repo	Branch	Lines +/-
Depool codfw for asw-b-codfw upgrade.	operations/dns	master	+3 -0
Route traffic around codfw for asw-b-codfw upgrade.	operations/puppet	production	+3 -3
Temporarily remove achernar from lvs4* resolv.conf	operations/puppet	production	+1 -1
Temporarily remove achernar from lvs2* resolv.conf	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T183585: Rack/cable/configure asw2-b-eqiad switch stack
T172459: eqiad row D switch upgrade
T170380: codfw row C switch upgrade
Mentioned Here: T170378: Investigate why elastic@codfw alerted during codfw row B switch upgrade
T169501: Move some masters away from B6
T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw)
T154759: Pybal not happy with DNS delays
T167274: codfw row D switch upgrade
T168462: codfw row A switch upgrade

Event Timeline

ayounsi created this task.Jun 30 2017, 2:51 PM

Restricted Application added a project: SRE. · View Herald TranscriptJun 30 2017, 2:51 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We can absorb whatever outage is convenient for you @ayounsi for our things:

labstore*
labtestcontrol*
labtestnet*
labtestneutron*
labtestvirt*

@madhuvishy we may need to keep tabs and restart backups. Otherwise it's just administrative stuff users don't care about.

@chasemp Cool, thanks for the heads up!

Edit: moving the maintenance to Wednesday July 12 for availability reasons.

In this row, as important hosts we have to downtime:

es2018 -> es1014 needs to be downtimed as it will page with replication broken
db2029 is s7 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.
db2028 is s6 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.
db2023 is s5 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.
db2019 is s4 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.
db2018 is s3 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.
db2017 is s2 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.
db2016 is s1 codfw master (We might want to downtime its slaves to avoid IRC noise, they won't page) - replication codfw -> eqiad is disconnected.

Thinking about it, we should spread those masters a bit, they are all in B6 (T169501)

For kafka2002 it is sufficient to depool it from eventbus via pybal/conftool, and then re-balance the cluster when the work is done (will take care of the two steps).

During the last upgrade we have generated a lot of pages and alerts on IRC, that is expected since the maintenance needed reboots of all the row's switches. What I am wondering is if we could silence icinga in a way to avoid any other un-related outages that can occur at the same time to be "masked" by the background noise of alarms firing. Maybe something like creating a script that grabs the list of "affected" hosts and silence them in icinga for two hours via icinga-downtime on einstenium?

During the first upgrade, despite the switch being downtimed in Icinga, the upgrade process didn't make it go down (as in it was still replying to pings) so all the hosts depending on that stack alerted as being down.

For the second one, we disabled the checks and set the switch manually as down in Icinga, that way the parent/child relationship of Icinga should have had the hosts as "unreachable" and only the switch as "down", and thus not paging for the hosts. Not sure if that worked or not.
It's also possible that the services took longer to recover than the hosts, and thus alert as soon as the hosts came back up. Maybe there is a possibility to have the services checks wait a bit longer when a host comes back up?

For this upgrade, all the switches should go down at the same time, so it should behave the same as #2.

• ema moved this task from Backlog to General on the Traffic board.Jul 3 2017, 3:18 PM

Change 364661 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool codfw for asw-b-codfw upgrade.

https://gerrit.wikimedia.org/r/364661

gerritbot added a project: Patch-For-Review.Jul 12 2017, 7:13 AM

Change 364663 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Route traffic around codfw for asw-b-codfw upgrade.

https://gerrit.wikimedia.org/r/364663

Change 364661 merged by Ayounsi:
[operations/dns@master] Depool codfw for asw-b-codfw upgrade.

https://gerrit.wikimedia.org/r/364661

Mentioned in SAL (#wikimedia-operations) [2017-07-12T07:39:57Z] <XioNoX> depooled codfw for T169345

Change 364663 merged by Ayounsi:
[operations/puppet@production] Route traffic around codfw for asw-b-codfw upgrade.

https://gerrit.wikimedia.org/r/364663

Change 364667 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Temporarily remove achernar from lvs2* resolv.conf

https://gerrit.wikimedia.org/r/364667

Change 364667 merged by Ema:
[operations/puppet@production] Temporarily remove achernar from lvs2* resolv.conf

https://gerrit.wikimedia.org/r/364667

Change 364669 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Temporarily remove achernar from lvs4* resolv.conf

https://gerrit.wikimedia.org/r/364669

Change 364669 merged by Ema:
[operations/puppet@production] Temporarily remove achernar from lvs4* resolv.conf

https://gerrit.wikimedia.org/r/364669

Mentioned in SAL (#wikimedia-operations) [2017-07-12T08:32:57Z] <XioNoX> asw-b-codfw back up - T169345

Switch went down for about 10min and came back up properly.

Some notes:

The upgrade was more smooth than using NSSU
If a ganeti* host will be impacted, list the impacted VMs as well
The parent/child relationship worked fine: out of the 157 icinga hosts that were impacted, 132 were unreachable (as expected) and 25 down (all ganeti VMs, expected)
The only page was "search.svc.codfw.wmnet/LVS HTTP IPv4 is CRITICAL" - T170378
restbase-async and citoid has been set to active-active for the maintenance

In T169345#3429337, @ayounsi wrote:

The only page was "search.svc.codfw.wmnet/LVS HTTP IPv4 is CRITICAL"

Preceded by "search.svc.codfw.wmnet/ElasticSearch health check for shards is CRITICAL"

The number of shards never reached the critical threshold, in irc I've seen:
10:24 <icinga-wm> PROBLEM - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.2.1.30, port=9200): Read timed out. (read timeout=4)
Looks like a problem with the check itself.

ayounsi mentioned this in T170380: codfw row C switch upgrade.Jul 12 2017, 9:40 AM

ayounsi mentioned this in T172459: eqiad row D switch upgrade.Aug 3 2017, 11:04 PM

ayounsi mentioned this in T183585: Rack/cable/configure asw2-b-eqiad switch stack.Mar 23 2018, 9:30 PM

codfw row B switch upgradeClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

codfw row B switch upgrade
Closed, ResolvedPublic
Actions