asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues
Closed, ResolvedPublic

Description

We've had a string of asw-c2-eqiad issues lately, that may or may not be connected with each other. Causes are still largely unknown and definitely among them are JunOS bugs for which we can't do much other than upgrade JunOS or replace the switches as part of an extended maintenance.

The timeline of what we know so far is:

  • Jan 12 04:56:46 (during All Hands): asw-c2 reboots out of the blue. Downtime was short (~5mins) but impactful due to S1 master being on C2. Multiple opsens respond and page others, eventually both @faidon and @mark investigated and found nothing but log lines indicating a switch reboot:
Jan 12 04:56:46  asw-c-eqiad chassisd[1512]: CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 2 offline: Removal
  • Jan 18 17:13-22:46 Cloud-Services team switches Labs storage from labstore1001 to labstore1004, a server connected to asw-c2.
  • Jan 19 20:41:40: an alert is triggered for labstore1004 for high load avg. Multiple people from Cloud-Services & Operations investigate, find multiple nfsd processes in D state and/or crashed. Labs team fails over to labstore1005 as a response. Service is restored.
  • Jan 19 21:35:29: @chasemp reboots labstore1004, as the suspicions were around kernel bugs. Unknown to everyone at the time, asw-c-eqiad logs these error messages:
Jan 19 21:35:29  asw-c-eqiad fpc1 MRVL-L2:mrvl_fdb_mac_entry_mc_set(),1089:Sanity Checks Failed(Invalid Params:-2)
Jan 19 21:35:29  asw-c-eqiad fpc3 MRVL-L2:mrvl_fdb_mac_entry_mc_set(),1089:Sanity Checks Failed(Invalid Params:-2)
Jan 19 21:35:29  asw-c-eqiad fpc3 MRVL-L2:mrvl_fdb_mac_entry_rebake(),482:fdb_mac_entry_mc_set() failed(-1)
Jan 19 21:35:29  asw-c-eqiad fpc3 RT-HAL,rt_entry_topo_handler,4121: l2_halp_vectors->l2_entry_rebake failed
[...]
  • Jan 19 23:50: labstore1004 logs resolution failures Name or service not known for the first time, indicative of connectivity issues. These continue until 16:28 or so (see below).
  • Jan 20 ~07:00: @faidon investigates those network connectivity reports based on the above and @Joe/@madhuvishy'ss reports of packet loss failure; notices that labstore1004 & labstore1005 are misconfigured: they both have IP addresses in the 10.64.37.0/24 space on both eth0 and eth1, and all four (2x2) ports are connected to the same VLAN.
  • Jan 20 09:31:23 and 10:11:20: asw-c-eqiad logs more Sanity Checks Failed errors, not currently linked to any other event.
  • Jan 20 16:18: @Cmjohnson/@chasemp swap labstore1004's eth0 cable; asw-c-eqiad logs more Sanity Checks Failed.
  • Jan 20 17:21: @chasemp moves the IP to eth0 and shuts down labstore1004's eth1; asw-c-eqiad logs more Sanity Checks Failed.
  • Jan 21 01:04: asw-c2-eqiad reboots in similar ways to the Jan 12th reboot. Upon reboot, prints "Sanity Check Failed". Multiple opsens respond.
  • Jan 21 01:22: @ema pages @faidon who subsequently investigates, eventually notices that labstore1004 is on C2 and makes the (mental) link. @Volans points out to the SAL about labstore1004's eth1 being down. @faidon notices that while labstore1004's eth1 is down, the port still appears as "physical link up" on the switch(!?)
  • Jan 21 01:36 @chasemp gets paged and confirms that no labstore-related work happened on Jan 12th or thereabouts and that the least labstore-related action was at 17:21.
  • Jan 21 01:44, 01:47, 01:50: @faidon tries setting the port to disable and back to enable, as well as confirming that the port is labelled correctly (it was). Switch logs Sanity Checks Failed on every one of those actions, refuses to see the port's physical link down. Port was left at "disable" just in case.
  • Jan 21 01:52: @faidon remembers the 20150607-LabsNFS-Outage with very similar log lines and effects of all the above.

So far, based on the above I think we can extract these conclusions:

  • There is a definitive link between labstore1004's link state and weird errors logged in the switch.
  • labstore1004 connectivity issues may be a cause of the load spike *or* an effect of the reboot (in turn caused by the reshuffling the two ports). Logs point out to the latter, but that leaves the load spike without a known cause.
  • There is no evidence of the Sanity Checks Failed errors being linked with either of the switch reboots. One reboot was days before any of those messages appeared and the other one was ~8 hours after the last of them did.
  • That said, the coincidence of a certain switch rebooting twice *and* a server plugged into it causing error messages and misreporting of physical link state is probably remote. Even though we don't have sufficient evidence, I would guess that the two events are likely related to either each other or to a third cause (hardware failure or software bug causing state mismatches across the stack).
faidon created this task.Jan 21 2017, 3:54 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2017, 3:54 AM
elukey added a subscriber: elukey.Jan 21 2017, 12:05 PM

added a secondary switch, asw2-c2-eqiad. accessible via scs port 48

faidon moved this task from Backlog to In Progress on the netops board.Jan 25 2017, 11:06 AM
faidon raised the priority of this task from High to Unbreak Now!.
Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptJan 25 2017, 11:07 AM

The switch rebooted again overnight (Jan 25 01:16 UTC). We are going to proceed with a replacement as soon as the DBA work (T155999) is done. Setting priority to UBN due to the fact that this has caused 3 outages so far (albeit small, ~5 minutes each).

@Cmjohnson: the one you installed seems to be booting directly into the boot loader as it can't find a kernel, probably corrupted or faulty flash. I tried booting it off the network and reinstalling JunOS but a) that failed (is it connected to the mgmt network?) b) I'd rather replace this with a switch that is generally in a good shape. Do we have any other spares on site?

Marostegui added a comment.EditedJan 25 2017, 11:13 AM

The switch rebooted again overnight (Jan 25 01:16 UTC). We are going to proceed with a replacement as soon as the DBA work (T155999) is done. Setting priority to UBN due to the fact that this has caused 3 outages so far (albeit small, ~5 minutes each).

Just FYI we will be doing the enwiki switchover tomorrow, Thursday 07:00UTC.

@faidon new switch has been installed. Also added an uplink module. The switch is accessible via mgmt

Hi,

The pending work of: T156008 shouldn't be a blocker to replace the switch.
The switchover was done, and only pending to move dbstore1001 to replicate from the new master. It currently still replicates from db1057. it is not an easy job, but given it is our 24h delayed replica, it is not a big deal if it gets a bit more delay because of the switch replacement.

asw-c2-eqiad was replaced yesterday (Jan 26 17:50 UTC) with one of our spares. Total downtime was approximately 30 minutes mostly due to the recabling effort but also due to an extra reboot that was needed to enable mixed mode. For approximately 1½ minute within that window, and right after the new switch booted and joined the stack, the whole stack got desynced with multiple FPCs leaving and joining the stack, causing widespread issues across the C row (cp* alerts, leader elections for etcd, master switchover in Hadoop etc.). This may or may not be related to the row's state in general (Sanity Check Failed errors etc.) ­-- unknown and very difficult to know.

During the whole 30 minute window there was also an increased response time from the MediaWiki API, that cascaded into MobileApps issues and RESTBase alerts. This was likely an effect of the (purposefully not depooled) database servers in that rack (db1055-56-57-59) and possibly caused by MediaWiki not responding well to unresponsive hosts. Investigation for this is still on-going.

As we only had suspicions and not evidence that asw-c2-eqiad was faulty, I'm going to keep this open for now to monitor the situation before we call this issue resolved.

fgiunchedi added a subscriber: fgiunchedi.

During the whole 30 minute window there was also an increased response time from the MediaWiki API, that cascaded into MobileApps issues and RESTBase alerts. This was likely an effect of the (purposefully not depooled) database servers in that rack (db1055-56-57-59) and possibly caused by MediaWiki not responding well to unresponsive hosts. Investigation for this is still on-going.

Followup investigation is T156475: Investigate spike in 500s during asw-c2-eqiad replacement

faidon lowered the priority of this task from Unbreak Now! to High.Jan 30 2017, 4:54 PM

Looks stable for now, lowering priority.

faidon closed this task as Resolved.Feb 17 2017, 6:23 PM
faidon claimed this task.

The "Sanity Checks Failed" log messages continue to happen sporadically but we haven't had a switch failure in over 3 weeks now. Considering this fixed and resolving.