We've had a string of asw-c2-eqiad issues lately, that may or may not be connected with each other. Causes are still largely unknown and definitely among them are JunOS bugs for which we can't do much other than upgrade JunOS or replace the switches as part of an extended maintenance.
The timeline of what we know so far is:
- Jan 12 04:56:46 (during All Hands): asw-c2 reboots out of the blue. Downtime was short (~5mins) but impactful due to S1 master being on C2. Multiple opsens respond and page others, eventually both @faidon and @mark investigated and found nothing but log lines indicating a switch reboot:
Jan 12 04:56:46 asw-c-eqiad chassisd[1512]: CHASSISD_FRU_OFFLINE_NOTICE: Taking FPC 2 offline: Removal
- Jan 18 17:13-22:46 Cloud-Services team switches Labs storage from labstore1001 to labstore1004, a server connected to asw-c2.
- Jan 19 20:41:40: an alert is triggered for labstore1004 for high load avg. Multiple people from Cloud-Services & SRE investigate, find multiple nfsd processes in D state and/or crashed. Labs team fails over to labstore1005 as a response. Service is restored.
- Jan 19 21:35:29: @chasemp reboots labstore1004, as the suspicions were around kernel bugs. Unknown to everyone at the time, asw-c-eqiad logs these error messages:
Jan 19 21:35:29 asw-c-eqiad fpc1 MRVL-L2:mrvl_fdb_mac_entry_mc_set(),1089:Sanity Checks Failed(Invalid Params:-2) Jan 19 21:35:29 asw-c-eqiad fpc3 MRVL-L2:mrvl_fdb_mac_entry_mc_set(),1089:Sanity Checks Failed(Invalid Params:-2) Jan 19 21:35:29 asw-c-eqiad fpc3 MRVL-L2:mrvl_fdb_mac_entry_rebake(),482:fdb_mac_entry_mc_set() failed(-1) Jan 19 21:35:29 asw-c-eqiad fpc3 RT-HAL,rt_entry_topo_handler,4121: l2_halp_vectors->l2_entry_rebake failed [...]
- Jan 19 23:50: labstore1004 logs resolution failures Name or service not known for the first time, indicative of connectivity issues. These continue until 16:28 or so (see below).
- Jan 20 ~07:00: @faidon investigates those network connectivity reports based on the above and @Joe/@madhuvishy'ss reports of packet loss failure; notices that labstore1004 & labstore1005 are misconfigured: they both have IP addresses in the 10.64.37.0/24 space on both eth0 and eth1, and all four (2x2) ports are connected to the same VLAN.
- Jan 20 09:31:23 and 10:11:20: asw-c-eqiad logs more Sanity Checks Failed errors, not currently linked to any other event.
- Jan 20 16:18: @Cmjohnson/@chasemp swap labstore1004's eth0 cable; asw-c-eqiad logs more Sanity Checks Failed.
- Jan 20 17:21: @chasemp moves the IP to eth0 and shuts down labstore1004's eth1; asw-c-eqiad logs more Sanity Checks Failed.
- Jan 21 01:04: asw-c2-eqiad reboots in similar ways to the Jan 12th reboot. Upon reboot, prints "Sanity Check Failed". Multiple opsens respond.
- Jan 21 01:22: @ema pages @faidon who subsequently investigates, eventually notices that labstore1004 is on C2 and makes the (mental) link. @Volans points out to the SAL about labstore1004's eth1 being down. @faidon notices that while labstore1004's eth1 is down, the port still appears as "physical link up" on the switch(!?)
- Jan 21 01:36 @chasemp gets paged and confirms that no labstore-related work happened on Jan 12th or thereabouts and that the least labstore-related action was at 17:21.
- Jan 21 01:44, 01:47, 01:50: @faidon tries setting the port to disable and back to enable, as well as confirming that the port is labelled correctly (it was). Switch logs Sanity Checks Failed on every one of those actions, refuses to see the port's physical link down. Port was left at "disable" just in case.
- Jan 21 01:52: @faidon remembers the 20150607-LabsNFS-Outage with very similar log lines and effects of all the above.
So far, based on the above I think we can extract these conclusions:
- There is a definitive link between labstore1004's link state and weird errors logged in the switch.
- labstore1004 connectivity issues may be a cause of the load spike *or* an effect of the reboot (in turn caused by the reshuffling the two ports). Logs point out to the latter, but that leaves the load spike without a known cause.
- There is no evidence of the Sanity Checks Failed errors being linked with either of the switch reboots. One reboot was days before any of those messages appeared and the other one was ~8 hours after the last of them did.
- That said, the coincidence of a certain switch rebooting twice *and* a server plugged into it causing error messages and misreporting of physical link state is probably remote. Even though we don't have sufficient evidence, I would guess that the two events are likely related to either each other or to a third cause (hardware failure or software bug causing state mismatches across the stack).