Page MenuHomePhabricator

lvs1016 enp5s0f0 interface errors
Closed, ResolvedPublic

Description

lvs1016 has interface errors on just enp5s0f0. They started at a very small rate ~10 days ago and have been growing slowly ever since. This in turn has caused a slowly growing rate of random ProxyFetch healtcheck failures (all on vlan 1018, which is attached to the physical interface experiencing the errors). As of today, the rate is high enough that it's causing sporadic icinga alerts for pybal pooling issues from all the ProxyFetch failures.

Attached is the network errors over 14 days from grafana Host Overview, edited to cap at 300pps so that the rare large spikes don't drown out the interesting pattern:

2020-09-30-182414_1912x843_scrot.png (843×1 px, 108 KB)

Does this look like a signature from anything we recognize, like a slowly failing transceiver?

It's also possible the drops here are just because we're reaching some kind of effective pps limits due to new monitoring checks being added? It would seem unlikely we'd hit a pps limit with healthcheck traffic given the interface byte rates seems totally sane and reasonable, though, and we're not having any similar symptoms on lvs1015, even though it has the bulk of the same healthchecks plus all the live traffic, while lvs1016 has no live traffic to compete with.

Maybe we should try replacing transceivers on either end?

Event Timeline

BBlack triaged this task as High priority.Sep 30 2020, 6:32 PM
BBlack created this task.

Mentioned in SAL (#wikimedia-operations) [2020-09-30T18:36:31Z] <bblack> lvs1016 pybal diff alerts downtimed in icinga for ~48h to reduce annoying flappy alert spam, with reference to https://phabricator.wikimedia.org/T264227

No errors on the switch side.

lvs1016:~$ sudo ethtool -S enp5s0f0 | grep crc
     rx_crc_errors: 27387518
lvs1016:~$ sudo ethtool -S enp5s0f0 | grep crc
     rx_crc_errors: 27387851

So yeah, the usual replace optics and/or patch, starting with the lvs1016 side.

@Cmjohnson please sync up with @BBlack or Traffic.

@BBlack it might be worth graphing and alerting on rx_crc_errors as well.

Mentioned in SAL (#wikimedia-operations) [2020-10-01T15:53:51Z] <bblack> lvs1016: re-disabled puppet with ticket ref in comment, downed interface enp5s0f0 since it's flapping furiously - T264227

The link has gotten worse and began flapping up and down rapidly since last update, causing a loss of routing to the row. I've downtimed the whole host now in icinga, disabled puppet on the host, and manually downed the interface to stop the flapping.

Mentioned in SAL (#wikimedia-operations) [2020-10-01T16:19:41Z] <bblack> rebooting lvs1016 to a fresh state for interface config and error counters, etc - T264227

BBlack assigned this task to Cmjohnson.

@Cmjohnson replaced the SFPs on both ends of this link before my reboot above. Since the reboot, we don't seem to have any abnormal rate of interface failures, neither do we yet observe ProxyFetch failures from pybal's logs.

I'm tentatively resolving this, will re-open if things go sour again.