Page MenuHomePhabricator

[eqiad] faulty VC optics
Closed, ResolvedPublic

Description

Looking at the logs some VC links have been flapping more or less regularly.

VC link errors statistics are not in LibreNMS, looking more closely at the VCs other interfaces (using show virtual-chassis vc-port statistics extensive | match "FPC|Port|CRC alignment errors") show CRC errors:
asw2-a-eqiad:
fpc1:1/0 - CRC alignment errors: 31
fpc1:1/1 - CRC alignment errors: 401 <---- matches interface flap logs - increased since last counter clear (17)
fpc8:1/1 - CRC alignment errors: 374063 - increased since last counter clear (54730)

asw2-b-eqiad:
fpc1:1/1 - CRC alignment errors: 55 <---- matches interface flap logs - increased since last counter clear (2)
fpc3:1/1 - CRC alignment errors: 937
fpc6:1/0 - CRC alignment errors: 36

asw2-c-eqiad:
fpc3:1/0 - CRC alignment errors: 398877076 - increased since last counter clear (38926979)
fpc5:1/3 - CRC alignment errors: 610 <---- matches interface flap logs - increased since last counter clear (268)
fpc6:1/2 - CRC alignment errors: 3
fpc8:1/1 - CRC alignment errors: 998

asw2-d-eqiad:
fpc8:1/0 - CRC alignment errors: 71

As it's not possible to know if all those errors are still relevant, I cleared the counters and will check them after the break.

In the meantime we should replace the faulty optics that match interface flaps logs.

All of those are MMFs with QSFP+-40G-SR4 on each side.

The main offender is this one: https://netbox.wikimedia.org/dcim/cables/3391/ so we should start with it.

@Jclark-ctr do we have spares on site? When would be the earliest you could do it in January?

Event Timeline

ayounsi triaged this task as High priority.

For the record I had a quick look at the codfw / ulsfo / eqsin / esams virtual-chassis port stats and none of them are showing historical CRC errors.

I had a closer look as there is support for this kind of graphing and alerting in LibreNMS since a while https://github.com/librenms/librenms/blame/258505ed4429050344f99cbbcb71b0c14bca50d6/includes/polling/ports/os/junos.inc.php#L55

And we can see the graphs there for example: https://librenms.wikimedia.org/device/device=162/tab=port/port=18678/

Fetching directly the collected data in MySQL though doesn't fit the real life devices data:
SELECT device_id, ifDescr, ifInErrors, ifInErrors_prev, ifInErrors_delta FROM librenms.ports where ifDescr like "fpc%" and ifInErrors != 0

Which is probably why the alerting rule devices.type != "power" AND ports.ignore = 0 AND ports.ifInErrors_delta != 0 doesn't trigger.

For example jnxVirtualChassisPortInCRCAlignErrors.5."vcp-255/1/3" = 42 while it has been cleared and should now be at 0.
It also surprisingly matches this value: jnxVirtualChassisPortUndersizePkts.5."vcp-255/1/3" = 42

My guess so far is that it's a bug on the Juniper devices, and there is not much more we can do other than upgrade.

Not directly related, the devices also expose jnxVirtualChassisPortCarrierTrans.5."vcp-255/1/3" = 3868which could be a good indicator of issues.
Last, alerting based on syslog messages might be an option as well.

@ayounsi I do have spare optics for connection. 1/5/23 is a good day to perform this maintenance

For example jnxVirtualChassisPortInCRCAlignErrors.5."vcp-255/1/3" = 42 while it has been cleared and should now be at 0.

jnxVirtualChassisPortInCRCAlignErrors is a COUNTER64, so I'm not sure that clearing the device counters should reset what SNMP reports. If it did then LibreNMS would interpret a value of '0' as indicating the counter had reached 2^64 and wrapped around, which would totally skew the results. So I expect the SNMP counters are not reset when you issue the reset command for what the CLI shows.

Fetching directly the collected data in MySQL though doesn't fit the real life devices data:
SELECT device_id, ifDescr, ifInErrors, ifInErrors_prev, ifInErrors_delta FROM librenms.ports where ifDescr like "fpc%" and ifInErrors != 0

I'm not sure I'm totally following what you observed here. I gather the vcp interfaces didn't appear on the back of that query? If you leave out the last and ifInErrors != 0 does it return anything? I should take a look I guess, not done it before anything to watch out for opening a MySQL shell? I guess be careful not to run queries that are gonna return insane amounts of data or otherwise stress the CPU?

jnxVirtualChassisPortInCRCAlignErrors is a COUNTER64, so I'm not sure that clearing the device counters should reset what SNMP reports. If it did then LibreNMS would interpret a value of '0' as indicating the counter had reached 2^64 and wrapped around, which would totally skew the results. So I expect the SNMP counters are not reset when you issue the reset command for what the CLI shows.

Good to know! I assumed clearing the counters would clear that as well and tools would be smart enough to know that it's only increasing.

I'm not sure I'm totally following what you observed here. I gather the vcp interfaces didn't appear on the back of that query?

They did, but based on the previous assumption the results seemed incorrect to me as they didn't match the CLI output.

So on one hand we're graphing data that somehow correlates with faulty optics, on the other hand it might not be enough to trigger alerting. The mystery still stands :)

For MySQL happy to walk you through how I do it over IRC.

I had a look at the eqiad counters and updated the task description with what has increased. We should replace the optic on the 5 interfaces that stand out.

Mentioned in SAL (#wikimedia-operations) [2023-01-05T13:38:00Z] <XioNoX> start [eqiad] faulty VC optics maintenance - T325803

Everything has been replaced, thanks @Jclark-ctr!

I'll check it on Monday to see if there are any ongoing errors and close it if it's good!

Thanks Ayounsi! if any issues come back i still do have more spare optics. would Monday be a good day to schedule line card moves also if no errors return?

Unfortunately that didn't solve it for all switches:

asw2-c-eqiad is all good, but A and B are still showing errors.

asw2-a-eqiad:
fpc1:port: 1/1 - CRC alignment errors: 23
fpc8: port: 1/1 - CRC alignment errors: 11578

asw2-b-eqiad:
fpc1: port 1/1 - CRC alignment errors: 12

So next step is to replace the optic on the remote side. Let's sync on IRC for a time to do that work.

Mentioned in SAL (#wikimedia-operations) [2023-01-09T16:04:36Z] <XioNoX> start VC link maintenance in eqiad - T325803

Replaced and counters cleared. Let's check in a couple days.

asw2-a-eqiad is now all good but...

...asw2-b-eqiad:fpc1:1/1 is still showing errors...

Next step will be to replace the fiber between the two (already replaced) optics.

@Jclark-ctr let me know when would be a good time to do so.

Mentioned in SAL (#wikimedia-operations) [2023-01-10T14:56:48Z] <XioNoX> start VC link maintenance in eqiad - T325803

asw2-b-eqiad: fpc1:1/1 Cleaned fiber and replaced optic

All good, thanks a lot!