Page MenuHomePhabricator

LInk errors from lvs1017 to ssw1-e1-eqiad
Closed, ResolvedPublic

Assigned To
Authored By
cmooney
Sep 6 2024, 2:35 PM
Referenced Files
F57469270: image.png
Sep 6 2024, 3:42 PM
F57469267: image.png
Sep 6 2024, 3:42 PM
F57469046: image.png
Sep 6 2024, 2:35 PM
F57469041: image.png
Sep 6 2024, 2:35 PM

Description

In almost a direct mirror of T374155 it seems our link from lvs1017 port enp94s0f0np0 to ssw1-e1-eqiad (see netbox) is exhibiting errors.

There is usually almost zero usage on this link, so the errors were only occasional, but fearing something similar I sent a small 10Mb stream out from the LVS, which resulted in a constant level of input errors being reported on the switch side:

image.png (588×972 px, 54 KB)

image.png (540×972 px, 55 KB)

Given the lack of data on this link normally I think we don't have to treat as super-urgent, but we should try to investigate as soon as possible. It does make me wonder if the SFP has also gone bad, similar to the issue with lvs1019 (maybe bad batch or just they both aged out the same?). Like the previous occasion light levels either side are good, so I less suspect an issue with the fiber run / patches but not impossible.

Either way steps would be to do a basic check on the fiber path, and if looks ok then swap one optic, then the other if it still doesn't fix. I'd say we're best tackling the lvs side first as that was what was wrong with the other one. We'll need to fail over the host before doing any of this I will work with dc-ops on irc to line up a time.

Event Timeline

cmooney triaged this task as Medium priority.Sep 6 2024, 2:35 PM
cmooney created this task.

Icinga downtime and Alertmanager silence (ID=c63ff66a-28d3-4567-b7cc-a03c0da01345) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: Move traffic off lvs1017 to lvs1020 to troubleshooot faulty link

lvs1017.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-09-06T15:14:32Z] <topranks> disabling PyBal on lvs1017 to shift traffic to lvs1020 and allow work to fix faulty fibre link T374247

cmooney added a subscriber: Jclark-ctr.

Ok we have replaced the optic in lvs1017 (same model as the one taken from lvs1019 for the record), and the link now looks to be clean. Same test as before (more bw as server is now depooled) and no errors logged.

image.png (598×924 px, 43 KB)

image.png (499×924 px, 45 KB)

Thanks @Jclark-ctr for the help!

Mentioned in SAL (#wikimedia-operations) [2024-09-06T15:42:43Z] <topranks> enabling PyBal on lvs1017 to make primary again after repairing faulty fiber link T374247