Page MenuHomePhabricator

Inbound interface errors
Closed, ResolvedPublic

Description

Common information

  • description: Rule: Inbound interface errors Faults: #1: xe-3/2/2 - Transport: cr2-codfw:xe-1/1/1:1 (Lumen, 442550293) {#5249}

https://wikitech.wikimedia.org/wiki/Network_monitoring#LibreNMS_alerts

  • summary: Alert for device cr2-eqiad.wikimedia.org - Inbound interface errors
  • timestamp: 2023-12-08 01:32:33
  • alertname: Inbound interface errors
  • instance: cr2-eqiad.wikimedia.org
  • scope: global
  • severity: task
  • source: librenms
  • team: dcops

Firing alerts


  • description: Rule: Inbound interface errors Faults: #1: xe-3/2/2 - Transport: cr2-codfw:xe-1/1/1:1 (Lumen, 442550293) {#5249}

https://wikitech.wikimedia.org/wiki/Network_monitoring#LibreNMS_alerts

  • summary: Alert for device cr2-eqiad.wikimedia.org - Inbound interface errors
  • timestamp: 2023-12-08 01:32:33
  • alertname: Inbound interface errors
  • instance: cr2-eqiad.wikimedia.org
  • scope: global
  • severity: task
  • source: librenms
  • team: dcops
  • Source

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@ayounsi What day would work best for you to assist trouble shoot

@ayounsi i will be in early tomorrow working on unboxing 15 pallets of servers that have arrived will you be available tomorrow to assist. I would like to try another optic

We are monitoring this error it has been 12 days with no faults

Unfortunately the errors are back, even though not much it's still better to fix the issue.

Yeah I think we still need to look at this, further errors on the link today. Seems somewhat related to throughput, but we are miles away from capacity (peaks under 2Gb/sec).

I'd say worth trying an optic swap on one end, then the other to see if it fixes. May be cable but light levels look ok.

Icinga downtime and Alertmanager silence (ID=43516157-a1a8-45c7-82a5-d013fe5b4dda) set by cmooney@cumin1001 for 2:00:00 on 2 host(s) and their services with reason: replacing optics to troubleshoot errors on core switch link

lsw1-f1-eqiad.mgmt,ssw1-e1-eqiad.mgmt

Optic in ssw1-e1-eqiad et-0/0/8 was replaced, new one now working we should keep an eye and see if it fires again.

@cmooney i have not seen any new faults on this ticket. are you ok closing this ticket?

Yep, looks good, thanks!

Actually, the ssw interface is fixed, but the cr2-eqiad one didn't https://librenms.wikimedia.org/graphs/to=1699536900/id=11592/type=port_errors/from=1694180100/
phaultfinder updated the task description, the cr2-eqiad errors are probably under the threshold to to trigger the alert but need to be fixed.

@Jclark-ctr can you let us know when is a good time to try another optic (cf. T342502#9073771 )

@cmooney i have not seen any new faults on this ticket. are you ok closing this ticket?

Thanks yeah as Arzhel said the link looks clean now. We should mark the optic that was removed from ssw1-e1-eqiad as faulty. Maybe hold on to it for some emergency just in case as they're not cheap. Should we also speak to Rob about ordering another spare to keep our spare stocks up? Or what's the normal process there.

@ayounsi if you wouldn't mind messaging me a time that works best with you so we can fix this

Mentioned in SAL (#wikimedia-operations) [2023-12-08T14:44:48Z] <XioNoX> drain eqiad-codfw lumen transport for maintenance - T342502

cleaned both sides of cable and replaced optic if errors continue we will replace cable

Thx, closing the task, automation will re-open it if needed.