Page MenuHomePhabricator

Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091})
Closed, ResolvedPublic

Assigned To
Authored By
phaultfinder
Jan 28 2026, 4:49 AM
Referenced Files
F73035082: 748bb66f-e29b-4dcf-8be1-98881cf13ecd.jpg
Mar 18 2026, 3:32 PM
F73035081: 8a6d7f60-fd78-4b55-b4ee-db027c6c6baf.jpg
Mar 18 2026, 3:32 PM
F73035080: 728a9fed-0f12-45d5-80e1-c31c44a1295a.jpg
Mar 18 2026, 3:32 PM
F72909069: IMG_20260316_105300732.jpg
Mar 16 2026, 5:27 PM
F72909068: IMG_20260316_105328379.jpg
Mar 16 2026, 5:27 PM
F72909067: IMG_20260316_105356985.jpg
Mar 16 2026, 5:27 PM
F72909066: IMG_20260316_105509576.jpg
Mar 16 2026, 5:27 PM
Restricted File
Mar 16 2026, 5:25 PM

Description

Common information

  • alertname: InboundInterfaceErrors
  • instance: cr2-magru:9804
  • interface_description: Transit: EdgeUno (E1-SER-7853-IP) {#70091}
  • interface_name: xe-0/1/0
  • job: gnmi
  • prometheus: ops
  • scope: network
  • severity: task
  • site: magru
  • source: prometheus
  • team: dcops

Firing alerts


Event Timeline

ayounsi subscribed.

Can you follow up with magru remote hands for the regular, cable, clean, optic replacement, etc ? Just let netops know when to drain that link.

Apologies this was neglected. Since we need to likely give 24 hours notice for smart hands to avoid expedite fees, I suggest we schedule this for Monday, 2026-03-16.

@ayounsi can drain the link when he starts his workday a few hours in advance of Brazil, then we can have remote hands clean the optic when I start my workday at 7AM Pacific / 11AM Brazil.

I rather they not be poking in the racks without someone within DC-Ops on standby.

Remote hands ticket CS1254900 filed:

Support,

Please schedule this work for Monday 2026-03-16 @ 11AM Brazil time, as we'll drain the link a few hours in advance of the work.

We're receiving a number of interface errors on one of our transit links. We would like to have remote hands clean the fiber optic patch ends with a fiber optic patch cleaner and re-seat the optic in use as the first steps in troubleshooting this link.

Location of device located in B4, U43 (rear facing) router MX204 named cr2-magru port xe-0/1/0. Please unplug fiber patch '70091' on port xe-0/1/0, and re-seat the SFP+-10G-LR module with serial GT3Y6245, then clean the fiber patch 70091 (both ends) and re-seat it into the router port xe-0/1/0 and the patch panel port.

Once this is completed, our network administrators will investigate to see if the link error rate has diminished.

RobH triaged this task as High priority.Mar 11 2026, 4:57 PM
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Hardware Failure / Repair on the ops-magru board.
RobH added subscribers: cmooney, Papaul, RobH.

@ayounsi / @cmooney / @Papaul,

Not sure who wants to take point on this, but since I chatted briefly with Arzhel in IRC I'll default to him and ya'll can reassign as needed!

I've scheduled this work for 2026-03-16 @ 11AM Brazil (7AM Pacific) to allow for either Arzhel or Cathal to start their workday normally on the Monday and drain the transit link in advance of the work. I set it to 11AM Brazil so I'll be awake and around working in Pacific timezone.

I just want to ensure netops is aware of this scheduled time and has no issues, as I've entered the remote hands for it. If we need to adjust, just let me know!

I'm guessing we want to resolve this issue before I do the same on T413409, since we don't want to drain both at the same time.

Ok, that is annoying, these auto created tasks cannot have things appended into the task descirption or phaultfinder removes it...

Troubleshooting Checklist

  • Work Scheduled for Monday 2026-03-16 @ 11AM Brazil (7AM Pacific)
  • Remote hands ticket filed CS1254900 T415743#11698762
  • netops drains link in advance of work during EU AM.
  • Remote hands unseats the fiber optic patch and cleans both ends at patch panel and at router, reseats the optic in the router, reports back to us via ticket.
  • Link error checking by netops, further steps to be defined if the cleaning and re-seating fails.

Mentioned in SAL (#wikimedia-operations) [2026-03-16T13:21:02Z] <XioNoX> drain edgeuno transit for optic replacement - T415743

netops drains link in advance of work during EU AM.

Done.

They had an issue where they couldn't locate the fiber listed and instead skipped the work entirely! I need to review the photos and find out what the patch is actually labeled and confirm they need to remove it.

Arzhel: Return this to service and we'll reschedule it for Wednesday 8AM Pacific / Noon Brazil so I'll be online! I was a few minutes late to the keyboard today!

I want to ensure I'm reading the photos correctly, but the update from remote hands is the fiber ID 70091 wasn't found, and it appears to me that the fiber ID for the patch in xe-0/1/0 is 70152.

Can I get a confirm from someone else that I'm reading the ports correctly?

Confirm with @cmooney via IRC that 70152 is indeed xe-0//0 in these photos and updated the remote hands for Wednesday.

We had the wrong cable ID in our records. Thank you for the photos!

The cable ID to use for this request is 70152, Please re-schedule this work for Wednesday, 2026-03-18 @ 11:30AM Brazil Time for the following:

Wednesday, 2026-03-18 @ 11:30AM:
B4, U43 (rear facing) router MX204 named cr2-magru port xe-0/1/0. Please unplug fiber patch '70152' on port xe-0/1/0, and re-seat the SFP+-10G-LR module with serial GT3Y6245, then clean the fiber patch 70152 (both ends) and re-seat it into the router port xe-0/1/0 and the patch panel port.

Thank you!

@ayounsi / @cmooney:

Cathal returned the link to service as the work wasn't performed by remote hands due to cable ID mismatch from our records and reality. Photos sent from remote hands and reviewed by myself and Cathal show we had the wrong cable ID listed so I've fixed the remote hands directions and resubmitted the work to occur on Wednesday, 2026-03-18 @ 11:30AM Brazil (7:30AM Pacific) so I can be around to see results and provide immediate feedback (like replacing the patch and/or the optic.)

Please re-drain this link Wednesday in advance of this work, thank you!

Please re-drain this link Wednesday in advance of this work, thank you!

Cool thanks Rob will do.

I created a meeting to not forget, and invited you both just in case.

Remote hands cleaned the patch cable and reseated the optic along with photos to show the work.

This is now returned to netops purview for monitoring. If more errors occur please update the task and I'll move forward with either swapping the patch cable (cheapest, first option) or the sfp module.

Support,

The link came back up after your cleaning and re-seating the optic and patch cable, but the errors have resumed after the circuit came back online.

Next step, we would like to have you swap patch cable 70152 with a spare patch cable in our rack (should be on top of the servers) for this link at your earliest possible convenience. Please let us know the cable ID of this new patch, if it doesn't have one, please apply ID 260301.

The link is currently depooled (will show a link light but it is not serving traffic). When you replace the fiber patch cable, it should resume its link light.

Please also check in our racks and report back an inventory of how many spare fiber optic patch cables and lengths we have, along with the spare optics. These should be in our racks on top of the servers.

This work can take place at any time. Thank you in advance!

Comentário gerado em Smart Hands: Good afternoon,

We carried out the replacement of the fiber optic patch cable. A 10‑meter patch cable available in Rack B03 was used.
Attached are the evidences of the activity performed.

Inventory of materials available in Rack B03:

07 units of fiber optic patch cables – 2 meters
02 units of MPO patch cables – 1 meter
02 units QFX‑SFP‑10GE‑LR
01 unit JNP‑SFP‑25G‑LR
01 unit QFX‑SFP‑1GE‑T

So the fiber was swapped, please re-enable the link to see if errors resume or stay gone. If they resume, we'll next swap the optic.

https://librenms.wikimedia.org/graphs/to=1773862200/id=31633/type=port_errors/from=1773775800/

Errors returned, Arzhel redrained the link, update sent to ticket:

Support,

Thank you for swapping out fiber 70152 with 260301, but it turns out that wasn't the issue. The errors have returned. Please now swap out the SFP+-10G-LR module with serial GT3Y6245 out of cr2-magru port xe-0/1/0 and swap with the spare QFX‑SFP‑10GE‑LR spare. Please put the optic GT3Y6245 in an envelope marked T415743 so we can later determine if it was at fault.

Please put fiber patch 70152 back into our spares pile of patches in the rack since it seems to be fine.

Once the new optic is in place, let us know. You should see the link light on both before the swap and after the swap, but traffic has been drained from this link.

This work can take place at your earliest convenience.

The optic was swapped, but the errors resumed.

Arzhel got me setup with an EdgeUno portal account so I can view the two circuits and opened case 441863 against the transit circuit errors.

Previous tickets have been solved in the past with the troubleshooting including fiber cleaning: https://edgeuno.cloud/tickets.php/view/484278000194772025#

Please note the ticket was opened but their portal doesn't seem to email myself, Arzhel, or Cathal even though I listed all three of us on the ticket. The only way to see ticket updates is to login to the actual ticket view: https://edgeuno.cloud/tickets.php/view/484278000266995093

CURRENT TICKET https://edgeuno.cloud/tickets.php/view/484278000266995093

Summary:

  • EdgeUno says they see no errors only our flap
  • Arzhel replied back stating that we are still seeing errors, stressed that we've already swapped optics and fibers on our end and re-requested they do the same as I did in the original request.
  • They replied back about 45 minutes or so ago stating 'We will check our side as requested and let you know.'
  • The portal does NOT email myself, Arzhel, or Cathal even though I listed all three of us on the ticket.
  • The ticket has to be checked manually for updates: CURRENT TICKET https://edgeuno.cloud/tickets.php/view/484278000266995093
    • Manually checked for updates by Rob: 2026-03-19@16:43

Fixed on Friday, synced up in meeting today and no morre errors. Cathal closing the ticket on the Lumen portal.