This has been flapping and erroring for a while and causing random other network errors for various services. I manually commited "set disable" on the interfaces on both sides to stabilize the situation for now. Netops should investigate and clean up when they're online (should be non-urgent given link redundancy).
Description
Related Objects
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2025-10-16T20:56:55Z] <bblack> see also https://phabricator.wikimedia.org/T407578 for above port disables
Thanks Brandon you did the right thing.
For now, for troubleshooting, I have set the Arelion circuit to 'drained' status in Netbox and re-run Homer against cr1-codfw and cr1-eqiad. That has brought the port back up but will mean OSPF does not select the link for any traffic so it won't affect anything else.
Likely either a bad optic on our side or on the provider side in eqiad. I'll dig a little deeper and likely work with dc-ops to see if we can replace our optic (and/or raise ticket with the carrier).
Seems this started fairly suddenly yesterday afternoon:
The link is flapping hard up/down constantly, which is probably the result of the error stats, as opposed to frame CRC errors or similar.
Oct 16 15:15:07 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_DOWN: ifIndex 910, ifAdminStatus up(1), ifOperStatus down(2), ifName et-1/1/2 Oct 16 15:15:10 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_UP: ifIndex 910, ifAdminStatus up(1), ifOperStatus up(1), ifName et-1/1/2 Oct 16 15:15:10 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_DOWN: ifIndex 910, ifAdminStatus up(1), ifOperStatus down(2), ifName et-1/1/2 Oct 16 15:15:11 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_UP: ifIndex 910, ifAdminStatus up(1), ifOperStatus up(1), ifName et-1/1/2 Oct 16 15:15:36 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_DOWN: ifIndex 910, ifAdminStatus up(1), ifOperStatus down(2), ifName et-1/1/2 Oct 16 15:15:37 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_UP: ifIndex 910, ifAdminStatus up(1), ifOperStatus up(1), ifName et-1/1/2 Oct 16 15:15:37 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_DOWN: ifIndex 910, ifAdminStatus up(1), ifOperStatus down(2), ifName et-1/1/2 Oct 16 15:15:51 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_UP: ifIndex 910, ifAdminStatus up(1), ifOperStatus up(1), ifName et-1/1/2 Oct 16 15:15:51 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_DOWN: ifIndex 910, ifAdminStatus up(1), ifOperStatus down(2), ifName et-1/1/2 Oct 16 15:15:55 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_UP: ifIndex 910, ifAdminStatus up(1), ifOperStatus up(1), ifName et-1/1/2 Oct 16 15:15:55 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_DOWN: ifIndex 910, ifAdminStatus up(1), ifOperStatus down(2), ifName et-1/1/2 Oct 16 15:15:56 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_UP: ifIndex 910, ifAdminStatus up(1), ifOperStatus up(1), ifName et-1/1/2 Oct 16 15:15:56 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_DOWN: ifIndex 910, ifAdminStatus up(1), ifOperStatus down(2), ifName et-1/1/2 Oct 16 15:16:40 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_UP: ifIndex 910, ifAdminStatus up(1), ifOperStatus up(1), ifName et-1/1/2 Oct 16 15:17:06 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_DOWN: ifIndex 910, ifAdminStatus up(1), ifOperStatus down(2), ifName et-1/1/2 Oct 16 15:17:08 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_UP: ifIndex 910, ifAdminStatus up(1), ifOperStatus up(1), ifName et-1/1/2 Oct 16 15:17:25 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_DOWN: ifIndex 910, ifAdminStatus up(1), ifOperStatus down(2), ifName et-1/1/2 Oct 16 15:17:27 re0.cr1-eqiad mib2d[39234]: SNMP_TRAP_LINK_UP: ifIndex 910, ifAdminStatus up(1), ifOperStatus up(1), ifName et-1/1/2
This plays havoc with actual traffic as it constantly starts to be used, then traffic is disrupted, then it comes back etc.
I'll open a ticket with Aerlion to get their input. Still could be the optic but I'm thinking maybe it's their side, and at least we can try to get some info on that beofre anyone will be on site in Ashburn.
The link has been mostly stable since re-enabling it at 08:15 UTC, it flapped a few times immediately after but been ok since.
Will wait and see what the carrier says of course, but perhaps it was an issue on their side that has been resolved.
So there was a known fault on the Arelion side and they had raised a ticket internally about it. See emails to noc@wikimedia.org with reference #INC2223086.
Issue was a fibre break that has been fixed now:
10/17/2025 9:00:08 AM Cause and Location of Outage: The service disruption was caused by a fiber cut between GREENVILLE and HARTWELL. Action Taken: The damaged fiber was successfully repaired, and all services have since been restored. Responsible Party: Third-party fiber provider Incident Timeline: Start Time: 17 October 2025, 04:05 UTC End Time: 17 October 2025, 07:53 UTC
10/17/2025 7:30:22 AM Splicing efforts are progressing toward completion of the permanent fiber repair. Our providers are still doing the splicing activity and is approximately 70% complete at the northern end of the span and 30% complete at the southern end.
10/17/2025 7:07:27 AM The fiber has been cut, and splice preparations are underway. Splicing is expected to commence within the hour, with a preliminary estimated time to restoral (ETR) of 09:00 GMT.
10/17/2025 4:49:37 AM Our fiber provider started conducting fiber splicing activities for unidirectional span loss degradation that caused multiple superchannels down in CHARLOTTE UNITED STATES
10/17/2025 4:40:24 AM We are observing an outage between GREENVILLE and HARTWELL UNITED STATES due to our fiber provider conducting fiber splicing activities
As the circuit has now been stable for a few hours and the carrier has confirmed root cause and fix I will undrain the circuit and observe how things look.
Mentioned in SAL (#wikimedia-operations) [2025-10-17T10:03:33Z] <topranks> un-draining Arelion 100G transport eqiad <-> codfw following carrier fibre fix and return to stability T407578
Gonna leave this a few days before closing, we've had a few flaps since it was restored. If it gets worse or doesn't settle down I'll query again with Aerlion.
So this has bounced a few times since, however it is relatively stable.
Given it's not causing us noticeable issues right now I'm gonna close this task, and revisit with Arelion if it degrades further.




