Page MenuHomePhabricator

cr2-magru <-> asw1-b3-magru link down March 2026
Closed, ResolvedPublic

Description

The link between cr2-magru and asw1-b3-magru failed today out of the blue.

Logs on the CR are full of these going back:

cmooney@cr2-magru> show log messages | match "et-0/0/1"
Mar  4 07:30:01  cr2-magru fpc0 SMIC_PHY_DFE_TUNING: smic_phy_dfe_tuning_state: et-0/0/1 - DFE coarse/fine tuning completed (took 1996 ms); enabling DFE adaptive tuning.
Mar  4 07:30:01  cr2-magru fpc0 MQSS(0): CMACPCS1: Cleared Ethernet MAC Local Fault Delta Event (et-0/0/1)
Mar  4 07:30:01  cr2-magru fpc0 MQSS(0): CMACPCS1: Ethernet PCS Multilane Alignment Done Delta Event (et-0/0/1)
Mar  4 07:30:01  cr2-magru fpc0 MQSS(0): CMACPCS1: Detected Ethernet MAC Local Fault Delta Event (et-0/0/1)
Mar  4 07:30:01  cr2-magru fpc0 MQSS(0): CMACPCS1: Ethernet PCS Multilane Alignment Not Done Delta Event (et-0/0/1)

Light levels look reasonable-ish, though a little low on lane 2 on the CR:

cmooney@cr2-magru> show interfaces diagnostics optics et-0/0/1 | except "warn|alarm"
Mar 04 08:31:02
Physical interface: et-0/0/1
    Module temperature                        :  33 degrees C / 91 degrees F
    Module voltage                            :  3.2290 V
  Lane 0
    Laser bias current                        :  7.599 mA
    Laser output power                        :  0.955 mW / -0.20 dBm
    Laser receiver power                      :  0.205 mW / -6.88 dBm
  Lane 1
    Laser bias current                        :  7.199 mA
    Laser output power                        :  0.815 mW / -0.89 dBm
    Laser receiver power                      :  0.442 mW / -3.55 dBm
  Lane 2
    Laser bias current                        :  7.199 mA
    Laser output power                        :  0.786 mW / -1.05 dBm
    Laser receiver power                      :  0.070 mW / -11.52 dBm
  Lane 3
    Laser bias current                        :  7.199 mA
    Laser output power                        :  0.910 mW / -0.41 dBm
    Laser receiver power                      :  0.296 mW / -5.28 dBm

That level has been decresing steadily over the past while:

https://librenms.wikimedia.org/graphs/id=60583/type=sensor_dbm/from=1767256500/to=1772613300/

asw sides looks healthy:

cmooney@asw1-b3-magru> show interfaces diagnostics optics et-0/0/50 | except "warn|alarm"
Physical interface: et-0/0/50
    Module temperature                        :  34 degrees C / 93 degrees F
    Module voltage                            :  3.2560 V
  Lane 0
    Laser bias current                        :  7.199 mA
    Laser output power                        :  0.785 mW / -1.05 dBm
    Laser receiver power                      :  0.926 mW / -0.33 dBm
  Lane 1
    Laser bias current                        :  7.199 mA
    Laser output power                        :  0.780 mW / -1.08 dBm
    Laser receiver power                      :  0.871 mW / -0.60 dBm
  Lane 2
    Laser bias current                        :  7.199 mA
    Laser output power                        :  0.806 mW / -0.94 dBm
    Laser receiver power                      :  0.766 mW / -1.16 dBm
  Lane 3
    Laser bias current                        :  7.199 mA
    Laser output power                        :  0.844 mW / -0.73 dBm
    Laser receiver power                      :  0.868 mW / -0.61 dBm

cr2 is reporting the fault is local:

cmooney@cr2-magru> show interfaces et-0/0/1
Mar 04 08:32:09
Physical interface: et-0/0/1, Enabled, Physical link is Down
  Interface index: 151, SNMP ifIndex: 579
  Description: Core: asw1-b3-magru:et-0/0/50 {#70130}
  Link-level type: Ethernet, MTU: 9192, MRU: 9200, Speed: 100Gbps, BPDU Error: None, Loop Detect PDU Error: None, Ethernet-Switching Error: None, Loopback: Disabled,
  Source filtering: Disabled, Flow control: Enabled
  Pad to minimum frame size: Disabled
  Device flags   : Present Running Down
  Interface Specific flags: Internal: 0x100200
  Interface flags: Hardware-Down SNMP-Traps Internal: 0x4000
  Link flags     : None
  CoS queues     : 8 supported, 8 maximum usable queues
  Schedulers     : 0
  Current address: b4:f9:5d:30:e1:43, Hardware address: b4:f9:5d:30:e1:43
  Last flapped   : 2026-03-04 08:26:23 UTC (00:05:46 ago)
  Input rate     : 0 bps (0 pps)
  Output rate    : 0 bps (0 pps)
  Active alarms  : LINK
  Active defects : LINK, LOCAL-FAULT

I think we probably want to try and replace the SFP here, probably starting with the CR2 side. Though a reseat and checking/cleaning the fibres might be another option (and the issue could well be the module in the asw either).

Related Objects

StatusSubtypeAssignedTask
ResolvedRobH

Event Timeline

cmooney triaged this task as High priority.

Rob, could you prioritize this ? Thanks

CS1253254 filed, listed myself, Arzhel, Cathal, and Papaul on the CC list.

Account: WIKIMEDIA
Contact: Robert McMahon Halsell
Data Center: São Paulo 3
Rack: B3/B4
Category: Cabling
Short description: Troubleshooting a down fiber link between our router and switch
Description: Support,

We've recently had one of our router to switch links go down, and would like to have some troubleshooting steps completed by remote hands. This link was online, and has degraded to offline with errors at this time so work can begin immediately (no need to schedule it with us.)

The link in question is cable ID 70130, which runs between our racks B3:U37::asw1-b3-magru:et-0/0/50 and B4:U43:cr2-magru:et-0/0/1

Please locate a spare 40G QSFP in our racks and swap that optic with the optic located in RackB4:U43 cr2-magru:et-0/0/1, optic serial GT3AAG00321. Once swapped, please put the optic serial GT3AAG00321 into an envelope and label it T418978. Once this is done, please let us know so we can check the link.

Please do not close this ticket until we confirm the link is back online. If this optic swap doesn't solve it, we'll followup with more steps.

Thanks in advance,
Watch List: Arzhel Younsi, Cathal Mooney, Papaul Tshibamba

Ok, they swapped the optic in cr2-magru but still shows down:

et-0/0/1 up down Core: asw1-b3-magru:et-0/0/50 {#70130}

The optic did change, the old serial and new serial are different:

Old output:

FPC 0                     BUILTIN      BUILTIN           MPC
  PIC 0                   BUILTIN      BUILTIN           4XQSFP28 PIC
    Xcvr 0       REV 01   740-061405   GT3AAG00319       QSFP-100GBASE-SR4
    Xcvr 1       REV 01   740-061405   GT3AAG00321       QSFP-100GBASE-SR4
    Xcvr 2       REV 01   740-061405   GT3AAG00320       QSFP-100GBASE-SR4

New output:

FPC 0                     BUILTIN      BUILTIN           MPC
  PIC 0                   BUILTIN      BUILTIN           4XQSFP28 PIC
    Xcvr 0       REV 01   740-061405   GT3AAG00319       QSFP-100GBASE-SR4
    Xcvr 1       REV 01   740-061405   GT3AAG00315       QSFP-100GBASE-SR4
    Xcvr 2       REV 01   740-061405   GT3AAG00320       QSFP-100GBASE-SR4

As this still shows down, the next step is likely the swap of the fiber patch.

Support,

Thank you, we can see the old module QSFP-100GBASE-SR4 SN GT3AAG00321 was removed and replaced with QSFP-100GBASE-SR4 module GT3AAG00315. However, the link is still showing down for us.

Acknowledged that all optics are actually QSFP-100GBASE-SR4, my initial listing of 40G was incorrect.

Thank you for the photos, as they show the new optic did not resolve the issue and the link is still offline (red) not online (green).
Please re-seat both sides of the patch cable and check for link light. If the link light doesn't go green, please source a new fiber patch from our rack spares, note its label (and report it back to us) and then replace patch cable ID 70130 with a new patch.

They've now replace the patch cable but we're still seeing down:

Comentário gerado em Smart Hands: Dear, evening.

As requested, the MPO panel for ID 70130 was replaced.

However, the doors are intermittent at both ends;

The ticket is awaiting further instructions;

Photos are attached as evidence.

Atenciosamente / Best Regards
Anderson Teixeira Analista de SMART HANDS

At this point I'm not sure swapping the optic on the switch side is advised since it is intermittent. I'd like a netops to take a look at things and advise if we should proceed to swapping the optic in the switch since we've now swapped the optic on the router and the patch cable or am I missing something?

Looks like changing the module on the switch side fixed the issue.

sw1-b3-magru> show interfaces et-0/0/50 descriptions 
Interface       Admin Link Description
et-0/0/50       up    up   Core: cr2-magru:et-0/0/1 {#70130}

Showing as down right now both sides, lane 2 RX still poor on cr2-magru:

cmooney@cr2-magru> show interfaces diagnostics optics et-0/0/1 | except "warn|alarm"
Physical interface: et-0/0/1
    Module temperature                        :  35 degrees C / 94 degrees F
    Module voltage                            :  3.2330 V
  Lane 0
    Laser bias current                        :  7.199 mA
    Laser output power                        :  0.662 mW / -1.79 dBm
    Laser receiver power                      :  0.196 mW / -7.07 dBm
  Lane 1
    Laser bias current                        :  7.199 mA
    Laser output power                        :  0.720 mW / -1.42 dBm
    Laser receiver power                      :  0.361 mW / -4.42 dBm
  Lane 2
    Laser bias current                        :  7.199 mA
    Laser output power                        :  0.768 mW / -1.15 dBm
    Laser receiver power                      :  0.055 mW / -12.56 dBm
  Lane 3
    Laser bias current                        :  7.199 mA
    Laser output power                        :  0.735 mW / -1.33 dBm
    Laser receiver power                      :  0.236 mW / -6.26 dBm

Was the module definitely swapped on the switch side? I see the same serial number as the latest saved config in Rancid, which was from last week:

cmooney@asw1-b3-magru> show chassis hardware | match "Xcvr 50|^Item"
Item             Version  Part number  Serial number     Description
    Xcvr 50      REV 01   740-061405   GT3AAG00314       QSFP-100GBASE-SR4

At this point I'm not sure swapping the optic on the switch side is advised since it is intermittent.

If we have not swapped the optic on the switch side that is 100% the next step to take. Given the RX level has been decreasing steadily for some time we are probably now in the "grey area" between operational and completely down, causing it to sometimes show up. However these things only go in one direction, the router module clearly wasn't the problem so the switch one is what to look at.

Link is down again: https://alerts.wikimedia.org/?q=scope%3Dnetwork&q=site%3Dmagru&q=%40state%3Dsuppressed

Thanks rob for leading on this, it's no big deal if it's taking a bit longer as it's also a great learning opportunity ! Ascenty staff seem to be quite good and reactive too.

Support,

You have swapped the optic on the router side, and the MPO patch cable. The link is still down, so we'd like you to swap the optic on the switch side. The switch is located in B3:U37::asw1-b3-magru port et-0/0/50. Please remove the optic serial GT3AAG00314 in asw1-b3-magru port et-0/0/50 and swap it with another 100G optic spare from our rack.

Please place the optic serial GT3AAG00314 in an envelope marked T119524-switch and set aside in our racks until we determine what caused the link failure.

They swapped the optic GT3AAG00314 out of the switch for optic GT3AAG00316 and now the link shows up:

router:

et-0/0/1        up    up   Core: asw1-b3-magru:et-0/0/50 {#70130}

switch:

et-0/0/50       up    up   Core: cr2-magru:et-0/0/1 {#70130}

Let's leave this open for 24h and ensure it is actually resolved, and if so I'll look into the return/replacement of the bad optics

Sent an email to investigate the return/repair of the two optics