Page MenuHomePhabricator

Link down between cr3-ulsfo and cr4-ulsfo
Closed, ResolvedPublic

Assigned To
Authored By
cmooney
Apr 1 2025, 1:54 PM
Referenced Files
Restricted File
May 7 2025, 7:09 AM
F59718360: image.png
May 6 2025, 8:37 AM
F59013588: image.png
Apr 8 2025, 10:05 AM
F59013584: image.png
Apr 8 2025, 10:05 AM
F59004137: Screenshot From 2025-04-07 15-31-42.png
Apr 7 2025, 1:33 PM
F59004138: Screenshot From 2025-04-07 15-32-05.png
Apr 7 2025, 1:33 PM
F58959231: image.png
Apr 1 2025, 5:27 PM

Description

Bit of a strange one, we earlier today upgraded JunOS on cr4-ulsfo, which went smoothly. Or almost smoothly.

On reboot the 100G link which we recently installed between it and cr3-ulsfo wouldn't re-esablish (see T384288). I'm kind of surprised tbh, the optics both sides here were newly installed in January. Even more unusual that a reboot/software upgrade would result in this.

We tried bouncing the port and checking the config, can't see any reason why it would be. FEC errors are incrementing rapidly on the cr4 side for some reason. Only thing I can think of is perhaps a defective QSFP28 module which cooled down a bit when the work was being done and somehow broke? The interface was not showing errors prior to the change so it's really a head-scratcher.

Anyway I think the only step at this stage would be to replace the CWDM4 optic in cr4-ulsfo port 0. We have a spare from the recent order available I think.

@RobH can you arrange for this when you get a moment? Thanks.

Event Timeline

cmooney triaged this task as High priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Chatted with @cmooney via IRC. Since there are 2 cross router links, this has been entered as non expedited planned remote hands work (with the less expensive hourly rate).

Case 01043199

Support,

We recently rolled some OS upgrades to our routers and during that, one of the optics on our cross-router links failed. There was no physical changes made, we're assuming the bouncing of all the ports via software upgrade caused the failing optic to finally die.

We would like support to swap out the CWDM4 optic in cr4-ulsfo port 0. It has a yellow singlemode fiber plugged into it labeled '1073'. There should be spare QSFP-100G-CWDM4 optic in the storage bins on the bottom front of our racks (there are 4 storage bins, all unlocked).

Please put the old optic back in the storage bin with a label of this ticket #. We're not certain if it is the optic or the port, so we want to keep it.

The routers are still linked with their redundant connection on port 1, so this work on port 0 can take place immediately.

Mentioned in SAL (#wikimedia-operations) [2025-04-01T15:45:06Z] <topranks> removing et-0/0/0 from ae0 bundle on cr3-ulsfo and cr4-ulsfo T390731

cmooney lowered the priority of this task from High to Low.Apr 1 2025, 5:27 PM

Looks like remote hands replaced the module.

cmooney@cr4-ulsfo> show log messages | match qsfp 
Apr  1 17:10:19  cr4-ulsfo fpc0 qsfp-0/0/0 set to not present
Apr  1 17:10:38  cr4-ulsfo fpc0 qsfp-0/0/0 plugged in

Since then everything looks good, the link is up and the FEC errors have stopped:

cmooney@cr4-ulsfo> show interfaces et-0/0/0 | match "^Phy|FEC" 
Physical interface: et-0/0/0, Enabled, Physical link is Up
  Active defects : None
  Ethernet FEC Mode  :                  FEC91
    FEC Codeword size                     528
    FEC Codeword rate                   0.973
  Ethernet FEC statistics              Errors
    FEC Corrected Errors                    0
    FEC Uncorrected Errors                  0
    FEC Corrected Errors Rate               0
    FEC Uncorrected Errors Rate             0

I re-added it to the ae0 bundle and traffic is flowing again:

image.png (468×957 px, 36 KB)

Thanks for the quick action here @RobH, I think we can close this one unless there is anything else.

ayounsi raised the priority of this task from Low to High.Apr 7 2025, 1:33 PM

Bumping the priority back up on this one as the interface keeps flapping.

Screenshot From 2025-04-07 15-32-05.png (694×1 px, 74 KB)

Screenshot From 2025-04-07 15-31-42.png (694×1 px, 69 KB)

@RobH can you have remote hands replace the fiber and optic on the cr3 side asap ?

Case 01045114 opened just swapped out the info about a bit:

Support, We recently rolled some OS upgrades to our routers and during that, one of the optics on our cross-router links failed.
There was no physical changes made, we're assuming the bouncing of all the ports via software upgrade caused the failing optic to finally die.
You've already swapped the optic on the cr4 side of this link.
We would like support to swap out the CWDM4 optic in cr3-ulsfo port 0 located in rack 22U42 and its fiber patch. It has a yellow singlemode fiber plugged into it labeled '1073'. There should be spare QSFP-100G-CWDM4 optic in the storage bins on the bottom front of our racks (there are 4 storage bins, all unlocked) and spare fibers in the same bins. Please swap out the optic AND the fiber patch with a new fiber patch from spares. Please label the new fiber patch 1073b.

Please put the old optic back in the storage bin with a label of this ticket #. We're not certain if it is the optic or the port, so we want to keep it.

The routers are still linked with their redundant connection, so this work on port 0 can take place immediately. Once swapped, please update the ticket and text me that it has been done, <# redacted from phab task>. Thanks!

It seems the work yesterday has not stopped the carrier transitions reported, although the number has decreased:

image.png (376×1 px, 65 KB)

image.png (376×1 px, 60 KB)

At this point I'm somewhat unsure what to do. Is there something about the port here that's making the optics fail? So far the timeline is:

  • We started seeing small numbers errors on the link in Nov/December time
  • The link was operational though and we kept it in service
  • We replaced those 40GBase-LX4 optics with 100GBase-CWDM4 in January
  • This cleared all the errors
  • In March we noticed a small number of errors again
  • Since then we've replaced both optics and the fibre patch
    • These have cleared the errors for a short time (days) or reduced them, but not made them disappear

I'm really scratching my head as to what is the problem and what we can do now. The light levels either side are good, we are not over the high warning level that might be burning out the RX side.

@ayounsi I'm also wondering if we should keep traffic on the link? In fairness we did have traffic on it for a long time after the first errors occured and it didn't seem to cause operational impact. But we may be dropping packets, causing retransmits etc and we don't know.

One thing that strikes me is it would be simpler to manage here if we had two independent routed links in OSPF, and could adjust the cost on this one so it wasn't used but we still could ping across for tests, and have it there as a backup in case the other went down?

For sure that's an odd one... Maybe we could try with a different port.

For OSPF, +1 to do it for the troubleshooting window. Depending on how long it takes us to fix it 100%. Or I don't know if we can have them as active/passive in LACP.

RobH mentioned this in Unknown Object (Task).Apr 10 2025, 4:53 PM

The two new optics arrived for this, one spare and one to swap in.

In T390766#10730347, @RobH wrote:

@cmooney: So I've figured out what happened here:

  • first swap event swaps out optic G2430452535 out of cr4-ulsfo to spares.
  • I open RMA for now spare optic G2430452535
  • second swap event swaps out optic in cr3-ulsfo with previously swapped out G2430452535

So now we have a different optic ready for sending back, when G2430452535 is returned to service. Overall I think this is NOT an optic issue, and the optic I opened a return for is now back in use.

I don't think we need to return anything at this point, as we're getting odd results and things we thought were broken are now working.

Thoughts?

So we ordered a new optic and spare, and the defective optic is currently in cr3.

So G2430452535 was in cr4, it was swapped out with the spare, and then the link between cr3-cr4 still had errors and remote hands swapped G2430452535 back onto cr3 (non ideal). Now we know G2430452535 originally had issues in cr4, so at minimum I think we want to swap it out of cr3.

How is best to proceed? Since this is a redundant link can I just enter a remote hands for the optic swap on cr3 to remove G2430452535 from use? My understanding is we still have some errors on this, and since we have G2430452535 back in rotation it could very well be due to it.

How should we proceed?

Change #1135998 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add new include statement for netbox-generated dns snippet

https://gerrit.wikimedia.org/r/1135998

Change #1135999 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] ulsfo: enable OSPF on separate link between CRs

https://gerrit.wikimedia.org/r/1135999

Change #1135998 merged by Cathal Mooney:

[operations/dns@master] Add new include statement for netbox-generated dns snippet

https://gerrit.wikimedia.org/r/1135998

Change #1135999 merged by jenkins-bot:

[operations/homer/public@master] ulsfo: enable OSPF on separate link between CRs

https://gerrit.wikimedia.org/r/1135999

Mentioned in SAL (#wikimedia-operations) [2025-04-11T18:45:03Z] <topranks> remove et-0/0/0 from ae0 LAG bundle on cr3-ulsfo and cr4-ulsfo T390731

How is best to proceed? Since this is a redundant link can I just enter a remote hands for the optic swap on cr3 to remove G2430452535 from use? My understanding is we still have some errors on this, and since we have G2430452535 back in rotation it could very well be due to it.

Yeah I've made it the backup link so let's get G2430452535 removed from cr3 and replaced and see how goes. Remote hands can go ahead and do it any time the traffic is on the other link.

Change #1140153 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] cr3/4-ulsfo: Set et-0/0/0.0 OSPF metric to 100

https://gerrit.wikimedia.org/r/1140153

Change #1140153 merged by jenkins-bot:

[operations/homer/public@master] cr3/4-ulsfo: Set et-0/0/0.0 OSPF metric to 100

https://gerrit.wikimedia.org/r/1140153

New remote hands entered to get this fixed: Case Order #01053614

Directions for remote hands to repair our link between cr3 and cr4:

Support,

We've previously been troubleshooting this link, but did not have enough spare optics to successfully swap both sides out without re-using a potentially defective optic.

With the new order arrival and placement in our rack (01047768) it includes 2 new QSFP-100G-CWDM4. We would like the following completed at your earliest possible convincence. The link is currently up with a link light, but has no traffic flowing across it so this single link can be swapped.

  • Open shipment 01047768 which was placed in our cage, remove one of the two new QSFP-100G-CWDM4.
  • Remove optic G2430452535 out of cr3-ulsfo:et-0/0/0 and replace with a new optic. The patch cable for this should be labled 1073 or 1073A 1073b (most likely), please advise which cable ID is going into cr3-ulsfo:et-0/0/0 as it has been updated by previous remote hands.
  • Note if link comes back up, update this ticket so we can check things on our side.

@cmooney :

"Created by: mmariscalmata The following has been completed:

Retrieve package #1047768
Remove optic G2430452535 out of cr3-ulsfo:et-0/0/0
Replace with a new optic.
The patch cable is labled 1073A"

You can also see this message via the portal:

was

Xcvr 0       REV 01   740-061408   G2430452535       QSFP-100G-CWDM4

now

Xcvr 0       REV 01   740-061408   S2410424688       QSFP-100G-CWDM4

interfaces output:

et-0/0/0        up    up   Core: cr4-ulsfo:et-0/0/0 {#1073}

I've also updated the cable ID in netbox.

Since this showed online before, we can only really tell if its ok after its been returned to service without errors, so assigning back to you for that.

Thanks @RobH. It looks good so far, this is the graph we need to keep an eye on:

https://grafana.wikimedia.org/goto/SVEEkIbHR

If it seems good in a few days I'll revert the config to what we had before, for now it's still operating as a backup link.

Change #1140500 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Adjust OSPF metric on cr3-ulsfo -> cr4-ulsfo 100G link

https://gerrit.wikimedia.org/r/1140500

Change #1140500 merged by jenkins-bot:

[operations/homer/public@master] Adjust OSPF metric on cr3-ulsfo -> cr4-ulsfo 100G link

https://gerrit.wikimedia.org/r/1140500

cmooney lowered the priority of this task from High to Medium.May 1 2025, 1:53 PM

Happy to say this is still looking clean.

image.png (499×1 px, 42 KB)

So whatever happened previously one of the elements here was obviously causing a problem. Maybe we had a few faulty optics from a bad batch or something? Or cabling done incorrectly? Anyway it seems to be operating ok now so I don't believe we have faulty ports or anything with the actual routers thankfully.

ayounsi reopened this task as Open.EditedMay 7 2025, 6:59 AM

Unfortunately we're not out of the wood yet...

cr3-ulsfo> show interfaces et-0/0/0 media still shows lots of ongoing FEC errors :

PCS statistics                      Seconds
  Bit errors                             0
  Errored blocks                         0
Ethernet FEC Mode  :                  FEC91
  FEC Codeword size                     528
  FEC Codeword rate                   0.973
Ethernet FEC statistics              Errors
  FEC Corrected Errors               123176
  FEC Uncorrected Errors                  0
  FEC Corrected Errors Rate             146
  FEC Uncorrected Errors Rate             0

Next step is probably to use a different port on the router.

@RobH could you draft remote-hand instructions to connect et-0/0/2 on both cr3/4 ? Similarly to what's currently on et-0/0/0.
Then we can do the re-config remotely

{F59737209}

After chatting with Cathal, we decided to leave it as it as moving ports requires intrusive changes (PIC reload) and there is no operational impact currently. If the situation degrades, we can revisit it.

An alternative to trying a different port is to use a 100G DAC cable instead of optics+fibers : https://apps.juniper.net/home/mx204/hardware-compatibility, if that's ok for DCops of course.

FWIW I seen an interesting talk from the latest Nanog conference about "return loss" on shorter and faster links which can cause problems. It might well be what we were/are seeing on this link:

https://www.youtube.com/watch?v=-UaVhy_5dLQ