Page MenuHomePhabricator

Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4
Closed, ResolvedPublic

Description

Creating this task for visibility on the netops site. We have had errors on the et-0/0/0 40G interface on cr3-ulsfo, which connects to cr4-ulsfo et-0/0/0, for the past while:

image.png (771×1 px, 276 KB)

DC-Ops are working to order replacement optics to swap them out. After discussion, and based on price, we opted to order 3x100G-CWDM4 modules, which is around the same cost as replacement 40G-LX4 but increases the bandwidth between the routers to 100G.

When they arrive I'll work with DC-ops / remote hands to swap out the modules and hopefully clear up the issue. Thankfully it's not degraded since first detected so probably not as urgent as initially feared.

Related Objects

Event Timeline

cmooney triaged this task as Medium priority.
cmooney added a subtask: Unknown Object (Task).Jan 21 2025, 11:15 AM
cmooney updated the task description. (Show Details)

@cmooney,

I'm updating the order task, but this was delivered in December so I can open a remote hands to get it fixed. Do we need to schedule the window for this work, or should they just pull and swap at any time?

I'm assuming we need to schedule it, and we should give them a couple days notice if we want a set schedule/maint window. Can we set for next Monday?

RobH closed subtask Unknown Object (Task) as Resolved.Jan 21 2025, 2:46 PM

I'm assuming we need to schedule it, and we should give them a couple days notice if we want a set schedule/maint window. Can we set for next Monday?

Thanks Rob! Yeah we should probably schedule it and I can gracefully shift traffic off the link in advance. I can work with the remote hands (hopefully something interactive like a call or chat session, but email will work if there is nothing else).

Monday's are usually fairly busy for me with meetings, maybe let's say next Tuesday the 28th? Morning-time locally so it's not too late for me if possible. Thanks.

Picked this back up, it had gotten neglected due to not being assigned to me and not having the ops-ulsfo tag and I should have noticed that sooner (back when I ordered the optics in December!). I've fixed now and will file a support ticket for Schedule of work: Tuesday, 2025-02-04 @ 0800 Pacific (1600 GMT).

Linked draft of the scope of work to Cathal to review later today and will file task directly after review.

Remote hands 01020815 scheduled for 2025-02-04 @ 0800 Pacific (1600 GMT).

Icinga downtime and Alertmanager silence (ID=a50b2671-d855-40a0-8790-c502280b9115) set by cmooney@cumin1002 for 1:00:00 on 6 host(s) and their services with reason: replace faulty optic et-0/0/0

cr[3-4]-ulsfo,cr[3-4]-ulsfo IPv6,cr[3-4]-ulsfo.mgmt

Mentioned in SAL (#wikimedia-operations) [2025-02-04T16:17:14Z] <topranks> disable et-0/0/0 on cr3-ulsfo to prep for optic replacement T384288

Remote hands 01020815 scheduled for 2025-02-04 @ 0800 Pacific (1600 GMT).

Happy to say the optics either side were replaced without problem today. Link is now up at 100G and not showing any errors, will keep an eye on it.

cmooney@cr4-ulsfo> show interfaces et-0/0/0    
Physical interface: et-0/0/0, Enabled, Physical link is Up
  Interface index: 170, SNMP ifIndex: 542
  Description: Core: cr3-ulsfo:et-0/0/0 {#1073}
  Link-level type: Flexible-Ethernet, MTU: 9192, MRU: 9200, Speed: 100Gbps, BPDU Error: None, Loop Detect PDU Error: None, Loopback: Disabled, Source filtering: Disabled,
  Flow control: Disabled
  Pad to minimum frame size: Disabled
  Device flags   : Present Running
  Interface flags: SNMP-Traps Internal: 0x4000
  CoS queues     : 8 supported, 8 maximum usable queues
  Schedulers     : 0
  Current address: ec:38:73:75:3c:ac, Hardware address: ec:38:73:75:34:cb
  Last flapped   : 2025-02-04 16:30:09 UTC (00:11:05 ago)
  Input rate     : 1066955256 bps (126077 pps)
  Output rate    : 20680 bps (40 pps)
  Active alarms  : None
  Active defects : None
  PCS statistics                      Seconds
    Bit errors                             0
    Errored blocks                         0
  Ethernet FEC Mode  :                  FEC91
  Ethernet FEC statistics              Errors
    FEC Corrected Errors                    0
    FEC Uncorrected Errors                  0
    FEC Corrected Errors Rate               0
    FEC Uncorrected Errors Rate             0
cmooney@cr3-ulsfo> show interfaces et-0/0/0 
Physical interface: et-0/0/0, Enabled, Physical link is Up
  Interface index: 170, SNMP ifIndex: 593
  Description: Core: cr4-ulsfo:et-0/0/0 {#1073}
  Link-level type: Flexible-Ethernet, MTU: 9192, MRU: 9200, Speed: 100Gbps, BPDU Error: None, Loop Detect PDU Error: None, Ethernet-Switching Error: None, Loopback: Disabled,
  Source filtering: Disabled, Flow control: Disabled
  Pad to minimum frame size: Disabled
  Device flags   : Present Running
  Interface Specific flags: Internal: 0x100200
  Interface flags: SNMP-Traps Internal: 0x4000
  Link flags     : 0x800
  CoS queues     : 8 supported, 8 maximum usable queues
  Schedulers     : 0
  Current address: b0:eb:7f:82:f9:2a, Hardware address: b0:eb:7f:82:f1:49
  Last flapped   : 2025-02-04 16:30:09 UTC (00:16:57 ago)
  Input rate     : 13480 bps (16 pps)
  Output rate    : 986838776 bps (118290 pps)
  Active alarms  : None
  Active defects : None
  PCS statistics                      Seconds
    Bit errors                             0
    Errored blocks                         0
  Ethernet FEC Mode  :                  FEC91
    FEC Codeword size                     528
    FEC Codeword rate                   0.973
  Ethernet FEC statistics              Errors
    FEC Corrected Errors                    0
    FEC Uncorrected Errors                  0
    FEC Corrected Errors Rate               0
    FEC Uncorrected Errors Rate             0

Will check the error stats in a day or two and close if it looks clean.

This link has had a reasonable amount of traffic since the move and still error free so I am resolving it.

image.png (552×1 px, 65 KB)