Page MenuHomePhabricator

Interface errors on cr1-codfw: xe-5/3/1
Closed, ResolvedPublic0 Estimated Story Points

Description

See https://librenms.wikimedia.org/graphs/id=8288/type=port_errors

Most likely need its optic replaced.

Please coordinate with me prior to doing any work on the interface so I can drain traffic away (disable BGP, etc)

Event Timeline

ayounsi triaged this task as Medium priority.May 10 2019, 4:40 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2019-05-14T15:33:29Z] <XioNoX> deactivate bgp to telia on cr1-codfw - T222967

Mentioned in SAL (#wikimedia-operations) [2019-05-14T15:38:35Z] <XioNoX> re-activate bgp to telia on cr1-codfw - T222967

  • Email sent to Telia
  • Telia open a case
  • Case information : Telia Carrier Case Reference 00980021

Follow up email sent to Telia.

Dear Customer,

This is an IP TRANSIT Service.

As per below you will see that everything is working as it should and indeed there was some flapping.

@dls-b22> show interfaces descriptions | match IC-308846
xe-0/1/1 up up To Wikimedia Foundation, Inc;IP_TRANSIT;IC-308846;;
xe-0/1/1.0 up up To Wikimedia Foundation, Inc;IP_TRANSIT;IC-308846;;

{master}
@dls-b22> show interfaces xe-0/1/1 brief
Physical interface: xe-0/1/1, Enabled, Physical link is Up
Link-level type: Ethernet, MTU: 4484, MRU: 4492, LAN-PHY mode, Speed: 10Gbps, Loopback: None, Source filtering: Disabled,
Flow control: Enabled
Device flags : Present Running
Interface flags: SNMP-Traps Internal: 0x4000
Link flags : None
Link Degrade :
Link Monitoring : Disable

Logical interface xe-0/1/1.0
Flags: Up SNMP-Traps 0x4004000 Encapsulation: ENET2
inet 80.239.192.101/30
inet6 2001:2000:3080:af4::1/64
fe80::3e8a:b0ff:fe30:582a/64
multiservice

{master}
@dls-b22> show interfaces diagnostics optics xe-0/1/1 | match dbm
Laser output power : 0.6710 mW / -1.73 dBm
Laser receiver power : 0.5937 mW / -2.26 dBm
Laser output power high alarm threshold : 1.5840 mW / 2.00 dBm
Laser output power low alarm threshold : 0.1580 mW / -8.01 dBm
Laser output power high warning threshold : 1.2580 mW / 1.00 dBm
Laser output power low warning threshold : 0.1990 mW / -7.01 dBm
Laser rx power high alarm threshold : 1.7783 mW / 2.50 dBm
Laser rx power low alarm threshold : 0.0100 mW / -20.00 dBm
Laser rx power high warning threshold : 1.5849 mW / 2.00 dBm
Laser rx power low warning threshold : 0.0158 mW / -18.01 dBm

@dls-b22> show bgp summary | match 80.239.192.102
80.239.192.102 14907 326 144330 0 90 2:29:02 Establ

{master}
@dls-b22> show interfaces xe-0/1/1 | match flap
Last flapped : 2019-05-14 17:37:20 CEST (02:30:57 ago)

{master}
@dls-b22> show interfaces xe-0/1/1 | match error
Link-level type: Ethernet, MTU: 4484, MRU: 4492, LAN-PHY mode, Speed: 10Gbps, BPDU Error: None, Loop Detect PDU Error: None,
MAC-REWRITE Error: None, Loopback: None, Source filtering: Disabled, Flow control: Enabled
Bit errors 0
Errored blocks 23

Our estimate is that there was a Field Engineer on the site conducting Maintenance on nearby services.

Errors are not increasing at this moment.
We shall keep this ticket open for 24 hours and if you notice any discrepancies please let us know.

Errors increased during the weekend and librenms alerted on IRC, see https://librenms.wikimedia.org/graphs/to=NaN/id=8288/type=port_errors/from=1557831000/?

New reported stats are:

volans@re0.cr1-codfw> show interfaces xe-5/3/1 extensive | match error
  Link-level type: Ethernet, MTU: 4470, MRU: 4478, LAN-PHY mode, Speed: 10Gbps, BPDU Error: None, MAC-REWRITE Error: None, Loopback: None,
  Input errors:
    Errors: 873337, Drops: 0, Framing errors: 873337, Runts: 0, Policed discards: 0, L3 incompletes: 0, L2 channel errors: 0, L2 mismatch timeouts: 0,
    FIFO errors: 0, Resource errors: 0
  Output errors:
    Carrier transitions: 0, Errors: 0, Drops: 0, Collisions: 0, Aged packets: 0, FIFO errors: 0, HS link CRC errors: 0, MTU errors: 0,
    Resource errors: 0
    Bit errors                           412
    Errored blocks                    498047
    CRC/Align errors                    873337                0
    FIFO errors                              0                0
    Total errors                        855005                0
    Output packet error count                                 0
Volans raised the priority of this task from Medium to High.May 20 2019, 11:03 AM

Follow up email sent to Telia.

Mentioned in SAL (#wikimedia-operations) [2019-05-21T15:10:43Z] <XioNoX> disable BGP to telia on cr1-codfw - T222967

Mentioned in SAL (#wikimedia-operations) [2019-05-21T15:32:17Z] <XioNoX> enable BGP to telia on cr1-codfw - T222967

Papaul cleaned the fiber, issues are still increasing. Followed up with Telia.

Mentioned in SAL (#wikimedia-operations) [2019-05-24T15:30:30Z] <XioNoX> disable bgp to telia on cr1-codfw for X-connect investigation - T222967

CY1 did test the X-connect and didn't find any problem. see
https://phabricator.wikimedia.org/T224196
Sending other follow up email to Telia

Dear Customer,

Thank you for your email. We have started a case for your query. Telia Carrier case: 00984411.

We will investigate this case and get back to you.

We appreciate your patience and apologies for any inconvenience caused.

Email Telia again with another follow up email.

Dear Customer,

Our transmission 2nd line team are still investigating this. We will inform you as soon as the issue solved and sorry for any inconvenience caused.

Best Regards,

Mentioned in SAL (#wikimedia-operations) [2019-06-18T12:23:04Z] <XioNoX> activate bgp to telia on cr1-codfw - T222967

The issue should be resolved as per our second line , and any further optimization on the circuit will be done through maintenance window and of course you will be notified, so please check and let us know if you still see any issue .

I'll reopen the task if the errors show up again.