Page MenuHomePhabricator

ulsfo <-> codfw transit link flapping causing nginx availability alerts
Closed, ResolvedPublic

Description

From https://librenms.wikimedia.org/eventlog:

2019-03-29 07:46:53	cr1-codfw	xe-5/0/2.0	ifOperStatus: lowerLayerDown -> up	 
2019-03-29 07:46:53	cr1-codfw	xe-5/0/2	ifOperStatus: down -> up	 

2019-03-29 07:42:01	cr1-codfw	xe-5/0/2.0	ifOperStatus: up -> lowerLayerDown	 
2019-03-29 07:42:01	cr1-codfw	xe-5/0/2	ifOperStatus: up -> down	 

2019-03-29 07:31:53	cr1-codfw	xe-5/0/2.0	ifOperStatus: lowerLayerDown -> up	 
2019-03-29 07:31:53	cr1-codfw	xe-5/0/2	ifOperStatus: down -> up

2019-03-29 07:27:00	cr1-codfw	xe-5/0/2.0	ifOperStatus: up -> lowerLayerDown	 
2019-03-29 07:26:59	cr1-codfw	xe-5/0/2	ifOperStatus: up -> down

[..]

Nginx availability alerts for ulsfo triggered at the same time. xe-5/0/2 on cr1-codfw shows interface errors.. Following: https://wikitech.wikimedia.org/wiki/Network_monitoring#Inbound/outbound_interface_errors

elukey@re0.cr1-codfw> show interfaces xe-5/0/2 extensive | match error
  Link-level type: Ethernet, MTU: 1514, MRU: 1522, LAN-PHY mode, Speed: 10Gbps, BPDU Error: None, MAC-REWRITE Error: None, Loopback: None, Source filtering: Disabled, Flow control: Enabled
  Input errors:
    Errors: 216, Drops: 0, Framing errors: 0, Runts: 0, Policed discards: 0, L3 incompletes: 216, L2 channel errors: 0, L2 mismatch timeouts: 0, FIFO errors: 0, Resource errors: 0
  Output errors:
    Carrier transitions: 1379, Errors: 0, Drops: 400386124, Collisions: 0, Aged packets: 0, FIFO errors: 0, HS link CRC errors: 0, MTU errors: 1, Resource errors: 0
    Bit errors                             0
    Errored blocks                       683
    CRC/Align errors                         0                0
    FIFO errors                              0                0
    Total errors                             0                0
    Output packet error count                                 0

Zayo sent an email to noc@ saying that maintenance is currently in progress for the link: ZAYO TTN-0003155831 Emergency MAINTENANCE NOTIFICATION

Event Timeline

elukey triaged this task as High priority.Mar 29 2019, 8:27 AM
elukey created this task.

Change 499987 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Depool ulsfo

https://gerrit.wikimedia.org/r/499987

Change 499987 merged by Filippo Giunchedi:
[operations/dns@master] Depool ulsfo

https://gerrit.wikimedia.org/r/499987

Change 500031 had a related patch set uploaded (by Ema; owner: Ema):
[operations/dns@master] Revert "Depool ulsfo"

https://gerrit.wikimedia.org/r/500031

Mentioned in SAL (#wikimedia-operations) [2019-03-29T15:48:43Z] <XioNoX> bump ulsfo-codfw ospf link cost to 1000 - T219591

Change 500031 merged by Ayounsi:
[operations/dns@master] Revert "Depool ulsfo"

https://gerrit.wikimedia.org/r/500031

Mentioned in SAL (#wikimedia-operations) [2019-04-01T18:58:51Z] <XioNoX> re-set ulsfo-codfw ospf cost to previous default - T219591

Link has been up for 1+ day. Got a notification saying the emergency maintenance was done.