Page MenuHomePhabricator

Instability of the Level3 link between cr2-eqiad and cr2-esams
Closed, ResolvedPublic0 Story Points

Description

The Level3 link between cr2-eqiad and cr2-esams has flapped a lot recently, as @BBlack noticed in P8785

As https://librenms.wikimedia.org/device/device=66/tab=port/port=16577/ shows we are currently going through knams, and the link seems to have recovered and then failed again multiple times.

@faidon sent an email to century link but up to now no response.

Related Objects

Event Timeline

elukey triaged this task as High priority.Jul 24 2019, 6:25 AM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 24 2019, 6:25 AM
Peachey88 updated the task description. (Show Details)Jul 24 2019, 8:21 AM
CDanis added a subscriber: CDanis.Jul 24 2019, 1:22 PM

Level3/CenturyLink opened a ticket for that circuit and completed an emergency maintenance.
I also see some planned maintenance in the last few days.
And have at least one upcoming on 2019-07-31 04:00 GMT.
If it's service impacting we can de-pref the link by increasing its OSPF metric.

We usually see impact in 50x and/or nginx availability when the link goes down, so if that could be avoided I'd be +1.

Mentioned in SAL (#wikimedia-operations) [2019-07-29T18:17:58Z] <XioNoX> switch traffic to the GTT link between Ashburn and Amsterdam (set GTT metric to 820 vs. 1820 before) - T228827

We can see that link flapping in https://librenms.wikimedia.org/device/device=2/tab=port/port=6835/view=events/ as well. I think only one of those was a planned maintenance.
I called CentruryLink, and they opened ticket 16814863 to investigate it and allowed them to do intrusive testing.
I drained that Level3 link as well by tuning OSPF metrics so their testing doesn't impact users.

ayounsi claimed this task.Jul 29 2019, 6:22 PM

This circuit has been impacted by multiple planned maintenances and higher-level network events. They have all been different troubles that have been restored so there are no chronic issues impacting this service. In the future, please report all troubles so we can fully investigate while the logs and performance monitoring data is still available.

I'm going to keep the circuit as secondary for a couple weeks and re-set it as primary if no more changes.

elukey added a comment.Aug 1 2019, 7:06 AM

The link went down again:

elukey@re0.cr2-eqiad> show interfaces descriptions | match down
xe-4/1/3        up    down Transport: cr2-esams:xe-0/1/3 (Level3, BDFS2448, 84ms) {#2013} [10Gbps wave]

That was scheduled maintenance in Centurylink's ticket 16820717, should be
resolved as of about two hours ago.

Mentioned in SAL (#wikimedia-operations) [2019-08-02T23:16:32Z] <XioNoX> Make the Level3 link between eqiad-knams primary - T228827

Talked to Faidon, using the backup link for a long amount of time is costing us money (see overusage on https://librenms.wikimedia.org/bill/bill_id=17/). I made the Level3 link primary again.

BBlack added a comment.Aug 5 2019, 7:49 PM

Again today, causing a small spike of esams-specific 503s and icinga alerts:

2019-08-05 19:46:27	bgpPeer	cr2-esams.wikimedia	BGP Session Up: 91.198.174.248 (AS65001)	System
2019-08-05 19:45:34	pe-0/1/0.32769	cr2-esams.wikimedia	ifMtu: 9142 -> 8958	System
2019-08-05 19:45:34	xe-0/1/3.0	cr2-esams.wikimedia	ifOperStatus: lowerLayerDown -> up	System
2019-08-05 19:45:34	xe-0/1/3	cr2-esams.wikimedia	ifOperStatus: down -> up	System
2019-08-05 19:42:25	bgpPeer	cr2-eqiad.wikimedia	BGP Session Flap: 10.64.17.14 (AS64600)	System
2019-08-05 19:42:24	bgpPeer	cr2-eqiad.wikimedia	BGP Session Flap: 91.198.174.249 (AS65003)	System
2019-08-05 19:41:37	bgpPeer	cr2-esams.wikimedia	BGP Session Down: 91.198.174.248 (AS65001)	System
2019-08-05 19:40:37	pe-0/1/0.32769	cr2-esams.wikimedia	ifMtu: 8958 -> 9142	System
2019-08-05 19:40:37	xe-0/1/3.0	cr2-esams.wikimedia	ifOperStatus: up -> lowerLayerDown	System
2019-08-05 19:40:37	xe-0/1/3	cr2-esams.wikimedia	ifOperStatus: up -> down	System
2019-08-05 19:37:25	bgpPeer	cr1-eqiad.wikimedia	BGP Session Flap: 10.64.49.16 (AS64600)	System
BBlack added a comment.Aug 5 2019, 7:50 PM

May as well link in an earlier related ticket from late last year for more backstory, too: https://phabricator.wikimedia.org/T205609

Email sent to our account rep to know what they can do.

Circuit is down again, opened ticket 16915334.
Account rep replied to the thread and put their client support manager in the loop as well.

ayounsi closed this task as Resolved.Wed, Sep 18, 8:34 PM

From Level3:

I appreciate your patience while we worked on gathering the data on these repair tickets. I’ve attached the repair ticket log above for you.
Unfortunately, there was no chronic issue when researching this circuit. The circuit has been impacted by multiple planned maintenances and higher-level network events. They have all been different troubles that have been restored when the problem was brought to our attention.
We understand the negative impact this has on your business and the need to mitigate the down time. With the circuit being down so many time we would expect to see some type of chronic issue but this was really an outliers with so many issues that impacted the service. Let me know if you have any questions of if I can provide further assistance.

The circuit has been stable since it seems, and I enabled damping on one side.

Please reopen if the issue happen again and I'll follow up.

Another one (scheduled as 17144179)
2019-09-26 23:32:28 xe-4/1/3 ifOperStatus: down -> up
2019-09-26 22:12:28 xe-4/1/3 ifOperStatus: up -> down