Page MenuHomePhabricator

Improve resiliency of the eqsin transport link
Closed, ResolvedPublic

Description

Follow up from incident https://wikitech.wikimedia.org/wiki/Incident_documentation/20191016-network_eqsin

One of the issue was caused by the main transport link terminating on cr1-eqsin flapping and causing 500 errors as the caches couldn't reach the main DCs.

A few options:

  1. Terminating it on cr2-eqsin (and the tunnel on cr1 for redundancy)
  2. Adding a 2nd link
  3. Configuring link damping (cf. T196432)

As the tunnel has proven quite reliable I'd suggest do to 3 first, then ideally 2 at some point.

Event Timeline

ayounsi triaged this task as Normal priority.Wed, Oct 30, 8:21 AM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptWed, Oct 30, 8:21 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2019-11-07T00:21:19Z] <XioNoX> enable interface damping on primary eqsin-codfw link - T236878

ayounsi closed this task as Resolved.Thu, Nov 7, 12:26 AM
ayounsi claimed this task.

Damping configured.