Follow up from incident https://wikitech.wikimedia.org/wiki/Incident_documentation/20191016-network_eqsin
One of the issue was caused by the main transport link terminating on cr1-eqsin flapping and causing 500 errors as the caches couldn't reach the main DCs.
A few options:
- Terminating it on cr2-eqsin (and the tunnel on cr1 for redundancy)
- Adding a 2nd link
- Configuring link damping (cf. T196432)
As the tunnel has proven quite reliable I'd suggest do to 3 first, then ideally 2 at some point.