Page MenuHomePhabricator

eqiad: patch 2nd Equinix IXP
Closed, ResolvedPublic

Description

Could you patch this cable: https://netbox.wikimedia.org/dcim/cables/4333/ ?
This is for the 2nd IXP port, from patch panel DC6:01:061130:PP:0000:103234:3,4 to cr1-eqiad:xe-3/0/6
As well as update Netbox.

Let me know once it's done so I can follow up with Equinix for the turn up call.

Event Timeline

ayounsi triaged this task as Medium priority.Oct 19 2021, 7:27 AM
ayounsi created this task.
ayounsi added a parent task: Unknown Object (Task).

Cable has been run shows link. netbox has not been updated yet
#2009 15m. pp219588361 <-> to cr1-eqiad:xe-3/0/6.

The interface has been disabled because this started a partial outage, which has been resolved now.

Legoktm added a subscriber: Legoktm.

I'm tagging this with Sustainability (Incident Followup) because somehow the above cable addition caused https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-22_eqiad_return_path_timeouts which has the following actionable:

  • Investigate & document why adding a new patch cable to the Equinix IX ports caused return path issues

If it's better to discuss that in a separate task feel free to fork it.

Giving more details on our current process and what happened here.

When configuring a new circuit, we enable the interface on our side ahead of time, so DCops can check if there is light/link and if not either roll it, or follow up with the datacenter to check the X-connect.
Additionally we pre-configure the IPs on our side when communicated by the provider to save time and make the turn up smoother (this is even more true when at the same time provisioning all the drmrs circuits).
Doing so is impact-less for all the usual circuits turn up (transit, transport, new peering) as they require BGP to be configured (in a 2nd time) to send prod traffic.

Then comes this "special" case.

Special as it's the first time we configure a new circuit in that way.
Adding a 2nd L3 port to the same IXP vlan on cr1. cr2-eqiad already have an interface exchanging prod traffic.
Additionally, Equinix requires a turn up call, and put us in a temporary "quarantine" vlan until then.

What happened is that as the port got patched through, the interface went up, so did the configured IP.
From then cr1-eqiad thought it could reach IX peers in the 206.126.236.0/22 range as it thought it had L2 adjacency.

In parallel, cr2 is advertising prefixes learned from IX peers to cr1 with that IX peer IP as next hop.

Which mean when cr1 was receiving outbound traffic (eg. from half the access switches) toward an IX peer, it was sending it to that "quarantine" vlan, and thus the traffic was getting discarded.

A process change I could think of to prevent future occurrence is to not pre-configure IPs on links ahead of time.
The same issue would probably have triggered during the turn up call, but at least it would have been rolled back faster.
Another options could be to test circuits in a dedicated VRF, but that seems overkill to change processes for all the turn up while only 1 usecase can be problematic.
Another possibility to reduce recovery time is to !log all circuits connections (and removal).

Thanks for the run down Arzhel. Unfortunate incident, easily understood in hindsight but a quirky edge case - I can understand how we overlooked the potential for it to happen beforehand.00

A process change I could think of to prevent future occurrence is to not pre-configure IPs on links ahead of time.

+1 on that. It's rare enough we have all the circumstances for this problem, but keeping IPs off interfaces in Netbox should prevent it. And still allow us to validate interface status / light levels for new circuits.

The VRF option is also there, but I agree it is probably overkill. DC-Ops ! logging new WAN circuit connections is probably not a bad idea either.

If there is a need to keep this open in DC-OPS please re-open