eqiad: patch 2nd Equinix IXP
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ayounsi
	Oct 19 2021, 7:27 AM

Description

Could you patch this cable: https://netbox.wikimedia.org/dcim/cables/4333/ ?
This is for the 2nd IXP port, from patch panel DC6:01:061130:PP:0000:103234:3,4 to cr1-eqiad:xe-3/0/6
As well as update Netbox.

Let me know once it's done so I can follow up with Equinix for the turn up call.

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Task)
Resolved	LSobanski	T295152 Incident: 2021-10-22 eqiad return path timeouts
Resolved	• Cmjohnson	T293726 eqiad: patch 2nd Equinix IXP

Event Timeline

• ayounsi triaged this task as Medium priority.Oct 19 2021, 7:27 AM

• ayounsi created this task.

• ayounsi added a parent task: Unknown Object (Task).

Maintenance_bot added a project: SRE.Oct 19 2021, 7:45 AM

wiki_willy assigned this task to • Cmjohnson.Oct 19 2021, 5:37 PM

Cable has been run shows link. netbox has not been updated yet
#2009 15m. pp219588361 <-> to cr1-eqiad:xe-3/0/6.

The interface has been disabled because this started a partial outage, which has been resolved now.

I'm tagging this with Sustainability (Incident Followup) because somehow the above cable addition caused https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-10-22_eqiad_return_path_timeouts which has the following actionable:

Investigate & document why adding a new patch cable to the Equinix IX ports caused return path issues

If it's better to discuss that in a separate task feel free to fork it.

Giving more details on our current process and what happened here.

When configuring a new circuit, we enable the interface on our side ahead of time, so DCops can check if there is light/link and if not either roll it, or follow up with the datacenter to check the X-connect.
Additionally we pre-configure the IPs on our side when communicated by the provider to save time and make the turn up smoother (this is even more true when at the same time provisioning all the drmrs circuits).
Doing so is impact-less for all the usual circuits turn up (transit, transport, new peering) as they require BGP to be configured (in a 2nd time) to send prod traffic.

Then comes this "special" case.

Special as it's the first time we configure a new circuit in that way.
Adding a 2nd L3 port to the same IXP vlan on cr1. cr2-eqiad already have an interface exchanging prod traffic.
Additionally, Equinix requires a turn up call, and put us in a temporary "quarantine" vlan until then.

What happened is that as the port got patched through, the interface went up, so did the configured IP.
From then cr1-eqiad thought it could reach IX peers in the 206.126.236.0/22 range as it thought it had L2 adjacency.

In parallel, cr2 is advertising prefixes learned from IX peers to cr1 with that IX peer IP as next hop.

Which mean when cr1 was receiving outbound traffic (eg. from half the access switches) toward an IX peer, it was sending it to that "quarantine" vlan, and thus the traffic was getting discarded.

A process change I could think of to prevent future occurrence is to not pre-configure IPs on links ahead of time.
The same issue would probably have triggered during the turn up call, but at least it would have been rolled back faster.
Another options could be to test circuits in a dedicated VRF, but that seems overkill to change processes for all the turn up while only 1 usecase can be problematic.
Another possibility to reduce recovery time is to !log all circuits connections (and removal).

cmooney subscribed.Oct 28 2021, 9:45 AM

Thanks for the run down Arzhel. Unfortunate incident, easily understood in hindsight but a quirky edge case - I can understand how we overlooked the potential for it to happen beforehand.00

A process change I could think of to prevent future occurrence is to not pre-configure IPs on links ahead of time.

+1 on that. It's rare enough we have all the circumstances for this problem, but keeping IPs off interfaces in Netbox should prevent it. And still allow us to validate interface status / light levels for new circuits.

The VRF option is also there, but I agree it is probably overkill. DC-Ops ! logging new WAN circuit connections is probably not a bad idea either.

If there is a need to keep this open in DC-OPS please re-open

herron added a parent task: T295152: Incident: 2021-10-22 eqiad return path timeouts.Nov 5 2021, 2:40 PM

• ayounsi mentioned this in T295672: Use next-hop-self for iBGP sessions.Nov 15 2021, 10:08 AM

eqiad: patch 2nd Equinix IXPClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

eqiad: patch 2nd Equinix IXP
Closed, ResolvedPublic
Actions

Related Objects
Search...