Page MenuHomePhabricator

Enable OSPF Icinga check for EVPN based switches
Closed, ResolvedPublic

Description

A link between two of the EVPN-enabled switches in Eqiad went down recently, but we did not receive an alert.

I made an error in thinking that BGP would go down if a link failed, but given we're using iBGP for the EVPN SAFI, multihop is enabled which keeps the session up if the direct link is down. We may wish to look at this, changing the multihop attribute to force iBGP down if devices aren't directly connected.

But overall it would help to also alert on OSPF enabled interfaces vs OSPF adjacencies. This check exists for core routers already so we should be able to re-purpose the same one.

Event Timeline

cmooney triaged this task as Medium priority.

Having thought about it in more detail I think it's best to keep the multihop for the iBGP EVPN sessions.

Reason being that even if a Leaf loses a Spine link it will still have two logical RR sessions. Which mean in case of an issue or config change each can be cleared separately, without removing all routes / disrupting traffic.

The OSPF adjacency check should catch failing ports, if not we can build another check for that.

@ayounsi be interested if you've any thoughts on that.

yeah I agree +1 on having a stable iBGP capable of handling link failure.

The OSPF adjacency check should be used but IIRC it assumes there are as many v4 sessions than v6 as that's what we have on the transport links. So a flag like --nov6 might be required.

Change 899609 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Adjust OSPF Icinga check to ignore OSPFv3 if zero ints configured

https://gerrit.wikimedia.org/r/899609

Change 899609 merged by Cathal Mooney:

[operations/puppet@production] Adjust OSPF Icinga check to ignore OSPFv3 if zero ints configured

https://gerrit.wikimedia.org/r/899609

Change 900431 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Enable OSPF check by default for l3 switch mgmt interfaces

https://gerrit.wikimedia.org/r/900431

Change 900431 merged by Cathal Mooney:

[operations/puppet@production] Enable OSPF check by default for l3 switch mgmt interfaces

https://gerrit.wikimedia.org/r/900431

I've merged the patch and the EVPN switches are now being checked by Icinga, all looks healthy.