It's a little painful - especially of late with all the mw/wikikube-worker server renames going on - that we get alerts for BGP sessions down to Kubernetes hosts that are being reimaged, for eample:
PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv
@JMeybohm mentioned it on irc and we both agreed the effect has been that probably all of us pay less heed to these messages, when we should. The open question is how we can prevent these firing or silence them during maintenance without placing too much strain on people or systems.
Current Icinga Check Setup
Currently we use the following Icinga check to connect to routers and check BGP sessions. If there is a problem a WARNING level is returned, however we have a certain AS numbers configured as "critical ASNs", which the Kubernetes ones are in, along with our external Transits and high-profile peers:
The basic options are:
- Make the existing Icigna check "downtime aware", ignoring peer status if hosts are downtimed
- Add some new alerting path for K8s BGP status, which we can more simply be aware of the downtime status
- Disable and re-activate the BGP sessions on the routers before and after maintenance
1 might be an option if there is a good way to get the current list of downtimed hosts, I'm not sure on that. There are various options on 2 and 3 I mention below.
New alerting options
Export the BGP session state from K8s hosts and monitor that
This would be the obvious simple way to go. If the hosts exported the BGP state machine status (idle/active/established etc.) we could create alerts on that and not monitor from the router side. The advantage here is the regular host downtime would then silence the alerts. Unfortunately from what I can tell this data is not exported, there is an related issue on the Calico github:
https://github.com/projectcalico/calico/issues/2369
Create a new Icinga check which is fine-tuned for the K8s hosts
We're trying to move away from Icinga so this probably isn't great. The advantage is we can write the check in any scripting language so we should be able to implement complex logic. Unsure if this would be better than just modifying the existing check.
Alert with a custom LibreNMS rule
We have some custom alerts in LibreNMS for some internal BGP stuff, like the number of prefixes received from Anycast hosts. These can be built using custom SQL queries to the LibreNMS backend database, for instance the Anycast one is:
SELECT * FROM devices,bgpPeers,bgpPeers_cbgp WHERE (devices.device_id = ? AND devices.device_id = bgpPeers.device_id AND devices.device_id = bgpPeers_cbgp.device_id AND bgpPeers.bgpPeerIdentifier = bgpPeers_cbgp.bgpPeerIdentifier) AND bgpPeers_cbgp.AcceptedPrefixes = 0 AND bgpPeers.astext = "Anycast" AND bgpPeers_cbgp.safi = "unicast" AND bgpPeers.bgpPeerState = "established"
Again the part that is not clear to me here is how to get the list of downtime hosts to ignore.
Export BGP stats using gnmi and alert on the status from alertmanager
Longer term this is probably better. I got pretty close to a working setup for exporting BGP stats with GNMI to Prometheus, however some performance problems prevented it going live (see T369384#10488927).
Unfortunately right now we can't say if this will be an option soon. We could get lucky and find our issues are easily fixed, but it is not a priority.
Disable the BGP sessions before/after
Another way to approach this is to deactivate the BGP session on the router as part of our cookbook. Alerts won't fire for a down host if it is admin disabled.
Change netbox BGP status and push updated config with Homer
This is the "full fat" option. It's not terribly tricky to change the Netbox "bgp" flag for a given host from a cookbook, but I'm not sure running Homer from a cookbook would be easy. It also has the disadvantage that a full Homer run, especially against the core routers, takes time.
Create a cookbook to do an ad-hoc deactivation of a given peer
We could fairly easily create a cookbook that would deactivate the BGP session for a given peer on the router side. It would need to:
- Connect to netbox, to get the host primary IPs
- Disable BGP for the host in Netbox (to prevent a Homer run re-activating the down peer accidentially)
- Push the cli command to "deactivate" the specific peerings
- The API would be more robust but we don't have spicerack support for it right now
- Wait for the maintenance to complete
- Push the cli command to "reactivate" the specific peerings
- Re-enable the BGP flag for the host in netbox
We could then adjust any disruptive cookbook so it triggered this one if the bgp flag was enabled for the given host in Netbox.
Interested to hear other thoughts or if there are other ways to approach this also. Personally I'd probably lean towards the last option but none are clear winners to me.