Page MenuHomePhabricator

Prevent BGP alerts triggering when K8s host maintenance is being done
Closed, ResolvedPublic

Description

It's a little painful - especially of late with all the mw/wikikube-worker server renames going on - that we get alerts for BGP sessions down to Kubernetes hosts that are being reimaged, for eample:

PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv

@JMeybohm mentioned it on irc and we both agreed the effect has been that probably all of us pay less heed to these messages, when we should. The open question is how we can prevent these firing or silence them during maintenance without placing too much strain on people or systems.

Current Icinga Check Setup

Currently we use the following Icinga check to connect to routers and check BGP sessions. If there is a problem a WARNING level is returned, however we have a certain AS numbers configured as "critical ASNs", which the Kubernetes ones are in, along with our external Transits and high-profile peers:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/nagios_common/files/check_commands/check_bgp.cfg#3

The basic options are:

  1. Make the existing Icigna check "downtime aware", ignoring peer status if hosts are downtimed
  2. Add some new alerting path for K8s BGP status, which we can more simply be aware of the downtime status
  3. Disable and re-activate the BGP sessions on the routers before and after maintenance

1 might be an option if there is a good way to get the current list of downtimed hosts, I'm not sure on that. There are various options on 2 and 3 I mention below.

New alerting options

Export the BGP session state from K8s hosts and monitor that

This would be the obvious simple way to go. If the hosts exported the BGP state machine status (idle/active/established etc.) we could create alerts on that and not monitor from the router side. The advantage here is the regular host downtime would then silence the alerts. Unfortunately from what I can tell this data is not exported, there is an related issue on the Calico github:

https://github.com/projectcalico/calico/issues/2369

Create a new Icinga check which is fine-tuned for the K8s hosts

We're trying to move away from Icinga so this probably isn't great. The advantage is we can write the check in any scripting language so we should be able to implement complex logic. Unsure if this would be better than just modifying the existing check.

Alert with a custom LibreNMS rule

We have some custom alerts in LibreNMS for some internal BGP stuff, like the number of prefixes received from Anycast hosts. These can be built using custom SQL queries to the LibreNMS backend database, for instance the Anycast one is:

SELECT * FROM devices,bgpPeers,bgpPeers_cbgp WHERE (devices.device_id = ? AND devices.device_id = bgpPeers.device_id AND devices.device_id = bgpPeers_cbgp.device_id AND bgpPeers.bgpPeerIdentifier = bgpPeers_cbgp.bgpPeerIdentifier) AND bgpPeers_cbgp.AcceptedPrefixes = 0 AND bgpPeers.astext = "Anycast" AND bgpPeers_cbgp.safi = "unicast" AND bgpPeers.bgpPeerState = "established"

Again the part that is not clear to me here is how to get the list of downtime hosts to ignore.

Export BGP stats using gnmi and alert on the status from alertmanager

Longer term this is probably better. I got pretty close to a working setup for exporting BGP stats with GNMI to Prometheus, however some performance problems prevented it going live (see T369384#10488927).

Unfortunately right now we can't say if this will be an option soon. We could get lucky and find our issues are easily fixed, but it is not a priority.

Disable the BGP sessions before/after

Another way to approach this is to deactivate the BGP session on the router as part of our cookbook. Alerts won't fire for a down host if it is admin disabled.

Change netbox BGP status and push updated config with Homer

This is the "full fat" option. It's not terribly tricky to change the Netbox "bgp" flag for a given host from a cookbook, but I'm not sure running Homer from a cookbook would be easy. It also has the disadvantage that a full Homer run, especially against the core routers, takes time.

Create a cookbook to do an ad-hoc deactivation of a given peer

We could fairly easily create a cookbook that would deactivate the BGP session for a given peer on the router side. It would need to:

  • Connect to netbox, to get the host primary IPs
  • Disable BGP for the host in Netbox (to prevent a Homer run re-activating the down peer accidentially)
  • Push the cli command to "deactivate" the specific peerings
    • The API would be more robust but we don't have spicerack support for it right now
  • Wait for the maintenance to complete
  • Push the cli command to "reactivate" the specific peerings
  • Re-enable the BGP flag for the host in netbox

We could then adjust any disruptive cookbook so it triggered this one if the bgp flag was enabled for the given host in Netbox.

Interested to hear other thoughts or if there are other ways to approach this also. Personally I'd probably lean towards the last option but none are clear winners to me.

Event Timeline

cmooney triaged this task as Low priority.

@fgiunchedi perhaps you might know a way to do this.

We now have stats like this in Prometheus, with values that tell us if BGP is up or down:

gnmi_bgp_neighbor_session_state{address="10.64.0.105", instance="cr1-eqiad", job="gnmi", network_instance_name="DEFAULT", peer_as="64601", peer_descr="wikikube-worker1242", peer_group="Kubernetes4", peer_type="EXTERNAL", prometheus="ops", protocol_identifier="BGP", protocol_name="DEFAULT", site="eqiad"}  6

If we replaced the existing LibreNMS alert rule with an alertmanager one, would there be any way to supress alerts if the "peer_descr" matched a hostname that was currently downtimed?

Yes we would be able to inhibit (i.e. send no notifications) say BGPPeerDown alert based on the fact that another alert is firing (e.g. HostDown alert, to be created).

Note this is slightly different from what you mentioned, namely not the fact that an host is downtimed but rather say to alertmanager "if the host is down, then don't notify for the same host being down as a bgp peer"

Hope that helps!

Yes we would be able to inhibit (i.e. send no notifications) say BGPPeerDown alert based on the fact that another alert is firing (e.g. HostDown alert, to be created).

Note this is slightly different from what you mentioned, namely not the fact that an host is downtimed but rather say to alertmanager "if the host is down, then don't notify for the same host being down as a bgp peer"

Hope that helps!

Thanks yes it does! But I'm actually thinking I'm looking at it the wrong way. The previous check ran against the router, checking for all the BGP peers, and alerting if any were down.

I think what we should do instead is create an alertmanager check at the host level for all K8s servers, with a query like:

gnmi_bgp_neighbor_session_state{peer_descr="$hostname"} == 6

Then this check, because it's at the host level, would automatically not alert if the host is downtimed? Or would that only work if instance=$hostname?

An alternative (or short term solution until the above/cleaner one is live) is to not alert for sessions on the router side towards k8s nodes, and only monitor BGP from the k8s side.

The only thing that we would lose is if there is a bug where the calico/bird instance thinks BGP is up, while it's actually down. This can also be a complementary monitoring.

From a quick look, https://docs.tigera.io/calico-cloud/operations/monitor/metrics/bgp-metrics is only for the paid version of Calico, but it might be possible to install prometheus-bird-exporter like we do for the Anycast service : https://grafana.wikimedia.org/d/dxbfeGDZk/anycast

Yes we would be able to inhibit (i.e. send no notifications) say BGPPeerDown alert based on the fact that another alert is firing (e.g. HostDown alert, to be created).

Note this is slightly different from what you mentioned, namely not the fact that an host is downtimed but rather say to alertmanager "if the host is down, then don't notify for the same host being down as a bgp peer"

Hope that helps!

Thanks yes it does! But I'm actually thinking I'm looking at it the wrong way. The previous check ran against the router, checking for all the BGP peers, and alerting if any were down.

I think what we should do instead is create an alertmanager check at the host level for all K8s servers, with a query like:

gnmi_bgp_neighbor_session_state{peer_descr="$hostname"} == 6

Then this check, because it's at the host level, would automatically not alert if the host is downtimed? Or would that only work if instance=$hostname?

Technically we could, though I'd rather not create individual alerts for each host but rather rely on queries without explicit hostnames in them. At any rate let's brainstorm before/in/after Atlanta !

An alternative (or short term solution until the above/cleaner one is live) is to not alert for sessions on the router side towards k8s nodes, and only monitor BGP from the k8s side.

Yeah this is one of the options in the task description, and the cleanest as far as I am concerned. I don't believe it is trivial to do, however, the data is not exposed in Prometheus.

The only thing that we would lose is if there is a bug where the calico/bird instance thinks BGP is up, while it's actually down. This can also be a complementary monitoring.

Yeah that's unlikely.

From a quick look, https://docs.tigera.io/calico-cloud/operations/monitor/metrics/bgp-metrics is only for the paid version of Calico, but it might be possible to install prometheus-bird-exporter like we do for the Anycast service : https://grafana.wikimedia.org/d/dxbfeGDZk/anycast

I didn't get further than finding the issue and that it's in the paid version. But a quick check of prometheus-bird-exporter shows it uses Unix sockets to talk to Bird, so in theory it would be possible. There is also this I found but it depends on 'birdcl' / 'birdc' which is not on our k8s hosts:

https://github.com/kumina/birdwatcher

Technically we could, though I'd rather not create individual alerts for each host but rather rely on queries without explicit hostnames in them. At any rate let's brainstorm before/in/after Atlanta !

Yeah we don't want to do that. We can check for the whole group, the problem is suppressing the alerts for downtimed hosts. Let's talk at the summit!

From lunch discussion in Atlanta:
It would be ideal if we could create a recording rule that copies the gnmi_bgp_neighbor_session_state metric, rewriting the peer_descr label value to instance. That would auto downtime/silence alerts based on that metric when a kubernetes node (peer_descr) is silenced.
This has the benefit of still relying on router data for alerting, not having to introduce another data source and not having to patch the calico chart with the bird exporter.

If, for whatever reason, the above does not work, we could still add the prometheus-bird-exporter as sidecar to calico-node (bird v4 and v6 sockets are accessible there) and alerts on metrics reported by it rather than gnmi.

Since we have to overwrite instance with the host instead of the router, that information is effectively lost, unless we move instance elsewhere. At any rate I think for a limited (i.e. internal bgp peers) use case it is fine.

The base query looks like this: label_replace(sum without (instance) (gnmi_bgp_neighbor_session_state), "instance", "$1:0", "peer_descr", "(.*)"), we would record that query into a new metric such as instance:gnmi_bgp_neighbor_session_state in modules/profile/files/prometheus/rules_ops.yml and we can alert if said metric == 6.

I also noticed that for some of the results there's more information/duplication? That could be filtered too, e.g. of duplication

{address="10.128.0.11", instance="lvs4010:0", job="gnmi", network_instance_name="DEFAULT", peer_as="64600", peer_descr="lvs4010", peer_group="PyBal", peer_type="EXTERNAL", prometheus="ops", protocol_identifier="BGP", protocol_name="DEFAULT", site="ulsfo"}	6
{address="10.128.0.11", instance="lvs4010:0", job="gnmi", network_instance_name="master", peer_as="64600", peer_descr="lvs4010", peer_group="PyBal", peer_type="EXTERNAL", prometheus="ops", site="ulsfo"}	6

Thanks !

Since we have to overwrite instance with the host instead of the router, that information is effectively lost, unless we move instance elsewhere. At any rate I think for a limited (i.e. internal bgp peers) use case it is fine.

The base query looks like this: label_replace(sum without (instance) (gnmi_bgp_neighbor_session_state), "instance", "$1:0", "peer_descr", "(.*)"), we would record that query into a new metric such as instance:gnmi_bgp_neighbor_session_state in modules/profile/files/prometheus/rules_ops.yml and we can alert if said metric == 6.

Is it possible to duplicate the metric, before the transform, so it still shows up as a router metric ?

And what happens if peer_descr is missing or empty ?

I also noticed that for some of the results there's more information/duplication? That could be filtered too, e.g. of duplication

{address="10.128.0.11", instance="lvs4010:0", job="gnmi", network_instance_name="DEFAULT", peer_as="64600", peer_descr="lvs4010", peer_group="PyBal", peer_type="EXTERNAL", prometheus="ops", protocol_identifier="BGP", protocol_name="DEFAULT", site="ulsfo"}	6
{address="10.128.0.11", instance="lvs4010:0", job="gnmi", network_instance_name="master", peer_as="64600", peer_descr="lvs4010", peer_group="PyBal", peer_type="EXTERNAL", prometheus="ops", site="ulsfo"}	6

That's because in "legacy" subnets, the servers peer with the two core routers :

netflow4002:~$ curl localhost:9804/metrics | grep gnmi_bgp_neighbor_session_state | grep lvs4010 see the different source later renamed to instance

gnmi_bgp_neighbor_session_state{address="10.128.0.11",network_instance_name="master",peer_as="64600",peer_descr="lvs4010",peer_group="PyBal",peer_type="EXTERNAL",source="cr4-ulsfo.wikimedia.org:32767"} 6 1739965191643
gnmi_bgp_neighbor_session_state{address="10.128.0.11",network_instance_name="DEFAULT",peer_as="64600",peer_descr="lvs4010",peer_group="PyBal",peer_type="EXTERNAL",protocol_identifier="BGP",protocol_name="DEFAULT",source="cr3-ulsfo.wikimedia.org:32767"} 6 1739965192053

(the other fields difference are because each router runs a different junos version).

So if we re-write that state, it will lead to duplicates, and possibly wrong info if only one of the session is up and the other down. If we go that way we might want to keep the worse state. Or do more transformation to add a router_name (or peer_name or something else) field and add the current source/instance in there.

Overall for that usecase I think the cleanest would be to go with the prometheus-bird-exporter (especially if one day we go with dynamic BGP neighbors) but I lack knowledge on how complex it is to setup.

On the other hand I can imagine other usecases where such transform would be useful. For example if a switch port goes down, or saturates, to have it alert as the matching host/team.

! In T384731#10556225, @fgiunchedi wrote:
I also noticed that for some of the results there's more information/duplication? That could be filtered too, e.g. of duplication

{address="10.128.0.11", instance="lvs4010:0", job="gnmi", network_instance_name="DEFAULT", peer_as="64600", peer_descr="lvs4010", peer_group="PyBal", peer_type="EXTERNAL", prometheus="ops", protocol_identifier="BGP", protocol_name="DEFAULT", site="ulsfo"}	6
{address="10.128.0.11", instance="lvs4010:0", job="gnmi", network_instance_name="master", peer_as="64600", peer_descr="lvs4010", peer_group="PyBal", peer_type="EXTERNAL", prometheus="ops", site="ulsfo"}	6

That's because in "legacy" subnets, the servers peer with the two core routers

Yeah I hadn't considered that. I guess we need two rewrites:

instance -> router_name
peer_descr -> instance

So we have an additional field "router_name" to differentiate these two.

Is it possible to duplicate the metric, before the transform, so it still shows up as a router metric ?

My understanding is this is replacing labels and writing the resulting new metrics as separate series to the db. So we still have the source metric as-is with tags unchanged? @fgiunchedi can you clarify that?

Is it possible to duplicate the metric, before the transform, so it still shows up as a router metric ?

My understanding is this is replacing labels and writing the resulting new metrics as separate series to the db. So we still have the source metric as-is with tags unchanged? @fgiunchedi can you clarify that?

Yes your understanding is correct: recording rules create new metrics based on the input query, metrics that are part of the input query are left untouched

Thanks !

Since we have to overwrite instance with the host instead of the router, that information is effectively lost, unless we move instance elsewhere. At any rate I think for a limited (i.e. internal bgp peers) use case it is fine.

The base query looks like this: label_replace(sum without (instance) (gnmi_bgp_neighbor_session_state), "instance", "$1:0", "peer_descr", "(.*)"), we would record that query into a new metric such as instance:gnmi_bgp_neighbor_session_state in modules/profile/files/prometheus/rules_ops.yml and we can alert if said metric == 6.

Is it possible to duplicate the metric, before the transform, so it still shows up as a router metric ?

yes, see above

And what happens if peer_descr is missing or empty ?

good question, in that case the instance label will be :0 since $1 will be empty. Is having peer_descr something that can happen?

I also noticed that for some of the results there's more information/duplication? That could be filtered too, e.g. of duplication

{address="10.128.0.11", instance="lvs4010:0", job="gnmi", network_instance_name="DEFAULT", peer_as="64600", peer_descr="lvs4010", peer_group="PyBal", peer_type="EXTERNAL", prometheus="ops", protocol_identifier="BGP", protocol_name="DEFAULT", site="ulsfo"}	6
{address="10.128.0.11", instance="lvs4010:0", job="gnmi", network_instance_name="master", peer_as="64600", peer_descr="lvs4010", peer_group="PyBal", peer_type="EXTERNAL", prometheus="ops", site="ulsfo"}	6

That's because in "legacy" subnets, the servers peer with the two core routers :

thank you that explains!

So if we re-write that state, it will lead to duplicates, and possibly wrong info if only one of the session is up and the other down. If we go that way we might want to keep the worse state. Or do more transformation to add a router_name (or peer_name or something else) field and add the current source/instance in there.

Indeed we'll have to rename both instance and peer_descr as you described

Overall for that usecase I think the cleanest would be to go with the prometheus-bird-exporter (especially if one day we go with dynamic BGP neighbors) but I lack knowledge on how complex it is to setup.

When we talked with @JMeybohm it looked possible though IIRC not very future proof since calico might ditch bird, totally doable though

On the other hand I can imagine other usecases where such transform would be useful. For example if a switch port goes down, or saturates, to have it alert as the matching host/team.

indeed that's a useful feature to have!

Thanks for the update @fgiunchedi

! In T384731#10566953, @fgiunchedi wrote:

And what happens if peer_descr is missing or empty ?

good question, in that case the instance label will be :0 since $1 will be empty. Is having peer_descr something that can happen?

The config for each peer is created by automation triggered by the 'bgp' flag for a host being set in Netbox. So in normal operations there is basically zero chance of "peer_descr" being missing/incorrect.

Overall for that usecase I think the cleanest would be to go with the prometheus-bird-exporter (especially if one day we go with dynamic BGP neighbors) but I lack knowledge on how complex it is to setup.

When we talked with @JMeybohm it looked possible though IIRC not very future proof since calico might ditch bird, totally doable though

I think it's easily doable by adding a sidecar to the calico-node pods that has access to the bird sockets (hostPath). Also there is on evidence that calico will be ditching bird so we could also go that route for now and revisit later if needed.

Overall for that usecase I think the cleanest would be to go with the prometheus-bird-exporter (especially if one day we go with dynamic BGP neighbors) but I lack knowledge on how complex it is to setup.

When we talked with @JMeybohm it looked possible though IIRC not very future proof since calico might ditch bird, totally doable though

I think it's easily doable by adding a sidecar to the calico-node pods that has access to the bird sockets (hostPath). Also there is on evidence that calico will be ditching bird so we could also go that route for now and revisit later if needed.

My apologies for the noise -- I clearly wasn't remembering correctly! If the sidecar case is easy to do then that's something to seriously consider IMHO. At the same time, addressing/thinking about the general problem (namely "what to do with alerts about hosts which don't come from the host itself") is worthwhile doing anyways.

And what happens if peer_descr is missing or empty ?

good question, in that case the instance label will be :0 since $1 will be empty. Is having peer_descr something that can happen?

Looks like it can't be empty or missing but can become (null). See T387220: BGP peers with missing descriptions.
I also asked in case there was any bugs in the pipeline and it somehow became missing.

And what happens if peer_descr is missing or empty ?

good question, in that case the instance label will be :0 since $1 will be empty. Is having peer_descr something that can happen?

Looks like it can't be empty or missing but can become (null). See T387220: BGP peers with missing descriptions.
I also asked in case there was any bugs in the pipeline and it somehow became missing.

Good to check. One thing this made me think is it probably would be a good idea to limit the number of new metrics we make, to reduce the total amount of duplicate data?

I'm not sure what the best way to do this would be. One way might be to filter in the rewrite rule to just the groups we are interested in? I guess we should do it and make the rule for all the groups with servers doing BGP? But we'd need to maintain a list of the groups somewhere. Or perhaps there are other ways - or maybe we don't care about the duplication?

ayounsi claimed this task.

Closing that parent task to focus on the remaining sub task.