Page MenuHomePhabricator

Alert when anycast-healthchecker withdraws BGP route
Closed, ResolvedPublic

Description

When working on the Anycast setup for the ceph swift service I noticed that we get no notification when anycast-healthchecker's check fails and it withdraws a BGP route.

I know in most scenarios we have other checks on the service it is looking at, and would alert, but I think it might be worth adding some alerting on this, so things don't look "normal" if the host is not announcing anything for an odd reason?

One way to capture it might be to monitor the system logs and look for when the anycast-healthchecker marks something as DOWN?

Alternately we could count how many BGP rotues are announced, using something like this:

cmooney@cephosd1003:~$ sudo birdc show route all export bgp1 | grep -c unicast
0
cmooney@cephosd1003:~$ sudo birdc show route all export bgp1 | grep -c unicast
1

We could also maybe look at using someting like this prometheus bird exporter, which should give us a count of the prefixes exported for each neighbor.

Event Timeline

cmooney triaged this task as Low priority.

Thanks for filing this task! This is indeed something we have discussed in the past but not formally so let's use this task to do that.

I think the main challenge here is that it will be difficult to get a solution that fits all current bird setups in our infrastructure because it is possible that a host is not announcing any routes for good reason, if it is for example intentionally depooled. Thus any check above to see if no routes are being announced or routes below a certain threshold are being announced requires an additional check on the state of the host. That does not apply if the host should always be announcing a certain set of IPs or doesn't have any defined state outside, in which case the above checks in the task should be sufficient.

On the DNS host (using as an example), the state of the host and the various services we define under it are defined by the state in etcd/confctl. So if were to alert on such a check above, we will have to confirm the state from there as well and only alert if there is a discrepancy between that and the routes being currently announced by the host.

Note that since T370068 was rolled out, anycast-healthchecker also exports certain metrics directly, such as:

# HELP anycast_healthchecker_service_state The status of the service check: 0 = healthy, any other value = unhealthy
# TYPE anycast_healthchecker_service_state gauge
anycast_healthchecker_service_state{ip_prefix="fd12:aba6:57db:ffff::1/128",service_name="foo1IPv6.bar.com"} 0.0
anycast_healthchecker_service_state{ip_prefix="10.52.12.1/32",service_name="foo.bar.com"} 0.0
anycast_healthchecker_service_state{ip_prefix="10.52.12.2/32",service_name="foo1.bar.com"} 0.0

While the metrics are not turned on in the current setup, they can be, and we can then use AlertManager to parse this data and the state data (if any, also not sure how) before we actual send out an alert.

Note that the Bird exporter is already up and running: https://grafana.wikimedia.org/d/dxbfeGDZk/anycast

We could in theory correlate the etcd status to the service status to the anycast-HC status to the bird status all in prometheus/alertmanager, not sure how complex it is though.

The fastest option for now is to systematically alert if Bird doesn't advertise any prefix. This won't give us much granularity (should it advertise 1 or 10 prefixes?) but would significantly reduce the alerting blindspot. We can also check in Grafana's history how many time it would have triggered in the last month or so to know if it's realistic or too much noise.

Or go directly to anycast_healthchecker_service_state and not care about the Bird side for alerting (but have both on a dashboard).

The ideal would be to correlate it with the router side's BGP session, but we're not there yet :)

I am going to tackle this for the DNS hosts at least and then we can revisit a generic solution.

Change #1163858 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:bird and C:bird::anycast: support exporting Prom metrics

https://gerrit.wikimedia.org/r/1163858

Change #1163859 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: enable exporting prom metrics from doh1001 for anycast-hc

https://gerrit.wikimedia.org/r/1163859

Change #1164296 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] prometheus: add dnsbox_service_state_exporter

https://gerrit.wikimedia.org/r/1164296

Mentioned in SAL (#wikimedia-operations) [2025-07-03T13:18:32Z] <sukhe> sudo cumin 'C:bird' "disable-puppet 'merging CR 1163858'": T374619

Mentioned in SAL (#wikimedia-operations) [2025-07-03T13:21:39Z] <sukhe> sudo cumin -b11 'C:bird' "run-puppet-agent --enable 'merging CR 1163858'": NOOP change T374619

Change #1163858 merged by Ssingh:

[operations/puppet@production] P:bird and C:bird::anycast: support exporting Prom metrics

https://gerrit.wikimedia.org/r/1163858

Change #1163859 merged by Ssingh:

[operations/puppet@production] hiera: enable exporting anycast-hc prom metrics for O:wikidough

https://gerrit.wikimedia.org/r/1163859

Change #1166204 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: enable exporting anycast-hc prom metrics for O:wikidough

https://gerrit.wikimedia.org/r/1166204

Change #1164296 merged by Ssingh:

[operations/puppet@production] prometheus: add dnsbox_service_state_exporter

https://gerrit.wikimedia.org/r/1164296

Change #1166210 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] P:dns::auth::monitoring: add prometheus::dnsbox_service_state_exporter

https://gerrit.wikimedia.org/r/1166210

Change #1166204 merged by Ssingh:

[operations/puppet@production] hiera: enable exporting anycast-hc prom metrics for O:wikidough

https://gerrit.wikimedia.org/r/1166204

Change #1166222 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] bird/anycast-hc: allow setting SupplementaryGroups for anycast-hc unit

https://gerrit.wikimedia.org/r/1166222

Change #1166223 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: dnsbox: set supplementary_groups for anycast-hc

https://gerrit.wikimedia.org/r/1166223

Change #1166224 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] C:prometheus: dnsbox_service_state_exporter s/define/class

https://gerrit.wikimedia.org/r/1166224

Change #1166225 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/alerts@master] team-traffic: add dnsbox alert for service status mistmatch

https://gerrit.wikimedia.org/r/1166225

Change #1166224 merged by Ssingh:

[operations/puppet@production] C:prometheus: dnsbox_service_state_exporter s/define/class

https://gerrit.wikimedia.org/r/1166224

Change #1166222 merged by Ssingh:

[operations/puppet@production] bird/anycast-hc: allow setting SupplementaryGroups for anycast-hc unit

https://gerrit.wikimedia.org/r/1166222

Change #1166210 merged by Ssingh:

[operations/puppet@production] P:dns::auth::monitoring: add prometheus::dnsbox_service_state_exporter

https://gerrit.wikimedia.org/r/1166210

Change #1166836 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] C:prometheus: use updated file name for dnsbox_service_state

https://gerrit.wikimedia.org/r/1166836

Change #1166836 merged by Ssingh:

[operations/puppet@production] C:prometheus: use updated file name for dnsbox_service_state

https://gerrit.wikimedia.org/r/1166836

Change #1166838 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: enable anycast-hc prom metrics for wikidough

https://gerrit.wikimedia.org/r/1166838

Change #1166838 merged by Ssingh:

[operations/puppet@production] hiera: enable anycast-hc prom metrics for wikidough

https://gerrit.wikimedia.org/r/1166838

Mentioned in SAL (#wikimedia-operations) [2025-07-07T14:47:01Z] <sukhe> sudo cumin 'A:dnsbox' "disable-puppet 'merging CR 1166223'": rolling out prom metrics for anycast-hc: T374619

Change #1166223 merged by Ssingh:

[operations/puppet@production] hiera: dnsbox: set supplementary_groups and enable Prom metrics (anycast-hc)

https://gerrit.wikimedia.org/r/1166223

Mentioned in SAL (#wikimedia-operations) [2025-07-07T14:54:42Z] <sukhe@puppetserver1001> conftool action : set/pooled=no; selector: name=dns7001.wikimedia.org [reason: testing CR 1166223: T374619]

Mentioned in SAL (#wikimedia-operations) [2025-07-07T14:58:24Z] <sukhe@puppetserver1001> conftool action : set/pooled=yes; selector: name=dns7001.wikimedia.org [reason: [done] testing CR 1166223: T374619]

Mentioned in SAL (#wikimedia-operations) [2025-07-07T15:00:37Z] <sukhe> sudo cumin -b1 -s120 'A:dnsbox and not P{dns7001*}' "run-puppet-agent --enable 'merging CR 1166223'": T374619

Change #1166225 merged by jenkins-bot:

[operations/alerts@master] team-traffic: add dnsbox alert for service status mismatch

https://gerrit.wikimedia.org/r/1166225

Change #1167659 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/alerts@master] team-traffic: dnsbox: use metrics anycast_healthchecker_service_state

https://gerrit.wikimedia.org/r/1167659

Change #1167659 merged by jenkins-bot:

[operations/alerts@master] team-traffic: dnsbox: use metrics anycast_healthchecker_service_state

https://gerrit.wikimedia.org/r/1167659

On the DNS hosts as of today, we have an alert in place if we detect a mismatch between the service state as defined by confd/confctl and the advertisements of the VIPs on the DNS hosts themselves. Here is how it works in case we are interested in expanding this to other places:

  • On the DNS hosts, a custom Prometheus exporter exports the metrics for the various DNS services. Example:
cat /var/lib/prometheus/node.d/dnsbox_service_state.prom 
# HELP dnsbox_service_state Service state: 0 = down, 1 = up
# TYPE dnsbox_service_state gauge
dnsbox_service_state{service_name="authdns-ns0"} 1.0
dnsbox_service_state{service_name="authdns-ns2"} 1.0
dnsbox_service_state{service_name="recdns"} 1.0
dnsbox_service_state{service_name="ntp-a"} 1.0
  • Within anycast-healthchecker itself, we enable the exporting of Prometheus metrics that gives us the health of the services as:
anycast_healthchecker_service_state{ip_prefix="10.3.0.5/32",service_name="hc-vip-ntp-a.anycast.wmnet"} 0.0
anycast_healthchecker_service_state{ip_prefix="208.80.154.238/32",service_name="hc-vip-ns0.wikimedia.org"} 0.0
anycast_healthchecker_service_state{ip_prefix="10.3.0.1/32",service_name="hc-vip-recdns.anycast.wmnet"} 0.0

This is enabled by setting the following two hieras:

profile::bird::anycast::do_prom_exporter: true
profile::bird::anycast::supplementary_groups:
  - 'prometheus-node-exporter'

(You need supplementary_groups here otherwise the anycast-hc systemd unit can't write to the Prometheus directory)

  • In operations/alerts, we have an alerting rule called team-traffic/dnsbox.yaml. We use this to send out an alert if we notice that the service is pooled but for whatever reason the healthcheck is failing and so is the VIP advertisement.

I tested it today:

12:00:20 <+jinxer-wm> FIRING: DnsboxServiceMismatch: Service ntp-b state mismatch on dns7002:9100 -

There is a runbook at https://wikitech.wikimedia.org/wiki/DNS#DnsboxServiceMismatch.

Change #1167716 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/alerts@master] team-traffic: dnsbox: alert after rule is true for 1m

https://gerrit.wikimedia.org/r/1167716

Change #1167716 merged by jenkins-bot:

[operations/alerts@master] team-traffic: dnsbox: alert after rule is true for 1m

https://gerrit.wikimedia.org/r/1167716

ayounsi claimed this task.

All the tooling, metrics and examples are there for the service owners to setup their alerting like traffic did for DNS.