We are running anycast-healthchecker 0.9.1-1+wmf12u1 in production on most hosts and some of them on (bullseye) running 0.8.2-1+wmf11u1. The current stable release is 0.9.8 and we should upgrade to do that so as to make sure that we are not too divergent from the stable release and also can make use of some of the new features.
A full changelog is available at https://github.com/unixsurfer/anycast_healthchecker/blob/master/ChangeLog#L54 and up.
- The most notable feature we might care about is support for exporting Prometheus metrics. There might be some overlap with the metrics from bird_exporter but we should look into this a bit more carefully. Nevertheless, and while not a priority, we should upgrade to the latest stable release to make use of this to improve our monitoring (and alerting).
- We should also look into patching when a healthcheck service is DOWN and the logging level related to that. Doing so can help us set:
profile::bird::anycasthc_logging: level: 'warning' num_backups: 1
See https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050626
But the problem is that anycast-hc does this:
self.log.info("status DOWN", extra=self.extra)Which means that if set the logging to WARNING, we won't capture the DOWN state. Reducing the logging by setting it to WARN can significantly decrease the disk space anycast-hc occupies and also potentially the CPU usage; on the dns hosts, anycast-hc consumes more CPU than even gdnsd:
https://grafana.wikimedia.org/goto/u1LTH3_Sg?orgId=1
I suspect the logs contribute to this (not all though as there are indeed periodic healthchecks) and a good way of isolating that can be to set the logging to WARN but that does not work currently because we will miss the service DOWN state, which is not ideal. Thus I think we should patch this as part of this build.

