Page MenuHomePhabricator

Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1)
Closed, ResolvedPublic

Description

We are running anycast-healthchecker 0.9.1-1+wmf12u1 in production on most hosts and some of them on (bullseye) running 0.8.2-1+wmf11u1. The current stable release is 0.9.8 and we should upgrade to do that so as to make sure that we are not too divergent from the stable release and also can make use of some of the new features.

A full changelog is available at https://github.com/unixsurfer/anycast_healthchecker/blob/master/ChangeLog#L54 and up.

  1. The most notable feature we might care about is support for exporting Prometheus metrics. There might be some overlap with the metrics from bird_exporter but we should look into this a bit more carefully. Nevertheless, and while not a priority, we should upgrade to the latest stable release to make use of this to improve our monitoring (and alerting).
  1. We should also look into patching when a healthcheck service is DOWN and the logging level related to that. Doing so can help us set:
profile::bird::anycasthc_logging:
  level: 'warning'
  num_backups: 1

See https://gerrit.wikimedia.org/r/c/operations/puppet/+/1050626

But the problem is that anycast-hc does this:

self.log.info("status DOWN", extra=self.extra)

Which means that if set the logging to WARNING, we won't capture the DOWN state. Reducing the logging by setting it to WARN can significantly decrease the disk space anycast-hc occupies and also potentially the CPU usage; on the dns hosts, anycast-hc consumes more CPU than even gdnsd:

https://grafana.wikimedia.org/goto/u1LTH3_Sg?orgId=1

2024-07-15-113014_1890x476_scrot.png (476×1 px, 116 KB)

I suspect the logs contribute to this (not all though as there are indeed periodic healthchecks) and a good way of isolating that can be to set the logging to WARN but that does not work currently because we will miss the service DOWN state, which is not ideal. Thus I think we should patch this as part of this build.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1054370 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/debs/python-anycast-healthchecker@master] Release 0.9.8-1+wmf12u1

https://gerrit.wikimedia.org/r/1054370

Change #1054370 merged by Ssingh:

[operations/debs/python-anycast-healthchecker@master] Release 0.9.8-1+wmf12u1

https://gerrit.wikimedia.org/r/1054370

Mentioned in SAL (#wikimedia-operations) [2024-07-16T14:44:55Z] <sukhe> reprepro -C main include bookworm-wikimedia anycast-healthchecker_0.9.8-1+wmf12u1_amd64.changes: T370068

Mentioned in SAL (#wikimedia-operations) [2024-07-16T14:49:36Z] <sukhe> [durum1001] upgrade anycast-healthchecker to 0.9.8-1+wmf12u1: T370068

Mentioned in SAL (#wikimedia-operations) [2024-07-17T14:16:55Z] <sukhe> [durum3003] upgrade anycast-healthchecker to 0.9.8-1+wmf12u1: T370068

Change #1055973 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] durum1001: set anycast-hc log level to WARN

https://gerrit.wikimedia.org/r/1055973

Change #1055973 merged by Ssingh:

[operations/puppet@production] durum1001: set anycast-hc log level to WARN

https://gerrit.wikimedia.org/r/1055973

Mentioned in SAL (#wikimedia-operations) [2024-07-22T16:37:16Z] <sukhe> [doh1001] upgrade anycast-healthchecker to 0.9.8-1+wmf12u1: T370068

Change #1056000 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: dns6001: reduce anycast_hc logging level and backups

https://gerrit.wikimedia.org/r/1056000

Change #1056000 merged by Ssingh:

[operations/puppet@production] hiera: dns6001: reduce anycast_hc logging level and backups

https://gerrit.wikimedia.org/r/1056000

Mentioned in SAL (#wikimedia-operations) [2024-07-23T13:37:22Z] <sukhe@puppetmaster1001> conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org [reason: upgrading anycast-hc: T370068]

Mentioned in SAL (#wikimedia-operations) [2024-07-23T13:40:53Z] <sukhe@puppetmaster1001> conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org [reason: finished upgrading anycast-hc: T370068]

On dns6001, we have anycast-hc 0.9.8 running with the patch to change the logging level to WARN for when a service is down. We are no longer logging when a service is up but only when it goes down:

2024-07-23 13:41:03,403 anycast-healthchecker[433436] WARNING  hc-vip-recdns.anycast.wmnet  status DOWN
2024-07-23 13:41:03,404 anycast-healthchecker[433436] WARNING  hc-vip-ntp-a.anycast.wmnet   status DOWN
2024-07-23 13:41:03,405 anycast-healthchecker[433436] WARNING  hc-vip-ns2.wikimedia.org     status DOWN

The default logging level is also set to WARN. We will see how this plays out for a bit before rolling it out to everywhere else.

https://grafana.wikimedia.org/goto/8urj7LXIR?orgId=1

2024-07-24-090819_1892x479_scrot.png (479×1 px, 61 KB)

The hypothesis that reducing logging should help the CPU usage was clearly wrong. In light of that, I am going to revert the patch and stick to the log level INFO but reduce the number of logs we keep. Looking at the source code again, there seem to be other important messages under INFO that we should probably keep and given that the CPU usage didn't go down as well, I am going to revert the change and do a new build.

Change #1056500 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/debs/python-anycast-healthchecker@master] Release 0.9.8-1+wmf12u2

https://gerrit.wikimedia.org/r/1056500

Change #1056500 merged by Ssingh:

[operations/debs/python-anycast-healthchecker@master] Release 0.9.8-1+wmf12u2

https://gerrit.wikimedia.org/r/1056500

Mentioned in SAL (#wikimedia-operations) [2024-07-24T13:28:53Z] <sukhe> reprepro -C main include bookworm-wikimedia anycast-healthchecker_0.9.8-1+wmf12u2_amd64.changes: T370068

Mentioned in SAL (#wikimedia-operations) [2024-07-24T14:27:29Z] <sukhe@puppetmaster1001> conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org [reason: upgrading anycast-hc: T370068]

Mentioned in SAL (#wikimedia-operations) [2024-07-24T14:31:46Z] <sukhe@puppetmaster1001> conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org [reason: finished upgrading anycast-hc: T370068]

Mentioned in SAL (#wikimedia-operations) [2024-07-25T14:44:57Z] <sukhe@puppetmaster1001> conftool action : set/pooled=no; selector: name=dns4003.wikimedia.org [reason: upgrading anycast-hc: T370068]

Mentioned in SAL (#wikimedia-operations) [2024-07-25T14:46:44Z] <sukhe> [dns4003] upgrade anycast-healthchecker to 0.9.8-1+wmf12u2: T370068

Mentioned in SAL (#wikimedia-operations) [2024-07-25T14:48:54Z] <sukhe@puppetmaster1001> conftool action : set/pooled=yes; selector: name=dns4003.wikimedia.org [reason: finished upgrading anycast-hc: T370068]

cmooney triaged this task as Medium priority.Jul 29 2024, 2:28 PM

Mentioned in SAL (#wikimedia-operations) [2024-07-29T14:33:52Z] <sukhe> A:wikidough: debdeploy upgrade anycast-hc to 0.9.8: T370068

Mentioned in SAL (#wikimedia-operations) [2024-07-29T15:09:39Z] <sukhe@puppetmaster1001> conftool action : set/pooled=no; selector: name=dns2006.wikimedia.org [reason: upgrading anycast-hc: T370068]

Mentioned in SAL (#wikimedia-operations) [2024-07-29T15:10:56Z] <sukhe> [dns2006] upgrade anycast-healthchecker to 0.9.8-1+wmf12u2: T370068

Mentioned in SAL (#wikimedia-operations) [2024-07-29T15:13:05Z] <sukhe@puppetmaster1001> conftool action : set/pooled=yes; selector: name=dns2006.wikimedia.org [reason: finished upgrading anycast-hc: T370068]

Mentioned in SAL (#wikimedia-operations) [2024-07-30T14:51:05Z] <sukhe@puppetmaster1001> conftool action : set/pooled=no; selector: name=dns7001.wikimedia.org [reason: upgrading anycast-hc: T370068]

Mentioned in SAL (#wikimedia-operations) [2024-07-30T14:51:34Z] <sukhe> [dns7001] upgrade anycast-healthchecker to 0.9.8-1+wmf12u2: T370068

Mentioned in SAL (#wikimedia-operations) [2024-07-30T15:00:15Z] <sukhe@puppetmaster1001> conftool action : set/pooled=yes; selector: name=dns7001.wikimedia.org [reason: finished upgrading anycast-hc: T370068]

Mentioned in SAL (#wikimedia-operations) [2024-08-06T14:56:40Z] <sukhe> disable puppet on A:dnsbox for cluster-wide anycast-hc 0.9.8 upgrade on remaining hosts: T370068

Mentioned in SAL (#wikimedia-operations) [2024-08-06T16:08:26Z] <sukhe> sudo cumin "A:dnsbox" "run-puppet-agent --enable 'upgrading anycast-hc'": finish anycast-hc upgrade: T370068

We have upgraded all DNS boxes, Wikimedia DNS and durum hosts to the latest version of anycast-healthchecker. The only hosts that are left and on bookworm are:

cloudservices[2004-2005]-dev.codfw.wmnet,cloudservices[1005-1006].eqiad.wmnet

They will get the updated package on the next reimage so I am going to mark this as resolved in the meantime.