Upgrade Alert Instances to Debian Bookworm
Overview
This task tracks the upgrade and details the upgrade steps of our Alert* instances to Debian Bookworm.
| Active Host: | alert1001.wikimedia.org |
| Standby Host: | alert2001.wikimedia.org |
Alert related packages
The following table lists the Alert* hosts related packages, including their current installed versions and the target versions available upstream. Packages upgrades are tracked on T357683:
| Package | Installed version | Upstream version | Compatibility |
| alertmanager-webhook-logger | v0.3 | v1.0 | Yes |
| icinga | Backported | ||
| karma | v0.114 | v0.116 | Yes |
| kthxbye | v0.8 | v0.16 | Yes |
| phalerts | 60942d8 | e2a0b3a (+1 commit) | Yes |
| prometheus-icinga-exporter | v0.20 | v0.20 | Yes |
| python-irc | v8.5.3 | v20.3.0 | ~Yes (Python3 version available) |
| python-phabricator | v0.7.0 | v0.8.1 | Yes |
| python-pyinotify | v0.9.6 | v0.9.6 | ~Yes (Python3 version available) |
| python3-service-checker | v0.2.1 | v0.2.1 | Yes |
| statograph | v0.1.2 | v0.1.2 | Yes |
| vopsbot | v0.3.6 | v0.3.6 | Yes |
The following table lists the Alert* hosts related services:
| Unit | Description |
| alertmanager-irc-relay.service | Send Prometheus Alerts to IRC using Webhooks |
| alertmanager-webhook-logger.service | Alertmanager Webhook Logger |
| alerts-triage.service | Help with triaging alerts |
| apache2.service | The Apache HTTP Server |
| icinga.service | LSB: icinga host/service/network monitoring and management system |
| ircecho.service | ircecho |
| karma.service | Alert dashboard for Prometheus Alertmanager |
| klaxon.service | "klaxon manual paging webapp" |
| kthxbye.service | Acknowledgements for Alertmanager alerts |
| nagios-nrpe-server.service | Nagios Remote Plugin Executor |
| nsca.service | LSB: Start/Stop the Nagios Service Check Acceptor (nsca) daemon |
| nic-saturation-exporter.service | Prometheus network interface saturation exporter |
| phalerts.service | Phabricator webhook for Prometheus Alertmanager |
| prometheus-alertmanager.service | Alertmanager for prometheus |
| prometheus-icinga-am.service | Prometheus Icinga AlertManager Forwarder |
| prometheus-icinga-exporter.service | Prometheus Icinga exporter |
| prometheus-ipmi-exporter.service | Prometheus exporter for IPMI devices |
| prometheus-node-exporter.service | Prometheus exporter for machine metrics |
| tcpircbot-logmsgbot.service | TCP socket to IRC bot: tcpircbot-logmsgbot |
| tcpircbot-logmsgbot_cloud.service | TCP socket to IRC bot: tcpircbot-logmsgbot_cloud |
| vopsbot.service | vopsbot, the irc bot to interact with splunk oncall |
1. Prerequisites
- Set up a Bookworm alerting_host in Pontoon
- Check that Puppet runs as expected (e.g. no packages missing, etc)
- Check that daemons can start, configurations are valid, etc
2. Upgrade steps:
Upgrading the alert* instances consists of several steps, executed from the active cumin host.
Stop meta-monitoring on the wikitech-static host by disabling the following cron jobs
*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga2001.wikimedia.org */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga1001.wikimedia.org
2.1 Reimage Standby Host (alert2001)
Re-image the standby host to Debian Bookworm
$ sudo cookbook sre.hosts.reimage --os bookworm -t T333615 alert2001Arm the keyholder agent with the metamonitor passphrase pwstore/pw.git/metamonitor-key-passphrase
Ensure the keyholder agent is armed
$ sudo cumin 'alert2001.wikimedia.org' 'keyholder status'Ensure key services like icinga are working as expected
2.2 Failover from the active to the standby host
Merge the following patches:
- alert: Failover Icinga and Alertmanager to alert2001 (Change 1003513)
- alert: Resolve alerts DNS queries to alert2001 (Change 1003516)
- failover icinga to alert2001 too https://gerrit.wikimedia.org/r/c/operations/dns/+/1008882
Run Puppet on the alert hosts
$ sudo cumin 'alert*' 'run-puppet-agent'Update DNS records
$ sudo cumin 'dns1004.wikimedia.org' 'sudo -i authdns-update'2.3 Reimage Standby Host (alert1001)
Re-image the standby host to Debian Bookworm
$ sudo cookbook sre.hosts.reimage --os bookworm -t T333615 alert1001Arm the keyholder agent with the metamonitor passphrase pwstore/pw.git/metamonitor-key-passphrase
Ensure the keyholder agent is armed
$ sudo cumin 'alert1001.wikimedia.org' 'keyholder status'Ensure key services like icinga are working as expected
2.4 Failover from the active to the standby host
Merge the following patches
- Revert alert: Failover Icinga and Alertmanager to alert2001 (Change 1003513)
- Revert alert: Resolve alerts DNS queries to alert2001 (Change 1003516)
- Revert failover icinga to alert2001 too https://gerrit.wikimedia.org/r/c/operations/dns/+/1008882
Run Puppet on the alert hosts
$ sudo cumin 'alert*' 'run-puppet-agent'Update DNS records
$ sudo cumin 'dns1004.wikimedia.org' 'sudo -i authdns-update'Ensure key services like icinga are working as expected
3. Post-Upgrade Actions:
Start meta-monitoring by enabling the following cron jobs on the wikitech-static host
*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga2001.wikimedia.org */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga1001.wikimedia.org