Upgrade Alert Instances to Debian Bookworm
Overview
This task tracks the upgrade and details the upgrade steps of our Alert* instances to Debian Bookworm.
Active Host: | alert1001.wikimedia.org |
Standby Host: | alert2001.wikimedia.org |
Alert related packages
The following table lists the Alert* hosts related packages, including their current installed versions and the target versions available upstream. Packages upgrades are tracked on T357683:
Package | Installed version | Upstream version | Compatibility |
alertmanager-webhook-logger | v0.3 | v1.0 | Yes |
icinga | Backported | ||
karma | v0.114 | v0.116 | Yes |
kthxbye | v0.8 | v0.16 | Yes |
phalerts | 60942d8 | e2a0b3a (+1 commit) | Yes |
prometheus-icinga-exporter | v0.20 | v0.20 | Yes |
python-irc | v8.5.3 | v20.3.0 | ~Yes (Python3 version available) |
python-phabricator | v0.7.0 | v0.8.1 | Yes |
python-pyinotify | v0.9.6 | v0.9.6 | ~Yes (Python3 version available) |
python3-service-checker | v0.2.1 | v0.2.1 | Yes |
statograph | v0.1.2 | v0.1.2 | Yes |
vopsbot | v0.3.6 | v0.3.6 | Yes |
The following table lists the Alert* hosts related services:
Unit | Description |
alertmanager-irc-relay.service | Send Prometheus Alerts to IRC using Webhooks |
alertmanager-webhook-logger.service | Alertmanager Webhook Logger |
alerts-triage.service | Help with triaging alerts |
apache2.service | The Apache HTTP Server |
icinga.service | LSB: icinga host/service/network monitoring and management system |
ircecho.service | ircecho |
karma.service | Alert dashboard for Prometheus Alertmanager |
klaxon.service | "klaxon manual paging webapp" |
kthxbye.service | Acknowledgements for Alertmanager alerts |
nagios-nrpe-server.service | Nagios Remote Plugin Executor |
nsca.service | LSB: Start/Stop the Nagios Service Check Acceptor (nsca) daemon |
nic-saturation-exporter.service | Prometheus network interface saturation exporter |
phalerts.service | Phabricator webhook for Prometheus Alertmanager |
prometheus-alertmanager.service | Alertmanager for prometheus |
prometheus-icinga-am.service | Prometheus Icinga AlertManager Forwarder |
prometheus-icinga-exporter.service | Prometheus Icinga exporter |
prometheus-ipmi-exporter.service | Prometheus exporter for IPMI devices |
prometheus-node-exporter.service | Prometheus exporter for machine metrics |
tcpircbot-logmsgbot.service | TCP socket to IRC bot: tcpircbot-logmsgbot |
tcpircbot-logmsgbot_cloud.service | TCP socket to IRC bot: tcpircbot-logmsgbot_cloud |
vopsbot.service | vopsbot, the irc bot to interact with splunk oncall |
1. Prerequisites
- Set up a Bookworm alerting_host in Pontoon
- Check that Puppet runs as expected (e.g. no packages missing, etc)
- Check that daemons can start, configurations are valid, etc
2. Upgrade steps:
Upgrading the alert* instances consists of several steps, executed from the active cumin host.
Stop meta-monitoring on the wikitech-static host by disabling the following cron jobs
*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga2001.wikimedia.org */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga1001.wikimedia.org
2.1 Reimage Standby Host (alert2001)
Re-image the standby host to Debian Bookworm
$ sudo cookbook sre.hosts.reimage --os bookworm -t T333615 alert2001
Arm the keyholder agent with the metamonitor passphrase pwstore/pw.git/metamonitor-key-passphrase
Ensure the keyholder agent is armed
$ sudo cumin 'alert2001.wikimedia.org' 'keyholder status'
Ensure key services like icinga are working as expected
2.2 Failover from the active to the standby host
Merge the following patches:
- alert: Failover Icinga and Alertmanager to alert2001 (Change 1003513)
- alert: Resolve alerts DNS queries to alert2001 (Change 1003516)
- failover icinga to alert2001 too https://gerrit.wikimedia.org/r/c/operations/dns/+/1008882
Run Puppet on the alert hosts
$ sudo cumin 'alert*' 'run-puppet-agent'
Update DNS records
$ sudo cumin 'dns1004.wikimedia.org' 'sudo -i authdns-update'
2.3 Reimage Standby Host (alert1001)
Re-image the standby host to Debian Bookworm
$ sudo cookbook sre.hosts.reimage --os bookworm -t T333615 alert1001
Arm the keyholder agent with the metamonitor passphrase pwstore/pw.git/metamonitor-key-passphrase
Ensure the keyholder agent is armed
$ sudo cumin 'alert1001.wikimedia.org' 'keyholder status'
Ensure key services like icinga are working as expected
2.4 Failover from the active to the standby host
Merge the following patches
- Revert alert: Failover Icinga and Alertmanager to alert2001 (Change 1003513)
- Revert alert: Resolve alerts DNS queries to alert2001 (Change 1003516)
- Revert failover icinga to alert2001 too https://gerrit.wikimedia.org/r/c/operations/dns/+/1008882
Run Puppet on the alert hosts
$ sudo cumin 'alert*' 'run-puppet-agent'
Update DNS records
$ sudo cumin 'dns1004.wikimedia.org' 'sudo -i authdns-update'
Ensure key services like icinga are working as expected
3. Post-Upgrade Actions:
Start meta-monitoring by enabling the following cron jobs on the wikitech-static host
*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga2001.wikimedia.org */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga1001.wikimedia.org