# Upgrade Grafana Instances to Debian Bookworm
## Overview
This task tracks the upgrade and details the upgrade steps of our Alert* instances to Debian Bookworm.
| **Active Host**: | `alert1001.wikimedia.org` |
| **Standby Host**: | `alert2001.wikimedia.org` |
### Package Upgrade RequirementsAlert related packages
The following table lists the Alert* hosts related packages to be upgraded,, including their current installed versions and the target versions available upstream. including their current installed versions and the target versions available upstream:Packages upgrades are tracked on T357683:
| **Package** | **Installed version** | **Upstream version** | **Compatibility** |
| alertmanager-webhook-logger | [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/alertmanager-webhook-logger/ | v0.3 ]] | [[ https://github.com/tomtom-international/alertmanager-webhook-logger/releases/tag/1.0 | v1.0 ]] | Yes
| icinga | | | **Backported**
| karma | [[ https://gerrit.wikimedia.org/g/operations/debs/karma | v0.114 ]] | [[ https://github.com/prymitive/karma/releases/tag/v0.116 | v0.116 ]] | Yes
| kthxbye | [[ https://gerrit.wikimedia.org/g/operations/debs/kthxbye | v0.8 ]] | [[ https://github.com/prymitive/kthxbye | v0.16 ]] | Yes
| phalerts | [[ https://gerrit.wikimedia.org/g/operations/debs/phalerts | 60942d8 ]] | [[ https://github.com/knyar/phalerts/commit/e2a0b3acc6150b3d694f0207b5e9b1a0aa27c8c5 | e2a0b3a (+1 commit) ]] | Yes
| prometheus-icinga-exporter | [[ https://apt.wikimedia.org/wikimedia/pool/main/p/prometheus-icinga-exporter/prometheus-icinga-exporter_0.20-1_all.deb | v0.20 ]] | [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/prometheus-icinga-exporter/+/refs/heads/upstream/0.20 | v0.20 ]] | Yes
| **python-irc** | [[ https://packages.debian.org/buster/python-irc | v8.5.3]] | [[ https://github.com/jaraco/irc/releases/tag/v20.3.0 | v20.3.0 ]] | **~Yes** (Python3 version available)
| python-phabricator | [[ https://apt.wikimedia.org/wikimedia/pool/main/p/python-phabricator/python3-phabricator_0.7.0-2~wmf2_all.deb | v0.7.0 ]] | [[ https://github.com/disqus/python-phabricator/releases/tag/0.8.1 | v0.8.1 ]] | Yes
| **python-pyinotify** | [[ https://packages.debian.org/buster/python-pyinotify | v0.9.6 ]] | [[ https://github.com/seb-m/pyinotify/releases/tag/0.9.6 | v0.9.6 ]] |**~Yes** (Python3 version available)
| **python3-service-checker** | [[ https://apt.wikimedia.org/wikimedia/pool/main/s/service-checker/python3-service-checker_0.2.1-buster1_all.deb | v0.2.1 ]] | [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/service-checker/+/refs/tags/upstream/0.2.1 | v0.2.1 ]] | **Yes**
| statograph | [[ https://apt.wikimedia.org/wikimedia/pool/main/s/statograph/statograph_0.1.2-1_all.deb | v0.1.2 ]] | [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/statograph/+/refs/tags/0.1.2 | v0.1.2 ]] | Yes
| vopsbot | [[ https://apt.wikimedia.org/wikimedia/pool/main/v/vopsbot/vopsbot_0.3.6-1_amd64.deb | v0.3.6 ]] | [[https://gitlab.wikimedia.org/repos/sre/vopsbot/-/tags/0.3.6 | v0.3.6 ]] | Yes
The following table lists the Alert* hosts related services:
| ** Unit ** | ** Description ** |
| alertmanager-irc-relay.service | Send Prometheus Alerts to IRC using Webhooks |
| alertmanager-webhook-logger.service | Alertmanager Webhook Logger |
| alerts-triage.service | Help with triaging alerts |
| apache2.service | The Apache HTTP Server |
| icinga.service | LSB: icinga host/service/network monitoring and management system |
| ircecho.service | ircecho |
| karma.service | Alert dashboard for Prometheus Alertmanager |
| klaxon.service | "klaxon manual paging webapp" |
| kthxbye.service | Acknowledgements for Alertmanager alerts |
| nagios-nrpe-server.service | Nagios Remote Plugin Executor |
| nsca.service | LSB: Start/Stop the Nagios Service Check Acceptor (nsca) daemon |
| nic-saturation-exporter.service | Prometheus network interface saturation exporter |
| phalerts.service | Phabricator webhook for Prometheus Alertmanager |
| prometheus-alertmanager.service | Alertmanager for prometheus |
| prometheus-icinga-am.service | Prometheus Icinga AlertManager Forwarder |
| prometheus-icinga-exporter.service | Prometheus Icinga exporter |
| prometheus-ipmi-exporter.service | Prometheus exporter for IPMI devices |
| prometheus-node-exporter.service | Prometheus exporter for machine metrics |
| tcpircbot-logmsgbot.service | TCP socket to IRC bot: tcpircbot-logmsgbot |
| tcpircbot-logmsgbot_cloud.service | TCP socket to IRC bot: tcpircbot-logmsgbot_cloud |
| vopsbot.service | vopsbot, the irc bot to interact with splunk oncall |
## 1. Prerequisites
* [x] Set up a Bookworm `alerting_host` in Pontoon
* [x] Check that Puppet runs as expected (e.g. no packages missing, etc)
* [x] Check that daemons can start, configurations are valid, etc
## 2. Upgrade steps:
Upgrading the alert* instances consists of several steps, executed from the active cumin host.
### 2.1 Reimage Standby Host (`alert2001`)
Silence meta-monitoring.
Merge the following patch:
Re-image the standby host to Debian Bookworm and Puppet 7
$ sudo cookbook sre.hosts.reimage --os bookworm -t T333615 alert2001
Arm the keyholder agent with the `metamonitor` passphrase `pwstore/pw.git/metamonitor-key-passphrase`
Ensure the keyholder agent is armed
$ sudo cumin 'alert2001.wikimedia.org' 'keyholder status'
Ensure key services like icinga are working as expected
### 2.2 Failover from the active to the standby host
Merge the following patches:
- alert: Failover Icinga and Alertmanager to alert2001 ([Change 1003513](https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003513))
- alert: Resolve alerts DNS queries to alert2001 ([Change 1003516](https://gerrit.wikimedia.org/r/c/operations/dns/+/1003516))
Run Puppet on the alert hosts
$ sudo cumin 'A:alert' 'run-puppet-agent'
Update DNS records
$ sudo cumin 'dns1004.wikimedia.org' 'sudo -i authdns-update'
### 2.3 Reimage Standby Host (`alert1001`)
Merge the following patch:
- alert: Ensure the alert1001 host is reimaged with Puppet 7
([Change 1003531](https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003531))
Re-image the standby host to Debian Bookworm and Puppet 7
$ sudo cookbook sre.hosts.reimage --os bookworm -t T333615 alert1001
Arm the keyholder agent with the `metamonitor` passphrase `pwstore/pw.git/metamonitor-key-passphrase`
Ensure the keyholder agent is armed
$ sudo cumin 'alert1001.wikimedia.org' 'keyholder status'
Ensure key services like icinga are working as expected
### 2.4 Failover from the active to the standby host
Merge the following patches
- **Revert** alert: Failover Icinga and Alertmanager to alert2001 ([Change 1003513](https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003513))
- **Revert** alert: Resolve alerts DNS queries to alert2001 ([Change 1003516](https://gerrit.wikimedia.org/r/c/operations/dns/+/1003516))
Run Puppet on the alert hosts
$ sudo cumin 'A:alert' 'run-puppet-agent'
Update DNS records
$ sudo cumin 'dns1004.wikimedia.org' 'sudo -i authdns-update'
Ensure key services like icinga are working as expected
## 3. Post-Upgrade Actions:
Merge the following patches
- **Revert** alert: Ensure the alert2001 host is reimaged with Puppet 7
([Change 1003527](https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003527))
- **Revert** alert: Ensure the alert1001 host is reimaged with Puppet 7
([Change 1003531](https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003531))
**Package Upgrades:**
* [ ] alertmanager-webhook-logger
* [ ] karma
* [ ] kthxbye
* [ ] phalerts
* [ ] prometheus-icinga-exporter
* [ ] python-irc
* [ ] python-phabricator
* [ ] python-pyinotify
* [ ] python3-service-checker