Page MenuHomePhabricator

Upgrade alert* hosts to Bookworm
Closed, ResolvedPublic

Description

Upgrade Alert Instances to Debian Bookworm

Overview

This task tracks the upgrade and details the upgrade steps of our Alert* instances to Debian Bookworm.

Active Host:alert1001.wikimedia.org
Standby Host:alert2001.wikimedia.org

Alert related packages

The following table lists the Alert* hosts related packages, including their current installed versions and the target versions available upstream. Packages upgrades are tracked on T357683:

PackageInstalled versionUpstream versionCompatibility
alertmanager-webhook-loggerv0.3v1.0Yes
icingaBackported
karmav0.114v0.116Yes
kthxbyev0.8v0.16Yes
phalerts60942d8e2a0b3a (+1 commit)Yes
prometheus-icinga-exporterv0.20v0.20Yes
python-ircv8.5.3v20.3.0~Yes (Python3 version available)
python-phabricatorv0.7.0v0.8.1Yes
python-pyinotifyv0.9.6v0.9.6~Yes (Python3 version available)
python3-service-checkerv0.2.1v0.2.1Yes
statographv0.1.2v0.1.2Yes
vopsbotv0.3.6v0.3.6Yes

The following table lists the Alert* hosts related services:

Unit Description
alertmanager-irc-relay.serviceSend Prometheus Alerts to IRC using Webhooks
alertmanager-webhook-logger.serviceAlertmanager Webhook Logger
alerts-triage.serviceHelp with triaging alerts
apache2.serviceThe Apache HTTP Server
icinga.serviceLSB: icinga host/service/network monitoring and management system
ircecho.serviceircecho
karma.serviceAlert dashboard for Prometheus Alertmanager
klaxon.service"klaxon manual paging webapp"
kthxbye.serviceAcknowledgements for Alertmanager alerts
nagios-nrpe-server.serviceNagios Remote Plugin Executor
nsca.serviceLSB: Start/Stop the Nagios Service Check Acceptor (nsca) daemon
nic-saturation-exporter.servicePrometheus network interface saturation exporter
phalerts.servicePhabricator webhook for Prometheus Alertmanager
prometheus-alertmanager.serviceAlertmanager for prometheus
prometheus-icinga-am.servicePrometheus Icinga AlertManager Forwarder
prometheus-icinga-exporter.servicePrometheus Icinga exporter
prometheus-ipmi-exporter.servicePrometheus exporter for IPMI devices
prometheus-node-exporter.servicePrometheus exporter for machine metrics
tcpircbot-logmsgbot.serviceTCP socket to IRC bot: tcpircbot-logmsgbot
tcpircbot-logmsgbot_cloud.serviceTCP socket to IRC bot: tcpircbot-logmsgbot_cloud
vopsbot.servicevopsbot, the irc bot to interact with splunk oncall

1. Prerequisites

  • Set up a Bookworm alerting_host in Pontoon
  • Check that Puppet runs as expected (e.g. no packages missing, etc)
  • Check that daemons can start, configurations are valid, etc

2. Upgrade steps:

Upgrading the alert* instances consists of several steps, executed from the active cumin host.

Stop meta-monitoring on the wikitech-static host by disabling the following cron jobs

*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga2001.wikimedia.org
*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga1001.wikimedia.org

2.1 Reimage Standby Host (alert2001)

Re-image the standby host to Debian Bookworm

$ sudo cookbook sre.hosts.reimage --os bookworm -t T333615 alert2001

Arm the keyholder agent with the metamonitor passphrase pwstore/pw.git/metamonitor-key-passphrase

Ensure the keyholder agent is armed

$ sudo cumin 'alert2001.wikimedia.org' 'keyholder status'

Ensure key services like icinga are working as expected

2.2 Failover from the active to the standby host

Merge the following patches:

Run Puppet on the alert hosts

$ sudo cumin 'alert*' 'run-puppet-agent'

Update DNS records

$ sudo cumin 'dns1004.wikimedia.org' 'sudo -i authdns-update'

2.3 Reimage Standby Host (alert1001)

Re-image the standby host to Debian Bookworm

$ sudo cookbook sre.hosts.reimage --os bookworm -t T333615 alert1001

Arm the keyholder agent with the metamonitor passphrase pwstore/pw.git/metamonitor-key-passphrase

Ensure the keyholder agent is armed

$ sudo cumin 'alert1001.wikimedia.org' 'keyholder status'

Ensure key services like icinga are working as expected

2.4 Failover from the active to the standby host

Merge the following patches

Run Puppet on the alert hosts

$ sudo cumin 'alert*' 'run-puppet-agent'

Update DNS records

$ sudo cumin 'dns1004.wikimedia.org' 'sudo -i authdns-update'

Ensure key services like icinga are working as expected

3. Post-Upgrade Actions:

Start meta-monitoring by enabling the following cron jobs on the wikitech-static host

*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga2001.wikimedia.org
*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga1001.wikimedia.org

Details

SubjectRepoBranchLines +/-
labs/privatemaster+0 -3
operations/puppetproduction+3 -4
operations/puppetproduction+3 -3
operations/dnsmaster+3 -3
operations/dnsmaster+1 -1
operations/puppetproduction+5 -1
operations/dnsmaster+0 -1
operations/puppetproduction+5 -90
operations/puppetproduction+5 -0
operations/puppetproduction+5 -3
operations/puppetproduction+6 -1
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+8 -0
operations/puppetproduction+274 -17
operations/puppetproduction+1 -1
operations/puppetproduction+4 -0
operations/puppetproduction+1 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 991360 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] monitoring: adjust default for cluster and group

https://gerrit.wikimedia.org/r/991360

Change 991361 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] puppetserver: move ::generators from puppetmaster

https://gerrit.wikimedia.org/r/991361

Change 991363 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] pontoon: include profile::monitoring in base

https://gerrit.wikimedia.org/r/991363

Change 991364 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: remove ldap-icinga renmants

https://gerrit.wikimedia.org/r/991364

Change 991365 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] klaxon: bookworm/gunicorn compat

https://gerrit.wikimedia.org/r/991365

Change 991360 abandoned by Filippo Giunchedi:

[operations/puppet@production] monitoring: adjust default for cluster and group

Reason:

Not needed

https://gerrit.wikimedia.org/r/991360

Change 991365 merged by Filippo Giunchedi:

[operations/puppet@production] klaxon: bookworm/gunicorn compat

https://gerrit.wikimedia.org/r/991365

Change 991361 merged by Filippo Giunchedi:

[operations/puppet@production] puppetserver: move ::generators from puppetmaster

https://gerrit.wikimedia.org/r/991361

Change 991363 merged by Filippo Giunchedi:

[operations/puppet@production] pontoon: include profile::monitoring in base

https://gerrit.wikimedia.org/r/991363

Change 992083 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wikimedia.org: clean up ldap-icinga

https://gerrit.wikimedia.org/r/992083

Change 991364 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: remove ldap-icinga remnants

https://gerrit.wikimedia.org/r/991364

Change 992083 merged by Filippo Giunchedi:

[operations/dns@master] wikimedia.org: clean up ldap-icinga

https://gerrit.wikimedia.org/r/992083

andrea.denisse shifted this object from the S1 Public space to the Restricted Space space.Feb 14 2024, 7:06 PM
andrea.denisse updated the task description. (Show Details)
andrea.denisse shifted this object from the Restricted Space space to the S1 Public space.Feb 14 2024, 7:07 PM

Change 1003513 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Failover Icinga and Alertmanager to alert2001

https://gerrit.wikimedia.org/r/1003513

Change 1003516 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] alert: Resolve alerts DNS queries to alert2001

https://gerrit.wikimedia.org/r/1003516

Change 1003527 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Ensure the alert2001 host is reimaged with Puppet 7

https://gerrit.wikimedia.org/r/1003527

Change 1003531 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Ensure the alert1001 host is reimaged with Puppet 7

https://gerrit.wikimedia.org/r/1003531

Change 1003527 abandoned by Andrea Denisse:

[operations/puppet@production] alert: Ensure the alert2001 host is reimaged with Puppet 7

Reason:

Abandoning to migrate to Puppet 7 after the alert hosts upgrade is finished.

https://gerrit.wikimedia.org/r/1003527

Change 1003531 abandoned by Andrea Denisse:

[operations/puppet@production] alert: Ensure the alert1001 host is reimaged with Puppet 7

Reason:

Abandoning to migrate to Puppet 7 after the alert hosts upgrade is finished.

https://gerrit.wikimedia.org/r/1003531

Mentioned in SAL (#wikimedia-operations) [2024-02-20T15:16:10Z] <denisse_> starting the Alert hosts upgrade to Bookworm - T333615

Mentioned in SAL (#wikimedia-sre) [2024-02-20T15:16:25Z] <denisse_> starting the Alert hosts upgrade to Bookworm - T333615

Mentioned in SAL (#wikimedia-operations) [2024-02-20T15:32:05Z] <godog> temp disable meta-monitoring on wikitech-static.w.o - T333615

Mentioned in SAL (#wikimedia-operations) [2024-02-20T15:46:00Z] <godog> re-enable meta-monitoring on wikitech-static.w.o - T333615

Mentioned in SAL (#wikimedia-operations) [2024-02-20T15:46:28Z] <denisse> When doing the alert hosts upgrade we encountered some issues that prevented us to properly reimage the hosts to proceed with the upgrade. We're investigating this issue and inform of the new alert hosts upgrade date ASAP. - T333615

Mentioned in SAL (#wikimedia-sre) [2024-02-20T15:46:38Z] <denisse> When doing the alert hosts upgrade we encountered some issues that prevented us to properly reimage the hosts to proceed with the upgrade. We're investigating this issue and inform of the new alert hosts upgrade date ASAP. - T333615

Mentioned in SAL (#wikimedia-sre) [2024-02-26T15:02:45Z] <denisse> Disabling meta-monitoring for the alert hosts - T333615

Mentioned in SAL (#wikimedia-operations) [2024-02-26T15:02:59Z] <denisse> Disabling meta-monitoring for the alert hosts - T333615

Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin2002 for host alert2001.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin2002 for host alert2001.wikimedia.org with OS bookworm executed with errors:

  • alert2001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" alert2001.wikimedia.org to get a root shellbut depending on the failure this may not work.

Change 1006540 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] Set the alert2001 to insetup for the Bookworm upgrade

https://gerrit.wikimedia.org/r/1006540

Change 1006540 abandoned by Andrea Denisse:

[operations/puppet@production] Set the alert2001 to insetup for the Bookworm upgrade

Reason:

https://gerrit.wikimedia.org/r/1006540

Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin2002 for host alert2001.wikimedia.org with OS bookworm

Mentioned in SAL (#wikimedia-operations) [2024-02-26T17:35:56Z] <denisse> Enabled meta-monitoring for alert1001 - T333615

Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin2002 for host alert2001.wikimedia.org with OS bookworm executed with errors:

  • alert2001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402261722_denisse_3773295_alert2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" alert2001.wikimedia.org to get a root shellbut depending on the failure this may not work.

Mentioned in SAL (#wikimedia-operations) [2024-03-05T15:09:00Z] <denisse> disable meta-monitoring for alert1001 - T333615

Change 1003513 merged by Andrea Denisse:

[operations/puppet@production] alert: Failover Icinga and Alertmanager to alert2001

https://gerrit.wikimedia.org/r/1003513

Change 1003516 merged by Andrea Denisse:

[operations/dns@master] alert: Resolve alerts DNS queries to alert2001

https://gerrit.wikimedia.org/r/1003516

Change 1008882 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wikimedia.org: failover icinga to alert2001 too

https://gerrit.wikimedia.org/r/1008882

Change 1008882 merged by Filippo Giunchedi:

[operations/dns@master] wikimedia.org: failover icinga to alert2001 too

https://gerrit.wikimedia.org/r/1008882

Mentioned in SAL (#wikimedia-operations) [2024-03-05T16:39:21Z] <denisse> enabling meta-monitoring for the alert* hosts - T333615

Something else that didn't work well: the current version of ircecho doesn't seem to attempt reopening the files it is supposed to look for in /var/log/icinga. I have "fixed" this by creating said .log files and then restarting ircecho, which then did properly open/tail the files

Mentioned in SAL (#wikimedia-operations) [2024-03-06T16:26:03Z] <denisse> Disable meta-monitoring for alert1001 - T333615

Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin2002 for host alert1001.wikimedia.org with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin2002 for host alert1001.wikimedia.org with OS bookworm completed:

  • alert1001 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403061652_denisse_187623_alert1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-03-26T16:18:00Z] <denisse> Importing karma 0.119 to reprepro - T333615

Change #1003531 restored by Andrea Denisse:

[operations/puppet@production] alert: Ensure the alert1001 host is reimaged with Puppet 7

https://gerrit.wikimedia.org/r/1003531

Change #1003527 restored by Andrea Denisse:

[operations/puppet@production] alert: Ensure the alert2001 host is reimaged with Puppet 7

https://gerrit.wikimedia.org/r/1003527

Special thanks to @fgiunchedi for their help and support.

Change #1016290 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Move alert* Hiera config to the role level

https://gerrit.wikimedia.org/r/1016290

Change #1016290 merged by Muehlenhoff:

[operations/puppet@production] Move alert* Hiera config to the role level

https://gerrit.wikimedia.org/r/1016290

Change #1017146 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[labs/private@master] Delete dummy TLS certificate for the performance host

https://gerrit.wikimedia.org/r/1017146

Change #1017146 merged by Andrea Denisse:

[labs/private@master] Delete dummy TLS certificate for the performance host

https://gerrit.wikimedia.org/r/1017146