Overview
This task tracks the hardware migration of the alert1002 and alert2002 hosts to replace the alert1001 and alert2001 hosts.
The four host use Debian Bookworm so it's expected for the Puppet role to work as is.
Proposed solution:
Stage 1: Prepare hosts
- Add the alert1002, and the alert2002 hosts to the Acme Chief list of authorized domains.
- Apply the alerting_host role for the alert1002, and alert2002 hosts.
- Merge Gerrit patch #1062444 - alert: Ensure the alert*002 hosts use the alerting_host role
- Run Puppet on the alert[12]002 hosts: sudo cumin 'alert*[02]*' 'run-puppet-agent'
- Arm the keyholder agent in the alert1002, and alert2002 hosts.
- Use the metamonitor passphrase pwstore/pw.git/metamonitor-key-passphrase
- Ensure the keyholder agent is armed sudo cumin 'alert*002wikimedia.org,' 'keyholder status'
- Verify the IP addresses are propagated across the infrastructure
- Add the new hosts to the Prometheus Blackbox exporter list
Stage 2: Enable hosts as passive alertmanagers
- Allow connections from the alert[12]002 addresses.
- Merge Gerrit patch #1064818 - alert: Allow connections from the alertx002 addresses
- Merge Gerrit patch #1064821 - alert: Allow Apache2 connections for the alertx002 hosts
- Run Puppet on the alert hosts: sudo cumin 'alert*' 'run-puppet-agent'
- Add the alert1002, and alert2002 hosts as Icinga and Alertmanager partners to work as passive hosts.
- Merge Gerrit patch #1064820 - alert: Add the alertx002 hosts as Icinga and AM partners
- Run Puppet on the alert hosts: sudo cumin 'alert*' 'run-puppet-agent'
- Enable the alert[12]002 hosts as alertmanagers
- Verify the hosts are working as intended as standby hosts (e.g. no puppet or unit failures)
Stage 3: Make alert2002 the active alertmanager host
- Disable meta-monitoring for the alert hosts.
- SSH as root into wikitech-static.wikimedia.org with the metamonitor-key-passphrase.
- Comment the following crontab entries to stop meta-monitoring against both alert hosts:
- */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert1001.wikimedia.org
- */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert2001.wikimedia.org
- Stop services in the alert1001 host.
- Make alert2002 the active host.
- Merge Gerrit patch #1071700 - alert: Failover from alert1001 to alert2002
- Run Puppet on the alert hosts: sudo cumin 'alert*' 'run-puppet-agent'
- Merge Gerrit patch #1072326 - alert: Resolve alerts DNS queries to alert2002
- Update DNS records: $ sudo cumin 'dns1004.wikimedia.org' 'sudo -i authdns-update'
- Ensure services work as expected.
- Enable metamonitoring for the alert1001, and alert2002 hosts.
- SSH as root into wikitech-static.wikimedia.org with the metamonitor-key-passphrase.
- Uncomment the following crontab entries to enable meta-monitoring for the alert1001 host, and add meta-monitoring for the alert2002 host.
- # */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert1001.wikimedia.org
- # */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert2002.wikimedia.org
Stage 4: Make alert1002 the active alertmanager host
- Disable meta-monitoring for the alert hosts.
- SSH as root into wikitech-static.wikimedia.org with the metamonitor-key-passphrase.
- Comment the following crontab entries to stop meta-monitoring against both alert hosts:
- */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert1001.wikimedia.org
- */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert2002.wikimedia.org
- Stop services in the alert2002 host.
- Make alert1002 the active host.
- Merge Gerrit patch #1071701 - alert: Failover from alert1001 to alert2002
- Run Puppet on the alert hosts: sudo cumin 'alert*' 'run-puppet-agent'
- Merge Gerrit patch #1063078 - alert: Resolve alerts DNS queries to alert1002
- Update DNS records: $ sudo cumin 'dns1004.wikimedia.org' 'sudo -i authdns-update'
- Ensure services work as expected.
- Enable metamonitoring for the alert1002, and alert2002 hosts.
- SSH as root into wikitech-static.wikimedia.org with the metamonitor-key-passphrase.
- Uncomment the following crontab entries to enable meta-monitoring for the alert2002 host, and add meta-monitoring for the alert1002 host.
- # */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert1002.wikimedia.org
- # */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert2002.wikimedia.org
Step 5: Cleanup
- Update hostnames for alertmanager tests