alert[12]001 will become the new active/standby icinga servers. We will need to ensure they are set up correctly for meta monitoring, and contact syncing.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
alerts: combine alerts.wm.o and icinga.wm.o certificates | operations/puppet | production | +2 -8 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Stalled | None | T302086 Set scap minimum python version to 3.7 | |||
Resolved | None | T247045 Migrate all of production metal and VMs to Buster or later | |||
Resolved | lmata | T247966 Migrate role::alerting_host to Buster | |||
Resolved | herron | T261342 ensure alert[12]001 are prepared for meta monitoring |
Event Timeline
Currently the sync_check_icinga_contacts unit is failed on alert1001. I've armed the keyholder, but am not sure if there's an additional step to carry out on the wikitech-static host to permit the key from a new host. Or even if the sync should be running from multiple places at the same time.
[1598463544] SERVICE ALERT: alert1001;Check the last execution of sync_check_icinga_contacts;CRITICAL;HARD;2;CRITICAL: Status of the systemd unit sync_check_icinga_contacts
@herron most likely you have to just SSH once to accept the fingerprint with SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh root@wikitech-static.wikimedia.org
As for the sync, yes it can be run from multiple hosts, it was already between the two icinga hosts, splayed by Puppet cron function IIRC.
Thanks @Volans, sync_check_icinga_contacts is happy now on alert[12]001
Next issue to sort out is certificate fail on the new icinga hosts:
wikitech-static:~# /usr/local/bin/check_icinga alert1001.wikimedia.org 2020-09-02 18:26:24,667 [ERROR] Unable to load existing state from /var/tmp/check_icinga_alert1001.wikimedia.org.state: [Errno 2] No such file or directory: '/var/tmp/check_icinga_alert1001.wikimedia.org.state' 2020-09-02 18:26:24,696 [INFO] Checking icinga host alert1001.wikimedia.org (is_active=False, active_host=icinga1001.wikimedia.org) 2020-09-02 18:26:34,848 [ERROR] Certificate did not match expected hostname: alert1001.wikimedia.org. Certificate: {'subjectAltName': [('DNS', 'alerts.wikimedia.org')], 'subject': ((('commonName', 'alerts.wikimedia.org'),),)} 2020-09-02 18:26:34,849 [ERROR] Check for host alert1001.wikimedia.org (1/3) failed: ["hostname 'alert1001.wikimedia.org' doesn't match 'alerts.wikimedia.org'"]
We could possibly address this by combining the alerts.wikimedia.org and icinga.wikimedia.org certificates. They will both be served by the same apache.
Change 623848 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] alerts: combine alerts.wm.o and icinga.wm.o certificates
Change 623848 merged by Herron:
[operations/puppet@production] alerts: combine alerts.wm.o and icinga.wm.o certificates
Icinga/alerts certificate issue has been fixed and meta monitoring is now working against the new alert[12]001 hosts.
wikitech-static:~# /usr/local/bin/check_icinga alert1001.wikimedia.org 2020-09-03 15:26:26,728 [INFO] Checking icinga host alert1001.wikimedia.org (is_active=False, active_host=icinga1001.wikimedia.org) 2020-09-03 15:26:29,627 [INFO] Check for host alert1001.wikimedia.org: OK
I've prepared crontab entries for meta monitoring (in roots crontab) but left them commented out. We can simply uncomment when the alert[12]001 hosts become live.
wikitech-static:~# crontab -l | grep alert #*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert2001.wikimedia.org #*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert1001.wikimedia.org