Page MenuHomePhabricator

ensure alert[12]001 are prepared for meta monitoring
Closed, ResolvedPublic

Description

alert[12]001 will become the new active/standby icinga servers. We will need to ensure they are set up correctly for meta monitoring, and contact syncing.

Event Timeline

Currently the sync_check_icinga_contacts unit is failed on alert1001. I've armed the keyholder, but am not sure if there's an additional step to carry out on the wikitech-static host to permit the key from a new host. Or even if the sync should be running from multiple places at the same time.

[1598463544] SERVICE ALERT: alert1001;Check the last execution of sync_check_icinga_contacts;CRITICAL;HARD;2;CRITICAL: Status of the systemd unit sync_check_icinga_contacts

@herron most likely you have to just SSH once to accept the fingerprint with SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh root@wikitech-static.wikimedia.org

As for the sync, yes it can be run from multiple hosts, it was already between the two icinga hosts, splayed by Puppet cron function IIRC.

Thanks @Volans, sync_check_icinga_contacts is happy now on alert[12]001

Next issue to sort out is certificate fail on the new icinga hosts:

wikitech-static:~# /usr/local/bin/check_icinga alert1001.wikimedia.org
2020-09-02 18:26:24,667 [ERROR] Unable to load existing state from /var/tmp/check_icinga_alert1001.wikimedia.org.state: [Errno 2] No such file or directory: '/var/tmp/check_icinga_alert1001.wikimedia.org.state'
2020-09-02 18:26:24,696 [INFO] Checking icinga host alert1001.wikimedia.org (is_active=False, active_host=icinga1001.wikimedia.org)
2020-09-02 18:26:34,848 [ERROR] Certificate did not match expected hostname: alert1001.wikimedia.org. Certificate: {'subjectAltName': [('DNS', 'alerts.wikimedia.org')], 'subject': ((('commonName', 'alerts.wikimedia.org'),),)}
2020-09-02 18:26:34,849 [ERROR] Check for host alert1001.wikimedia.org (1/3) failed: ["hostname 'alert1001.wikimedia.org' doesn't match 'alerts.wikimedia.org'"]

We could possibly address this by combining the alerts.wikimedia.org and icinga.wikimedia.org certificates. They will both be served by the same apache.

Change 623848 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] alerts: combine alerts.wm.o and icinga.wm.o certificates

https://gerrit.wikimedia.org/r/623848

Change 623848 merged by Herron:
[operations/puppet@production] alerts: combine alerts.wm.o and icinga.wm.o certificates

https://gerrit.wikimedia.org/r/623848

Icinga/alerts certificate issue has been fixed and meta monitoring is now working against the new alert[12]001 hosts.

wikitech-static:~# /usr/local/bin/check_icinga alert1001.wikimedia.org
2020-09-03 15:26:26,728 [INFO] Checking icinga host alert1001.wikimedia.org (is_active=False, active_host=icinga1001.wikimedia.org)
2020-09-03 15:26:29,627 [INFO] Check for host alert1001.wikimedia.org: OK

I've prepared crontab entries for meta monitoring (in roots crontab) but left them commented out. We can simply uncomment when the alert[12]001 hosts become live.

wikitech-static:~# crontab -l | grep alert
#*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert2001.wikimedia.org
#*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga alert1001.wikimedia.org
herron renamed this task from ensure alert[12]001 are configured for meta monitoring to ensure alert[12]001 are prepared for meta monitoring.Sep 3 2020, 3:30 PM
herron closed this task as Resolved.
herron claimed this task.