Page MenuHomePhabricator

ensure alert[12]001 are prepared for meta monitoring
Closed, ResolvedPublic


alert[12]001 will become the new active/standby icinga servers. We will need to ensure they are set up correctly for meta monitoring, and contact syncing.

Event Timeline

Currently the sync_check_icinga_contacts unit is failed on alert1001. I've armed the keyholder, but am not sure if there's an additional step to carry out on the wikitech-static host to permit the key from a new host. Or even if the sync should be running from multiple places at the same time.

[1598463544] SERVICE ALERT: alert1001;Check the last execution of sync_check_icinga_contacts;CRITICAL;HARD;2;CRITICAL: Status of the systemd unit sync_check_icinga_contacts

@herron most likely you have to just SSH once to accept the fingerprint with SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh

As for the sync, yes it can be run from multiple hosts, it was already between the two icinga hosts, splayed by Puppet cron function IIRC.

Thanks @Volans, sync_check_icinga_contacts is happy now on alert[12]001

Next issue to sort out is certificate fail on the new icinga hosts:

wikitech-static:~# /usr/local/bin/check_icinga
2020-09-02 18:26:24,667 [ERROR] Unable to load existing state from /var/tmp/ [Errno 2] No such file or directory: '/var/tmp/'
2020-09-02 18:26:24,696 [INFO] Checking icinga host (is_active=False,
2020-09-02 18:26:34,848 [ERROR] Certificate did not match expected hostname: Certificate: {'subjectAltName': [('DNS', '')], 'subject': ((('commonName', ''),),)}
2020-09-02 18:26:34,849 [ERROR] Check for host (1/3) failed: ["hostname '' doesn't match ''"]

We could possibly address this by combining the and certificates. They will both be served by the same apache.

Change 623848 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] alerts: combine alerts.wm.o and icinga.wm.o certificates

Change 623848 merged by Herron:
[operations/puppet@production] alerts: combine alerts.wm.o and icinga.wm.o certificates

Icinga/alerts certificate issue has been fixed and meta monitoring is now working against the new alert[12]001 hosts.

wikitech-static:~# /usr/local/bin/check_icinga
2020-09-03 15:26:26,728 [INFO] Checking icinga host (is_active=False,
2020-09-03 15:26:29,627 [INFO] Check for host OK

I've prepared crontab entries for meta monitoring (in roots crontab) but left them commented out. We can simply uncomment when the alert[12]001 hosts become live.

wikitech-static:~# crontab -l | grep alert
#*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga
#*/2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga
herron renamed this task from ensure alert[12]001 are configured for meta monitoring to ensure alert[12]001 are prepared for meta monitoring.Sep 3 2020, 3:30 PM
herron closed this task as Resolved.
herron claimed this task.