Hosts with long running issues such as those described in T148891 currently affecting cp1052 cause the icinga check script for strongswan to report lots of critical errors. We do not want those errors to be reported when the machine is depooled.
Description
Details
Related Objects
Event Timeline
fwiw, i saw these in Icinga web-ui but icinga-wm was apparently not talking about them on IRC, but i did not see "disabled notifications" or ACKs in the web ui. then i ACKed them all and it said that on IRC, that was a bit strange.
Icinga external commands include SCHEDULE_SVC_DOWNTIME, which seems handy. We could perhaps try writing a script that issues a SCHEDULE_SVC_DOWNTIME for the IPSec service for each host defined in the role::ipsec targets array?
This was mostly about cache nodes back when those had ipsec, I think. The remaining case that uses ipsec anymore is the memcached cluster. Does this matter there, is it worth fixing there? Unsetting priority to get this some Triage, and removing the traffic label.
During T196487, with all eqiad row D hosts downtimed, we got this:
14:06 <+icinga-wm> PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance={mc2033,mc2034,mc2035,mc2036} site=codfw tunnel={mc1033_v4,mc1034_v4,mc1035_v4,mc1036_v4} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
So I would say that the issue is still there, and still worth fixing!
Change 632738 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: apply ipsec monitoring where enabled with ipsec_exporter
Change 632739 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: clean up ipsec aggregate check
Strongswan is going away because we do not need it anymore. We were using it for redis_sessions T267581
$ sudo cumin --dry-run R:Class=strongswan 16 hosts will be targeted: mc[2020,2022,2026,2030,2032,2034,2036,2038].codfw.wmnet,mc[1038,1040,1042,1044,1048,1050,1052,1054].eqiad.wmnet
Change 632738 abandoned by Cwhite:
[operations/puppet@production] profile: apply ipsec monitoring where enabled with ipsec_exporter
Reason:
task was declined
Change 632739 abandoned by Cwhite:
[operations/puppet@production] profile: clean up ipsec aggregate check
Reason: