Page MenuHomePhabricator

Strongswan Icinga check: do not report issues about depooled hosts
Closed, DeclinedPublic

Description

Hosts with long running issues such as those described in T148891 currently affecting cp1052 cause the icinga check script for strongswan to report lots of critical errors. We do not want those errors to be reported when the machine is depooled.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Oct 24 2016, 2:59 PM

fwiw, i saw these in Icinga web-ui but icinga-wm was apparently not talking about them on IRC, but i did not see "disabled notifications" or ACKs in the web ui. then i ACKed them all and it said that on IRC, that was a bit strange.

Icinga external commands include SCHEDULE_SVC_DOWNTIME, which seems handy. We could perhaps try writing a script that issues a SCHEDULE_SVC_DOWNTIME for the IPSec service for each host defined in the role::ipsec targets array?

BBlack raised the priority of this task from Medium to Needs Triage.Sep 23 2020, 4:22 PM
BBlack removed a project: Traffic.
BBlack subscribed.

This was mostly about cache nodes back when those had ipsec, I think. The remaining case that uses ipsec anymore is the memcached cluster. Does this matter there, is it worth fixing there? Unsetting priority to get this some Triage, and removing the traffic label.

This was mostly about cache nodes back when those had ipsec, I think. The remaining case that uses ipsec anymore is the memcached cluster. Does this matter there, is it worth fixing there? Unsetting priority to get this some Triage, and removing the traffic label.

During T196487, with all eqiad row D hosts downtimed, we got this:

14:06 <+icinga-wm> PROBLEM - Aggregate IPsec Tunnel Status codfw on alert1001 is CRITICAL: instance={mc2033,mc2034,mc2035,mc2036} site=codfw tunnel={mc1033_v4,mc1034_v4,mc1035_v4,mc1036_v4}
                   https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status

So I would say that the issue is still there, and still worth fixing!

herron triaged this task as Medium priority.Sep 30 2020, 5:28 PM
herron moved this task from Radar to Inbox on the observability board.

Change 632738 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: apply ipsec monitoring where enabled with ipsec_exporter

https://gerrit.wikimedia.org/r/632738

Change 632739 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: clean up ipsec aggregate check

https://gerrit.wikimedia.org/r/632739

jijiki subscribed.

Strongswan is going away because we do not need it anymore. We were using it for redis_sessions T267581

$ sudo cumin --dry-run R:Class=strongswan
16 hosts will be targeted:
mc[2020,2022,2026,2030,2032,2034,2036,2038].codfw.wmnet,mc[1038,1040,1042,1044,1048,1050,1052,1054].eqiad.wmnet

Change 632738 abandoned by Cwhite:

[operations/puppet@production] profile: apply ipsec monitoring where enabled with ipsec_exporter

Reason:

task was declined

https://gerrit.wikimedia.org/r/632738

Change 632739 abandoned by Cwhite:

[operations/puppet@production] profile: clean up ipsec aggregate check

Reason:

https://gerrit.wikimedia.org/r/632739