Page MenuHomePhabricator

De-noise ipsec alerts (Reduce Icinga alert noise goal)
Closed, ResolvedPublic

Description

Today when an ipsec tunnel goes down, a large number of host alerts will fire. This happens because a single icinga check per-host handles multiple tunnels.

We should be able to move to this to a prometheus check. High level we would need to...

  • Add ipsec tunnel status metrics to prometheus
  • Alert on the aggregate ipsec status metrics (per-site)
  • Phase out the host based ipsec checks

Event Timeline

I've drafted a prometheus-ipsec-exporter package based on https://github.com/dennisstritzke/ipsec_exporter on boron.

Due to the dependencies, it currently builds successfully for buster. And the resulting package installs successfully on stretch.

I don't currently have the privs to create a new gerrit project operations/debs/prometheus-ipsec-exporter, so fo the time being it will have to be browsed from my homedir on boron:

boron:/home/herron/prometheus-ipsec-exporter/prometheus-ipsec-exporter-0.3.1

And the built package can be fetched from:

boron:/var/cache/pbuilder/result/buster-amd64/prometheus-ipsec-exporter_0.3.1-1_amd64.deb

Light testing has been successful using a stretch VM running strongswan.

Change 530203 had a related patch set uploaded (by Herron; owner: Herron):
[operations/debs/prometheus-ipsec-exporter@master] prometheus-ipsec-exporter: initial commit of version 0.3.1

https://gerrit.wikimedia.org/r/530203

Change 530616 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: add prometheus ipsec exporter service & config

https://gerrit.wikimedia.org/r/530616

Change 530203 merged by Herron:
[operations/debs/prometheus-ipsec-exporter@master] prometheus-ipsec-exporter: initial commit of version 0.3.1

https://gerrit.wikimedia.org/r/530203

Mentioned in SAL (#wikimedia-operations) [2019-08-26T17:34:48Z] <herron> beginning roll out of prometheus-ipsec-exporter in ulsfo T230236

Change 530616 merged by Herron:
[operations/puppet@production] prometheus: add prometheus ipsec exporter service & config in ulsfo

https://gerrit.wikimedia.org/r/530616

Change 532426 had a related patch set uploaded (by Herron; owner: Herron):
[operations/debs/prometheus-ipsec-exporter@master] change user to root

https://gerrit.wikimedia.org/r/532426

Change 532426 merged by Herron:
[operations/debs/prometheus-ipsec-exporter@master] change user to root

https://gerrit.wikimedia.org/r/532426

Change 533563 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: aggregate ipsec_status and add alert

https://gerrit.wikimedia.org/r/533563

Change 534210 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: deploy prometheus-ipsec-exporter to all sites

https://gerrit.wikimedia.org/r/534210

Change 534210 merged by Herron:
[operations/puppet@production] prometheus: deploy prometheus-ipsec-exporter to all sites

https://gerrit.wikimedia.org/r/534210

Change 533563 merged by Herron:
[operations/puppet@production] prometheus: aggregate ipsec_status and add alert

https://gerrit.wikimedia.org/r/533563

Change 536216 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: switch to per-site aggregate ipsec checks

https://gerrit.wikimedia.org/r/536216

Change 536216 merged by Herron:
[operations/puppet@production] prometheus: switch to per-site aggregate ipsec checks

https://gerrit.wikimedia.org/r/536216

Change 536671 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf

https://gerrit.wikimedia.org/r/536671

Change 536671 merged by Herron:
[operations/puppet@production] prometheus-ipsec-exporter: subscribe service to /etc/ipsec.conf

https://gerrit.wikimedia.org/r/536671

Looks like the alerts are working as intended and we can move the per-host alerts to warning (or remove)

11:17 -icinga-wm:#wikimedia-operations- PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2006:9536,cp2019:9536} site=codfw tunnel={cp3060_v4,cp3060_v6} 
          https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
11:18 -icinga-wm:#wikimedia-operations- PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance=cp1081:9536 site=eqiad tunnel={cp3060_v4,cp3060_v6} 
          https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
11:19 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:21 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:30 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:31 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:31 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:33 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:33 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:36 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:37 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:37 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:37 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:37 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:38 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:38 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:38 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:38 -icinga-wm:#wikimedia-operations- RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link 
          https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
11:38 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:47 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:47 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:48 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:48 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:48 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:48 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:48 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:48 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:48 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:48 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:48 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:48 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
11:52 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:01 -icinga-wm:#wikimedia-operations- PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} 
          site=eqiad tunnel={cp3060_v4,cp3060_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
12:02 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:06 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:08 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:10 -icinga-wm:#wikimedia-operations- PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp1081 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp1079 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp1087 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp1077 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:28 -icinga-wm:#wikimedia-operations- RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
12:29 -icinga-wm:#wikimedia-operations- RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link 
          https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
12:30 -icinga-wm:#wikimedia-operations- RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link 
          https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status

Change 546666 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] ipsec: remove check_strongswan in favor of prometheus check

https://gerrit.wikimedia.org/r/546666

Change 546666 merged by Herron:
[operations/puppet@production] ipsec: remove check_strongswan in favor of prometheus check

https://gerrit.wikimedia.org/r/546666

https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status probably needs some cleanup (some of the graphs are empty, there's a note there to ignore icinga errors, etc). Also fix missing doc link on the alert?

Change 549481 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: add ipsec_status to prometheus/global

https://gerrit.wikimedia.org/r/549481

Change 549481 merged by Herron:
[operations/puppet@production] prometheus: add ipsec_status to prometheus/global

https://gerrit.wikimedia.org/r/549481

Change 549932 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] icinga: add notes_url to aggregate ipsec alerts

https://gerrit.wikimedia.org/r/549932

Change 549932 merged by Herron:
[operations/puppet@production] icinga: add notes_url to aggregate ipsec alerts

https://gerrit.wikimedia.org/r/549932

herron claimed this task.

https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status probably needs some cleanup (some of the graphs are empty, there's a note there to ignore icinga errors, etc). Also fix missing doc link on the alert?

Done, and done! Also added some troubleshooting hints to the alert documentation link. I think we're in good shape here. But please reopen if you see anything needing follow-up

DannyS712 subscribed.

[batch] remove patch for review tag from resolved tasks