Page MenuHomePhabricator

Migrate Traffic Prometheus alerts from Icinga to Alertmanager
Closed, ResolvedPublic

Description

There are a few traffic check_prometheus based alerts that should be migrated to Alertmanager:

monitoring::check_prometheus { "aggregate-ipsec-tunnel-status-${site}":
monitoring::check_prometheus { "varnish_${title}":
monitoring::check_prometheus { "ats_${title}":
monitoring::check_prometheus { $title:
monitoring::check_prometheus { "varnishkafka-${instance}-${cache_segment}-eqiad-kafka_drerr":
monitoring::check_prometheus { "varnishkafka-${instance}-${cache_segment}-codfw-kafka_drerr":
monitoring::check_prometheus { "varnishkafka-${instance}-${cache_segment}-esams-kafka_drerr":
monitoring::check_prometheus { "varnishkafka-${instance}-${cache_segment}-ulsfo-kafka_drerr":
monitoring::check_prometheus { "varnishkafka-${instance}-${cache_segment}-eqsin-kafka_drerr":
monitoring::check_prometheus { 'purged-event-lag':
monitoring::check_prometheus { 'purged-backlog':
monitoring::check_prometheus { 'varnishd-mmap-count':
monitoring::check_prometheus { 'excessive-lvs-rx-traffic':
monitoring::check_prometheus { 'lvs-cpu-saturated':
monitoring::check_prometheus { 'pybal_bgp_sessions':
monitoring::check_prometheus { 'varnish-frontend-check-child-start':

Notably the "reduced availability" alerts can suffer from delays or excessive averaging due to their use of the prometheus global instance and icinga evaluation period.

Event Timeline

jbond triaged this task as Medium priority.Feb 16 2022, 4:57 PM

Change 803368 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] Traffic: Add PyBal BGP sessions

https://gerrit.wikimedia.org/r/803368

Change 804450 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] Traffic Add alert for Varnish child restart

https://gerrit.wikimedia.org/r/804450

Change 803368 merged by jenkins-bot:

[operations/alerts@master] Traffic: Add PyBal BGP sessions

https://gerrit.wikimedia.org/r/803368

Change 804450 merged by BCornwall:

[operations/alerts@master] Traffic: Add alert for Varnish child restart

https://gerrit.wikimedia.org/r/804450

Change 805237 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] Traffic: add varnishkafka delivery error alarms

https://gerrit.wikimedia.org/r/805237

varnishd-mmap-count is proving difficult to port over since we don't appear to have any way of determining a server's vm.max_map_count with our current prometheus data set. The incumbent Icinga alert passes in the sysctl value as set by Puppet; Separating the alert into a separate repository has us lose this value. I've done some searching around and am unaware of any simple mechanism by which we can retrieve that value in Prometheus, unfortunately.

There exists a sysctl exporter for BSD's Prometheus, but I do not see anything similar for Linux. I have been unable to find anything in the greater ecosystem that would provide these values.

Change 805887 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] Traffic: Port IPsec/Strongswan connection alert

https://gerrit.wikimedia.org/r/805887

Change 805873 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] WIP: add VarnishHighMmapCount

https://gerrit.wikimedia.org/r/805873

Change 805887 abandoned by BCornwall:

[operations/alerts@master] Traffic: Port IPsec/Strongswan connection alert

Reason:

IPsec/strongswan is used only for memcache nowadays and will soon be retired entirely, so it doesn't make sense to port this over.

https://gerrit.wikimedia.org/r/805887

Change 806332 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] Traffic: Port over purged lag/queue monitors

https://gerrit.wikimedia.org/r/806332

BCornwall changed the task status from Open to In Progress.Jun 17 2022, 6:30 PM

Change 805237 merged by BCornwall:

[operations/alerts@master] data-engineering: add varnishkafka delivery errors

https://gerrit.wikimedia.org/r/805237

Change 806332 merged by BCornwall:

[operations/alerts@master] Traffic: Port over purged lag/queue monitors

https://gerrit.wikimedia.org/r/806332

the varnish-mmap-count situation could be resolved with https://github.com/prometheus/procfs/pull/176/files which has been merged in, but it's not available in node-exporter yet (and the readme for that repo mentions an experimental status) :(

Change 807214 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] traffic: Port over ATS restart alert

https://gerrit.wikimedia.org/r/807214

the varnish-mmap-count situation could be resolved with https://github.com/prometheus/procfs/pull/176/files which has been merged in, but it's not available in node-exporter yet (and the readme for that repo mentions an experimental status) :(

Indeed no sign of that metric, the simplest thing to do might be to go the textfile .prom way and add that ourselves, though as Jessie mentioned I'm now wondering if the alert is still relevant nowadays? cc @BBlack and @Vgutierrez as they would have an informed opinion!

Change 807214 merged by BCornwall:

[operations/alerts@master] traffic: Port over ATS restart alert

https://gerrit.wikimedia.org/r/807214

Spoke with @Vgutierrez on IRC and they confirmed that the mmap maximum is worth monitoring. The suggested approach would be similar to prometheus::node_varnishd_mmap_count, which is made available via a simple routine shell script execution. I'll make a child ticket to track that work.

Change 805873 merged by jenkins-bot:

[operations/alerts@master] varnish: add VarnishHighMmapCount

https://gerrit.wikimedia.org/r/805873

@fgiunchedi

Looks like the rules mentioned in the ticket have all either been ported or confirmed as not worthy of porting. The only thing left here is to verify that these are indeed alerting rules running on our alertmanager instances (how can I verify that? I'm not sure where our alertmanager instances are!). After verification, then we need to remove the existing Icinga alerts from puppet, correct?

Ah, I've since learned where to look and verify where the rules are. Are we comfortable enough with the unit tests to rip out the Icinga alerts without a more "real-world" test?

Change 812424 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] varnish: Port over traffic_drop from Icinga

https://gerrit.wikimedia.org/r/812424

@fgiunchedi

Looks like the rules mentioned in the ticket have all either been ported or confirmed as not worthy of porting. The only thing left here is to verify that these are indeed alerting rules running on our alertmanager instances (how can I verify that? I'm not sure where our alertmanager instances are!). After verification, then we need to remove the existing Icinga alerts from puppet, correct?

That is correct yes, once the alerting rules are in place in alerts.git the old icinga checks can be removed from puppet

Ah, I've since learned where to look and verify where the rules are. Are we comfortable enough with the unit tests to rip out the Icinga alerts without a more "real-world" test?

For checks based on check_prometheus in Icinga personally I'm confident enough about the semantics of unit tests to be comfortable removing the icinga alerts (and have done so in the past). Having said that, it is totally understandable to be wanting a real-world test; I'm guessing that depending on the system at hand that will be easier or harder to achieve. If there's a simple way to do a real-world test then +1

Change 812424 merged by BCornwall:

[operations/alerts@master] varnish: Port over traffic_drop from Icinga

https://gerrit.wikimedia.org/r/812424

Change 814894 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Icinga: Remove traffic alerts

https://gerrit.wikimedia.org/r/814894

Change 814894 merged by BCornwall:

[operations/puppet@production] Icinga: Remove traffic alerts

https://gerrit.wikimedia.org/r/814894

Change 817844 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/puppet@production] prometheus: remove traffic_drop alerts

https://gerrit.wikimedia.org/r/817844

Change 817844 merged by BCornwall:

[operations/puppet@production] prometheus: remove traffic_drop alerts

https://gerrit.wikimedia.org/r/817844

Change 817866 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Remove more traffic alerts

https://gerrit.wikimedia.org/r/817866

Change 817866 merged by BCornwall:

[operations/puppet@production] Remove kafka alerting class

https://gerrit.wikimedia.org/r/817866

Change 889881 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] trafficserver: Remove restart count icinga alert

https://gerrit.wikimedia.org/r/889881

Change 889881 merged by BCornwall:

[operations/puppet@production] trafficserver: Remove restart count icinga alert

https://gerrit.wikimedia.org/r/889881