Migrate Traffic Prometheus alerts from Icinga to Alertmanager
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Feb 2 2022, 11:11 AM

Description

There are a few traffic check_prometheus based alerts that should be migrated to Alertmanager:

monitoring::check_prometheus { "aggregate-ipsec-tunnel-status-${site}":
monitoring::check_prometheus { "varnish_${title}":
monitoring::check_prometheus { "ats_${title}":
monitoring::check_prometheus { $title:
monitoring::check_prometheus { "varnishkafka-${instance}-${cache_segment}-eqiad-kafka_drerr":
monitoring::check_prometheus { "varnishkafka-${instance}-${cache_segment}-codfw-kafka_drerr":
monitoring::check_prometheus { "varnishkafka-${instance}-${cache_segment}-esams-kafka_drerr":
monitoring::check_prometheus { "varnishkafka-${instance}-${cache_segment}-ulsfo-kafka_drerr":
monitoring::check_prometheus { "varnishkafka-${instance}-${cache_segment}-eqsin-kafka_drerr":
monitoring::check_prometheus { 'purged-event-lag':
monitoring::check_prometheus { 'purged-backlog':
monitoring::check_prometheus { 'varnishd-mmap-count':
monitoring::check_prometheus { 'excessive-lvs-rx-traffic':
monitoring::check_prometheus { 'lvs-cpu-saturated':
monitoring::check_prometheus { 'pybal_bgp_sessions':
monitoring::check_prometheus { 'varnish-frontend-check-child-start':

Notably the "reduced availability" alerts can suffer from delays or excessive averaging due to their use of the prometheus global instance and icinga evaluation period.

Details

Subject	Repo	Branch	Lines +/-
trafficserver: Remove restart count icinga alert	operations/puppet	production	+0 -14
Remove kafka alerting class	operations/puppet	production	+0 -35
prometheus: remove traffic_drop alerts	operations/puppet	production	+0 -6
Icinga: Remove traffic alerts	operations/puppet	production	+0 -156
varnish: Port over traffic_drop from Icinga	operations/alerts	master	+83 -0
varnish: add VarnishHighMmapCount	operations/alerts	master	+77 -1
traffic: Port over ATS restart alert	operations/alerts	master	+38 -0
Traffic: Port over purged lag/queue monitors	operations/alerts	master	+145 -0
data-engineering: add varnishkafka delivery errors	operations/alerts	master	+99 -0
Traffic: Port IPsec/Strongswan connection alert	operations/alerts	master	+84 -0
Traffic: Add alert for Varnish child restart	operations/alerts	master	+41 -0
Traffic: Add PyBal BGP sessions	operations/alerts	master	+42 -0

Related Objects
Search...

Status	Assigned	Task
Open	None	T321808 Port most/all Icinga checks to Prometheus/Alertmanager
Open	None	T288622 All Prometheus based alerts move from Icinga to alert manager exclusively
Resolved	BCornwall	T300723 Migrate Traffic Prometheus alerts from Icinga to Alertmanager
Resolved	BCornwall	T311445 Create vm.max_map_count metrics for Prometheus

Event Timeline

fgiunchedi created this task.Feb 2 2022, 11:11 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 2 2022, 11:11 AM

Maintenance_bot added a project: SRE.Feb 2 2022, 11:45 AM

fgiunchedi updated the task description. (Show Details)Feb 2 2022, 1:12 PM

fgiunchedi updated the task description. (Show Details)Feb 2 2022, 1:40 PM

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Feb 9 2022, 9:06 AM

jbond triaged this task as Medium priority.Feb 16 2022, 4:57 PM

lmata edited projects, added SRE Observability (FY2021/2022-Q4); removed SRE Observability (FY2021/2022-Q3).Apr 11 2022, 1:15 PM

BCornwall claimed this task.Jun 2 2022, 8:56 PM

Change 803368 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] Traffic: Add PyBal BGP sessions

https://gerrit.wikimedia.org/r/803368

gerritbot added a project: Patch-For-Review.Jun 6 2022, 11:39 PM

Change 804450 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] Traffic Add alert for Varnish child restart

https://gerrit.wikimedia.org/r/804450

Change 803368 merged by jenkins-bot:

[operations/alerts@master] Traffic: Add PyBal BGP sessions

https://gerrit.wikimedia.org/r/803368

Change 804450 merged by BCornwall:

[operations/alerts@master] Traffic: Add alert for Varnish child restart

https://gerrit.wikimedia.org/r/804450

Maintenance_bot removed a project: Patch-For-Review.Jun 13 2022, 6:30 PM

Change 805237 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] Traffic: add varnishkafka delivery error alarms

https://gerrit.wikimedia.org/r/805237

gerritbot added a project: Patch-For-Review.Jun 13 2022, 10:25 PM

varnishd-mmap-count is proving difficult to port over since we don't appear to have any way of determining a server's vm.max_map_count with our current prometheus data set. The incumbent Icinga alert passes in the sysctl value as set by Puppet; Separating the alert into a separate repository has us lose this value. I've done some searching around and am unaware of any simple mechanism by which we can retrieve that value in Prometheus, unfortunately.

There exists a sysctl exporter for BSD's Prometheus, but I do not see anything similar for Linux. I have been unable to find anything in the greater ecosystem that would provide these values.

Change 805887 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] Traffic: Port IPsec/Strongswan connection alert

https://gerrit.wikimedia.org/r/805887

Change 805873 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] WIP: add VarnishHighMmapCount

https://gerrit.wikimedia.org/r/805873

Change 805887 abandoned by BCornwall:

[operations/alerts@master] Traffic: Port IPsec/Strongswan connection alert

Reason:

IPsec/strongswan is used only for memcache nowadays and will soon be retired entirely, so it doesn't make sense to port this over.

https://gerrit.wikimedia.org/r/805887

Change 806332 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] Traffic: Port over purged lag/queue monitors

https://gerrit.wikimedia.org/r/806332

BCornwall changed the task status from Open to In Progress.Jun 17 2022, 6:30 PM

Change 805237 merged by BCornwall:

[operations/alerts@master] data-engineering: add varnishkafka delivery errors

https://gerrit.wikimedia.org/r/805237

Change 806332 merged by BCornwall:

[operations/alerts@master] Traffic: Port over purged lag/queue monitors

https://gerrit.wikimedia.org/r/806332

the varnish-mmap-count situation could be resolved with https://github.com/prometheus/procfs/pull/176/files which has been merged in, but it's not available in node-exporter yet (and the readme for that repo mentions an experimental status) :(

Change 807214 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] traffic: Port over ATS restart alert

https://gerrit.wikimedia.org/r/807214

In T300723#8017017, @BCornwall wrote:

the varnish-mmap-count situation could be resolved with https://github.com/prometheus/procfs/pull/176/files which has been merged in, but it's not available in node-exporter yet (and the readme for that repo mentions an experimental status) :(

Indeed no sign of that metric, the simplest thing to do might be to go the textfile .prom way and add that ourselves, though as Jessie mentioned I'm now wondering if the alert is still relevant nowadays? cc @BBlack and @Vgutierrez as they would have an informed opinion!

Change 807214 merged by BCornwall:

[operations/alerts@master] traffic: Port over ATS restart alert

https://gerrit.wikimedia.org/r/807214

Spoke with @Vgutierrez on IRC and they confirmed that the mmap maximum is worth monitoring. The suggested approach would be similar to prometheus::node_varnishd_mmap_count, which is made available via a simple routine shell script execution. I'll make a child ticket to track that work.

BCornwall mentioned this in T311445: Create vm.max_map_count metrics for Prometheus.Jun 27 2022, 6:28 PM

fgiunchedi edited projects, added Observability-Alerting; removed SRE Observability (FY2021/2022-Q4).Jul 1 2022, 8:11 AM

BCornwall closed subtask T311445: Create vm.max_map_count metrics for Prometheus as Resolved.Jul 7 2022, 7:19 PM

Change 805873 merged by jenkins-bot:

[operations/alerts@master] varnish: add VarnishHighMmapCount

https://gerrit.wikimedia.org/r/805873

@fgiunchedi

Looks like the rules mentioned in the ticket have all either been ported or confirmed as not worthy of porting. The only thing left here is to verify that these are indeed alerting rules running on our alertmanager instances (how can I verify that? I'm not sure where our alertmanager instances are!). After verification, then we need to remove the existing Icinga alerts from puppet, correct?

Ah, I've since learned where to look and verify where the rules are. Are we comfortable enough with the unit tests to rip out the Icinga alerts without a more "real-world" test?

Change 812424 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/alerts@master] varnish: Port over traffic_drop from Icinga

https://gerrit.wikimedia.org/r/812424

(forgot one last one!)

fgiunchedi edited parent tasks, added: T288622: All Prometheus based alerts move from Icinga to alert manager exclusively; removed: T281454: Onboard teams with Prometheus-based alerts to AlertManager.Jul 11 2022, 12:09 PM

In T300723#8066385, @BCornwall wrote:

@fgiunchedi

Looks like the rules mentioned in the ticket have all either been ported or confirmed as not worthy of porting. The only thing left here is to verify that these are indeed alerting rules running on our alertmanager instances (how can I verify that? I'm not sure where our alertmanager instances are!). After verification, then we need to remove the existing Icinga alerts from puppet, correct?

That is correct yes, once the alerting rules are in place in alerts.git the old icinga checks can be removed from puppet

In T300723#8066564, @BCornwall wrote:

Ah, I've since learned where to look and verify where the rules are. Are we comfortable enough with the unit tests to rip out the Icinga alerts without a more "real-world" test?

For checks based on check_prometheus in Icinga personally I'm confident enough about the semantics of unit tests to be comfortable removing the icinga alerts (and have done so in the past). Having said that, it is totally understandable to be wanting a real-world test; I'm guessing that depending on the system at hand that will be easier or harder to achieve. If there's a simple way to do a real-world test then +1

Change 812424 merged by BCornwall:

[operations/alerts@master] varnish: Port over traffic_drop from Icinga

https://gerrit.wikimedia.org/r/812424

Change 814894 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Icinga: Remove traffic alerts

https://gerrit.wikimedia.org/r/814894

BCornwall closed this task as Resolved.Jul 27 2022, 4:05 PM

Change 814894 merged by BCornwall:

[operations/puppet@production] Icinga: Remove traffic alerts

https://gerrit.wikimedia.org/r/814894

Change 817844 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/puppet@production] prometheus: remove traffic_drop alerts