Page MenuHomePhabricator

Reduce Icinga alert noise
Closed, ResolvedPublic

Description

It has been observed multiple times that Icinga alerts can be noisy, especially on IRC and during incidents it can be very distracting. In particular the following should help improving the signal to noise ratio:

Replace host-level IRC alerts with equivalent service-level.

Especially on IRC there's often no need to have notifications for single hosts, e.g: CPU alerts, dpkg broken, etc. These host-level alerts in some cases make sense aggregated (e.g. per cluster) and/or not to be sent on IRC but shown on icinga UI only.

[x] Alerts that page should say so

ATM it is impossible to tell whether a given alert has paged folks, a paging alert indicates a serious issue and a certain level of response expected. Thus explicitly paging alerts will help picking out serious issue (e.g. from IRC)

[stretch] Downtime hosts from IRC

It'll be useful if folks can downtime hosts from IRC in a similar fashion to how we !log for example, useful during incidents since we're on IRC anyways and the icinga ui can be clunky/slow, ditto for logging into icinga host and issuing downtime-host for each host.

De-noise puppet failed runs T229262

Event Timeline

Change 525502 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: aggregate puppet failure percent by cluster

https://gerrit.wikimedia.org/r/525502

Change 525511 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: add logstash 5xx dashboard to availability alerts

https://gerrit.wikimedia.org/r/525511

Change 525512 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: calculate nginx/varnish availability over 2m too

https://gerrit.wikimedia.org/r/525512

Change 525511 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: add logstash 5xx dashboard to availability alerts

https://gerrit.wikimedia.org/r/525511

Change 525502 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: aggregate puppet failure percent by cluster

https://gerrit.wikimedia.org/r/525502

Change 525512 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: calculate nginx/varnish availability over 2m too

https://gerrit.wikimedia.org/r/525512

Change 525535 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Consolidate 'critical' and 'contact groups' logic

https://gerrit.wikimedia.org/r/525535

Change 525536 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: tweak description for paging alerts

https://gerrit.wikimedia.org/r/525536

Change 525511 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: add logstash 5xx dashboard to availability alerts

https://gerrit.wikimedia.org/r/525511

This didn't turn out as I thought because of url encoding, the link on IRC is https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X although that results in a 404 from kibana: {"statusCode":404,"error":"Not Found","message":"Not Found"}.

Change 526118 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: split puppet failed runs metrics

https://gerrit.wikimedia.org/r/526118

Change 526118 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: split puppet failed runs metrics

https://gerrit.wikimedia.org/r/526118

Change 525535 merged by Filippo Giunchedi:
[operations/puppet@production] Consolidate 'critical' and 'contact groups' logic

https://gerrit.wikimedia.org/r/525535

Change 527465 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: fix HTTP availability dashboard links

https://gerrit.wikimedia.org/r/527465

Change 527465 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: fix HTTP availability dashboard links

https://gerrit.wikimedia.org/r/527465

Change 525511 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: add logstash 5xx dashboard to availability alerts

https://gerrit.wikimedia.org/r/525511

This didn't turn out as I thought because of url encoding, the link on IRC is https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X although that results in a 404 from kibana: {"statusCode":404,"error":"Not Found","message":"Not Found"}.

Fixed in Ie4059468bfb47d1a

Change 525536 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: tweak description for paging alerts

https://gerrit.wikimedia.org/r/525536

Change 528733 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: start collecting mediawiki aggregated stats

https://gerrit.wikimedia.org/r/528733

Change 528733 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: start collecting mediawiki aggregated stats

https://gerrit.wikimedia.org/r/528733

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.

Change 530080 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: stop monitoring individual daemons

https://gerrit.wikimedia.org/r/530080

Change 530080 merged by Filippo Giunchedi:
[operations/puppet@production] swift: stop monitoring individual daemons

https://gerrit.wikimedia.org/r/530080

Change 531690 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] wdqs: improve alert description

https://gerrit.wikimedia.org/r/531690

Change 531690 merged by Filippo Giunchedi:
[operations/puppet@production] wdqs: improve alert description

https://gerrit.wikimedia.org/r/531690

Change 532707 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: bump logstash rate of ingestion threshold

https://gerrit.wikimedia.org/r/532707

Change 532707 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: bump logstash rate of ingestion threshold

https://gerrit.wikimedia.org/r/532707

fgiunchedi claimed this task.

Resolving as this is complete, the ipsec alerts subtask is still open pending a firing of legacy/spammy alerts to compare to the new ones but otherwise done. systemd alerts have been stalled pending better aggregation/grouping capabilities.