Reduce Icinga alert noise
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jul 24 2019, 2:15 PM

Description

It has been observed multiple times that Icinga alerts can be noisy, especially on IRC and during incidents it can be very distracting. In particular the following should help improving the signal to noise ratio:

Replace host-level IRC alerts with equivalent service-level.

Especially on IRC there's often no need to have notifications for single hosts, e.g: CPU alerts, dpkg broken, etc. These host-level alerts in some cases make sense aggregated (e.g. per cluster) and/or not to be sent on IRC but shown on icinga UI only.

[x] Alerts that page should say so

ATM it is impossible to tell whether a given alert has paged folks, a paging alert indicates a serious issue and a certain level of response expected. Thus explicitly paging alerts will help picking out serious issue (e.g. from IRC)

[stretch] Downtime hosts from IRC

It'll be useful if folks can downtime hosts from IRC in a similar fashion to how we !log for example, useful during incidents since we're on IRC anyways and the icinga ui can be clunky/slow, ditto for logging into icinga host and issuing downtime-host for each host.

De-noise puppet failed runs T229262

Details

Subject	Repo	Branch	Lines +/-
prometheus: bump logstash rate of ingestion threshold	operations/puppet	production	+2 -2
wdqs: improve alert description	operations/puppet	production	+1 -1
swift: stop monitoring individual daemons	operations/puppet	production	+6 -36
prometheus: start collecting mediawiki aggregated stats	operations/puppet	production	+15 -0
monitoring: tweak description for paging alerts	operations/puppet	production	+13 -4
monitoring: fix HTTP availability dashboard links	operations/puppet	production	+2 -3
Consolidate 'critical' and 'contact groups' logic	operations/puppet	production	+4 -9
prometheus: split puppet failed runs metrics	operations/puppet	production	+4 -2
prometheus: calculate nginx/varnish availability over 2m too	operations/puppet	production	+11 -2
prometheus: aggregate puppet failure percent by cluster	operations/puppet	production	+5 -0
monitoring: add logstash 5xx dashboard to availability alerts	operations/puppet	production	+4 -2

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T228379 Improve our alerting capabilities (Q1 goal FY19-20)
Resolved	fgiunchedi	T228878 Reduce Icinga alert noise
Resolved	fgiunchedi	T229262 De-noise puppet failed runs (Reduce Icinga alert noise goal)
Resolved	herron	T230236 De-noise ipsec alerts (Reduce Icinga alert noise goal)
Resolved	fgiunchedi	T230396 De-noise per-host API appservers high CPU usage
Stalled	None	T230570 De-noise systemd alerts (Reduce Icinga alert noise goal)
Resolved	fgiunchedi	T232303 Tweak widespread puppet failures for small sites
Resolved	fgiunchedi	T260154 De-noise "Ensure local MW versions match expected deployment" alerts

Event Timeline

See also T225140: Icinga alerts that should open tasks instead of alerting about opening tasks and T223458: mgmt outages for cloud* systems seem to page everyone about splitting off wmcs contacts, especially for pages

Change 525502 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: aggregate puppet failure percent by cluster

https://gerrit.wikimedia.org/r/525502

gerritbot added a project: Patch-For-Review.Jul 25 2019, 9:07 AM

fgiunchedi mentioned this in T228966: Passenger stderr warnings for regex and htpasswd.rb.Jul 25 2019, 9:14 AM

fgiunchedi updated the task description. (Show Details)Jul 25 2019, 9:33 AM

Change 525511 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: add logstash 5xx dashboard to availability alerts

https://gerrit.wikimedia.org/r/525511

Change 525512 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: calculate nginx/varnish availability over 2m too

https://gerrit.wikimedia.org/r/525512

Change 525511 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: add logstash 5xx dashboard to availability alerts

https://gerrit.wikimedia.org/r/525511

Change 525502 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: aggregate puppet failure percent by cluster

https://gerrit.wikimedia.org/r/525502

Change 525512 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: calculate nginx/varnish availability over 2m too

https://gerrit.wikimedia.org/r/525512

Maintenance_bot removed a project: Patch-For-Review.Jul 26 2019, 9:10 AM

Change 525535 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Consolidate 'critical' and 'contact groups' logic

https://gerrit.wikimedia.org/r/525535

Change 525536 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: tweak description for paging alerts

https://gerrit.wikimedia.org/r/525536

herron moved this task from Inbox to In progress on the observability board.Jul 26 2019, 4:30 PM

herron mentioned this in T228379: Improve our alerting capabilities (Q1 goal FY19-20).

In T228878#5364485, @gerritbot wrote:

Change 525511 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: add logstash 5xx dashboard to availability alerts

https://gerrit.wikimedia.org/r/525511

This didn't turn out as I thought because of url encoding, the link on IRC is https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X although that results in a 404 from kibana: {"statusCode":404,"error":"Not Found","message":"Not Found"}.

Change 526118 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: split puppet failed runs metrics

https://gerrit.wikimedia.org/r/526118

Change 526118 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: split puppet failed runs metrics

https://gerrit.wikimedia.org/r/526118

herron updated the task description. (Show Details)Jul 29 2019, 6:26 PM

Change 525535 merged by Filippo Giunchedi:
[operations/puppet@production] Consolidate 'critical' and 'contact groups' logic

https://gerrit.wikimedia.org/r/525535

Change 527465 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: fix HTTP availability dashboard links

https://gerrit.wikimedia.org/r/527465

Change 527465 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: fix HTTP availability dashboard links

https://gerrit.wikimedia.org/r/527465

In T228878#5372226, @fgiunchedi wrote:

In T228878#5364485, @gerritbot wrote:

Change 525511 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: add logstash 5xx dashboard to availability alerts

https://gerrit.wikimedia.org/r/525511

This didn't turn out as I thought because of url encoding, the link on IRC is https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X although that results in a 404 from kibana: {"statusCode":404,"error":"Not Found","message":"Not Found"}.

Fixed in Ie4059468bfb47d1a

Change 525536 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: tweak description for paging alerts

https://gerrit.wikimedia.org/r/525536

Maintenance_bot removed a project: Patch-For-Review.Aug 5 2019, 9:10 AM

• JHedden mentioned this in T229787: Toolforge: sudden issues in both gridengine and k8s webservices.Aug 5 2019, 1:29 PM

Change 528733 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: start collecting mediawiki aggregated stats

https://gerrit.wikimedia.org/r/528733

gerritbot added a project: Patch-For-Review.Aug 7 2019, 10:10 AM

Change 528733 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: start collecting mediawiki aggregated stats

https://gerrit.wikimedia.org/r/528733

Maintenance_bot removed a project: Patch-For-Review.Aug 7 2019, 3:10 PM

fgiunchedi updated the task description. (Show Details)Aug 13 2019, 1:47 PM

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.

Change 530080 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] swift: stop monitoring individual daemons

https://gerrit.wikimedia.org/r/530080

gerritbot added a project: Patch-For-Review.Aug 14 2019, 8:00 AM

fgiunchedi closed subtask T229262: De-noise puppet failed runs (Reduce Icinga alert noise goal) as Resolved.Aug 14 2019, 1:24 PM

Change 530080 merged by Filippo Giunchedi:
[operations/puppet@production] swift: stop monitoring individual daemons

https://gerrit.wikimedia.org/r/530080

Maintenance_bot removed a project: Patch-For-Review.Aug 19 2019, 8:10 AM

Change 531690 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] wdqs: improve alert description

https://gerrit.wikimedia.org/r/531690

gerritbot added a project: Patch-For-Review.Aug 22 2019, 12:49 PM

Change 531690 merged by Filippo Giunchedi:
[operations/puppet@production] wdqs: improve alert description

https://gerrit.wikimedia.org/r/531690

Maintenance_bot removed a project: Patch-For-Review.Aug 22 2019, 1:10 PM

Change 532707 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: bump logstash rate of ingestion threshold

https://gerrit.wikimedia.org/r/532707

gerritbot added a project: Patch-For-Review.Aug 27 2019, 1:05 PM

Change 532707 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: bump logstash rate of ingestion threshold

https://gerrit.wikimedia.org/r/532707

Maintenance_bot removed a project: Patch-For-Review.Aug 28 2019, 9:10 AM

fgiunchedi closed subtask T230396: De-noise per-host API appservers high CPU usage as Resolved.Sep 4 2019, 2:58 PM

fgiunchedi closed subtask T232303: Tweak widespread puppet failures for small sites as Resolved.Sep 20 2019, 7:37 AM

herron changed the status of subtask T230570: De-noise systemd alerts (Reduce Icinga alert noise goal) from Open to Stalled.Sep 25 2019, 2:19 PM

Resolving as this is complete, the ipsec alerts subtask is still open pending a firing of legacy/spammy alerts to compare to the new ones but otherwise done. systemd alerts have been stalled pending better aggregation/grouping capabilities.

fgiunchedi mentioned this in T236379: Include #_page on host alerts that page SRE.Oct 24 2019, 12:46 PM

herron closed subtask T230236: De-noise ipsec alerts (Reduce Icinga alert noise goal) as Resolved.Nov 8 2019, 9:22 PM

fgiunchedi closed subtask T260154: De-noise "Ensure local MW versions match expected deployment" alerts as Resolved.Oct 27 2022, 1:50 PM

Reduce Icinga alert noiseClosed, ResolvedPublicActions