Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | fgiunchedi | T228379 Improve our alerting capabilities (Q1 goal FY19-20) | |||
Resolved | fgiunchedi | T228878 Reduce Icinga alert noise | |||
Resolved | fgiunchedi | T229262 De-noise puppet failed runs (Reduce Icinga alert noise goal) | |||
Resolved | herron | T230236 De-noise ipsec alerts (Reduce Icinga alert noise goal) | |||
Resolved | fgiunchedi | T230396 De-noise per-host API appservers high CPU usage | |||
Stalled | None | T230570 De-noise systemd alerts (Reduce Icinga alert noise goal) | |||
Resolved | fgiunchedi | T232303 Tweak widespread puppet failures for small sites | |||
Resolved | fgiunchedi | T260154 De-noise "Ensure local MW versions match expected deployment" alerts | |||
Resolved | fgiunchedi | T228879 Produce and circulate an alerting roadmap | |||
Resolved | fgiunchedi | T228880 Establish periodic alerts reviews, complete one by EOQ | |||
Resolved | fgiunchedi | T230413 Include acknowledge information in icinga emails |
Event Timeline
Reporting here other ideas that emerged from irc/chats for tracking (relevant but not necessarily part of this Q's goal)
- There's confusion between is_critical in puppet, to indicate a paging alert and icinga's CRITICAL
- Sometimes it is easy to miss an alert notification on IRC, perhaps a meta-alert about long-standing unacknowledged CRITICALs might be useful
- Once IRC alert noise is down we should consider enabling icinga's repeated notifications for unacked alerts ("obsess over service" if I'm not mistaken)
Change 528410 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] icinga: add /alerts shortcut for faster ack'ing
Change 528462 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring::host: rename critical to paging
Change 528463 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring::service rename critical to paging
Change 528410 merged by Filippo Giunchedi:
[operations/puppet@production] icinga: add /alerts shortcut for faster ack'ing
Change 532335 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: alert on availability over two minutes
Change 528462 abandoned by Filippo Giunchedi:
monitoring::host: rename critical to paging
Reason:
See comments
Change 528463 abandoned by Filippo Giunchedi:
monitoring::service rename critical to paging
Reason:
See Icc4fee67d44
Change 532335 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: alert on availability over two minutes