Page MenuHomePhabricator

Improve our alerting capabilities (Q1 goal FY19-20)
Closed, ResolvedPublic

Description

  • Produce and circulate an alerting infrastructure roadmap T228879
  • Establish periodic alerts reviews, complete one by EOQ T228880
  • Reduce Icinga alert noise T228878

Event Timeline

herron updated the task description. (Show Details)
herron updated the task description. (Show Details)

Reporting here other ideas that emerged from irc/chats for tracking (relevant but not necessarily part of this Q's goal)

  • There's confusion between is_critical in puppet, to indicate a paging alert and icinga's CRITICAL
  • Sometimes it is easy to miss an alert notification on IRC, perhaps a meta-alert about long-standing unacknowledged CRITICALs might be useful
  • Once IRC alert noise is down we should consider enabling icinga's repeated notifications for unacked alerts ("obsess over service" if I'm not mistaken)

Change 528410 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] icinga: add /alerts shortcut for faster ack'ing

https://gerrit.wikimedia.org/r/528410

Change 528462 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring::host: rename critical to paging

https://gerrit.wikimedia.org/r/528462

Change 528463 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring::service rename critical to paging

https://gerrit.wikimedia.org/r/528463

Change 528410 merged by Filippo Giunchedi:
[operations/puppet@production] icinga: add /alerts shortcut for faster ack'ing

https://gerrit.wikimedia.org/r/528410

Change 532335 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] monitoring: alert on availability over two minutes

https://gerrit.wikimedia.org/r/532335

Change 528462 abandoned by Filippo Giunchedi:
monitoring::host: rename critical to paging

Reason:
See comments

https://gerrit.wikimedia.org/r/528462

Change 528463 abandoned by Filippo Giunchedi:
monitoring::service rename critical to paging

Reason:
See Icc4fee67d44

https://gerrit.wikimedia.org/r/528463

Change 532335 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: alert on availability over two minutes

https://gerrit.wikimedia.org/r/532335

fgiunchedi claimed this task.
fgiunchedi updated the task description. (Show Details)

Completed! See T228878 for subtask status