Page MenuHomePhabricator

Streamline WMCS Alerting and Paging
Open, Needs TriagePublic

Description

In support of T310598.

Let's audit all existing alerts and on-call schedules with the following goals in mind:

  1. Improve paging by prioritizing team members who are awake
  2. Prioritize alerts to ensure pages only occur for critical events.
  3. Remove unneeded alerts. Move informational alerts to automated tickets, rather than pages via https://phabricator.wikimedia.org/p/phaultfinder/
  4. Make alertmanager the single source of truth and interface for visualizing and responding to alerts

Related Objects

StatusSubtypeAssignedTask
Opendcaro
Resolveddcaro
Opendcaro
Opendcaro
Resolveddcaro
Resolveddcaro
Duplicatedcaro
Resolveddcaro
Resolveddcaro
Resolveddcaro
Resolvedandrea.denisse
Resolvedandrea.denisse
ResolvedBUG REPORTandrea.denisse
ResolvedBUG REPORTandrea.denisse
OpenBUG REPORTandrea.denisse
Resolvedandrea.denisse
Resolvedandrea.denisse
Resolvedtaavi
Resolveddcaro
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
Resolved JHedden
Resolved JHedden
Resolved Bstorm
Resolvedbd808
ResolvedAndrew
DeclinedNone
OpenNone
Resolved nskaggs
OpenNone
OpenNone
Resolvedtaavi
Resolvedtaavi
OpenNone
OpenNone
Resolvedtaavi
OpenNone
OpenNone
Resolvedtaavi
Resolvedtaavi
OpenNone
OpenNone
Resolvedtaavi
Resolvedjbond
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
OpenNone
Opentaavi
OpenNone
Resolvedtaavi
Resolveddcaro
ResolvedAndrew
Resolveddcaro

Event Timeline

Change 813898 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] labstore: Send prom stats for getent_check

https://gerrit.wikimedia.org/r/813898

Change 813915 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/alerts@master] wmcs: add ldap getent speed alerts

https://gerrit.wikimedia.org/r/813915

Change 813898 abandoned by David Caro:

[operations/puppet@production] labstore: Send prom stats for getent_check

Reason:

https://gerrit.wikimedia.org/r/813898