Page MenuHomePhabricator

Database alerting
Closed, ResolvedPublic

Description

This is an epic task to gather all alerting (not including monitoring of trends/graphing, only potential emergencies a.k.a. icinga) of databases. There is currently too many false positives, and some gaps on the alerting, so tools and model has to change.

Related incident: https://wikitech.wikimedia.org/wiki/Incident_documentation/2017-07-28_s5_(WikiData_and_dewiki)_read-only

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
Resolved jcrespo
Resolved jcrespo
DeclinedNone
Resolvedaaron
Resolved jcrespo
ResolvedDzahn
ResolvedCDanis
ResolvedVolans
ResolvedCDanis
ResolvedCDanis
Resolved Marostegui
OpenNone
Resolved jcrespo
OpenNone
Resolved Kormat
Resolved jcrespo
Resolvedhashar
DeclinedLSobanski
OpenNone
Resolved Kormat
Resolved Marostegui
ResolvedPapaul
OpenNone
Resolved Kormat

Event Timeline

jcrespo triaged this task as Medium priority.Aug 4 2017, 8:41 AM
jcrespo moved this task from Triage to Meta/Epic on the DBA board.

Change 595149 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] monitoring: remove usages of 'dba' contact group

https://gerrit.wikimedia.org/r/595149

Change 595153 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Default monitor disk paging to false

https://gerrit.wikimedia.org/r/595153

Change 595149 merged by Jcrespo:
[operations/puppet@production] monitoring: remove usages of 'dba' contact group

https://gerrit.wikimedia.org/r/595149

Change 595153 merged by Jcrespo:
[operations/puppet@production] mariadb: Default monitor disk & process paging to false

https://gerrit.wikimedia.org/r/595153

LSobanski renamed this task from Improve database alerting (tracking) to Database alerting.May 14 2021, 10:37 AM
Ladsgroup subscribed.

Database alerting in general needs improvements and we made a lot of progress since this ticket was created. But tickets usually shouldn't be kept open forever. The subtickets are open and that should be enough. We can always create a new ticket for specific or goal-oriented cases ("switch all alerts to prometheus" as an example) if needed and close them once done.