Page MenuHomePhabricator

Database alerting
Closed, ResolvedPublic

Description

This is an epic task to gather all alerting (not including monitoring of trends/graphing, only potential emergencies a.k.a. icinga) of databases. There is currently too many false positives, and some gaps on the alerting, so tools and model has to change.

Related incident: https://wikitech.wikimedia.org/wiki/Incident_documentation/2017-07-28_s5_(WikiData_and_dewiki)_read-only

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
Resolvedjcrespo
Resolvedjcrespo
DeclinedNone
Resolvedaaron
Resolvedjcrespo
ResolvedDzahn
ResolvedCDanis
ResolvedVolans
ResolvedCDanis
ResolvedCDanis
ResolvedMarostegui
OpenNone
Resolvedjcrespo
OpenNone
Resolved Kormat
Resolvedjcrespo
Resolvedhashar
DeclinedLSobanski
OpenNone
Resolved Kormat
ResolvedMarostegui
ResolvedPapaul
OpenNone
Resolved Kormat

Event Timeline

jcrespo triaged this task as Medium priority.Aug 4 2017, 8:41 AM
jcrespo moved this task from Triage to Meta/Epic on the DBA board.

Change 595149 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] monitoring: remove usages of 'dba' contact group

https://gerrit.wikimedia.org/r/595149

Change 595153 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Default monitor disk paging to false

https://gerrit.wikimedia.org/r/595153

Change 595149 merged by Jcrespo:
[operations/puppet@production] monitoring: remove usages of 'dba' contact group

https://gerrit.wikimedia.org/r/595149

Change 595153 merged by Jcrespo:
[operations/puppet@production] mariadb: Default monitor disk & process paging to false

https://gerrit.wikimedia.org/r/595153

LSobanski renamed this task from Improve database alerting (tracking) to Database alerting.May 14 2021, 10:37 AM
Ladsgroup subscribed.

Database alerting in general needs improvements and we made a lot of progress since this ticket was created. But tickets usually shouldn't be kept open forever. The subtickets are open and that should be enough. We can always create a new ticket for specific or goal-oriented cases ("switch all alerts to prometheus" as an example) if needed and close them once done.