Page MenuHomePhabricator

Create a dashboard for database backups monitoring/reporting
Open, HighPublic

Description

Create a dashboard to have better understanding of the state of database backups (success/failure, delays, current status, logs, etc.). If possible, also generate a subset of that information as public metrics for prometheus, so we can integrate them with the alerting infrastructure. Provide also an inventory of objects backed up, as well as historical trends.

Original description: As a person interested in backup status, I would like to have the ability to subscribe to alerts about failed / delayed backup tasks.

Details

SubjectRepoBranchLines +/-
operations/software/pampinusmaster+2 -0
operations/software/pampinusmaster+172 -103
operations/software/pampinusmaster+43 -10
operations/software/pampinusmaster+2 K -0
operations/puppetproduction+0 -12
operations/puppetproduction+36 -101
operations/softwaremaster+18 -0
operations/puppetproduction+38 -29
operations/puppetproduction+4 -0
operations/puppetproduction+489 -16
operations/puppetproduction+63 -14
operations/software/wmfbackupsmaster+195 -103
labs/privatemaster+2 -0
operations/puppetproduction+0 -124
labs/privatemaster+0 -1
operations/puppetproduction+41 -3
labs/privatemaster+2 -3
labs/privatemaster+3 -0
operations/puppetproduction+0 -1
operations/puppetproduction+1 -0
operations/puppetproduction+5 -4
operations/puppetproduction+4 -4
operations/puppetproduction+158 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jcrespo added subscribers: Marostegui, jcrespo.

This is not super urgent, but @LSobanski if you could think of more details of what kind of information exactly you would like to have more accessible about database backups specifically (we will also think of use cases as backup maintainers and team members- @Marostegui and me), but your unique perspective is important to us as both a manager and general SRE.

  • Which level of detail would be adequate to you?
  • What are the things you would like to understand the fastest/most clear?
  • What would you prefer to be notified about (push) versus reading on demand (pull/dashboard)
  • Do we need an api to interoperate with other systems (e.g. grafana, alert-manager, icinga, other automation, etc.?)
  • Are general statistics useful (not related directly to malfunction), like backup sizes, wiki sizes, table sizes, etc.?

For a list of the entire data available already on the DB (there is a lot) one could check: https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Metadata but my question is- how to present that in a more "friendly way" by filtering out the not-so-important information (your take on it)- as we have already GB of existing data?

Offtopic, but something to think in the future is also how we want to present/integrate metadata about the several worklows of backups (database backups vs general backups (bacula) vs media backups and others more in the future). Of if not being integrated is a feature, not a bug.

This is not super urgent, but @LSobanski if you could think of more details of what kind of information exactly you would like to have more accessible about database backups specifically (we will also think of use cases as backup maintainers and team members- @Marostegui and me), but your unique perspective is important to us as both a manager and general SRE.

  • Which level of detail would be adequate to you?
  • What are the things you would like to understand the fastest/most clear?
  • What would you prefer to be notified about (push) versus reading on demand (pull/dashboard)
  • Do we need an api to interoperate with other systems (e.g. grafana, alert-manager, icinga, other automation, etc.?)
  • Are general statistics useful (not related directly to malfunction), like backup sizes, wiki sizes, table sizes, etc.?

For a list of the entire data available already on the DB (there is a lot) one could check: https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Metadata but my question is- how to present that in a more "friendly way" by filtering out the not-so-important information (your take on it)- as we have already GB of existing data?

Some quick brain dump:

I think we first need to define what is acceptable and what is not and then later present that information in one way or another.
To me, the most important key information is: "are we ok backup-wise" but defining what is ok and what is not is the hard part.

In my mind, what I would love to have would be a central place to see something high level like:

  • dataset to backup, last time it was backed-up successfully, available number of "fresh" copies (not stored in long-term archive), available number of copies stored in bacula (or any other long-term solution), next time this backup is scheduled to run, average time that takes to finish a backup run (we'd need to define what is a backup run)

If I wanted to go low level, that would be a different place and with more specific data as you mentioned (file sizes, table sizes, backup total size) etc, but that's probably something secondary.

Along with the high level details, I believe having some documentation for the operators would be good:
ie: last available backup is too old (whatever too old means) -> follow this to get that metric back into green status.

This turned into a much bigger scope than my original intention but it's a good discussion to have. I am now thinking that my ask was not well defined and needs to be adjusted, especially given my dislike of using email for operational notification :)

The use case I was thinking of was receiving some sort of a proactive notification (ideally a Phab task, alternatively an email) when whatever we consider to be a "bad state" is breached, e.g multiple consecutive failed backups, "too old" (as Manuel said above).

LSobanski renamed this task from Backup alert email notification to Backup alert proactive notification.May 27 2021, 11:41 AM
LSobanski updated the task description. (Show Details)

As a "complicated" topic- do you mind talking in our 1:1 about the challenges of it- not just doing as requested (I don't mind spamming you with notifications if you really want them, :-D), but doing it efficiently and without getting transformed into useless spam, with little false positives (which is an open issue, here and for other cases).

For the "metadata dashboard" usage, something like https://github.com/nocodb/nocodb could be interesting to evaluate.

jcrespo raised the priority of this task from Low to High.EditedMay 9 2022, 9:48 AM
jcrespo moved this task from Refine to In Progress on the Data-Persistence-Backup board.

I think this is partially covered with the icinga and prometheus monitoring that was setup long ago + the new dbbackup dashboard we are about to setup (including limited public prometheus support).

jcrespo renamed this task from Backup alert proactive notification to Create a dashboard for database backups monitoring/reporting.May 12 2022, 5:21 PM
jcrespo updated the task description. (Show Details)

Change 791414 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Setup backupmon1001 as a database backups monitoring service

https://gerrit.wikimedia.org/r/791414

Change 791560 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] alerting_host: Remove references to dbbackups monitoring

https://gerrit.wikimedia.org/r/791560

Change 791414 merged by Jcrespo:

[operations/puppet@production] dbbackups: Setup backupmon1001 as a database backups monitoring service

https://gerrit.wikimedia.org/r/791414

Change 793022 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Update definition to use Floats for type

https://gerrit.wikimedia.org/r/793022

Change 793023 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Temporarely disable notifications on backupmon hosts

https://gerrit.wikimedia.org/r/793023

Change 793022 merged by Jcrespo:

[operations/puppet@production] dbbackups: Update definition to use Floats for type

https://gerrit.wikimedia.org/r/793022

Change 793023 merged by Jcrespo:

[operations/puppet@production] dbbackups: Temporarely disable notifications on backupmon hosts

https://gerrit.wikimedia.org/r/793023

Change 793026 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Reinstall backupmon1001 with bullseye

https://gerrit.wikimedia.org/r/793026

Change 793026 merged by Jcrespo:

[operations/puppet@production] dbbackups: Reinstall backupmon1001 with bullseye

https://gerrit.wikimedia.org/r/793026

Change 793038 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Reenable checks now that they are working as intended

https://gerrit.wikimedia.org/r/793038

Change 793038 merged by Jcrespo:

[operations/puppet@production] dbbackups: Reenable checks now that they are working as intended

https://gerrit.wikimedia.org/r/793038

Change 793042 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[labs/private@master] ddbackups: Remove old references to the check pass on the alert hosts

https://gerrit.wikimedia.org/r/793042

Change 793094 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] alert_host: Ensure packages and files from dbbackups check are gone

https://gerrit.wikimedia.org/r/793094

Change 793475 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] [WIP]django: Create custom django module and apply it to backupmon1001

https://gerrit.wikimedia.org/r/793475

Change 793498 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[labs/private@master] dbbackups: Add django database password and secret key for pampinus

https://gerrit.wikimedia.org/r/793498

Change 793498 merged by Jcrespo:

[labs/private@master] dbbackups: Add django database password and secret key for pampinus

https://gerrit.wikimedia.org/r/793498

Change 793505 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[labs/private@master] dbbackups: Fix hiera key formatting for db password and django secret

https://gerrit.wikimedia.org/r/793505

Change 793505 merged by Jcrespo:

[labs/private@master] dbbackups: Fix hiera key formatting for db password and django secret

https://gerrit.wikimedia.org/r/793505

Change 793094 merged by Jcrespo:

[operations/puppet@production] alert_host: Ensure packages and files from dbbackups check are gone

https://gerrit.wikimedia.org/r/793094

Change 793042 merged by Jcrespo:

[labs/private@master] ddbackups: Remove old references to the check pass on the alert hosts

https://gerrit.wikimedia.org/r/793042

Change 791560 merged by Jcrespo:

[operations/puppet@production] alerting_host: Remove references to dbbackups monitoring

https://gerrit.wikimedia.org/r/791560

Change 793771 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[labs/private@master] django: Add dummy django secret key and mysql pass to test compilation

https://gerrit.wikimedia.org/r/793771

Change 793771 merged by Jcrespo:

[labs/private@master] django: Add dummy django secret key and mysql pass to test compilation

https://gerrit.wikimedia.org/r/793771

Change 801657 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/wmfbackups@master] check: Split common functionality to a WMFMetrics class

https://gerrit.wikimedia.org/r/801657

Change 801657 merged by Jcrespo:

[operations/software/wmfbackups@master] check: Split common functionality to a WMFMetrics class

https://gerrit.wikimedia.org/r/801657

Change 801741 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups::check: Add enabled flag to have a passive host on codfw

https://gerrit.wikimedia.org/r/801741

Change 801741 merged by Jcrespo:

[operations/puppet@production] dbbackups::check: Add enabled flag to have a passive host on codfw

https://gerrit.wikimedia.org/r/801741

Change 810885 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] Add new user for dbbackups database for django dashboard

https://gerrit.wikimedia.org/r/810885

Mentioned in SAL (#wikimedia-operations) [2022-07-04T12:38:22Z] <jynus> running alter table on dbbackups db T283017

Change 810885 merged by Jcrespo:

[operations/puppet@production] Add new user for dbbackups database for django dashboard

https://gerrit.wikimedia.org/r/810885

Change 817181 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] db_inventory: Cleanup zarcillo database grants

https://gerrit.wikimedia.org/r/817181

Change 817294 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/pampinus@master] Initial commit

https://gerrit.wikimedia.org/r/817294

This would be a typical warning scenario:

Screenshot_20220727_083857.png (161×2 px, 37 KB)

More status info:

Screenshot_20220726_171509.png (1×1 px, 162 KB)

Change 818088 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] Adapt mysql prometheus script to new zarcillo schema

https://gerrit.wikimedia.org/r/818088

Change 818538 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/pampinus@master] Add absolute number (bytes) changed & max staleness for backup status

https://gerrit.wikimedia.org/r/818538

Change 819025 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/pampinus@master] Attempt to follow Wikimedia's Design Style Guide

https://gerrit.wikimedia.org/r/819025

Change 820073 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/pampinus@master] Add the possibility of searching racks for instances, too

https://gerrit.wikimedia.org/r/820073

Change 820074 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software@master] [WIP]Add instance script with increased functionality over section

https://gerrit.wikimedia.org/r/820074

I did some css changes and made the patch for it. Once that's deployed, I'll find a designer colleague and ask for a quick check and a bit of further improvements. I like the server view and generally think it should be split into its own service (probably fork from pampinus) but that's for later.

Change 817181 merged by Jcrespo:

[operations/puppet@production] db_inventory: Cleanup zarcillo database grants

https://gerrit.wikimedia.org/r/817181

Change 853950 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] zarcillo: Remove access to non-primary dc prometheus hosts

https://gerrit.wikimedia.org/r/853950

Change 853950 merged by Jcrespo:

[operations/puppet@production] zarcillo: Remove access to non-primary dc prometheus hosts

https://gerrit.wikimedia.org/r/853950

Change 817294 merged by Jcrespo:

[operations/software/pampinus@master] Initial commit

https://gerrit.wikimedia.org/r/817294

Change 818538 merged by Jcrespo:

[operations/software/pampinus@master] Add absolute number (bytes) changed & max staleness for backup status

https://gerrit.wikimedia.org/r/818538

Change 819025 merged by Jcrespo:

[operations/software/pampinus@master] Attempt to follow Wikimedia's Design Style Guide

https://gerrit.wikimedia.org/r/819025

Change 820073 merged by Jcrespo:

[operations/software/pampinus@master] Add the possibility of searching racks for instances, too

https://gerrit.wikimedia.org/r/820073