Page MenuHomePhabricator

Setup automation to detect long running backups
Closed, ResolvedPublic

Description

Setup some kind of automation that checks backups (e.g. every week) and detects ongoing or finished backups that run over a configurable amount of time (e.g. more than 12 hours) and send an email with that- similar to how data checks work.

This will serve 2 purposes:

  • Detect processes "stuck" for some unknown reason (this was already covered by the backup monitoring, but it is useful also by itself)
  • Give some hints of when ES clusters should be split to reduce both the backup time- and more importantly, the recovery time

Event Timeline

jcrespo triaged this task as Medium priority.Sep 13 2023, 12:40 PM

Code so far:

Screenshot_20230913_143059.png (369×1 px, 40 KB)

Change 957288 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Add new check (focused on ES) of long running backups

https://gerrit.wikimedia.org/r/957288

Maybe a link to a wikitech page about ES cluster rotate/split in the email? That's my only request.

This is the latest version:

Screenshot_20230914_195213.png (573×1 px, 128 KB)

The link is to: https://wikitech.wikimedia.org/wiki/MariaDB/Backups/Long_running which is the context and runbook for this alert- but I think it should link to a different page for the actual cluster rolling change, as defined by the DBAs.

Change 957288 merged by Jcrespo:

[operations/puppet@production] dbbackups: Add new check (focused on ES) of long running backups

https://gerrit.wikimedia.org/r/957288

Change 959166 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Separate ensure file and 'ensure' job as followup to 51607b8

https://gerrit.wikimedia.org/r/959166

Change 959166 merged by Jcrespo:

[operations/puppet@production] dbbackups: Separate ensure file and 'ensure' job as followup to 51607b8

https://gerrit.wikimedia.org/r/959166

Change 959167 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Fix bug on config directory path: defaults -> default

https://gerrit.wikimedia.org/r/959167

Change 959167 merged by Jcrespo:

[operations/puppet@production] dbbackups: Fix bug on config directory path: defaults -> default

https://gerrit.wikimedia.org/r/959167

Deployed and manually tested- Tomorrow should work automatically.