Page MenuHomePhabricator

Improve regular production database backups handling
Open, MediumPublic

Description

  • Documentation, documentation, documentation
  • More flexibility: if possible per-table CSVs
  • More flexibility: Physical backups
  • Better recovery documentation: "one line to recover" (there is now a recover_section.py)
  • Faster point in time recovery/premade tools
  • Better compression
  • Prepare based on name of the backup, not just the section
  • Option optimization (e.g. double the use_memory)
  • More detailed health checks of backups (size, failures, objects, ...). E.g. check size is within a percentage of the previous backup.
  • Identify failures after X amount of timeout/time passed and easy cleanup of file leftover (probably on T205627)
  • Purge old metadata and make sure logs are rotated T205627 T349360
  • Review and improve logging (beyond metadata)
  • 1 retry after initial failure
  • More optimization of certain database tables
  • Maybe some kind of locking of backups and/or transfer.py to prevent concurrent actions on the same source or target servers
  • Incremental/Differential backups T244884
  • Document the last edit time (and potentially alert on) of some sample tables (e.g. recentchanges or revision) to verify the source databases are up to date (e.g. if its master, or intermediate master have replication stopped, or some other issue causing recent backups of stale data)
  • Have a quick way to see which backup sources belong to each section (tendril, dashboard) (Done on puppet's site.pp
  • Document and/or automate best server configuration for fast dump load (e.g. disable checksums, innodb transactionality, etc.)
  • Enable the possibility of editing per-table options such as the engine and compression
  • Workaround the "myloader doesn't import empty dbs" bug
  • Binlog backups
  • Easier/automated recovery/backup testing

Related incident: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160623-etherpad

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+2 -2
operations/software/wmfbackupsmaster+9 -2
operations/puppetproduction+2 -2
operations/software/wmfbackupsmaster+639 -1
operations/software/wmfbackupsmaster+100 -31
operations/software/wmfbackupsmaster+1 -0
operations/software/wmfbackupsmaster+344 -124
operations/puppetproduction+20 -20
operations/puppetproduction+39 -39
operations/puppetproduction+56 -29
operations/software/wmfbackupsmaster+101 -0
operations/puppetproduction+2 -2
operations/puppetproduction+4 -4
operations/puppetproduction+2 -0
operations/puppetproduction+2 -2
operations/puppetproduction+82 -82
labs/privatemaster+1 -2
operations/puppetproduction+2 -2
operations/puppetproduction+24 -16
operations/software/wmfbackupsmaster+53 -5
operations/puppetproduction+7 -225
operations/puppetproduction+9 -462
operations/puppetproduction+8 -1 K
operations/puppetproduction+15 -1 K
operations/software/wmfbackupsmaster+4 -2
operations/software/wmfbackupsmaster+5 -7
operations/software/wmfbackupsmaster+10 -3
operations/puppetproduction+10 -3
operations/software/wmfmariadbpymaster+10 -3
operations/software/wmfbackupsmaster+10 -3
operations/puppetproduction+6 -3
operations/puppetproduction+15 -19
operations/puppetproduction+0 -2
operations/software/wmfmariadbpymaster+2 -2
operations/puppetproduction+9 -9
operations/puppetproduction+3 -1
operations/puppetproduction+2 -662
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/softwaremaster+3 -3
operations/puppetproduction+29 -28
operations/puppetproduction+3 -2
operations/software/wmfmariadbpymaster+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+19 -9
operations/puppetproduction+63 -22
operations/puppetproduction+8 -10
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+3 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+10 -10
operations/puppetproduction+8 -2
operations/puppetproduction+0 -2
operations/puppetproduction+13 -12
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolved jcrespo
Resolved jcrespo
Resolved jcrespo
Resolved jcrespo
DeclinedNone
Resolved jcrespo
Resolved jcrespo
OpenNone
Resolved jcrespo
Resolved jcrespo
ResolvedNone
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Declined Marostegui
Resolvedmark
Resolved jcrespo
Resolved jcrespo
Resolved jcrespo
Resolved jcrespo
Resolved jcrespo
Resolved jcrespo
Resolved jcrespo
Resolved jcrespo
Resolved jcrespo
Resolved jcrespo
DuplicateNone
Resolved jcrespo
OpenNone
OpenNone
Resolved jcrespo
ResolvedL0st3xpl0r3r
Resolved jcrespo
ResolvedPapaul
Resolved Cmjohnson
ResolvedPapaul
Resolved Cmjohnson
OpenNone
Resolved jcrespo
Resolved jcrespo
Resolved jcrespo
Resolved jcrespo
DuplicateNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 609101 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Fix remote_backup_mariadb.py deps and path

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609101

Change 609101 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Fix remote_backup_mariadb.py deps and path

https://gerrit.wikimedia.org/r/c/operations/puppet/ /609101

Change 615155 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Adjust check parameters to get less false positives

https://gerrit.wikimedia.org/r/615155

Change 615155 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Adjust check parameters to get less false positives

https://gerrit.wikimedia.org/r/615155

Change 618720 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Reenable notifications for dbprov2003 after maintenance

https://gerrit.wikimedia.org/r/618720

Change 618722 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Move x1 and misc logical dumps to dbprov1003

https://gerrit.wikimedia.org/r/618722

Change 618723 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] BackupStatistics: Do not raise an exception if metadata cannot be sent

https://gerrit.wikimedia.org/r/618723

Change 618723 abandoned by Jcrespo:
[operations/software/wmfmariadbpy@master] BackupStatistics: Do not raise an exception if metadata cannot be sent

Reason:
not a bug

https://gerrit.wikimedia.org/r/618723

Change 618720 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Reenable notifications for dbprov2003 after maintenance

https://gerrit.wikimedia.org/r/618720

Change 618722 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Move x1 and misc logical dumps to dbprov1003

https://gerrit.wikimedia.org/r/618722

Change 620651 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Disable snapshots being sent to bacula

https://gerrit.wikimedia.org/r/620651

Change 620651 merged by Jcrespo:
[operations/puppet@production] mariadb: Disable snapshots being sent to bacula

https://gerrit.wikimedia.org/r/620651

Change 623525 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb-backups: Fix missing check on no ERROR msgs on logs

https://gerrit.wikimedia.org/r/623525

Change 623530 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfbackups@master] mariadb-backups: Fix missing check on no ERROR msgs on logs

https://gerrit.wikimedia.org/r/623530

Change 623532 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfbackups@master] mariadb-backups: Fix missing check on no ERROR msgs on logs

https://gerrit.wikimedia.org/r/623532

Change 623530 abandoned by Jcrespo:
[operations/software/wmfbackups@master] mariadb-backups: Fix missing check on no ERROR msgs on logs

Reason:
duplicate

https://gerrit.wikimedia.org/r/623530

Change 623525 abandoned by Jcrespo:
[operations/software/wmfmariadbpy@master] mariadb-backups: Fix missing check on no ERROR msgs on logs

Reason:
duplicate

https://gerrit.wikimedia.org/r/623525

Change 623538 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Update backup logic to check errors on log

https://gerrit.wikimedia.org/r/623538

Change 623538 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Update backup logic to check errors on log

https://gerrit.wikimedia.org/r/623538

Change 623532 merged by jenkins-bot:
[operations/software/wmfbackups@master] mariadb-backups: Fix missing check on no ERROR msgs on logs

https://gerrit.wikimedia.org/r/623532

Change 626172 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfbackups@master] remote_backup: Instead of using a preassigned port, autoselect one

https://gerrit.wikimedia.org/r/626172

Change 628163 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Stop using puppet to deploy wmfbackups and use debian packages

https://gerrit.wikimedia.org/r/628163

Change 628168 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfbackups@master] cli: Make /etc/wmfbackups the config dir for the main backup scripts

https://gerrit.wikimedia.org/r/628168

Change 628172 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfbackups@master] [WIP] Use the shared list of sections to validate backup checks

https://gerrit.wikimedia.org/r/628172

Change 626172 merged by Jcrespo:
[operations/software/wmfbackups@master] remote_backup: Instead of using a preassigned port, autoselect one

https://gerrit.wikimedia.org/r/626172

Change 628168 merged by Jcrespo:
[operations/software/wmfbackups@master] cli: Make /etc/wmfbackups the config dir for the main backup scripts

https://gerrit.wikimedia.org/r/628168

Change 629076 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Start using the package instead of the script on puppet

https://gerrit.wikimedia.org/r/629076

Change 629076 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Start using the package instead of the script on puppet

https://gerrit.wikimedia.org/r/629076

Change 629092 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Use the wmfbackups-remote package and not the puppet version

https://gerrit.wikimedia.org/r/629092

Change 628163 abandoned by Jcrespo:
[operations/puppet@production] mariadb: Stop using puppet to deploy wmfbackups and use debian packages

Reason:
split into 3 other patches, once per role

https://gerrit.wikimedia.org/r/628163

Change 629101 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Stop using puppet to deploy wmfbackups and use debian packages

https://gerrit.wikimedia.org/r/629101

Change 629101 merged by Jcrespo:
[operations/puppet@production] mariadb: Stop using puppet to deploy wmfbackups and use debian packages

https://gerrit.wikimedia.org/r/629101

Change 629092 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Use the wmfbackups-remote package and not the puppet version

https://gerrit.wikimedia.org/r/629092

Change 643220 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfbackups@master] Add new option, stats_file, to prevent sending passwords over the network/logs

https://gerrit.wikimedia.org/r/643220

Change 643223 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Make use of the new stats-file options, preventing password logging

https://gerrit.wikimedia.org/r/643223

Change 643300 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Make sure mydumper and xtrabackup are installed on the right hosts

https://gerrit.wikimedia.org/r/643300

Change 643220 merged by Jcrespo:
[operations/software/wmfbackups@master] Add new option, stats_file, to prevent sending passwords over the network/logs

https://gerrit.wikimedia.org/r/643220

Change 643223 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Use new stats-file option, preventing password logging

https://gerrit.wikimedia.org/r/643223

Change 643300 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Make sure mydumper and xtrabackup are installed

https://gerrit.wikimedia.org/r/643300

Change 662740 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] [WIP] Move database backups-related puppet code to its own profile/role

https://gerrit.wikimedia.org/r/662740

Change 663221 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[labs/private@master] dbbackups: Update password locations for database-backups db

https://gerrit.wikimedia.org/r/663221

Change 663221 merged by Jcrespo:
[labs/private@master] dbbackups: Update password locations for database-backups db

https://gerrit.wikimedia.org/r/663221

Change 662740 merged by Jcrespo:
[operations/puppet@production] dbbackups: Move database backups-related puppet code to its own profile/role

https://gerrit.wikimedia.org/r/662740

Change 663649 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbbackups: Create new puppet module dbbackups, move backup check to it

https://gerrit.wikimedia.org/r/663649

Change 663649 merged by Jcrespo:
[operations/puppet@production] dbbackups: Create new puppet module dbbackups, move backup check to it

https://gerrit.wikimedia.org/r/663649

Change 670174 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbbackups: Limit concurrency of es backups on codfw, too

https://gerrit.wikimedia.org/r/670174

Change 670217 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbbackups: Move files and templates to dbbackups hierarchy

https://gerrit.wikimedia.org/r/670217

Change 670174 merged by Jcrespo:
[operations/puppet@production] dbbackups: Limit concurrency of es backups on codfw, too

https://gerrit.wikimedia.org/r/670174

Change 670217 merged by Jcrespo:
[operations/puppet@production] dbbackups: Move files and templates to dbbackups hierarchy

https://gerrit.wikimedia.org/r/670217

Change 671112 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbbackups: Reduce External store retention now that bacula store it long term

https://gerrit.wikimedia.org/r/671112

Change 671112 merged by Jcrespo:
[operations/puppet@production] dbbackups: Reduce External store retention now that bacula is used long term

https://gerrit.wikimedia.org/r/671112

Change 673292 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfbackups@master] Add sql with the empty database structure to the repo

https://gerrit.wikimedia.org/r/673292

Change 673292 merged by jenkins-bot:
[operations/software/wmfbackups@master] Add sql with the empty database structure to the repo

https://gerrit.wikimedia.org/r/673292

Change 681317 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Add enabled parameter backups on the cluster::management role

https://gerrit.wikimedia.org/r/681317

Change 681317 merged by Jcrespo:

[operations/puppet@production] dbbackups: Add enabled parameter backups on the cluster::management role

https://gerrit.wikimedia.org/r/681317

Change 739471 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Further reorganize backup location to optimize for total latency

https://gerrit.wikimedia.org/r/739471

Change 739471 merged by Jcrespo:

[operations/puppet@production] dbbackups: Further reorganize backup location to optimize for total latency

https://gerrit.wikimedia.org/r/739471

Change 739768 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Switch s3 and x1 backup generation as an optimization

https://gerrit.wikimedia.org/r/739768

Change 739768 merged by Jcrespo:

[operations/puppet@production] dbbackups: Switch s3 and x1 backup generation as an optimization

https://gerrit.wikimedia.org/r/739768

Change 767844 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/wmfbackups@master] Refactor check_mariadb_backups.py and add enough tests for it

https://gerrit.wikimedia.org/r/767844

Change 767844 merged by jenkins-bot:

[operations/software/wmfbackups@master] Refactor check_mariadb_backups.py and add enough tests for it

https://gerrit.wikimedia.org/r/767844

Change 770021 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/wmfbackups@master] Improve logic and quality of life for remote backups

https://gerrit.wikimedia.org/r/770021

Change 770023 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Migrate remote backup (snapshot) cmd line to 0.7 format

https://gerrit.wikimedia.org/r/770023

Change 628172 abandoned by Jcrespo:

[operations/software/wmfbackups@master] [WIP] Use the shared list of sections to validate backup checks

Reason:

not possible at the moment, see other related patches on check script

https://gerrit.wikimedia.org/r/628172

Change 771868 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/wmfbackups@master] wmfbackups: Add manpages for all executables and config files

https://gerrit.wikimedia.org/r/771868

Change 770021 merged by Jcrespo:

[operations/software/wmfbackups@master] Improve logic and quality of life for remote backups

https://gerrit.wikimedia.org/r/770021

Change 771868 merged by Jcrespo:

[operations/software/wmfbackups@master] wmfbackups: Add manpages for all executables and config files

https://gerrit.wikimedia.org/r/771868

Change 770023 merged by Jcrespo:

[operations/puppet@production] dbbackups: Migrate remote backup (snapshot) cmd line to 0.7 format

https://gerrit.wikimedia.org/r/770023

Mentioned in SAL (#wikimedia-operations) [2022-03-28T08:50:12Z] <jynus> deploy new alerting (0.7.1) for db backups at alert1001 T138562

Change 774406 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/wmfbackups@master] check: Fix typo causing x1 section to be unrecognized

https://gerrit.wikimedia.org/r/774406

Change 774406 merged by Jcrespo:

[operations/software/wmfbackups@master] check: Fix typo causing x1 section to be unrecognized

https://gerrit.wikimedia.org/r/774406

Change 817871 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Increase es dumps retention from 8 days to 10 days

https://gerrit.wikimedia.org/r/817871

Change 817871 merged by Jcrespo:

[operations/puppet@production] dbbackups: Increase es dumps retention from 8 days to 10 days

https://gerrit.wikimedia.org/r/817871