Page MenuHomePhabricator

Improve regular production database backups handling
Open, MediumPublic

Description

  • Documentation, documentation, documentation
  • More flexibility: if possible per-table CSVs
  • More flexibility: Physical backups
  • Better recovery documentation: "one line to recover" (there is now a recover_section.py)
  • Faster point in time recovery/premade tools
  • Better compression
  • Prepare based on name of the backup, not just the section
  • Option optimization (e.g. double the use_memory)
  • Identify failures after X amount of timeout/time passed and easy cleanup of file leftover (probably on T205627)
  • Purge old metadata and make sure logs are rotated T205627
  • Review and improve logging (beyond metadata)
  • 1 retry after initial failure
  • More optimization of certain database tables
  • Maybe some kind of locking of backups and/or transfer.py to prevent concurrent actions on the same source or target servers
  • Differential backups
  • More detailed health checks of backups (size, failures, objects, ...). E.g. check size is within a percentage of the previous backup.
  • Document the last edit time (and potentially alert on) of some sample tables (e.g. recentchanges or revision) to verify the source databases are up to date (e.g. if its master, or intermediate master have replication stopped, or some other issue causing recent backups of stale data)
  • Have a quick way to see which backup sources belong to each section (tendril, dashboard)
  • Document and/or automate best server configuration for fast dump load (e.g. disable checksums, innodb transactionality, etc.)
  • Enable the possibility of editing per-table options such as the engine and compression
  • Workaround the "myloader doesn't import empty dbs" bug

Details

Related Gerrit Patches:

Related Objects

StatusSubtypeAssignedTask
ResolvedRobH
Resolvedfgiunchedi
OpenNone
Resolvedjcrespo
OpenNone
Resolvedjcrespo
Openjcrespo
DeclinedNone
Resolvedjcrespo
Resolvedjcrespo
OpenNone
OpenNone
OpenNone
Resolvedjcrespo
ResolvedCmjohnson
ResolvedCmjohnson
ResolvedCmjohnson
Resolvedjcrespo
ResolvedMarostegui
ResolvedRobH
ResolvedAndrew
ResolvedCmjohnson
Resolvedjcrespo
ResolvedCmjohnson
ResolvedCmjohnson
Resolvedjcrespo
ResolvedCmjohnson
Resolvedjcrespo
ResolvedPapaul
ResolvedMarostegui
ResolvedRobH
ResolvedRobH
Resolvedjcrespo
Resolvedjcrespo
ResolvedNone
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
DeclinedMarostegui
Resolvedmark
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
OpenRduran
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
DuplicateNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedMarostegui
ResolvedMarostegui
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
ResolvedPapaul
DeclinedNone
OpenNone
Resolvedjcrespo
ResolvedPapaul
Resolvedjcrespo
ResolvedPapaul
ResolvedCmjohnson
ResolvedPapaul
Resolvedjcrespo
Resolvedjcrespo

Event Timeline

jcrespo created this task.Jun 24 2016, 9:00 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 24 2016, 9:00 AM

The is to improve handling of cases such as T138516

jcrespo moved this task from Triage to Backlog on the DBA board.Jun 24 2016, 2:49 PM

Change 296706 had a related patch set uploaded (by Jcrespo):
Correct invalid cron definition; add gtid to backups

https://gerrit.wikimedia.org/r/296706

Change 296706 merged by Jcrespo:
Correct invalid cron definition; add gtid to backups

https://gerrit.wikimedia.org/r/296706

greg moved this task from On-going to Follow-up on the Wikimedia-Incident board.Jul 27 2016, 10:45 PM
greg added a subscriber: greg.Sep 29 2016, 7:40 PM

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).

jcrespo triaged this task as Low priority.Sep 30 2016, 8:24 AM
jcrespo renamed this task from Improve db backup handling, specially of misc hosts to Improve db backup handling.Apr 12 2017, 10:21 AM
jcrespo removed a project: Patch-For-Review.
jcrespo renamed this task from Improve db backup handling to Improve regular production database backups handling.Apr 12 2017, 10:26 AM
jcrespo raised the priority of this task from Low to Medium.
jcrespo moved this task from Backlog to Meta/Epic on the DBA board.
jcrespo updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)Mar 9 2018, 3:16 PM
greg removed a subscriber: greg.Apr 11 2019, 9:20 PM
jcrespo updated the task description. (Show Details)Jun 25 2019, 8:58 AM
jcrespo updated the task description. (Show Details)Dec 2 2019, 10:03 AM
jcrespo updated the task description. (Show Details)Dec 10 2019, 3:13 PM
jcrespo updated the task description. (Show Details)