I think prod databases are, so this should be too. Delayed replication and / or dumps?
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | yuvipanda | T105720 Labs team reliability goal for Q1 2015/16 | |||
Resolved | Andrew | T105723 Eliminate SPOFs in Labs infrastructure | |||
Resolved | coren | T88234 Puppetize & fix tools-db | |||
Duplicate | coren | T88716 Make sure tools-db is backed up in some form |
Event Timeline
(Presumably, back up offsite)
This requires one of two things: either we dump the database to labstore2001 (which also gets rsyncs of labstores) or we add a DB to codfw and slave it.
What is the purpose of the backups? If it is to guard against hardware failures & Co., I assume some replication to another server that can be pointed to by tools-db would be the best way (and this task thus a duplicate of T88718).
Hinting that user databases are backed up probably provokes support requests that someone wants to restore a table row in the form it had 39.437 days ago. In addition, I assume much (most?) of the data in the user databases is derived from replicated data and could thus be regenerated. So for "user database backups" in that sense I would instead recommend advertising that those users who need that should set up a cron job that calls mysqldump tailored to their requirements.
Once T88718 is finished (most of the work has been done already), backups can be taken from the slave consistently.
A slave replica will prevent against:
- Hardware issues (e.g. secondary storage broken, server fried in general)
- Admin and security issues (data is rm'ed accidentally/by an attacker)
Pointing clients to the slave is trivial, and could even done automatically.
A slave replica will not prevent against:
- Software logic (a bad SQL command is executed, and data is logically DROPed or MySQL has a bug which creates data loss).
For that, there are 2 options:
- Regular periodic backups, that will allow to go back that period
- Delayed slave, where in the event of a bad SQL command, there will be X amount of hours until the slave executes it.
Problem with backups in that host is that there are some largish databases that are directly derived from mysql production replicas, and not worth recovering. I would suggest doing a user poll/selecting specific databases to avoid duplicating 500GB.
and I'd repeat my comment T88716#1181437:
[…]
Hinting that user databases are backed up probably provokes support requests that someone wants to restore a table row in the form it had 39.437 days ago. In addition, I assume much (most?) of the data in the user databases is derived from replicated data and could thus be regenerated. So for "user database backups" in that sense I would instead recommend advertising that those users who need that should set up a cron job that calls mysqldump tailored to their requirements.
I can imagine constellations where some form of point-in-time recovery is useful from a DBA perspective (imagine a maintenance script that is pointed to the wrong DB host and drops all databases), but IMHO users should be advised to mysqldump tables they think are important at intervals they deem appropriate. So if you think that replication guards against hardware issues and a DB root accidentally dropping databases is unlikely enough, I would consider this task resolved (or a duplicate of T88718).
Just a comment: I would make sure to announce the no-recovery guarantee to labs list (and intend to).