Page MenuHomePhabricator

Implement database binary backups into the production infrastructure
Closed, ResolvedPublic

Description

Implement database binary backups into the production infrastructure:

We have working and monitored logical backups "dumps" of all important metadata mediawiki and misc servers, but recovering from them while highly flexible, it is long: Recovering from a full section dump may take from 3 up to 24 hours.

A less flexible but faster recovery method is needed in addition to the logical backups for the most critical part of the infrastructure "physical backups", "binary backups" or "snapshots", images of the files in its original shape that can be loaded fastly without requiring any transformation. Options for them could be:

  • xtrabackup (or its mariadb version, mariabackup)
  • cold backups
  • volume snapshots
  • delayed slaves

Research was done to decide which is the appropiate one for our infrastructure and we need to implement a proof of concept (without needing full coverage) that validates the decision taken and measures the Time to Recovery (TTR). mariabackup/xtrabackup worked for us, at least for custom testing and we will move forward with it (xtrabackup stopped working for MariaDB, so that was why it was being tested).

The snapshots can also be used in the future for automatic provisioning, although that is for the moment out of scope of this goal.

In order to implement fully this system, extra hardware is also needed for both the snapshot taking and the long term storage, which will be needed to be acquired and provisioned accordingly.

  • Design the final architecture down to the physical architecture
  • Procure hardware for binary backups
  • Implement a more or less final snapshot cycle automation for a mediawiki metadata and misc databases

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+10 -10
operations/puppetproduction+41 -8
operations/software/wmfmariadbpymaster+35 -2
operations/software/wmfmariadbpymaster+1 -1
operations/puppetproduction+3 -3
operations/puppetproduction+93 -44
operations/puppetproduction+15 -13
operations/puppetproduction+1 -0
operations/puppetproduction+30 -3
operations/puppetproduction+52 -8
operations/puppetproduction+10 -10
operations/puppetproduction+5 -4
operations/puppetproduction+10 -0
operations/software/wmfmariadbpymaster+2 -2
operations/puppetproduction+9 -4
operations/puppetproduction+99 -18
operations/software/wmfmariadbpymaster+51 -3
operations/puppetproduction+44 -0
operations/puppetproduction+15 -10
operations/puppetproduction+5 -4
operations/puppetproduction+2 -2
operations/puppetproduction+23 -4
operations/puppetproduction+26 -4
operations/puppetproduction+2 -2
operations/puppetproduction+47 -15
operations/software/wmfmariadbpymaster+47 -15
operations/puppetproduction+68 -68
operations/software/wmfmariadbpymaster+68 -65
operations/puppetproduction+8 -5
operations/puppetproduction+8 -0
operations/puppetproduction+1 -1
operations/puppetproduction+17 -9
operations/puppetproduction+2 K -904
operations/software/wmfmariadbpymaster+1 K -901
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
ResolvedRobH
Resolvedfgiunchedi
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
ResolvedPapaul
DeclinedNone
ResolvedGuozr.im
Resolvedjcrespo
ResolvedPapaul
Resolvedjcrespo
ResolvedPapaul
Resolved Cmjohnson
ResolvedPapaul
Resolvedjcrespo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 499997 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Allow remote dumps from cumin hosts

https://gerrit.wikimedia.org/r/499997

Change 500980 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Setup full daily snapshots for all sections

https://gerrit.wikimedia.org/r/500980

Mentioned in SAL (#wikimedia-operations) [2019-04-04T17:33:07Z] <jynus> stopping replication on dbstore2001:s8 for backup testing T206203

Despite the failure and kill, it is nice because a process kill gets properly handled and the backup is set as failed.

Change 501506 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Only create x1 snapshots on dbstore1001

https://gerrit.wikimedia.org/r/501506

Change 501506 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshots: Only create x1 snapshots on dbstore1001

https://gerrit.wikimedia.org/r/501506

Change 501546 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Stop replication during transfer

https://gerrit.wikimedia.org/r/501546

With replication down, backup of s8 (with x1 concurrency, but it shouldn't matter in this case) took 4h45m:

  • 2h45m for the 1.5TB transference
  • 25s for the prepare
  • 2h for the compression

It is not great, but I guess better than not finishing? Source is not final hw, anyway.

I think it is pretty good compared with doing it without stopping replication.
Probably once the sources have been migrated to the final HW will reduce that even more.

Change 501555 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshot: Reduce codfw mariabackup generation to x1 and m5

https://gerrit.wikimedia.org/r/501555

Change 501555 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshot: Reduce codfw mariabackup generation to x1 and m5

https://gerrit.wikimedia.org/r/501555

Change 502828 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb: Allow new option --stop-slave for xtrabackup transfers

https://gerrit.wikimedia.org/r/502828

Change 506947 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Setup db2097 as the source of some codfw backups

https://gerrit.wikimedia.org/r/506947

Change 506947 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Setup new backup source hosts for codfw backups

https://gerrit.wikimedia.org/r/506947

Change 500980 abandoned by Jcrespo:
mariadb-snapshots: Setup full daily snapshots for all codfw sections

Reason:
it will be deployed with other hosts instead

https://gerrit.wikimedia.org/r/500980

Change 508801 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Update transfer.py to 3b11a70cc1f356

https://gerrit.wikimedia.org/r/508801

Change 508805 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Setup snapshots of s1 stopping replication

https://gerrit.wikimedia.org/r/508805

Change 502828 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] mariadb: Allow new option --stop-slave for xtrabackup transfers

https://gerrit.wikimedia.org/r/502828

Change 508801 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Update transfer.py to 3b11a70cc1f356

https://gerrit.wikimedia.org/r/508801

Change 501546 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshots: Stop replication during transfer

https://gerrit.wikimedia.org/r/501546

Change 508810 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] transfer.py: Ignore stopping and starting slave if option is not set

https://gerrit.wikimedia.org/r/508810

Change 508810 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] transfer.py: Ignore stopping and starting slave if option is not set

https://gerrit.wikimedia.org/r/508810

Change 508812 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Update transfer.py to 37c035ae06c54bbb79078c3d84

https://gerrit.wikimedia.org/r/508812

Change 508805 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Setup snapshots of s1 stopping replication

https://gerrit.wikimedia.org/r/508805

Change 508812 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Update transfer.py to 37c035ae06c54bbb79078c3d84

https://gerrit.wikimedia.org/r/508812

root@cumin2001:~$ time transfer.py --no-checksum --no-encrypt --type=decompress dbprov2001.codfw.wmnet:/srv/backups/snapshots/latest/snapshot.s6.2019-05-07--20-00-02.tar.gz db2117.codfw.wmnet:/srv/sqldata
...
WARNING: Original size is 207411963527 but transferred size is 494357898256 for copy to db2117.codfw.wmnet
494357898256 bytes correctly transferred from dbprov2001.codfw.wmnet to db2117.codfw.wmnet

real    30m9.519s
user    0m3.612s
sys     0m0.872s
root@cumin2001:~$ time transfer.py --no-checksum --no-encrypt --type=decompress dbprov2001.codfw.wmnet:/srv/backups/snapshots/latest/snapshot.s6.2019-05-07--20-00-02.tar.gz db2117.codfw.wmnet:/srv/sqldata
...
WARNING: Original size is 207411963527 but transferred size is 494357898256 for copy to db2117.codfw.wmnet
494357898256 bytes correctly transferred from dbprov2001.codfw.wmnet to db2117.codfw.wmnet

real    30m9.519s
user    0m3.612s
sys     0m0.872s

To set up replication on the destination, questions: does the metadata file contain only GTID coordinates so we have to do the "translation" looking at the master's binlog?

To set up replication on the destination, questions: does the metadata file contain only GTID coordinates so we have to do the "translation" looking at the master's binlog?

GTID "should work". But I have not tested it. You can try (you will only lose 30 minutes!) or do the other. All those are part of a provisioning script I don't have yet.

Executed as I mentioned above:

SET GLOBAL gtid_slave_pos = '0-180359184-3070029963,171970705-171970705-239075862,171974883-171974883-1239749870,171978766-171978766-1375989821,180359184-180359184-35604081,180363274-180363274-7949150,180367474-180367474-109042271'; # as obtained from /srv/sqldata/xtrabackup_slave_info

CHANGE MASTER TO MASTER_HOST='db2039.codfw.wmnet', MASTER_USER='<user>', MASTER_PASSWORD='<pass>', MASTER_SSL=1, master_use_gtid = slave_pos;

start slave io_thread;
show slave status\G
             # Slave_IO_Running: Yes
            # Slave_SQL_Running: No
start slave;
show slave status\G
             # Slave_IO_Running: Yes
            # Slave_SQL_Running: Yes

Unlike with logical copies, grants, heartbeat and other metadata should be there (although it still requires a double check of everything).
Remember our problem with GTID is replication, but this is (technically) a clone of the original replica.

That doesn't mean we should believe it blindly- we should test its correctness like we do every time we create new workflows that can fail (or every time we do proven ones too!).

Nice! I guess I have had too many nightmares with GTID, I need to start trusting it again, this will definitely help! :)
Probably this needs to go to the cheatsheet I think! \o/

Change 509012 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Setup all metadata sections for daily snapshots

https://gerrit.wikimedia.org/r/509012

Change 509036 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Fix bug of stopping slave after the backup

https://gerrit.wikimedia.org/r/509036

Change 509036 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Fix bug of stopping slave after the backup

https://gerrit.wikimedia.org/r/509036

Change 509012 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Setup all metadata sections for daily snapshots

https://gerrit.wikimedia.org/r/509012

Change 509343 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Force snapshots in certain order

https://gerrit.wikimedia.org/r/509343

Change 509343 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Force snapshots in certain order

https://gerrit.wikimedia.org/r/509343

Mentioned in SAL (#wikimedia-operations) [2019-05-10T09:17:15Z] <jynus> disabling replication lag alerts for backup source hosts on s1, s4, s8 T206203

Things pending I would like to work on:

  • Proper documentation
  • Identify failures after X amount of timeout/time passed and easy cleanup of file leftover (probably on T205627)
  • Purge old metadata and make sure logs are rotated T205627
  • Review and improve logging (beyond metadata)
  • 1 retry after initial failure + prepare based on name of the backup, not just the section
  • More optimization of certain database tables
  • Option optimization (e.g. double the use_memory)
  • Maybe some kind of locking of backups and/or transfer.py to prevent concurrent actions on the same source or target servers

All 9 + 9 backups worked, starting 20 UTC and the last one finished at 10:15 the next day. 14.4 TB of backups produced in that interval (~5.6 TB after compression).
In comparison, dumps do 14 + 13 backups and it takes from 17 UTC to 00:12 the next day, with a total size of 2.9 TB after compression.

Things pending I would like to work on:

  • Proper documentation
  • Identify failures after X amount of timeout/time passed and easy cleanup of file leftover (probably on T205627)
  • Purge old metadata and make sure logs are rotated T205627
  • Review and improve logging (beyond metadata)
  • 1 retry after initial failure + prepare based on name of the backup, not just the section
  • More optimization of certain database tables
  • Option optimization (e.g. double the use_memory)
  • Maybe some kind of locking of backups and/or transfer.py to prevent concurrent actions on the same source or target servers

As this looks like a wishing list I will throw some ideas here (of course not to be done as part of this goal, but just adding them here for the record). They are mostly related to reporting/usability and probably to be done within the "new" tendril eventually

  • Have a section that parses the backup table so we can check the status of the backups for a given week and check where the snapshots are (ie: I need to provision 3 hosts with s4 snapshot)
    • Overall health of the attempted backups? (finished correctly, in progress, failed)
    • where is the snapshot?
    • when is it from?
  • Have a quick way to see which backup sources belong to each section. Maybe have a way to mark it on tendril, so we don't have to always check site.pp, but have a way to quickly identify it from tendril

Again, adding them here for the record, not expecting them to be worked out now or as part of this goal.

Change 510184 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Reduce frequency of backups to 3 times a week

https://gerrit.wikimedia.org/r/510184

Change 510184 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Reduce frequency of snapshots to 3 times a week

https://gerrit.wikimedia.org/r/510184

Change 510453 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Refactor dumps and snapshots

https://gerrit.wikimedia.org/r/510453

Change 510453 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Refactor dumps and snapshots

https://gerrit.wikimedia.org/r/510453

Change 510700 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Monitor snapshots alongside dumps on icinga

https://gerrit.wikimedia.org/r/510700

Change 510700 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Monitor snapshots alongside dumps on icinga

https://gerrit.wikimedia.org/r/510700

s1 and s6 eqiad backups have been on "prepare" status for almost 24h now.
stracing the processes on dbprov1001 and dbprov1002 looks like they are not really doing anything. I am going to give it a few more hours, and kill the processes and remove the data on the ongoing directory to see if they get generated correctly tomorrow.

I have killed both, they were not doing anything, and the server was fully idle. strace didn't show any activity at all.
Killed both the processes on cumin and on the dbprov hosts, removed s6 and s1 directories from /srv/backups/snapshosts/ongoing

s1 and s6 eqiad backups were completed correctly last night.
Now waiting for s3 eqiad and codfw to finish

s3 finished correctly on eqiad and codfw.
All backups worked fine last night.

I just noticed a small thing, looks like the metadata for s4 codfw wasn't updated correctly (it says the latest backup is from 17th May):

| 1436 | snapshot.s4.2019-05-17--10-05-10 | finished | db2099.codfw.wmnet:3314 | dbprov2001.codfw.wmnet | snapshot | s4      | 2019-05-17 11:33:21 | 2019-05-17 12:26:51 | 1061051511697 |

However icinga reports:

snapshot for s4 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-05-19 23:12:20 from db2099.codfw.wmnet:3314 (992 GB)

Which also matches the fact that on dbprov2001 the backup is on latest and not on-going:

root@dbprov2001:/srv/backups/snapshots/latest# ls -lh snapshot.s4.2019-05-19--21-43-58.tar.gz
-rw-r--r-- 1 dump dump 332G May 20 00:40 snapshot.s4.2019-05-19--21-43-58.tar.gz

And from cuming2001 and dbprov2001 logs:

May 19 21:44:04 cumin2001 mariadb-snapshots[12384]: [21:44:04]: INFO - Running XtraBackup at db2099.codfw.wmnet:3314 and sending it to dbprov2001.codfw.wmnet
[00:40:41]: DEBUG - Archiving snapshot.s4.2019-05-17--10-05-10.tar.gz
[00:40:42]: INFO - Backup s4 generated correctly.

Note: Would be nice to have also the date printed on the log, not only the timestamp. It can be obviously guessed with the backup name which contains the date, but having the date+timestamp might be easier for grepping and future parsing of the logs.

Change 511454 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] check_mariadb_status.py: Clarify the status of the alert

https://gerrit.wikimedia.org/r/511454

Change 512894 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Disable checks of database snapshots

https://gerrit.wikimedia.org/r/512894

I am manually preparing x1 and s7 snapshots, which failed again. Maybe s1 on codfw, too? Giving a general check.

Change 511454 abandoned by Marostegui:
check_mariadb_status.py: Clarify the status of the alert

https://gerrit.wikimedia.org/r/511454

Change 512894 merged by Jcrespo:
[operations/puppet@production] mariadb: Disable checks of database snapshots

https://gerrit.wikimedia.org/r/512894

Change 515063 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] WMFBackup: Increase xtrabackup memory use to 20GB

https://gerrit.wikimedia.org/r/515063

Change 515064 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb: Allow the passing of a full path to section on prepare

https://gerrit.wikimedia.org/r/515064

Change 515072 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Use full paths for postprocessing new snapshots

https://gerrit.wikimedia.org/r/515072

Change 515063 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] WMFBackup: Increase xtrabackup memory use to 20GB

https://gerrit.wikimedia.org/r/515063

Change 515064 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] mariadb: Allow the passing of a full path to section on prepare

https://gerrit.wikimedia.org/r/515064

Change 515072 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshots: Use full paths for postprocessing new snapshots

https://gerrit.wikimedia.org/r/515072

I am going to close this as resolved, move the minor pending things to T138562 with lower priority.

jcrespo claimed this task.

4th time in a row with 0 failures, this is done.