Implement an utility equivalent to dump_section.py, but for taking snapshots (binary or raw backups). More information on the parent taks
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • jcrespo | T206203 Implement database binary backups into the production infrastructure | |||
Resolved | • jcrespo | T210292 Implement a proof of concept of a snapshot cycle automation for a mediawiki section database |
Event Timeline
Change 489657 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software@master] mariadb-package: Upgrade to 10.1.38, add mariabackup to path
Change 489657 merged by Jcrespo:
[operations/software@master] mariadb-package: Upgrade to 10.1.38, add mariabackup to path
Change 486264 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] transfer.py: Add the ability to transfer from a new mariabackup
Change 491251 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb: Modify dump_section to allow different types of dump
Change 491256 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Update mariadb logical path location
Change 491251 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] mariadb: Modify dump_section to allow different types of dump
Change 475471 merged by Jcrespo:
[operations/puppet@production] mariadb: Modify dump_section to allow different types of dump
Change 491256 merged by Jcrespo:
[operations/puppet@production] mariadb: Update mariadb logical path location
Mentioned in SAL (#wikimedia-operations) [2019-02-18T15:21:06Z] <jynus> move logical backups to subdirectory T210292
Change 491293 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb-backups: Fix bug when trying the default type
Change 491295 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Fix bug when trying the default type
Change 491293 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] mariadb-backups: Fix bug when trying the default type
Change 491295 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Fix bug when trying the default type
Change 491818 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb: Add the option of postprocessing backups
Change 491818 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] mariadb: Add the option of postprocessing backups
Change 493218 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Update dump_section.py and recover_section.py
Change 493218 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Update dump_section.py and recover_section.py
Change 494899 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts
Known bugs right now (CC @Marostegui):
- All transfers use port 4444, so not good for concurrency (either it fails or corruption is created, probably the first)
- On transfer.py, if a host that is not on puppet is set, an exception is thrown (because of an out of bounds array on cumin). Probably hosts should be validated
- No validations of paths. While transfer.py doesn't allow to overwrite files (and requires an empty directory), it would be nice to prevent writing everywhere
- No good errors handling, logging
Despite all that, a full cycle is now running from db1114 to dbstore1001.
Mentioned in SAL (#wikimedia-operations) [2019-03-12T12:23:42Z] <jynus> testing snapshotting on db1117:3325 -> dbstore1001 T210292
Snapshoting is now working consistently. There was a locking issue due to stdout piping + xtrabackup verbose output (for long running backups).
Next: Fixing the port multiplexing issue to allow for multiple simultaneous backups.
Other issues: Popen does an active wait until the process is finished, that is a waste of cpu cycles. Maybe a better way can be done with non wait + sleep?
Mentioned in SAL (#wikimedia-operations) [2019-03-13T06:21:03Z] <marostegui> Testing snapshotting on db1117:3321 to > dbstore1001 - T210292
Mentioned in SAL (#wikimedia-operations) [2019-03-13T07:13:56Z] <marostegui> Test snapshot dbstore1001:3311 to dbstore1001 - T210292
Mentioned in SAL (#wikimedia-operations) [2019-03-13T11:34:10Z] <marostegui> Test snapshot db1117:3325 to dbstore1001 - T210292
I have been testing the snapshots too with daily_snapshot.py
What I have seen is also mostly related to what we discussed about the dumps, error handling and protection against human error. Other than that, I think it is good (I haven't tested all the options)
ie:
Trying to run only post-process setting only_postprocess: True on the cnf triggers a full snapshot cycle . Is that expected?
only_postprocess: True rotate: True compress: True archive: False statistics: host: 'db1115.eqiad.wmnet' user: 'stats' <snip>
If that line doesn't do anything, why continuing the whole process instead of returning an error and stopping?
I also added a random line called:
thisdoesntwork: True
Same thing as I described with the dumps .py, if we run daily_snapshot.py without any option I would expect the usage and triggering the snapshot cycle.
Other than that and the pending fixes, I think the core functionality is there and I would suggest we start scheduling it in production and monitor how it goes, as probably there will be things that will arise once we consistently start using it every week (or every day).
Yeah, only_postprocess: True is ignored by daily_snapshot.py. I wonder if I should either implement it or error on it?
There are slight differences between the 2 backup files- I am working right now on documenting that. A check is "easy" to do, I didn't do it before because as the options weren't set in stone, I wanted to take arbitrary ones. I think I can first document and then error out in case they are not expected.
Yes, not a big deal if it is done now or after the doc (I agree - I would prefer the doc before the check) as it is mostly something that will be run in a cron, but it would be nice to error on unrecognized options rather than do the fully cycle.
while I agree with the interactive usage of backup_mariadb.py, I am not sure about daily_snapshot.py, as it is not supposed to be run interactively, just for the cron. Maybe I can implement an interactive version and rename it to snapshot_mariadb.py or create a separate executable just for that, one of the 2.
Agree that it will mostly be run from cron, but not always. ie: snapshot that fails
Depending on how often we do snapshots it might be something we need to run manually, ie: maybe a server has crashed and I need to rebuild it and I want to take a snapshot now and not use a 3 days old one.
I think either a different script (interactive vs non interactive) or a check should be in place
Change 496714 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] backups: Make rentention policy configurable
Change 496746 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Better error and logging handling
@Marostegui better :-)?
Also:
root@cumin2001:~$ ./daily_snapshot.py [10:37:49]: ERROR - Found unknown config option "only_postprocess" on section test
Change 494899 merged by Jcrespo:
[operations/puppet@production] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts
Change 497265 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Make backups.cnf on managements hosts owned by root
Change 497265 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Make backups.cnf on managements hosts owned by root
Change 496746 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Better error and logging handling
Change 496714 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] backups: Make rentention policy configurable
Change 497843 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Fix missing --retention param for backup_mariadb.py
Change 497843 abandoned by Jcrespo:
mariadb-backups: Fix missing --retention param for backup_mariadb.py
Change 496746 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshots: Better error and logging handling
Change 497853 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Fix snapshot statistics db credentials
Change 497853 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshots: Fix snapshot statistics db credentials
Change 498024 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Make sure retention is handled correctly
Change 498029 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Allow the option to only postprocess snapshots
Change 498314 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Chose right backup source for s1
Change 498315 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] WMFBackup: Make sure retention is handled correctly
Change 498314 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshots: Chose right backup source for s1
Change 498315 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] WMFBackup: Make sure retention is handled correctly
Change 498024 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Make sure retention is handled correctly
Change 498324 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backup_mariadb: Output Log to /var/log
Change 498326 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] backup_mariadb: Output Log to /var/log
Change 498029 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshots: Allow the option to only postprocess snapshots
Change 498326 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] backup_mariadb: Output Log to /var/log
Change 498324 merged by Jcrespo:
[operations/puppet@production] backup_mariadb: Output Log to /var/log
I would say this is done:
root@cumin2001:~$ daily_snapshot.py [13:16:46]: INFO - Create a new empty directory at es2001.codfw.wmnet ... [13:16:47]: INFO - Running XtraBackup at dbstore2002.codfw.wmnet:3311 and sending it to es2001.codfw.wmnet [15:26:13]: INFO - Preparing backup at es2001.codfw.wmnet ... [17:33:15]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/python3...atabase zarcillo'. 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/python3...atabase zarcillo'. [17:33:15]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. [17:33:15]: INFO - Backup finished correctly
On the provisioning host logs:
[15:26:14]: DEBUG - ['/opt/wmf-mariadb101/bin/mariabackup', '--prepare', '--target-dir', '/srv/backups/snapshots/ongoing/snapshot.s1.2019-03-22--13-16-46', '--use-memory', '10G'] [16:00:56]: DEBUG - ['/bin/tar', '--create', '--remove-files', '--file', '/srv/backups/snapshots/ongoing/snapshot.s1.2019-03-22--13-16-46.tar.gz', '--directory', '/srv/backups/snapshots/ongoing', '--use-compress-program', '/usr/bin/pigz -p 16', 'snapshot.s1.2019-03-22--13-16-46'] [17:33:15]: INFO - Backup s1 generated correctly.
It is very slow:
- 2 hours to run xtrabackup
- 30 minutes to prepare
- 1h30 to compress
But this is on an overloaded, slow, hd-based mysql source, and a slow, hd-based provisioning host.
The backup takes 1.1 TB uncompressed, and 400 GB compressed:
root@es2001:/srv/backups/snapshots/latest$ ls -al -rw-r--r-- 1 root root 404797485854 Mar 22 17:33 snapshot.s1.2019-03-22--13-16-46.tar.gz
On metadata database:
root@db1115.eqiad.wmnet[zarcillo]> select * FROM backups where id = 959\G *************************** 1. row *************************** id: 959 name: snapshot.s1.2019-03-22--13-16-46 status: finished source: dbstore2002.codfw.wmnet:3311 host: es2001.codfw.wmnet type: snapshot section: s1 start_date: 2019-03-22 15:26:14 end_date: 2019-03-22 17:33:15 total_size: 1007150184199 1 row in set (0.01 sec) root@db1115.eqiad.wmnet[zarcillo]> select * FROM backup_files where backup_id = 959 LIMIT 25; +-----------+-----------+------------------------------+------------+---------------------+------------------+ | backup_id | file_path | file_name | size | file_date | backup_object_id | +-----------+-----------+------------------------------+------------+---------------------+------------------+ | 959 | | aria_log.00000001 | 16384 | 2019-03-22 15:26:11 | NULL | | 959 | | aria_log_control | 52 | 2019-03-22 15:26:11 | NULL | | 959 | | backup-my.cnf | 363 | 2019-03-22 15:26:12 | NULL | | 959 | | enwiki | 8192 | 2019-03-22 15:25:47 | NULL | | 959 | | heartbeat | 75 | 2019-03-22 15:25:47 | NULL | | 959 | | ibdata1 | 9170845696 | 2019-03-22 16:00:37 | NULL | | 959 | | ib_buffer_pool | 38204910 | 2019-03-22 15:26:11 | NULL | | 959 | | ib_logfile0 | 50331648 | 2019-03-22 16:00:37 | NULL | | 959 | | ib_logfile1 | 50331648 | 2019-03-22 16:00:35 | NULL | | 959 | | ib_lru_dump | 35749888 | 2019-03-22 15:26:12 | NULL | | 959 | | mysql | 4096 | 2019-03-22 15:25:47 | NULL | | 959 | | ops | 4096 | 2019-03-22 15:26:11 | NULL | | 959 | | performance_schema | 27 | 2019-03-22 15:26:11 | NULL | | 959 | | sys | 8192 | 2019-03-22 15:26:11 | NULL | | 959 | | xtrabackup_binlog_pos_innodb | 30 | 2019-03-22 16:00:31 | NULL | | 959 | | xtrabackup_checkpoints | 115 | 2019-03-22 16:00:34 | NULL | | 959 | | xtrabackup_info | 560 | 2019-03-22 15:26:12 | NULL | | 959 | | xtrabackup_logfile | 7212761088 | 2019-03-22 16:00:34 | NULL | | 959 | | xtrabackup_slave_info | 249 | 2019-03-22 15:26:11 | NULL | | 959 | enwiki | abuse_filter.frm | 2951 | 2019-03-22 15:25:47 | NULL | | 959 | enwiki | abuse_filter.ibd | 5242880 | 2019-03-22 16:00:16 | NULL | | 959 | enwiki | abuse_filter_action.frm | 1752 | 2019-03-22 15:25:47 | NULL | | 959 | enwiki | abuse_filter_action.ibd | 90112 | 2019-03-22 13:17:24 | NULL | | 959 | enwiki | abuse_filter_history.frm | 3903 | 2019-03-22 15:25:47 | NULL | | 959 | enwiki | abuse_filter_history.ibd | 62914560 | 2019-03-22 13:17:32 | NULL | +-----------+-----------+------------------------------+------------+---------------------+------------------+ 25 rows in set (0.00 sec)
Hardware and tuning should improve that by a lot, but that will happen when it runs on final hw.
Purging also works, as it did when I ran on a previous run.
There is probably many outstanding bugs and optimizations to be done, but the scope of this task was a POC, further work will be tracked on its parent task.
I will now document its usage, as it has a semi-stable interface.