Page MenuHomePhabricator

Implement a proof of concept of a snapshot cycle automation for a mediawiki section database
Closed, ResolvedPublic

Description

Implement an utility equivalent to dump_section.py, but for taking snapshots (binary or raw backups). More information on the parent taks

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui moved this task from In progress to Next on the DBA board.Dec 4 2018, 6:19 AM
jcrespo moved this task from Next to In progress on the DBA board.Jan 22 2019, 12:22 PM

Change 486257 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] transfer.py: Add the ability to transfer from a new mariabackup

https://gerrit.wikimedia.org/r/486257

Change 486264 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] transfer.py: Add the ability to transfer from a new mariabackup

https://gerrit.wikimedia.org/r/486264

Change 486257 abandoned by Jcrespo:
transfer.py: Add the ability to transfer from a new mariabackup

https://gerrit.wikimedia.org/r/486257

Change 489657 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software@master] mariadb-package: Upgrade to 10.1.38, add mariabackup to path

https://gerrit.wikimedia.org/r/489657

Change 489657 merged by Jcrespo:
[operations/software@master] mariadb-package: Upgrade to 10.1.38, add mariabackup to path

https://gerrit.wikimedia.org/r/489657

Change 486264 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] transfer.py: Add the ability to transfer from a new mariabackup

https://gerrit.wikimedia.org/r/486264

Change 491251 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb: Modify dump_section to allow different types of dump

https://gerrit.wikimedia.org/r/491251

Change 491256 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Update mariadb logical path location

https://gerrit.wikimedia.org/r/491256

Change 491251 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] mariadb: Modify dump_section to allow different types of dump

https://gerrit.wikimedia.org/r/491251

Change 475471 merged by Jcrespo:
[operations/puppet@production] mariadb: Modify dump_section to allow different types of dump

https://gerrit.wikimedia.org/r/475471

Change 491256 merged by Jcrespo:
[operations/puppet@production] mariadb: Update mariadb logical path location

https://gerrit.wikimedia.org/r/491256

Mentioned in SAL (#wikimedia-operations) [2019-02-18T15:21:06Z] <jynus> move logical backups to subdirectory T210292

Change 491293 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb-backups: Fix bug when trying the default type

https://gerrit.wikimedia.org/r/491293

Change 491295 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Fix bug when trying the default type

https://gerrit.wikimedia.org/r/491295

Change 491293 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] mariadb-backups: Fix bug when trying the default type

https://gerrit.wikimedia.org/r/491293

Change 491295 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Fix bug when trying the default type

https://gerrit.wikimedia.org/r/491295

Change 491818 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb: Add the option of postprocessing backups

https://gerrit.wikimedia.org/r/491818

Change 491818 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] mariadb: Add the option of postprocessing backups

https://gerrit.wikimedia.org/r/491818

Change 493218 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Update dump_section.py and recover_section.py

https://gerrit.wikimedia.org/r/493218

Change 493218 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Update dump_section.py and recover_section.py

https://gerrit.wikimedia.org/r/493218

Marostegui assigned this task to jcrespo.Mar 1 2019, 5:55 AM

Assigning to Jaime as he is currently working on it

Change 494899 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts

https://gerrit.wikimedia.org/r/494899

Known bugs right now (CC @Marostegui):

  • All transfers use port 4444, so not good for concurrency (either it fails or corruption is created, probably the first)
  • On transfer.py, if a host that is not on puppet is set, an exception is thrown (because of an out of bounds array on cumin). Probably hosts should be validated
  • No validations of paths. While transfer.py doesn't allow to overwrite files (and requires an empty directory), it would be nice to prevent writing everywhere
  • No good errors handling, logging

Despite all that, a full cycle is now running from db1114 to dbstore1001.

Mentioned in SAL (#wikimedia-operations) [2019-03-12T12:23:42Z] <jynus> testing snapshotting on db1117:3325 -> dbstore1001 T210292

Snapshoting is now working consistently. There was a locking issue due to stdout piping + xtrabackup verbose output (for long running backups).
Next: Fixing the port multiplexing issue to allow for multiple simultaneous backups.

Other issues: Popen does an active wait until the process is finished, that is a waste of cpu cycles. Maybe a better way can be done with non wait + sleep?

Mentioned in SAL (#wikimedia-operations) [2019-03-13T06:21:03Z] <marostegui> Testing snapshotting on db1117:3321 to > dbstore1001 - T210292

Mentioned in SAL (#wikimedia-operations) [2019-03-13T07:13:56Z] <marostegui> Test snapshot dbstore1001:3311 to dbstore1001 - T210292

Mentioned in SAL (#wikimedia-operations) [2019-03-13T11:34:10Z] <marostegui> Test snapshot db1117:3325 to dbstore1001 - T210292

I have been testing the snapshots too with daily_snapshot.py
What I have seen is also mostly related to what we discussed about the dumps, error handling and protection against human error. Other than that, I think it is good (I haven't tested all the options)

ie:
Trying to run only post-process setting only_postprocess: True on the cnf triggers a full snapshot cycle . Is that expected?

only_postprocess: True
rotate: True
compress: True
archive: False
statistics:
  host: 'db1115.eqiad.wmnet'
  user: 'stats'
<snip>

If that line doesn't do anything, why continuing the whole process instead of returning an error and stopping?
I also added a random line called:
thisdoesntwork: True

Same thing as I described with the dumps .py, if we run daily_snapshot.py without any option I would expect the usage and triggering the snapshot cycle.

Other than that and the pending fixes, I think the core functionality is there and I would suggest we start scheduling it in production and monitor how it goes, as probably there will be things that will arise once we consistently start using it every week (or every day).

Yeah, only_postprocess: True is ignored by daily_snapshot.py. I wonder if I should either implement it or error on it?

There are slight differences between the 2 backup files- I am working right now on documenting that. A check is "easy" to do, I didn't do it before because as the options weren't set in stone, I wanted to take arbitrary ones. I think I can first document and then error out in case they are not expected.

Yes, not a big deal if it is done now or after the doc (I agree - I would prefer the doc before the check) as it is mostly something that will be run in a cron, but it would be nice to error on unrecognized options rather than do the fully cycle.

while I agree with the interactive usage of backup_mariadb.py, I am not sure about daily_snapshot.py, as it is not supposed to be run interactively, just for the cron. Maybe I can implement an interactive version and rename it to snapshot_mariadb.py or create a separate executable just for that, one of the 2.

Agree that it will mostly be run from cron, but not always. ie: snapshot that fails
Depending on how often we do snapshots it might be something we need to run manually, ie: maybe a server has crashed and I need to rebuild it and I want to take a snapshot now and not use a 3 days old one.

I think either a different script (interactive vs non interactive) or a check should be in place

Change 496714 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] backups: Make rentention policy configurable

https://gerrit.wikimedia.org/r/496714

Change 496746 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Better error and logging handling

https://gerrit.wikimedia.org/r/496746

@Marostegui better :-)?

Also:

root@cumin2001:~$ ./daily_snapshot.py 
[10:37:49]: ERROR - Found unknown config option "only_postprocess" on section test

Oooooh :-)
Thanks! :)

Change 494899 merged by Jcrespo:
[operations/puppet@production] mariadb-backup: Deploy new snapshot cycle to cumin and provisioning hosts

https://gerrit.wikimedia.org/r/494899

Change 497265 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Make backups.cnf on managements hosts owned by root

https://gerrit.wikimedia.org/r/497265

Change 497265 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Make backups.cnf on managements hosts owned by root

https://gerrit.wikimedia.org/r/497265

Change 496746 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Better error and logging handling

https://gerrit.wikimedia.org/r/496746

Change 496714 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] backups: Make rentention policy configurable

https://gerrit.wikimedia.org/r/496714

Change 497843 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Fix missing --retention param for backup_mariadb.py

https://gerrit.wikimedia.org/r/497843

Change 497843 abandoned by Jcrespo:
mariadb-backups: Fix missing --retention param for backup_mariadb.py

https://gerrit.wikimedia.org/r/497843

Change 496746 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshots: Better error and logging handling

https://gerrit.wikimedia.org/r/496746

Change 497853 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Fix snapshot statistics db credentials

https://gerrit.wikimedia.org/r/497853

Change 497853 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshots: Fix snapshot statistics db credentials

https://gerrit.wikimedia.org/r/497853

Change 498024 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Make sure retention is handled correctly

https://gerrit.wikimedia.org/r/498024

Change 498029 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Allow the option to only postprocess snapshots

https://gerrit.wikimedia.org/r/498029

Change 498314 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-snapshots: Chose right backup source for s1

https://gerrit.wikimedia.org/r/498314

Change 498315 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] WMFBackup: Make sure retention is handled correctly

https://gerrit.wikimedia.org/r/498315

Change 498314 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshots: Chose right backup source for s1

https://gerrit.wikimedia.org/r/498314

Change 498315 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] WMFBackup: Make sure retention is handled correctly

https://gerrit.wikimedia.org/r/498315

Change 498024 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Make sure retention is handled correctly

https://gerrit.wikimedia.org/r/498024

Change 498324 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] backup_mariadb: Output Log to /var/log

https://gerrit.wikimedia.org/r/498324

Change 498326 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] backup_mariadb: Output Log to /var/log

https://gerrit.wikimedia.org/r/498326

Change 498029 merged by Jcrespo:
[operations/puppet@production] mariadb-snapshots: Allow the option to only postprocess snapshots

https://gerrit.wikimedia.org/r/498029

Change 498326 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] backup_mariadb: Output Log to /var/log

https://gerrit.wikimedia.org/r/498326

Change 498324 merged by Jcrespo:
[operations/puppet@production] backup_mariadb: Output Log to /var/log

https://gerrit.wikimedia.org/r/498324

jcrespo closed this task as Resolved.Mar 22 2019, 6:23 PM

I would say this is done:

root@cumin2001:~$ daily_snapshot.py
[13:16:46]: INFO - Create a new empty directory at es2001.codfw.wmnet
...
[13:16:47]: INFO - Running XtraBackup at dbstore2002.codfw.wmnet:3311 and sending it to es2001.codfw.wmnet
[15:26:13]: INFO - Preparing backup at es2001.codfw.wmnet
...
[17:33:15]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/python3...atabase zarcillo'.
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/python3...atabase zarcillo'.
[17:33:15]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
[17:33:15]: INFO - Backup finished correctly

On the provisioning host logs:

[15:26:14]: DEBUG - ['/opt/wmf-mariadb101/bin/mariabackup', '--prepare', '--target-dir', '/srv/backups/snapshots/ongoing/snapshot.s1.2019-03-22--13-16-46', '--use-memory', '10G']
[16:00:56]: DEBUG - ['/bin/tar', '--create', '--remove-files', '--file', '/srv/backups/snapshots/ongoing/snapshot.s1.2019-03-22--13-16-46.tar.gz', '--directory', '/srv/backups/snapshots/ongoing', '--use-compress-program', '/usr/bin/pigz -p 16', 'snapshot.s1.2019-03-22--13-16-46']
[17:33:15]: INFO - Backup s1 generated correctly.

It is very slow:

  • 2 hours to run xtrabackup
  • 30 minutes to prepare
  • 1h30 to compress

But this is on an overloaded, slow, hd-based mysql source, and a slow, hd-based provisioning host.

The backup takes 1.1 TB uncompressed, and 400 GB compressed:

root@es2001:/srv/backups/snapshots/latest$ ls -al
-rw-r--r-- 1 root root 404797485854 Mar 22 17:33 snapshot.s1.2019-03-22--13-16-46.tar.gz

On metadata database:

root@db1115.eqiad.wmnet[zarcillo]> select * FROM backups where id = 959\G
*************************** 1. row ***************************
        id: 959
      name: snapshot.s1.2019-03-22--13-16-46
    status: finished
    source: dbstore2002.codfw.wmnet:3311
      host: es2001.codfw.wmnet
      type: snapshot
   section: s1
start_date: 2019-03-22 15:26:14
  end_date: 2019-03-22 17:33:15
total_size: 1007150184199
1 row in set (0.01 sec)

root@db1115.eqiad.wmnet[zarcillo]> select * FROM backup_files where backup_id = 959 LIMIT 25;
+-----------+-----------+------------------------------+------------+---------------------+------------------+
| backup_id | file_path | file_name                    | size       | file_date           | backup_object_id |
+-----------+-----------+------------------------------+------------+---------------------+------------------+
|       959 |           | aria_log.00000001            |      16384 | 2019-03-22 15:26:11 |             NULL |
|       959 |           | aria_log_control             |         52 | 2019-03-22 15:26:11 |             NULL |
|       959 |           | backup-my.cnf                |        363 | 2019-03-22 15:26:12 |             NULL |
|       959 |           | enwiki                       |       8192 | 2019-03-22 15:25:47 |             NULL |
|       959 |           | heartbeat                    |         75 | 2019-03-22 15:25:47 |             NULL |
|       959 |           | ibdata1                      | 9170845696 | 2019-03-22 16:00:37 |             NULL |
|       959 |           | ib_buffer_pool               |   38204910 | 2019-03-22 15:26:11 |             NULL |
|       959 |           | ib_logfile0                  |   50331648 | 2019-03-22 16:00:37 |             NULL |
|       959 |           | ib_logfile1                  |   50331648 | 2019-03-22 16:00:35 |             NULL |
|       959 |           | ib_lru_dump                  |   35749888 | 2019-03-22 15:26:12 |             NULL |
|       959 |           | mysql                        |       4096 | 2019-03-22 15:25:47 |             NULL |
|       959 |           | ops                          |       4096 | 2019-03-22 15:26:11 |             NULL |
|       959 |           | performance_schema           |         27 | 2019-03-22 15:26:11 |             NULL |
|       959 |           | sys                          |       8192 | 2019-03-22 15:26:11 |             NULL |
|       959 |           | xtrabackup_binlog_pos_innodb |         30 | 2019-03-22 16:00:31 |             NULL |
|       959 |           | xtrabackup_checkpoints       |        115 | 2019-03-22 16:00:34 |             NULL |
|       959 |           | xtrabackup_info              |        560 | 2019-03-22 15:26:12 |             NULL |
|       959 |           | xtrabackup_logfile           | 7212761088 | 2019-03-22 16:00:34 |             NULL |
|       959 |           | xtrabackup_slave_info        |        249 | 2019-03-22 15:26:11 |             NULL |
|       959 | enwiki    | abuse_filter.frm             |       2951 | 2019-03-22 15:25:47 |             NULL |
|       959 | enwiki    | abuse_filter.ibd             |    5242880 | 2019-03-22 16:00:16 |             NULL |
|       959 | enwiki    | abuse_filter_action.frm      |       1752 | 2019-03-22 15:25:47 |             NULL |
|       959 | enwiki    | abuse_filter_action.ibd      |      90112 | 2019-03-22 13:17:24 |             NULL |
|       959 | enwiki    | abuse_filter_history.frm     |       3903 | 2019-03-22 15:25:47 |             NULL |
|       959 | enwiki    | abuse_filter_history.ibd     |   62914560 | 2019-03-22 13:17:32 |             NULL |
+-----------+-----------+------------------------------+------------+---------------------+------------------+
25 rows in set (0.00 sec)

Hardware and tuning should improve that by a lot, but that will happen when it runs on final hw.

Purging also works, as it did when I ran on a previous run.

There is probably many outstanding bugs and optimizations to be done, but the scope of this task was a POC, further work will be tracked on its parent task.

I will now document its usage, as it has a semi-stable interface.