Page MenuHomePhabricator

Implement cron-based mydumper backups on the dbstore role
Closed, ResolvedPublic

Event Timeline

Change 371925 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Install mydumper on dbstore_multiinstance hosts, drop tls

https://gerrit.wikimedia.org/r/371925

Change 371925 merged by Jcrespo:
[operations/puppet@production] mariadb: Install mydumper on dbstore_multiinstance hosts, drop tls

https://gerrit.wikimedia.org/r/371925

Trying:

shard=s1
numthreads=8
mydumper --compress --host=localhost --threads=$numthreads --user=$(( whoami )) --socket=/run/mysqld/mysqld.$shard.sock --triggers --routines --events --rows=100000000 --logfile=$backupdir/dump.log --outputdir=$backupdir/$shard.$(( date +%Y%m%d%H%M%S ))

Change 371935 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Disable buffer pool loading and dumping on new dbstores

https://gerrit.wikimedia.org/r/371935

Change 371935 merged by Jcrespo:
[operations/puppet@production] mariadb: Disable buffer pool loading and dumping on new dbstores

https://gerrit.wikimedia.org/r/371935

It took 3 hours to do the dump:

Started dump at: 2017-08-14 09:02:16
...
Finished dump at: 2017-08-14 12:04:49

That is much worse than T162789#3238231 but the difference is:

  • current instance was using only 15G GB instead of 512G of memory
  • Host is much older, probably CPU, too
  • No SSD
  • Replication was running for other instances, reducing iops
  • Replication was running for the dumped database enwiki
  • It has compressed tables, which probably makes queries much slower
  • In the last half an hour, some threads were idle, waiting for all threads to complete (watchlist and templatelinks finishing)

Maybe we can stop replication on all hosts and measure how that changes the export time.

Mentioned in SAL (#wikimedia-operations) [2017-08-14T12:21:22Z] <jynus> stopping replication on all instances of dbstore2001 T169516

s2 took less, 1 hour an a half, but its tables are much smaller, and we used 16 threads, and all replication threads were stopped:

Started dump at: 2017-08-14 12:24:42
Finished dump at: 2017-08-14 13:57:25

Change 371944 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] [WIP]mariadb: First attempt at a mydumper-based dump script

https://gerrit.wikimedia.org/r/371944

Change 374560 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Implement regular logical backups using mydumper

https://gerrit.wikimedia.org/r/374560

Change 371944 abandoned by Jcrespo:
[WIP]mariadb: First attempt at a mydumper-based dump script

Reason:
in favour of https://gerrit.wikimedia.org/r/374560

https://gerrit.wikimedia.org/r/371944

We need this ASAP dbstore1001 crashed and it is not in a good state; plus it can no longer catch up with replication reasonably well.

Change 374560 merged by Jcrespo:
[operations/puppet@production] mariadb: Implement regular logical backups using mydumper

https://gerrit.wikimedia.org/r/374560

Change 381472 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Fix require on cronjob

https://gerrit.wikimedia.org/r/381472

Change 381472 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Fix require on cronjob

https://gerrit.wikimedia.org/r/381472

Change 381491 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: backup user must be dump to match already in use mysql account

https://gerrit.wikimedia.org/r/381491

Change 381491 merged by Jcrespo:
[operations/puppet@production] mariadb: backup user must be dump to match already in use mysql account

https://gerrit.wikimedia.org/r/381491

Basic script works and users setup on dbstore2001:s5- pending lot of followup both in hardware and scripting.