Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T138562 Improve regular production database backups handling | |||
Resolved | jcrespo | T162789 Create less overhead on bacula jobs when dumping production databases | |||
Resolved | jcrespo | T169658 Improve database backups' coverage, monitoring and data recovery time (part 1) (tracking) | |||
Resolved | jcrespo | T169516 Implement cron-based mydumper backups on the dbstore role |
Event Timeline
Change 371925 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] Install mydumper on dbstore_multiinstance hosts, drop tls
Change 371925 merged by Jcrespo:
[operations/puppet@production] mariadb: Install mydumper on dbstore_multiinstance hosts, drop tls
Trying:
shard=s1 numthreads=8 mydumper --compress --host=localhost --threads=$numthreads --user=$(( whoami )) --socket=/run/mysqld/mysqld.$shard.sock --triggers --routines --events --rows=100000000 --logfile=$backupdir/dump.log --outputdir=$backupdir/$shard.$(( date +%Y%m%d%H%M%S ))
Change 371935 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Disable buffer pool loading and dumping on new dbstores
Change 371935 merged by Jcrespo:
[operations/puppet@production] mariadb: Disable buffer pool loading and dumping on new dbstores
It took 3 hours to do the dump:
Started dump at: 2017-08-14 09:02:16 ... Finished dump at: 2017-08-14 12:04:49
That is much worse than T162789#3238231 but the difference is:
- current instance was using only 15G GB instead of 512G of memory
- Host is much older, probably CPU, too
- No SSD
- Replication was running for other instances, reducing iops
- Replication was running for the dumped database enwiki
- It has compressed tables, which probably makes queries much slower
- In the last half an hour, some threads were idle, waiting for all threads to complete (watchlist and templatelinks finishing)
Maybe we can stop replication on all hosts and measure how that changes the export time.
Mentioned in SAL (#wikimedia-operations) [2017-08-14T12:21:22Z] <jynus> stopping replication on all instances of dbstore2001 T169516
s2 took less, 1 hour an a half, but its tables are much smaller, and we used 16 threads, and all replication threads were stopped:
Started dump at: 2017-08-14 12:24:42 Finished dump at: 2017-08-14 13:57:25
Change 371944 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] [WIP]mariadb: First attempt at a mydumper-based dump script
Change 374560 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Implement regular logical backups using mydumper
Change 371944 abandoned by Jcrespo:
[WIP]mariadb: First attempt at a mydumper-based dump script
Reason:
in favour of https://gerrit.wikimedia.org/r/374560
We need this ASAP dbstore1001 crashed and it is not in a good state; plus it can no longer catch up with replication reasonably well.
Change 374560 merged by Jcrespo:
[operations/puppet@production] mariadb: Implement regular logical backups using mydumper
Change 381472 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Fix require on cronjob
Change 381472 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Fix require on cronjob
Change 381491 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: backup user must be dump to match already in use mysql account
Change 381491 merged by Jcrespo:
[operations/puppet@production] mariadb: backup user must be dump to match already in use mysql account
Basic script works and users setup on dbstore2001:s5- pending lot of followup both in hardware and scripting.