Page MenuHomePhabricator

Recreate analytics-meta replica on db1108 from master on an-coord1001
Closed, ResolvedPublic

Description

In https://phabricator.wikimedia.org/T279440#7489667 and https://phabricator.wikimedia.org/T284150#7490379 we decided to temporarily postpone the work for those tasks so we could prioritize on other goals for Q2 2021.

In the meantime, we'd like to at least make the analytics-meta backup replica on db1108 consistent (at least at the time of a restore) with the master instance running on an-coord1001. We cannot recreate from the failover replica on an-coord1002, as it too is showing data inconsistencies.

Event Timeline

@Marostegui, I'm reading https://wikitech.wikimedia.org/wiki/Backup-mariadb, can/should we use that to recreate a replica from a master? Is it as simple as

  • Installing backup-mariadb on db1108
  • Running:
backup-mariadb \
--host an-coord1001.eqiad.wmnet \
--port 3306 \
--user <username?>
--password <password?>
--type snapshot

Then following some steps in https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Provision_a_precompressed_and_prepared_snapshot_(preferred) to restore and start the slave from the correct position?

Ottomata triaged this task as High priority.

@jcrespo Oh, perhaps this is a question for you?

Following some direction from Data-Persistence we have decided to use the transfer.py tool with the --type=xtrabackup method to create the snapshot and transfer it to the destination.
As per this documentation: https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Copy_data_from_a_backup_source_to_a_host

I have currently tried the following command:

btullis@cumin1001:~$ time sudo transfer.py --no-encrypt --type=xtrabackup --verbose an-coord1001.eqiad.wmnet:/run/mysqld/mysqld.sock db1108.eqiad.wmnet:/srv/sqldata.analytics_meta.T295312

It is failing at the moment. I have a feeling that it might be because of crossing the analytics vlan barrier. Continuing to investigate.

It looks like this is the case. I can't copy a simple file between these two servers using transfer.py

btullis@cumin1001:~$ time sudo transfer.py --no-encrypt --verbose an-coord1001.eqiad.wmnet:/home/btullis/test/joke.txt db1108.eqiad.wmnet:/home/btullis/test/
2021-11-09 18:25:40  INFO: About to transfer /home/btullis/test/joke.txt from an-coord1001.eqiad.wmnet to ['db1108.eqiad.wmnet']:['/home/btullis/test/'] (29 bytes)
                     100.0% (1/1) of nodes failed to execute command '/bin/bash -c "/u...qiad.wmnet 4403"': an-coord1001.eqiad.wmnet
2021-11-09 18:26:49  ERROR: Copy from an-coord1001.eqiad.wmnet:/home/btullis/test/joke.txt to db1108.eqiad.wmnet:/home/btullis/test/ failed

Maybe I should just create the local snapshot on an-coord1001 and save it to its local disk, then copy it another way after creation.

Maybe I should just create the local snapshot on an-coord1001

I don't think there is room!

But, we should make transfer.py work between the VLANs anyway.

Change 737889 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/homer/public@master] Add port 4400/tcp to the mysql-replica rule in the analytics policy

https://gerrit.wikimedia.org/r/737889

Maybe I should just create the local snapshot on an-coord1001

I don't think there is room!

True. There might just about be if we used compression, but it's too close for comfort.
an-coord1002 definitely doesn't have enough space either.

I tried to use an-presto1002 as an intermediate host, but that failed because it doesn't have the mbstream binary available in order to de-serialize the streamed backup.

So I have created this CR to add a specific port https://gerrit.wikimedia.org/r/c/operations/homer/public/+/737889 to homer

Change 737889 merged by Btullis:

[operations/homer/public@master] Add port 4400/tcp to the mysql-replica rule in the analytics policy

https://gerrit.wikimedia.org/r/737889

Success! The backup is now streaming from an-coord1001 to db1108.

Change 737906 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/homer/public@master] Revert the temporary change that was made for transfer.py

https://gerrit.wikimedia.org/r/737906

The backup completed successfully.

Now preparing the backup with:

root@db1108:/srv/sqldata.analytics_meta.T295312# xtrabackup --prepare --use-memory=100GB --target-dir=/srv/sqldata.analytics_meta.T295312
xtrabackup based on MariaDB server 10.4.18-MariaDB Linux (x86_64)
[00] 2021-11-10 13:42:55 cd to /srv/sqldata.analytics_meta.T295312/
[00] 2021-11-10 13:42:55 open files limit requested 0, set to 1024
[00] 2021-11-10 13:42:55 This target seems to be not prepared yet.
[00] 2021-11-10 13:42:55 mariabackup: using the following InnoDB configuration for recovery:
[00] 2021-11-10 13:42:55 innodb_data_home_dir = .
[00] 2021-11-10 13:42:55 innodb_data_file_path = ibdata1:12M:autoextend
[00] 2021-11-10 13:42:55 innodb_log_group_home_dir = .
[00] 2021-11-10 13:42:55 InnoDB: Using Linux native AIO
[00] 2021-11-10 13:42:55 Starting InnoDB instance for recovery.
[00] 2021-11-10 13:42:55 mariabackup: Using 107374182400 bytes for buffer pool (set by --use-memory parameter)
2021-11-10 13:42:55 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2021-11-10 13:42:55 0 [Note] InnoDB: Uses event mutexes
2021-11-10 13:42:55 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2021-11-10 13:42:55 0 [Note] InnoDB: Number of pools: 1
2021-11-10 13:42:55 0 [Note] InnoDB: Using SSE2 crc32 instructions
2021-11-10 13:42:55 0 [Note] InnoDB: Initializing buffer pool, total size = 100G, instances = 1, chunk size = 100G
2021-11-10 13:42:57 0 [Note] InnoDB: Completed initialization of buffer pool
2021-11-10 13:42:57 0 [Note] InnoDB: page_cleaner coordinator priority: -20
2021-11-10 13:42:57 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=633097821131
2021-11-10 13:42:57 0 [Note] InnoDB: Starting final batch to recover 454 pages from redo log.
2021-11-10 13:42:58 0 [Note] InnoDB: Last binlog file './analytics-meta-bin.018499', position 465589381
[00] 2021-11-10 13:42:58 Last binlog file ./analytics-meta-bin.018499, position 465589381
[00] 2021-11-10 13:43:00 completed OK!

Set downtime for both MariaDB instances on db1108.
Stopped slave on both instances on db1108
Stopped analytics_meta instance on db1108
Created backup of analytics_meta on db1108
Copied the prepared backup into place.

btullis@db1108:/srv$ sudo mv /srv/sqldata.analytics_meta /srv/sqldata.analytics_meta.db1108.T279440
btullis@db1108:/srv$ sudo cp -a sqldata.analytics_meta.T295312 sqldata.analytics_meta
  • Merged and deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/736780
  • btullis@db1108:/srv/sqldata.analytics_meta$ sudo chown -R mysql:mysql .
  • btullis@db1108:/srv/sqldata.analytics_meta$ sudo service mariadb@matomo start
  • btullis@db1108:/srv/sqldata.analytics_meta$ sudo mysql -S /var/run/mysqld/mysqld.matomo.sock;
  • start slave;

Change 737930 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the my.cnf file for the analytics_multiinstance backup DB host

https://gerrit.wikimedia.org/r/737930

Change 737930 merged by Btullis:

[operations/puppet@production] Fix the my.cnf file for the analytics_multiinstance backup DB host

https://gerrit.wikimedia.org/r/737930

Change 737932 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix a second small syntax error in the my.cnf file for db1108

https://gerrit.wikimedia.org/r/737932

Change 737932 merged by Btullis:

[operations/puppet@production] Fix a second small syntax error in the my.cnf file for db1108

https://gerrit.wikimedia.org/r/737932

There were a couple of syntax errors in the /etc/my.cnf file that had been caused by the refactor and had slipped through code review.
I merged two small corrections and have now successfully started the service with:

sudo systemctl set-environment MYSQLD_OPTS="--skip-slave-start"
sudo service mariadb@analytics_meta start

Now proceeding to set the slave configuration and GTID details.

MariaDB [(none)]> CHANGE MASTER TO MASTER_HOST='an-coord1001.eqiad.wmnet',MASTER_USER='repl',MASTER_PASSWORD='redacted',MASTER_SSL=1,master_use_gtid = slave_pos;
MariaDB [(none)]> start slave;

Oh, that didn't quite work.

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.'

I reset the GTID position and restarted the slave with:

MariaDB [(none)]> stop slave;
MariaDB [(none)]> SET GLOBAL gtid_slave_pos = "0-171971944-460977441";
MariaDB [(none)]> start slave;

Now the slave status looks all OK.

All looks OK. Removed downtimes on services.

Removed the backups.

btullis@db1108:/srv$ sudo rm -rf sqldata.analytics_meta.T295312
btullis@db1108:/srv$ sudo rm -rf sqldata.analytics_meta.db1108.T279440

I was on vacations- you seemed to do well on your own :-).

Consider setting up a parallel remote-backup-mariadb setup in your own vlan (so that you have full control of it) for periodic backups :-). Could make operations easier in the future (port handling, recovery, etc.), even if empty by default. E.g. we currently have profiles for mw db backups: dbbackups::content and dbbackups::metadata, it would be trivial to have a simple profile with dbbackups::analytics with analytics' own needs.

We currently do a few analytics dbs on the production vlan, but you could consolidate those in the same location, too, while I continue maintaining the automation for both envs.

Thanks @jcrespo - nice to hear.

Consider setting up a parallel remote-backup-mariadb setup in your own vlan (so that you have full control of it) for periodic backups :-)

Yes, that could be potentially very useful.

Let me see if I've got this right...
This would require two new servers then, correct?

  1. A new backup replica of the production pair (those are currently an-coord100[12] but will soon be an-db100[12]). Plus this would replicate any other analytics-related databases that we won't consolidate to the new database servers. Let's call this an-dbreplica1001 or something.
  2. A new backup provisioning server, let's call it an-dbprov1001 or something.

We/You would then configure:

  • an-dbprov1001 to do remote logical backups of an-dbreplica1001 using mydumper, with itself (an-dbprov1001) as the target for these dumps.
  • cumin1001 to do remote snapshot backups using transfer.py and the xtrabackup type, with an-dbprov1001 as the backup target.
  • Off-host backups of an-dbprov1001 are configured with Bacula, stored locally in eqiad short-term and codfw long-term.
  • We would then regularly test the recovery workflow, by provisiong a copy of this database to our test database host. (Currently an-test-coord1001 but might not be in future.)

Have I missed anything?

In the meantime, I notice that whilst we have weekly logical dumps of the analytics_meta and matomo database instances configured...
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/dbbackups/dbprov1002.cnf.erb$31-36

...it doesn't look like we have the daily snapshots of these instances configured:
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/profile/templates/dbbackups/cumin1001.cnf.erb

btullis@dbprov1002:~$ sudo tree /srv/backups/snapshots/latest
/srv/backups/snapshots/latest
├── snapshot.s1.2021-11-10--19-00-05.tar.gz
├── snapshot.s2.2021-11-10--22-24-57.tar.gz
├── snapshot.s3.2021-11-11--05-15-58.tar.gz
├── snapshot.s7.2021-11-11--01-45-14.tar.gz
└── snapshot.x1.2021-11-11--08-52-03.tar.gz

0 directories, 5 files

@jcrespo would you be happy for us to add these snapshots of analytics_meta and matomo to the backup schedule on cumin1001, or is there a reason why they shouldn't be here?

...it doesn't look like we have the daily snapshots of these instances configured:
would you be happy for us

I'd prefer if the scheduling was separate, that is why I suggested to setup a parallel install- not only may mix different domains, it mixes the monitoring (e.g. I will receive alerts about failed backups of dbs/machines I may not know how to fix- like when they grows a lot). But as long as the files are stored on separate hosts (we cannot at the moment handle the storage of more snapshots without more resources- we don't take snapshots of non-mw dbs because the implied SLA in case of a recovery is much lower), and it is a temporary setup, and it is a limited amount of hosts, it won't be a huge issue.

In any case, my original suggestion was to make your life easier, for you to setup the profiles (obviously, with my help) on your existing machines- even with empty/no regularly scheduled backups (the automation), so it was ready whenever you needed it (e.g. in a cloning emergency), even without regular jobs happening. An empty, but "ready to go" backup can be setup easyly any host, just requires a few debian packages and some config- no disk space needed. That way you can have more control about what and how to backup. For example- while transfer.py runs on port 4400 by default, if it is unavailable, it will try to use the 4401, 4402, etc. Plus ips of management and other hosts will vary with time. A separate setup will facilitate not forgetting about custom vlan holes and other custom config.

This is not because "analytics"- I would say the same to cloud- if a separate domain, it should as little dependent of outside infrastructure as possible, while sharing as much puppet and automation code.

Let me know your thoughts on your needs and resources available.

Change 737906 merged by jenkins-bot:

[operations/homer/public@master] Revert the temporary change that was made for transfer.py

https://gerrit.wikimedia.org/r/737906

Thanks again @jcrespo - Lots to take in and consider there.

As things stand, I think that I can resolve this ticket because we've achieved what we set out to do in the description, which was to rebuild the replica of the analytics_meta on db1108 from the known good primary.
I've checked that the existing backup regime is working and I've familiarized myself enough with the current tooling and the monitoring that I could carry out a full or partial restore of analytics_meta (or matomo) from the logical dumps. Opening a hole in the firewall in order to carry out a restore wouldn't be a big deal.

I suppose that the point I'm still somewhat troubled by is this:

we don't take snapshots of non-mw dbs because the implied SLA in case of a recovery is much lower

It's probably something that I should discuss with the data engineering team, because I'm just a little nervous about only having a weekly backup schedule.
I think that we should, as a team, decide on our own recovery objectives for these databases, rather than default to the implied SLA.

Your points here are totally valid and I understand them:

we cannot at the moment handle the storage of more snapshots without more resources

I will receive alerts about failed backups of dbs/machines I may not know how to fix

if a separate domain, it should as little dependent of outside infrastructure as possible, while sharing as much puppet and automation code

Bearing those concerns in mind, I'll discuss with @odimitrijevic and the team how we should proceed and what priority to give this work.

We certainly wouldn't want to implement anything that would be burdensome for the data-persistence team and we must be mindful of our own team's resources,
However, I'd like to think that we can make some positive changes around our disaster recovery planning within the analytics domain.

I think that I can resolve this ticket

of course!

only having a weekly backup schedule

Note you not only have that- you should have also available binlogs for point in time recovery, with transaction granularity. We have, however not yet a centralized solution for that. For mediawiki and other misc dbs, we rely on the high distribution of those files, and while we have scheduled a proper backup solution 0 1, there is nothing available at the moment. You should get familiar with that process, too: https://dev.mysql.com/doc/refman/5.6/en/point-in-time-recovery.html

Mentioning it because the previous Analytics SRE was happy with that combination (weekly dumps + binlogs for point in time recovery)- granularity wise.

Note you not only have that- you should have also available binlogs for point in time recovery, with transaction granularity.

Ah, OK. Thanks, that's good to know. I've never used that method yet, but it makes sense. 👍