Page MenuHomePhabricator

Reimage dbstore2001 as jessie
Closed, ResolvedPublic

Description

  • Reimage dbstore2001 to jessie, in preparation for testing its upgrade and new disks. It is completely broken software-wise; but it can be used as a testbed for future work
  • Test MySQL 5.7 for its delayed replication and compatibility ?? See at least if it is viable in our infrastructure
  • Write and test some binary import scripts
  • Check InnoDB compressed size

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['dbstore2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610030930_marostegui_13056.log.

Completed auto-reimage of hosts:

['dbstore2001.codfw.wmnet']

Those hosts were successful:

[]

Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['dbstore2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201610031003_marostegui_18776.log.

Completed auto-reimage of hosts:

['dbstore2001.codfw.wmnet']

Those hosts were successful:

['dbstore2001.codfw.wmnet']

So this host is now running Jessie

marostegui@dbstore2001:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:   	Debian GNU/Linux 8.6 (jessie)
Release:       	8.6
Codename:      	jessie

Given that we have to copy all the shards to this server, I was thinking on doing the following:

  • xtrabackup + import tablespaces

We could try xtrabackup from S3 - db2065 for instance, as there are hundreds of files, and it would be easier to xtrabackup them and leave the transportable spaces for shards with less files.

Once that shard is up and running, start importing tablespaces from the rest of the shards from codfw and start adding them to dbstore2001

What about compression and 5.7? Because right now there is not enough space for all shards.

I was planning on testing the compression on dbstore not on the slaves so the plan I had in mind was:

Xtrabackup S3 and once that shard is running, compress it.
Import another shard, compress it
And so on with the rest of the shards.

Do you have any other idea about it or specific shards/tables you'd like to try?

That seems ok to me, just it was not clear to me.

Regarding 5.7, I was planning on installing:

marostegui@dbstore2001:~$ apt-cache show wmf-mysql57
Package: wmf-mysql57
Version: 5.7.9-1
Architecture: amd64
Maintainer: Jaime Crespo <jcrespo@wikimedia.org>
Priority: optional
Section: misc
Filename: pool/thirdparty/w/wmf-mysql57/wmf-mysql57_5.7.9-1_amd64.deb
Size: 261117056
SHA256: 2ef4af54b4c19382c60a9b24ca5d70feab09c159862ba7c4d9f4dfabce61944c
SHA1: 13077361f699efbb5b4bb0c2c844409d04213ee3
MD5sum: 1c89fee77d09823c0e5629f5ecf7ce9d
Description: MySQL plus patches.
Description-md5: e4135d7bd9aebcc954b3cc96597967b5

I have been trying https://github.com/giacomolozito/ibdata-shrinker for defragmenting in a small instance.
From the initial tests: it looks good and no data was lost.
More than the shrinking I was interested in the tablespace movement, which was done nicely.
Again, there were not many big tables and not too complicated ones.

When reimporting the tables I have not found any issue with existing data or when inserting new data (ie: AUTOINC values etc).

The trite tool looks very promising (and a lot more robust from a first glance)

After wasting the whole morning dealing with xtrabackup errors and issues, I have stopped mysql on db2057.codfw.wmnet (S3) and started a copy via netcat.

My idea is to bring up mysql and replicating on dbstore2001 and start testing the packages with real data and following kinda the migration path to see what issues we face.

So that means

10.0.23 -> 10.0.27 -> 10.1

After that I will test 5.7

Allow me to suggest keeping the original tar as I think that migration is possible, but the opposite is not. For testing, s3 is a bad candidate, as just mysql_upgrade may take 30-minutes/1-hour (Warning!).

Adding @chasemp so he is in the loop because even if the ticket is not nominatively related to labs, we are testing the same procedure we are about to apply for the new labsdbs there.

dbstore2001 is now working and replicating fine using GTID.
I have migrated to 10.0.27-1 and found something which I guess isn't expected?

root@dbstore2001:/opt/wmf-mariadb10# /etc/init.d/mysql status
-bash: /etc/init.d/mysql: No such file or directory
root@dbstore2001:/opt/wmf-mariadb10# ls -lh /etc/init.d/mysql
lrwxrwxrwx 1 root root 26 Oct 13 06:35 /etc/init.d/mysql -> /opt/wmf-mariadb10/service
root@dbstore2001:/opt/wmf-mariadb10# find . -name service
root@dbstore2001:/opt/wmf-mariadb10#

I have created the file manually and started MySQL, which started correctly.
mysql_upgrade also worked fine.

Did some basic testing with mysqld_safe and mysqd and all looked good from that side.

That is not a bug, that is intended. But I am having issues here: https://gerrit.wikimedia.org/r/315228

I have upgraded to 10.1.18.

The first I had to deal with was:

ERROR 1275 (HY000): Server is running in --secure-auth mode, but 'root'@'localhost' has a password in the old format; please change the password to the new format

That was an easy fix. We might need to think about it if we ever migrate to 10.1 in a massive way.

What I did:

/etc/init.d/mysql start --skip-slave-start --skip-secure-auth

Then got into MySQL prompt:

set password for 'root'@'localhost' = PASSWORD('xxxx');

Stopped MySQL and started again.

After that I was able to login just fine and mysql_upgrade was able to run and fix all the errors related to mysql.XX tables.

I have done a diff of the old - new variables before upgrading (as per: https://mariadb.com/kb/en/mariadb/upgrading-from-mariadb-100-to-101/#options-that-have-changed-default-values) but nothing really affects us as we mostly set them from start and do not let them get the value they want.

After installing the package we had:

root@dbstore2001:/opt/wmf-mariadb101/bin# dpkg -l | grep maria
ii  wmf-mariadb10                  10.0.27-1                            amd64        MariaDB plus patches.
ii  wmf-mariadb101                 10.1.18-1                            amd64        MariaDB plus patches.

I changed the basedirand copied the service file to /opt/mariadb101 to make sure it was starting the 10.1 installation.
And decided to remove the 10.0.27-1 package to see what would happen.

And the first that happens by default is:

-bash: /usr/local/bin/mysql: No such file or directory

So I had to:

root@dbstore2001:/opt/wmf-mariadb101/bin# pwd
/opt/wmf-mariadb101/bin
root@dbstore2001:/opt/wmf-mariadb101/bin# cp mysql /usr/local/bin/
root@dbstore2001:~# mysql --skip-ssl -e "select version();"
+-----------------+
| version()       |
+-----------------+
| 10.1.18-MariaDB |
+-----------------+

I have not been able to test this: https://jira.mariadb.org/browse/MDEV-10703 but I have subscribed myself to that bug because it looks scary.

cp mysql /usr/local/bin/

That is done by the package itself, that is very strange.

Let me implement https://gerrit.wikimedia.org/r/315228 on 10.1- it should work.

Also, let's add the basedir on puppet to all servers so we do not have to have issues with that with systemd.

Oh, I see what happens- on install, symbolic links are created on /usr/local, but if you delete a package, those have to be deleted. The solution is to make the packages incompatible, but that defeats the whole purpose. No outage is created and reinstalling the packet solves it. Not sure if a bug, really, but I accept alternatives.

Can you do some kind of massive import and/or a stress testing? I would like to indentify issues with 10.1 itself rather than the package.

Oh, I see what happens- on install, symbolic links are created on /usr/local, but if you delete a package, those have to be deleted. The solution is to make the packages incompatible, but that defeats the whole purpose. No outage is created and reinstalling the packet solves it. Not sure if a bug, really, but I accept alternatives.

Not necessarily a bug per se. Maybe the way I upgraded it was mean to break it which was basically

  1. Install new package
  2. Delete the old one

Instead of:

  1. Delete old package
  2. Install the new one

Which is a bit scarier.
I normally like to delete the old package to avoid having multiple packages installed from different versions as it can be confusing.

A way of fixing it can be basically remove the existing link (if there is one) and recreate it.

Can you do some kind of massive import and/or a stress testing? I would like to indentify issues with 10.1 itself rather than the package.

Sure, I want to import another shard and get the multi source replication to start working (right now it is only replicating S3).

I came across this too: https://jira.mariadb.org/browse/MDEV-10123 which I want to see if it happens with our workload or not.

Not sure if it's a temporary thing or warrants the work, but the Debian way to manage multiple packages owning a common command is https://wiki.debian.org/DebianAlternatives
Let me know if you want/need this for the wmf-mariadb package(s) at some point and I can help out.

Thanks @MoritzMuehlenhoff that would actually fix the problem I believe as we can set the priority for the latest package installed and then remove without any problem the new mariadbwmf101 one.

Side note: I have converted dbstore2001 slave to be now multi source, so s3 has its own channel now. I will import another shard in a different channel and see how it goes.

The alternatives system does that even automatically in the case of only two alternatives, so if you have wmf100 and wmf101 providing the symlink, and wmf100 gets removed, it will automatically point to the remaining alternative.

I am compiling the new packages with update-alternatives as we speak. However, that may break the existing old hardcoded symlinks.

Mentioned in SAL (#wikimedia-operations) [2016-10-14T10:28:12Z] <jynus> stopping and restarting mysql at dbstore2001 for misc tests T146261

This is now the procedure- it is not very elegant, but it works:

root@dbstore2001:/opt$ sudo dpkg -i ~/wmf-mariadb10_10.0.27-1_amd64.deb 
Selecting previously unselected package wmf-mariadb10.
(Reading database ... 44076 files and directories currently installed.)
Preparing to unpack .../wmf-mariadb10_10.0.27-1_amd64.deb ...
Unpacking wmf-mariadb10 (10.0.27-1) ...
Setting up wmf-mariadb10 (10.0.27-1) ...
update-alternatives: using /opt/wmf-mariadb10/bin/mysql to provide /usr/local/bin/mariadb (mariadb) in auto mode
update-alternatives: using /opt/wmf-mariadb10/bin/mysqldump to provide /usr/local/bin/mariadbdump (mariadbdump) in auto mode
update-alternatives: warning: alternative /opt/wmf-mariadb101/bin/mysql (part of link group mysql) doesn't exist; removing from list of alternatives
update-alternatives: warning: /etc/alternatives/mysql is dangling; it will be updated with best choice
update-alternatives: using /opt/wmf-mariadb10/bin/mysql to provide /usr/local/bin/mysql (mysql) in auto mode
update-alternatives: using /opt/wmf-mariadb10/bin/mysqldump to provide /usr/local/bin/mysqldump (mysqldump) in auto mode
root@dbstore2001:/opt$ sudo dpkg -i ~/wmf-mariadb101_10.1.18-1_amd64.deb 
Selecting previously unselected package wmf-mariadb101.
(Reading database ... 44751 files and directories currently installed.)
Preparing to unpack .../wmf-mariadb101_10.1.18-1_amd64.deb ...
Unpacking wmf-mariadb101 (10.1.18-1) ...
Setting up wmf-mariadb101 (10.1.18-1) ...
root@dbstore2001:/opt$ mysql --version
mysql  Ver 15.1 Distrib 10.0.27-MariaDB, for Linux (x86_64) using readline 5.1
root@dbstore2001:/opt$ sudo aptitude purge wmf-mariadb10
The following packages will be REMOVED:  
  wmf-mariadb10{p} 
0 packages upgraded, 0 newly installed, 1 to remove and 4 not upgraded.
Need to get 0 B of archives. After unpacking 0 B will be used.
Do you want to continue? [Y/n/?] Y
(Reading database ... 45467 files and directories currently installed.)
Removing wmf-mariadb10 (10.0.27-1) ...
update-alternatives: warning: alternative /opt/wmf-mariadb10/bin/mysql (part of link group mariadb) doesn't exist; removing from list of alternatives
update-alternatives: warning: /etc/alternatives/mariadb is dangling; it will be updated with best choice
update-alternatives: using /opt/wmf-mariadb101/bin/mysql to provide /usr/local/bin/mariadb (mariadb) in auto mode
update-alternatives: warning: alternative /opt/wmf-mariadb10/bin/mysqldump (part of link group mariadbdump) doesn't exist; removing from list of alternatives
update-alternatives: warning: /etc/alternatives/mariadbdump is dangling; it will be updated with best choice
update-alternatives: using /opt/wmf-mariadb101/bin/mysqldump to provide /usr/local/bin/mariadbdump (mariadbdump) in auto mode
update-alternatives: warning: alternative /opt/wmf-mariadb10/bin/mysql (part of link group mysql) doesn't exist; removing from list of alternatives
update-alternatives: warning: /etc/alternatives/mysql is dangling; it will be updated with best choice
update-alternatives: using /opt/wmf-mariadb101/bin/mysql to provide /usr/local/bin/mysql (mysql) in auto mode
update-alternatives: warning: alternative /opt/wmf-mariadb10/bin/mysqldump (part of link group mysqldump) doesn't exist; removing from list of alternatives
update-alternatives: warning: /etc/alternatives/mysqldump is dangling; it will be updated with best choice
update-alternatives: using /opt/wmf-mariadb101/bin/mysqldump to provide /usr/local/bin/mysqldump (mysqldump) in auto mode
Purging configuration files for wmf-mariadb10 (10.0.27-1) ...
                                         
root@dbstore2001:/opt$ mysql --version
mysql  Ver 15.1 Distrib 10.1.18-MariaDB, for Linux (x86_64) using readline 5.1

Mentioned in SAL (#wikimedia-operations) [2016-10-14T11:02:44Z] <marostegui> Stopping MySQL db2055 (S1-codfw) to import S1 to dbstore2001 - T146261

Importing the tablespace failed for one table (enwiki/change_tag`and crashed the whole server.)

2016-10-14 13:47:38 140429149276928 [Note] InnoDB: Sync to disk
2016-10-14 13:47:38 140429149276928 [Note] InnoDB: Sync to disk - done!
2016-10-14 13:47:38 140429149276928 [Note] InnoDB: Phase III - Flush changes to disk
2016-10-14 13:47:38 140429149276928 [Note] InnoDB: Phase IV - Flush complete
InnoDB: Error: trying to access page number 490905633 in space 167173,
InnoDB: space name enwiki/change_tag,
InnoDB: which is outside the tablespace bounds.
InnoDB: Byte offset 0, len 16384, i/o type 10.
InnoDB: If you get this error at mysqld startup, please check that
InnoDB: your my.cnf matches the ibdata files that you have in the
InnoDB: MySQL server

Full log at: https://phabricator.wikimedia.org/P4222

I am going to try to skip that table and see what I can get

I started mysql with innodb_force_recovery = 1 to keep trying to import the tablespaces from the table it failed.

I was able to keep advancing on the tables, but it crashed again.

2016-10-14 15:19:11 139628524821248 [Note] InnoDB: Phase I - Update all pages
2016-10-14 15:19:13 139628524821248 [Note] InnoDB: Sync to disk
2016-10-14 15:19:13 139628524821248 [Note] InnoDB: Sync to disk - done!
2016-10-14 15:19:13 139628524821248 [Note] InnoDB: Phase III - Flush changes to disk
2016-10-14 15:19:13 139628524821248 [Note] InnoDB: Phase IV - Flush complete
InnoDB: Error: trying to access page number 1701967104 in space 167235,
InnoDB: space name enwiki/objectcache,
InnoDB: which is outside the tablespace bounds.
InnoDB: Byte offset 0, len 16384, i/o type 10.
InnoDB: If you get this error at mysqld startup, please check that
InnoDB: your my.cnf matches the ibdata files that you have in the
InnoDB: MySQL server.
2016-10-14 15:19:13 7efdcc9f5b00  InnoDB: Assertion failure in thread 139628524821248 in file fil0fil.cc line 5961
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
161014 15:19:13 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

The server is obviously totally corrupted. I will try the reimport again on to see what happens and if this is a bug or something.

The recovery = 1 also fails at this point

2016-10-14 15:24:50 139809294985088 [ERROR] InnoDB: Attempted to open a previously opened tablespace. Previous tablespace abwiktionary/langlinks uses space ID: 460 at filepath: ./abwiktionary/langlinks.ibd. Cannot open tablespace enwiki/oldimage which uses space ID: 460 at filepath: ./enwiki/oldimage.ibd
2016-10-14 15:24:50 7f27e35d2780  InnoDB: Operating system error number 2 in a file operation.

The import process was something like this:

dbstore2001:
create enwiki database with the tables structure
alter table xx discard tablespace;

db2055:
stop slave
flush tables for export
stop mysql
nc to dbstore2001

Once the transfer was done:
dbstore2001:
alter table enwiki.xx import tablespace

I have reinstalled 10.0.27 and I am copying S3 again. Last time it took 2 hours or so. Once that is done I will configure multisource and import S1.

S3 is now replicating in dbstore2001
I am going to import S1's tablespaces.

Mentioned in SAL (#wikimedia-operations) [2016-10-17T12:31:17Z] <marostegui> Stopping MySQL db2055 (S1-codfw) to import S1 to dbstore2001 - T146261

I have finished importing S1's enwiki tablespace.
There are some things that need to be mentioned.

The following command:

for i in `mysql --skip-ssl enwiki -e "show tables;" -B`; do echo $i; mysql --skip-ssl enwiki -e "alter table $i import tablespace;";done

Crashed exactly on the same table again: change_tag.
The server crashed:

161018  9:58:39 [Note] InnoDB: Phase IV - Flush complete
InnoDB: Error: trying to access page number 483211770 in space 167455,
InnoDB: space name enwiki/change_tag,
InnoDB: which is outside the tablespace bounds.
InnoDB: Byte offset 0, len 16384, i/o type 10.
InnoDB: If you get this error at mysqld startup, please check that
InnoDB: your my.cnf matches the ibdata files that you have in the
InnoDB: MySQL server.
2016-10-18 09:58:43 7f587f070700  InnoDB: Assertion failure in thread 140018065016576 in file fil0fil.cc line 5598
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
161018  9:58:43 [ERROR] mysqld got signal 6 ;

I started mysql with:

innodb_force_recovery = 1

It started fine and I continued importing the tables.

Then a different table crashed with the same error. I repeated what I wrote above and it went fine.
Then the rest of the tables finished fine.

MySQL starts fine and show table status for enwiki shows data for all the tables.
I am going to do a full scan of all the tables to see if any of them fails. Big tables like revision I might do some queries but not a full scan :-).
Basically I want to make sure that the whole table can be queried with no crashes.
I am not 100% sure that the data of S1 is valid, but I also have no reason to believe it is corrupt. I want to double check so pt-checksum will be handy to see that.

I am currently a packagin the current sqldata so we can avoid the copy of S3 and S1 for future testing at: /srv/sqldata_s1_s3.tar.gz
Compressed with: tar cf - sqldata/ | pigz > sqldata_s1_s3.tar.gz

S3 is replicating fine.
S1 I will start replication and some of the data checks mentioned above once the snapshot is done.

I have also documented how to import tables: https://wikitech.wikimedia.org/wiki/MariaDB/ImportTableSpace

Note- pt-table checksum is ok (although it may need a patch to work) but pt-table-sync corrupts data with our config, make sure you do not use it.

Thanks for the heads up. At the moment I am planning only to use pt-table-checksum but it is good (and quite important) to know about pt-table-sync.
Much appreciated!

Mentioned in SAL (#wikimedia-operations) [2016-10-19T08:15:36Z] <marostegui> Stopping db2062.codfw.wmnet to use it to clone another server - T146261

The table change_tag is definitely corrupted (that is the one that crashed the first time, and during the different two attempts of copying it).
It is imposible to use it, not even to select anything from it.

InnoDB: Error: trying to access page number 419221143 in space 167455,
InnoDB: space name enwiki/change_tag,
InnoDB: which is outside the tablespace bounds.
InnoDB: Byte offset 0, len 16384, i/o type 10.
InnoDB: If you get this error at mysqld startup, please check that
InnoDB: your my.cnf matches the ibdata files that you have in the
InnoDB: MySQL server.
2016-10-19 08:05:55 7f3acc132700  InnoDB: Assertion failure in thread 139890508637952 in file fil0fil.cc line 5598
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.

I have checked the rest of the tables and they are all fine, so at this point I am unsure about whether this is the table itself (it doesn't look like a weird table or with anything that could be problematic ie: partitions) or the server it comes from (db2055) which is the one that was used the two times I imported it.

Going to try a different server to see if the ibd is corrupted itself on the source host or if it is the procedure. Which seems to be working fine for the rest of the tables.

I don't think the different versions between servers (10.0.27 vs 10.0.22) is playing any role here.

Maybe stopping the slave, exporting it (probably it is very small), dropping the table and recreating it logically on the source?

Check if it has a primary key- wasn't this one that caused issue on toku, or on another host with innodb? Alternatively, we may hit a new (or old, but still latent) InnoDB bug. Which would be good news because we can do something about it before it is too late.

Edit: I answer myself: yes T123225 and we should put as "very high" this: T147166

Maybe stopping the slave, exporting it (probably it is very small), dropping the table and recreating it logically on the source?

That is a good idea. I want to see if the new source has the same problem.
The table isn't too big indeed, around 8G

Check if it has a primary key- wasn't this one that caused issue on toku, or on another host with innodb? Alternatively, we may hit a new (or old, but still latent) InnoDB bug. Which would be good news because we can do something about it before it is too late.

Yes, it doesn't have a PK, but there are many of them without PK on that host. So I didn't think that could be the issue, but it is a good point actually.

ADifferent server, same table crashed.
Interesting.

I am going to run the alter table described at: T147166 in db2055.codfw.wmnet and see what happens - I can also try to export the table, drop it and import it again. However I am unsure whether it is a problem of a corrupted table itself or not.

It could be if both servers happened to be cloned from the same source (db2055 and db2062)

Update: The alter is done - https://phabricator.wikimedia.org/T147166#2728317
Going to try to copy the data again.

Mentioned in SAL (#wikimedia-operations) [2016-10-19T12:24:08Z] <marostegui> Stopping db2055 to clone another host - T146261

After adding the PK change I am now importing the enwiki tables and change_tag table has been imported finely.
So, so far so good!.
We will see how it goes with the rest of the tables, it is still importing tablespaces

The import finished.
One table crashed though, however everything looks fine.
This is the timeline:

transcodetable (which as a PK) crashed while importing.
Started with innodb_force_recovery=1 and tried to re-import it again, but it was already imported (and I was able to query the table without any problems).
After finishing importing the rest of the tables pending, they all got imported fine.

I have queried all the tables and none of them made the server crash again, so those were good news.
After restarting MySQL without the recovery mode, I queried again all the tables and and it was all ok.

I have started replication on S1 thread and so far so good, it is catching up.
Once it has caught up I will create a snapshot of this data and I will probably import S4 so we can have 3 threads running.

S1
S3
S4

I also need to run pt-table-checksum to make sure the data is actually fine (which I think it is now)

I will start testing InnoDB compression on S1 once it has caught up too.

Looks like after the crash today, S3 replication thread got broken - reminder: s3 has been untouched for almost a week after it was copied over:

161020 14:23:02 [ERROR] Master 's3': Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-171974683-4735941203, which is not in the master's binlog', Internal MariaDB error code: 1236

That position indeed doesn't exist on the binary log of the master.
This is the query that exists on the relay log of the server but not on the master:

#161020 14:05:12 server id 171974683  end_log_pos 56202904     	Query  	thread_id=732334095    	exec_time=0    	error_code=0
SET TIMESTAMP=1476972312/*!*/;
UPDATE /* User::incEditCountImmediate  */  `user` SET user_editcount=user_editcount+1 WHERE user_id = 'xx' AND (user_editcount IS NOT NULL)
/*!*/;
# at 56203192
#161020 14:05:12 server id 171974683  end_log_pos 56202931     	Xid = 3060608412
COMMIT/*!*/;
# at 56203219
#161020 14:05:12 server id 171974683  end_log_pos 56202969     	GTID 0-171974683-4736550085

I have changed it to go back not to use GTID (change master to master_user_gtid=no; )
Duplicate entries are showing up now. I am skipping them (as this host isn't trusted) so we can keep testing things.

@jcrespo thinks there might be something writing directly to this host. So I need to investigate that.

Pretty much at the same time S1 got a duplicate entry however, this one does exist on the master's binlog:

161020 14:26:21 [ERROR] Master 's1': Slave SQL: Error 'Duplicate entry '873342839' for key 'PRIMARY'' on query. Default database: 'enwiki'. Query: 'INSERT /* RecentChange::save  */  INTO `recentchanges` (rc_timestamp,rc_namespace,rc_title,rc_type,rc_source,rc_minor,rc_cur_id,rc_user,rc_user_text,rc_comment,rc_this_oldid,rc_last_oldid,rc_bot,rc_ip,rc_patrolled,rc_new,rc_old_len,rc_new_len,rc_deleted,rc_logid,rc_log_type,rc_log_action,rc_params,rc_id) VALUES (xxxxxxx)', Gtid 0-171974683-4735996975, Internal MariaDB error code: 1062

I skipped this entry and S1 has been replicating fine ever since and has caught up with the master finely.
It is interesting that I have also tried to change S1 to use GTID with:

MariaDB DBSTORE localhost (none) > stop slave;
Query OK, 0 rows affected (0.00 sec)

MariaDB DBSTORE localhost (none) > CHANGE MASTER TO MASTER_USE_GTID = current_pos;
Query OK, 0 rows affected (0.08 sec)

MariaDB DBSTORE localhost (none) > start slave;
161020 15:53:33 [ERROR] Master 's1': Error reading packet from server: Error: connecting slave requested to start from GTID 0-171966669-3275722362, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions ( server_errno=1236)
161020 15:53:33 [ERROR] Master 's1': Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-171966669-3275722362, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions', Internal MariaDB error code: 1236

So I have switched it back to normal replication. We need to investigate that too.
This is the first time I attempt to switch a database that has been imported just moving the .ibd files from another host to start using GTID. Not sure if this has played a role.

I have restored S3 snapshot so S3 is replicating again.

Mentioned in SAL (#wikimedia-operations) [2016-10-21T14:29:57Z] <marostegui> Stopping replication on db2055 to use it to clone another host - T146261

I have started a new copy of s1 from db2055, copying also the cfg files to dbstore2001.

Also, I have added a small text to the documentation stating that we are not sure whether the method I wrote works completely, so everyone is aware.
Despite that, I have started to change things to make sure the new instructions are there (ie: the new way of copying .cfg files, no need to stop mysql on source etc), nothing too pretty so far, just commands as I go.

We will see what happens.

S3 replication continues to be fine and it is catching up with the master quite nicely, it is using GTID at the moment. But I will change it to normal replication too, once it caught up probably.

I am now importing the ibd files for S1 enwiki after the transfer.
The transfer was done correctly and we have the same amount of ibd files as cfg ones, so that looks good.

root@dbstore2001:/srv/sqldata/enwiki# ls -lh | grep ibd | wc -l
142
root@dbstore2001:/srv/sqldata/enwiki# ls -lh | grep cfg | wc -l
142

I have taken an snapshot before starting the process, as S3 was already replicating fine.
This time I am doing the import but with the replication thread of S3 stopped, so nothing is being written to disk from another thread.

We will see how it gets imported...

Mentioned in SAL (#wikimedia-operations) [2016-10-22T12:34:22Z] <marostegui> Stopping replication in db2055 to use it to clone another host - T146261

TL;DR: S1 is now replicating fine, the import was correct after all the tests and the server hasn't crashed not even once. So all good.
I have started replication again in S3 and it looks good too.

S1 -> not using GTID
S3 -> using GTID

More details below about the tests I have done today below.


I have been doing a few tests with some single tables, just copying ìbd, cfg,frm` files and some other tests copying cfg,ibd.
Both tests were fine, but one of the tables (just one) was complaining:

MariaDB DBSTORE localhost enwiki > alter table abuse_filter_action import tablespace;
ERROR 1810 (HY000): IO Read error: (139862985341824, ) (null)
MariaDB DBSTORE localhost enwiki > select count(*) from abuse_filter_action;
ERROR 1814 (HY000): Tablespace has been discarded for table 'abuse_filter_action'
MariaDB DBSTORE localhost enwiki >

This table has exactly the same md5 as it has on the source host (db2055)

root@dbstore2001:/srv/sqldata/enwiki# md5sum abuse_filter_action.{frm,ibd}
c81a9f2856fdd66014068e0a4f22e7a1  abuse_filter_action.frm
85623ff2673ae027e07784d0fa1ff0c6  abuse_filter_action.ibd
root@db2055:/srv/sqldata/enwiki# md5sum abuse_filter_action.{frm,ibd}
c81a9f2856fdd66014068e0a4f22e7a1  abuse_filter_action.frm
85623ff2673ae027e07784d0fa1ff0c6  abuse_filter_action.ibd

The logs aren't too helpful (I have created a ticket to MariaDB to see if they can improve a bit the error message - https://jira.mariadb.org/browse/MDEV-11106):

InnoDB: Discarding tablespace of table "enwiki"."abuse_filter_action": Generic error

I have tried a few things like recreating the table, importing and discarding the same tablespace without too much luck.
The good thing is that the server never crashed when doing all the tests.

Obviously creating a mysqldump of this table (which is really small) would have solved the issue, but that wasn't the point of the tests.

After playing around with that a bit, I just did a single copy only of ibd and cfgfiles and the import was done fine.

As I said, this looks a lot better now.
Next steps:

  • If by Sunday both threads have caught up (they should, they were only delayed a few hours) and the server looks stable, I will take a snapshot.
  • Enable GTID on S1 so we can have both threads running GTID
  • Start testing compression on S1 to see how multisource+imported tablespaces deal with it. Because after that we can move all this data to a labsdb and see if we can move compressed InnoDB by just moving its tablespace, which can be a big big win.

I have started to change the documentation to reflect the new steps. I will keep doing that Monday probably.
Also, I've got some ideas in mind about how to automate all this process with a couple of scripts.

When going to stop both slaves to get the snapshot, S3 replication thread had exactly the same issue it had Friday when I tried to start it again:

161023  7:04:27 [ERROR] Master 's3': Error reading packet from server: Error: connecting slave requested to start from GTID 0-171974683-4750083833, which is not in the master's binlog ( server_errno=1236)
161023  7:04:27 [ERROR] Master 's3': Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-171974683-4750083833, which is not in the master's binlog', Internal MariaDB error code: 1236
161023  7:04:27 [Note] Master 's3': Slave I/O thread exiting, read up to log 'db2018-bin.002468', position 50404161; GTID position 0-171974683-4750083833

I will leave it like that so we can troubleshoot it again and see how we can overcome this problem as in making sure we know how to re enable replication once this happens. Or at least, try to understand why this is happening.
Both threads were replicating completely fine, and this only happened once I did stop all slaves; and then again start all slaves to see if they'd start correctly.
S1 did, but S3 didn't

We can probably reset slave on S3 and start normal replication with the exec_master_log_pos we have below without using GTID

https://phabricator.wikimedia.org/P4288

I reseted S3 slave without using GTID (and using the coordinates in the paste above) and now I can stop and start both of them just fine.

Stopping and starting MySQL also worked fine.
I am taking the snapshot now then.

The replication thread went fine throughout the night. I have been stopping and starting it today for a few times as well as stopping MySQL and starting it.
No problems at all, so I would say this is now stable.

Again, GTID isn't running. @jcrespo and myself have been discussing about this: https://mariadb.com/kb/en/mariadb/gtid/#use-with-multi-source-replication-and-other-multi-master-setups

So looks like we need to start playing around with gtid_domain_id flag and see if by changing it per shard we can enable GTID.
I am doing tests now on my testing environment with 1 slave getting replication from 2 upstream masters.
Right now with no GTID enable, I am generating some data and then I will start playing with the domain_id flag and see what happens when it gets changed live.

Apart from that I have tested to compress archive table, stopped replication, started it again, stopped MySQL and started it again. To make sure nothing happened.
It went fine.

This is what I ran:

ALTER TABLE archive row_format=compressed key_block_size=8;

I optimized the table before compressing it to see what was the real compression rate for this particular table.

Before the optimize: 22G
After the optimize and before the compression: 17G
After the compression: 11G

I also compressed watchlist table:

Before the optimize: 35G
After the optimize and before the compression: 24G
After the compression: 13G

After both compressions I stopped and started MySQL to make sure it was all fine. And it was.

I am now compressing templatelinks table (135G) we will see how it goes.

templatelinks table finished its compression and went from 135G to 39G.
I did not optimize this table first, so we are not sure how much the compression rate was though.

As the server is looking fine now I am going to take a snapshot now before compressing all the tables in enwiki and before importing S4.

While this is getting done, I will test the gtid_domain_id out of band.

Mentioned in SAL (#wikimedia-operations) [2016-10-25T08:48:58Z] <marostegui> Stopping replication db2058 s4 - using it to clone another host - T146261

S4 cannot be imported as it has tables with partitions and MariaDB currently does not support moving partitions (https://jira.mariadb.org/browse/MDEV-10568)
I assume the way to do it would be to remove partitions on the source host, copy the single ibd file and them recreate them.

After talking to Jaime, we are going to fix slaves in codfw (T149079)

(The snapshot was taken anyways)

Once one of them is fixed, I will use it to import the database. Meanwhile I am going to keep compressing tables on S1.

Change 317788 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Depool db1059

https://gerrit.wikimedia.org/r/317788

Change 317788 merged by jenkins-bot:
db-eqiad.php: Depool db1059

https://gerrit.wikimedia.org/r/317788

Mentioned in SAL (#wikimedia-operations) [2016-10-25T10:09:29Z] <marostegui@mira> Synchronized wmf-config/db-eqiad.php: Depool db1059 to clone another host from it - T146261 (duration: 01m 36s)

Mentioned in SAL (#wikimedia-operations) [2016-10-25T11:14:04Z] <marostegui@mira> Synchronized wmf-config/db-eqiad.php: Repool db1059 - T146261 (duration: 01m 22s)

I have been testing gtid_domain_id and I have seen good and a mix of bad news.

Good news: Changing it dynamically doesn't affect the connected replication threads.

Bad news: So far I have not been able to make a slave to replicate from multi master if the gtid_domain_id isn't changed in the master _before_ the slave gets built. Looks like we need to rotate the relay logs (keep reading)


I have set up 2 masters (s1,s3) and one slave (s2) that replicate from both.

Started with no GTID.
Firstly I tried to reproduce the error we got in production, that is switching both threads to GTID.
The first thread works fine, and as soon as I switch the second one it breaks - expected.

Both masters have gtid_domain_id set to 0, which is the default.
So their binary logs events look like:

s1

#161025 12:45:46 server id 1  end_log_pos 4627 	GTID 0-1-25

s3

#161025 12:44:54 server id 3  end_log_pos 859   GTID 0-3-3

Once I change the gtid_domain_id to 1 and 3 their binary logs change (and the slave which is still not using GTID keeps replicating)

s1

#161025 12:46:28 server id 1  end_log_pos 5627  GTID 1-1-27

s3

#161025 12:46:26 server id 3  end_log_pos 3847  GTID 3-3-9

However, the slave breaks once I switch to GTID.

For the tests I tried to do a FLUSH LOGSon the master to see if that would make a difference, but it didn't.

After those tests, I decided to try another thing: configure replication from scratch (so relay log gets flushed) once the global id has been changed and not before.

  • Change the gtid_domain_id on the master so the binlogs started to inject the transaction ID.
  • Reconfigure replication on the slave. That worked.

The way I had to do it was:

  • Configure the first slave and start GTID on it .
  • Configure the second slave and start GTID on it.

If I configured both without GTID and then attempt to switch to GTID at the same time it would fail.

Maybe the issue is related to the relay logs then?

And while still testing it inserting data in both masters and stopping/starting the slaves replication got broken:

Oct 25 15:00:53 ubuntu-512mb-lon1-01 mysqld: 161025 15:00:53 [ERROR] Master 's1': Error reading packet from server: Error: connecting slave requested to start from GTID 0-3-21, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions ( server_errno=1236)

Oct 25 15:00:53 ubuntu-512mb-lon1-01 mysqld: 161025 15:00:53 [ERROR] Master 's1': Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-3-21, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions', Internal MariaDB error code: 1236

logging (50G->19G) and categorylinks (54G->19) have been compressed.

And while still testing it inserting data in both masters and stopping/starting the slaves replication got broken:

Oct 25 15:00:53 ubuntu-512mb-lon1-01 mysqld: 161025 15:00:53 [ERROR] Master 's1': Error reading packet from server: Error: connecting slave requested to start from GTID 0-3-21, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions ( server_errno=1236)

Oct 25 15:00:53 ubuntu-512mb-lon1-01 mysqld: 161025 15:00:53 [ERROR] Master 's1': Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-3-21, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions', Internal MariaDB error code: 1236

I have been doing more testing - s3 thread breaks complaining about a GTID coordenate which is not in its binlog.

Oct 25 16:54:33 ubuntu-512mb-lon1-01 mysqld: 161025 16:54:33 [Note] Master 's3': Slave SQL thread initialized, starting replication in log 's3-bin.000017' at position 3586, relay log './s2-rel-s3.000001' position: 4; GTID position '1-1-443,0-1-38,3-3-376'
Oct 25 16:54:33 ubuntu-512mb-lon1-01 mysqld: 161025 16:54:33 [Note] Master 's3': Slave I/O thread: connected to master 'slave_user@127.0.0.1:3308',replication starts at GTID position '1-1-443,0-1-38,3-3-376'
Oct 25 16:54:33 ubuntu-512mb-lon1-01 mysqld: 161025 16:54:33 [ERROR] Master 's3': Error reading packet from server: Error: connecting slave requested to start from GTID 0-1-38, which is not in the master's binlog ( server_errno=1236)
Oct 25 16:54:33 ubuntu-512mb-lon1-01 mysqld: 161025 16:54:33 [ERROR] Master 's3': Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-38, which is not in the master's binlog', Internal MariaDB error code: 1236

And that is correct, because it is in s1 binlogs

root@ubuntu-512mb-lon1-01:/var/lib/mysql# mysqlbinlog --base64-output=DECODE-ROW s1-bin.000021 | grep -B20 "0-1-38"
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#161025 16:08:07 server id 1  end_log_pos 248  	Start: binlog v 4, server v 10.0.25-MariaDB-0ubuntu0.16.04.1 created 161025 16:08:07
# Warning: this binlog is either in use or was not closed properly.
# at 248
#161025 16:08:07 server id 1  end_log_pos 303  	Gtid list [1-1-388,
# 0-1-38]
`

It is also strange that it is the first event written in the binlog and it is having a "0" which is not supposed to be there as the gtid_domain_id is set to 1 for that host.

Right after that, the events get inserted fine with the correct ID:

#161025 16:08:07 server id 1  end_log_pos 303   Gtid list [1-1-388,
# 0-1-38]
# at 303
#161025 16:08:07 server id 1  end_log_pos 339   Binlog checkpoint s1-bin.000020
# at 339
#161025 16:08:07 server id 1  end_log_pos 375   Binlog checkpoint s1-bin.000021
# at 375
#161025 16:08:31 server id 1  end_log_pos 413   GTID 1-1-389
/*!100001 SET @@session.gtid_domain_id=1*//*!*/;
/*!100001 SET @@session.server_id=1*//*!*/;
/*!100001 SET @@session.gtid_seq_no=389*//*!*/;
"show global variables like '%gtid_domain_id%';"
+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| gtid_domain_id | 1     |
+----------------+-------+

Looks like that when doing a FLUSH LOGS on the master, that even is always there:

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#161025 17:00:13 server id 1  end_log_pos 248   Start: binlog v 4, server v 10.0.25-MariaDB-0ubuntu0.16.04.1 created 161025 17:00:13
# Warning: this binlog is either in use or was not closed properly.
# at 248
#161025 17:00:13 server id 1  end_log_pos 303   Gtid list [1-1-444,
# 0-1-38]

And that is why the slave breaks.

Smells like a bug? Like a race condition for a very specific replication state?

Could be a bug indeed

I am going to do a fresh install tomorrow to discard any issues as I have been playing around with binlogs, relay logs, the gtid mysql table etc.

Another compression for a bunch of big tables just finished:

root@dbstore2001:/srv/sqldata/enwiki# ls -lh externallinks.ibd templatelinks.ibd text.ibd
-rw-rw---- 1 mysql mysql 22G Oct 26 05:44 externallinks.ibd
-rw-rw---- 1 mysql mysql 38G Oct 26 05:44 templatelinks.ibd
-rw-rw---- 1 mysql mysql 28G Oct 26 05:15 text.ibd

I am going to take an snapshot and import s4 from db2058

Mentioned in SAL (#wikimedia-operations) [2016-10-26T08:06:51Z] <marostegui> Stoppping replication on db2058 - using it to clone another host - T146261

I have started the tests from scratch in a completely new vm. (Conclusions at the end of the post)

This is what I did:

Started the three instances with a gtid_domain_id from the start (this time the slave also had one).

  • Configured normal replication so s2 replicates from s1 and s3 like it did yesterday.
  • Started to insert data on the masters
  • Stopped started slave a few times for both threads at the same time
  • Stopped slave, and: CHANGE MASTER TO MASTER_USE_GTID = current_pos;
  • Started them and all went fine.
  • All this has been done while the masters were getting data written. No problems.
  • Left a script stopping and starting both threads randomly while the masters were getting data inserted. Replication never broke. Good news!!

I noticed that the show slave status this time showed this:

Gtid_IO_Pos: 1-1-34,3-3-33

Only two domains (which is correct because there are only 2 masters)

While yesterday it was showing 3:

Gtid_Slave_Pos: 1-1-30,0-1-85,3-3-22

I believe this has to do with the fact that the slave at some point could've had binary logs (and domain id 0) which can explain the second coordinates.

This is still weird as I deleted the instance a few times but who knows...

The second test I did was a bit more similar to our current environment.

  • Set up the three instances with gtid_domain_id = 0
  • Start replication in the slave from both masters.
  • Change gtid_domain_id on each instance
  • FLUSH LOGS on the masters
  • Stop replications
  • Change replication to use GTID
  • Start replications

It breaks.

Gtid_Slave_Pos: 1-1-12,0-3-22,3-3-7

It gets again a third domain (0-3-22) which makes it break.

Look like we have to change the gtid_domain_id on every single host (or at least on the masters + servers with multisource) before starting to clone the servers with multi source otherwise, the slave gets a bit confused.
*Changing it on the fly will not work for a multisource slave.*

I have also tested what would happen if we have a single replication thread (and all the servers with their own gtid_domain_id) and a master dies.
How do we repoint replication to a server with a different domain id? So the tests were similar to what we'd face in production.

s1 is the master and s2 and s3 are slaves

  • All three instances without gtid_domain_id set
  • Slaves replicating with GTID

This is what the slave shows:

Using_Gtid: Current_Pos
                  Gtid_IO_Pos: 0-1-30
  • Set gtid_domian_id it to a different value on each host (and flush logs)

This is what the slaves shows now

Gtid_IO_Pos: 1-1-1,0-1-32
  • s1 dies and s3 is going to be the new master

When the master is dead, both slaves have executed up to the same GTID position

Gtid_IO_Pos: 1-1-31,0-1-32
  • s2 to become a slave of s3
change master to master_host='127.0.0.1', master_port=3308;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> start slave;
Query OK, 0 rows affected (0.01 sec)

The new master gets the writes and the replication flows without any problem.

I have tested the scenario of the failover with a current multisource+GTID server and needing to repoint it to a different server

  • s2 replicates from s1,s3 with GTID enabled. s4 is a standby master which replicates from s1 too.
  • s1 dies
  • s4 becomes the new master
  • s2 needs to be repointed to s4 without specifying any position, just the new IP.
  • All worked fine and s2 replicates the new writes from s4
  • Multiple and random stop/start all slaves while data is being written to s4 and s3 shows now breakage.

Conclusions:

  • Changing gtid_domain_id on the fly:

I have seen no issues by changing it live (it requires FLUSH LOGs, which can be risky depending on the workload)

  • Multisource + GTID

Works fine as long as we change all the masters + slaves (I would change the slaves for consistency) before rebuilding the future dbstores+labsdb servers, so they do not replicate from old binlogs without that flag injected

  • Impact on gtid_domain_id with any Production failover:

Looks like that the fact that there are different gtid_domain_id per server isn't a problem to repoint a slave to another master.

  • Multisource + GTID + Master failure

Looks like it works fine and a replication thread can be pointed to another master if the current one dies, even if it has a different domain_id.


As next step maybe we can simply change the gtid_domain_id on one of the misc replication chains and leave it running for a few days so the logs get flushed etc.
If we want to puppetize this, maybe we can use the hostname (without the "db") as a gtid_domain_id.

I have finished importing s4 commonswiki and it is now replicating fine.
So there is now s1, s3 and s4 replicating.
Obviously not using GTID

root@dbstore2001:/srv# mysql --skip-ssl -e "nopager;show all slaves status\G" | egrep "Connection|Seconds"
              Connection_name: s1
        Seconds_Behind_Master: 22622
              Connection_name: s3
        Seconds_Behind_Master: 2444
              Connection_name: s4
        Seconds_Behind_Master: 0

The total size of the three shards is around 3.8T out of 6.6T.
Some tables are compressed already in enwiki. But as all of them are working fine and the server seems stable, I am going to go ahead, take a snapshot of the four shards and compress all the tables in enwiki (including revision)

After a chat with Jaime yesterday we decided to try one more thing:

(The master getting inserts during the whole process)

  • Slave replicating from two masters (s1 and s3) without GTID and without gtid_domain_id-
  • Enable gtid_domain_id (flush logs) on all the servers.

The slave sees the new domains and keeps replicating fine with the new domain from the masters

  • Switch to GTID

As expected the first thread that gets changed (s1) breaks - the other keeps working fine.

  • reset slave all (all the related replication files get removed finely) on s1

The gtid_slave_pos table still contains data for domain 1 which is the one belonging to s1, which is the instance that breaks.
I noted down the positions and coordinates and delete that data from the table.

  • Reconfigure the slave that broke (s1) to start replication from the last executed position on the master.

Once that has been done, stop the slave and start replication with GTID again.

That seems to work, however at soon as you stop the slaves and start again, the other instance breaks (s3)

Now, trying to do the same with s3

  • Note the position
  • Reset slave all
  • Clean the gtid_slave_pos table for anything related to domain_id=3
  • Start replication again
  • Change to GTID
  • stop and start all slaves replication

And now s1 breaks again.

What I tried after that was:

  • s1 is broken - note its position
  • stop s3 - note its position
  • reset all slaves
  • truncate gtid_slave_pos

Configure both slaves again - no luck, one thread keeps breaking

[ERROR] Master 's1': Error reading packet from server: Error: connecting slave requested to start from GTID 0-3-35, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions ( server_errno=1236)

So looks like that the procedure that works is: https://phabricator.wikimedia.org/T146261#2744128

As a side note: I have compressed all the tables in s1.enwiki on dbstore2001 and only pagelinks and revision are pending.
Right now the dataset is 876G and the two tables pending are 244G and 375G before optimizing, so I am expecting the dataset to be around 600G once those two are finished

The alter table for pagelinks is now running after it failed last night because of

ERROR 1799 (HY000): Creating index 'pl_from' required more than 'innodb_online_alter_log_max_size' bytes of modification log. Please try again.

http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_online_alter_log_max_size

We need to test our use-case: single master with pt-heartbeat creating master events on the master's master. Should we put secondary datacenter masters with the same domain id as the primary master or a different one? I see issues with both approaches, although if it has worked until now, maybe it is not too much of a problem?

I can try that in my lab too.

Master 1 -> Master 2 -> MultiSource + GTID slave

To be honest, I would go for a different domain_id on every host because that would make things easier from a management point of view (including puppet), less snowflakes.
I am specifically thinking that if we have to failover to another host, it is another thing to keep in mind and it can be easily forgettable.
Also, from a Puppet point of view it can make things a lot more complicated I think, as servers would need to be, somehow, aware of the topology.
What I had in mind to puppetize this was simply to add something like:

gtid_domain_id = <%= @my_hostname.gsub(/^db(\d+)\..*$/, '\1') %>

On the my.cnf template and that would get the digits of the hostname and use it as a domain_id

Anyways, I don't think it could break a multi source slave with GTID as the events are probably treated the same way, but it needs to be tested.
Thanks for pointing this out

I have tested these two scenarios (with the normal pt-heartbeat tool not our modified one)


A)
Master 1 (pt-heartbeat) -> Master 2 -> MultiSource + GTID slave
Master 3 (pt-heartbeat) -> Multisource + GTID slave

B)
Master 1 (pt-heartbeat) -> Master 2 (pt-heartbeat) -> MultiSource + GTID slave
Master 3 (pt-heartbeat) -> Multisource + GTID slave


Master1 and Master 3 were getting normal writes at the same time. and nothing broke.
I stopped and started the Multisource + GTID slave several times and nothing broke.

We can puppetize it (but not deploy today), and start rolling it, on misc and codfw first.

Maybe tracking it on a separate ticket, make this blocked on that.

We can puppetize it (but not deploy today), and start rolling it, on misc and codfw first.

Maybe tracking it on a separate ticket, make this blocked on that.

Yes, I would like to deploy the change only on a misc chain (and only one) next week and leave it there for a few days, make sure the logs rotate fine etc, and all works fine.

pagelinks finished (15 hours ALTER) 255G -> 75G

I just thought about a test that I need to perform tomorrow which can give us a bit more flexibility:

Master 1 (no gtid_domain_id set) -> master 2 (gtid_domain_id set) -> multi source slave +gtid

If this works this can give us a bit more flexibility as we could enable it in codfw and not in eqiad (if we are not 100% confident yet) and build the new hosts.
Although from the previous tests it looks safe to enable.

Anyhow, as we have discussed before we should start only with misc shards first and leave it for a few days to make sure the logs rotate fine and all that.

revision finished its compression (it took 1 day and 2 hours).
It went from: 375G to 139G
The whole dataset is now compressed and it is 485G.

As db1073 is having some delay issues, I am going to monitor dbstore2001 the s1 thread to see if it has the same issues (https://phabricator.wikimedia.org/T95501#2751531)

I am now taking a snapshot of dbstore2001

Manuel, I mentioned the labsdb1008 plan to Chase. No issue on doing that next week. We will have to ping him in advance, and I promised to move things there to labsdb1009 to continue the testing (databases, I assume also his home or wherever the scripts are).

Thanks - that sounds good. But we need to keep in mind labsdb1008 space if we want to place the three shards (s1,s3 and s4)
Right now dbstore2001 and dbstore2002 (with s1 compressed) are 3.4T and labsdb1008 has 3.6T available.

I am planning to compress S4 on dbstores200X anyways at some point.

Regarding dbstore2002:
S3 and S4 caught up with the master already and S1 is still doing so, it looks good so far.
I have done a few basic tests like stopping all the slaves and starting them, and as soon as S1 has caught up I will stop mysql, start it a few times to make sure it works fine.

If we are confident, we can move that snapshot to labsdb1008 and execute the scripts to sanitize it too.

Lastly, once we are ready to reinstall dbstore2001, I will take a final snapshot from it.
Given that db2016, db2018 and db2019 have expire_log_days ) 30 that should give us enough time to reinstall dbstore2001 as soon as it has the new disks and restore that snapshot and start replication, so we do not have to copy the data over.

By doing all that we'd get dbstore2001/2 with S3 shards, and I would like to get s2 and s6 there at some point too.

This is done - I will create a different task to import all the shards to dbstore2001 and dbstore2002.