Page MenuHomePhabricator

Deploy gtid_domain_id flag in our mysql hosts
Closed, ResolvedPublic

Description

gtid_domain_id variable is needed to be able to have multisource+GTID slaves.
The idea is to slowly deploy this to first one of the misc shards and leave it running for a few days to make sure the logs rotate fine etc.
Then maybe one shard of codfw
This injects new objects on the binlogs, so it needs to be treated carefully.

More context on why this is needed: https://phabricator.wikimedia.org/T146261#2744128

From my tests, this would do the trick

gtid_domain_id = <%= @my_hostname.gsub(/^db(\d+)\..*$/, '\1') %>

Details

Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I have restarted the mXX hosts from codfw so they get the new gtid_domain_id on Monday I will apply it on the rest of servers (and masters)

root@neodymium:~# for i in  db2010 db2012 db2030; do echo $i; mysql -h$i.codfw.wmnet -e "show global variables like 'gtid_domain_id'"; done
db2010
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 180355085 |
+----------------+-----------+
db2012
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 180355087 |
+----------------+-----------+
db2030
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 180359186 |
+----------------+-----------+

I have deployed this flag to the following hosts:

db1009 (m5 master)
db1046 (m4 master - no slaves)
db1016 (m1 master)
db1001 (m1 slave)

root@neodymium:/home/marostegui/git/software/dbtools# for i in db1009 db1046 db1016 db1001; do echo $i; mysql -h$i -e "select @@gtid_domain_id;";done
db1009
+------------------+
| @@gtid_domain_id |
+------------------+
|        171966477 |
+------------------+
db1046
+------------------+
| @@gtid_domain_id |
+------------------+
|        171970595 |
+------------------+
db1016
+------------------+
| @@gtid_domain_id |
+------------------+
|        171966484 |
+------------------+
db1001
+------------------+
| @@gtid_domain_id |
+------------------+
|        171966469 |
+------------------+

Pending: m3 eqiad servers (including dbstoreXXXX)

I am going to start rolling this out in m3.
dbstore servers do not use GTID so it should be perfectly safe to deploy it there too.

root@neodymium:/home/marostegui/git/software/dbtools# for i in db1043 db1048 dbstore1001 db2012.codfw.wmnet dbstore1002; do echo $i; mysql -h$i -e "select @@gtid_domain_id;";done
db1043
+------------------+
| @@gtid_domain_id |
+------------------+
|                0 |
+------------------+
db1048
+------------------+
| @@gtid_domain_id |
+------------------+
|                0 |
+------------------+
dbstore1001
+------------------+
| @@gtid_domain_id |
+------------------+
|                0 |
+------------------+
db2012.codfw.wmnet
+------------------+
| @@gtid_domain_id |
+------------------+
|        180355087 |
+------------------+
dbstore1002
+------------------+
| @@gtid_domain_id |
+------------------+
|                0 |
+------------------+

Change 325303 had a related patch set uploaded (by Marostegui):
mariadb: Added gtid_domain_id variable

https://gerrit.wikimedia.org/r/325303

Change 325303 merged by Marostegui:
mariadb: Added gtid_domain_id variable

https://gerrit.wikimedia.org/r/325303

The new labs servers (1009,1010 and 1011) as well as sanitarium2 (db1095) has gtid_domain_id now deployed.

root@neodymium:~# for i in db1095 labsdb1009 labsdb1010 labsdb1011; do echo $i; mysql -h$i -e "select @@gtid_domain_id;";done
db1095
+------------------+
| @@gtid_domain_id |
+------------------+
|        171974784 |
+------------------+
labsdb1009
+------------------+
| @@gtid_domain_id |
+------------------+
|        171967502 |
+------------------+
labsdb1010
+------------------+
| @@gtid_domain_id |
+------------------+
|        171975959 |
+------------------+
labsdb1011
+------------------+
| @@gtid_domain_id |
+------------------+
|        171975960 |
+------------------+

Change 325781 had a related patch set uploaded (by Marostegui):
mariadb: Enable gtid_domain_id - phabricator hosts

https://gerrit.wikimedia.org/r/325781

Change 325781 abandoned by Marostegui:
mariadb: Enable gtid_domain_id - phabricator hosts

Reason:
After discussing it, we will calculate this where server_id is calculated to make it independent from each other.

https://gerrit.wikimedia.org/r/325781

Change 326080 had a related patch set uploaded (by Marostegui):
mariadb: Added calculation for gtid_domain_id

https://gerrit.wikimedia.org/r/326080

Change 326080 merged by Marostegui:
mariadb: Added calculation for gtid_domain_id

https://gerrit.wikimedia.org/r/326080

Change 326086 had a related patch set uploaded (by Marostegui):
mariadb: Added gtid_domain_id to its own variable

https://gerrit.wikimedia.org/r/326086

Change 326086 merged by Marostegui:
mariadb: Added gtid_domain_id to its own variable

https://gerrit.wikimedia.org/r/326086

I have deployed how we calculate gtid_domain_id (which is the same way as we calculate server_id for now, but they are now independent from each other)
This is the changelog for the record:

I have run puppet in the servers that already had this deployed and all went fine, ie: db2030

root@db2030:~# cat /etc/my.cnf | grep "^gtid"
gtid_domain_id  = 180359186

root@db2030:~# sudo puppet agent -t
Info: Retrieving pluginfacts
<snip>
Info: Applying configuration version '1481556364'
Notice: Finished catalog run in 16.33 seconds

root@db2030:~# cat /etc/my.cnf | grep "^gtid"
gtid_domain_id  = 180359186

Change 326446 had a related patch set uploaded (by Marostegui):
mariadb: Enable gtid_domain_id - phabricator hosts

https://gerrit.wikimedia.org/r/326446

Mentioned in SAL (#wikimedia-operations) [2017-01-18T09:40:37Z] <marostegui> Restart mysql dbstore2002 to enable gtid_domain_id manually before deploying it on m3 - T149418

Mentioned in SAL (#wikimedia-operations) [2017-01-18T10:10:03Z] <marostegui> Restart mysql dbstore2001 to enable gtid_domain_id manually before deploying it on m3 - T149418

I am going to enable gtid_domain_id first on dbstore2001 before doing it on m3 to make sure it is all good. It is the first time we enable it since the tests on a multisource slave and I want to make sure that the dbstore1001,2 servers on m3 won't have any issues once I deploy: https://gerrit.wikimedia.org/r/326446

Change 332228 had a related patch set uploaded (by Marostegui):
mariadb: Split dbstore role classes

https://gerrit.wikimedia.org/r/332228

Change 326446 merged by Marostegui:
mariadb: Enable gtid_domain_id - phabricator hosts

https://gerrit.wikimedia.org/r/326446

Mentioned in SAL (#wikimedia-operations) [2017-01-18T12:21:15Z] <marostegui> Enable gtid_domain_id on m3 - T149418

I have deployed puppet and manually enabled the gtid_domain_id flag on m3 (phabricator) no problems have been encountered and replication is flowing fine on the slaves and on the dbstore servers (dbstore2001 has the flag also enabled).

Change 332965 had a related patch set uploaded (by Marostegui):
eventlogging: Enable gtid_domainid on eventlogging

https://gerrit.wikimedia.org/r/332965

phabricator hosts (m3) got the flag enabled yesterday.

I have subimited a change to enable it on the eventlogging host, so we can get all the misc shards done. It doesn't have replication, but for consistency I would like to have it enabled there: https://gerrit.wikimedia.org/r/#/c/332965/1

Mentioned in SAL (#wikimedia-operations) [2017-01-23T07:32:27Z] <marostegui> Deploy gtid_domain_id db1043 (passive master) - last host pending in m3 - T149418

Change 332965 merged by Marostegui:
eventlogging: Enable gtid_domainid on eventlogging

https://gerrit.wikimedia.org/r/332965

Mentioned in SAL (#wikimedia-operations) [2017-01-23T07:43:31Z] <marostegui> Enabling gtid_domain_id on db1046 (eventlogging master) - T149418

Mentioned in SAL (#wikimedia-operations) [2017-01-23T07:47:10Z] <marostegui> Enabling gtid_domain_id on db1047 (eventlogging host) - T149418

I have deployed the puppet change to enabled gtid_domain_id on m4, which is eventlogging. I have manually enabled it after running puppet on those hosts (db1046 and db1047).

After this change, all the misc shards now have gtid_domain_id enabled.

Change 332228 merged by Marostegui:
mariadb: Split dbstore role classes

https://gerrit.wikimedia.org/r/332228

gtid_domain_id pushed to dbstore2 (that is dbstore2001 and dbstore2002).

Change 335816 had a related patch set uploaded (by Marostegui):
mariadb: Add gtid_domain_id to s4

https://gerrit.wikimedia.org/r/335816

Change 335817 had a related patch set uploaded (by Marostegui):
mariadb: Use the common gtid_domain_id

https://gerrit.wikimedia.org/r/335817

Change 335817 merged by Marostegui:
mariadb: Use the common gtid_domain_id

https://gerrit.wikimedia.org/r/335817

I was doing some tests to refresh my mind and play with the new versions (10.1.21) as this is a critical change and I have filled this: https://jira.mariadb.org/browse/MDEV-12012
It doesn't really impact our production core environment but it can break our multisource hosts once we turn on GTID (which is the whole point of deploying gtid_domain_id)

I have added some more comments to the bug report, as I have tested it with all 10.1 versions involved.

Hello,

I thought I would update the ticket with the last findings about gtid+multi source.
Looks like what we faced here: T146261#2744128 is more than an issue that we thought.

So when I was doing the last tests before enabling it on production and later GTID on multi source slaves, I found the same issue and filled the bug report above.
Long story short: you either enable gtid_domain_id on the masters BEFORE setting up a multisource slave or you're in serious trouble to enable GTID after that.
They have suggested a workaround that (so far) it is not working. They believe this is the expected behaviour (the replication threads breaking) but it not stated ANYWHERE on the documentation, so as I mention below, I do consider this a bug.
Another solution they suggest is to "reset master" which is a no-go for production (as they pretty much say). Below a bit of the thread we are having:

First mariadb answer:

At the beginning, since both masters are running with domain_id 0, the slave considers it a single replication stream. 
We can see how the slave only has 0-2-x, although master1 has 0-1-x and master2 has 0-2-x.
When the slave is switched to using GTID replication, it sends the position to each master. Although domain IDs on masters are now set to unique values, they of course stay aware of the old 0 domain they used to use, so they are trying to find the position 0-2-x which corresponds to this domain. But only one of the servers has it, the other one returns an error.
I think one way to overcome the problem is to adjust the slave position manually before switching to using GTID-based replication.
That is, assuming that you've ensured that all old-zero-domain events from both servers have been replicated by this time, after you stopped slaves and before you changed them to using GTID, you delete zero domain from mysql.gtid_slave_pos and from @@global.gtid_slave_pos:
stop all slaves;
 
delete from mysql.gtid_slave_pos where domain_id=0;
 
select @@global.gtid_slave_pos;
# Let's say it's 0-2-10,1-1-4,2-2-1 as in the data above
set global gtid_slave_pos= '1-1-4,2-2-1';
 
change master '1' to master_use_gtid=slave_pos;
change master '2' to master_use_gtid=slave_pos;
start all slaves;
It might produce a warning about missing domain "3", but it will be adjusted automatically.
I think it should help to get it up and running, although it's a bit cumbersome.
I'm not sure whether it's a bug. 
Kristian Nielsen, what do you think?
Also, maybe you can suggest a more elegant way to perform this transition?

Second answer from mariadb:

I agree with Elena. When replicating with GTID, the master must have either all of the domain in the binlog, or none of the domain. Changing gtid_domain_id on a master at some point does not change the binlog of the past.
RESET MASTER on the masters will remove the old events with wrong domain id, though that may be inconvenient to do on a production setup as it loses all binlogs on the masters.

The suggested workaround doesn't work and I have replied this to MariaDB:

To be honest, I do consider this a bug, or at least a serious limitation as it doesn't work as expected (specially keeping in mind that the documentation doesn't talk about existing multisource slaves breaking when trying to set up gtid_domain_id+GTID). If this is the expected behaviour it should be stated there as a big warning:
I consider this (if it is the expected behaviour) a serious limitation because it forces you to either:
* not use GTID with multisource slaves
or
* Making sure that you set up all the stuff from the start (gtid_domain_id) before setting up any sort of multisource replication, otherwise it will break. Which again, is a massive stop for existing environments where you'd like to take advantage of multisource (ie: reduce hardware) and then the safety of GTID on top of it.

Even if the suggested workaround would work (maybe we are doing something wrong and it really works?) it requires to stop the slave, which might be a no-go for some production environments. And it should be specified on the documentation.
But doing a RESET MASTER is probably a no-go for the 95% of the production environments I would say.
Whatever the fix is, if it cannot be done live (the documentation suggests that gtid_domain_id can be changed on the fly, which is true, but swiching to GTID will break anyways so the whole point of gtid_domain_id is gone), it needs to be specified on the documentation.
I am happy to keep debugging this issue to see if we can reach a workaround can work 
Thanks again!
Manuel.

I think it is just easier to change domain_id everywhere, wait 15 plus days. CHANGE MASTER on the multisource slaves including the coords. Then enabling GTID.

I think it is just easier to change domain_id everywhere, wait 15 plus days. CHANGE MASTER on the multisource slaves including the coords. Then enabling GTID.

Not sure that will work as per my tests with mariadb, because you would still have this:

root@MISC m3[(none)]> select * from mysql.gtid_slave_pos;
+-----------+----------+-----------+-----------+
| domain_id | sub_id   | server_id | seq_no    |
+-----------+----------+-----------+-----------+
|         0 | 35332094 | 171970592 | 200818911 |
|         0 | 35332179 | 171970592 | 200818911 |
| 171970592 | 49703957 | 171970592 |  56150120 |
| 171970592 | 49703958 | 171970592 |  56150121 |
+-----------+----------+-----------+-----------+
4 rows in set (0.00 sec)

We still have the domain_id = 0 there, and that is a host that we changed a loooong time ago (db1048). The work around was basically delete the entries with gtid_domain_id and change last slave_pos and that didn't work :-(

TL;DR: gtid_domain_id + gtid works fine with multiple switchovers just using GTID. Note: this remains broken for multisource+gtid. We can go ahead and start deploying: https://gerrit.wikimedia.org/r/#/c/335816/

Given all the issues we are having with gtid_domain_id+GTID+multisource...I wanted to test the normal case we'll have in production (leaving aside gtid+multisource) to see if it is safe to deploy the variable and then attempt master switchovers.

This is what I have tested:

  • 1 master (hostname: master1) + 2 slaves (hostnames: master2 and slave1). One of them is called master2 because it will be come a master at some point too. master1 and master2 have log-slave-updates. master1 and master2 replicate from each other like in production

Current status:

master1----
|         |
master2  slave1
  • GTID enabled (slave_pos) but gtid_domain_id = 0 on all the hosts
  • Data gets replicate fine.
  • I do a first switchover so the new setup is:
master2-----
|          |
master1  slave1
  • Insert data on master2 normally and replication works fine.
  • Change gtid_domain_id on the slave1
  • Insert data normally and replication works fine
  • Change gtid_domain_id on master1 (reminder: this is a slave now, the primary master is master2).
  • Insert data normally and replication works fine
  • Change gtid_domain_id on master2 (primary master)
  • Insert data normally and replication works fine. At some point we see the following change on the slaves once gtid_domain_id has been enabled on the primary master:
Gtid_IO_Pos: 0-2-29

To

Gtid_IO_Pos: 0-2-29,2-2-3

Which shows the new domain_id=2 as the primary master has gtid_domain_id=2

  • We keep inserting data and all works fine
  • We switch over again to master1 to be the primary master (with gtid):
stop slave; change master to master_host='192.168.56.21'; start slave;

So the current topology is now:

master1----
|         |
master2  slave1
  • Without even inserting data we see that the show slave status already shows the gtid_domain_id=1 which is the new master:
Gtid_IO_Pos: 1-1-1,2-2-4,0-2-29
  • Insert data on the new master (master1) and all works fine.
  • Attempt another switchover where slave1 will be the master and we'll a three level replication:
  • New topology:
slave1
|          
master2
|
master1
  • Insert data on slave1 (now primary master) and replication works fine.
  • Last change, get back to the original topology:
master1----
|         |
master2  slave1
  • Insert data and replication works fine.
  • The slaves show now all the gtid_domains_id of all the servers that got data inserted (out from the replication thread):
Gtid_IO_Pos: 1-1-7,0-2-29,3-3-1,2-2-5

Obviously the "0" remains there as at some point there was a server with "0" that got inserted.

  • flush logs + purge logs on the master doesn't clean the "0" there, but that is not an issue.

Change 335816 merged by Marostegui:
mariadb: Add gtid_domain_id to s6

https://gerrit.wikimedia.org/r/335816

Mentioned in SAL (#wikimedia-operations) [2017-02-20T09:33:13Z] <marostegui> Manually deploy gtid_domain_id on s6 hosts - T149418

The patch to enable gtid_domain_id for hosts in s6 has been deployed. As this only takes effect upon restart, I have manually changed it on all of them (reminder, this value is set as the same value as server_id):

root@neodymium:/home/marostegui/git/software/dbtools# for i in `cat s6.hosts | cut -f1 -d " " | egrep -v "dbstore*|labs*|db1069|db1095" `;do echo $i; mysql -h$i -e "show global variables like 'gtid_domain_id'; show global variables like 'server_id';";done
db2039.codfw.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 180363274 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 180363274 |
+---------------+-----------+
db2046.codfw.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 180363370 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 180363370 |
+---------------+-----------+
db2053.codfw.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 180367365 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 180367365 |
+---------------+-----------+
db2060.codfw.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 180367372 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 180367372 |
+---------------+-----------+
db2067.codfw.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 180367379 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 180367379 |
+---------------+-----------+
db2028.codfw.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 180359184 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 180359184 |
+---------------+-----------+
db1022.eqiad.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 171970571 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 171970571 |
+---------------+-----------+
db1023.eqiad.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 171970572 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 171970572 |
+---------------+-----------+
db1030.eqiad.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 171970579 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 171970579 |
+---------------+-----------+
db1037.eqiad.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 171970586 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 171970586 |
+---------------+-----------+
db1061.eqiad.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 171978766 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 171978766 |
+---------------+-----------+
db1085.eqiad.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 171970663 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 171970663 |
+---------------+-----------+
db1088.eqiad.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 171974770 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 171974770 |
+---------------+-----------+
db1093.eqiad.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 171978904 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 171978904 |
+---------------+-----------+
db1050.eqiad.wmnet
+----------------+-----------+
| Variable_name  | Value     |
+----------------+-----------+
| gtid_domain_id | 171970705 |
+----------------+-----------+
+---------------+-----------+
| Variable_name | Value     |
+---------------+-----------+
| server_id     | 171970705 |
+---------------+-----------+

The hosts are seeing the new master's gtid_domain_id value on their gtid_io_pos:

root@neodymium:/home/marostegui/git/software/dbtools# for i in `cat s6.hosts | cut -f1 -d " " | egrep -v "dbstore*|labs*|db1069|db1095" `;do echo $i; mysql -h$i -e "show slave status\G" | grep Gtid_IO_Pos;done
db2039.codfw.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55119,0-171970705-3042619293
db2046.codfw.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55130,0-171970705-3042619293
db2053.codfw.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55134,0-171970705-3042619293
db2060.codfw.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55138,0-171970705-3042619294
db2067.codfw.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55145,0-171970705-3042619294
db2028.codfw.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55154,0-180359184-3042619294
db1022.eqiad.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55154,0-180359184-3042619295
db1023.eqiad.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55158,0-180359184-3042619295
db1030.eqiad.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55161,0-180359184-3042619295
db1037.eqiad.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55162,0-180359184-3042619295
db1061.eqiad.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55163,0-180359184-3042619295
db1085.eqiad.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55164,0-180359184-3042619295
db1088.eqiad.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55168,0-180359184-3042619295
db1093.eqiad.wmnet
                  Gtid_IO_Pos: 171970705-171970705-55172,0-180359184-3042619295
db1050.eqiad.wmnet
                  Gtid_IO_Pos:

Change 338734 had a related patch set uploaded (by Marostegui):
mariadb: Add gtid_domain_id to s2

https://gerrit.wikimedia.org/r/338734

@Marostegui : Hi, trying to keep an eye on that task - Do you think the analytics team plan on being able to use the new labsdb infra soon for testing and productionization?
Thanks !

Hey @JAllemandou,

We will not be able to enable GTID soon as we would like to - we have found a bug with mariadb implementation: https://jira.mariadb.org/browse/MDEV-12012
We are deploying gtid_domain_id because it is harmless (or looks so) if it is not combined with multisource slaves (which is what we have in labsdb, technically, on their masters). So you might need to wait a bit longer to do your tests (I guess that is what you meant?)

Sorry for the inconveniences :-(

@Marostegui : Thanks for the update, I'll continue monitoring in here and ask for headsup evey now and then :)
Thanks

Change 338734 merged by Marostegui:
mariadb: Add gtid_domain_id to s2

https://gerrit.wikimedia.org/r/338734

Mentioned in SAL (#wikimedia-operations) [2017-02-27T13:58:58Z] <marostegui> Manually deploy gtid_domain_id on s2 - T149418

I have manually deployed gtid_domain_id on all the hosts in s2.
Will prepare the patch to enable it on the config file across production in a bit, and deploy it in a couple of days.

Change 340130 had a related patch set uploaded (by Marostegui):
production.my.cnf: Enable gtid_domaid_id

https://gerrit.wikimedia.org/r/340130

So the above patch is ready to be deployed in production. I will roll it out in a couple of days. I know I am being careful, but I prefer to be so than enabling it on s2 today and roll this out today as well.

Hi @Marostegui
We have questions around that task ass it prevents us to move forward in productionizing our data extraction pipeline (one major goal for us this quarter / ping @Nuria).

  • Do you allow us to productionise our data extraction with current setting (low number of connections), or do you want us to wait for this task to be finished?
  • Also, can you provide us an ETA as to when this task will be finished, so that we can plan on testing with more connections?

Many thanks

Hi @Marostegui
We have questions around that task ass it prevents us to move forward in productionizing our data extraction pipeline (one major goal for us this quarter / ping @Nuria).

  • Do you allow us to productionise our data extraction with current setting (low number of connections), or do you want us to wait for this task to be finished?
  • Also, can you provide us an ETA as to when this task will be finished, so that we can plan on testing with more connections?

Many thanks

Hey!

The main thing is that even if we finish this task (which we plan to do soon), this won't allow us to deploy GTID on the labs (or any other multisource slave) servers, because there is a bug with gtid+multisource (see https://jira.mariadb.org/browse/MDEV-12012)
The main reason we want GTID enabled there is because it would allow us to recover from slave crashes without "any" problem, and in general, it means we will not need to rebuild all the dataset, which takes lots of time and effort.

Jaime and myself, yesterday, were discussing about how we can try to overcome MariaDB bug and try ways to enable it - we are planning to do some tests, hopefully this week, but that doesn't mean we will solve it, as, again, it looks like a design bug GTID related.

I wouldn't mind if you go ahead with the current amount of connections as it shouldn't put so much load (OOM, etc) on the servers. It will be slow for you, but probably safer for everyone.

Thanks for the answer @Marostegui .
Slow data is better than no data :)
I'll keep listening to this task for news and ping once in a while.

TL;DR = I found a way to enable GTID+multisource but it crashes when MySQL gets restarted which is a no-go for us still. However, I think this will help MariaDB to find the root cause or a fix for it.

I have been testing again if there is a way to enable gtid_domain_id + GTID on a multisource slave, which is alreaady running without gtid_domain_id. Like the environment we have in production.

The initial scenario is as follows
master1: 10.0.29, gtid_domain_id=0, no GTID enabled
master2: 10.0.29, gtid_domain_id=0, no GTID enabled
slave1: 10.1.21, gtid_domain_id=0, no GTID enabled

Initially there is no intermediate master, so slave1 replicates from both masters.
Before setting up the replication this is the situation on all the hosts:

MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show slave status\G
+------------+
| @@hostname |
+------------+
| master1    |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| gtid_binlog_pos        |       |
| gtid_binlog_state      |       |
| gtid_current_pos       |       |
| gtid_domain_id         | 0     |
| gtid_ignore_duplicates | OFF   |
| gtid_slave_pos         |       |
| gtid_strict_mode       | OFF   |
+------------------------+-------+
7 rows in set (0.00 sec)

Empty set (0.00 sec)

Empty set (0.00 sec)




MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show slave status\G
+------------+
| @@hostname |
+------------+
| master2    |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| gtid_binlog_pos        |       |
| gtid_binlog_state      |       |
| gtid_current_pos       |       |
| gtid_domain_id         | 0     |
| gtid_ignore_duplicates | OFF   |
| gtid_slave_pos         |       |
| gtid_strict_mode       | OFF   |
+------------------------+-------+
7 rows in set (0.00 sec)

Empty set (0.00 sec)

Empty set (0.00 sec)



MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show slave status\G
+------------+
| @@hostname |
+------------+
| slave1     |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| gtid_binlog_pos        |       |
| gtid_binlog_state      |       |
| gtid_current_pos       |       |
| gtid_domain_id         | 0     |
| gtid_ignore_duplicates | OFF   |
| gtid_slave_pos         |       |
| gtid_strict_mode       | OFF   |
| wsrep_gtid_domain_id   | 0     |
| wsrep_gtid_mode        | OFF   |
+------------------------+-------+
9 rows in set (0.00 sec)

Empty set (0.00 sec)

Empty set (0.00 sec)

Let's set up multisource replication:

MariaDB [(none)]> change master '1' to master_host='192.168.56.21', master_user='replication', master_password='password', master_log_pos=314, master_log_file='mariadb-bin.000001';
Query OK, 0 rows affected (0.03 sec)

MariaDB [(none)]> change master '2' to master_host='192.168.56.22', master_user='replication', master_password='password', master_log_pos=314, master_log_file='mariadb-bin.000001';
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> start all slaves;
Query OK, 0 rows affected, 2 warnings (0.00 sec)

MariaDB [(none)]> show warnings;
+-------+------+-------------------+
| Level | Code | Message           |
+-------+------+-------------------+
| Note  | 1937 | SLAVE '1' started |
| Note  | 1937 | SLAVE '2' started |
+-------+------+-------------------+
2 rows in set (0.00 sec)

MariaDB [(none)]> show all slaves status\G
*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000001
          Read_Master_Log_Pos: 314
               Relay_Log_File: mysqld-relay-bin-1.000002
                Relay_Log_Pos: 538
        Relay_Master_Log_File: mariadb-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 314
              Relay_Log_Space: 839
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 5
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos:
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000001
          Read_Master_Log_Pos: 314
               Relay_Log_File: mysqld-relay-bin-2.000002
                Relay_Log_Pos: 538
        Relay_Master_Log_File: mariadb-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 314
              Relay_Log_Space: 839
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 5
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos:
2 rows in set (0.00 sec)

Now I generate some writes on each master and rotating logs on the maters too.
We can see that we now have some more coordenates on the slave:

MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show all slaves status\G
+------------+
| @@hostname |
+------------+
| slave1     |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+--------+
| Variable_name          | Value  |
+------------------------+--------+
| gtid_binlog_pos        |        |
| gtid_binlog_state      |        |
| gtid_current_pos       | 0-2-35 |
| gtid_domain_id         | 0      |
| gtid_ignore_duplicates | OFF    |
| gtid_slave_pos         | 0-2-35 |
| gtid_strict_mode       | OFF    |
| wsrep_gtid_domain_id   | 0      |
| wsrep_gtid_mode        | OFF    |
+------------------------+--------+
9 rows in set (0.00 sec)

+-----------+--------+-----------+--------+
| domain_id | sub_id | server_id | seq_no |
+-----------+--------+-----------+--------+
|         0 |     63 |         2 |     34 |
|         0 |     64 |         2 |     35 |
+-----------+--------+-----------+--------+
2 rows in set (0.00 sec)

*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000003
          Read_Master_Log_Pos: 3649
               Relay_Log_File: mysqld-relay-bin-1.000006
                Relay_Log_Pos: 3939
        Relay_Master_Log_File: mariadb-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3649
              Relay_Log_Space: 5111
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 83
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-2-35
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000006
          Read_Master_Log_Pos: 3310
               Relay_Log_File: mysqld-relay-bin-2.000012
                Relay_Log_Pos: 3600
        Relay_Master_Log_File: mariadb-bin.000006
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3310
              Relay_Log_Space: 5238
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 125
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-2-35
2 rows in set (0.00 sec)

And of course on the masters:

MariaDB [(none)]> select @@hostname;  show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show slave status\G
+------------+
| @@hostname |
+------------+
| master1    |
+------------+
1 row in set (0.00 sec)

+------------------------+--------+
| Variable_name          | Value  |
+------------------------+--------+
| gtid_binlog_pos        | 0-1-29 |
| gtid_binlog_state      | 0-1-29 |
| gtid_current_pos       | 0-1-29 |
| gtid_domain_id         | 0      |
| gtid_ignore_duplicates | OFF    |
| gtid_slave_pos         |        |
| gtid_strict_mode       | OFF    |
+------------------------+--------+
7 rows in set (0.00 sec)

Empty set (0.00 sec)

Empty set (0.00 sec)



MariaDB [(none)]> select @@hostname;  show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show slave status\G
+------------+
| @@hostname |
+------------+
| master2    |
+------------+
1 row in set (0.00 sec)

+------------------------+--------+
| Variable_name          | Value  |
+------------------------+--------+
| gtid_binlog_pos        | 0-2-35 |
| gtid_binlog_state      | 0-2-35 |
| gtid_current_pos       | 0-2-35 |
| gtid_domain_id         | 0      |
| gtid_ignore_duplicates | OFF    |
| gtid_slave_pos         |        |
| gtid_strict_mode       | OFF    |
+------------------------+--------+
7 rows in set (0.00 sec)

Empty set (0.00 sec)

Empty set (0.00 sec)

Let's go ahead and enable gtid_domain_id on all the hosts:

MariaDB [(none)]> select @@hostname;
+------------+
| @@hostname |
+------------+
| slave1     |
+------------+
1 row in set (0.00 sec)

MariaDB [(none)]> show global variables like 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id     | 3     |
+---------------+-------+
1 row in set (0.00 sec)

MariaDB [(none)]> set global gtid_domain_id=3;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> show global variables like 'gtid_domain_id';
+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| gtid_domain_id | 3     |
+----------------+-------+
1 row in set (0.00 sec)


MariaDB [(none)]>  select @@hostname;
+------------+
| @@hostname |
+------------+
| master1    |
+------------+
1 row in set (0.00 sec)

MariaDB [(none)]> show global variables like 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id     | 1     |
+---------------+-------+
1 row in set (0.00 sec)

MariaDB [(none)]> set global gtid_domain_id=1;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> show global variables like 'gtid_domain_id';
+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| gtid_domain_id | 1     |
+----------------+-------+
1 row in set (0.00 sec)



MariaDB [(none)]>  select @@hostname;
+------------+
| @@hostname |
+------------+
| master2    |
+------------+
1 row in set (0.00 sec)

MariaDB [(none)]> show global variables like 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id     | 2     |
+---------------+-------+
1 row in set (0.00 sec)

MariaDB [(none)]> set global gtid_domain_id=2;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> show global variables like 'gtid_domain_id';
+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| gtid_domain_id | 2     |
+----------------+-------+
1 row in set (0.00 sec)
`

Let's generate more writes, flush logs and so forth and we can see the slaves has picked up the new coordinates:

MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show all slaves status\G
+------------+
| @@hostname |
+------------+
| slave1     |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+----------------------+
| Variable_name          | Value                |
+------------------------+----------------------+
| gtid_binlog_pos        |                      |
| gtid_binlog_state      |                      |
| gtid_current_pos       | 0-2-20,1-1-18,2-2-16 |
| gtid_domain_id         | 3                    |
| gtid_ignore_duplicates | OFF                  |
| gtid_slave_pos         | 0-2-20,1-1-18,2-2-16 |
| gtid_strict_mode       | OFF                  |
| wsrep_gtid_domain_id   | 0                    |
| wsrep_gtid_mode        | OFF                  |
+------------------------+----------------------+
9 rows in set (0.00 sec)

+-----------+--------+-----------+--------+
| domain_id | sub_id | server_id | seq_no |
+-----------+--------+-----------+--------+
|         0 |     43 |         2 |     19 |
|         0 |     44 |         2 |     20 |
|         1 |     77 |         1 |     17 |
|         1 |     78 |         1 |     18 |
|         2 |     71 |         2 |     15 |
|         2 |     72 |         2 |     16 |
+-----------+--------+-----------+--------+
6 rows in set (0.00 sec)

*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000019
          Read_Master_Log_Pos: 887
               Relay_Log_File: mysqld-relay-bin-1.000038
                Relay_Log_Pos: 1177
        Relay_Master_Log_File: mariadb-bin.000019
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 887
              Relay_Log_Space: 2375
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 337
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-2-20,1-1-18,2-2-16
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000012
          Read_Master_Log_Pos: 1014
               Relay_Log_File: mysqld-relay-bin-2.000024
                Relay_Log_Pos: 1304
        Relay_Master_Log_File: mariadb-bin.000012
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1014
              Relay_Log_Space: 2502
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 231
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-2-20,1-1-18,2-2-16
2 rows in set (0.00 sec)

Now, let's enable and see how it breaks:

MariaDB [(none)]> stop all slaves;
Query OK, 0 rows affected, 2 warnings (0.00 sec)

MariaDB [(none)]> change master '1' to master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> change master '2' to master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> start all slaves;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

MariaDB [(none)]> show all slaves status\G
*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State:
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000019
          Read_Master_Log_Pos: 887
               Relay_Log_File: mysqld-relay-bin-1.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: mariadb-bin.000019
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 887
              Relay_Log_Space: 249
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-2-20, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-18,0-2-20,2-2-16
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 338
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-2-20,1-1-18,2-2-16
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000012
          Read_Master_Log_Pos: 1014
               Relay_Log_File: mysqld-relay-bin-2.000002
                Relay_Log_Pos: 714
        Relay_Master_Log_File: mariadb-bin.000012
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1014
              Relay_Log_Space: 1015
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-18,0-2-20,2-2-16
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 240
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-2-20,1-1-18,2-2-16

Let's see the error coming from the channel '1'

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-2-20, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions'

The error makes sense, as there is no way that that binlog contains position "0-2-20" as it is a position from the master2 when it had gtid_domain_id=0.
As we can see in the earlier outputs (server_id=2 and seq_no=20):

+-----------+--------+-----------+--------+
| domain_id | sub_id | server_id | seq_no |
+-----------+--------+-----------+--------+
|         0 |     43 |         2 |     19 |
|         0 |     44 |         2 |     20 |
|         1 |     77 |         1 |     17 |
|         1 |     78 |         1 |     18 |
|         2 |     71 |         2 |     15 |
|         2 |     72 |         2 |     16 |
+-----------+--------+-----------+--------+

Unsuccessful first attempt to fix replication, resetting masters and slaves and starting replication again:

MariaDB [mysql]> reset master;
Query OK, 0 rows affected (0.01 sec)

MariaDB [mysql]> show master status\G
*************************** 1. row ***************************
            File: mariadb-bin.000001
        Position: 314
    Binlog_Do_DB:
Binlog_Ignore_DB:
1 row in set (0.00 sec)

MariaDB [mysql]> show global variables like '%gtid%';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| gtid_binlog_pos        |       |
| gtid_binlog_state      |       |
| gtid_current_pos       |       |
| gtid_domain_id         | 1     |
| gtid_ignore_duplicates | OFF   |
| gtid_slave_pos         |       |
| gtid_strict_mode       | OFF   |
+------------------------+-------+
7 rows in set (0.00 sec)


MariaDB [(none)]>  reset master;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> show global variables like '%gtid%';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| gtid_binlog_pos        |       |
| gtid_binlog_state      |       |
| gtid_current_pos       |       |
| gtid_domain_id         | 2     |
| gtid_ignore_duplicates | OFF   |
| gtid_slave_pos         |       |
| gtid_strict_mode       | OFF   |
+------------------------+-------+
7 rows in set (0.00 sec)

Now let's reset the slave:

`
MariaDB [(none)]> reset slave '1' all;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> reset slave '2' all;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> show all slaves status\G
Empty set (0.00 sec)

MariaDB [(none)]> delete from mysql.gtid_slave_pos;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> select * from mysql.gtid_slave_pos;
Empty set (0.00 sec)

Now let's configure replication again directly with gtid:

MariaDB [(none)]> change master '2' to master_host='192.168.56.22', master_user='replication', master_password='password', master_log_pos=314, master_log_file='mariadb-bin.000001';
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> change master '1' to master_host='192.168.56.21', master_user='replication', master_password='password', master_log_pos=314, master_log_file='mariadb-bin.000001';
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> change master '2' to master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> change master '1' to master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> start all slaves;
Query OK, 0 rows affected, 2 warnings (0.02 sec)

MariaDB [(none)]> select * from mysql.gtid_slave_pos;
Empty set (0.00 sec)

MariaDB [(none)]> show all slaves status\G
*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000001
          Read_Master_Log_Pos: 314
               Relay_Log_File: mysqld-relay-bin-1.000002
                Relay_Log_Pos: 604
        Relay_Master_Log_File: mariadb-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 314
              Relay_Log_Space: 905
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-18,0-2-20,2-2-16
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 7
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-2-20,1-1-18,2-2-16,3-3-2
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000001
          Read_Master_Log_Pos: 314
               Relay_Log_File: mysqld-relay-bin-2.000002
                Relay_Log_Pos: 604
        Relay_Master_Log_File: mariadb-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 314
              Relay_Log_Space: 905
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-18,0-2-20,2-2-16
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 7
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-2-20,1-1-18,2-2-16,3-3-2
2 rows in set (0.00 sec)

If we check the Gtid_Slave_Pos it is still pointing to the OLD transactions. Even though it has been reset, the mysql.gtid_slave_pos table is empty and the multi-master.info is empty.

Let's write on both masters just one transaction on each:

MariaDB [mysql]> show global variables like '%gtid%';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| gtid_binlog_pos        | 1-1-1 |
| gtid_binlog_state      | 1-1-1 |
| gtid_current_pos       | 1-1-1 |
| gtid_domain_id         | 1     |
| gtid_ignore_duplicates | OFF   |
| gtid_slave_pos         |       |
| gtid_strict_mode       | OFF   |
+------------------------+-------+
7 rows in set (0.00 sec)




MariaDB [(none)]> show global variables like '%gtid%';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| gtid_binlog_pos        | 2-2-1 |
| gtid_binlog_state      | 2-2-1 |
| gtid_current_pos       | 2-2-1 |
| gtid_domain_id         | 2     |
| gtid_ignore_duplicates | OFF   |
| gtid_slave_pos         |       |
| gtid_strict_mode       | OFF   |
+------------------------+-------+
7 rows in set (0.00 sec)

And replication breaks on both slaves, because they are attempting to look for OLD transactions

MariaDB [(none)]> show all slaves status\G
*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State:
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000001
          Read_Master_Log_Pos: 314
               Relay_Log_File: mysqld-relay-bin-1.000002
                Relay_Log_Pos: 604
        Relay_Master_Log_File: mariadb-bin.000001
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 314
              Relay_Log_Space: 905
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 1-1-18, which is not in the master's binlog'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-18,0-2-20,2-2-16
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 7
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-2-20,1-1-18,2-2-16,3-3-2
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State:
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000001
          Read_Master_Log_Pos: 314
               Relay_Log_File: mysqld-relay-bin-2.000002
                Relay_Log_Pos: 604
        Relay_Master_Log_File: mariadb-bin.000001
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 314
              Relay_Log_Space: 905
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-16, which is not in the master's binlog'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-18,0-2-20,2-2-16
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 7
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-2-20,1-1-18,2-2-16,3-3-2

Looks like there is some sort of cache (or something) that requires a restart to get cleaned so when we set up the replication threads after resetting them we do not get the same old position? I am able to only get a completely first start if I delete the file: mariadb.state.
Otherwise I cannot:

root@slave1:/var/log/mysql# /etc/init.d/mysql  stop
[ ok ] Stopping mysql (via systemctl): mysql.service.
root@slave1:/var/log/mysql# ls
mariadb-bin.000001  mariadb-bin.000002	mariadb-bin.index  mariadb-bin.state
root@slave1:/var/log/mysql# rm mariadb-bin.00000* mariadb-bin.index
root@slave1:/var/log/mysql# /etc/init.d/mysql  start
[ ok ] Starting mysql (via systemctl): mysql.service.
root@slave1:/var/log/mysql# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 10.1.21-MariaDB-1~jessie mariadb.org binary distribution

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show all slaves status\G
+------------+
| @@hostname |
+------------+
| slave1     |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| gtid_binlog_pos        | 3-3-1 |
| gtid_binlog_state      | 3-3-1 |
| gtid_current_pos       | 3-3-1 |
| gtid_domain_id         | 3     |
| gtid_ignore_duplicates | OFF   |
| gtid_slave_pos         |       |
| gtid_strict_mode       | OFF   |
| wsrep_gtid_domain_id   | 0     |
| wsrep_gtid_mode        | OFF   |
+------------------------+-------+
9 rows in set (0.00 sec)

Empty set (0.00 sec)

Empty set (0.00 sec)

MariaDB [(none)]> Ctrl-C -- exit!
Aborted
root@slave1:/var/log/mysql# mysql
MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show all slaves status\G
+------------+
| @@hostname |
+------------+
| slave1     |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| gtid_binlog_pos        | 3-3-1 |
| gtid_binlog_state      | 3-3-1 |
| gtid_current_pos       | 3-3-1 |
| gtid_domain_id         | 3     |
| gtid_ignore_duplicates | OFF   |
| gtid_slave_pos         |       |
| gtid_strict_mode       | OFF   |
| wsrep_gtid_domain_id   | 0     |
| wsrep_gtid_mode        | OFF   |
+------------------------+-------+

Now let's remove it:

root@slave1:/var/log/mysql# /etc/init.d/mysql  stop
[ ok ] Stopping mysql (via systemctl): mysql.service.
root@slave1:/var/log/mysql# ls
mariadb-bin.000001  mariadb-bin.index  mariadb-bin.state
root@slave1:/var/log/mysql# rm mariadb-bin.state
root@slave1:/var/log/mysql# /etc/init.d/mysql  start
[ ok ] Starting mysql (via systemctl): mysql.service.
root@slave1:/var/log/mysql# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 10.1.21-MariaDB-1~jessie mariadb.org binary distribution

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show all slaves status\G
+------------+
| @@hostname |
+------------+
| slave1     |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| gtid_binlog_pos        | 3-3-1 |
| gtid_binlog_state      | 3-3-1 |
| gtid_current_pos       | 3-3-1 |
| gtid_domain_id         | 3     |
| gtid_ignore_duplicates | OFF   |
| gtid_slave_pos         |       |
| gtid_strict_mode       | OFF   |
| wsrep_gtid_domain_id   | 0     |
| wsrep_gtid_mode        | OFF   |
+------------------------+-------+

So let's go back from a fresh start and remove all the replication related files, including mariadb.state file that file just before we reenable GTID, so BEFORE it crashes:
So basically I reset all the masters, slave, delete all the logs, relay logs and set gtid_domain_id to 0 on all the servers, disable gtid, stop the servers etc to go back to the original issue.

So we are at the point where gtid_domain_id has been enabled on all hosts, but GTID remains disabled. Instead of setting up GTID (as it will break) let's clean up all the stuff first
So this is what I do:

Let's first clean up all the replication threads

MariaDB [(none)]> stop all slaves;
Query OK, 0 rows affected, 2 warnings (0.01 sec)


MariaDB [(none)]> show all slaves status\G
*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State:
               Slave_IO_State:
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000014
          Read_Master_Log_Pos: 4639
               Relay_Log_File: mysqld-relay-bin-1.000028
                Relay_Log_Pos: 4929
        Relay_Master_Log_File: mariadb-bin.000014
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4639
              Relay_Log_Space: 6002
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 335
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-1-9,1-1-42,2-2-66
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State:
               Slave_IO_State:
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000006
          Read_Master_Log_Pos: 4887
               Relay_Log_File: mysqld-relay-bin-2.000012
                Relay_Log_Pos: 5177
        Relay_Master_Log_File: mariadb-bin.000006
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4887
              Relay_Log_Space: 10752
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 339
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-1-9,1-1-42,2-2-66
2 rows in set (0.00 sec)

MariaDB [(none)]> reset slave '1' all;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> reset slave '2' all;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> show all slaves status\G
Empty set (0.00 sec)

Even though we have reseted the slaves, the old position still remains there, even if we delete it from the table (as we have seen previously)

MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show all slaves status\G
+------------+
| @@hostname |
+------------+
| slave1     |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+---------------------+
| Variable_name          | Value               |
+------------------------+---------------------+
| gtid_binlog_pos        |                     |
| gtid_binlog_state      |                     |
| gtid_current_pos       | 0-1-9,1-1-42,2-2-66 |
| gtid_domain_id         | 3                   |
| gtid_ignore_duplicates | OFF                 |
| gtid_slave_pos         | 0-1-9,1-1-42,2-2-66 |
| gtid_strict_mode       | OFF                 |
| wsrep_gtid_domain_id   | 0                   |
| wsrep_gtid_mode        | OFF                 |
+------------------------+---------------------+
9 rows in set (0.00 sec)

+-----------+--------+-----------+--------+
| domain_id | sub_id | server_id | seq_no |
+-----------+--------+-----------+--------+
|         0 |     14 |         1 |      8 |
|         0 |     15 |         1 |      9 |
|         1 |    122 |         1 |     41 |
|         1 |    123 |         1 |     42 |
|         2 |    101 |         2 |     65 |
|         2 |    102 |         2 |     66 |
+-----------+--------+-----------+--------+
6 rows in set (0.00 sec)

Empty set (0.00 sec)

MariaDB [(none)]> delete from mysql.gtid_slave_pos;
Query OK, 6 rows affected (0.00 sec)

MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show all slaves status\G
+------------+
| @@hostname |
+------------+
| slave1     |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+---------------------------+
| Variable_name          | Value                     |
+------------------------+---------------------------+
| gtid_binlog_pos        | 3-3-1                     |
| gtid_binlog_state      | 3-3-1                     |
| gtid_current_pos       | 0-1-9,1-1-42,2-2-66,3-3-1 |
| gtid_domain_id         | 3                         |
| gtid_ignore_duplicates | OFF                       |
| gtid_slave_pos         | 0-1-9,1-1-42,2-2-66       |
| gtid_strict_mode       | OFF                       |
| wsrep_gtid_domain_id   | 0                         |
| wsrep_gtid_mode        | OFF                       |
+------------------------+---------------------------+
9 rows in set (0.00 sec)

Empty set (0.00 sec)

Empty set (0.00 sec)

MariaDB [(none)]> Ctrl-C -- exit!
Aborted

So let's try the hard way:

root@slave1:/var/lib/mysql# /etc/init.d/mysql  stop
[ ok ] Stopping mysql (via systemctl): mysql.service.
root@slave1:/var/lib/mysql# rm multi-master.info
root@slave1:/var/lib/mysql# cd /var/log/mysql/
root@slave1:/var/log/mysql# ls
mariadb-bin.000001  mariadb-bin.index  mariadb-bin.state
root@slave1:/var/log/mysql# rm -fr *
root@slave1:/var/log/mysql# /etc/init.d/mysql  restart
[ ok ] Restarting mysql (via systemctl): mysql.service.
root@slave1:/var/log/mysql# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 10.1.21-MariaDB-1~jessie mariadb.org binary distribution

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show all slaves status\G
+------------+
| @@hostname |
+------------+
| slave1     |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| gtid_binlog_pos        |       |
| gtid_binlog_state      |       |
| gtid_current_pos       |       |
| gtid_domain_id         | 3     |
| gtid_ignore_duplicates | OFF   |
| gtid_slave_pos         |       |
| gtid_strict_mode       | OFF   |
| wsrep_gtid_domain_id   | 0     |
| wsrep_gtid_mode        | OFF   |
+------------------------+-------+
9 rows in set (0.00 sec)

Empty set (0.00 sec)

Empty set (0.00 sec)

As it looks good now, let's configure replication in the point where it was left and enable GTID all at once:

MariaDB [(none)]> change master '1' to master_host='192.168.56.21', master_user='replication', master_password='password', master_log_pos=4639, master_log_file='mariadb-bin.000014';
Query OK, 0 rows affected (0.02 sec)

MariaDB [(none)]> change master '2' to master_host='192.168.56.22', master_user='replication', master_password='password', master_log_pos=4887, master_log_file='mariadb-bin.000006';
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> change master '1' to master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> change master '2' to master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> show all slaves status\G
*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State:
               Slave_IO_State:
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000014
          Read_Master_Log_Pos: 4639
               Relay_Log_File: mysqld-relay-bin-1.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: mariadb-bin.000014
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4639
              Relay_Log_Space: 249
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 0
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos:
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 0
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos:
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State:
               Slave_IO_State:
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000006
          Read_Master_Log_Pos: 4887
               Relay_Log_File: mysqld-relay-bin-2.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: mariadb-bin.000006
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4887
              Relay_Log_Space: 249
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 0
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos:
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 0
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos:
2 rows in set (0.00 sec)

Let's start the slaves and see if they catch up

MariaDB [(none)]> start all slaves;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

MariaDB [(none)]> show all slaves status\G
*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000014
          Read_Master_Log_Pos: 8893
               Relay_Log_File: mysqld-relay-bin-1.000020
                Relay_Log_Pos: 9183
        Relay_Master_Log_File: mariadb-bin.000014
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 8893
              Relay_Log_Space: 10256
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-63,0-1-9
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 382
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-1-9,1-1-63,2-2-88
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000006
          Read_Master_Log_Pos: 9389
               Relay_Log_File: mysqld-relay-bin-2.000012
                Relay_Log_Pos: 9679
        Relay_Master_Log_File: mariadb-bin.000006
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 9389
              Relay_Log_Space: 15254
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-2-6,2-2-88
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 429
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-1-9,1-1-63,2-2-88
2 rows in set (0.00 sec)

MariaDB [(none)]> show all slaves status\G
*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000014
          Read_Master_Log_Pos: 13147
               Relay_Log_File: mysqld-relay-bin-1.000020
                Relay_Log_Pos: 13437
        Relay_Master_Log_File: mariadb-bin.000014
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 13147
              Relay_Log_Space: 14510
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-84,0-1-9
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 466
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-1-9,1-1-84,2-2-110
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000006
          Read_Master_Log_Pos: 13891
               Relay_Log_File: mysqld-relay-bin-2.000012
                Relay_Log_Pos: 14181
        Relay_Master_Log_File: mariadb-bin.000006
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 13891
              Relay_Log_Space: 19756
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-2-6,2-2-110
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 517
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-1-9,1-1-84,2-2-110
2 rows in set (0.00 sec)

MariaDB [(none)]> show all slaves status\G
*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000014
          Read_Master_Log_Pos: 13147
               Relay_Log_File: mysqld-relay-bin-1.000020
                Relay_Log_Pos: 13437
        Relay_Master_Log_File: mariadb-bin.000014
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 13147
              Relay_Log_Space: 14510
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-84,0-1-9
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 466
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-1-9,1-1-84,2-2-110
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000006
          Read_Master_Log_Pos: 13891
               Relay_Log_File: mysqld-relay-bin-2.000012
                Relay_Log_Pos: 14181
        Relay_Master_Log_File: mariadb-bin.000006
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 13891
              Relay_Log_Space: 19756
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-2-6,2-2-110
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 517
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-1-9,1-1-84,2-2-110

They did!
So looks like we HAVE TO delete that mariadb.state.
However, as soon as we restart MySQL, it crashes again as it keeps trying to look for a position in the binlog of the other channel:

              Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-22, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions'
             Last_SQL_Errno: 0
             Last_SQL_Error:
Replicate_Ignore_Server_Ids:
           Master_Server_Id: 2
             Master_SSL_Crl:
         Master_SSL_Crlpath:
                 Using_Gtid: Slave_Pos
                Gtid_IO_Pos: 1-1-189,0-1-22,2-2-199
    Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
              Parallel_Mode: conservative
       Retried_transactions: 0
         Max_relay_log_size: 104857600
       Executed_log_entries: 1
  Slave_received_heartbeats: 0
     Slave_heartbeat_period: 1800.000
             Gtid_Slave_Pos: 0-1-22,1-1-189,2-2-199

I am going to report this upstream and keep digging a bit more.

The only way I have found to get it enabled is as follows
The main issue seems to come from the masters where there is no way to clean up the coordinates when they used to have gtid_domain_id=0 unless you reset master AFTER you've set up the desired value for gtid_domain_id on it:

+------------------------+---------------+
| Variable_name          | Value         |
+------------------------+---------------+
| gtid_binlog_pos        | 0-2-44,2-2-88 |
| gtid_binlog_state      | 0-2-44,2-2-88 |
| gtid_current_pos       | 0-2-44,2-2-88 |
| gtid_domain_id         | 2             |
| gtid_ignore_duplicates | OFF           |
| gtid_slave_pos         |               |
| gtid_strict_mode       | OFF           |
+------------------------+---------------+
7 rows in set (0.00 sec)

And then on the first transaction it gets the new coordinates and the old 0 dissapears, which is what confuses the slave.

+------------------------+--------+
| Variable_name          | Value  |
+------------------------+--------+
| gtid_binlog_pos        | 2-2-22 |
| gtid_binlog_state      | 2-2-22 |
| gtid_current_pos       | 2-2-22 |
| gtid_domain_id         | 2      |
| gtid_ignore_duplicates | OFF    |
| gtid_slave_pos         |        |
| gtid_strict_mode       | OFF    |
+------------------------+--------+
7 rows in set (0.00 sec)

This only works if the slave gets completely reseted as well, otherwise it will keep looking for "0-XXX"
So the procedure would also involve resetting the master, then the slave (once all the transactions with the old domain_id=0 have been executed, you also need to reset and reconfigure the slaves with):

MariaDB [(none)]> stop all slaves;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

MariaDB [(none)]> reset slave all ;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> reset slave '2' all;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> reset slave '1' all;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> delete from mysql.gtid_slave_pos;
Query OK, 2 rows affected (0.01 sec)

MariaDB [(none)]> Ctrl-C -- exit!
Aborted
root@slave1:/var/log/mysql# /etc/init.d/mysql  stop
[ ok ] Stopping mysql (via systemctl): mysql.service.
root@slave1:/var/log/mysql# ls
mariadb-bin.000001  mariadb-bin.index  mariadb-bin.state
root@slave1:/var/log/mysql# rm -fr *
root@slave1:/var/log/mysql# cd /var/lib/mysql/
root@slave1:/var/lib/mysql# rm -fr multi-master.info

Once that is done and the slave is "fresh", you need to reenable replication:

root@slave1:/var/lib/mysql# /etc/init.d/mysql  restart
[ ok ] Restarting mysql (via systemctl): mysql.service.
root@slave1:/var/lib/mysql# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 10.1.21-MariaDB-1~jessie mariadb.org binary distribution

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show all slaves status\G
+------------+
| @@hostname |
+------------+
| slave1     |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| gtid_binlog_pos        |       |
| gtid_binlog_state      |       |
| gtid_current_pos       |       |
| gtid_domain_id         | 3     |
| gtid_ignore_duplicates | OFF   |
| gtid_slave_pos         |       |
| gtid_strict_mode       | OFF   |
| wsrep_gtid_domain_id   | 0     |
| wsrep_gtid_mode        | OFF   |
+------------------------+-------+
9 rows in set (0.00 sec)

Empty set (0.00 sec)

Empty set (0.00 sec)

MariaDB [(none)]> change master '1' to master_host='192.168.56.21', master_user='replication', master_password='password', master_log_pos=314, master_log_file='mariadb-bin.000001';

Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> change master '2' to master_host='192.168.56.22', master_user='replication', master_password='password', master_log_pos=314, master_log_file='mariadb-bin.000001';

Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> change master '2' to master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> change master '1' to master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> start all slaves;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

MariaDB [(none)]> select @@hostname; show slave hosts\G show global variables like '%gtid%'; select * from mysql.gtid_slave_pos; show all slaves status\G
+------------+
| @@hostname |
+------------+
| slave1     |
+------------+
1 row in set (0.00 sec)

Empty set (0.00 sec)

+------------------------+---------------+
| Variable_name          | Value         |
+------------------------+---------------+
| gtid_binlog_pos        |               |
| gtid_binlog_state      |               |
| gtid_current_pos       | 1-1-21,2-2-22 |
| gtid_domain_id         | 3             |
| gtid_ignore_duplicates | OFF           |
| gtid_slave_pos         | 1-1-21,2-2-22 |
| gtid_strict_mode       | OFF           |
| wsrep_gtid_domain_id   | 0             |
| wsrep_gtid_mode        | OFF           |
+------------------------+---------------+
9 rows in set (0.00 sec)

+-----------+--------+-----------+--------+
| domain_id | sub_id | server_id | seq_no |
+-----------+--------+-----------+--------+
|         1 |     16 |         1 |     16 |
|         1 |     17 |         1 |     17 |
|         1 |     18 |         1 |     18 |
|         1 |     19 |         1 |     19 |
|         1 |     20 |         1 |     20 |
|         1 |     21 |         1 |     21 |
|         2 |     41 |         2 |     20 |
|         2 |     42 |         2 |     21 |
|         2 |     43 |         2 |     22 |
+-----------+--------+-----------+--------+
9 rows in set (0.00 sec)

*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000001
          Read_Master_Log_Pos: 4568
               Relay_Log_File: mysqld-relay-bin-1.000002
                Relay_Log_Pos: 4858
        Relay_Master_Log_File: mariadb-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4568
              Relay_Log_Space: 5159
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-21
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 91
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 1-1-21,2-2-22
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000003
          Read_Master_Log_Pos: 4816
               Relay_Log_File: mysqld-relay-bin-2.000002
                Relay_Log_Pos: 5106
        Relay_Master_Log_File: mariadb-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4816
              Relay_Log_Space: 5407
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 2-2-22
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 95
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 1-1-21,2-2-22

Stopping and starting the slave seems to work fine and the slave processes all the stuff coming from the masters. Rotating logs on the master, flushing them and all that produces no issue on the master.

So, to be able to enable multisource+gtid looks like you have to completely reset the master AND the slave while there are no writes happening. Which is not what the documentation says (at all) as apparently as per the documentation enabling just gtid_domain_id would do the trick.

I am going to test if this is the same if we have intermediate masters (which is the case we have for labs and sanitariums)

So for the last test I have tested with the same topology with have in production+sanitarium+labs

That is:
two masters -> sanitarium (with two replication threads) -> one slave (one replication thread)

So, in order to enable GTID on the sanitarium host I have done the above trick, and then enabled GTID on the final slave (the labsdb host).
The order in which I enabled gtid_domain_id was:
Sanitarium (master3)
labs (slave1)
master1
master2

Now, let's reset masters and sanitarium (and let's see what happens with labs).

  1. Reset slaves in sanitarium:
root@localhost[(none)]> stop all slaves;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

root@localhost[(none)]> reset slave '1' all;
Query OK, 0 rows affected (0.00 sec)

root@localhost[(none)]> reset slave '2' all;
Query OK, 0 rows affected (0.00 sec)

root@localhost[(none)]> delete from mysql.gtid_slave_pos;
Query OK, 4 rows affected (0.00 sec)
root@master3:/var/log/mysql# rm -fr *
root@master3:/var/log/mysql# cd /var/lib/mysql/
root@master3:/var/lib/mysql# rm multi-master.info

Now, let's reset the masters.

And finally, let's reset sanitarium:

MariaDB [(none)]> stop slave;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> reset slave all;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> change master to master_host='192.168.56.10', master_user='replication', master_password='password', master_log_pos=315, master_log_file='mariadb-bin.000001';
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> change master to master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> start slave;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.10
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000001
          Read_Master_Log_Pos: 315
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 605
        Relay_Master_Log_File: mariadb-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 315
              Relay_Log_Space: 904
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 10
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-10-201,2-2-88
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative

And let's write to the master!
Sanitarium works fine, but labsdb breaks:

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 2-2-88, which is not in the master's binlog'

It broke because we didn't reset it the "hard" way, deleting all the relays, and master info.
Let's try again the whole process resetting labsdb the hard way:

root@localhost[(none)]> stop slave;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

root@localhost[(none)]> reset slave all;
Query OK, 0 rows affected (0.00 sec)

root@localhost[(none)]> delete from mysql.gtid_slave_pos;
Query OK, 4 rows affected (0.00 sec)
root@slave1:/var/log/mysql# rm -fr *
root@slave1:/var/log/mysql# cd /var/lib/mysql/
root@slave1:/var/lib/mysql# rm multi-master.info

And after reconfiguring all the replication threads in sanitarium and labdb, it works.

Sanitarium with GTID enabled:

*************************** 1. row ***************************
              Connection_name: 1
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.21
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000002
          Read_Master_Log_Pos: 9040
               Relay_Log_File: mysqld-relay-bin-1.000004
                Relay_Log_Pos: 9330
        Relay_Master_Log_File: mariadb-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 9040
              Relay_Log_Space: 9676
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-85,2-2-22
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 272
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 1-1-85,2-2-111
*************************** 2. row ***************************
              Connection_name: 2
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.22
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000003
          Read_Master_Log_Pos: 13930
               Relay_Log_File: mysqld-relay-bin-2.000006
                Relay_Log_Pos: 14220
        Relay_Master_Log_File: mariadb-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 13930
              Relay_Log_Space: 14566
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-21,2-2-111
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 382
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 1-1-85,2-2-111

And labs with GTID enabled:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.10
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000002
          Read_Master_Log_Pos: 40087
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 9500
        Relay_Master_Log_File: mariadb-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 40087
              Relay_Log_Space: 9799
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 10
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 1-1-106,2-2-133
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative

So, if we want to enable gtid+multisource, at this point (let's see what MariaDB says) we need to stop slaves, unconfigure replication the hard way, reset the masters, and then configure them again from that first position.

Change 340130 merged by Marostegui:
[operations/puppet] production.my.cnf: Enable gtid_domaid_id

https://gerrit.wikimedia.org/r/340130

So, I have deployed the patch to enable gtid_domain_id on all the core hosts. It will be picked up upon restart, but I am manually enabling it on all the shards though, to do it in a controlled way.

So, misc, eventlogging s6 and s2 already had it.
I just added it to s4.

Change 341500 had a related patch set uploaded (by marostegui):
[operations/puppet] dbstore.my.cnf: Enable gtid_domain_id

https://gerrit.wikimedia.org/r/341500

Change 341500 merged by Marostegui:
[operations/puppet] dbstore.my.cnf: Enable gtid_domain_id

https://gerrit.wikimedia.org/r/341500

s5 done
dbstore (1001,1002 - 2001 and 2002 had it enabled for a long time already) done

Change 341505 had a related patch set uploaded (by marostegui):
[operations/puppet] tendri.my.cnf: Enable gtid_domain_id

https://gerrit.wikimedia.org/r/341505

Change 341505 merged by Marostegui:
[operations/puppet] tendril.my.cnf: Enable gtid_domain_id

https://gerrit.wikimedia.org/r/341505

tendril host is done (not that it really needs it, but for consistency)
s7 is done

Change 341511 had a related patch set uploaded (by marostegui):
[operations/puppet] labs.my.cnf: Add gtid_domain_id

https://gerrit.wikimedia.org/r/341511

Change 341511 merged by Marostegui:
[operations/puppet] labs.my.cnf: Add gtid_domain_id

https://gerrit.wikimedia.org/r/341511

x1 is now done.
old labsdb1001 and 1003 also done (for consistency).

Deployed on s1. All the hosts in the .hosts files have this variable now deployed and enabled.

Note: Having this deployed and enabled doesn't mean we can enable GTID yet on multisource slaves (or at least in a non disruptive way, see: T149418#3070309 and T149418#3070692)

Mentioned in SAL (#wikimedia-operations) [2017-03-07T12:53:38Z] <marostegui> Just for the sake of having it logged: gtid_domain_id has been deployed in all the database servers - T149418

Change 342226 had a related patch set uploaded (by Marostegui):
[operations/puppet] sanitarium2.my.cnf: Using standard gtid_domain_id

https://gerrit.wikimedia.org/r/342226

Change 342226 merged by Marostegui:
[operations/puppet] sanitarium2.my.cnf: Using standard gtid_domain_id

https://gerrit.wikimedia.org/r/342226

I was doing some tests to refresh my mind and play with the new versions (10.1.21) as this is a critical change and I have filled this: https://jira.mariadb.org/browse/MDEV-12012
It doesn't really impact our production core environment but it can break our multisource hosts once we turn on GTID (which is the whole point of deploying gtid_domain_id)

I thought I would update this, just for the record.
During the back and forth of the conversation about this bug we filled, there is a reply from Kristian from a few days ago with a workaround that might work: https://jira.mariadb.org/browse/MDEV-12012?focusedCommentId=100529&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-100529
I haven't tried it - it doesn't change our plans anyways.
Time allowing, it'd be nice to try it in a test environment just to see if it actually works.

New update from the bug:

Andrei Elkin closed MDEV-12012.
-------------------------------
    Fix Version/s: 10.1.30
                   10.2.11
                       (was: 10.1)
       Resolution: Fixed

Pushed aae4932775d.

Might be fixed in 10.1.30 then. I will try to test it during the code freeze in December in my local instance to see how that goes.
Maybe we would be able to enable GTID on the labs servers after all? :-)

I have been testing the workaround suggested on https://jira.mariadb.org/browse/MDEV-12012?focusedCommentId=100529&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-100529 in order to be able to enable GTID on multi-source slaves without needing to reset master.

Overall, it has worked fine on my lab (I have done it a few times to make sure it worked out every single time and that transactions were not lost or skipped).

This is what I have done.

  • Set up two masters with 10.1.33
  • Set up one slave with 10.1.33 replicating multi-source from those two masters.
  • Start replication and writes. All the hosts with gtid_domain_id=0 and GTID disabled.
  • Start changing gtid_domain_id on all three hosts (slave first, masters last) while writes keep coming.
  • Once a few writes with gtid_domain_id has been set...enable GTID.
  • Breakage happens on one thread as expected (writes keep happening on the master):
MariaDB [(none)]>  change master '33' to master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]>  change master '34' to master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> start all slaves;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

MariaDB [(none)]> select count(*) from test_34.t34;select count(*) from test_33.t33;
+----------+
| count(*) |
+----------+
|       20 |
+----------+
1 row in set (0.00 sec)

+----------+
| count(*) |
+----------+
|       20 |
+----------+
1 row in set (0.00 sec)

MariaDB [(none)]> show all slaves status\G
*************************** 1. row ***************************
              Connection_name: 33
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.33
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000002
          Read_Master_Log_Pos: 3111
               Relay_Log_File: mysqld-relay-bin-33.000002
                Relay_Log_Pos: 2515
        Relay_Master_Log_File: mariadb-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3111
              Relay_Log_Space: 2817
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 33
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 33-33-15,0-33-6,34-34-10
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 81
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-32-20,32-32-3,33-33-15,34-34-10
*************************** 2. row ***************************
              Connection_name: 34
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State:
                  Master_Host: 192.168.56.34
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000002
          Read_Master_Log_Pos: 2179
               Relay_Log_File: mysqld-relay-bin-34.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: mariadb-bin.000002
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 2179
              Relay_Log_Space: 249
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-33-6, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 34
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 33-33-5,0-33-6,34-34-10
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 54
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-32-20,32-32-3,33-33-15,34-34-10
2 rows in set (0.00 sec)

MariaDB [(none)]> select count(*) from test_34.t34;select count(*) from test_33.t33;
+----------+
| count(*) |
+----------+
|       20 |
+----------+
1 row in set (0.00 sec)

+----------+
| count(*) |
+----------+
|       30 |
+----------+
1 row in set (0.00 sec)

MariaDB [(none)]> stop all slaves;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

We apply the suggested workaround by MariaDB on the master whose replication broke, that is "34" and the GTID pos that broke is GTID 0-33-6

MariaDB [test_34]> SET SESSION gtid_domain_id=0;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test_34]> SET SESSION server_id=33;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test_34]> SET SESSION gtid_seq_no=6;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test_34]> DROP TABLE IF EXISTS non_existing_table;
Query OK, 0 rows affected, 1 warning (0.00 sec)
  • Now, apply the workaround on the slave, basically we have to get rid of: 0-32-20 and go for 0-33-6
MariaDB [(none)]> SET GLOBAL gtid_slave_pos='0-33-6,32-32-3,33-33-25,34-34-10';
Query OK, 0 rows affected, 1 warning (0.02 sec)

MariaDB [(none)]> start all slaves;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

MariaDB [(none)]> show all slaves status\G
*************************** 1. row ***************************
              Connection_name: 33
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.33
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000002
          Read_Master_Log_Pos: 6951
               Relay_Log_File: mysqld-relay-bin-33.000002
                Relay_Log_Pos: 2578
        Relay_Master_Log_File: mariadb-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 6951
              Relay_Log_Space: 2880
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 33
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 33-33-35,0-33-6,32-32-3,34-34-10
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 169
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-32-20,32-32-3,33-33-35,34-34-40
*************************** 2. row ***************************
              Connection_name: 34
              Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.56.34
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000002
          Read_Master_Log_Pos: 8047
               Relay_Log_File: mysqld-relay-bin-34.000002
                Relay_Log_Pos: 6410
        Relay_Master_Log_File: mariadb-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 8047
              Relay_Log_Space: 6712
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 34
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 33-33-25,0-33-6,32-32-3,34-34-40
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
         Retried_transactions: 0
           Max_relay_log_size: 104857600
         Executed_log_entries: 183
    Slave_received_heartbeats: 0
       Slave_heartbeat_period: 1800.000
               Gtid_Slave_Pos: 0-32-20,32-32-3,33-33-35,34-34-40
2 rows in set (0.00 sec)
  • And we can see that no transactions are lost
MariaDB [(none)]> select count(*) from test_34.t34;select count(*) from test_33.t33;
+----------+
| count(*) |
+----------+
|       50 |
+----------+
1 row in set (0.00 sec)

+----------+
| count(*) |
+----------+
|       40 |
+----------+
1 row in set (0.00 sec)

MariaDB [(none)]> select count(*) from test_34.t34;select count(*) from test_33.t33;
+----------+
| count(*) |
+----------+
|       53 |
+----------+
1 row in set (0.00 sec)

+----------+
| count(*) |
+----------+
|       40 |
+----------+
1 row in set (0.00 sec)

MariaDB [(none)]> select count(*) from test_34.t34;select count(*) from test_33.t33;
+----------+
| count(*) |
+----------+
|       53 |
+----------+
1 row in set (0.00 sec)

+----------+
| count(*) |
+----------+
|       41 |
+----------+
1 row in set (0.00 sec)

This looks promising for our labsdb servers. However, there is an important note on the workaround

Assuming no new transactions will be made in domain 0 anywhere, this should
make the servers all think that they are up-to-date in domain 0. So the
slave should be able to connect to either master without complaints about
domain 0.

A quick look at labsdb1010 reveals that somehow (I still don't get how), domain 0 has some position changing:

root@labsdb1010.eqiad.wmnet[mysql]> select * from mysql.gtid_slave_pos where domain_id=0;
+-----------+-------------+-----------+------------+
| domain_id | sub_id      | server_id | seq_no     |
+-----------+-------------+-----------+------------+
|         0 | 15718468932 | 171966669 | 4067321066 |
|         0 | 15718468938 | 171970637 | 5481803966 |
+-----------+-------------+-----------+------------+
2 rows in set (0.00 sec)

root@labsdb1010.eqiad.wmnet[mysql]> select * from mysql.gtid_slave_pos where domain_id=0;
+-----------+-------------+-----------+------------+
| domain_id | sub_id      | server_id | seq_no     |
+-----------+-------------+-----------+------------+
|         0 | 15718471510 | 171966669 | 4067321072 |
|         0 | 15718471515 | 171970637 | 5481803972 |
+-----------+-------------+-----------+------------+
2 rows in set (0.01 sec)

171966669 = db1075 (s3 master) however:

root@db1075[(none)]> select @@gtid_domain_id;
+------------------+
| @@gtid_domain_id |
+------------------+
|        171966669 |
+------------------+
1 row in set (0.00 sec)

171970637 = db1052 (s1 master) however:

root@db1052[(none)]> select @@gtid_domain_id;
+------------------+
| @@gtid_domain_id |
+------------------+
|        171970637 |
+------------------+
1 row in set (0.00 sec)

This last bit needs further research to see why gtid_domain_id is getting updates for domain 0 from those two hosts.

why gtid_domain_id is getting updates for domain 0 from those two hosts

My guess is that I am not 100% sure all masters were restarted after changing its domain_id.

That could explain it. We will see if it stops once we failover s1 master.

We could also think of testing strict mode to prevent slave, out-of-band writes (being very careful).

For what is worth, I have replicated exactly the same thing we have in production to make sure nothing would break along the way with the suggested workaround. That is:

master (stament) -> slave (ROW) -> slave (ROW) -> slave with multi-source without GTID

Everything seems to be fine when applying the workaround.
However, we are not close to enable this on production, and we have to be very careful with two things

  1. T149418#4290941 (we have to be completely sure that there are no writes coming from domain=0)
  2. The fact that we have this on labsdb hosts:
Gtid_Slave_Pos: 0-171970637-5481822048,171966555-171966555-590,171966558-171966558-63,171966574-171966574-47630128,171966669-171966669-2609597550,171966670-171966670-2410812544,171970577-171970577-597926,171970589-171970589-201132050,171970593-171970593-3479,171970599-171970599-24483,171970637-171970637-1930265453,171970663-171970663-133,171970751-171970751-21842,171974668-171974668-34,171974686-171974686-867,171974769-171974769-52,171974784-171974784-44193860,171974883-171974883-105456337,171974884-171974884-1473084269,171978764-171978764-57,171978765-171978765-33,171978766-171978766-10301,171978767-171978767-2269180357,171978771-171978771-30,171978772-171978772-1562,171978775-171978775-2709368791,171978777-171978777-851322886,171978778-171978778-628571707,180359172-180359172-49702202,180359174-180359174-94123432,180359175-180359175-43143522

In theory it is just replacing: 0-171970637-5481822048 with whatever transaction we generate as the rest of positions do not contain domain 0, but we gotta be careful as there is looots of stuff in there.

Mentioned in SAL (#wikimedia-operations) [2020-03-23T07:46:30Z] <marostegui> Stop MySQL on db1077 (non used) for 10.4 upgrade and gtid_domain_id on multisource T149418

Given that the original bug report we submitted for this issue (https://jira.mariadb.org/browse/MDEV-12012) was closed with a fix for 10.1.30 (but we never tested that workaround) and 10.3, I am going to test this for 10.4, as if it is really fixed, it can be great news for labs hosts.

This is what I have done:

db1077 running stretch+10.1 with gtid_id_domain id set to:

root@db1077.eqiad.wmnet[labswiki]> select @@gtid_domain_id;
+------------------+
| @@gtid_domain_id |
+------------------+
|        171970751 |
+------------------+
1 row in set (0.01 sec)
  • restored m1 and m5 backups there and started multi-source replication (obviously without GTID).
  • Enabling GTID (slave_pos) doesn't break (10.1.43). This scenario is a bit different from the original one we saw on labs, as replication there started without gtid_domain_id set and we enabled it _after_ replication was set.
  • I have tried to break it by doing set gtid_domain_id=0 on db1077, but so far it keeps running fine.
  • Enabled GTID without issues.

I am going to upgrade to buster and 10.4 and see what happens.

Change 582775 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Reimage db1077 to buster

https://gerrit.wikimedia.org/r/582775

Change 582775 merged by Marostegui:
[operations/puppet@production] install_server: Reimage db1077 to buster

https://gerrit.wikimedia.org/r/582775

Change 582776 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Allow reimage db1077

https://gerrit.wikimedia.org/r/582776

Change 582776 merged by Marostegui:
[operations/puppet@production] install_server: Allow reimage db1077

https://gerrit.wikimedia.org/r/582776

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1077.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202003231208_marostegui_34283.log.

Completed auto-reimage of hosts:

['db1077.eqiad.wmnet']

and were ALL successful.

I have been able to get GTID also running on buster and 10.4 and multi-source.
The only corner case to be able to fully replicate the original issue was to have the masters with some data inserted when gtid_domain_id was 0, which is not the case for this, as replication started when masters already had gtid_domain_id set.

I am going to try to fully replicate than on my lab, to see if the bug is fully fixed before considering upgrading labs and enabling GTID there.

So the idea is to fully replicate the labsdb hosts roadmap:

  • Set 2 masters with a version lower than 10.1.30 and gtid_domain_id set to 0
  • Set one slave with gtid_domain_id set to 0 and using something lower than 10.1.30 and replicating from both masters
  • Insert data, rotate binlogs etc.
  • Change gtid_domain_id on all the hosts.
  • Enable GTID on the slave (I would expect a breakage here)
  • Upgrade all the hosts to 10.1.43, try to enable GTID again
  • Upgrade all the hosts to buster and 10.4 and try to enable GTID again.

I have been able to get GTID also running on buster and 10.4 and multi-source.
The only corner case to be able to fully replicate the original issue was to have the masters with some data inserted when gtid_domain_id was 0, which is not the case for this, as replication started when masters already had gtid_domain_id set.

I am going to try to fully replicate than on my lab, to see if the bug is fully fixed before considering upgrading labs and enabling GTID there.

So the idea is to fully replicate the labsdb hosts roadmap:

  • Set 2 masters with a version lower than 10.1.30 and gtid_domain_id set to 0
  • Set one slave with gtid_domain_id set to 0 and using something lower than 10.1.30 and replicating from both masters
  • Insert data, rotate binlogs etc.
  • Change gtid_domain_id on all the hosts.
  • Enable GTID on the slave (I would expect a breakage here)
  • Upgrade all the hosts to 10.1.43, try to enable GTID again
  • Upgrade all the hosts to buster and 10.4 and try to enable GTID again.

I have been able to enable GTID on 10.1.44 on my lab.

Looks like the key here is to run the following command on the masters:

FLUSH BINARY LOGS DELETE_DOMAIN_ID=(0);

That command won't let you purge events if there is a binlog which still contain that transaction, which shouldn't be our case, as those have rotated years ago:

MariaDB [test_m1]> FLUSH BINARY LOGS DELETE_DOMAIN_ID=(0);
ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('0') being deleted. Make sure to first purge those files.
MariaDB [test_m1]> purge binary logs to 'mariadb-bin.000011';
Query OK, 0 rows affected (0.01 sec)

MariaDB [test_m1]> FLUSH BINARY LOGS DELETE_DOMAIN_ID=(0);
Query OK, 0 rows affected (0.02 sec)

So what I did was:

  • Stop all slaves and capture its positions
  • Reset all slaves
  • Run FLUSH BINARY LOGS DELETE_DOMAIN_ID=(0);
  • Reenable replication and let it catch up
  • Once it was sync-ed, stopped slave and enabled Slave_pos GTID
  • Keep inserting and replication works

The key part here is FLUSH BINARY LOGS DELETE_DOMAIN_ID=(0);. The good part is that we enabled gtid_domain_id years ago, so there should be no binlogs with any of that anyways, as they've rotated.