Page MenuHomePhabricator

Evaluate removing gtid_domain_id from the infra
Closed, DeclinedPublic

Description

After conversations with MariaDB due to the gtid_domain_id deletion bug, they'd advised to actually not set gtid_domain_id at all and go for a default (0) domain_id everywhere if we're never going to use multisource or multi-master (which we only use for x2 and possibly for pcX, where we don't use GTID as we consider the data volatile).

Removing this can actually simplify a lot our issues with GTID for failovers+orchestrator.
To be able to do so we'd need to still unset it clean up all the domain_ids

We'd need to explore FLUSH BINARY LOGS DELETE_DOMAIN_ID and how to operate with it safely across all topologies.

Stage 1 - simplify gtid_slave_pos on the replicas:

  • m3
  • db_inventory
  • x1
    • codfw
    • eqiad (done: dbstore1005, db1225, db1220)

Stage 2 - simplify gtid_slave_pos on intermediate masters

  • m3
  • db_inventory

x1

  • codfw

Stage 3 - Change gtid_domain_id on the intermediate masters and FLUSH unused DOMAIN_ID

  • m3
  • db_inventory

x1

  • codfw

Stage 4 - Change gtid_domain_id on active masters and FLUSH unused domain_id

  • m3
  • db_inventory

x1

  • eqiad

Event Timeline

Marostegui moved this task from Triage to In progress on the DBA board.

A quick check confirms that we do have to delete everything first, as simply changing the master to become gtid_domain_id=0 would break replication with:

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-171970580-683331037, which is not in the master's binlog'

The safest option is probably:

  • Go for STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=no; START SLAVE;
  • Do all the magic
  • And then go for MASTER_USE_GTID=Slave_pos again

The first step I am going to tackle is clean up the slaves a bit first, before removing the domain_id on the master. We have such a list that is very complicated to do all at once, so I'd rather go slowly but securely.
First I have used db1137 to take all the steps.

db1137 had:

select @@gtid_slave_pos;
0-171970768-717,1-171970580-1,171966572-171966572-896177597,171966628-171966628-1448099028,171970580-171970580-596994206,171970768-171970768-49223008,171974667-171974667-1417249615,171974681-171974681-198565537,180355159-180355159-115369055,180359202-180359202-331241772,180363268-180363268-39992186,180363398-180363398-187984835

While db1179, the master had:

+-----------------------------------------------------------------------------------------------------------+
| @@gtid_binlog_state                                                                                       |
+-----------------------------------------------------------------------------------------------------------+
| 0-171970768-717,171966628-171966628-1448099028,171970768-171970768-49230848,180363398-180363398-187984835 |
+-----------------------------------------------------------------------------------------------------------+

So I stopped the slave and captured the position for master's gtid_domain_id (171970768) and set it back to only the domains_id that exist there:

set global gtid_slave_pos="0-171970768-717,171966628-171966628-1448099028,171970768-171970768-49229648,180363398-180363398-187984835";

It worked all fine and the slave output is a lot more readable now:

Gtid_IO_Pos: 0-171970768-717,171966628-171966628-1448099028,171970768-171970768-49229648,180363398-180363398-187984835

My plan is to leave it running for a few hours and at the same time do a data check. x1 is RBR so it should break if I made a mistake with the positions.

Once the slaves are tackled, I can go ahead and flush the non existent domain_ids on the master:

FLUSH BINARY LOGS DELETE_DOMAIN_ID=(0,171966628,180363398)

I need to see how to operate with 180363398 as that is codfw, so will need to double check the steps correctly.

I just cleaned up db2131. I am going to leave it running for the weekend, if all goes well, on Monday I will clean up the other replica.
Going to start with the replicas to lately flush the domain_ids on the codfw master.

Cleaned up db2101:3320. I now have to see how to clean up x1 codfw master before being able to replace gtid_domain_id for it.

I have cleaned up x1 codfw to the maximum before changing the domain_id to 0.
This has been the procedure

  • Change codfw master gtid_slave_pos to eqiad's master domain_id appearing on eqiad's master gtid_binlog_state
  • Flushed all non active gtid_domain_id on codfw's master so the binlog state looks clean with the only two active gtid_domain_id:
root@db2115.codfw.wmnet[(none)]> select @@gtid_binlog_state;
+------------------------------------------------------------+
| @@gtid_binlog_state                                        |
+------------------------------------------------------------+
| 171970768-171970768-73774356,180363398-180363398-189278224 |
+------------------------------------------------------------+
1 row in set (0.034 sec)
  • Change codfw replicas to the new gtid_slave_pos which only has two active domains (eqiad's and codfw's master)
  • Changed gtid_domain_id on db2115 (codfw master) to 0 so it is now showing up on the slaves:
root@db2096.codfw.wmnet[(none)]> select @@gtid_slave_pos;
+---------------------------------------------------------------------------+
| @@gtid_slave_pos                                                          |
+---------------------------------------------------------------------------+
| 0-180363398-41,171970768-171970768-73801690,180363398-180363398-189278687 |
+---------------------------------------------------------------------------+
1 row in set (0.032 sec)

root@db2096.codfw.wmnet[(none)]>
  • FLUSHED domain_id (180363398 codfw's master gtid_domain_id) on db1179 (eqiad master)
  • FLUSHED domain_id (180363398 codfw's master gtid_domain_id) on db2115 (codfw's master)

codfw master looks like:

root@db2115.codfw.wmnet[(none)]>  select @@gtid_binlog_state;
+----------------------------------------------+
| @@gtid_binlog_state                          |
+----------------------------------------------+
| 0-180363398-402,171970768-171970768-73823942 |
+----------------------------------------------+
1 row in set (0.034 sec)
  • Change codfw's slave_pos to the current eqiad's master binlog_state

Next: Review everything and change gtid_domain=0 on eqiad's master and follow the procedure to flush everything.

Changed codfw replicas to replicate only from: 0 (codfw new gtid_domain_id) and 171970768 (eqiad's one)

Change 920986 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] phabricator.my.cnf.erb: Set gtid_domain_id=0

https://gerrit.wikimedia.org/r/920986

Change 920986 merged by Marostegui:

[operations/puppet@production] phabricator.my.cnf.erb: Set gtid_domain_id=0

https://gerrit.wikimedia.org/r/920986

I have successfully cleaned up m3 entirely. It has been a very complex process, which I also discovered that I need to stop pt-heartbeat on the intermediate before changing gtid_slave_pos to replicate from domain_id=0, as otherwise there will be collisions, but it now clean and they all have gtid_domain_id=0.

Also connections needed to be killed on the primary master as already initialized connections would keep writing using the old cached gtid_domain_id instead of 0.

I am going though to double check with Kristian to make absolutely sure that gtid_domain_id=0 copes well with a primary master + intermediate master + local writes on that intermediate master.
Also going to do a data check.

Change 922476 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db_inventory: Change gtid_domain_id to 0

https://gerrit.wikimedia.org/r/922476

Change 922476 merged by Marostegui:

[operations/puppet@production] db_inventory: Change gtid_domain_id to 0

https://gerrit.wikimedia.org/r/922476

Change 922790 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] misc_multiinstance.my.cnf.erb: Set gtid_domain_id=0

https://gerrit.wikimedia.org/r/922790

Change 922790 merged by Marostegui:

[operations/puppet@production] misc_multiinstance.my.cnf.erb: Set gtid_domain_id=0

https://gerrit.wikimedia.org/r/922790

db_inventory was cleaned up and it is now replicating with domain 0

I am going to close this bug because the above method isn't safe and it is very error prone, I will update the parent task with my latest tests