Page MenuHomePhabricator

Clean up old gtid_domain_id
Open, MediumPublic

Description

Some of our hosts have a very old GTID domains on their server's binary log.
This can be a problem, not only when reading them as they can be very messy:

@@gtid_slave_pos: 0-171970637-5484646134,171966471-171966471-62,171970572-171970572-2942236266,171970577-171970577-101890,171970637-171970637-2116621969,171970661-171970661-3655324752,171970704-171970704-351094624,171970745-171970745-2419896119,171974720-171974720-2572451842,171974884-171974884-1473084269,171978765-171978765-199,171978768-171978768-202416,171978774-171978774-5,171978777-171978777-514400352,180355171-180355171-148310907,180359172-180359172-49702203,180359179-180359179-96523837,180363268-180363268-1082287825
1 row in set (0.001 sec)

This can lead to errors when trying to use GTID to switch masters, which is something we eventually want do with Orchestrator (T322993)

In order to clean these up, we need to first identify those gtid_domain_id that do not belong to any of the hosts in the section (master/candidate mostly) and then execute:

FLUSH BINARY LOGS DELETE_DOMAIN_ID=(XXX);

This needs to be done very carefully, as we can run into replication issues. We should probably start with things where replication isn't an issue until we've built trust that we are doing the right thing (mX, x2)

Progress

  • s1
  • s2
  • s3
  • s4
  • s5
  • s6
  • s7
  • s8
  • m1
  • m2
  • m3
  • m5
  • es4
  • es5
  • pc1
  • pc2
  • pc3
  • x1
  • x2
  • db_inventory

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.

Going to mark x2 as done as we don't use GTID there (we don't care about data consistency on that section)

I am running into some issues while cleaning up this on m1. Some of the domain_ids cannot be cleaned up and they error out with:

root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);
ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.

That domain_id belong to es1027 which it's never been in m1, and I guess it is just an IP that was re-used. Anyhow, that gtid_domain_id isn't on any of the binlogs. So I am investigating as we might find this issue on some other sections.

This is interesting....it looks like flush binary logs does check local files and that domain_id is part of these files:

-rw-rw---- 1 mysql mysql 1004M Mar 14  2018 db1016-bin.013077
-rw-rw---- 1 mysql mysql 1001M Mar 14  2018 db1016-bin.013078
-rw-rw---- 1 mysql mysql 1001M Mar 14  2018 db1016-bin.013079
-rw-rw---- 1 mysql mysql 1001M Mar 14  2018 db1016-bin.013080

They are old leftover files of course.

root@db1117:/srv/sqldata.m1# ls | grep "db1016-bin" | wc -l
78

Purging those made no difference. There are no more binary logs on db1117 or db1195 (master) which contain that gtid (at least on file). So I am checking where that error is coming from.

My last tests have been oriented to use RESET MASTER, which is the approach that looks less error prone, although it is the more invasive one (as it requires reconfiguring all the replicas).
The tests I have done, so far would require:

  • Read only (MW and mysql)
  • Reset master on the master itself, this unconfigures ALL the direct replicas.
    • We'd need to check what would be the status with the intermediate master, probably the safest thing would be to do it first on codfw and then keep going up (wiki replicas).
      • Although we need to check what happens with intermediate masters replicating gtid_domain_id downstream.
  • Issue a truncate truncate gtid_slave_pos; on all the involved hosts.
  • Issue a stop slave; SET GLOBAL gtid_slave_pos = ''; on the slaves
  • Reconfigure all the slaves
    • Start replication and then issue STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE;

This is the initial draft, but I really need to set up a testing environment where I have an intermediate master so I can test the intermediate master gtid_domain_id replication when the first reset master is done on the intermediate master.

@Marostegui: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!