Start testing how orchestrator would recover an intermediate master (dc-master) replication topology without having any data loss
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T322993 Test orchestrator intermediate master topology auto-recovery | |||
| Open | None | T324965 Clean up old gtid_domain_id | |||
| Resolved | fnegri | T334947 ToolsDB: discard obsolete GTID domains | |||
| Declined | Marostegui | T336228 Evaluate removing gtid_domain_id from the infra | |||
| Open | Marostegui | T359163 Reclone m5 hosts from production into old hosts to simulate GTID cleaning up |
Event Timeline
Change 859972 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] db1133: Move it to test-s4 section
Change 859972 merged by Marostegui:
[operations/puppet@production] db1133: Move it to test-s4 section
Some updates: I have been doing some of the first tests with the following replication topology:
The first test was an easy one, killing db1125 and moving db1133 automatically.
The problem was sort of expected given the mess we have with gtid_domain_ids:
The expected error happened:
[ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 171970572-171970572-2838977796, which is not in the master's binlog', Internal MariaDB error code: 1236
Which is due to:
root@db1133.eqiad.wmnet[(none)]> select @@gtid_binlog_pos; +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | @@gtid_binlog_pos | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 171970572-171970572-2838977796,171970577-171970577-41557066,171970661-171970661-3655324752,171970745-171970745-2362445072,171974728-171974728-145270,180363268-180363268-1082287825 | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.001 sec)
171970572-171970572-2838977796 is no longer valid for this host as this is s1 master (which was the old master for db1133). The second value 171970577 is the one that is valid, as that corresponds to current test-s4 master. So the only way to work around that error is either
- Making orchestrator to go for traditional binlog/pos way of solving the issue
- Clean up non used GTID by: FLUSH BINARY LOGS DELETE_DOMAIN_ID=(171970572);
That last command worked so now the first value is the one that works:
root@db1133.eqiad.wmnet[(none)]> select @@gtid_binlog_pos; +------------------------------------------------------------------------------------------------------------------------------------------------------+ | @@gtid_binlog_pos | +------------------------------------------------------------------------------------------------------------------------------------------------------+ | 171970577-171970577-41629632,171970661-171970661-3655324752,171970745-171970745-2362445072,171974728-171974728-145270,180363268-180363268-1082287825 | +------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.001 sec)
Now, killing db1125 results on orchestrator picking up the value that works.
I still need to investigate what happened on a data layer level.
I have started db1124 (test-s4 master) to replicate enwiki.recentchanges,heartbeat.heartbeat from s1 master. I want to start moving around db1133 and see what happens with the data arriving to recentchanges.
set global replicate_do_table='enwiki.recentchanges,heartbeat.heartbeat';
I am probably going to create an epic task to see if we can address and clean up all the mess we have with gtid_domain_id as the workaround described at T322993#8419212 seems to be working on most cases.
Of course the other approach would be to disable GTID, but that's not something I am too comfortable with at the moment. I am going to do a few more tests trying to mess up with db1133's gtid_binlog_pos.
Cause at the moment db1133 is showing this:
root@db1133.eqiad.wmnet[(none)]> select @@gtid_binlog_pos; +-------------------+ | @@gtid_binlog_pos | +-------------------+ | | +-------------------+ 1 row in set (0.001 sec)
root@db1133.eqiad.wmnet[information_schema]> select @@gtid_slave_pos; +-----------------------------------------------------------------------------------------------------------------------------------------------+ | @@gtid_slave_pos | +-----------------------------------------------------------------------------------------------------------------------------------------------+ | 171966678-171966678-229039886,171970572-171970572-187842637,171970577-171970577-43189882,171970745-171970745-2370743389,171978825-171978825-1 | +-----------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.001 sec)
Change 929945 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] orchestrator.conf: Add test-s4 to automatic recovery
Change 929945 merged by Marostegui:
[operations/puppet@production] orchestrator.conf: Add test-s4 to automatic recovery
Change 929947 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] orchestrator.conf: Add intermediate master recovery
Change 929947 merged by Marostegui:
[operations/puppet@production] orchestrator.conf: Add intermediate master recovery
Change 929948 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] db1124,db1125,db1133: Binlog set to SBR
Change 929948 merged by Marostegui:
[operations/puppet@production] db1124,db1125,db1133: Binlog set to SBR
@Marostegui: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!
