Page MenuHomePhabricator

Test orchestrator intermediate master topology auto-recovery
Open, MediumPublic

Description

Start testing how orchestrator would recover an intermediate master (dc-master) replication topology without having any data loss

Event Timeline

Marostegui triaged this task as Medium priority.Nov 14 2022, 8:21 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Change 859972 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1133: Move it to test-s4 section

https://gerrit.wikimedia.org/r/859972

Change 859972 merged by Marostegui:

[operations/puppet@production] db1133: Move it to test-s4 section

https://gerrit.wikimedia.org/r/859972

Some updates: I have been doing some of the first tests with the following replication topology:

Captura de pantalla 2022-11-24 a las 8.12.58.png (153×1 px, 23 KB)

The first test was an easy one, killing db1125 and moving db1133 automatically.
The problem was sort of expected given the mess we have with gtid_domain_ids:

1root@db1133.eqiad.wmnet[mysql]> SELECT @@GLOBAL.gtid_slave_pos\G
2*************************** 1. row ***************************
3@@GLOBAL.gtid_slave_pos: 0-171970637-5484646134,171970572-171970572-2838977796,171970577-171970577-41556788,171970637-171970637-2116621969,171970661-171970661-3655324752,171970745-171970745-2362445072,171974720-171974720-2572451842,171978774-171978774-5,171978825-171978825-1,180355171-180355171-148310907,180359172-180359172-49702203,180363268-180363268-1082287825

The expected error happened:

[ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 171970572-171970572-2838977796, which is not in the master's binlog', Internal MariaDB error code: 1236

Which is due to:

root@db1133.eqiad.wmnet[(none)]> select @@gtid_binlog_pos;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| @@gtid_binlog_pos                                                                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 171970572-171970572-2838977796,171970577-171970577-41557066,171970661-171970661-3655324752,171970745-171970745-2362445072,171974728-171974728-145270,180363268-180363268-1082287825 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.001 sec)

171970572-171970572-2838977796 is no longer valid for this host as this is s1 master (which was the old master for db1133). The second value 171970577 is the one that is valid, as that corresponds to current test-s4 master. So the only way to work around that error is either

  • Making orchestrator to go for traditional binlog/pos way of solving the issue
  • Clean up non used GTID by: FLUSH BINARY LOGS DELETE_DOMAIN_ID=(171970572);

That last command worked so now the first value is the one that works:

root@db1133.eqiad.wmnet[(none)]> select @@gtid_binlog_pos;
+------------------------------------------------------------------------------------------------------------------------------------------------------+
| @@gtid_binlog_pos                                                                                                                                    |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
| 171970577-171970577-41629632,171970661-171970661-3655324752,171970745-171970745-2362445072,171974728-171974728-145270,180363268-180363268-1082287825 |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.001 sec)

Now, killing db1125 results on orchestrator picking up the value that works.
I still need to investigate what happened on a data layer level.

I have started db1124 (test-s4 master) to replicate enwiki.recentchanges,heartbeat.heartbeat from s1 master. I want to start moving around db1133 and see what happens with the data arriving to recentchanges.

set global replicate_do_table='enwiki.recentchanges,heartbeat.heartbeat';

I am probably going to create an epic task to see if we can address and clean up all the mess we have with gtid_domain_id as the workaround described at T322993#8419212 seems to be working on most cases.
Of course the other approach would be to disable GTID, but that's not something I am too comfortable with at the moment. I am going to do a few more tests trying to mess up with db1133's gtid_binlog_pos.

Cause at the moment db1133 is showing this:

root@db1133.eqiad.wmnet[(none)]> select @@gtid_binlog_pos;
+-------------------+
| @@gtid_binlog_pos |
+-------------------+
|                   |
+-------------------+
1 row in set (0.001 sec)
root@db1133.eqiad.wmnet[information_schema]> select @@gtid_slave_pos;
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| @@gtid_slave_pos                                                                                                                              |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
| 171966678-171966678-229039886,171970572-171970572-187842637,171970577-171970577-43189882,171970745-171970745-2370743389,171978825-171978825-1 |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.001 sec)

Change 929945 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] orchestrator.conf: Add test-s4 to automatic recovery

https://gerrit.wikimedia.org/r/929945

Change 929945 merged by Marostegui:

[operations/puppet@production] orchestrator.conf: Add test-s4 to automatic recovery

https://gerrit.wikimedia.org/r/929945

Change 929947 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] orchestrator.conf: Add intermediate master recovery

https://gerrit.wikimedia.org/r/929947

Change 929947 merged by Marostegui:

[operations/puppet@production] orchestrator.conf: Add intermediate master recovery

https://gerrit.wikimedia.org/r/929947

Change 929948 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1124,db1125,db1133: Binlog set to SBR

https://gerrit.wikimedia.org/r/929948

Change 929948 merged by Marostegui:

[operations/puppet@production] db1124,db1125,db1133: Binlog set to SBR

https://gerrit.wikimedia.org/r/929948

@Marostegui: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!