Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-02-12
Closed, ResolvedPublic

Assigned To
Authored By
fnegri
Feb 12 2025, 5:22 PM
Referenced Files
F58424879: Screenshot 2025-02-19 at 16.12.50.png
Feb 19 2025, 3:24 PM
F58411591: Screenshot 2025-02-17 at 11.17.45.png
Feb 17 2025, 10:20 AM
F58399213: Screenshot 2025-02-14 at 16.17.16.png
Feb 14 2025, 3:19 PM
F58398454: Screenshot 2025-02-14 at 12.26.28.png
Feb 14 2025, 11:34 AM
F58394809: Screenshot 2025-02-13 at 12.15.07.png
Feb 13 2025, 11:21 AM
F58391989: Screenshot 2025-02-12 at 17.59.17.png
Feb 12 2025, 5:22 PM

Description

It was a while since the last time this happened (see the parent task T357624 for a list).

The replica has been frequently lagging durint the past week:

Screenshot 2025-02-12 at 17.59.17.png (612×1 px, 133 KB)

I believe some of the spikes were caused by repeated crashes of the primary (tracked in T385900: [toolsdb] mariadb crashing repeatedly (innodb_fatal_semaphore_wait_threshold)).

But today the replica is stuck replicating a big transaction like we saw many times in the past:

#250212  6:51:25 server id 2886729896  end_log_pos 1370378 CRC32 0x98d8bb5e     Annotate_rows:
#Q> DELETE FROM `store_edit` WHERE (`store_edit`.`batch_id` = 1867223 AND `store_edit`.`newrevid` < 2075780282)
#250212  6:51:25 server id 2886729896  end_log_pos 1370475 CRC32 0xfab66eac     Table_map: `s53685__editgroups`.`store_edit` mapped to number 1710992

This query was not logged in the slow query log of the primary, where we log all queries taking more than 30 minutes to complete. But as we saw in the past, because of row-based replication this kind of queries can take much longer on the replica than what they took in the primary.

Update 2025-02-13: the query above has completed after a few hours, but replication is now stuck on a different query, on a different table (see comments for details)

Update 2025-02-17: replication keeps getting stuck on different queries. I discovered that capturing a backtrace with gdb has the side effect of getting it unstuck (see comments for details).

I can easily recreate the replica from scratch, but I would really like to understand what's going on.

Side note: we should really migrate the s53685__editgroups database to a dedicated Trove instance, as it's currently the third-largest database on ToolsDB, using about 180GB. See also: T291782: Migrate largest ToolsDB users to Trove.

Event Timeline

fnegri changed the task status from Open to In Progress.Feb 12 2025, 5:36 PM
fnegri triaged this task as High priority.

Compared to previous situations like T370760: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-07-23 this looks a bit different:

  • CPU usage is close to zero, so replication is not CPU-bound
  • disk I/O is very low, so replication is not disk-bound
  • SHOW ENGINE INNODB STATUS is static, no I/O and it's not creating thousands of row locks as it usually does when replicating this kind of mass-deletes:
---TRANSACTION 145653083611, ACTIVE 8396 sec updating or deleting
mysql tables in use 1, locked 1
2 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1
MariaDB thread id 12, OS thread handle 139991545382592, query id 14559920 Delete_rows_log_event::ha_delete_row(-1) on table `store_edit`
DELETE FROM `store_edit` WHERE (`store_edit`.`batch_id` = 1867223 AND `store_edit`.`newrevid` < 2075780282)

[...]

--------
FILE I/O
--------
Pending flushes (fsync) log: 0; buffer pool: 0
707083 OS file reads, 8404133 OS file writes, 107788 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s

That transaction eventually completed around 21:00 UTC, replication lag went down quickly for about 1 hour, then replication got stuck on another transaction around 22:00 UTC:

Screenshot 2025-02-13 at 12.15.07.png (618×1 px, 91 KB)

The new stuck transaction is:

|     13 | system user     |           | s55902__xlinkstest_p | Slave_worker |  47698 | Update_rows_log_event::ha_update_row(-1) on table `host` | UPDATE host SET hst_added_count = hst_added_count + 1 WHERE hst_id = 200561 |    0.000 |

That looks like a very simple query, and I can run SELECT * FROM host WHERE hst_id = 200561; with no issues.

I have no idea why the Slave_worker can not complete that update.

Full output of SHOW ENGINE INNODB STATUS:

1MariaDB [s55902__xlinkstest_p]> SHOW ENGINE INNODB STATUS\G
2*************************** 1. row ***************************
3 Type: InnoDB
4 Name:
5Status:
6=====================================
72025-02-13 11:18:55 0x7f4a835686c0 INNODB MONITOR OUTPUT
8=====================================
9Per second averages calculated from the last 46 seconds
10-----------------
11BACKGROUND THREAD
12-----------------
13srv_master_thread loops: 159 srv_active, 0 srv_shutdown, 130073 srv_idle
14srv_master_thread log flush and writes: 130182
15----------
16SEMAPHORES
17----------
18------------------------
19LATEST FOREIGN KEY ERROR
20------------------------
212025-02-12 22:02:29 0x7f52472a86c0 Transaction:
22TRANSACTION 145675054535, ACTIVE 0 sec inserting
23mysql tables in use 1, locked 1
243 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1
25MariaDB thread id 19, OS thread handle 139991358015168, query id 41063861 Write_rows_log_event::write_row(-1) on table `cache_data`
26INSERT INTO cache_data (cdata_cache_id, cdata_url,
27 cdata_confidence, cdata_skipped,
28 cdata_excluded) VALUES ('+\?\?5\?)MJ\?\?:\?0\?ۤ+?/\?\?J\?\?', 'https://web.archive.org/web/20110605231429/http://www.locusmag.com/SFAwards/Db/Nebula1973.html', 0.0196078431372549, 0, 0)
29Foreign key constraint fails for table `s51330__copyvios_p`.`cache_data`:
30,
31 CONSTRAINT `cache_data_ibfk_1` FOREIGN KEY (`cdata_cache_id`) REFERENCES `cache` (`cache_id`) ON DELETE CASCADE ON UPDATE CASCADE in parent table, in index cdata_cache_id tuple:
32DATA TUPLE: 2 fields;
33 0: len 32; hex 1e2bb1d335b3294d4a19b08e3aa9300fbad2970810dba42b1bd12fd0ef4a8efc; asc + 5 )MJ : 0 + / J ;;
34 1: len 8; hex 0000000004124a39; asc J9;;
35
36But in parent table `s51330__copyvios_p`.`cache`, in index PRIMARY,
37the closest match we can find is record:
38PHYSICAL RECORD: n_fields 7; compact format; info bits 0
39 0: len 30; hex 1e352898928aeb1776a94ead14875927ccb20778028a13f6331fe137d98a; asc 5( v N Y' x 3 7 ; (total 32 bytes);
40 1: len 6; hex 000000000000; asc ;;
41 2: len 7; hex 80000000000000; asc ;;
42 3: len 4; hex 67ab3f5a; asc g ?Z;;
43 4: len 4; hex 80000008; asc ;;
44 5: len 4; hex 055f2041; asc _ A;;
45 6: len 1; hex 80; asc ;;
46
47------------------------
48LATEST DETECTED DEADLOCK
49------------------------
502025-02-12 22:02:26 0x7f52524c26c0
51*** (1) TRANSACTION:
52TRANSACTION 145675045099, ACTIVE 0 sec inserting
53mysql tables in use 1, locked 1
54LOCK WAIT 4 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 2
55MariaDB thread id 15, OS thread handle 139991544768192, query id 41054037 Write_rows_log_event::write_row(-1) on table `store_edit`
56INSERT INTO `store_edit` (`id`, `oldrevid`, `newrevid`, `oldlength`, `newlength`, `timestamp`, `title`, `namespace`, `uri`, `comment`, `parsedcomment`, `bot`, `minor`, `changetype`, `user`, `patrolled`, `batch_id`, `reverted`) VALUES (2380326460, 0, 2310647986, 0, 4541, '2025-02-12 21:50:56', 'Q132256083', 0, 'https://www.wikidata.org/wiki/Q132256083', '/* wbeditentity-create-item:0| */ #quickstatements; #temporary_batch_1739389835152', '‎<span dir=\"auto\"><span class=\"autocomment\">Created a new Item: </span></span> #quickstatements; #temporary_batch_1739389835152', 0, 0, 'new', 'Stevenliuyi', 1, 2174762, 0), (2380326465, 2288778247, 2310647991, 34027, 34036, '2025-02-12 21:50:57', 'Q122386242', 0, 'https://www.wikidata.org/wiki/Q122386242', '/* wbsetdescription-set:1|fi */ videopeli vuodelta 2020, fixing description ([[:toollabs:editgroups/b/OR/3844b10ad42|details]])', '??
57*** WAITING FOR THIS LOCK TO BE GRANTED:
58RECORD LOCKS space id 866093 page no 39254 n bits 320 index PRIMARY of table `s53685__editgroups`.`store_batch` trx id 145675045099 lock mode S locks rec but not gap waiting
59Record lock, heap no 38 PHYSICAL RECORD: n_fields 16; compact format; info bits 0
60 0: len 4; hex 80212de1; asc !- ;;
61 1: len 6; hex 0021eae8c8ee; asc ! ;;
62 2: len 7; hex 6e0020803b0c17; asc n ; ;;
63 3: len 14; hex 53616d6f6173616d626961426f74; asc SamoasambiaBot;;
64 4: len 11; hex 3338343462313061643432; asc 3844b10ad42;;
65 5: len 18; hex 666978696e67206465736372697074696f6e; asc fixing description;;
66 6: len 8; hex 01029a152f9a95c0; asc / ;;
67 7: len 8; hex 01029a1fbb286680; asc (f ;;
68 8: len 4; hex 8000afbc; asc ;;
69 9: len 4; hex 80000001; asc ;;
70 10: len 4; hex 8000afbc; asc ;;
71 11: len 4; hex 80000000; asc ;;
72 12: len 4; hex 80000000; asc ;;
73 13: len 8; hex 8000000000062d95; asc - ;;
74 14: len 8; hex 99b5d894140d54ef; asc T ;;
75 15: len 1; hex 80; asc ;;
76
77*** CONFLICTING WITH:
78RECORD LOCKS space id 866093 page no 39254 n bits 320 index PRIMARY of table `s53685__editgroups`.`store_batch` trx id 145675045102 lock_mode X locks rec but not gap
79Record lock, heap no 38 PHYSICAL RECORD: n_fields 16; compact format; info bits 0
80 0: len 4; hex 80212de1; asc !- ;;
81 1: len 6; hex 0021eae8c8ee; asc ! ;;
82 2: len 7; hex 6e0020803b0c17; asc n ; ;;
83 3: len 14; hex 53616d6f6173616d626961426f74; asc SamoasambiaBot;;
84 4: len 11; hex 3338343462313061643432; asc 3844b10ad42;;
85 5: len 18; hex 666978696e67206465736372697074696f6e; asc fixing description;;
86 6: len 8; hex 01029a152f9a95c0; asc / ;;
87 7: len 8; hex 01029a1fbb286680; asc (f ;;
88 8: len 4; hex 8000afbc; asc ;;
89 9: len 4; hex 80000001; asc ;;
90 10: len 4; hex 8000afbc; asc ;;
91 11: len 4; hex 80000000; asc ;;
92 12: len 4; hex 80000000; asc ;;
93 13: len 8; hex 8000000000062d95; asc - ;;
94 14: len 8; hex 99b5d894140d54ef; asc T ;;
95 15: len 1; hex 80; asc ;;
96
97Record lock, heap no 102 PHYSICAL RECORD: n_fields 16; compact format; info bits 0
98 0: len 4; hex 80212e21; asc !.!;;
99 1: len 6; hex 0021eae8c8ee; asc ! ;;
100 2: len 7; hex 6e0020803b0c7f; asc n ; ;;
101 3: len 10; hex 466c6f72656e74796e61; asc Florentyna;;
102 4: len 6; hex 323433363132; asc 243612;;
103 5: len 0; hex ; asc ;;
104 6: len 8; hex 01029a17b486aec0; asc ;;
105 7: len 8; hex 01029a1fbadc1b40; asc @;;
106 8: len 4; hex 800005f1; asc ;;
107 9: len 4; hex 80000002; asc ;;
108 10: len 4; hex 800005f1; asc ;;
109 11: len 4; hex 80000000; asc ;;
110 12: len 4; hex 80000000; asc ;;
111 13: len 8; hex 800000000001e141; asc A;;
112 14: len 8; hex 99b5d8c4290a8763; asc ) c;;
113 15: len 1; hex 80; asc ;;
114
115
116*** (2) TRANSACTION:
117TRANSACTION 145675045102, ACTIVE 0 sec starting index read
118mysql tables in use 1, locked 1
119LOCK WAIT 4 lock struct(s), heap size 1128, 5 row lock(s), undo log entries 4
120MariaDB thread id 10, OS thread handle 139991659656896, query id 41054044 Update_rows_log_event::find_row(-1) on table `store_batch`
121UPDATE `store_batch` SET `ended` = (CASE `id` WHEN 2174762 THEN '2025-02-12 21:51:07' WHEN 2174433 THEN '2025-02-12 21:51:06' WHEN 2174669 THEN '2025-02-12 21:51:04' WHEN 2174497 THEN '2025-02-12 21:51:01' WHEN 2174725 THEN '2025-02-12 21:51:02' ELSE `ended` END), `nb_edits` = (CASE `id` WHEN 2174762 THEN 3841 WHEN 2174433 THEN 44988 WHEN 2174669 THEN 2654 WHEN 2174497 THEN 1521 WHEN 2174725 THEN 766 ELSE `nb_edits` END), `nb_distinct_pages` = (CASE `id` WHEN 2174762 THEN 3841 WHEN 2174433 THEN 44988 WHEN 2174669 THEN 2654 WHEN 2174497 THEN 1521 WHEN 2174725 THEN 766 ELSE `nb_distinct_pages` END), `nb_reverted_edits` = (CASE `id` WHEN 2174762 THEN 0 WHEN 2174433 THEN 0 WHEN 2174669 THEN 0 WHEN 2174497 THEN 0 WHEN 2174725 THEN 0 ELSE `nb_reverted_edits` END), `nb_new_pages` = (CASE `id` WHEN 2174762 THEN 3841 WHEN 2174433 THEN 0 WHEN 2174669 THEN 0 WHEN 2174497 THEN 0 WHEN 21747
122*** WAITING FOR THIS LOCK TO BE GRANTED:
123RECORD LOCKS space id 866093 page no 39257 n bits 320 index PRIMARY of table `s53685__editgroups`.`store_batch` trx id 145675045102 lock_mode X locks rec but not gap waiting
124Record lock, heap no 3 PHYSICAL RECORD: n_fields 16; compact format; info bits 0
125 0: len 4; hex 80212f2a; asc !/*;;
126 1: len 6; hex 0021eae8b25f; asc ! _;;
127 2: len 7; hex 260011c00b1ee6; asc & ;;
128 3: len 11; hex 53746576656e6c69757969; asc Stevenliuyi;;
129 4: len 13; hex 31373339333839383335313532; asc 1739389835152;;
130 5: len 17; hex 23717569636b73746174656d656e74733b; asc #quickstatements;;;
131 6: len 8; hex 01029a1e10ce4840; asc H@;;
132 7: len 8; hex 01029a1fba620940; asc b @;;
133 8: len 4; hex 80000ef9; asc ;;
134 9: len 4; hex 80000006; asc ;;
135 10: len 4; hex 80000ef9; asc ;;
136 11: len 4; hex 80000000; asc ;;
137 12: len 4; hex 80000ef9; asc ;;
138 13: len 8; hex 800000000109e0a3; asc ;;
139 14: len 8; hex 99b5d93cf80bbd2b; asc < +;;
140 15: len 1; hex 80; asc ;;
141
142*** CONFLICTING WITH:
143RECORD LOCKS space id 866093 page no 39257 n bits 320 index PRIMARY of table `s53685__editgroups`.`store_batch` trx id 145675045099 lock mode S locks rec but not gap
144Record lock, heap no 3 PHYSICAL RECORD: n_fields 16; compact format; info bits 0
145 0: len 4; hex 80212f2a; asc !/*;;
146 1: len 6; hex 0021eae8b25f; asc ! _;;
147 2: len 7; hex 260011c00b1ee6; asc & ;;
148 3: len 11; hex 53746576656e6c69757969; asc Stevenliuyi;;
149 4: len 13; hex 31373339333839383335313532; asc 1739389835152;;
150 5: len 17; hex 23717569636b73746174656d656e74733b; asc #quickstatements;;;
151 6: len 8; hex 01029a1e10ce4840; asc H@;;
152 7: len 8; hex 01029a1fba620940; asc b @;;
153 8: len 4; hex 80000ef9; asc ;;
154 9: len 4; hex 80000006; asc ;;
155 10: len 4; hex 80000ef9; asc ;;
156 11: len 4; hex 80000000; asc ;;
157 12: len 4; hex 80000ef9; asc ;;
158 13: len 8; hex 800000000109e0a3; asc ;;
159 14: len 8; hex 99b5d93cf80bbd2b; asc < +;;
160 15: len 1; hex 80; asc ;;
161
162*** WE ROLL BACK TRANSACTION (1)
163------------
164TRANSACTIONS
165------------
166Trx id counter 145675059184
167Purge done for trx's n:o < 145675059181 undo n:o < 0 state: running
168History list length 328
169LIST OF TRANSACTIONS FOR EACH SESSION:
170---TRANSACTION (0x7f5247d67dc0), not started
1710 lock struct(s), heap size 1128, 0 row lock(s)
172---TRANSACTION 145675059157, ACTIVE (PREPARED) 47786 sec
1733 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 2
174MariaDB thread id 15, OS thread handle 139991544768192, query id 41068895 Waiting for prior transaction to commit
175---TRANSACTION 145675059164, ACTIVE (PREPARED) 47786 sec
1763 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 2
177MariaDB thread id 9, OS thread handle 139991660066496, query id 41068902 Waiting for prior transaction to commit
178---TRANSACTION 145675059163, ACTIVE (PREPARED) 47786 sec
1793 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 2
180MariaDB thread id 16, OS thread handle 139991544563392, query id 41068901 Waiting for prior transaction to commit
181---TRANSACTION 145675059161, ACTIVE (PREPARED) 47786 sec
1823 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 2
183MariaDB thread id 10, OS thread handle 139991659656896, query id 41068898 Waiting for prior transaction to commit
184---TRANSACTION 145675059166, ACTIVE (PREPARED) 47786 sec
1854 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 2
186MariaDB thread id 11, OS thread handle 139991659861696, query id 41068908 Waiting for prior transaction to commit
187---TRANSACTION 145675059168, ACTIVE (PREPARED) 47786 sec
1883 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 2
189MariaDB thread id 12, OS thread handle 139991545382592, query id 41068911 Waiting for prior transaction to commit
190---TRANSACTION 145675059162, ACTIVE 47786 sec updating or deleting
191mysql tables in use 1, locked 1
1922 lock struct(s), heap size 1128, 1 row lock(s)
193MariaDB thread id 13, OS thread handle 139991545177792, query id 41068896 Update_rows_log_event::ha_update_row(-1) on table `host`
194UPDATE host SET hst_added_count = hst_added_count + 1 WHERE hst_id = 200561
195---TRANSACTION 145675059159, ACTIVE (PREPARED) 47786 sec
1964 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 3
197MariaDB thread id 14, OS thread handle 139991544972992, query id 41068910 Waiting for prior transaction to commit
198---TRANSACTION 145675059172, ACTIVE (PREPARED) 47786 sec
1993 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 2
200MariaDB thread id 18, OS thread handle 139991358219968, query id 41068915 Waiting for prior transaction to commit
201---TRANSACTION 145675059171, ACTIVE (PREPARED) 47786 sec
2029 lock struct(s), heap size 1128, 3 row lock(s), undo log entries 4
203MariaDB thread id 19, OS thread handle 139991358015168, query id 41068917 Waiting for prior transaction to commit
204---TRANSACTION 145675059165, ACTIVE (PREPARED) 47786 sec
2053 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 2
206MariaDB thread id 17, OS thread handle 139991544358592, query id 41068907 Waiting for prior transaction to commit
207---TRANSACTION 145675059158, ACTIVE (PREPARED) 47786 sec
2083 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 2
209MariaDB thread id 20, OS thread handle 139991357810368, query id 41068891 Waiting for prior transaction to commit
210---TRANSACTION (0x7f5247d5eec0), not started
2110 lock struct(s), heap size 1128, 0 row lock(s)
212---TRANSACTION (0x7f5247d5e3c0), not started
2130 lock struct(s), heap size 1128, 0 row lock(s)
214--------
215FILE I/O
216--------
217Pending flushes (fsync) log: 0; buffer pool: 0
2181871404 OS file reads, 26351275 OS file writes, 412456 OS fsyncs
2190.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
220-------------------------------------
221INSERT BUFFER AND ADAPTIVE HASH INDEX
222-------------------------------------
223Ibuf: size 1, free list len 393972, seg size 393974, 0 merges
224merged operations:
225 insert 0, delete mark 0, delete 0
226discarded operations:
227 insert 0, delete mark 0, delete 0
2280.00 hash searches/s, 0.00 non-hash searches/s
229---
230LOG
231---
232Log sequence number 110490006628940
233Log flushed up to 110490006628940
234Pages flushed up to 110489906580852
235Last checkpoint at 110489906580852
2360 pending log flushes, 0 pending chkp writes
23721963939 log i/o's done, 0.00 log i/o's/second
238----------------------
239BUFFER POOL AND MEMORY
240----------------------
241Total large memory allocated 33319550976
242Dictionary memory allocated 222326512
243Buffer pool size 2011776
244Free buffers 2232
245Database pages 1990955
246Old database pages 734952
247Modified db pages 33794
248Percent of dirty pages(LRU & free pages): 1.695
249Max dirty pages percent: 90.000
250Pending reads 0
251Pending writes: LRU 1, flush list 6
252Pages made young 621814, not young 14503080
2530.00 youngs/s, 0.00 non-youngs/s
254Pages read 1870617, created 2936597, written 4348240
2550.00 reads/s, 0.00 creates/s, 0.00 writes/s
256Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
257Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
258LRU len: 1990955, unzip_LRU len: 38333
259I/O sum[0]:cur[0], unzip sum[0]:cur[0]
260--------------
261ROW OPERATIONS
262--------------
2630 read views open inside InnoDB
264Process ID=0, Main thread ID=0, state: sleeping
265Number of rows inserted 18670783, updated 9525684, deleted 6635376, read 44582781
2660.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
267Number of system rows inserted 16558168, updated 0, deleted 16434622, read 16434689
2680.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
269----------------------------
270END OF INNODB MONITOR OUTPUT
271============================
272
2731 row in set (0.001 sec)

Full output of SHOW SLAVE STATUS:

1MariaDB [(none)]> SHOW SLAVE STATUS\G
2*************************** 1. row ***************************
3 Slave_IO_State: Waiting for master to send event
4 Master_Host: tools-db-4.tools.eqiad1.wikimedia.cloud
5 Master_User: repl
6 Master_Port: 3306
7 Connect_Retry: 60
8 Master_Log_File: log.018558
9 Read_Master_Log_Pos: 82950143
10 Relay_Log_File: tools-db-5-relay-bin.000449
11 Relay_Log_Pos: 39156783
12 Relay_Master_Log_File: log.018477
13 Slave_IO_Running: Yes
14 Slave_SQL_Running: Yes
15 Replicate_Do_DB:
16 Replicate_Ignore_DB:
17 Replicate_Do_Table:
18 Replicate_Ignore_Table:
19 Replicate_Wild_Do_Table:
20 Replicate_Wild_Ignore_Table:
21 Last_Errno: 0
22 Last_Error:
23 Skip_Counter: 0
24 Exec_Master_Log_Pos: 39156490
25 Relay_Log_Space: 9149935877
26 Until_Condition: None
27 Until_Log_File:
28 Until_Log_Pos: 0
29 Master_SSL_Allowed: No
30 Master_SSL_CA_File:
31 Master_SSL_CA_Path:
32 Master_SSL_Cert:
33 Master_SSL_Cipher:
34 Master_SSL_Key:
35 Seconds_Behind_Master: 48940
36 Master_SSL_Verify_Server_Cert: No
37 Last_IO_Errno: 0
38 Last_IO_Error:
39 Last_SQL_Errno: 0
40 Last_SQL_Error:
41 Replicate_Ignore_Server_Ids:
42 Master_Server_Id: 2886729896
43 Master_SSL_Crl:
44 Master_SSL_Crlpath:
45 Using_Gtid: Slave_Pos
46 Gtid_IO_Pos: 0-2886731673-33522724637,2886729896-2886729896-1364463741,2886731301-2886731301-9248274392,2886731673-2886731673-4887243158
47 Replicate_Do_Domain_Ids:
48 Replicate_Ignore_Domain_Ids:
49 Parallel_Mode: optimistic
50 SQL_Delay: 0
51 SQL_Remaining_Delay: NULL
52 Slave_SQL_Running_State: Waiting for room in worker thread event queue
53 Slave_DDL_Groups: 35983
54Slave_Non_Transactional_Groups: 2531905
55 Slave_Transactional_Groups: 14071749
561 row in set (0.001 sec)

Raw binlog from the primary:

1root@tools-db-4:/srv/labsdb/binlogs# mysqlbinlog --base64-output=decode-rows --verbose --start-position 39156490 log.018477 |head -n 50
2/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
3/*!40019 SET @@session.max_insert_delayed_threads=0*/;
4/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
5DELIMITER /*!*/;
6# at 39156490
7#250212 21:51:41 server id 2886729896 end_log_pos 39156532 CRC32 0x825281b4 GTID 2886729896-2886729896-1355755397 trans
8/*M!100101 SET @@session.skip_parallel_replication=0*//*!*/;
9/*M!100001 SET @@session.gtid_domain_id=2886729896*//*!*/;
10/*M!100001 SET @@session.server_id=2886729896*//*!*/;
11/*M!100001 SET @@session.gtid_seq_no=1355755397*//*!*/;
12START TRANSACTION
13/*!*/;
14# at 39156532
15# at 39156630
16#250212 21:51:41 server id 2886729896 end_log_pos 39156630 CRC32 0x1690376c Annotate_rows:
17#Q> UPDATE host SET hst_added_count = hst_added_count + 1 WHERE hst_id = 200561
18#250212 21:51:41 server id 2886729896 end_log_pos 39156699 CRC32 0x3031c028 Table_map: `s55902__xlinkstest_p`.`host` mapped to number 1904398
19# at 39156699
20#250212 21:51:41 server id 2886729896 end_log_pos 39156801 CRC32 0x14b370e4 Update_rows: table id 1904398 flags: STMT_END_F
21### UPDATE `s55902__xlinkstest_p`.`host`
22### WHERE
23### @1=200561
24### @2='net.digiporta.www.'
25### @3=5
26### @4=1
27### @5=12193
28### SET
29### @1=200561
30### @2='net.digiporta.www.'
31### @3=5
32### @4=1
33### @5=12194
34# Number of rows: 1
35# at 39156801
36#250212 21:51:41 server id 2886729896 end_log_pos 39156832 CRC32 0x9a49ffee Xid = 452907802
37COMMIT/*!*/;
38# at 39156832
39#250212 21:51:41 server id 2886729896 end_log_pos 39156874 CRC32 0x12df977a GTID 2886729896-2886729896-1355755398 trans

I will try to help with this more once I'm done with T386162: Some dewiktionary pageviews fatal "RevisionAccessException: Failed to load data blob from {address} for revision {revision}." but a couple of small things that comes to mind:

  • A replica should basically never run into a deadlock or write concurrency latency. Replication is serial and no other script should be able to write to the replica (unless root) at the same time. Please make sure the replica is read only
  • The query looks okay to me, it can be disk becoming suddenly slow for look ups
  • Or, there are a LOT of updates like that happening.

I will try to debug more.

fnegri updated the task description. (Show Details)

Thanks @Ladsgroup!

Please make sure the replica is read only

MariaDB [(none)]> SELECT @@read_only;
+-------------+
| @@read_only |
+-------------+
|           1 |
+-------------+

The query looks okay to me, it can be disk becoming suddenly slow for look ups

I also suspected slow disk (especially being Ceph), but IOPS is super low, and I can SELECT that record just fine...
CPU usage is also very low.

there are a LOT of updates like that happening

Possibly, but it's stuck on this single transaction since a few hours...

There must be an explanation, but I can't think of any!

I tried STOP SLAVE; and that was also hanging. I killed all replication threads with KILL 7; KILL 8; ... KILL 19; and in that way replication finally stopped:

MariaDB [(none)]> kill 19;
Query OK, 0 rows affected (0.001 sec)

MariaDB [(none)]> kill 20;
ERROR 1094 (HY000): Unknown thread id: 20
MariaDB [(none)]> SHOW PROCESSLIST;
+--------+-----------------+-----------+------+---------+--------+------------------------+------------------+----------+
| Id     | User            | Host      | db   | Command | Time   | State                  | Info             | Progress |
+--------+-----------------+-----------+------+---------+--------+------------------------+------------------+----------+
|      3 | event_scheduler | localhost | NULL | Daemon  | 157058 | Waiting on empty queue | NULL             |    0.000 |
| 262132 | root            | localhost | NULL | Query   |      0 | starting               | SHOW PROCESSLIST |    0.000 |
+--------+-----------------+-----------+------+---------+--------+------------------------+------------------+----------+
2 rows in set (0.000 sec)

CHECK TABLE and ANALYZE TABLE did not show any issue:

MariaDB [(none)]> CHECK TABLE s55902__xlinkstest_p.host EXTENDED;
+---------------------------+-------+----------+----------+
| Table                     | Op    | Msg_type | Msg_text |
+---------------------------+-------+----------+----------+
| s55902__xlinkstest_p.host | check | status   | OK       |
+---------------------------+-------+----------+----------+
1 row in set (9.550 sec)

MariaDB [(none)]> ANALYZE TABLE s55902__xlinkstest_p.host;
+---------------------------+---------+----------+----------+
| Table                     | Op      | Msg_type | Msg_text |
+---------------------------+---------+----------+----------+
| s55902__xlinkstest_p.host | analyze | status   | OK       |
+---------------------------+---------+----------+----------+
1 row in set (0.009 sec)

I restarted the whole process with sudo systemctl restart mariadb and replication resumed.

Let's see if it breaks again...

This is the journalctl log for the past hour, as I killed threads and restarted mariadb, nothing interesting there:

1fnegri@tools-db-5:~$ sudo journalctl -u mariadb --no-pager --since -1h
2Feb 13 18:41:07 tools-db-5 mysqld[338883]: 2025-02-13 18:41:07 7 [Note] Slave I/O thread exiting, read up to log 'log.018604', position 70625464; GTID position 0-2886731673-33522724637,2886729896-2886729896-1369513031,2886731301-2886731301-9248274392,2886731673-2886731673-4887243158, master tools-db-4.tools.eqiad1.wikimedia.cloud:3306
3Feb 13 18:41:13 tools-db-5 mysqld[338883]: 2025-02-13 18:41:13 8 [ERROR] Slave SQL: Connection was killed, Internal MariaDB error code: 1927
4Feb 13 18:41:13 tools-db-5 mysqld[338883]: 2025-02-13 18:41:13 8 [Warning] Slave: Connection was killed Error_code: 1927
5Feb 13 18:41:13 tools-db-5 mysqld[338883]: 2025-02-13 18:41:13 8 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'log.018477' position 39156490; GTID position '0-2886731673-33522724637,2886729896-2886729896-1355755396,2886731301-2886731301-9248274392,2886731673-2886731673-4887243158'
6Feb 13 18:41:37 tools-db-5 mysqld[338883]: 2025-02-13 18:41:37 8 [Note] Slave SQL thread exiting, replication stopped in log 'log.018477' at position 39162668; GTID position '0-2886731673-33522724637,2886729896-2886729896-1355755408,2886731301-2886731301-9248274392,2886731673-2886731673-4887243158', master: tools-db-4.tools.eqiad1.wikimedia.cloud:3306
7Feb 13 18:42:40 tools-db-5 systemd[1]: Stopping mariadb.service - mariadb database server...
8Feb 13 18:42:40 tools-db-5 mysqld[338883]: 2025-02-13 18:42:40 0 [Note] /opt/wmf-mariadb106/bin/mysqld (initiated by: unknown): Normal shutdown
9Feb 13 18:42:40 tools-db-5 mysqld[338883]: 2025-02-13 18:42:40 0 [Note] Event Scheduler: Killing the scheduler thread, thread id 3
10Feb 13 18:42:40 tools-db-5 mysqld[338883]: 2025-02-13 18:42:40 0 [Note] Event Scheduler: Waiting for the scheduler thread to reply
11Feb 13 18:42:40 tools-db-5 mysqld[338883]: 2025-02-13 18:42:40 0 [Note] Event Scheduler: Stopped
12Feb 13 18:42:42 tools-db-5 mysqld[338883]: 2025-02-13 18:42:42 0 [Note] InnoDB: FTS optimize thread exiting.
13Feb 13 18:42:42 tools-db-5 mysqld[338883]: 2025-02-13 18:42:42 0 [Note] InnoDB: Starting shutdown...
14Feb 13 18:42:42 tools-db-5 mysqld[338883]: 2025-02-13 18:42:42 0 [Note] InnoDB: Dumping buffer pool(s) to /srv/labsdb/data/ib_buffer_pool
15Feb 13 18:42:42 tools-db-5 mysqld[338883]: 2025-02-13 18:42:42 0 [Note] InnoDB: Restricted to 502944 pages due to innodb_buf_pool_dump_pct=25
16Feb 13 18:42:42 tools-db-5 mysqld[338883]: 2025-02-13 18:42:42 0 [Note] InnoDB: Buffer pool(s) dump completed at 250213 18:42:42
17Feb 13 18:42:46 tools-db-5 mysqld[338883]: 2025-02-13 18:42:46 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
18Feb 13 18:42:46 tools-db-5 mysqld[338883]: 2025-02-13 18:42:46 0 [Note] InnoDB: Shutdown completed; log sequence number 110490006985412; transaction id 145675059322
19Feb 13 18:42:46 tools-db-5 mysqld[338883]: 2025-02-13 18:42:46 0 [Note] /opt/wmf-mariadb106/bin/mysqld: Shutdown complete
20Feb 13 18:42:46 tools-db-5 systemd[1]: mariadb.service: Deactivated successfully.
21Feb 13 18:42:46 tools-db-5 systemd[1]: Stopped mariadb.service - mariadb database server.
22Feb 13 18:42:46 tools-db-5 systemd[1]: mariadb.service: Consumed 3h 54min 7.828s CPU time.
23Feb 13 18:42:46 tools-db-5 systemd[1]: Starting mariadb.service - mariadb database server...
24Feb 13 18:42:46 tools-db-5 mysqld[436046]: 2025-02-13 18:42:46 0 [Note] Starting MariaDB 10.6.19-MariaDB-log source revision 8f020508c87461bbeee551c3001acac56d506ae7 server_uid anDeSVoV50i/z+3s0c78qLNIAdg= as process 436046
25Feb 13 18:42:46 tools-db-5 mysqld[436046]: 2025-02-13 18:42:46 0 [ERROR] mysqld: Plugin 'unix_socket' already installed
26Feb 13 18:42:46 tools-db-5 mysqld[436046]: 2025-02-13 18:42:46 0 [ERROR] Couldn't load plugin 'unix_socket' from 'auth_socket.so'.
27Feb 13 18:42:46 tools-db-5 mysqld[436046]: 2025-02-13 18:42:46 0 [Note] InnoDB: Compressed tables use zlib 1.3.1
28Feb 13 18:42:46 tools-db-5 mysqld[436046]: 2025-02-13 18:42:46 0 [Note] InnoDB: Number of pools: 1
29Feb 13 18:42:46 tools-db-5 mysqld[436046]: 2025-02-13 18:42:46 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
30Feb 13 18:42:46 tools-db-5 mysqld[436046]: 2025-02-13 18:42:46 0 [Note] InnoDB: Using liburing
31Feb 13 18:42:46 tools-db-5 mysqld[436046]: 2025-02-13 18:42:46 0 [Note] InnoDB: Initializing buffer pool, total size = 33285996544, chunk size = 134217728
32Feb 13 18:42:46 tools-db-5 mysqld[436046]: 2025-02-13 18:42:46 0 [Note] InnoDB: Completed initialization of buffer pool
33Feb 13 18:42:46 tools-db-5 mysqld[436046]: 2025-02-13 18:42:46 0 [Note] InnoDB: 128 rollback segments are active.
34Feb 13 18:42:47 tools-db-5 mysqld[436046]: 2025-02-13 18:42:47 0 [Note] InnoDB: Creating shared tablespace for temporary tables
35Feb 13 18:42:47 tools-db-5 mysqld[436046]: 2025-02-13 18:42:47 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
36Feb 13 18:42:47 tools-db-5 mysqld[436046]: 2025-02-13 18:42:47 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
37Feb 13 18:42:47 tools-db-5 mysqld[436046]: 2025-02-13 18:42:47 0 [Note] InnoDB: 10.6.19 started; log sequence number 110490006985412; transaction id 145675059323
38Feb 13 18:42:47 tools-db-5 mysqld[436046]: 2025-02-13 18:42:47 0 [Note] InnoDB: Loading buffer pool(s) from /srv/labsdb/data/ib_buffer_pool
39Feb 13 18:42:48 tools-db-5 mysqld[436046]: 2025-02-13 18:42:48 0 [Note] Server socket created on IP: '0.0.0.0'.
40Feb 13 18:42:48 tools-db-5 mysqld[436046]: 2025-02-13 18:42:48 0 [Note] Server socket created on IP: '::'.
41Feb 13 18:42:48 tools-db-5 mysqld[436046]: 2025-02-13 18:42:48 0 [Warning] 'proxies_priv' entry '@% root@labsdb1005' ignored in --skip-name-resolve mode.
42Feb 13 18:42:50 tools-db-5 mysqld[436046]: 2025-02-13 18:42:50 0 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MariaDB server acts as a replica and has its hostname changed. Please use '--log-basename=#' or '--relay-log=tools-db-5-relay-bin' to avoid this problem.
43Feb 13 18:42:50 tools-db-5 mysqld[436046]: 2025-02-13 18:42:50 2 [Note] Event Scheduler: scheduler thread started with id 2
44Feb 13 18:42:53 tools-db-5 mysqld[436046]: 2025-02-13 18:42:53 6 [Note] Slave I/O thread: Start asynchronous replication to master 'repl@tools-db-4.tools.eqiad1.wikimedia.cloud:3306' in log 'log.018477' at position 39162668
45Feb 13 18:42:53 tools-db-5 mysqld[436046]: 2025-02-13 18:42:53 6 [Note] Slave I/O thread: connected to master 'repl@tools-db-4.tools.eqiad1.wikimedia.cloud:3306',replication starts at GTID position '0-2886731673-33522724637,2886729896-2886729896-1355755408,2886731301-2886731301-9248274392,2886731673-2886731673-4887243158'
46Feb 13 18:42:53 tools-db-5 mysqld[436046]: 2025-02-13 18:42:53 0 [Note] /opt/wmf-mariadb106/bin/mysqld: ready for connections.
47Feb 13 18:42:53 tools-db-5 mysqld[436046]: Version: '10.6.19-MariaDB-log' socket: '/var/run/mysqld/mysqld.sock' port: 3306 MariaDB Server
48Feb 13 18:42:53 tools-db-5 systemd[1]: Started mariadb.service - mariadb database server.
49Feb 13 18:43:15 tools-db-5 mysqld[436046]: 2025-02-13 18:43:15 0 [Note] InnoDB: Buffer pool(s) load completed at 250213 18:43:15

Replication caught up in about 2 hours, and seems to be working fine.

Screenshot 2025-02-14 at 12.26.28.png (618×1 px, 95 KB)

If the storage is over the network (you mentioned ceph) it's going to be quite a pain to maintain :( Maybe that can change or we should reduce the write volume somehow.

Storage have been over Ceph for a few years and it's mostly worked fine, but I agree we should reduce the write volume. I think the most effective way to do it is T291782: Migrate largest ToolsDB users to Trove.

In the meantime, replication got stuck again after working fine for a few hours:

Screenshot 2025-02-14 at 16.17.16.png (614×1 px, 104 KB)

SHOW PROCESSLIST shows a bunch of inserts on s53430__nppbrowser_p.article_new:

1MariaDB [(none)]> SHOW PROCESSLIST;
2+--------+-----------------+-----------+----------------------+--------------+-------+------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
3| Id | User | Host | db | Command | Time | State | Info | Progress |
4+--------+-----------------+-----------+----------------------+--------------+-------+------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
5| 2 | event_scheduler | localhost | NULL | Daemon | 74152 | Waiting on empty queue | NULL | 0.000 |
6| 6 | system user | | NULL | Slave_IO | 74149 | Waiting for master to send event | NULL | 0.000 |
7| 8 | system user | | s53430__nppbrowser_p | Slave_worker | 3513 | Write_rows_log_event::write_row(-1) on table `article_new` | INSERT INTO `article_new` (`wiki_id`, `is_npp`, `redirect`, `title`, `category_count`, `linkcount`, | 0.000 |
8| 9 | system user | | s53430__nppbrowser_p | Slave_worker | 3513 | Write_rows_log_event::write_row(-1) on table `article_new` | INSERT INTO `article_new` (`wiki_id`, `is_npp`, `redirect`, `title`, `category_count`, `linkcount`, | 0.000 |
9| 10 | system user | | s53430__nppbrowser_p | Slave_worker | 3513 | Write_rows_log_event::write_row(-1) on table `article_new` | INSERT INTO `article_new` (`wiki_id`, `is_npp`, `redirect`, `title`, `category_count`, `linkcount`, | 0.000 |
10| 11 | system user | | s53430__nppbrowser_p | Slave_worker | 3513 | Write_rows_log_event::write_row(-1) on table `article_new` | INSERT INTO `article_new` (`wiki_id`, `is_npp`, `redirect`, `title`, `category_count`, `linkcount`, | 0.000 |
11| 12 | system user | | s53430__nppbrowser_p | Slave_worker | 3513 | Write_rows_log_event::write_row(-1) on table `article_new` | INSERT INTO `article_new` (`wiki_id`, `is_npp`, `redirect`, `title`, `category_count`, `linkcount`, | 0.000 |
12| 13 | system user | | s53430__nppbrowser_p | Slave_worker | 3513 | Write_rows_log_event::write_row(-1) on table `article_new` | INSERT INTO `article_new` (`wiki_id`, `is_npp`, `redirect`, `title`, `category_count`, `linkcount`, | 0.000 |
13| 14 | system user | | s53430__nppbrowser_p | Slave_worker | 3513 | Write_rows_log_event::write_row(-1) on table `article_new` | INSERT INTO `article_new` (`wiki_id`, `is_npp`, `redirect`, `title`, `category_count`, `linkcount`, | 0.000 |
14| 15 | system user | | s53430__nppbrowser_p | Slave_worker | 3513 | Write_rows_log_event::write_row(-1) on table `article_new` | INSERT INTO `article_new` (`wiki_id`, `is_npp`, `redirect`, `title`, `category_count`, `linkcount`, | 0.000 |
15| 16 | system user | | s53430__nppbrowser_p | Slave_worker | 3513 | Write_rows_log_event::write_row(-1) on table `article_new` | INSERT INTO `article_new` (`wiki_id`, `is_npp`, `redirect`, `title`, `category_count`, `linkcount`, | 0.000 |
16| 17 | system user | | s53430__nppbrowser_p | Slave_worker | 3513 | Write_rows_log_event::write_row(-1) on table `article_new` | INSERT INTO `article_new` (`wiki_id`, `is_npp`, `redirect`, `title`, `category_count`, `linkcount`, | 0.000 |
17| 18 | system user | | s53430__nppbrowser_p | Slave_worker | 3513 | Write_rows_log_event::write_row(-1) on table `article_new` | INSERT INTO `article_new` (`wiki_id`, `is_npp`, `redirect`, `title`, `category_count`, `linkcount`, | 0.000 |
18| 19 | system user | | s53430__nppbrowser_p | Slave_worker | 3513 | Write_rows_log_event::write_row(-1) on table `article_new` | INSERT INTO `article_new` (`wiki_id`, `is_npp`, `redirect`, `title`, `category_count`, `linkcount`, | 0.000 |
19| 7 | system user | | NULL | Slave_SQL | 3509 | Waiting for room in worker thread event queue | NULL | 0.000 |
20| 459358 | root | localhost | NULL | Query | 0 | starting | SHOW PROCESSLIST | 0.000 |
21+--------+-----------------+-----------+----------------------+--------------+-------+------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
2216 rows in set (0.000 sec)

I tried capturing backtraces for the stuck threads:

fnegri@tools-db-5:~$ sudo gdb --batch --eval-command="set print frame-arguments all"  --eval-command="thread apply all bt full" /opt/wmf-mariadb106/bin/mysqld $(pgrep -xn mysqld)  > mariadbd_full_bt_all_threads.txt

Replication is now working again and I don't think the file generated by gdb contains any useful info, it looks like a snapshot of threads after replication resumed, with many idle threads and one thread applying a different transaction.

So either I was very unlucky and ran the gdb command a few seconds after replication resumed, or running the command itself caused the threads to get unstuck, or I'm not understanding how gdb works.

Let's see if replication gets stuck again or not in the next few days.

It did get stuck again, and running the same gdb command got it unstuck again.

SHOW PROCESSLIST before running gdb:

1MariaDB [(none)]> SHOW PROCESSLIST;
2
3+--------+-----------------+-----------+---------------------+--------------+-------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
4| Id | User | Host | db | Command | Time | State | Info | Progress |
5+--------+-----------------+-----------+---------------------+--------------+-------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
6| 2 | event_scheduler | localhost | NULL | Daemon | 83988 | Waiting on empty queue | NULL | 0.000 |
7| 6 | system user | | NULL | Slave_IO | 83985 | Waiting for master to send event | NULL | 0.000 |
8| 8 | system user | | NULL | Slave_worker | 3123 | Waiting for prior transaction to commit | NULL | 0.000 |
9| 9 | system user | | s51434__mixnmatch_p | Slave_worker | 3122 | Write_rows_log_event::write_row(-1) on table `entry` | INSERT IGNORE INTO `entry` (`catalog`,`ext_id`,`ext_url`,`ext_name`,`ext_desc`,`q`,`user`,`timestamp | 0.000 |
10| 10 | system user | | s51434__mixnmatch_p | Slave_worker | 3122 | Write_rows_log_event::write_row(-1) on table `entry` | INSERT IGNORE INTO `entry` (`catalog`,`ext_id`,`ext_url`,`ext_name`,`ext_desc`,`q`,`user`,`timestamp | 0.000 |
11| 11 | system user | | NULL | Slave_worker | 3123 | Waiting for prior transaction to commit | NULL | 0.000 |
12| 12 | system user | | s51434__mixnmatch_p | Slave_worker | 3122 | Write_rows_log_event::write_row(-1) on table `entry` | INSERT IGNORE INTO `entry` (`catalog`,`ext_id`,`ext_url`,`ext_name`,`ext_desc`,`q`,`user`,`timestamp | 0.000 |
13| 13 | system user | | s51434__mixnmatch_p | Slave_worker | 3122 | Write_rows_log_event::write_row(-1) on table `entry` | INSERT IGNORE INTO `entry` (`catalog`,`ext_id`,`ext_url`,`ext_name`,`ext_desc`,`q`,`user`,`timestamp | 0.000 |
14| 14 | system user | | s52481__scratch_id | Slave_worker | 3122 | Write_rows_log_event::write_row(-1) on table `articles` | INSERT INTO articles VALUES('3329244', 'https://id.wikipedia.org/wiki/Banzai,_Fujian', 'Banzai, Fuji | 0.000 |
15| 15 | system user | | s51434__mixnmatch_p | Slave_worker | 3122 | Write_rows_log_event::write_row(-1) on table `entry` | INSERT IGNORE INTO `entry` (`catalog`,`ext_id`,`ext_url`,`ext_name`,`ext_desc`,`q`,`user`,`timestamp | 0.000 |
16| 16 | system user | | NULL | Slave_worker | 3122 | Waiting for prior transaction to commit | NULL | 0.000 |
17| 17 | system user | | NULL | Slave_worker | 3123 | Waiting for prior transaction to commit | NULL | 0.000 |
18| 18 | system user | | s51434__mixnmatch_p | Slave_worker | 3122 | Write_rows_log_event::write_row(-1) on table `entry` | INSERT IGNORE INTO `entry` (`catalog`,`ext_id`,`ext_url`,`ext_name`,`ext_desc`,`q`,`user`,`timestamp | 0.000 |
19| 19 | system user | | NULL | Slave_worker | 3122 | Waiting for prior transaction to commit | NULL | 0.000 |
20| 7 | system user | | NULL | Slave_SQL | 3118 | Waiting for room in worker thread event queue | NULL | 0.000 |
21| 512184 | root | localhost | NULL | Query | 0 | starting | SHOW PROCESSLIST | 0.000 |
22+--------+-----------------+-----------+---------------------+--------------+-------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
2316 rows in set (14.146 sec)

Replication worked from 19:00 UTC on Friday until 03:40 UTC on Saturday, then it got stuck again:

Screenshot 2025-02-17 at 11.17.45.png (614×1 px, 129 KB)

I didn't touch it over the weekend, it is still stuck on this INSERT query:

MariaDB [(none)]> SHOW PROCESSLIST;
|     13 | system user     |           | s55543__xlinks_p | Slave_worker | 114683 | Write_rows_log_event::write_row(-1) on table `revision` | INSERT INTO revision
        (
            rev_db,
            rev_revision,
            rev_user,
  |    0.000 |

Running gdb caused the replication to resume:

fnegri@tools-db-5:~$ sudo gdb --batch --eval-command="set print frame-arguments all"  --eval-command="thread apply all bt full" /opt/wmf-mariadb106/bin/mysqld $(pgrep -xn mysqld)  > mariadbd_full_bt_all_threads_3.txt

But this time it got stuck again after only a few minutes. This is the new stuck query:

|     19 | system user     |           | s51434__mixnmatch_p | Slave_worker |  15843 | Update_rows_log_event::ha_update_row(-1) on table `entry` | UPDATE `entry` SET `q`=99682705,`user`=0,`timestamp`='20250215075942' WHERE `id`=177820721 AND (`q`  |    0.000 |

I'm gonna try upgrading MariaDB to 10.6.20 on the replica.

tools-db-5 is now running MariaDB 10.6.20. I temporarily enabled additional logging (SET GLOBAL log_warnings = 3;), let's see if replication gets stuck again.

Replication has been working normally in the past 2 days, I'm marking this task as Resolved.

Screenshot 2025-02-19 at 16.12.50.png (612×1 px, 121 KB)

The more verbose logging only collected 2 types of messages since the time I enabled it:

  • [Note] InnoDB: Online DDL [...] <- details on standard InnoDB operations
  • [Warning] InnoDB: Total InnoDB FTS size 15673893 for the table s51138__heritage_p.monuments_all_tmp exceeds the innodb_ft_cache_size 8000000 <- this indicates something that could be optimized, but it's a common presence in the logs since at least November, so it's not related to the recent replication issues)

Nothing interesting emerged from the additional logging, so I now reverted it to the previous value (SET GLOBAL log_warnings = 0;).