Page MenuHomePhabricator

ToolsDB: master crashed, replica having consistency issues
Closed, ResolvedPublic

Description

Today 2020-05-27 the toolsdb master server crashed.

The master server clouddb1001.clouddb-services.eqiad.wmflabs was totally frozen. After we checked the replica was healthy clouddb1002.clouddb-services.eqiad.wmflabs we decided to hard-reboot the master.

After the reboot, we were able to collect some logs indicating memory exhaustion:

May 27 05:47:14 clouddb1001 kernel: [18638293.176610] mysqld: page allocation stalls for 644076ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
May 27 05:47:14 clouddb1001 kernel: [18638293.176618] CPU: 1 PID: 8567 Comm: mysqld Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3.1
May 27 05:47:14 clouddb1001 kernel: [18638293.176619] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1 04/01/2014
May 27 05:47:14 clouddb1001 kernel: [18638293.176620]  0000000000000000 ffffffffb9534524 ffffffffb9c01e18 ffff9f6706f5fa08
May 27 05:47:14 clouddb1001 kernel: [18638293.176623]  ffffffffb938bfaa 0342004a00000000 ffffffffb9c01e18 ffff9f6706f5f9a8
May 27 05:47:14 clouddb1001 kernel: [18638293.176625]  ffffffff00000010 ffff9f6706f5fa18 ffff9f6706f5f9c8 fee40a0c5d22bfa4
May 27 05:47:14 clouddb1001 kernel: [18638293.176627] Call Trace:
May 27 05:47:14 clouddb1001 kernel: [18638293.176636]  [<ffffffffb9534524>] ? dump_stack+0x5c/0x78
May 27 05:47:14 clouddb1001 kernel: [18638293.176641]  [<ffffffffb938bfaa>] ? warn_alloc+0x13a/0x160
May 27 05:47:14 clouddb1001 kernel: [18638293.176643]  [<ffffffffb938c99d>] ? __alloc_pages_slowpath+0x95d/0xbb0
May 27 05:47:14 clouddb1001 kernel: [18638293.176645]  [<ffffffffb938cdf1>] ? __alloc_pages_nodemask+0x201/0x260
May 27 05:47:14 clouddb1001 kernel: [18638293.176648]  [<ffffffffb93de751>] ? alloc_pages_current+0x91/0x140
May 27 05:47:14 clouddb1001 kernel: [18638293.176649]  [<ffffffffb9383684>] ? pagecache_get_page+0xb4/0x2a0
May 27 05:47:14 clouddb1001 kernel: [18638293.176650]  [<ffffffffb938388c>] ? grab_cache_page_write_begin+0x1c/0x40
May 27 05:47:14 clouddb1001 kernel: [18638293.176653]  [<ffffffffb94706f5>] ? iomap_write_begin+0x55/0x120
May 27 05:47:14 clouddb1001 kernel: [18638293.176654]  [<ffffffffb9470988>] ? iomap_write_actor+0xa8/0x180
May 27 05:47:14 clouddb1001 kernel: [18638293.176657]  [<ffffffffb94708e0>] ? iomap_write_end+0x70/0x70
May 27 05:47:14 clouddb1001 kernel: [18638293.176659]  [<ffffffffb9470fed>] ? iomap_apply+0xbd/0x140
May 27 05:47:14 clouddb1001 kernel: [18638293.176660]  [<ffffffffb94708e0>] ? iomap_write_end+0x70/0x70
May 27 05:47:14 clouddb1001 kernel: [18638293.176661]  [<ffffffffb94710d7>] ? iomap_file_buffered_write+0x67/0x90
May 27 05:47:14 clouddb1001 kernel: [18638293.176662]  [<ffffffffb94708e0>] ? iomap_write_end+0x70/0x70
May 27 05:47:14 clouddb1001 kernel: [18638293.176716]  [<ffffffffc060a4a4>] ? xfs_file_buffered_aio_write+0xe4/0x2f0 [xfs]
May 27 05:47:14 clouddb1001 kernel: [18638293.176736]  [<ffffffffc06179b7>] ? xfs_iunlock+0xe7/0x100 [xfs]
May 27 05:47:14 clouddb1001 kernel: [18638293.176754]  [<ffffffffc060a7bf>] ? xfs_file_write_iter+0x10f/0x160 [xfs]
May 27 05:47:14 clouddb1001 kernel: [18638293.176757]  [<ffffffffb940ac30>] ? new_sync_write+0xe0/0x130
May 27 05:47:14 clouddb1001 kernel: [18638293.176758]  [<ffffffffb940b3b0>] ? vfs_write+0xb0/0x190
May 27 05:47:14 clouddb1001 kernel: [18638293.176760]  [<ffffffffb940c7f2>] ? SyS_write+0x52/0xc0
May 27 05:47:14 clouddb1001 kernel: [18638293.176762]  [<ffffffffb9203b7d>] ? do_syscall_64+0x8d/0xf0
May 27 05:47:14 clouddb1001 kernel: [18638293.176766]  [<ffffffffb981924e>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6

Upon initialization of the master, @Marostegui fixed some inconsistencies on some of the master tables, but replication to the backup is now showing inconsistencies in some tables as well.

May 27 05:57:30 clouddb1002 mysqld[14210]: 2020-05-27  5:57:30 140124118230784 [ERROR] Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013)
May 27 05:57:30 clouddb1002 mysqld[14210]: 2020-05-27  5:57:30 140124118230784 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log.209037' at position 83547720; GTID position '0-2886731673-23684796837'
May 27 05:57:30 clouddb1002 mysqld[14210]: 2020-05-27  5:57:30 140124118230784 [ERROR] Slave I/O: error reconnecting to master 'repl@172.16.7.153:3306' - retry-time: 60  maximum-retries: 86400  message: Can't connect to MySQL server on '172.16.7.153' (111 "Connection refused"), Internal MariaDB error code: 2003
May 27 06:10:00 clouddb1002 mysqld[14210]: 2020-05-27  6:10:00 140124118030080 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table s53685__editgroups.store_edit; Duplicate entry '1230218293' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log log.209041, end_log_pos 3041397, Gtid 0-2886731673-23684799152, Internal MariaDB error code: 1062
May 27 06:10:00 clouddb1002 mysqld[14210]: 2020-05-27  6:10:00 140124118030080 [Warning] Slave: Duplicate entry '1230218293' for key 'PRIMARY' Error_code: 1062
May 27 06:10:00 clouddb1002 mysqld[14210]: 2020-05-27  6:10:00 140124118030080 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'log.209041' position 3040679; GTID position '0-2886731673-23684799151'
May 27 06:10:00 clouddb1002 mysqld[14210]: 2020-05-27  6:10:00 140124118030080 [Note] Slave SQL thread exiting, replication stopped in log 'log.209041' at position 3040679; GTID position '0-2886731673-23684799151'

Most of the data consistency issues on both servers were already fixed by @Marostegui

Event Timeline

aborrero created this task.May 27 2020, 9:51 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 27 2020, 9:51 AM
aborrero updated the task description. (Show Details)May 27 2020, 9:52 AM

So far I have fixed all the drifts and replication is catching up. We'll see if we get more of those.

Marostegui closed this task as Resolved.May 27 2020, 11:28 AM
Marostegui claimed this task.

Replication has caught up, and there are no more drifts from what I can see.
So considering this solved?

Marostegui reopened this task as Open.May 27 2020, 1:13 PM
Marostegui removed Marostegui as the assignee of this task.

This is back - you guys might want to reclone the affected tables or the server or at least check with the users if the data can be ignored.
My guess is that something got inserted on the master but never made it to the slave which is now complaining about data that is not there but it is on the master.

Is there a method you recommend for recloning the tables, @Marostegui . I cannot say I have done a single table clone on mysql to a replica this decade 🙂.

We can always check on the users/data, of course. For all I know, we have new databases that need to be excluded from replication. I'll poke it today. Thanks!

bd808 added a comment.May 27 2020, 2:28 PM

Not sure if it is of value, but I caught the MariaDB log output from the first startup of the database after we forced the instance to reboot.

1-- Logs begin at Wed 2020-05-27 05:53:35 UTC, end at Wed 2020-05-27 06:01:56 UTC. --
2May 27 05:58:13 clouddb1001 systemd[1]: Starting mariadb database server...
3May 27 05:58:14 clouddb1001 mysqld[1923]: 2020-05-27 5:58:14 140561590362560 [Note] /opt/wmf-mariadb101/bin/mysqld (mysqld 10.1.39-MariaDB) starting as process 1923 ...
4May 27 05:58:14 clouddb1001 mysqld[1923]: 2020-05-27 5:58:14 140561590362560 [ERROR] Plugin 'unix_socket' already installed
5May 27 05:58:14 clouddb1001 mysqld[1923]: 2020-05-27 5:58:14 140561590362560 [Note] mysqld: Aria engine: starting recovery
6May 27 05:58:14 clouddb1001 mysqld[1923]: recovered pages: 0% 52% 100% (0.0 seconds); tables to flush: 5 4 3 2 1 0 (0.0 seconds);
7May 27 05:58:14 clouddb1001 mysqld[1923]: 2020-05-27 5:58:14 140561590362560 [Note] mysqld: Aria engine: recovery done
8May 27 05:58:14 clouddb1001 mysqld[1923]: 2020-05-27 5:58:14 140561590362560 [Note] InnoDB: Using mutexes to ref count buffer pool pages
9May 27 05:58:14 clouddb1001 mysqld[1923]: 2020-05-27 5:58:14 140561590362560 [Note] InnoDB: The InnoDB memory heap is disabled
10May 27 05:58:14 clouddb1001 mysqld[1923]: 2020-05-27 5:58:14 140561590362560 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
11May 27 05:58:14 clouddb1001 mysqld[1923]: 2020-05-27 5:58:14 140561590362560 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
12May 27 05:58:14 clouddb1001 mysqld[1923]: 2020-05-27 5:58:14 140561590362560 [Note] InnoDB: Compressed tables use zlib 1.2.11
13May 27 05:58:14 clouddb1001 mysqld[1923]: 2020-05-27 5:58:14 140561590362560 [Note] InnoDB: Using Linux native AIO
14May 27 05:58:14 clouddb1001 mysqld[1923]: 2020-05-27 5:58:14 140561590362560 [Note] InnoDB: Using SSE crc32 instructions
15May 27 05:58:14 clouddb1001 mysqld[1923]: 2020-05-27 5:58:14 140561590362560 [Note] InnoDB: Initializing buffer pool, size = 31.0G
16May 27 05:58:16 clouddb1001 mysqld[1923]: 2020-05-27 5:58:16 140561590362560 [Note] InnoDB: Completed initialization of buffer pool
17May 27 05:58:16 clouddb1001 mysqld[1923]: 2020-05-27 5:58:16 140561590362560 [Note] InnoDB: Highest supported file format is Barracuda.
18May 27 05:58:16 clouddb1001 mysqld[1923]: 2020-05-27 5:58:16 140561590362560 [Note] InnoDB: Starting crash recovery from checkpoint LSN=50443047733924
19May 27 05:58:24 clouddb1001 mysqld[1923]: 2020-05-27 5:58:24 140561590362560 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
20May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553384 was in the XA prepared state.
21May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553393 was in the XA prepared state.
22May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232551956 was in the XA prepared state.
23May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553400 was in the XA prepared state.
24May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553403 was in the XA prepared state.
25May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553474 was in the XA prepared state.
26May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553478 was in the XA prepared state.
27May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553476 was in the XA prepared state.
28May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553482 was in the XA prepared state.
29May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553506 was in the XA prepared state.
30May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553540 was in the XA prepared state.
31May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553795 was in the XA prepared state.
32May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553731 was in the XA prepared state.
33May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553732 was in the XA prepared state.
34May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553759 was in the XA prepared state.
35May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553767 was in the XA prepared state.
36May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553770 was in the XA prepared state.
37May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553813 was in the XA prepared state.
38May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553856 was in the XA prepared state.
39May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553882 was in the XA prepared state.
40May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553918 was in the XA prepared state.
41May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553987 was in the XA prepared state.
42May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232554038 was in the XA prepared state.
43May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232554040 was in the XA prepared state.
44May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232554041 was in the XA prepared state.
45May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232553991 was in the XA prepared state.
46May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232554022 was in the XA prepared state.
47May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232554069 was in the XA prepared state.
48May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232554080 was in the XA prepared state.
49May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232554096 was in the XA prepared state.
50May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232550766 was in the XA prepared state.
51May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232550766 was in the XA prepared state.
52May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Transaction 53232554747 was in the XA prepared state.
53May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: 36 transaction(s) which must be rolled back or cleaned up
54May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: in total 4715 row operations to undo
55May 27 05:58:26 clouddb1001 mysqld[1923]: InnoDB: Trx id counter is 53232573696
56May 27 05:58:26 clouddb1001 mysqld[1923]: 2020-05-27 5:58:26 140561590362560 [Note] InnoDB: Starting final batch to recover 14120 pages from redo log
57May 27 05:58:31 clouddb1001 mysqld[1923]: InnoDB: Last MySQL binlog file position 0 83484603, file name /srv/labsdb/binlogs/log.209037
58May 27 05:58:43 clouddb1001 mysqld[1923]: 2020-05-27 5:58:43 140561590362560 [Note] InnoDB: 128 rollback segment(s) are active.
59May 27 05:58:43 clouddb1001 mysqld[1923]: 2020-05-27 5:58:43 140524866500352 [Note] InnoDB: Starting in background the rollback of recovered transactions
60May 27 05:58:43 clouddb1001 mysqld[1923]: 2020-05-27 5:58:43 140524866500352 [Note] InnoDB: To roll back: 4 transactions, 4715 rows
61May 27 05:58:43 clouddb1001 mysqld[1923]: 2020-05-27 5:58:43 140561590362560 [Note] InnoDB: Waiting for purge to start
62May 27 05:58:43 clouddb1001 mysqld[1923]: 2020-05-27 5:58:43 140524866500352 [Note] InnoDB: Rollback of trx with id 53232564680 completed
63May 27 05:58:43 clouddb1001 mysqld[1923]: 2020-05-27 5:58:43 140524866500352 [Note] InnoDB: Rollback of trx with id 53232564390 completed
64May 27 05:58:43 clouddb1001 mysqld[1923]: 2020-05-27 5:58:43 140524866500352 [Note] InnoDB: Rollback of trx with id 53232554087 completed
65May 27 05:58:43 clouddb1001 mysqld[1923]: 2020-05-27 5:58:43 140561590362560 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.43-84.3 started; log sequence number 50443085726861
66May 27 05:58:43 clouddb1001 mysqld[1923]: 2020-05-27 5:58:43 140524866500352 [Note] InnoDB: Rollback of trx with id 53232540583 completed
67May 27 05:58:43 clouddb1001 mysqld[1923]: 2020-05-27 5:58:43 140524866500352 [Note] InnoDB: Rollback of non-prepared transactions completed
68May 27 05:58:52 clouddb1001 mysqld[1923]: 2020-05-27 5:58:52 140524866500352 [Note] InnoDB: Dumping buffer pool(s) not yet started
69May 27 05:58:52 clouddb1001 mysqld[1923]: 2020-05-27 05:58:52 7fce7ebff700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool
70May 27 05:58:52 clouddb1001 mysqld[1923]: 2020-05-27 5:58:52 140561590362560 [Note] Recovering after a crash using /srv/labsdb/binlogs/log
71May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 5:58:53 140561590362560 [ERROR] Error in Log_event::read_log_event(): 'Event invalid', data_len: 0, event_type: 0
72May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 5:58:53 140561590362560 [Note] Starting crash recovery...
73May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Starting recovery for XA transactions...
74May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232554747 in prepared state after recovery
75May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
76May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232554096 in prepared state after recovery
77May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
78May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232554080 in prepared state after recovery
79May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
80May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232554069 in prepared state after recovery
81May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
82May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232554041 in prepared state after recovery
83May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
84May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232554040 in prepared state after recovery
85May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
86May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232554038 in prepared state after recovery
87May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
88May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232554022 in prepared state after recovery
89May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
90May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553991 in prepared state after recovery
91May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
92May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553987 in prepared state after recovery
93May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
94May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553918 in prepared state after recovery
95May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
96May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553882 in prepared state after recovery
97May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
98May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553856 in prepared state after recovery
99May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
100May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553813 in prepared state after recovery
101May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
102May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553795 in prepared state after recovery
103May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
104May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553770 in prepared state after recovery
105May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
106May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553767 in prepared state after recovery
107May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
108May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553759 in prepared state after recovery
109May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
110May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553732 in prepared state after recovery
111May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
112May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553731 in prepared state after recovery
113May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
114May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553540 in prepared state after recovery
115May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
116May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553506 in prepared state after recovery
117May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
118May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553482 in prepared state after recovery
119May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
120May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553478 in prepared state after recovery
121May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
122May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553476 in prepared state after recovery
123May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
124May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553474 in prepared state after recovery
125May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
126May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553403 in prepared state after recovery
127May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
128May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553400 in prepared state after recovery
129May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
130May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553393 in prepared state after recovery
131May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 34 rows
132May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232553384 in prepared state after recovery
133May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 1 rows
134May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232551956 in prepared state after recovery
135May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 10 rows
136May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction 53232550766 in prepared state after recovery
137May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: Transaction contains changes to 464 rows
138May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fd70ba9a1c0 InnoDB: 32 transactions in prepared state after recovery
139May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 5:58:53 140561590362560 [Note] Found 32 prepared transaction(s) in InnoDB
140May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 5:58:53 140561590362560 [Note] Crash recovery finished.
141May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 5:58:53 140561590362560 [Note] Server socket created on IP: '::'.
142May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 5:58:53 140561590362560 [ERROR] Can't start server : Bind on unix socket: No such file or directory
143May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 5:58:53 140561590362560 [ERROR] Do you already have another mysqld server running on socket: /var/run/mysqld/mysqld.sock ?
144May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 5:58:53 140561590362560 [ERROR] Aborting
145May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 05:58:53 7fce7ebff700 InnoDB: Buffer pool(s) load completed at 200527 5:58:53
146May 27 05:58:53 clouddb1001 mysqld[1923]: 2020-05-27 5:58:53 140524866500352 [Warning] InnoDB: Dumping buffer pool(s) to .//ib_buffer_pool
147May 27 05:58:54 clouddb1001 mysqld[1923]: 2020-05-27 5:58:54 140524866500352 [Warning] InnoDB: Buffer pool(s) dump completed at 200527 5:58:54
148May 27 05:58:57 clouddb1001 systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
149May 27 05:58:57 clouddb1001 systemd[1]: Failed to start mariadb database server.
150May 27 05:58:57 clouddb1001 systemd[1]: mariadb.service: Unit entered failed state.
151May 27 05:58:57 clouddb1001 systemd[1]: mariadb.service: Failed with result 'exit-code'.

bd808 triaged this task as High priority.May 27 2020, 2:29 PM
bd808 added a project: Data-Services.
bd808 moved this task from Backlog to ToolsDB on the Data-Services board.

We have a script that we actually used to use on labsdb1004/labsdb1005 (the old toolsdb) to sync tables: https://github.com/wikimedia/operations-software/blob/master/dbtools/reimport_from_master.sh

Circling back around to this. It looks like I'm seeing: s51245__totoazero.maj_articles_recents is the major source of trouble. Timing-wise, I don't think I see any other issues.

That's the totoazero tool. Admins are @Framawiki and @totoazero. If that table is something you folks intend to keep, I can use/adapt that script to reclone it for you.

@Bstorm you might want to have the host up-to-date with replication before attempting the restore, probably the best approach is to skip those tables with replication filters (I can help with that once you've identified the ones you want to skip)

Ahh, yeah, oops. I see that Seconds_Behind_Master: NULL and the threads are dead all over the logs. So @Marostegui were the earlier ones that I think are timestamped before T253738#6169013 fixed already? If so, then I'm only aware of s51245__totoazero.maj_articles_recents.

Yes, the ones that I fixed might still have issues, as they could come once replication resumes again.
If it says NULL, above that you'll be able to see the table involved, you'd need to exclude it as a replication filter, and then hit start slave; again, so it will ignore anything that comes to that table.
You might need to do that for all the tables that will keep failing (if any).

Change 599926 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: Add misbehaving table to replication filters

https://gerrit.wikimedia.org/r/599926

So stupid question, @Marostegui, if I do that patch and restart on mysqld on the slave without restarting the master, will that work or do I need to CHANGE REPLICATION FILTER with the whole mess?

If you add the replication filter to my.cnf and restart the process on the slave, that would work, no need to touch the master there.
You can also do it without restarting mysql by doing: `change replication filter REPLICATE_WILD_IGNORE_TABLE = ('$DB.$TABLE'); (make sure to include the existing ones that you can get with a show slave status\G)

I am writing from my phone, so I could've made some typos!

Change 599926 merged by Bstorm:
[operations/puppet@production] toolsdb: Add misbehaving table to replication filters

https://gerrit.wikimedia.org/r/599926

So after merging that, I realized that the escapes don't seem to be correct for a table vs a wild_table, so I stopped puppet and made it a wildcard entry. That seems to have got replication moving again.

Mentioned in SAL (#wikimedia-cloud) [2020-05-29T23:42:04Z] <bstorm_> stopped puppet and restarting mariadb on clouddb1002 after filtering out a table T253738

Seconds_Behind_Master: 213890 sounds a lot better. Let's see if it actually catches up.

Seconds_Behind_Master: 213890 sounds a lot better. Let's see if it actually catches up.

Keep in mind that it can break anytime soon and you might need to add more replication filters until it has caught up with the master (Seconds_Behind_Master: 0)

Change 601385 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: add another temprorary filter

https://gerrit.wikimedia.org/r/601385

Change 601385 merged by Bstorm:
[operations/puppet@production] toolsdb: add another temprorary filter

https://gerrit.wikimedia.org/r/601385

Mentioned in SAL (#wikimedia-cloud) [2020-06-01T18:51:28Z] <bstorm_> restarting mariadb on clouddb1002 T253738

So far so good. It's still replicating. It hasn't fully caught up yet, though.

bd808 added a comment.Jun 3 2020, 7:39 PM

Replication still progressing ('Seconds_Behind_Master: 108583').

bd808 added a comment.Jun 3 2020, 11:48 PM

New things seem to be coming in faster than they are going out right now (Seconds_Behind_Master: 123536), but at least it has not hit another hard failure yet.

bd808 added a comment.Jun 4 2020, 1:34 PM

Seconds_Behind_Master: 173230

This server is getting behind because there is an optimize happening from one of the users which is causing high load:

|    5 | system user     |           | s51434__mixnmatch_p | Connect |  71110 | copy to tmp table                | optimize table entry |   44.832 |

That table is quite big:

-rw-rw---- 1 mysql mysql  70G Jun  3 17:58 entry.ibd

But it looks like it is almost done:

-rw-rw---- 1 mysql mysql   58G Jun  4 13:48 #sql-5595_5.ibd
bd808 added a comment.Jun 4 2020, 5:21 PM

New breakage:

Seconds_Behind_Master: NULL
Last_SQL_Error: Could not execute Write_rows_v1 event on table s51080__checkwiki_p.cw_new; Duplicate entry 'enwiki-Ituri Province' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log log.209740, end_log_pos 52679930

Change 602433 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolsdb: add 3rd temporary filter for replication

https://gerrit.wikimedia.org/r/602433

Mentioned in SAL (#wikimedia-cloud) [2020-06-04T17:41:25Z] <bd808> Restarting mariadb after hand applying {{gerrit|602433}} (T253738)

bd808 added a comment.Jun 4 2020, 5:41 PM

Replicating again (Seconds_Behind_Master: 31577)

bd808 added a comment.Jun 4 2020, 7:16 PM

Another busted table:

Last_SQL_Error: Could not execute Delete_rows_v1 event on table s51080__checkwiki_p.cw_dumpscan; Can't find record in 'cw_dumpscan', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log log.209768, end_log_pos 75595466
bd808 added a comment.Jun 4 2020, 7:20 PM

s51080__checkwiki_p.cw_dumpscan excluded from replication. (Seconds_Behind_Master: 26287)

bd808 added a comment.Jun 4 2020, 11:22 PM

Seconds_Behind_Master: 12774

bd808 added a comment.Jun 5 2020, 6:05 PM

Seconds_Behind_Master: 0!!!

Next steps will be to figure out what can be done about the 4 tables that were filtered out to get this far.

Ok, so it seems like to use https://github.com/wikimedia/operations-software/blob/master/dbtools/reimport_from_master.sh, we need to adjust some of the user information for root in mariadb if I'm reading that correctly. I'm poking around that.

Change 602433 merged by Bstorm:
[operations/puppet@production] toolsdb: add more temporary filters for replication

https://gerrit.wikimedia.org/r/602433

So the tools with tables that are out of sorts are:

  • totoazero
  • request
  • checkwiki

Since cloning the data requires setting sessions or the whole db to readonly, it may be better to work with users on this. I imagine that dumping the table, dropping it, removing the filter and then reloading it would work right @Marostegui ? I can make subtasks for working with the tool maintainers to decide to either abandon the data as something that is actually replicated or do something akin to that to restore replication.

Unfortunately setting specific sessions (or databases) to read-only isn't possible with mysql, it is either everything or nothing.
What you could do is:

  • Talk to the tool owners to disable/stop those tools to make sure no writes arrive to their tables.
  • Remove replication filters
  • drop + import the tables from the master.
Bstorm added a comment.Jul 6 2020, 6:23 PM

Thanks @Marostegui. I'll get that ball rolling.

Change 617533 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: remove one of the temporary replication filters

https://gerrit.wikimedia.org/r/617533

Change 617533 merged by Bstorm:
[operations/puppet@production] toolsdb: remove one of the temporary replication filters

https://gerrit.wikimedia.org/r/617533

Bstorm closed this task as Resolved.EditedDec 17 2020, 3:10 PM
Bstorm claimed this task.

Replication is restored (T266587) and we are back to filtering only the four databases we filtered before this all went down.