Page MenuHomePhabricator

mariadb crashed on labsdb1009
Closed, ResolvedPublic

Description

Page came in for:

Notification Type: PROBLEM

Service: Check systemd state
Host: labsdb1009
Address: 10.64.4.14
State: CRITICAL

Date/Time: Tue Mar 9 19:23:07 UTC 2021

Turns out that mariadb service seems to have restarted itself, which caused wmf-pt-kill to register a failure.

Event Timeline

This appears to have broken replication (which is expected). It seems to have a cleared set of master settings.

Bstorm renamed this task from mariadb seems to have restarted on labsdb1009 or something like that to mariadb seems to have restarted on labsdb1009.Mar 9 2021, 7:35 PM
Bstorm added a subscriber: Marostegui.

Reading the logs, it appears that mysqld crashed (which was what I thought).

Bstorm renamed this task from mariadb seems to have restarted on labsdb1009 to mariadb crashed on labsdb1009.Mar 9 2021, 7:38 PM
Bstorm moved this task from Backlog to Wiki replicas on the Data-Services board.
`
Mar 09 19:19:23 labsdb1009 mysqld[15890]: 2021-03-09 19:19:23 7f3740836700  InnoDB: Assertion failure in thread 139875282282240 in file row0ins.cc line 291
Mar 09 19:19:23 labsdb1009 mysqld[15890]: InnoDB: Failing assertion: *cursor->index->name == TEMP_INDEX_PREFIX
Mar 09 19:19:23 labsdb1009 mysqld[15890]: InnoDB: We intentionally generate a memory trap.
Mar 09 19:19:23 labsdb1009 mysqld[15890]: InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
Mar 09 19:19:23 labsdb1009 mysqld[15890]: InnoDB: If you get repeated assertion failures or crashes, even
Mar 09 19:19:23 labsdb1009 mysqld[15890]: InnoDB: immediately after the mysqld startup, there may be
Mar 09 19:19:23 labsdb1009 mysqld[15890]: InnoDB: corruption in the InnoDB tablespace. Please refer to
Mar 09 19:19:23 labsdb1009 mysqld[15890]: InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
Mar 09 19:19:23 labsdb1009 mysqld[15890]: InnoDB: about forcing recovery.
Mar 09 19:19:23 labsdb1009 mysqld[15890]: 210309 19:19:23 [ERROR] mysqld got signal 6 ;
Mar 09 19:19:23 labsdb1009 mysqld[15890]: This could be because you hit a bug. It is also possible that this binary
Mar 09 19:19:23 labsdb1009 mysqld[15890]: or one of the libraries it was linked against is corrupt, improperly built,
Mar 09 19:19:23 labsdb1009 mysqld[15890]: or misconfigured. This error can also be caused by malfunctioning hardware.
Mar 09 19:19:23 labsdb1009 mysqld[15890]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Mar 09 19:19:23 labsdb1009 mysqld[15890]: We will try our best to scrape up some info that will hopefully help
Mar 09 19:19:23 labsdb1009 mysqld[15890]: diagnose the problem, but since we have already crashed,
Mar 09 19:19:23 labsdb1009 mysqld[15890]: something is definitely wrong and this may fail.
Mar 09 19:19:23 labsdb1009 mysqld[15890]: Server version: 10.1.43-MariaDB
Mar 09 19:19:23 labsdb1009 mysqld[15890]: key_buffer_size=134217728
Mar 09 19:19:23 labsdb1009 mysqld[15890]: read_buffer_size=131072
Mar 09 19:19:23 labsdb1009 mysqld[15890]: max_used_connections=191
Mar 09 19:19:23 labsdb1009 mysqld[15890]: max_threads=1026
Mar 09 19:19:23 labsdb1009 mysqld[15890]: thread_count=16
Mar 09 19:19:23 labsdb1009 mysqld[15890]: It is possible that mysqld could use up to
Mar 09 19:19:23 labsdb1009 mysqld[15890]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2384940 K  bytes of memory
Mar 09 19:19:23 labsdb1009 mysqld[15890]: Hope that's ok; if not, decrease some variables in the equation.
Mar 09 19:19:23 labsdb1009 mysqld[15890]: Thread pointer: 0x7eca2342b008
Mar 09 19:19:23 labsdb1009 mysqld[15890]: Attempting backtrace. You can use the following information to find out
Mar 09 19:19:23 labsdb1009 mysqld[15890]: where mysqld died. If you see no messages after this, something went
Mar 09 19:19:23 labsdb1009 mysqld[15890]: terribly wrong...
Mar 09 19:19:23 labsdb1009 mysqld[15890]: stack_bottom = 0x7f3740835438 thread_stack 0x30000
Mar 09 19:19:24 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(my_print_stacktrace+0x2e)[0x563da72a210e]
Mar 09 19:19:25 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(handle_fatal_signal+0x3bd)[0x563da6de2b8d]
Mar 09 19:19:27 labsdb1009 mysqld[15890]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0)[0x7f37621db0e0]
Mar 09 19:19:28 labsdb1009 mysqld[15890]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcf)[0x7f37608b9fff]
Mar 09 19:19:29 labsdb1009 mysqld[15890]: /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f37608bb42a]
Mar 09 19:19:30 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(+0x8dff90)[0x563da710df90]
Mar 09 19:19:31 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(+0x8e208b)[0x563da711008b]
Mar 09 19:19:33 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(+0x8e24b4)[0x563da71104b4]
Mar 09 19:19:34 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(+0x8ee487)[0x563da711c487]
Mar 09 19:19:35 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(+0x843e65)[0x563da7071e65]
Mar 09 19:19:36 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(_ZN7handler12ha_write_rowEPh+0xed)[0x563da6ded56d]
Mar 09 19:19:38 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(_ZN14Rows_log_event9write_rowEP14rpl_group_infob+0x1aa)[0x563da6eb5e3a]
Mar 09 19:19:39 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(_ZN20Write_rows_log_event11do_exec_rowEP14rpl_group_info+0x7d)[0x563da6eb622d]
Mar 09 19:19:40 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x32b)[0x563da6ea97eb]
Mar 09 19:19:41 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(+0x397beb)[0x563da6bc5beb]
Mar 09 19:19:43 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(handle_slave_sql+0x2be3)[0x563da6bd0c93]
Mar 09 19:19:44 labsdb1009 mysqld[15890]: /opt/wmf-mariadb101/bin/mysqld(+0x81b191)[0x563da7049191]
Mar 09 19:19:45 labsdb1009 mysqld[15890]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4)[0x7f37621d14a4]
Mar 09 19:20:11 labsdb1009 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT
Mar 09 19:20:11 labsdb1009 systemd[1]: mariadb.service: Unit entered failed state.
Mar 09 19:20:11 labsdb1009 systemd[1]: mariadb.service: Failed with result 'signal'.
Mar 09 19:20:16 labsdb1009 systemd[1]: mariadb.service: Service hold-off time over, scheduling restart.
Mar 09 19:20:16 labsdb1009 systemd[1]: Stopped mariadb database server.
-- Subject: Unit mariadb.service has finished shutting down

I have started replication on the host and as soon as I did that, mysql crashed again with the same error so it is not looking great. I will dig into this tomorrow. Chances are that data is probably corrupted anyways.
Going to leave mysql up for now and I will see what I can do tomorrow

Mar 09 19:48:15 labsdb1009 mysqld[21777]: 2021-03-09 19:48:15 7f41b1bf8700 InnoDB: Buffer pool(s) load completed at 210309 19:48:15
Mar 09 20:10:19 labsdb1009 mysqld[21777]: 2021-03-09 20:10:19 140391351097088 [Note] Master 's4': Slave I/O thread: Start asynchronous replication to master 'repl@db1125.eqiad.wmnet:3314' in log 'db1125-bin.005675' at position 973472404
Mar 09 20:10:19 labsdb1009 mysqld[21777]: 2021-03-09 20:10:19 140391350687488 [Note] Master 's7': Slave I/O thread: Start asynchronous replication to master 'repl@db1125.eqiad.wmnet:3317' in log 'db1125-bin.002936' at position 158033399
Mar 09 20:10:19 labsdb1009 mysqld[21777]: 2021-03-09 20:10:19 140391350277888 [Note] Master 's5': Slave I/O thread: Start asynchronous replication to master 'repl@db1124.eqiad.wmnet:3315' in log 'db1124-bin.001722' at position 97608628
Mar 09 20:10:19 labsdb1009 mysqld[21777]: 2021-03-09 20:10:19 140391349868288 [Note] Master 's1': Slave I/O thread: Start asynchronous replication to master 'repl@db1124.eqiad.wmnet:3311' in log 'db1124-bin.003812' at position 461422000
Mar 09 20:10:19 labsdb1009 mysqld[21777]: 2021-03-09 20:10:19 140391349458688 [Note] Master 's2': Slave I/O thread: Start asynchronous replication to master 'repl@db1125.eqiad.wmnet:3312' in log 'db1125-bin.003307' at position 439037341
Mar 09 20:10:19 labsdb1009 mysqld[21777]: 2021-03-09 20:10:19 140391191701248 [Note] Master 's8': Slave I/O thread: Start asynchronous replication to master 'repl@db1124.eqiad.wmnet:3318' in log 'db1124-bin.004870' at position 899283425
Mar 09 20:10:19 labsdb1009 mysqld[21777]: 2021-03-09 20:10:19 140391191291648 [Note] Master 's6': Slave I/O thread: Start asynchronous replication to master 'repl@db1125.eqiad.wmnet:3316' in log 'db1125-bin.002163' at position 622095190
Mar 09 20:10:19 labsdb1009 mysqld[21777]: 2021-03-09 20:10:19 140391190882048 [Note] Master 's3': Slave I/O thread: Start asynchronous replication to master 'repl@db1124.eqiad.wmnet:3313' in log 'db1124-bin.003119' at position 858613245
Mar 09 20:10:19 labsdb1009 mysqld[21777]: 2021-03-09 20:10:19 7faf6886b700  InnoDB: Assertion failure in thread 140391349663488 in file row0ins.cc line 291
Mar 09 20:10:19 labsdb1009 mysqld[21777]: InnoDB: Failing assertion: *cursor->index->name == TEMP_INDEX_PREFIX
Mar 09 20:10:19 labsdb1009 mysqld[21777]: InnoDB: We intentionally generate a memory trap.
Mar 09 20:10:19 labsdb1009 mysqld[21777]: InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
Mar 09 20:10:19 labsdb1009 mysqld[21777]: InnoDB: If you get repeated assertion failures or crashes, even
Mar 09 20:10:19 labsdb1009 mysqld[21777]: max_used_connections=14
Mar 09 20:10:19 labsdb1009 mysqld[21777]: max_threads=1026
Mar 09 20:10:19 labsdb1009 mysqld[21777]: thread_count=9
Mar 09 20:10:19 labsdb1009 mysqld[21777]: It is possible that mysqld could use up to
Mar 09 20:10:19 labsdb1009 mysqld[21777]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2384940 K  bytes of memory
Mar 09 20:10:19 labsdb1009 mysqld[21777]: Hope that's ok; if not, decrease some variables in the equation.
Mar 09 20:10:19 labsdb1009 mysqld[21777]: Thread pointer: 0x7f422f02b008
Mar 09 20:10:19 labsdb1009 mysqld[21777]: Attempting backtrace. You can use the following information to find out
Mar 09 20:10:19 labsdb1009 mysqld[21777]: where mysqld died. If you see no messages after this, something went
Mar 09 20:10:19 labsdb1009 mysqld[21777]: terribly wrong...
Mar 09 20:10:19 labsdb1009 mysqld[21777]: stack_bottom = 0x7faf6886a438 thread_stack 0x30000
Mar 09 20:10:20 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(my_print_stacktrace+0x2e)[0x561d4862b10e]
Mar 09 20:10:20 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(handle_fatal_signal+0x3bd)[0x561d4816bb8d]
Mar 09 20:10:20 labsdb1009 mysqld[21777]: 2021-03-09 20:10:20 140391351097088 [Note] Master 's4': Slave I/O thread: connected to master 'repl@db1125.eqiad.wmnet:3314',replication started in log 'db1125-bin.005675' at position 973472404
Mar 09 20:10:21 labsdb1009 mysqld[21777]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0)[0x7faf80bde0e0]
Mar 09 20:10:21 labsdb1009 mysqld[21777]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcf)[0x7faf7f2bcfff]
Mar 09 20:10:21 labsdb1009 mysqld[21777]: 2021-03-09 20:10:21 140391349868288 [Note] Master 's1': Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3311',replication started in log 'db1124-bin.003812' at position 461422000
Mar 09 20:10:21 labsdb1009 mysqld[21777]: 2021-03-09 20:10:21 140391350277888 [Note] Master 's5': Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3315',replication started in log 'db1124-bin.001722' at position 97608628
Mar 09 20:10:21 labsdb1009 mysqld[21777]: 2021-03-09 20:10:21 140391349458688 [Note] Master 's2': Slave I/O thread: connected to master 'repl@db1125.eqiad.wmnet:3312',replication started in log 'db1125-bin.003307' at position 439037341
Mar 09 20:10:21 labsdb1009 mysqld[21777]: 2021-03-09 20:10:21 140391190882048 [Note] Master 's3': Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3313',replication started in log 'db1124-bin.003119' at position 858613245
Mar 09 20:10:21 labsdb1009 mysqld[21777]: 2021-03-09 20:10:21 140391350687488 [Note] Master 's7': Slave I/O thread: connected to master 'repl@db1125.eqiad.wmnet:3317',replication started in log 'db1125-bin.002936' at position 158033399
Mar 09 20:10:22 labsdb1009 mysqld[21777]: /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7faf7f2be42a]
Mar 09 20:10:22 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(+0x8dff90)[0x561d48496f90]
Mar 09 20:10:22 labsdb1009 mysqld[21777]: 2021-03-09 20:10:22 140391191291648 [Note] Master 's6': Slave I/O thread: connected to master 'repl@db1125.eqiad.wmnet:3316',replication started in log 'db1125-bin.002163' at position 622095190
Mar 09 20:10:22 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(+0x8e208b)[0x561d4849908b]
Mar 09 20:10:23 labsdb1009 mysqld[21777]: 2021-03-09 20:10:23 140391191701248 [Note] Master 's8': Slave I/O thread: connected to master 'repl@db1124.eqiad.wmnet:3318',replication started in log 'db1124-bin.004870' at position 899283425
Mar 09 20:10:23 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(+0x8e24b4)[0x561d484994b4]
Mar 09 20:10:23 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(+0x8ee487)[0x561d484a5487]
Mar 09 20:10:24 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(+0x843e65)[0x561d483fae65]
Mar 09 20:10:24 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(_ZN7handler12ha_write_rowEPh+0xed)[0x561d4817656d]
Mar 09 20:10:25 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(_ZN14Rows_log_event9write_rowEP14rpl_group_infob+0x1aa)[0x561d4823ee3a]
Mar 09 20:10:25 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(_ZN20Write_rows_log_event11do_exec_rowEP14rpl_group_info+0x7d)[0x561d4823f22d]
Mar 09 20:10:26 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x32b)[0x561d482327eb]
Mar 09 20:10:26 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(+0x397beb)[0x561d47f4ebeb]
Mar 09 20:10:27 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(handle_slave_sql+0x2be3)[0x561d47f59c93]
Mar 09 20:10:27 labsdb1009 mysqld[21777]: /opt/wmf-mariadb101/bin/mysqld(+0x81b191)[0x561d483d2191]
Mar 09 20:10:28 labsdb1009 mysqld[21777]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4)[0x7faf80bd44a4]
Mar 09 20:10:36 labsdb1009 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT
Mar 09 20:10:36 labsdb1009 systemd[1]: mariadb.service: Unit entered failed state.
Mar 09 20:10:36 labsdb1009 systemd[1]: mariadb.service: Failed with result 'signal'.
Mar 09 20:10:41 labsdb1009 systemd[1]: mariadb.service: Service hold-off time over, scheduling restart.
Mar 09 20:10:41 labsdb1009 systemd[1]: Stopped mariadb database server.
-- Subject: Unit mariadb.service has finished shutting down

I have started a table check across all databases, that will likely take more than 12h to complete. Let's see what it reports tomorrow morning.
MySQL is up but replication is stopped, so data isn't being updated.

I'll depool the server since it isn't stable or up to date.

Change 670276 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikireplicas: depool labsdb1009 due to instability

https://gerrit.wikimedia.org/r/670276

Mentioned in SAL (#wikimedia-operations) [2021-03-09T20:25:03Z] <bstorm> downtimed labsdb1009 so it doesn't keep paging T276980

Change 670276 merged by Bstorm:
[operations/puppet@production] wikireplicas: depool labsdb1009 due to instability

https://gerrit.wikimedia.org/r/670276

Table check still on-going and it will take a lot more (possibly more than 24h)

I have disabled notifications for this host, as it is likely to be down/under maintenance for some days

While checking all the tables, mysqld crashed again (it was on commons at the time).
I am going to go for another option which is start section by section so at least we can isolate the problem (if it is an specific section, and dig into those wikis/tables)

s1 made it crash.
Going for s2 now

s2 and s3 are now replicating. Let's see what happens after a few hours.

s2 and s3 synced with their master with no problem.
Going to go for s4 now.

s4 seems good, still catching up with no issues.
Just started s5...we'll see

s5 sync'ed fine.
s4 still on going
s6 just started

s6 sync'ed well with its master
s4 still catching up nicely
s7 just started and I noticed InnoDB: load corrupted index index "rc_ns_actor" of table "arwiki"."recentchanges" - going to fix that and will restart that replication thread.

Fixed:

mysql:root@localhost [arwiki]> check table recentchanges;
+----------------------+-------+----------+----------+
| Table                | Op    | Msg_type | Msg_text |
+----------------------+-------+----------+----------+
| arwiki.recentchanges | check | status   | OK       |
+----------------------+-------+----------+----------+
1 row in set (8.65 sec)

replication re-started on s7

s7 and s4 synced correctly.
I have started s8, which looks corrupted. I will try to fix the inconsistencies tomorrow and see if it replicates fine.

Mar 10 20:01:03 labsdb1009 mysqld[22149]: 2021-03-10 20:01:03 139779498170112 [ERROR] Master 's8': Slave SQL: Could not execute Update_rows_v1 event on table wikidatawiki.wb_changes; Can't find record in 'wb_changes', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1124-bin.004870, end_log_pos 898487947, Gtid 171970645-171970645-1289156356, Internal MariaDB error code: 1032

Meanwhile I have left a check on s1 to see if I can find out which table might be corrupted.

The check on s1 finished. Only one table is reported as corrupted: enwiki.iwlinks.
Going to fix that and then start replication on s1 and see if it crashes again.

enwiki.iwlinks fixed, started replication on s1. We'll see how it goes

Fixed s8 inconsistencies.
s8 and s1 are now replicating. The rest of the sections are for now stopped. If s1 and s8 sync fine, I will start them and see what happens

s1 sync'ed with the master correctly.
s8 is still doing so, but so far so good.

s8 has synced correctly.
At this point all the sections sync'ed correctly independently.

I am now going to start all of them one by one, and we'll see what happens.

All threads started and now catching up...fingers crossed!

All the replication threads synced correctly.
I am going to leave the host replicating until Monday and if there are no crashes, I will repool it.

labsdb1009 has been replicating fine since Thursday, so I am going to enable notifications and start repooling it.

Marostegui claimed this task.

Host repooled. Let's see how it goes with user traffic.